Google Ngram Viewer

jamesliu — Wed, 18 May 2022 09:10:59 +0000

The Google Ngram Viewer is an online search engine that charts the frequencies of searched word strings, using a yearly count of n-grams found in Google’s text corpora. In the context of humanities research, it is a useful tool for social linguistic research for both historical and contemporary context, as it possess the capacity for providing strong data visualization of comparative literary trends in accordance with the changing frequency of data string found within a given time frame, thus providing some tangible insight concerning comparative social trends that existed as proved by its protectory of frequency within literature at the time.

The term Ngram refers to a contiguous sequence of n items from a given sample of text collection, a concept used to predict language sequence based on the frequency of recurring elements based on probability.

Regarding the standard workflow of the examined online search tool, one simply enter the desired search strings and establish the desired time frame between the 1500s and 2019 AD, and the tool outputs finds in the form of a projection line chart, comparatively displaying the projection of Ngram occurrence frequency within the database. The presented data are relatively normalized in the form of published percentages within the respective categorised year. Further parameters settings are available as option to account for specific language sources, case sensitivities, form distinguishment and such. Advanced feature such as “wild cards” search enables further contextualization of data findings undefined by the original search string through comparative categorisation of Ngram sequences (like how the search string is sequentially structured within the sentence of occurrence), though this capacity is only limited to the immediate adjacent sequence.

The operation of said digital tool is facilitated by the multiple text corpora databases in tsv format, complied through continuous book scanning that is in update on regular basis. The quality of findings through said method (assuming the validity of search strings provided) is largely dependent on that of the corpora database, as well as the capacity of its “wild card” function when it comes to the contextualization of extracted data, the latter of which can be quite constraining in practice due to its imposed limitation.

In the context of digital humanities, the Ngram viewer is a clear example of data visualization that enables enhancement in the acquisition of societal knowledge, displaying changes in societal trends as evident by its relative prominence within literary publications at a given time. Its key strength lies in its operational efficiency as well as simplistic input process, allowing considerable degree of general accessibility of compacted datasets and presents it in an informative manner. A significant weakness, however, concerns its lack of capacity in the contextualisation of data presented, placing much of said responsibility on the quality of search string input. The tool itself is not advanced enough to factor language comprehension as part of its function, limiting its capacity to produce solid knowledge without the involvement of human interpretation.

Bibliography:

Basile, P., Caputo, A., Luisi, R., & Semeraro, G. (2016, December). Diachronic Analysis of the Italian Language exploiting Google Ngram. In CLiC-it/EVALITA.

Lin, Y., Michel, J. B., Lieberman, E. A., Orwant, J., Brockman, W., & Petrov, S. (2012, July). Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 system demonstrations (pp. 169-174).

Zeng, R., & Greenfield, P. M. (2015). Cultural evolution over the last 40 years in China: Using the Google Ngram Viewer to study implications of social and political change for cultural values. International Journal of Psychology, 50(1), 47-55

Machine learning in Archaeology

jamesliu — Thu, 24 Mar 2022 09:57:30 +0000

In a broad sense, Machine learning (ML) describes an algorithmic process that allows categorical derivation of mathematical classifiers, based on statistical analysis of categorized “training data”, enabling a machine intelligence to make informed predictions based on data acquired (Bickler, 2021). As the study of Archaeology has shown proscription of emphasis on the application of classification, the increasing application of ML within its contemporary field research can be argued as a natural technical progression.

ML is capable of constructing classification models based on large quantity of established (or “known”) set of data through test and tuning, through which ensures a considerable degree of internal consistency within the classification process, whilst also possessing considerable capacity for noise management (Resler, 2021). Through logic models constructed through the classification of training data, ML application is thus capable of predicting information based on raw data input.

Within the context of archaeology, ML application has seen field implementation broadly in the processing of statistical (such as chemical analysis), textual (language translation), image (automated identification and feature reconstruction), and geospatial data (Bickler, 2021). The latter of which has been described as the most promising of existing implementation, as by the combing of varies raw information derived from archeological sites and subterranean scanning, the combined algorithmic process is capable of creating some reliable reconstruction of human communal presence and activity in given time periods (Resler, 2021).

As the quality of ML application is heavily dependent on the quality of training data applied, a common difficulty of implementing ML in archaeology is rooted in the nature of archeological data, which unlike traditional “big data” come primarily in the form of highly contextualized chunk of information, occurring through spontaneous discovery, thus without consistency in flow (Grosman, 2014). Poor quality in training data will result in flawed logic models that can create bias in processing results (especially if the training data is incapable of accounting unfamiliar variables), and the deep complexity involved in the interaction between different algorithms can result in a “black box” process where it may prove difficult to comprehend the logical process of the ML application, both of which are common challenges in ML implementations.

Regarding the current developmental status of the application, one may be fair to conclude that whilst ML has proven to be a useful addition to the field study of archeology, it is far from capable of completely replacing manual input and oversight in its operation, given the complexity of variables involved in the raw processed data (Davis, 2019). ML predicts information based on information it knows, thus one have to question the validity of its predictions when involving information it may not account.

References:

Bickler, S. H. (2021). Machine Learning Arrives in Archaeology. Advances in Archaeological Practice, 9(2), 186-191.

Davis, D. S. (2019). Object‐based image analysis: a review of developments and future directions of automated feature detection in landscape archaeology. Archaeological Prospection, 26(2), 155-163.

Grosman, L., Karasik, A., Harush, O., & Smilansky, U. (2014). Archaeology in three dimensions: Computer-based methods in archaeological research. Journal of Eastern Mediterranean Archaeology and Heritage Studies, 2(1), 48-64.

Hörr, C., Lindinger, E., & Brunnett, G. (2014). Machine learning based typology development in archaeology. Journal on Computing and Cultural Heritage (JOCCH), 7(1), 1-23.

Resler, A., Yeshurun, R., Natalio, F., & Giryes, R. (2021). A deep-learning model for predictive archaeology and archaeological community detection. Humanities and Social Sciences Communications, 8(1), 1-10.

jamesliu – MetoDHology

Google Ngram Viewer

Machine learning in Archaeology