NGrams – MetoDHology https://metodhology.anu.edu.au A resource developed by the Centre for Digital Humanities Research at the Australian National University, Sat, 18 Jun 2022 05:04:02 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://metodhology.anu.edu.au/wp-content/uploads/2020/06/cropped-DH_favicon_icon-32x32.png NGrams – MetoDHology https://metodhology.anu.edu.au 32 32 Text Analysis: Methods, Assessment, and Experience https://metodhology.anu.edu.au/index.php/2022/05/26/text-analysis-methods-assessment-and-experience/ https://metodhology.anu.edu.au/index.php/2022/05/26/text-analysis-methods-assessment-and-experience/#respond Thu, 26 May 2022 03:05:44 +0000 https://metodhology.anu.edu.au/?p=2687 Text analysis can be a very general term. It’s often used to describe computational tools that analyse text (Reardon, 2020). Though computational tools that analyse text in computational text analysis, or machine analysis, are prevalent, human text analysis has provided a fundamental basis. A comparison of the two, as well as a personal example of the use of text analysis tools, can assist in the understanding of why text analysis is so significant in the modern age. 

Computational text analysis has become a far more widely used method of text analysis in the modern years. There are several benefits to computational text analysis. Using this technique, the root of a problem within both unstructured or structured data can be identified, trends and limits can be recognised, and digital experiences can be enhanced (Haije, 2019). In addition to these advantages, once the system behind the computational analysis has been trained to a sufficient level, the process becomes significantly efficient and quick (Haije, 2019). 

In comparison to computational text analysis, human text analysis has been used in the past and is currently used either in addition to or to replace machine text analysis. The benefits to human text analysis include the ease of commencement. Once a topic and dictionary have been established, the reading and writing of annotations can begin almost immediately. In addition, the interpretations and capabilities of humans have been trained and influenced during our every-day life by all the encounters we experience. Humans also have the benefit of being able to interpret anomalies with a higher success rate, such as irony (Wonderflow, 2019).

Though human text analysis does display some benefits, there are also many limitations that make computational analysis more easily accessible in the modern age. Consistency is often lacking in human text analysis, especially without repeating the process, as humans often evaluate things differently based on their mood (Wonderflow, 2019). Human memory can also present a constraint on the competence and speed of human text analysis. Text analysis often involves many firm definitions and parameters, the ability to remember these terms can hinder the process (Wonderflow, 2019).. Additionally, in comparison to computational text analysis, human text analysis can be a slow method due to manual input. 

With the benefits and limitations of computational and human text analysis in mind, I chose to document my own experiences with text analysis. Google Ngram Viewer is a tool that can be used to analyse terms used in literature and its relevance over time. I used this site to research the terms “anxiety” and “depression” over the years of 1800-2019. While the results showed a general increase, there was a peak in the use of the word “depression” in the 1930s. After the realisation that this was not related to mental health, and was instead referencing the Great Depression, one of Google Ngram’s complications became clear: context is not taken into account when analysing words. In addition, since Google Ngrams only documents written texts, much of the material from the world is unable to be assessed.

There are many tools on the internet that can provide basic computational text analysis. These instruments can be web-based applications, like voyant, or python-based, like Mallet. Either way, there are many ways to begin text analysis processes, and even more ways to enhance them. 

References 

Haije, E. G. (2019). What is Text Analytics? And why should I care? Retrieved May 23, 2022, from https://mopinion.com/what-is-text-analytics-benefits/ 

Reardon, J. (2020). “Text Analysis: An Overview”. METODHOLOGY. Retrieved May 23, 2022, from https://metodhology.anu.edu.au/index.php/content/text-analysis/ 

Wonderflow. (2019). What are the pros and cons of human text analysis – Part 2. Retrieved May 23, 2022, from https://www.wonderflow.ai/blog/what-are-the-pros-and-cons-of-human-text-analysis-part-2 

]]>
https://metodhology.anu.edu.au/index.php/2022/05/26/text-analysis-methods-assessment-and-experience/feed/ 0
Google Ngram Viewer https://metodhology.anu.edu.au/index.php/2022/05/18/google-ngram-viewer/ https://metodhology.anu.edu.au/index.php/2022/05/18/google-ngram-viewer/#respond Wed, 18 May 2022 09:10:59 +0000 https://metodhology.anu.edu.au/?p=2656 The Google Ngram Viewer is an online search engine that charts the frequencies of searched word strings, using a yearly count of n-grams found in Google’s text corpora.  In the context of humanities research, it is a useful tool for social linguistic research for both historical and contemporary context, as it possess the capacity for providing strong data visualization of comparative literary trends in accordance with the changing frequency of data string found within a given time frame, thus providing some tangible insight concerning comparative social trends that existed as proved by its protectory of frequency within literature at the time.

The term Ngram refers to a contiguous sequence of n items from a given sample of text collection, a concept used to predict language sequence based on the frequency of recurring elements based on  probability.

Regarding the standard workflow of the examined online search tool, one simply enter the desired search strings and establish the desired time frame between the 1500s and 2019 AD, and the tool outputs finds in the form of a projection line chart, comparatively displaying the projection of Ngram occurrence frequency within the database. The presented data are relatively normalized in the form of published percentages within the respective categorised year. Further parameters settings are available as option to account for specific language sources, case sensitivities, form distinguishment and such. Advanced feature such as “wild cards” search enables further contextualization of data findings undefined by the original search string through comparative categorisation of Ngram sequences (like how the search string is sequentially structured within the sentence of occurrence), though this capacity is only limited to the immediate adjacent sequence.

The operation of said digital tool is facilitated by the multiple text corpora databases in tsv format, complied through continuous book scanning that is in update on regular basis. The quality of findings through said method (assuming the validity of search strings provided) is largely dependent on that of the corpora database, as well as the capacity of its “wild card” function when it comes to the contextualization of extracted data, the latter of which can be quite constraining in practice due to its imposed limitation.

In the context of digital humanities, the Ngram viewer is a clear example of data visualization that enables enhancement in the acquisition of societal knowledge, displaying changes in societal trends as evident by its relative prominence within literary publications at a given time. Its key strength lies in its operational efficiency as well as simplistic input process, allowing considerable degree of general accessibility of compacted datasets and presents it in an informative  manner. A significant weakness, however, concerns its lack of capacity in the contextualisation of data presented, placing much of said responsibility on the quality of search string input. The tool itself is not advanced enough to factor language comprehension as part of its function,  limiting its capacity to produce solid knowledge without the involvement of human interpretation.

Bibliography:

Basile, P., Caputo, A., Luisi, R., & Semeraro, G. (2016, December). Diachronic Analysis of the Italian Language exploiting Google Ngram. In CLiC-it/EVALITA.

Lin, Y., Michel, J. B., Lieberman, E. A., Orwant, J., Brockman, W., & Petrov, S. (2012, July). Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 system demonstrations (pp. 169-174).

Zeng, R., & Greenfield, P. M. (2015). Cultural evolution over the last 40 years in China: Using the Google Ngram Viewer to study implications of social and political change for cultural values. International Journal of Psychology, 50(1), 47-55

]]>
https://metodhology.anu.edu.au/index.php/2022/05/18/google-ngram-viewer/feed/ 0