Text Analysis – MetoDHology

Overview

Text analysis, also known as, “text mining” is a computational method or technique used to extract large amounts of ‘unstructured’ data from documents and texts in their online forms (Reardon,2020). Many people who research using text analysis tools, use it to collect specific information from the texts they are studying. For example, text analysis can detect important phrases, patterns of words and word frequencies.

The term encompasses a wide range of tools and techniques that are practised in a range of research areas. Research in the Humanities and Social Sciences uses text analysis methods most consistently, however, the tool is gradually being used more in STEM with the analysis of metadata. This is because qualitative and ‘unstructured’ datasets occur more frequently within the Humanities.

Any form of written or transcribed text can be used as data for this method. The texts can range from social media posts to reviews to large novels. There is a range of publicly available datasets and collections of texts that can be filtered through text analysis tools. Project Gutenberg is a site that allows free access to literary texts before 1910 cut off for copyright. The British Nation Corpus, the Internet Movie Script Database, Scientific paper collections and Digital National Security Archive are all other forms of databases accessible to be used in text analysis. Links to all these datasets and more can be found on the website below:

https://onlinelibrary-wiley-com.virtual.anu.edu.au/doi/full/10.1111/cgf.12873

Process of text analysis

The first part of using this method for research is to acquire a dataset (discussed above). This data or text then needs to be put through a text analysis tool. The most common tools used in the humanities and social sciences for text analysis are “Voyant”, “MALLET”, Topic Modelling Tool, “WordSeer” and even “Wordle” (a popular word game site) can be used as a text analysis tool (Gupta, 2022).

Each of these platforms use the data from various texts and makes them into visualisations in the forms of word clouds, lists, graphs, tables, micro searches and many more. Some are more analytical than others, MALLET has a range of tools embedded and requires quite a lot of training and understanding of the tool, whereas Voyant is much easier to use and visualise data. Next, Once the data from the texts have been collected by the tools, researchers look at and analyse the data shown for themselves to aid their research. Often, the datasets collected from text analysis can be great starting points and used in other methodologies.

Text analysis does require researchers to have an extent of knowledge of the context of the text/texts studied. This is due to the fact that it is a quantitative study on ‘unstructured’ datasets. Texts, no matter what the form, are usually written in sentences that are less structured than, for example, a list of dates or names that are structured datasets. This makes text analysis a good aid to research, not a method to solely base research on. However, this is not necessarily a downside, as it requires researchers to bring their own backgrounds of research to the digital world and lens, making their study with text analysis tools one of collaboration with different disciplines, which is what Digital Humanities is all about.

Some key words/ phrases related to text mining:

Word frequencies: The number of times a word is written or occurs in the text.

An example of how this is used in literary research could be to see the amount of times words like she / her, they/them, he/him are used in a text to detect a specific gendered lens.

Word Clouds: Are a collection of the most frequent words in a text in a cloud visualisation. This offers a snapshot of the text in one visualisation.

Topic Modelling: is under the text analysis term but refers specifically to the grouping together of words or sentences relating to the same topic.

Distant reading: A method of collecting data from the text analysis and using it to group together certain bits of text or themes within a text in a quantitative way. It uses methods of literary analysis and mixes it with computational methods to read a collection of texts at a “distance” to try to see the bigger picture. The term was first coined by literary scholar Franco Moretti in 2000, however, versions of the method have gradually been used throughout literary history. The invention of the computer just sped the process up. (Underwood, 2017).

References

Gupta, Ravi. “Wordle -Vision: Simple Analytics To Up Your Wordle Game” , Towards Data Science, 2022. https://towardsdatascience.com/wordle-vision-simple-analytics-to-up-your-wordle-game-65daf4f1aa6f

Reardon, Jed. “Text Analysis: An Overview”, MethoDHology, 2020. https://metodhology.anu.edu.au/index.php/content/text-analysis/

Underwood, Ted. “A Genealogy of Distant Reading”, Digital Humanities Quarterly 11, no. 2 (2017).http://digitalhumanities.org/dhq/vol/11/2/000317/000317.html