Text Analysis: An Overview

The aim of this lesson is to give the reader an overview of some of the different types of approaches in text analysis and some basic things needed for text analysis, as well as providing inspiration.

What is text analysis?

Text analysis is an umbrella term for a wide variety of different methods which all use computational tools to analyse text. Googling text analysis will give many different terms including: keyword frequency, distant reading, text similarity, topic modelling, sentiment analysis, text mining, stylometric analysis, and natural language processing. [1] All these different terms reflect the tools used for text analysis and the context of the texts. For example, distant reading comes from the application of text analysis tools on literature.

Text analysis is used in various disciplines including digital huminites, social sciences, history, linguistics, and political science. It is also used in other non-academic areas, for example sentiment analysis is widely used for analysing the responses to customer satisfaction surveys. Text analysis in the form of natural language processing has a role in the building of Artificial Intelligence and in mining large amounts of data like that contained on social media.

What sort of things are texts? Texts don’t have to literary, although may available digitised texts are copies of works by well known authors. A text can be a variety of documents including speeches, diaries, letters, academic writing, reports or biographical information (for example the Australian Biographical Dictionary). Other types of text on which text analysis can be used include dictionaries, collections of book titles, and large data repositories. Many online-based text collections are perfect for text analysis including collections of blogs, discussion boards or forums, Twitter, and fanfiction on a site like A03 (Archive of Our Own). [1] Text can refer to a single work, perhaps one novel or speech, or it can refer to a corpus or large collection of works like the published works of Charles Dickens or all the tweets in a single hashtag. Thinking about the size of your text and what material is represented in it can be important for what tool you might use to analyse the text.

What does text analysis software do? How does it work? Essentially the software is reviewing the text and looking for features that you have told it to find, whether that is particularly words or phrases. Depending on the tool, the results are sorted in different ways which reveal patterns in the text. [1] Interpreting the data from text analysis is the important part: you need to be clear on what the tool has been asked to do and what this means in the context of your text. For example, what does the use of negative tone tell you about the text you’re looking at? If it’s a parliamentary speech that tone might say something about the politics surrounding the topic of the speech. But using topic modelling on a collection of diary entries might be able to discover general themes in the diary that can be used to interpret the writer’s attitudes and experiences. Sentiment analysis looks for words or phrases that can indicate the tone of debate and the positive or negative language being used.

Downsides of text analysis? The important part of text analysis using computational tools is interpreting the data. For this you need a good idea of the broader context of your text and its purpose. Otherwise you might get conclusions from text analysis that are not supported by a more traditional reading of the text. In practice, that means that the findings you have about word patterns in the text do not make sense or do not fit within the meaning of the text when it is considered back in its context. An example is using text analysis on a series of political speeches. Unless you have the context in which these speeches were given (perhaps they all come from debate on one issue or piece of legislation), the findings of your text analysis may not make sense. That is not to say you need to have personally read everything that you are putting through your text analysis tool; you just need to be specific about what you are looking for and have information to help you interpret your findings.

It’s also important to tailor the tool to the text. Text analysis tools all do different jobs and some may not be suitable for the text you have. Topic modelling can suit large amounts of text produced in a specific context like diaries or political speeches. It can be useful to track themes in the text and the patterns for how these themes appear. But it would not be suitable for looking at whether a Twitter hashtag or a Reddit thread uses positive or negative language around and issue; for this you would need something like sentiment analysis.

Where is text analysis not suitable? Text analysis may not be as useful for texts that have plenty of similar words or repetitive text. Examples include census information, lists, or some collections of records. Preparing data for text analysis can also be a time consuming process, so if your data is not already suited to the tool you want to use, you may have to assess the time and effort it will take to convert the data and whether this activity is worthwhile. A further consideration is whether a text contains structural elements like formatting that are important for analysis of the text. These elements would be lost or not recognised in text analysis and this may impact on how useful text analysis is for the text.

What can you use text analysis to do?

Text analysis can be used for a variety of research, everything from detailed examination of one text to looking for patterns in a corpus. Text analysis allows a different way of reading a text compared to a human reading or close reading. Some of the things text analysis can look at include:

Analysis of large text collections, including sorting texts and finding relevant text
Themes in text
Tone of the words used in a thread on social media
Word frequency showing use of certain language or phrase

Text analysis can be used to save time when a researcher is considering a large corpus of work, because in an age when there is so much information available it is impossible to read and consider everything. So text analysis can provide a quick way of searching a large collection of words for themes, patterns of word usage, or tone which might help to answer a research question. It is particularly useful for considering changes in language over time, development of themes over time or shifts in ways of writing or speaking. Text analysis can also help a researcher locate an important text within a corpus; an electronic needle in a haystack.

Themes and tone are particularly important in consideration of the debates and discussions held online in social media or online forums. These public collections of words, reactions, micro blogs and debate are now being exampled with text analysis tools to look at the way issues are debated online and the kind of language that is being used. But themes and tone can also be examined in other collections of text. Political speeches and parliamentary debate are an area of growing study using text analysis, while the increased digitisation of works of published authors allow collections to be examined in new ways with text analysis tools.

Text analysis can start simply however, with looking at word frequency in a piece of text, and building on this to create representations of key words in text such as word clouds or frequency graphs. At this basic level of text analysis is the importance of showing how text is shaped by themes underlying the writing; text analysis here gives a broader view of the text and allows for a quick demonstration to others of the constituent parts of the text.

Jargon busting

There are many terms used to describe types of text analysis. Here are some you may come across:

Distant reading is the name given to using computational tools to analyse literary text databases. Generally attributed to Franco Moretti, distant reading focuses on literature, but its principles can be applied to a variety of texts including parliamentary Hansard transcripts.

Topic modelling software like MALLET sorts the words in text into collections based on a statistical analysis of how often the words appear. The result is ‘topics’ that are groups of words of similar frequency.

Sentiment analysis or opinion mining is about text analysis that looks at sentiment in a text and can reveal the views held by a group or individual. It has plenty of commercial uses, for example on reviews and comments or customer surveys. But it can also be used for academic work, analysing large collections of text opinions like blogs or Twitter.

Stylometry analysis is about studying the linguistic style of a text and can be used to determine authorship, authenticity or other questions relating to an author’s style. A range of software is available for stylometry, but Python and R are free software.

I want to do it! What do I do?

Understand the context of your text and how it is constructed. Ask yourself what your text is and how it sits within a broader context. Is it a diary? Letters? A novel? A set of speeches? Think about the construction of your text and the structural rules that govern it. Think about the language used in the text and how it is used. For example, if you are looking at speeches made in Parliament about legislation: think about who made the speeches, what the legislation was about, when were the speeches made (language used in previous decades will differ from language used today), and is there any extra context to look out for (speeches recorded in parliamentary transcripts are edited so there are no ‘ums’ and ‘ers’ and there are whole sentences). All of these considerations will have a bearing on what tool is going to suit your work best and how you will interpret the results you get.

Hone your question – consider the tool you’re using and your text and ask yourself whether the two are suited. Think about what you want to do with your text. Is the text massive and you want a faster way to look at its themes? Or is the text smaller and you want to look specifically at the language used in the text and the patterns that emerge with word frequency? Perhaps the text is made up of a lot of items (for example tweets in a hashtag) and you want to see what language is common to all of them. Some tools are listed in this post and there are links to a number below, but there are millions of text analysis tools available both open source and commercial. Knowing what you’re looking for in your research will help you narrow down with tool will work best with your text.

Choose your tool. Think about what the tool will do and how it will work. Different tools require different types of data and they can analyse data in various ways. Topic modelling using MALLET requires unstructured data and for this txt files work best. A simple and quick tool like VOYANT uses a url or a PDF document. Python-based program use unstructured text too, while other tools can work with PDFs of text. Some tools, like those developed for the Hathi Trust database, work only on a particular database.

Choose your data and prepare it. Texts can come from a number of places (see below for some suggestions to get you started). There are lots of projects that create databases out of the collected works of a writer or writers, parliamentary speeches, or other publicly available data like court records or biographical data. You might have to prepare your text, depending on what tool you are using.

Some software to get started with

There is lots text analysis software available – plenty free and open source; some proprietary. Here are a few open source software types to show the variety of text analysis.

VOYANT https://voyant-tools.org/ VOYANT is a simple web-based tool in which a url or PDF document can be analysed quickly and to a basic level. It is a good entry point into text analysis because no skill is required. While useful for generating word clouds and basic data about a text, VOYANT is not useful for in depth or comparative research because of the lack of control over the text analysis conducted by the program.

MALLET http://mallet.cs.umass.edu/index.ph p MALLET is used for topic modelling and while it requires some skills (using Command prompt, loading software), it is not difficult to use. The Programming Historian website has an excellent lesson which can guide you through downloading and using MALLET. (https://programminghistorian.org/en/lessons/topic-modeling-and-mallet)

PYTHON and R are used for a number of text analysis tools because these are open source packages. Some skill is required to use these tools and to code in Python and it may be that this investment of time is not suitable for a project. If you are interested however, the Programming Historian website has lessons including Basic Text Processing in R (https://programminghistorian.org/en/lessons/basic-text-processing-in-r) and Introduction to Stylometry with Python (https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python)

AntConc https://www.laurenceanthony.net/software/antconc/ is open source software used for word frequency and concordance (that is, showing how any particular word or phrase is used). The software has a number of supporting Youtube clips and guides, and some versions can handle multiple formats of text.

For an idea of the sort of software that can be purchased, see:

NVIVO https://www.qsrinternational.com/nvivo-qualitative-data-analysis-software/about/nvivo this web-based portal can be used to transcribe data and analyse it. Commercial software like this can be used for businesses (for example analysing large amounts of customer survey data) or for academic research.

Leximancer https://info.leximancer.com/products-academic web-based and able to handle a variety of different text formats Leximancer is similar to NVIVO in that it has commercial as well as academic research uses.

For even more text analysis tools, try this list from the Ohio State University: https://guides.osu.edu/DH/text

Places to get data to practise on

To Be Continued https://cdh datasys.anu.edu.au/tobecontinued/ the To Be Continued database has stories published in Australian newspapers and periodicals available for download in text format. Useful for Australian literature, particularly that from relatively unknown authors.

Trove at the National Library of Australia https://trove.nla.gov.au/ one of the largest digitised collections in the world, Trove includes books, newspapers and periodicals. Download is available in PDF or text format.

Data Foundry at the National Library of Scotland https://data.nls.uk/ digitised and metadata collections are available of a variety of publications. Download is available in several formats, including XML and original scans.

CLARIN-UK https://www.clarin.ac.uk/home is a collection of many different databases including the British Parliamentary Corpus, Bodleian Library, British Library and many others. Text types and download forms vary depending on which database is being used.

Hathi Trust Digital Library https://www.hathitrust.org/ and https://analytics.hathitrust.org/ have some free downloads, but for full access an institutional login is required. Downloads include PDF pages, but with full access books or works can be downloaded. Hathi Analytics has a number digital tools for analysis of the collection and some further information can be found on the Programming Historian lesson Text Mining with Extracted Features HRTC https://programminghistorian.org/en/lessons/text-mining-with-extracted-features.

JSTOR https://www.jstor.org/ has some downloads available, but requires institutional login for full access to its database and downloads.

Cool examples of text analysis

To Be Continued https://cdhrdatasys.anu.edu.au/tobecontinued/ is an example of text analysis carried out on the National Library of Australia’s Trove database. To Be Continued collects works published in Australian newspapers and periodicals into genre collections and allows sorting by author, publication date, and publication. The work used keywords to identify published stories and to sort them into the To Be Continued collection.

Martha Ballard’s diary http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/ Martha Ballard was a midwife and healer in Maine, USA and kept a diary between 1785 and 1812. The diary has proved to be a wealth of information for social historians and came to prominence with the publication of A Midwife’s Tale: The Life of Martha Ballard based on her diary, 1785–1812 by historian Laurel Thatcher Ulrich in 1990. Martha Ballard’s diary has been digitised and researcher Cameron Blevins used topic modelling on the diary to discover patterns and themes within the diary’s almost 10,000 entries.

Analysing the Brexit debate through social media: topics, arguments, and attitudes https://www.aph.gov.au/About_Parliament/Senate/Whats_On/Seminars_and_Lectures/~/~/link.aspx?_id=8B299D56195D450198D08B0ECF5E9943&_z=z at a Senate Occasional Lecture on 8 March 2019, Professor Ken Benoit discussed his work analysing tweets and hashtags used throughout the Brexit debate. Professor Benoit’s work uses text analysis tools to analyse the language used in the debate and the tone of the arguments, finding that different tones attached to the Leave and Remain arguments being made on Twitter.

Sheffield Digital Humanities Institute https://www.dhi.ac.uk/projects/ has a number of different text analysis projects based on large digitised collections.

Analysis of New Zealand Parliamentary Hansard transcripts https://theconversation.com/analysis-shows-how-the-greens-have-changed-the-language-of-economic-debate-in-new-zealand-144492 a project from the University of Canterbury, New Zealand, has used text analysis on New Zealand Parliamentary Hansard transcripts and examined the way in which language used in Parliamentary debate has changed over time.

References and further reading

[1] Text mining & text analysis. (2020, July 24). Retrieved from https://guides.library.uq.edu.au/research-techniques/text-mining-analysis/introduction

[2] Why Learn Text Analysis? (2020, May 15). Retrieved from https://docs.tdm-pilot.org/why-should-humanists-learn-text-analysis/

Image from https://unsplash.com/photos/ywqa9IZB-dU