It is often commented that 80% of the work of data science is data cleaning, while only 20% is analysis (Browne-Anderson, 2018). Despite this, the actual contents of what data cleaning entails is largely obscured, often dismissed as a tedious and laboursome yet necessary exercise (Rawson and Muñoz, 2019). While definitions vary, data cleaning can be described broadly as a process of data standardisation or ‘detecting, diagnosing, and editing faulty data’ (Van der Broeck et al. 2005). Implicit in this language of data ‘cleaning’ is the inverse notion of ‘messy’ or ‘untidy’ data which needs to be organised.
At a mechanical level, that might mean filtering out unnecessary variables, or conflating slight variations of the same concept, such as lowercases, typos, word inflections, or ‘pruning’ words down to their stem. More generally however, data cleaning can necessitate imposing a normative order, a process of standardisation which some humanities scholars have seen as reductive and having serious intellectual and ethical implications (Drucker, 2021: 30; Rawson and Muñoz, 2019). Rather than being necessarily reductive, I argue that the act of data cleaning itself has potential as a critical cultural practice.
The cultural challenges of (Olteanu et al., 2019) data cleaning are well exemplified by the processing of data pertaining to the Indigenous population in the Australian census, run by the Australian Bureau of Statistics (ABS). Aboriginal and Torres Strait Islander people are consistently underrepresented in the Census, and methods of data collection and analysis have come under scrutiny from social science and humanities scholars. Frances Morphy (2007a) argued that the Census unsuccessfully tried to model remote Aboriginal social relationships in terms of the foundational Western metaphor of a ‘bounded container’.
Observing the Indigenous Processing Team (IPT) at the Data Processing Centre in Melbourne where forms from remote Indigenous communities were manually processed and standardised, Morphy (2007b) praised these efforts while highlighting the challenges of parameterization, and the ethics of designating Indigenous people as ‘disorder’ which can be remedied. It also reveals an inherent tension between maintaining consistency and commensurability, and making sure the data is coherent in the context of the community it pertains to.
Likewise, judgements about how to handle missing data can equally reflect political and cultural factors. For example, when a Census question which asks whether people identify as Aboriginal or Torres Strait Islander is left blank, should this value be excluded? Or should it be ‘imputed’ by assuming what someone might have answered based on similar ‘donor’ data, or through data-matching with previous Census data? This continues to be debated, although the ABS currently makes a deliberate choice not to impute the missing data because Indigenous identity is understood as a matter of self-determination, which is ultimately what the question seeks to measure.
In recognising that cleaning data can suppress diversity, Rawson and Muñoz (2019) argue that as humanities scholars we should consult with the relevant communities, whether it be Indigenous communities, analysts at the Data Processing Centre or librarians to unpack the concepts which structure the data, and its relationship to other data. The tension between standardisation and ensuring data faithfully represents the phenomenon being studied will probably always persist, but being able to clearly articulate and justify decisions in the data cleaning stage is a valuable exercise, one that is uniquely well-suited to digital humanities scholars. Clearly describing the process of data cleaning not only makes it easily reproducible, which is valued in data science, but enriches an understanding of the topic at hand.
Bowne-Anderson H (2018) What Data Scientists Really Do, According to 35 Data Scientists. Harvard Business Review, 15 August. Available at: https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists (accessed 27 May 2022).
Broeck JV den, Cunningham SA, Eeckels R, et al. (2005) Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities. PLOS Medicine 2(10). Public Library of Science: e267. DOI: 10.1371/journal.pmed.0020267.
Drucker J (2021) Cleaning and using data. In: The Digital Humanities Coursebook: An Introduction to Digital Methods for Research and Scholarship. Routledge.
Morphy F (2007a) The transformation of input into output: At the Melbourne Data Processing Centre. In: Http://Press-Files.Anu.Edu.Au/Downloads/Press/P18061/Pdf/Ch0810.Pdf. ANU ePress. Available at: https://openresearch-repository.anu.edu.au/handle/1885/32592 (accessed 27 May 2022).
Morphy F (2007b) Uncontained subjects: Population and household in remote aboriginal Australia. Journal of Population Research 24(2): 163–184. DOI: 10.1007/BF03031929.
Olteanu A, Castillo C, Diaz F, et al. (2019) Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries. Frontiers in Big Data 2. Available at: https://www.frontiersin.org/article/10.3389/fdata.2019.00013 (accessed 27 May 2022).
Rawson K and Muñoz T (2019) Against Cleaning. In: Gold MK and Klein LF (eds) Debates in the Digital Humanities 2019. University of Minnesota Press.