Getting, Making and Clean Data: A Practical Experience


This image has an empty alt attribute; its file name is 11.jpeg

The following are some introductions and evaluations of my personal experience in exploring how to collect data from different sources and how to clean different types of data.

Text is an important information carrier for human beings (Dragulanescu,2002; Yang et al.,2018), and text analysis is an important tool for understanding and analysing the activities of human interaction (Bernard and Ryan,1998).

I mainly try to collect textual data from two sources, online platforms and manually by myself. Collecting data online is interesting and important, as people are willing and used to express and exchange their opinions and attitudes on the internet/social media platforms. These data can be considered as useful and UpToDate information that reveals the public voice and is a vane of social development. For example, collecting posts on Reddit/Twitter about the Australian election allows you to analyse people’s attitudes towards candidates, predict trends in election results, and visualize online social networks between users. Additionally, online data is mostly a resource open to the public and can be accessed by anyone. Gathering information from the web can be easy, with both Twitter and Reddit offering free APIs for users to collect posts. Collecting data from online platforms can also be complex, depending on the platform’s security and privacy policies. For example, Meituan, the largest takeaway software in China, allows the usage of a web crawler to collect data only if an official approval is granted, and the collector account is disabled after a certain number of reviews have been collected. Programming skills are required.

Collecting your own dataset can be time-consuming and cumbersome to document, just like requiring official ethics approval for human involved research, but it can give you the most relevant data for your experimental purposes. I created my own dataset, consisting of 20 interviews in which participants were asked to verbalize every thought they had during the simulated food ordering process, such as the reasons for choosing or not choosing this restaurant. Each interview was recorded and transcribed into text form by me.

Data is simply a collection of different facts. There is not only structured data like numbers but also unstructured data in the form of text, images, audio, video, etc. (Feldman and Sanger, 2007). Since unstructured data cannot be used directly for research, we need to convert unstructured data into structured data so that it can be understood and processed by computers.

Data cleaning deals with detecting and eliminating errors, inconsistencies and unanalysable parts of the data to improve the quality of the data. (Ilyas and Chu,2019; Rahm and Do, 2000). I taught myself how to use Python, and the more challenging thing is to find the right list of stop words, which are a set of common words in a language that carries little useful information and needs to be eliminated. Examples of stop words in English are ‘a’, ‘the’, ‘is’, ‘are’, etc. And different languages or topics require different stop word lists, so if you can’t find a suitable one you need to customize one. Similarly, when it comes to semantic analysis, it is difficult to find a suitable sentiment dictionary or pre-trained model. This is because the semantic definitions of words are different in different languages or topics, and there are proper nouns in different contexts. What I did was compare and explore the accuracy of different stop word lists and different semantic dictionary on my food delivery review dataset. So far, with the Naive Bayesian algorithm, I train my own dataset and get a customised model, surprisingly, the accuracy is around 94%.

For the interview, to know which factors can actually influence purchase intention. I converted the written records into numbers and marked key factors mentioned as 1 and those not mentioned as 0. I find the considerable challenge was prioritizing the factors that people mentioned. For example, a participant might say that, in the beginning, he/she wants to filter restaurants in order of distance, which makes him/her make the following decisions based on distance preferences. Or the participant verbally emphasized the decisive role of specific factors, such as personal preference. In these cases, giving more weight to the influence of these factors is needed. However, the importance of the factor that people verbally express may differ from their behavioural performance. Therefore, it is difficult to set a standard and there may be subjective bias when I convert this information into numbers.


Bernard, H.R. and Ryan, G., 1998. Text analysis. Handbook of methods in cultural anthropology613.

Dragulanescu, N.G., 2002. Website quality evaluations: Criteria and tools. The international information & library review34(3), pp.247-254.

Ilyas, I.F. and Chu, X., 2019. Data cleaning. Morgan & Claypool.

Feldman, R. and Sanger, J., 2007. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge university press.

Rahm, E. and Do, H.H., 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull.23(4), pp.3-13.

Yang, Z., Zhang, P., Jiang, M., Huang, Y. and Zhang, Y.J., 2018, June. Rits: Real-time interactive text steganography based on automatic dialogue model. In International Conference on Cloud Computing and Security (pp. 253-264). Springer, Cham.

All comments.