5dez 2022
06:30 UTC

TWITTER-TOOLBOX: An algorithmic proposal for linguistic data collection and analysis

Computational methods have been often used in more areas than initially expected (BOSTROM, 2014, p. 42). Even though the analysis of the Portuguese language with the aid of computational tools is not a novelty (SARDINHA, 2005), their use of manual techniques for data collection and analysis is still a reality among linguists that do not work in the area of Corpus Linguistics or Computational Linguistics. This situation is a clear limitation in the development of research with more explanatory power in areas, such as Sociolinguistics (SOUSA, 2022), due to the significant amount of time required to carry out collection and processing of a large sample of data, as in the case of texts from the internet. This environment is an unrestricted source of texts (MOREIRA FILHO, 2021, p. 100), that come mainly from social networks. One example of that isTwitter that has 19.05 million users from Brazil according to data from statisa, a German company specialized in market and consumer data. This fact makes Twitter a potential source of data for corpora construction, not only for description, but also for linguistic analysis. Therefore, this study aims at presenting the interdisciplinarity between Linguistics and Computer Science through the elaboration of computational solutions for relative problems to data collection and analysis in language-related areas of study. Using the Python programming language (ROSSUM; DRAKE, 2009), libraries available for it, in particular the tweepy (ROESSLEIN, 2022) and NLTK (BIRD; LOPER; KLEIN, 2009), and natural language processing (NLP) techniques, a repository was built with algorithms that allow cleaning, organization, and analysis of linguistic data. Moreover, by using Twitter as a source of linguistic data, there was also a focus on the step of collecting texts from that social network to build the corpus. The algorithms demonstrated effectiveness and practicality in the steps involved in data processing since they enabled the filtering through keywords by creating specific fields to
build the corpus, the cleaning of the data by removing unwanted symbols, punctuation, and terms, and the analysis using bigrams, a technique from the NLP area. These results are shared in a public open source repository (GOIS, 2022), in line with Open Science principles. Furthermore, the built algorithms have the potential to work as basis for new modules that implement tools produced from NLP techniques, such as tokenization and tagging, thus expanding the frontiers of interdisciplinarity between Linguistics and Computer Science.