Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic. Previous work has mapped the distribution of languages using geo-referenced social media and web data. The goal, however, has been to describe these corpora themselves rather than to make inferences about underlying populations. This paper shows that a difference-indifferences method based on the HerfindahlHirschman Index can identify the bias in digital corpora that is introduced by non-local populations. These methods tell us where significant changes have taken place and whether this leads to increased or decreased diversity. This is an important step in aligning digital corpora like social media with the real-world populations that have produced them. 1 Biases in digital language data Data from social media and web-crawled sources has been used to map the distribution of both languages (Mocanu et al., 2013; Gonçalves and Sánchez, 2014; Lamanna et al., 2018; Dunn, 2020) and dialects (Eisenstein et al., 2014; Cook and Brinton, 2017; Dunn, 2019b,a; Grieve et al., 2019). This line of research is important because traditional methods have relied on census data and missionary reports (Eberhard et al., 2020; IMB, 2020), both of which are often out-of-date and can be inconsistent across countries. At the same time, we know that digital data sets do not necessarily reflect the underlying linguistic diversity in a country: the actual population of South Africa, for example, is not accurately represented by tweets from South Africa (Dunn and Adams, 2019). This becomes an important problem as soon as we try to use computational linguistics to tell us about people or language. For example, if an application is using Twitter to track sentiment about COVID-19, that tracking is meaningless without good information about how well it represents the population. Or, if an application is using Twitter to study lexical choices, that study depends on a relationship between lexical choices on Twitter and lexical choices more generally. In other words, the more we use digital corpora for scientific purposes, the more we need to control for bias in that data. There are four sources of diversity-related bias that we need to take into account. First, production bias occurs when one location (like the US) produces so much digital data that most corpora over-represent that location (Jurgens et al., 2017). For example, by default a corpus of English from the web or Twitter will mostly represent the US and the UK (Kulshrestha et al., 2012). It has been shown that this type of bias can be corrected using population-based sampling (Dunn and Adams, 2020) to enforce the representation of all relevant populations. Second, sampling bias occurs when a subset of the population produces a disproportionate amount of the overall data. This type of bias has been shown to be closely related to economic measures: more wealthy populations produce more digital language per capita (Dunn and Adams, 2019). By default, a corpus will contain more samples representing wealthier members of the population. Thus, ar X iv :2 10 4. 01 29 0v 1 [ cs .C L ] 3 A pr 2 02 1 Figure 1: Number of observations per country. this is similar to production bias, but with a demographic rather than a geographic scope. Third, non-local bias is the problem of overrepresenting those people in a place who are not from that place: tourists, aid workers, students, short-term visitors, etc. For example, in countries with low per-capita GDP (i.e., where local populations often lack internet access) digital language data is likely to represent outsiders like aid workers. On the other hand, in countries with large numbers of international tourists (e.g., New Zealand), data sets are likely to instead be contaminated with samples from these tourists. Fourth, majority language bias occurs when a multi-lingual population only uses some of its languages in digital contexts (Lackaff and Moner, 2016). Most often, majority languages like English and French are used online while minority languages are used in face-to-face contexts. The result is that even though an individual may be represented in a corpus, the full range of their linguistic behaviours is not represented. This is the only type of bias not quantified in this paper. For example, it is possible that changes in linguistic diversity are caused by a shift in behaviour, rather than a shift in population characteristics. Of the three sources of bias that we examine here, non-local bias is the most difficult to uncover (Graham et al., 2014; Johnson et al., 2016). We can identify production bias when the amount of data per country exceeds that country’s share of the global population. In this sense, the ideal corpus of English would equally represent each country according to the number of English speakers in that country. Within a country, we can measure the amount of sampling bias by looking at how economic measures like GDP and rates of internet access correspond with the amount of data per person. Thus, we could use median income by zip code to ensure that the US is properly represented. But non-local bias is more challenging because we need to know which samples from a place like New Zealand come from those speakers who are only passing through for a short time. Only with widespread restrictions on international travel during the COVID-19 pandemic do we have access to a collection of digital language from which non-local populations are largely absent (Gössling et al., 2020; Hale et al., 2020). This paper uses changes in linguistic diversity during these travel restrictions, against a historical baseline, to calibrate computational measures that support language and population mapping. This is a part of the larger problem of estimating population characteristics from digital language data. We start by describing the data used for the experiments in the paper (Section 2), drawn from Twitter over a two-year period. We then explore sources of bias in this data set by looking at production bias and sampling bias (in Section 3) and then developing a baseline of temporal variation in the data (in Section 4). We introduce a measure of geographic linguistic diversity (Section 5). Then we use this measure to find which countries and languages are most contaminated by non-local populations (in Section 6). Finally, we examine the results to find where the linguistic landscape has changed during the COVID-19 pandemic.
Jonathan Dunn, Tom Coupe, Benjamin Adams
Journal name not available for this finding