Privacy is becoming an increasing challenge in the Era of Big Data, periodically characterized by news about the unauthorized or legally-murky access, use, and selling of information otherwise entrusted to institutions, including some safeguarded by law. The issue has become more common today due the combination of the proliferation of electronic records of different kinds and growing technological capacity to mine and combine data in unprecedented ways, which could lead to the specific and unauthorized identification of individuals or, more broadly, the exploitation of their information in ways that violates privacy or confidentiality more “passively.”
This is a particularly important consideration when the data in question are part of the federal statistical system, including efforts like the Decennial population and housing Census and the American Community Surveys, where the whole population or otherwise millions of people’s records are kept and disseminated under strict and legally-binding promise of confidentiality and privacy. As discussed in the news somewhat in recent times in the context of the citizenship question that was ultimately not added to the 2020 Census, keeping the public’s trust in this effort is of paramount importance.
To guard against the possibility of misuse of their, our data, the Census Bureau is planning on using a masking technique that, roughly described, adds random noise to its forthcoming data products such that estimates of the population size and composition of the nation and many geographic units used by the Census Bureau remain unchanged relative to the underlying true data, but in which individuals are protected from being identified as some of their characteristics are masked or “distorted,” thus reducing the likelihood of unauthorized identification and other privacy violations.
Despite its societal importance and value for individuals, privacy also has trade-offs that affect scientific endeavor. To explore this issue in the context of forthcoming 2020 Census data products, last December, the Census Bureau -in collaboration with the Committee on National Statistics (CNSTAT) at the National Academies of Science, Engineering, and Medicine- organized a workshop aimed at discussing these issues, and in which experts compared unmasked 2010 data products to an analogous “private demonstration” dataset that uses the masking technique under consideration for the 2020 round of Census products.
CUPC affiliate and Associate Professor of Geography Seth Spielman (who also currently serves as Chief Data Strategist and Analytics Officer at the CU office of Data Analytics) was an invited participant to the workshop, where he presented related work in collaboration with and led by David van Riper, director of spatial analysis at the Minnesota Population Center. In their analyses using many different types of population characteristics, notably including segregation indices, David and Seth examine whether the masking technique works well in still depicting the characteristics of geographies at different levels of aggregation, from state to census block; and across space, from sea to shining sea.
This is a particularly important evaluation because -as van Riper and Spielman nicely argue and illustrate in their presentation– the masking technique used in the private demonstration data attempts to (and, they show, indeed often) keeps counts of key populations at fairly accurate levels across a “spatial” hierarchy of geographic units that include Census regions, divisions, states, counties, census tracts, block groups, and Census blocks. Despite these strengths, the technique does not explicitly strive for -and thus necessarily ensure, or achieve- this type of accuracy for geographies that re-combine spatial units within the hierarchy described before into other types of very important zonings, such as voting districts, Public Use Microdata Areas, or American Indian, Alaska Native, and Native Hawaiian Areas. Likewise, the technique does not ensure or achieve that the noise added to specific geographic units is randomly distributed spatially, resulting in distorted spatial patterns (and, potentially, relationships to other variabls) when using the masked data.
We hope that the work of van Riper, Spielman, and several other demographers and social scientists that were part of the workshop help the Census and other institutions continue to balance the fundamental right to privacy with the accuracy needed in population counts and their composition.