Privacy is becoming an increasing challenge in the Era of Big Data, periodically characterized by news about the unauthorized or legally-murky access, use, and selling of information otherwise entrusted to institutions, including some safeguarded by law. The issue has become more common today due the combination of the proliferation of electronic records of different kinds and growing technological capacity to mine and combine data in unprecedented ways, which could lead to the specific and unauthorized identification of individuals or, more broadly, the exploitation of their information in ways that violates privacy or confidentiality more “passively.”
This is a particularly important consideration when the data
in question are part of the federal statistical system, including efforts like the
Decennial population and housing Census and the American Community Surveys, where
the whole population or otherwise millions of people’s records are kept and disseminated
under strict and legally-binding promise of confidentiality and privacy. As discussed
in the news somewhat in recent times in the context of the citizenship question
that was ultimately not added to the 2020 Census, keeping the public’s trust in
this effort is of paramount importance.
To guard against the possibility of misuse of their, our data, the Census Bureau is planning on using a masking technique that, roughly described, adds random noise to its forthcoming data products such that estimates of the population size and composition of the nation and many geographic units used by the Census Bureau remain unchanged relative to the underlying true data, but in which individuals are protected from being identified as some of their characteristics are masked or “distorted,” thus reducing the likelihood of unauthorized identification and other privacy violations.
Despite its societal importance and value for individuals,
privacy also has trade-offs that affect scientific endeavor. To explore this issue
in the context of forthcoming 2020 Census data products, last December, the
Census Bureau -in collaboration with the Committee on National Statistics
(CNSTAT) at the National Academies of Science, Engineering, and Medicine- organized
workshop aimed at discussing these issues, and in which experts compared unmasked
2010 data products to an analogous “private demonstration” dataset that uses the
masking technique under consideration for the 2020 round of Census products.
CUPC affiliate and Associate Professor of Geography Seth Spielman (who also currently serves as Chief Data Strategist and Analytics Officer at the CU office of Data Analytics) was an invited participant to the workshop, where he presented related work in collaboration with and led by David van Riper, director of spatial analysis at the Minnesota Population Center. In their analyses using many different types of population characteristics, notably including segregation indices, David and Seth examine whether the masking technique works well in still depicting the characteristics of geographies at different levels of aggregation, from state to census block; and across space, from sea to shining sea.
This is a particularly important evaluation because -as van
Riper and Spielman nicely argue and illustrate in their
presentation– the masking technique used in the private demonstration data attempts
to (and, they show, indeed often) keeps counts of key populations at fairly
accurate levels across a “spatial” hierarchy of geographic units that include Census
regions, divisions, states, counties, census tracts, block groups, and Census blocks.
Despite these strengths, the technique does not explicitly strive for -and thus
necessarily ensure, or achieve- this type of accuracy for geographies that re-combine
spatial units within the hierarchy described before into other types of very
important zonings, such as voting districts, Public Use Microdata Areas, or
American Indian, Alaska Native, and Native Hawaiian Areas. Likewise, the
technique does not ensure or achieve that the noise added to specific geographic
units is randomly distributed spatially, resulting in distorted spatial patterns
(and, potentially, relationships to other variabls) when using the masked data.
We hope that the work of van Riper, Spielman, and several
other demographers and social scientists that were part of the workshop help the
Census and other institutions continue to balance the fundamental right to
privacy with the accuracy needed in population counts and their composition.