Tweetorial: Formal Privacy for Social Scientists

Tweetorial: formal privacy for social scientists. If you collect, publish or analyze data, understand the revolution happening in safe data publication. Stat agencies, @Google, @Apple, @Microsoft, @Facebook, @LinkedIn are all struggling with the same problem.
#dataprivacy
2. What is formal privacy? Mathematical definitions and theorems that translate concepts from cryptography into algorithms that provably bound the worst-case information leakage due to the publication of a collection of statistics using confidential data.
#differentialprivacy
3. What is information leakage? Think of the confidential data as an encryption. Published statistics are clues to the encryption (deliberately, they describe properties of the data). The more statistics published, the closer one gets to full knowledge of the confidential data.
4. This is called #databasereconstruction. Original paper: Dinur and Nissim 2003 http://www.cse.psu.edu/~ads22/privacy598/papers/dn03.pdf.
5. Easier read: @xchatty Garfinkel et al. 2018. (https://queue.acm.org/detail.cfm?id=3295691).
6. There is an unavoidable tension between publishing statistics and protecting confidentiality. Crypto lesson 1: publishing too many statistics, too accurately, leaks all the confidential data with near certainty. (https://arxiv.org/abs/1701.00752)
7. What’s the harm? Data are collected to be analyzed. #databasereconstruction rebuilds a record-level image of the confidential data outside the data curator’s firewall. Can individual records be re-identified from this image? Does the re-identification harm those individuals?
8. If I don’t publish the name, can it be reconstructed? No. If I don’t publish the address, can it be reconstructed? No. Telephone number? No. Social Security Number? No. Email address? No.
9. Then why worry? Because traditional methods of preventing re-identification assumed that reconstruction was impossible. These methods are called #statisticaldisclosurelimitation (#SDL) (with @ianschmutte https://www.brookings.edu/wp-content/uploads/2015/03/AbowdText.pdf)
10. Traditional #SDL methods assumed re-identification could be controlled by not publishing some statistics (suppression) or making the statistics more general, as in broad age or income categories (coarsening). Both assume the SDL can’t be undone via #databasereconstruction.
11. But traditional #SDL can be undone via #databasereconstruction. Example: coarse geographic areas in public-use micro-data samples (PUMS) have large populations, generally, 100,000+ people (public-use micro-areas, PUMA).
12. The geography in tabular summaries has much smaller populations. Census blocks have populations ranging from 0 to 1,000+ (average about 30). Same variables and same data rows are used in the PUMS and the tabular summaries. Combine to decoarsen; that’s #databasereconstruction.
13. Decoarsening a PUMS is the same thing as publishing the PUMS with block-level geography instead of the intended PUMA-level geography. Does that make the PUMS data easier to re-identify? Yes. Re-identification risks are orders of magnitude greater than originally thought.
14. Traditional #SDL under-estimated the re-identification risk. It did not have an appropriate mathematical framework, primarily because it relied on a model of feasible attacks that could not incorporate the #databasereconstruction of correct, additional public micro-data.
15. The question to ask yourself is: would a micro-data publication be vulnerable to re-identification if every variable were released at the level of detail in the most detailed tabular summary from the same confidential data?
16. Every variable in a public micro-data file is a potential identifier, not just name, address or SSN, especially if used in combination.
17. #databasereconstruction proved vulnerability. #differentialprivacy provided the first, and most durable, way to control it. Crypto lesson 2: Add noise to each published statistic. Calibrate the noise to limit worst-case global risk, called privacy-loss budget (ε).
18. The original article, Dwork, @frankmcsherry, Nissim, Smith 2006 (no paywall: https://journalprivacyconfidentiality.org/index.php/jpc/article/download/405/388/, paywall: https://doi.org/10.1007/11681878_14).
19. #differentialprivacy works because whether the data are released as summaries or PUMS, all the disclosure risk is accounted for. Neither release compromises the other. Both releases have controlled accuracy, which is a public feature of the data publication, not secret.
20. The controlled accuracy is the silver lining. Traditional SDL keeps the accuracy loss secret (see BPEA citation in tweet 9). This is scientifically dishonest. Calculations should be made on the original data. Margins of error should reflect all error including #SDL.
21. Data released by agencies are not the original data. Neither are the versions used in restricted-access enclaves, in general. Close or not, it is usually a violation of agency rules to quantify the accuracy after applying #SDL. Inferences are not necessarily valid.
22. The current methods used by U.S. statistical agencies are mostly summarized here: https://nces.ed.gov/FCSM/pdf/spwp22.pdf (Federal Committee on Statistical Methodology, WP 22, 2005). You won’t find any formulas for estimating margins of error.
23. That is a two-edged sword. If the effects on the margins of error are always small, then traditional #SDL is only effective when there is substantial suppression (unpublished data), which can result in biased published data—the suppression is not random.
24. But, if there isn’t much suppression, then traditional #SDL usually adds random noise, in which case the margins of error must be affected or else there is no effective protection because of accurate #databasereconstruction.
25. #differentialprivacy provides algorithms that quantify the accuracy of the published statistics as a function of the level of privacy-loss. As the privacy loss increases, the accuracy gets better. Better algorithms have more accuracy for any level of privacy loss.
26. When the privacy-loss is very large, the published data are completely accurate, but there is no protection against #databasereconstruction, and we are back to vulnerabilities inherent in traditional #SDL.
27. Technology using #differentialprivacy can’t determine an optimal combination of accuracy and privacy loss. For that decision, you need the tools of economics—a welfare function that captures the marginal social benefits of accuracy in terms of foregone privacy.
28. Citations for 27 Abowd and @ianschmutte 2019 (paywall: https://www.aeaweb.org/articles?id=10.1257/aer.20170627&&from=f , no paywall: https://arxiv.org/abs/1808.06303).
29. An accessible discussion can be found in my KDD 2018 talk here: https://digitalcommons.ilr.cornell.edu/ldi/49/; highlights: https://www.kdnuggets.com/2018/09/kdd-2018-key-takeaways.html.
30. Does sampling help? Yes. If you sample n units from a population of N, then for small ε, the effective privacy loss from an algorithm using ε on N records is (n/N)ε when operating only on the n randomly sampled records.
31. Citation for 30 Li et al. 2012 (paywall: https://dl.acm.org/citation.cfm?id=2414474, no paywall: https://arxiv.org/abs/1101.2604v2). Has full formula without “small ε” approximation.
32. Why worst-case? Do we have to allow for very weird possible databases? Worst-case analysis is necessary for an important property of #differentialprivacy: closure under composition.
33. Because #differentialprivacy algorithms compose, if we apply a two of them with privacy loss ε1 and ε2, then total privacy loss is no greater than ε1 + ε2. Composition allows global risk accounting, which traditional #SDL can’t do.
34. Does #differentialprivacy mean no direct access to the confidential data? No. It means that every access must be accounted for in the privacy-loss budget (ε). All calculations are done on the confidential data. Published data and margins of error account for ε.
35. How do @Google, @Apple and @Microsoft use #differentialprivacy? Google’s RAPPOR, Apple’s iOS, and Microsoft’s telemetry implementations all use local differential privacy—data are modified before they are received for analysis. Google’s PROCHLO is a hybrid.
36. What about statistical agencies? Since they have already collected the data, and have a legal mandate to do so, they can use central #differentialprivacy. Statistics and micro-data are modified as they are published. The confidential data remain pristine.
37. What’s the difference? Local #differentialprivacy algorithms are always less efficient than centralized algorithms for the same statistical analysis. Local DP requires much more data to achieve the same accuracy as central DP for the same ε.
38. Randomized response is the classic example of local #differentialprivacy. See the paper in tweet 28, or Dwork and Roth 2014 https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf, p. 29.
39. What can a social scientist do? Make inference validity primary. Registering randomized controlled trials has improved the statistical validity of those analyses. Using formal privacy methods improves research on confidential data compared to other feasible #SDL methods.
40. Contribute error metrics. Successful formal privacy systems rely on subject matter experts, statisticians, and computer scientists in equal measure. Subject experts define statistics of interest and the error metrics.
41. Crypto lesson 3: Transparency can’t be the harm. Transparency in #dataprivacy works like Kerckhoffs’ Principle in crypto. Confidentiality protection should still work when everything except the random number sequence actually used is public.
42. Cryptographers revolutionized data publication, just as they did for encryption technology. You wouldn’t be proud of using 1990s “state of the art” encryption on your most sensitive files. Why use 1990s technology for confidentiality protection? There is a better way.
43. Tools and primers
http://sigmod2017.org/wp-content/uploads/2017/03/04-Differential-Privacy-in-the-wild-1.pdf
https://privacytools.seas.harvard.edu/files/privacytools/files/pedagogical-document-dp_new.pdf
44. Full series in my blog: https://blogs.cornell.edu/abowd/special-materials/tweetorial-formal-privacy-for-social-scientists/ #archetypally_unglamorous_expert.
45. https://www.nytimes.com/2018/12/05/upshot/to-reduce-privacy-risks-the-census-plans-to-report-less-accurate-data.html. I would have preferred “… Census Acknowledges Accuracy Tradeoff” #databasereconstruction by @cocteau.
46. https://www.census.gov/about/cac/sac/meetings/2018-12-meeting.html starts today. @_kunal_talwar discusses Census Bureau #differentialprivacy implementation. Live video feed.