Introduction to differential privacy with OpenDP and DP WizardThese slides: (“✋” = class participation!) |
I’m in a class of 40 people, and the class average at midterm is 90. I drop the class, and the teacher announces the average is now 91. Can the other students figure out my grade?
I’m in a class of 40 people, and the class average at midterm is 90. I drop the class, and the teacher announces the average is now 91. Can the other students figure out my grade?
Yes!
>>> class_size = 40
>>> mean_with_me = 90.0
>>> mean_without_me = 91.0
>>> # total_without_me / (class_size - 1) = mean_without_me
>>> total_without_me = (class_size - 1) * mean_without_me
>>> total_without_me
3549.0
>>> # (me + total_without_me) / class_size = mean_with_me
>>> me = mean_with_me * class_size - total_without_me
>>> me
51.0
This “solution” is a problem! What could the teacher have done instead?
Differential privacy suggests adding calibrated noise before releasing statistics.
Ok, but what does “adding calibrated noise” even mean?
|
An algorithm is differentially private if by looking at the output, you can’t tell whether any individual’s data was included in the original dataset or not. Or: The behavior of the algorithm hardly changes when a single individual joins or leaves the dataset. We need to bound the contributions of individuals: For grades, that is easy, but it isn’t always! |
By Ted Rall (CC-BY-NC-ND). In Differential Privacy by Simson L. Garfinkel from MIT Press |
Ideas that we now identify as “DP” were in use before the term was coined: Stanley L. Warner: “Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias” (1965)
Here’s a question I might not feel comfortable answering honestly:
Are you here mostly for the free food?
| final: | heads (either flip) | tails (just last flip) |
|---|---|---|
| yes | A | B |
| no | B | A |
What does it look like in the limit?
| heads | tails | |
|---|---|---|
| yes | 3/4 are A | 1/4 are B |
| no | 3/4 are B | 1/4 are A |
Given the percentage “A”, solve for percentage “Yes”.
A% = 3/4 * Yes% + 1/4 * No%
A% = 3/4 * Yes% + 1/4 * (1 - Yes%)
A% = 1/2 * Yes% + 1/4
2 * (A% - 1/4) = Yes%
| A% | Yes% | |
|---|---|---|
| 25% | 0% | |
| 50% | 50% | 👉 These are noisy estimates! |
| 75% | 100% |
|
(1) What does this mean?(2) How do we explain it to users?(3) How can we make it more accurate? |
|
This is one draw from a random distribution.If we combine this number with others, negative values produce more accurate results.Imagine all the sessions make their own DP estimate: If we clipped each value at zero, the mean will be biased. |
|
This is hard!Bloomberg, August 12, 2021: “Data Scientists Square Off Over Trust and Privacy in 2020 Census”
Exercise ✋:
|
|
|
|
odds ratio ≤ eε log(odds ratio) ≤ ε |
|
A sample from registry.opendp.org:
| Organization | Epsilon | Context |
|---|---|---|
| Microsoft | 0.1 | Broadband Coverage Estimates |
| Wikimedia Foundation | 1.0 | Wikipedia Usage Data |
| US Census | 19.61 | Disclosure Avoidance System for Redistricting Data |
Five minute discussion ✋: Break into groups and…



|
Divide into four teams, and on one computer either:
|
|
Then:
|
|
For 5 minutes, experiment with one parameter, and then pick someone to summarize the result of your changes on the accuracy of the DP statistics.
|
Team A: Imagine that rather than protecting the privacy of individual students, we’re interested in the privacy of student groups. What do we need to know, and what would we change? Team B: Imagine we were just interested in pass/fail instead of letter grades. What can we change? |
Team C: Given what we know about the typical distribution of grades, is there something we can change to use our privacy budget more efficiently? Team D: How does accuracy change if we collect the same statistic across the entire school with a population of 1000, instead of a class of 100? |
The preview visualization uses the OpenDP Library under the hood, but we can also generate OpenDP code.
We’re going to walk though this notebook.
This is a demonstration of how the OpenDP Library can be used to create a differentially private release. To learn more about what’s going on here, see the documentation for OpenDP: https://docs.opendp.org/
First install and import the required dependencies:
%pip install 'opendp[polars]==0.14.1' matplotlib
>>> import polars as pl
>>> import opendp.prelude as dp
>>> import matplotlib.pyplot as plt
>>>
>>> # The OpenDP team is working to vet the core algorithms.
>>> # Until that is complete we need to opt-in to use these features.
>>> dp.enable_features("contrib")
Then define some utility functions to handle dataframes and plot results:
>>> # These functions are used both in the application
>>> # and in generated notebooks.
>>> from polars import DataFrame
>>>
>>>
>>> def make_cut_points(
... lower_bound: float, upper_bound: float, bin_count: int
... ):
... """
... Returns one more cut point than the bin_count.
... (There are actually two more bins, extending to
... -inf and +inf, but we'll ignore those.)
... Cut points are evenly spaced from lower_bound to upper_bound.
... >>> make_cut_points(0, 10, 2)
... [0.0, 5.0, 10.0]
... """
... bin_width = (upper_bound - lower_bound) / bin_count
... return [
... round(lower_bound + i * bin_width, 2)
... for i in range(bin_count + 1)
... ]
Based on the input you provided, for each column we’ll create a Polars expression that describes how we want to summarize that column.
grade>>> # See the OpenDP docs for more on making private histograms:
>>> # https://docs.opendp.org/en/stable/getting-started/examples/histograms.html
>>>
>>> # Use the public information to make cut points for 'grade':
>>> grade_cut_points = make_cut_points(
... lower_bound=0.0,
... upper_bound=100.0,
... bin_count=10,
... )
>>>
>>> # Use these cut points to add a new binned column to the table:
>>> grade_bin_expr = (
... pl.col("grade")
... .cut(grade_cut_points)
... .alias("grade_bin") # Give the new column a name.
... .cast(pl.String)
... )
Next, we’ll define our Context. This is where we set the privacy budget, and set the weight for each query under that overall budget.
>>> contributions = 1
>>> privacy_unit = dp.unit_of(contributions=contributions)
>>>
>>> # Consider how your budget compares to that of other projects.
>>> # https://registry.oblivious.com/#public-dp
>>> privacy_loss = dp.loss_of(
... epsilon=1.0,
... delta=1e-7,
... )
>>>
>>> # See the OpenDP docs for more on Context:
>>> # https://docs.opendp.org/en/stable/api/user-guide/context/index.html#context:
>>> context = dp.Context.compositor(
... data=pl.scan_csv(
... "docs/fill-in-correct-path.csv", encoding="utf8-lossy"
... ).with_columns(grade_bin_expr),
... privacy_unit=privacy_unit,
... privacy_loss=privacy_loss,
... split_by_weights=[2],
... margins=[
... # "max_length" should be a loose upper bound,
... # for example, the size of the total population being sampled.
... # https://docs.opendp.org/en/stable/api/python/opendp.extras.polars.html#opendp.extras.polars.Margin.max_length
... #
... # In production, "max_groups" should be set by considering the number
... # of possible values for each grouping column, and taking their product.
... dp.polars.Margin(
... by=[],
... invariant="keys",
... max_length=1000000,
... max_groups=100,
... ),
... dp.polars.Margin(
... by=[
... "grade_bin",
... ],
... invariant="keys",
... ),
... ],
... )
A note on utf8-lossy: CSVs can use different “character
encodings” to represent characters outside the plain ascii character
set, but out of the box the Polars library only supports UTF8.
Specifying utf8-lossy preserves as much information as
possible, and any unrecognized characters will be replaced by “�”. If
this is not sufficient, you will need to preprocess your data to
reencode it as UTF8.
Finally, we run the queries and plot the results.
>>> confidence = 0.95 # 95% confidence interval
Query for grade:
>>> groups = ["grade_bin"] + []
>>> grade_query = context.query().group_by(groups).agg(pl.len().dp.noise())
>>> grade_accuracy = grade_query.summarize(alpha=1 - confidence)[
... "accuracy"
... ].item()
>>> grade_stats = grade_query.release().collect()
>>> grade_stats # doctest: +ELLIPSIS
shape: (..., 2)
┌───────────┬─────┐
│ grade_bin ┆ len │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════╪═════╡
...
└───────────┴─────┘
If we try to run more queries at this point, it will error. Once the privacy budget is consumed, the library prevents you from running more queries with that Queryable.
With DP Wizard
|
With OpenDP
|
Other libraries
|
What is the alternative? If people believe their privacy has not been protected, then they may be less willing to participate in the future.
Other anonymization and grouping techniques are more fragile, and might introduce biases that are less well chacterized.
If you let yourself be too flexible, there’s a risk of p-hacking. The methodology that DP imposes is similar to what we should be doing in any case if we want our research to be reproducible.
(But there is work adapting DP for exploratory analysis: Measure-Observe-Remeasure.)
For stats like income, try histograms or quantiles with varying step sizes?
Multi-step workflows: Spend a little of your budget first, just to understand distributions.
Text or image data can’t just be dropped into DP.
Even if it can be reduced to a feature vector, can you define a unit of privacy?
With local DP, do you trust the software and hardware that it runs on?
With central DP, do you trust the authority to abide by their commitments?
DP is one of a number of privacy enhancing technologies (PETs), and in applications at scale these will be combined, utilizing the best parts of each.
Other PETs protect privacy during computation, but don’t preserve privacy in results.
| OpenDP | DP Wizard | |
|---|---|---|
| email: | info@opendp.org | cmccallum@g.harvard.edu |
| docs: | docs.opendp.org | opendp.github.io/dp-wizard |
| source: | github.com/opendp/opendp | github.com/opendp/dp-wizard |
| Books! |
|---|
| Differential Privacy, by Simson Garfinkel: From MIT Press. Available as free PDF, this is a great intro to the social context of DP, and is light on math. |
| Programming Differential Privacy, by Joe Near and Chiké Abuah: Free ebook. The examples use Python, and do not assume any particular library. If you intend to use DP in production, you should not be implementing algorithms yourself! |
| Hands-On Differential Privacy: Introduction to the Theory and Practice Using OpenDP: From O’Reilly, this covers a wide range of topics, and the discussion is tied to the OpenDP Library. |