Class grades solution

I’m in a class of 40 people, and the class average at midterm is 90. I drop the class, and the teacher announces the average is now 91. Can the other students figure out my grade?

Yes!

>>> class_size = 40
>>> mean_with_me = 90.0
>>> mean_without_me = 91.0

>>> # total_without_me / (class_size - 1) = mean_without_me
>>> total_without_me = (class_size - 1) * mean_without_me
>>> total_without_me
3549.0

>>> # (me + total_without_me) / class_size = mean_with_me
>>> me = mean_with_me * class_size - total_without_me
>>> me
51.0

This “solution” is a problem! What could the teacher have done instead?

Differential privacy suggests adding calibrated noise before releasing statistics.

Class grades solution problem solution problems!

Ok, but what does “adding calibrated noise” even mean?

What probability distribution?
- Typically Laplace or Gaussian
- Hopefully this is handled for you by a library!
How is it calibrated?
- With math!
- There’s a tradeoff between accuracy and privacy.
- The right balance will depend on context.
What if there is another release (perhaps before the exam)?
- We need to set a privacy budget and allocate it among queries.
- Reuse information you already have to conserve your budget.

What’s “differential” about differential privacy?

An algorithm is differentially private if by looking at the output, you can’t tell whether any individual’s data was included in the original dataset or not.

Or: The behavior of the algorithm hardly changes when a single individual joins or leaves the dataset.

We need to bound the contributions of individuals: For grades, that is easy, but it isn’t always!

By Ted Rall (CC-BY-NC-ND). In Differential Privacy by Simson L. Garfinkel from MIT Press

Randomized response ✋

Ideas that we now identify as “DP” were in use before the term was coined: Stanley L. Warner: “Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias” (1965)

Here’s a question I might not feel comfortable answering honestly:

Are you here mostly for the free food?

Decide on your answer.
Then flip a coin.
If it’s tails, flip once more.
Use this table for your public response (either “A” or “B”):

final:	heads (either flip)	tails (just last flip)
yes	A	B
no	B	A

And let’s count the “A”s and “B”s.

Let’s do differential privacy!

What does it look like in the limit?

	heads	tails
yes	3/4 are A	1/4 are B
no	3/4 are B	1/4 are A

Given the percentage “A”, solve for percentage “Yes”.

A% = 3/4 * Yes% + 1/4 * No%
A% = 3/4 * Yes% + 1/4 * (1 - Yes%)
A% = 1/2 * Yes% + 1/4
2 * (A% - 1/4) = Yes%

A%	Yes%
25%	0%
50%	50%	👉 These are noisy estimates!
75%	100%

Hold on… We could get a negative number back?!

A%	Yes%
0%	-50%
25%	0%
50%	50%
75%	100%
100%	150%

(1) What does this mean?

(2) How do we explain it to users?

(3) How can we make it more accurate?

(1) What does this mean?

A%	Yes%
0%	-50%
25%	0%
50%	50%
75%	100%
100%	150%

This is one draw from a random distribution.

If we combine this number with others, negative values produce more accurate results.

Imagine all the sessions make their own DP estimate: If we clipped each value at zero, the mean will be biased.

(2) How do we explain it to users? ✋

A%	Yes%
0%	-50%
25%	0%
50%	50%
75%	100%
100%	150%

This is hard!

Bloomberg, August 12, 2021: “Data Scientists Square Off Over Trust and Privacy in 2020 Census”

New York’s Liberty Island, population 0, has emerged as the unlikely center of a tug-of-war over the U.S census.

Aside from the Statue of Liberty, no one’s called the island home since 2012, when the former superintendent’s home was destroyed by Hurricane Sandy, leaving Liberty Island unoccupied for the first time in hundreds of years.

Quasi-officially, though, the current population for the 12-acre island stands at 48.

Exercise ✋:

Pair up, and in your own words explain our here-for-the-free-lunch stat.
Brainstorm how could we make it more accurate.

(3) How can we make it more accurate?

A%	Yes%
0%	-50%
25%	0%
50%	50%
75%	100%
100%	150%

Recruit more study participants.
- May not be an option.
- But, sampling itself is a source of randomness: “privacy amplification”
Instead of having every person randomize their response, we could collect all the data, and then only apply noise once.
- This is changing the model, from local to central DP.

Who do we trust?

Individuals	Authority	Public
✓			Local DP
✓	✓		Central DP
✓	✓	✓	No DP!

Instead of using a fair coin, we could randomize the response less than 50% of the time.
- We’re changing the trade-off between privacy and accuracy, the “Privacy Budget”.

Privacy budgets and epsilon

odds ratio ≤ e^ε

log(odds ratio) ≤ ε

ε is on a log scale, and smaller values are safer.
If two calculations consume ε₁ and ε₂, together they consume ε₁ + ε₂.
ε is proven worst case bound: Research is tightening the bounds on existing mechanisms, and developing new mechanisms which are more efficient.
The “right” value of ε in a given situation is social question.
We can use the Privacy Deployments Registry to see what other people have used.

Privacy Deployments Registry, and a discussion ✋

A sample from registry.opendp.org:

Organization	Epsilon	Context
Microsoft	0.1	Broadband Coverage Estimates
Wikimedia Foundation	1.0	Wikipedia Usage Data
US Census	19.61	Disclosure Avoidance System for Redistricting Data

Five minute discussion ✋: Break into groups and…

Give examples of private data with public applications, ideally something from your own experience.
Pick one of these and identify who would be interested in more accurate statistics, and who would prefer more privacy.
Finally, how can these two competing interests be represented? Does everyone have a seat at the table?

DP Wizard preview: Select Dataset

DP Wizard preview: Define Analysis

DP Wizard preview: Download Results

Return to the class grades example ✋

Divide into four teams, and on one computer either:

pip install 'dp_wizard[app]'
dp_wizard --cloud
(requires Python>=3.10)

… or go to: tinyurl.com/dp-wizard

Then:

1: On “Select Dataset”:
- Under “CSV Columns”, enter grade.
- Leave the “Unit of Privacy” at 1.
- Click “Define Analysis”.

2: On “Define Analysis”:
- Select grade in “Columns”.
- Leave “Group By” empty.
- Leave “Privacy Budget” at 1, and “Number of Rows” at 100.
3: On the histogram:
- Change the upper bound to 100.

DP Wizard experiments

For 5 minutes, experiment with one parameter, and then pick someone to summarize the result of your changes on the accuracy of the DP statistics.

Team A: Imagine that rather than protecting the privacy of individual students, we’re interested in the privacy of student groups. What do we need to know, and what would we change?

Team B: Imagine we were just interested in pass/fail instead of letter grades. What can we change?

Team C: Given what we know about the typical distribution of grades, is there something we can change to use our privacy budget more efficiently?

Team D: How does accuracy change if we collect the same statistic across the entire school with a population of 1000, instead of a class of 100?

Generating OpenDP Library code with DP Wizard

The preview visualization uses the OpenDP Library under the hood, but we can also generate OpenDP code.

Click on “Download Results”
Then click on “Download HTML”

We’re going to walk though this notebook.

Introduction to the OpenDP library

OpenDP Demo

This is a demonstration of how the OpenDP Library can be used to create a differentially private release. To learn more about what’s going on here, see the documentation for OpenDP: https://docs.opendp.org/

Prerequisites

First install and import the required dependencies:

%pip install 'opendp[polars]==0.14.1' matplotlib

>>> import matplotlib.pyplot as plt
>>> import opendp.prelude as dp
>>> import polars as pl
>>>
>>> # The OpenDP team is working to vet the core algorithms.
>>> # Until that is complete we need to opt-in to use these features.
>>> dp.enable_features("contrib")

Utility functions

Then define some utility functions to handle dataframes and plot results:

>>> # These functions are used both in the application
>>> # and in generated notebooks.
>>>
>>> def round_2(number) -> float:
...     return float(f"{number:.2g}")
>>>
>>> def make_cut_points(
...     lower_bound: float, upper_bound: float, bin_count: int
... ) -> list[float]:
...     bin_width = (upper_bound - lower_bound) / bin_count
...     return sorted({round_2(lower_bound + i * bin_width) for i in range(bin_count + 1)})

Analysis

Based on the input you provided, for each column we’ll create a Polars expression that describes how we want to summarize that column.

Expression for `grade`

>>> # See the OpenDP Library docs for more on making private histograms:
>>> # https://docs.opendp.org/en/v0.14.1/getting-started/examples/histograms.html
>>>
>>> # Use the public information to make cut points for 'grade':
>>> grade_cut_points = make_cut_points(
...     lower_bound=0.0,
...     upper_bound=100.0,
...     bin_count=10,
... )
>>>
>>> # Use these cut points to add a new binned column to the table:
>>> grade_bin_expr = (
...     pl.col("grade")
...     .cut(grade_cut_points)  # Use "left_closed=True" to switch endpoint inclusion.
...     .alias("grade_bin")  # Give the new column a name.
...     .cast(pl.String)
... )

Context

Next, we’ll define our Context. This is where we set the privacy budget, and set the weight for each query under that overall budget.

>>> contributions = 1
>>> privacy_unit = dp.unit_of(contributions=contributions)
>>>
>>> privacy_loss = dp.loss_of(
...     epsilon=1.0,
...     delta=1 / max(1e7, 100000),
... )
>>>
>>> # See the OpenDP Library docs for more on Context:
>>> # https://docs.opendp.org/en/v0.14.1/api/user-guide/context/index.html#context
>>> stats_context = dp.Context.compositor(
...     data=pl.scan_csv(
...         "docs/fill-in-correct-path.csv",
...         encoding="utf8-lossy",
...         ignore_errors=True,
...     ).with_columns(grade_bin_expr),
...     privacy_unit=privacy_unit,
...     privacy_loss=privacy_loss,
...     split_by_weights=[  # With only one query, the entire budget is allocated to that query:
...         1,  # grade
...     ],
...     margins=[],
... )

Results

Finally, we run the queries and plot the results.

>>> confidence = 0.95  # 95% confidence interval

Query for grade:

>>> groups = ["grade_bin"] + []
>>> grade_query = (
...     stats_context.query().group_by(groups).agg(pl.len().dp.noise().alias("count"))
... )
>>> grade_accuracy = grade_query.summarize(alpha=1 - confidence)["accuracy"].item()
>>> grade_stats = grade_query.release().collect()

If we try to run more queries at this point, it will error. Once the privacy budget is consumed, the library prevents you from running more queries with that Queryable.

OpenDP architecture

Solid implementations of algorithms; component architecture that ensures correctness… but it does have a learning curve.
OpenDP is designed to support multiple languages, but Python is the most well developed.
It is being used in the real-world!

What else can we do with DP?

With DP Wizard

Grouping
Means
Medians
Synthetic Data

With OpenDP

Other libraries

Stochastic gradient descent
SQL interfaces

DP tradeoffs

Less accurate stats

What is the alternative? If people believe their privacy has not been protected, then they may be less willing to participate in the future.

Other anonymization and grouping techniques are more fragile, and might introduce biases that are less well chacterized.

Less flexible workflow

If you let yourself be too flexible, there’s a risk of p-hacking. The methodology that DP imposes is similar to what we should be doing in any case if we want our research to be reproducible.

(But there is work adapting DP for exploratory analysis: Measure-Observe-Remeasure.)

DP limits

Requires bounds on sensitivity

For stats like income, try histograms or quantiles with varying step sizes?

Multi-step workflows: Spend a little of your budget first, just to understand distributions.

Can’t anonymize arbitrary data

Text or image data can’t just be dropped into DP.

Even if it can be reduced to a feature vector, can you define a unit of privacy?

Requires trust

With local DP, do you trust the software and hardware that it runs on?

With central DP, do you trust the authority to abide by their commitments?

Other PETs

DP is one of a number of privacy enhancing technologies (PETs), and in applications at scale these will be combined, utilizing the best parts of each.

Other PETs protect privacy during computation, but don’t preserve privacy in results.

Thank you, and stay in touch!

	OpenDP	DP Wizard
email:	info@opendp.org	cmccallum@g.harvard.edu
docs:	docs.opendp.org	opendp.github.io/dp-wizard
source:	github.com/opendp/opendp	github.com/opendp/dp-wizard

Books!
Differential Privacy, by Simson Garfinkel: From MIT Press. Available as free PDF, this is a great intro to the social context of DP, and is light on math.
Programming Differential Privacy, by Joe Near and Chiké Abuah: Free ebook. The examples use Python, and do not assume any particular library. If you intend to use DP in production, you should not be implementing algorithms yourself!
Hands-On Differential Privacy: Introduction to the Theory and Practice Using OpenDP: From O’Reilly, this covers a wide range of topics, and the discussion is tied to the OpenDP Library.

Introduction to differential privacy with OpenDP and DP Wizard

Outline

Introduction to differential privacy with OpenDP and DP Wizard

Class grades problem

Class grades solution

Class grades solution problem solution problems!

What’s “differential” about differential privacy?

Randomized response ✋

Let’s do differential privacy!

Hold on… We could get a negative number back?!

(1) What does this mean?

(2) How do we explain it to users?

(3) How can we make it more accurate?

(1) What does this mean?

This is one draw from a random distribution.

If we combine this number with others, negative values produce more accurate results.

(2) How do we explain it to users? ✋

This is hard!

Exercise ✋:

(3) How can we make it more accurate?

Privacy budgets and epsilon

Privacy Deployments Registry, and a discussion ✋

DP Wizard preview: Select Dataset

DP Wizard preview: Define Analysis

DP Wizard preview: Download Results

Return to the class grades example ✋

DP Wizard experiments

Generating OpenDP Library code with DP Wizard

Introduction to the OpenDP library

OpenDP Demo

Prerequisites

Utility functions

Analysis

Expression for grade

Context

Results

OpenDP architecture

What else can we do with DP?

With DP Wizard

With OpenDP

Other libraries

DP tradeoffs

Less accurate stats

Less flexible workflow

DP limits

Requires bounds on sensitivity

Can’t anonymize arbitrary data

Requires trust

Other PETs

Thank you, and stay in touch!

Expression for `grade`