#CS 2080: OpenDP Demo

Install OpenDP

In [3]:
pip install opendp

Collecting opendp
  Downloading opendp-0.12.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.1 kB)
Downloading opendp-0.12.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.0/25.0 MB[0m [31m68.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: opendp
Successfully installed opendp-0.12.1


Import OpenDP and enable relevant flags

In [4]:
import pandas as pd
import numpy as np

import opendp.prelude as dp
dp.enable_features("honest-but-curious", "contrib")

- **Honest-but-Curious**: We will require a looser trust model, as we cannot verify any privacy or stability properties of user-defined functions (exercise 2).
- **Contrib**: include mechanisms which have not yet been fully vetted

In [6]:
# Read in the dataset
# We will look at income data from the California PUMS dataset
data = dp.examples.get_california_pums_path().read_text()

# the greatest number of records that any one individual can influence in the dataset
max_influence = 1

# establish public information
col_names = ["age", "sex", "educ", "race", "income", "married"]

# we can also reasonably intuit that age and income will be numeric,
# as well as bounds for them, without looking at the data
age_bounds = (0, 100)
income_bounds = (0, 150_000)

### Creating our first transformation

We will create a transformation that preprocesses the `Income` column in the dataset. Note that we will use the chaining operator `>>` to combine transformations and measurements. Note that when chaining `A >> B`, the input domain of `B` must match the output domain of `A`.

We will use the [`make_split_dataframe()`](https://docs.opendp.org/en/stable/api/python/opendp.transformations.html#opendp.transformations.make_split_dataframe) and [`make_select_column()`](https://docs.opendp.org/en/stable/api/python/opendp.transformations.html#opendp.transformations.make_select_column) transformations.

In [None]:
income_preprocessor = (
    # Convert data into a dataframe where columns are of type Vec<str>
    #TODO
    # Selects a column of df, Vec<str>
    #TODO
)

#inspect the preprocessor transformation
print(income_preprocessor)

In [None]:
transformed_data = income_preprocessor(data)
print(transformed_data[:10])

Observe that the above transformed data is a vector of strings. We will want to instead convert these into a vector of integers. Use [`then_cast()`](https://docs.opendp.org/en/stable/api/python/opendp.transformations.html#opendp.transformations.then_cast) to convert the vector of strings to a vector of ints. We will also chain with [`then_impute_constant()`](https://docs.opendp.org/en/stable/api/python/opendp.transformations.html#opendp.transformations.then_impute_constant) to insert the constant $0$ in any row where the string-to-int cast operation fails.

In [None]:
cast_str_int = (
    # start with the output space of the income_preprocessor
    income_preprocessor.output_space >>
    # cast Vec<str> to Vec<Option<int>>
    #TODO
    # Replace any elements that failed to parse with 0, emitting a Vec<int>
    #TODO
)

#print(cast_str_int)

In [None]:
# replace the previous preprocessor: extend it with the caster
income_preprocessor = income_preprocessor >> cast_str_int
print(income_preprocessor(data)[:10])

[0, 17000, 0, 9100, 37000, 0, 6000, 350000, 33000, 25000]


Great! Now we have integer income data from our CSV. We can now compute our first private statistic. Suppose we want to know the number of records in the dataset. We can use the [list of aggregators](https://docs.opendp.org/en/stable/api/user-guide/transformations/index.html) in the transformation constructors section of the user guide to find `then_count()`.

In [None]:
count = income_preprocessor >> dp.t.then_count()
# NOT a DP release!
count_response = count(data)
print(count_response)

1000


We will need to chain the above counting transformation with a measurement to create a differentially private release.

When you use `then_laplace` below, it automatically chooses a discrete variation of the mechanism (i.e., the Geometric mechanism) for privatizing integers. Notice that the function now comes from dp.m (denoting measurement constructors), and the resulting type(dp_count) is Measurement. This tells us that the output will be a differentially private release.

In [None]:
dp_count = count >> dp.m.then_laplace(scale=1.)

In any realistic situation, you would likely want to estimate the budget utilization before you make a release. Use a search utility to quantify the privacy expenditure of this release. See [`binary_search`](https://docs.opendp.org/en/stable/api/user-guide/utilities/parameter-search.html) in the OpenDP docs.

In [None]:
epsilon = dp.binary_search(
    lambda eps: dp_count.check(d_in=max_influence, d_out=eps),
    bounds=(0., 100.))
print("DP count budget:", epsilon)

DP count budget: 1.0


In [None]:
count_release = dp_count(data)
print("DP count:", count_release)

DP count: 999


### Computing a Private Sum

Suppose we want to know the total income of our dataset. First, take a look at the [list of aggregators](https://www.google.com/url?q=https%3A%2F%2Fdocs.opendp.org%2Fen%2Fstable%2Fapi%2Fuser-guide%2Ftransformations%2Findex.html) and observe that `make_sum` meets our requirements. As indicated by the function’s API documentation, it expects bounded data, so we’ll also need to chain the transformation from `then_clamp` with the income_preprocessor.

In [None]:
bounded_income_sum = (
    income_preprocessor >>
    # clamp income values.
    # "then_*" means it uses the output domain and output metric from the previous transformation
    #TODO: apply clamping transformation

    # similarly, here we use "then_sum" to avoid needing to specify the input space.
    # the sum constructor gets told that the input consists of bounded data
    #TODO: apply sum transformation
)

In this example, instead of just passing a scale into make_laplace, we want whatever scale will make my measurement $\varepsilon$-DP for $\varepsilon=1$. Again, we can use a search utility to find such a scale.

In [None]:
discovered_scale = dp.binary_search_param(
    lambda s: bounded_income_sum >> dp.m.then_laplace(scale=s),
    d_in=max_influence,
    d_out=1.)

dp_sum = bounded_income_sum >> dp.m.then_laplace(scale=discovered_scale)

and then we can release the private sum...

In [None]:
dp_sum = bounded_income_sum >> dp.m.then_laplace(scale=discovered_scale)
print(dp_sum(data))

30182007


### Computing a Private Mean
We may be more interested in the mean age. The constructor for this function expects sized, bounded data. Sized data is data that has a known number of rows. The constructor enforces this requirement because knowledge of the dataset size is necessary to bound the sensitivity of the function.

Luckily, we’ve already made a DP release of the number of rows in the dataset, which we can reuse as an argument here.

In [None]:
dp_mean = (
    # Convert data into a dataframe of string columns
    dp.t.make_split_dataframe(separator=",", col_names=col_names) >>
    # Selects a column of df, Vec<str>
    dp.t.make_select_column(key="age", TOA=str) >>
    # Cast the column as Vec<float>, and fill nulls with the default value, 0.
    #TODO:
    # Clamp age values
    #TODO:
    # Resize the dataset to length `count_release`.
    #     If there are fewer than `count_release` rows in the data, fill with a constant of 20.
    #     If there are more than `count_release` rows in the data, only keep `count_release` rows
    #TODO:
    # Compute the mean
    #TODO:
    # add laplace noise
    #TODO:
)

#mean_release = dp_mean(data)
#print("DP mean:", mean_release)

## Composition

We can also compose multiple measurements into a single measurement using [basic composition](https://docs.opendp.org/en/stable/api/user-guide/combinators/compositors.html#basic-composition).

In [None]:
composed = dp.c.make_basic_composition([dp_sum, dp_mean])
composed(data)

[30177386, 44.348417133156815]

In [None]:
composed.map(1)

1.0999000999005568

... and look at the resulting privacy loss

## Plugins API

We can also create user-defined transformations and measurements using the plugins API.

In [None]:
def make_repeat(multiplicity):
    """Constructs a Transformation that duplicates each record `multiplicity` times"""
    def function(arg: list[int]) -> list[int]:
        #TODO

    def stability_map(d_in: int) -> int:
        # if a user could influence at most `d_in` records before,
        # they can now influence `d_in` * `multiplicity` records
        #TODO

    return dp.t.make_user_transformation(
        input_domain=dp.vector_domain(dp.atom_domain(T=int)),
        input_metric=dp.symmetric_distance(),
        output_domain=dp.vector_domain(dp.atom_domain(T=int)),
        output_metric=dp.symmetric_distance(),
        function=function,
        stability_map=stability_map,
    )

The resulting Transformation may be used interchangeably with those constructed via the library:


In [None]:
twice_sum_transformation = (
    income_preprocessor
    >> make_repeat(2)  # our custom transformation
    >> dp.t.then_clamp(income_bounds)
    >> dp.t.then_sum()
)

release = twice_sum_transformation(data)
twice_sum_transformation.map(1)

300000

We can also use the plugins API to create user-defined measurements. In this example, we'll make most private DP mechanism ever:



In [None]:
def make_base_constant(constant):
    """Constructs a Measurement that only returns a constant value."""
    def function(_arg: int):
        #TODO

    def privacy_map(d_in: int) -> float:
        #TODO

    return dp.m.make_user_measurement(
        input_domain=dp.atom_domain(T=int),
        input_metric=dp.absolute_distance(T=int),
        output_measure=dp.max_divergence(),
        function=function,
        privacy_map=privacy_map,
        TO=type(constant),  # the expected type of the output
    )

The resulting Measurement may be used interchangeably with those constructed via the library:

In [None]:
meas = (
    twice_sum_transformation
    >> make_base_constant("denied")
)

print(meas(data))

# computes epsilon, because the output measure is max divergence
meas.map(1)

denied


0.0

# Using the Context API

Each of the examples above, we've demonstrated how a DP developer might work with the OpenDP library using some *non-sensitive* test data.  A DP end user should **not** be running the intermediate transformations on the data and inspecting the results. Instead, the best practice for an end user is to use the Context API:
1. set up the privacy budget (and unit, domain) for the raw dataset up front and
2. have the framework mediate all access to the data, ensuring that we stay in budget.

In [None]:
privacy_unit = dp.unit_of(contributions=1)
privacy_loss = dp.loss_of(epsilon=1.)

# dataset (should not inspect)
from random import randint
data = [float(randint(0, 100)) for _ in range(100)]
bounds = (0.0, 100.0)
imputed_value = 50.0

context = dp.Context.compositor(
    data=data,
    privacy_unit=privacy_unit,
    privacy_loss=privacy_loss,
    split_evenly_over=2
)

count_query = (
    context.query()
    .count()
    .laplace()
)

dp_count = count_query.release()

mean_query = (
    context.query()
    .clamp(bounds)
    .resize(size=dp_count, constant=imputed_value)
    .mean()
    .laplace()
)

dp_mean = mean_query.release()

print(dp_count)
print(dp_mean)