Introduction
When implementing MinDiff, you will need to make complex decisions as you choose and shape your input before passing it on to the model. These decisions will largely determine the behavior of MinDiff within your model.
This guide will cover the technical aspects of this process, but will not discuss how to evaluate a model for fairness, or how to identify particular slices and metrics for evaluation. Please see the Fairness Indicators guidance for details on this.
To demonstrate MinDiff, this guide uses the UCI income dataset. The model task is to predict whether an individual has an income exceeding $50k, based on various personal attributes. This guide assumes there is a problematic gap in the FNR (false negative rate) between "Male"
and "Female"
slices and the model owner (you) has decided to apply MinDiff to address the issue. For more information on the scenarios in which one might choose to apply MinDiff, see the requirements page.
MinDiff works by penalizing the difference in distribution scores between examples in two sets of data. This guide will demonstrate how to choose and construct these additional MinDiff sets as well as how to package everything together so that it can be passed to a model for training.
Setup
pip install --upgrade tensorflow-model-remediation
import tensorflow as tf
from tensorflow_model_remediation import min_diff
from tensorflow_model_remediation.tools.tutorials_utils import uci as tutorials_utils
2024-07-19 09:12:10.760771: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-07-19 09:12:10.781683: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-07-19 09:12:10.787865: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Original Data
For demonstration purposes and to reduce runtimes, this guide uses only a sample fraction of the UCI Income dataset. In a real production setting, the full dataset would be utilized.
# Sampled at 0.3 for reduced runtimes.
train = tutorials_utils.get_uci_data(split='train', sample=0.3)
print(len(train), 'train examples')
9768 train examples
Converting to tf.data.Dataset
MinDiffModel
requires that the input be a tf.data.Dataset
. If you were using a different format of input prior to integrating MinDiff, you will have to convert your input data.
Use tf.data.Dataset.from_tensor_slices
to convert to tf.data.Dataset
.
dataset = tf.data.Dataset.from_tensor_slices((x, y, weights))
dataset.shuffle(...) # Optional.
dataset.batch(batch_size)
See Model.fit
documentation for details on equivalences between the two methods of input.
In this guide, the input is downloaded as a Pandas DataFrame and therefore, needs this conversion.
# Function to convert a DataFrame into a tf.data.Dataset.
def df_to_dataset(dataframe, shuffle=True):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=5000) # Reasonable but arbitrary buffer_size.
return ds
# Convert the train DataFrame into a Dataset.
original_train_ds = df_to_dataset(train)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1721380333.740706 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.745983 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.749582 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.753131 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.763393 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.766897 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.770266 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.773538 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.776460 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.779899 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.783299 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380333.786581 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.025282 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.027381 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.029430 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.032077 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.034163 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.036087 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.038088 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.040061 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.042022 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.043955 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.045985 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.047967 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.086347 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.088345 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.090375 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.092391 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.094523 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.096455 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.098382 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.100366 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.102381 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.104813 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.107188 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721380335.109569 11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
Creating MinDiff data
During training, MinDiff will encourage the model to reduce differences in predictions between two additional datasets (which may include examples from the original dataset). The selection of these two datasets is the key decision which will determine the effect MinDiff has on the model.
The two datasets should be picked such that the disparity in performance that you are trying to remediate is evident and well-represented. Since the goal is to reduce a gap in FNR between "Male"
and "Female"
slices, this means creating one dataset with only positively labeled "Male"
examples and another with only positively labeled "Female"
examples; these will be the MinDiff datasets.
First, examine the data present.
female_pos = train[(train['sex'] == ' Female') & (train['target'] == 1)]
male_pos = train[(train['sex'] == ' Male') & (train['target'] == 1)]
print(len(female_pos), 'positively labeled female examples')
print(len(male_pos), 'positively labeled male examples')
385 positively labeled female examples 2063 positively labeled male examples
It is perfectly acceptable to create MinDiff datasets from subsets of the original dataset.
While there aren't 5,000 or more positive "Male"
examples as recommended in the requirements guidance, there are over 2,000 and it is reasonable to try with that many before collecting more data.
min_diff_male_ds = df_to_dataset(male_pos)
Positive "Female"
examples, however, are much scarcer at 385. This is probably too small for good performance and so will require pulling in additional examples.
full_uci_train = tutorials_utils.get_uci_data(split='train')
augmented_female_pos = full_uci_train[((full_uci_train['sex'] == ' Female') &
(full_uci_train['target'] == 1))]
print(len(augmented_female_pos), 'positively labeled female examples')
1179 positively labeled female examples
Using the full dataset has more than tripled the number of examples that can be used for MinDiff. It’s still low but it is enough to try as a first pass.
min_diff_female_ds = df_to_dataset(augmented_female_pos)
Both the MinDiff datasets are significantly smaller than the recommended 5,000 or more examples. While it is reasonable to attempt to apply MinDiff with the current data, you may need to consider collecting additional data if you observe poor performance or overfitting during training.
Using tf.data.Dataset.filter
Alternatively, you can create the two MinDiff datasets directly from the converted original Dataset
.
# Male
def male_predicate(x, y):
return tf.equal(x['sex'], b' Male') and tf.equal(y, 0)
alternate_min_diff_male_ds = original_train_ds.filter(male_predicate).cache()
# Female
def female_predicate(x, y):
return tf.equal(x['sex'], b' Female') and tf.equal(y, 0)
full_uci_train_ds = df_to_dataset(full_uci_train)
alternate_min_diff_female_ds = full_uci_train_ds.filter(female_predicate).cache()
The resulting alternate_min_diff_male_ds
and alternate_min_diff_female_ds
will be equivalent in output to min_diff_male_ds
and min_diff_female_ds
respectively.
Constructing your Training Dataset
As a final step, the three datasets (the two newly created ones and the original) need to be merged into a single dataset that can be passed to the model.
Batching the datasets
Before merging, the datasets need to batched.
- The original dataset can use the same batching that was used before integrating MinDiff.
- The MinDiff datasets do not need to have the same batch size as the original dataset. In all likelihood, a smaller one will perform just as well. While they don't even need to have the same batch size as each other, it is recommended to do so for best performance.
While not strictly necessary, it is recommended to use drop_remainder=True
for the two MinDiff datasets as this will ensure that they have consistent batch sizes.
original_train_ds = original_train_ds.batch(128) # Same as before MinDiff.
# The MinDiff datasets can have a different batch_size from original_train_ds
min_diff_female_ds = min_diff_female_ds.batch(32, drop_remainder=True)
# Ideally we use the same batch size for both MinDiff datasets.
min_diff_male_ds = min_diff_male_ds.batch(32, drop_remainder=True)
Packing the Datasets with pack_min_diff_data
Once the datasets are prepared, pack them into a single dataset which will then be passed along to the model. A single batch from the resulting dataset will contain one batch from each of the three datasets you prepared previously.
You can do this by using the provided utils
function in the tensorflow_model_remediation
package:
train_with_min_diff_ds = min_diff.keras.utils.pack_min_diff_data(
original_dataset=original_train_ds,
sensitive_group_dataset=min_diff_female_ds,
nonsensitive_group_dataset=min_diff_male_ds)
And that's it! You will be able to use other util
functions in the package to unpack individual batches if needed.
for inputs, original_labels in train_with_min_diff_ds.take(1):
# Unpacking min_diff_data
min_diff_data = min_diff.keras.utils.unpack_min_diff_data(inputs)
min_diff_examples, min_diff_membership = min_diff_data
# Unpacking original data
original_inputs = min_diff.keras.utils.unpack_original_inputs(inputs)
With your newly formed data, you are now ready to apply MinDiff in your model! To learn how this is done, please take a look at the other guides starting with Integrating MinDiff with MinDiffModel.
Using a Custom Packing Format (optional)
You may decide to pack the three datasets together in whatever way you choose. The only requirement is that you will need to ensure the model knows how to interpret the data. The default implementation of MinDiffModel
assumes that the data was packed using min_diff.keras.utils.pack_min_diff_data
.
One easy way to format your input as you want is to transform the data as a final step after you have used min_diff.keras.utils.pack_min_diff_data
.
# Reformat input to be a dict.
def _reformat_input(inputs, original_labels):
unpacked_min_diff_data = min_diff.keras.utils.unpack_min_diff_data(inputs)
unpacked_original_inputs = min_diff.keras.utils.unpack_original_inputs(inputs)
return {
'min_diff_data': unpacked_min_diff_data,
'original_data': (unpacked_original_inputs, original_labels)}
customized_train_with_min_diff_ds = train_with_min_diff_ds.map(_reformat_input)
Your model will need to know how to read this customized input as detailed in the Customizing MinDiffModel guide.
for batch in customized_train_with_min_diff_ds.take(1):
# Customized unpacking of min_diff_data
min_diff_data = batch['min_diff_data']
# Customized unpacking of original_data
original_data = batch['original_data']
Additional Resources
- For an in depth discussion on fairness evaluation see the Fairness Indicators guidance
- For general information on Remediation and MinDiff, see the remediation overview.
- For details on requirements surrounding MinDiff see this guide.
- To see an end-to-end tutorial on using MinDiff in Keras, see this tutorial.
Utility Functions for other Guides
This guide outlines the process and decision making that you can follow whenever applying MinDiff. The rest of the guides build off this framework. To make this easier, logic found in this guide has been factored out into helper functions:
get_uci_data
: This function is already used in this guide. It returns aDataFrame
containing the UCI income data from the indicated split sampled at whatever rate is indicated (100% if unspecified).df_to_dataset
: This function converts aDataFrame
into atf.data.Dataset
as detailed in this guide with the added functionality of being able to pass the batch_size as a parameter.get_uci_with_min_diff_dataset
: This function returns atf.data.Dataset
containing both the original data and the MinDiff data packed together using the Model Remediation Library util functions as described in this guide.
The rest of the guides will build off of these to show how to use other parts of the library.