{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "headers"
},
"source": [
"Project: /responsible_ai/_project.yaml\n",
"Book: /responsible_ai/_book.yaml\n",
"\n",
"
\n", " View on TensorFlow.org\n", " | \n", "\n", " \n", " Run in Google Colab\n", " | \n", "\n", " \n", " View source on GitHub\n", " | \n", "\n", " Download notebook\n", " | \n", "
tf.data.Dataset
\n",
"\n",
"`MinDiffModel` requires that the input be a tf.data.Dataset
. If you were using a different format of input prior to integrating MinDiff, you will have to convert your input data.\n",
"\n",
"Use tf.data.Dataset.from_tensor_slices
to convert to tf.data.Dataset
.\n",
"\n",
"```\n",
"dataset = tf.data.Dataset.from_tensor_slices((x, y, weights))\n",
"dataset.shuffle(...) # Optional.\n",
"dataset.batch(batch_size)\n",
"```\n",
"\n",
"See [`Model.fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) documentation for details on equivalences between the two methods of input.\n",
"\n",
"In this guide, the input is downloaded as a Pandas DataFrame and therefore, needs this conversion."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XLEhhmxBjiG2"
},
"outputs": [],
"source": [
"# Function to convert a DataFrame into a tf.data.Dataset.\n",
"def df_to_dataset(dataframe, shuffle=True):\n",
" dataframe = dataframe.copy()\n",
" labels = dataframe.pop('target')\n",
" ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))\n",
" if shuffle:\n",
" ds = ds.shuffle(buffer_size=5000) # Reasonable but arbitrary buffer_size.\n",
" return ds\n",
"\n",
"# Convert the train DataFrame into a Dataset.\n",
"original_train_ds = df_to_dataset(train)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XqiTRpVqGnGh"
},
"source": [
"Note: The training dataset has not been batched yet but it will be later."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3hgcefRb84HT"
},
"source": [
"## Creating MinDiff data\n",
"\n",
"During training, MinDiff will encourage the model to reduce differences in predictions between two additional datasets (which may include examples from the original dataset). The selection of these two datasets is the key decision which will determine the effect MinDiff has on the model.\n",
"\n",
"The two datasets should be picked such that the disparity in performance that you are trying to remediate is evident and well-represented. Since the goal is to reduce a gap in FNR between `\"Male\"` and `\"Female\"` slices, this means creating one dataset with only _positively_ labeled `\"Male\"` examples and another with only _positively_ labeled `\"Female\"` examples; these will be the MinDiff datasets.\n",
"\n",
"Note: The choice of using only _positively_ labeled examples is directly tied to the target metric. This guide is concerned with _false negatives_ which, by definition, are _positively_ labeled examples that were incorrectly classified."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uWaRK3DVBeB9"
},
"source": [
"First, examine the data present."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "q9BbPpNTHs86"
},
"outputs": [],
"source": [
"female_pos = train[(train['sex'] == ' Female') & (train['target'] == 1)]\n",
"male_pos = train[(train['sex'] == ' Male') & (train['target'] == 1)]\n",
"print(len(female_pos), 'positively labeled female examples')\n",
"print(len(male_pos), 'positively labeled male examples')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZTUFR5mlIN06"
},
"source": [
"It is perfectly acceptable to create MinDiff datasets from subsets of the original dataset.\n",
"\n",
"While there aren't 5,000 or more positive `\"Male\"` examples as recommended in the [requirements guidance](https://www.tensorflow.org/responsible_ai/model_remediation/min_diff/guide/requirements#how_much_data_do_i_need), there are over 2,000 and it is reasonable to try with that many before collecting more data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0JtRx1iBIo7w"
},
"outputs": [],
"source": [
"min_diff_male_ds = df_to_dataset(male_pos)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Yn1dTuKRIyWK"
},
"source": [
"Positive `\"Female\"` examples, however, are much scarcer at 385. This is probably too small for good performance and so will require pulling in additional examples.\n",
"\n",
"Note: Since this guide began by reducing the dataset via sampling, this problem (and the corresponding solution) may seem contrived. However, it serves as a good example of how to approach concerns about the size of your MinDiff datasets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZzO6sOsDJsxA"
},
"outputs": [],
"source": [
"full_uci_train = tutorials_utils.get_uci_data(split='train')\n",
"augmented_female_pos = full_uci_train[((full_uci_train['sex'] == ' Female') &\n",
" (full_uci_train['target'] == 1))]\n",
"print(len(augmented_female_pos), 'positively labeled female examples')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "O6WLEgbaKD78"
},
"source": [
"Using the full dataset has more than tripled the number of examples that can be used for MinDiff. It’s still low but it is enough to try as a first pass."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "zB0UxWqfKhu3"
},
"outputs": [],
"source": [
"min_diff_female_ds = df_to_dataset(augmented_female_pos)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XmO9mTyDSJYI"
},
"source": [
"Both the MinDiff datasets are significantly smaller than the recommended 5,000 or more examples. While it is reasonable to attempt to apply MinDiff with the current data, you may need to consider collecting additional data if you observe poor performance or overfitting during training."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sq6nE2kCtVgP"
},
"source": [
"### Using tf.data.Dataset.filter
\n",
"\n",
"Alternatively, you can create the two MinDiff datasets directly from the converted original `Dataset`.\n",
"\n",
"Note: When using `.filter` it is recommended to use `.cache()` if the dataset can easily fit in memory for runtime performance. If it is too large to do so, consider storing your filtered datasets in your file system and reading them in."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "efFvtSCEubnm"
},
"outputs": [],
"source": [
"# Male\n",
"def male_predicate(x, y):\n",
" return tf.equal(x['sex'], b' Male') and tf.equal(y, 0)\n",
"\n",
"alternate_min_diff_male_ds = original_train_ds.filter(male_predicate).cache()\n",
"\n",
"# Female\n",
"def female_predicate(x, y):\n",
" return tf.equal(x['sex'], b' Female') and tf.equal(y, 0)\n",
"\n",
"full_uci_train_ds = df_to_dataset(full_uci_train)\n",
"alternate_min_diff_female_ds = full_uci_train_ds.filter(female_predicate).cache()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yf2WlD01MNS0"
},
"source": [
"The resulting `alternate_min_diff_male_ds` and `alternate_min_diff_female_ds` will be equivalent in output to `min_diff_male_ds` and `min_diff_female_ds` respectively."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zmi3Tak3vqP4"
},
"source": [
"## Constructing your Training Dataset\n",
"\n",
"As a final step, the three datasets (the two newly created ones and the original) need to be merged into a single dataset that can be passed to the model."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uwM9ka-uxBGB"
},
"source": [
"### Batching the datasets\n",
"\n",
"Before merging, the datasets need to batched.\n",
"\n",
"* The original dataset can use the same batching that was used before integrating MinDiff.\n",
"* The MinDiff datasets do not need to have the same batch size as the original dataset. In all likelihood, a smaller one will perform just as well. While they don't even need to have the same batch size as each other, it is recommended to do so for best performance.\n",
"\n",
"While not strictly necessary, it is recommended to use `drop_remainder=True` for the two MinDiff datasets as this will ensure that they have consistent batch sizes.\n",
"\n",
"\n",
"Warning: The 3 datasets must be batched **before** they are merged together. Failing to do so will likely result in unintended input shapes that will cause errors downstream."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RWoXWKNtwyYl"
},
"outputs": [],
"source": [
"original_train_ds = original_train_ds.batch(128) # Same as before MinDiff.\n",
"\n",
"# The MinDiff datasets can have a different batch_size from original_train_ds\n",
"min_diff_female_ds = min_diff_female_ds.batch(32, drop_remainder=True)\n",
"# Ideally we use the same batch size for both MinDiff datasets.\n",
"min_diff_male_ds = min_diff_male_ds.batch(32, drop_remainder=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_trgfQuazSaE"
},
"source": [
"### Packing the Datasets with `pack_min_diff_data`\n",
"\n",
"Once the datasets are prepared, pack them into a single dataset which will then be passed along to the model. A single batch from the resulting dataset will contain one batch from each of the three datasets you prepared previously.\n",
"\n",
"You can do this by using the provided `utils` function in the `tensorflow_model_remediation` package:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "m3-WDjO_z-cR"
},
"outputs": [],
"source": [
"train_with_min_diff_ds = min_diff.keras.utils.pack_min_diff_data(\n",
" original_dataset=original_train_ds,\n",
" sensitive_group_dataset=min_diff_female_ds,\n",
" nonsensitive_group_dataset=min_diff_male_ds)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "A-Z5Qjkm0xjM"
},
"source": [
"And that's it! You will be able to use other `util` functions in the package to unpack individual batches if needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "k8cCF0Sp1Dp6"
},
"outputs": [],
"source": [
"for inputs, original_labels in train_with_min_diff_ds.take(1):\n",
" # Unpacking min_diff_data\n",
" min_diff_data = min_diff.keras.utils.unpack_min_diff_data(inputs)\n",
" min_diff_examples, min_diff_membership = min_diff_data\n",
" # Unpacking original data\n",
" original_inputs = min_diff.keras.utils.unpack_original_inputs(inputs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MkuIAPh5af20"
},
"source": [
"With your newly formed data, you are now ready to apply MinDiff in your model! To learn how this is done, please take a look at the other guides starting with [Integrating MinDiff with MinDiffModel](./integrating_min_diff_with_min_diff_model)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4i-SwiEe3zji"
},
"source": [
"### Using a Custom Packing Format (optional)\n",
"\n",
"You may decide to pack the three datasets together in whatever way you choose. The only requirement is that you will need to ensure the model knows how to interpret the data. The default implementation of `MinDiffModel` assumes that the data was packed using min_diff.keras.utils.pack_min_diff_data
.\n",
"\n",
"One easy way to format your input as you want is to transform the data as a final step after you have used min_diff.keras.utils.pack_min_diff_data
."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eihOp-6E48Tk"
},
"outputs": [],
"source": [
"# Reformat input to be a dict.\n",
"def _reformat_input(inputs, original_labels):\n",
" unpacked_min_diff_data = min_diff.keras.utils.unpack_min_diff_data(inputs)\n",
" unpacked_original_inputs = min_diff.keras.utils.unpack_original_inputs(inputs)\n",
"\n",
" return {\n",
" 'min_diff_data': unpacked_min_diff_data,\n",
" 'original_data': (unpacked_original_inputs, original_labels)}\n",
"\n",
"customized_train_with_min_diff_ds = train_with_min_diff_ds.map(_reformat_input)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vozEV0ST60sN"
},
"source": [
"Your model will need to know how to read this customized input as detailed in the [Customizing MinDiffModel guide](./customizing_min_diff_model#customizing_default_behaviors_of_mindiffmodel)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "zgWyP9Om66SF"
},
"outputs": [],
"source": [
"for batch in customized_train_with_min_diff_ds.take(1):\n",
" # Customized unpacking of min_diff_data\n",
" min_diff_data = batch['min_diff_data']\n",
" # Customized unpacking of original_data\n",
" original_data = batch['original_data']"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XXJ_ShDEmMAy"
},
"source": [
"## Additional Resources\n",
"\n",
"* For an in depth discussion on fairness evaluation see the [Fairness Indicators guidance](https://www.tensorflow.org/responsible_ai/fairness_indicators/guide/guidance)\n",
"* For general information on Remediation and MinDiff, see the [remediation overview](https://www.tensorflow.org/responsible_ai/model_remediation).\n",
"* For details on requirements surrounding MinDiff see [this guide](https://www.tensorflow.org/responsible_ai/model_remediation/min_diff/guide/requirements).\n",
"* To see an end-to-end tutorial on using MinDiff in Keras, see [this tutorial](https://www.tensorflow.org/responsible_ai/model_remediation/min_diff/tutorials/min_diff_keras)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7xbNdDoIJLBt"
},
"source": [
"## Utility Functions for other Guides\n",
"\n",
"This guide outlines the process and decision making that you can follow whenever applying MinDiff. The rest of the guides build off this framework. To make this easier, logic found in this guide has been factored out into helper functions:\n",
"\n",
"* `get_uci_data`: This function is already used in this guide. It returns a `DataFrame` containing the UCI income data from the indicated split sampled at whatever rate is indicated (100% if unspecified).\n",
"* `df_to_dataset`: This function converts a `DataFrame` into a tf.data.Dataset
as detailed in this guide with the added functionality of being able to pass the batch_size as a parameter.\n",
"* `get_uci_with_min_diff_dataset`: This function returns a tf.data.Dataset
containing both the original data and the MinDiff data packed together using the Model Remediation Library util functions as described in this guide.\n",
"\n",
"Warning: These utility functions are **not** part of the official `tensorflow-model-remediation` package API and are subject to change at any time.\n",
"\n",
"The rest of the guides will build off of these to show how to use other parts of the library."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "min_diff_data_preparation.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}