{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "headers"
   },
   "source": [
    "Project: /responsible_ai/_project.yaml\n",
    "Book: /responsible_ai/_book.yaml\n",
    "\n",
    "<devsite-mathjax config=\"TeX-AMS-MML_SVG\"></devsite-mathjax>\n",
    "<link rel=\"stylesheet\" href=\"/site-assets/css/style.css\">\n",
    "\n",
    "<!-- DO NOT EDIT! Automatically generated file. -->\n",
    "\n",
    "\n",
    "{% comment %}\n",
    "The source of truth file can be found [here]: http://google3/third_party/py/tensorflow_model_remediation/docs\n",
    "{% endcomment %}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jh2Sd3OgG17J"
   },
   "source": [
    "# MinDiff Data Preparation\n",
    "\n",
    "<div class=\"devsite-table-wrapper\"><table class=\"tfo-notebook-buttons\" align=\"left\">\n",
    "  <td><a target=\"_blank\" href=\"https://www.tensorflow.org/responsible_ai/model_remediation/min_diff/guide/min_diff_data_preparation\">\n",
    "  <img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
    "</td>\n",
    "<td>\n",
    "  <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/model-remediation/blob/master/docs/min_diff/guide/min_diff_data_preparation.ipynb\">\n",
    "  <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\">Run in Google Colab</a>\n",
    "</td>\n",
    "<td>\n",
    "  <a target=\"_blank\" href=\"https://github.com/tensorflow/model-remediation/blob/master/docs/min_diff/guide/min_diff_data_preparation.ipynb\">\n",
    "  <img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\">View source on GitHub</a>\n",
    "</td>\n",
    "<td>\n",
    "  <a target=\"_blank\" href=\"https://storage.googleapis.com/tensorflow_docs/model-remediation/docs/min_diff/guide/min_diff_data_preparation.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
    "</td>\n",
    "</table></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GuAubYtgDiZ0"
   },
   "source": [
    "##Introduction\n",
    "\n",
    "When implementing MinDiff, you will need to make complex decisions as you choose and shape your input before passing it on to the model. These decisions will largely determine the behavior of MinDiff within your model.\n",
    "\n",
    "This guide will cover the technical aspects of this process, but will not discuss how to evaluate a model for fairness, or how to identify particular slices and metrics for evaluation. Please see the [Fairness Indicators guidance](https://www.tensorflow.org/responsible_ai/fairness_indicators/guide/guidance) for details on this.\n",
    "\n",
    "To demonstrate MinDiff, this guide uses the [UCI income dataset](https://archive.ics.uci.edu/ml/datasets/census+income). The model task is to predict whether an individual has an income exceeding $50k, based on various personal attributes. This guide assumes there is a problematic gap in the FNR (false negative rate) between `\"Male\"` and `\"Female\"` slices and the model owner (you) has decided to apply MinDiff to address the issue. For more information on the scenarios in which one might choose to apply MinDiff, see the [requirements page](https://www.tensorflow.org/responsible_ai/model_remediation/min_diff/guide/requirements).\n",
    "\n",
    "Note: We recognize the limitations of the categories used in the original dataset, and acknowledge that these terms do not encompass the full range of vocabulary used in describing gender. Further, we acknowledge that this task doesn’t represent a real-world use case, and is used only to demonstrate the technical details of the MinDiff library.\n",
    "\n",
    "MinDiff works by penalizing the difference in distribution scores between examples in two sets of data. This guide will demonstrate how to choose and construct these additional MinDiff sets as well as how to package everything together so that it can be passed to a model for training."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Vd2pkMQZL4W2"
   },
   "source": [
    "##Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:09.042304Z",
     "iopub.status.busy": "2024-07-19T09:12:09.041873Z",
     "iopub.status.idle": "2024-07-19T09:12:10.503573Z",
     "shell.execute_reply": "2024-07-19T09:12:10.502763Z"
    },
    "id": "kmAHyZt9TErX"
   },
   "outputs": [],
   "source": [
    "!pip install --upgrade tensorflow-model-remediation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:10.507458Z",
     "iopub.status.busy": "2024-07-19T09:12:10.507196Z",
     "iopub.status.idle": "2024-07-19T09:12:13.059349Z",
     "shell.execute_reply": "2024-07-19T09:12:13.058592Z"
    },
    "id": "XRa49AkYS6n1"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-07-19 09:12:10.760771: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "2024-07-19 09:12:10.781683: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "2024-07-19 09:12:10.787865: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow_model_remediation import min_diff\n",
    "from tensorflow_model_remediation.tools.tutorials_utils import uci as tutorials_utils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "R7rxz9wsv2tS"
   },
   "source": [
    "## Original Data\n",
    "\n",
    "For demonstration purposes and to reduce runtimes, this guide uses only a sample fraction of the UCI Income dataset. In a real production setting, the full dataset would be utilized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:13.063475Z",
     "iopub.status.busy": "2024-07-19T09:12:13.063084Z",
     "iopub.status.idle": "2024-07-19T09:12:13.230823Z",
     "shell.execute_reply": "2024-07-19T09:12:13.230161Z"
    },
    "id": "WtOSyqTh7SVZ"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9768 train examples\n"
     ]
    }
   ],
   "source": [
    "# Sampled at 0.3 for reduced runtimes.\n",
    "train = tutorials_utils.get_uci_data(split='train', sample=0.3)\n",
    "\n",
    "print(len(train), 'train examples')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ELbff3-cZkfd"
   },
   "source": [
    "### Converting to <a href=\"https://www.tensorflow.org/api_docs/python/tf/data/Dataset\"><code>tf.data.Dataset</code></a>\n",
    "\n",
    "`MinDiffModel` requires that the input be a <a href=\"https://www.tensorflow.org/api_docs/python/tf/data/Dataset\"><code>tf.data.Dataset</code></a>. If you were using a different format of input prior to integrating MinDiff, you will have to convert your input data.\n",
    "\n",
    "Use <a href=\"https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices\"><code>tf.data.Dataset.from_tensor_slices</code></a> to convert to <a href=\"https://www.tensorflow.org/api_docs/python/tf/data/Dataset\"><code>tf.data.Dataset</code></a>.\n",
    "\n",
    "```\n",
    "dataset = tf.data.Dataset.from_tensor_slices((x, y, weights))\n",
    "dataset.shuffle(...)  # Optional.\n",
    "dataset.batch(batch_size)\n",
    "```\n",
    "\n",
    "See [`Model.fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) documentation for details on equivalences between the two methods of input.\n",
    "\n",
    "In this guide, the input is downloaded as a Pandas DataFrame and therefore, needs this conversion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:13.234144Z",
     "iopub.status.busy": "2024-07-19T09:12:13.233896Z",
     "iopub.status.idle": "2024-07-19T09:12:15.455206Z",
     "shell.execute_reply": "2024-07-19T09:12:15.454473Z"
    },
    "id": "XLEhhmxBjiG2"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
      "I0000 00:00:1721380333.740706   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "I0000 00:00:1721380333.745983   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.749582   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.753131   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.763393   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.766897   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.770266   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.773538   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.776460   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.779899   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.783299   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380333.786581   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.025282   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.027381   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.029430   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.032077   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.034163   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.036087   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.038088   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.040061   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.042022   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.043955   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.045985   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.047967   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.086347   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.088345   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.090375   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.092391   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.094523   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.096455   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.098382   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.100366   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.102381   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.104813   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.107188   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721380335.109569   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n"
     ]
    }
   ],
   "source": [
    "# Function to convert a DataFrame into a tf.data.Dataset.\n",
    "def df_to_dataset(dataframe, shuffle=True):\n",
    "  dataframe = dataframe.copy()\n",
    "  labels = dataframe.pop('target')\n",
    "  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))\n",
    "  if shuffle:\n",
    "    ds = ds.shuffle(buffer_size=5000)  # Reasonable but arbitrary buffer_size.\n",
    "  return ds\n",
    "\n",
    "# Convert the train DataFrame into a Dataset.\n",
    "original_train_ds = df_to_dataset(train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XqiTRpVqGnGh"
   },
   "source": [
    "Note: The training dataset has not been batched yet but it will be later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3hgcefRb84HT"
   },
   "source": [
    "## Creating MinDiff data\n",
    "\n",
    "During training, MinDiff will encourage the model to reduce differences in predictions between two additional datasets (which may include examples from the original dataset). The selection of these two datasets is the key decision which will determine the effect MinDiff has on the model.\n",
    "\n",
    "The two datasets should be picked such that the disparity in performance that you are trying to remediate is evident and well-represented. Since the goal is to reduce a gap in FNR between `\"Male\"` and `\"Female\"` slices, this means creating one dataset with only _positively_ labeled `\"Male\"` examples and another with only _positively_ labeled `\"Female\"` examples; these will be the MinDiff datasets.\n",
    "\n",
    "Note: The choice of using only _positively_ labeled examples is directly tied to the target metric. This guide is concerned with _false negatives_ which, by definition, are _positively_ labeled examples that were incorrectly classified."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uWaRK3DVBeB9"
   },
   "source": [
    "First, examine the data present."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:15.459264Z",
     "iopub.status.busy": "2024-07-19T09:12:15.458965Z",
     "iopub.status.idle": "2024-07-19T09:12:15.467165Z",
     "shell.execute_reply": "2024-07-19T09:12:15.466505Z"
    },
    "id": "q9BbPpNTHs86"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "385 positively labeled female examples\n",
      "2063 positively labeled male examples\n"
     ]
    }
   ],
   "source": [
    "female_pos = train[(train['sex'] == ' Female') & (train['target'] == 1)]\n",
    "male_pos = train[(train['sex'] == ' Male') & (train['target'] == 1)]\n",
    "print(len(female_pos), 'positively labeled female examples')\n",
    "print(len(male_pos), 'positively labeled male examples')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZTUFR5mlIN06"
   },
   "source": [
    "It is perfectly acceptable to create MinDiff datasets from subsets of the original dataset.\n",
    "\n",
    "While there aren't 5,000 or more positive `\"Male\"` examples as recommended in the [requirements guidance](https://www.tensorflow.org/responsible_ai/model_remediation/min_diff/guide/requirements#how_much_data_do_i_need), there are over 2,000 and it is reasonable to try with that many before collecting more data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:15.470380Z",
     "iopub.status.busy": "2024-07-19T09:12:15.470111Z",
     "iopub.status.idle": "2024-07-19T09:12:15.487890Z",
     "shell.execute_reply": "2024-07-19T09:12:15.487320Z"
    },
    "id": "0JtRx1iBIo7w"
   },
   "outputs": [],
   "source": [
    "min_diff_male_ds = df_to_dataset(male_pos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Yn1dTuKRIyWK"
   },
   "source": [
    "Positive `\"Female\"` examples, however, are much scarcer at 385. This is probably too small for good performance and so will require pulling in additional examples.\n",
    "\n",
    "Note: Since this guide began by reducing the dataset via sampling, this problem (and the corresponding solution) may seem contrived. However, it serves as a good example of how to approach concerns about the size of your MinDiff datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:15.491197Z",
     "iopub.status.busy": "2024-07-19T09:12:15.490979Z",
     "iopub.status.idle": "2024-07-19T09:12:15.512440Z",
     "shell.execute_reply": "2024-07-19T09:12:15.511824Z"
    },
    "id": "ZzO6sOsDJsxA"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1179 positively labeled female examples\n"
     ]
    }
   ],
   "source": [
    "full_uci_train = tutorials_utils.get_uci_data(split='train')\n",
    "augmented_female_pos = full_uci_train[((full_uci_train['sex'] == ' Female') &\n",
    "                                       (full_uci_train['target'] == 1))]\n",
    "print(len(augmented_female_pos), 'positively labeled female examples')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "O6WLEgbaKD78"
   },
   "source": [
    "Using the full dataset has more than tripled the number of examples that can be used for MinDiff. It’s still low but it is enough to try as a first pass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:15.515479Z",
     "iopub.status.busy": "2024-07-19T09:12:15.515237Z",
     "iopub.status.idle": "2024-07-19T09:12:15.533240Z",
     "shell.execute_reply": "2024-07-19T09:12:15.532655Z"
    },
    "id": "zB0UxWqfKhu3"
   },
   "outputs": [],
   "source": [
    "min_diff_female_ds = df_to_dataset(augmented_female_pos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XmO9mTyDSJYI"
   },
   "source": [
    "Both the MinDiff datasets are significantly smaller than the recommended 5,000 or more examples. While it is reasonable to attempt to apply MinDiff with the current data, you may need to consider collecting additional data if you observe poor performance or overfitting during training."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sq6nE2kCtVgP"
   },
   "source": [
    "### Using <a href=\"https://www.tensorflow.org/api_docs/python/tf/data/Dataset#filter\"><code>tf.data.Dataset.filter</code></a>\n",
    "\n",
    "Alternatively, you can create the two MinDiff datasets directly from the converted original `Dataset`.\n",
    "\n",
    "Note: When using `.filter` it is recommended to use `.cache()` if the dataset can easily fit in memory for runtime performance. If it is too large to do so, consider storing your filtered datasets in your file system and reading them in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:15.536540Z",
     "iopub.status.busy": "2024-07-19T09:12:15.536302Z",
     "iopub.status.idle": "2024-07-19T09:12:15.662859Z",
     "shell.execute_reply": "2024-07-19T09:12:15.662214Z"
    },
    "id": "efFvtSCEubnm"
   },
   "outputs": [],
   "source": [
    "# Male\n",
    "def male_predicate(x, y):\n",
    "  return tf.equal(x['sex'], b' Male') and tf.equal(y, 0)\n",
    "\n",
    "alternate_min_diff_male_ds = original_train_ds.filter(male_predicate).cache()\n",
    "\n",
    "# Female\n",
    "def female_predicate(x, y):\n",
    "  return tf.equal(x['sex'], b' Female') and tf.equal(y, 0)\n",
    "\n",
    "full_uci_train_ds = df_to_dataset(full_uci_train)\n",
    "alternate_min_diff_female_ds = full_uci_train_ds.filter(female_predicate).cache()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yf2WlD01MNS0"
   },
   "source": [
    "The resulting `alternate_min_diff_male_ds` and `alternate_min_diff_female_ds` will be equivalent in output to `min_diff_male_ds` and `min_diff_female_ds` respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zmi3Tak3vqP4"
   },
   "source": [
    "## Constructing your Training Dataset\n",
    "\n",
    "As a final step, the three datasets (the two newly created ones and the original) need to be merged into a single dataset that can be passed to the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uwM9ka-uxBGB"
   },
   "source": [
    "### Batching the datasets\n",
    "\n",
    "Before merging, the datasets need to batched.\n",
    "\n",
    "*   The original dataset can use the same batching that was used before integrating MinDiff.\n",
    "*   The MinDiff datasets do not need to have the same batch size as the original dataset. In all likelihood, a smaller one will perform just as well. While they don't even need to have the same batch size as each other, it is recommended to do so for best performance.\n",
    "\n",
    "While not strictly necessary, it is recommended to use `drop_remainder=True` for the two MinDiff datasets as this will ensure that they have consistent batch sizes.\n",
    "\n",
    "\n",
    "Warning: The 3 datasets must be batched **before** they are merged together. Failing to do so will likely result in unintended input shapes that will cause errors downstream."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:15.666949Z",
     "iopub.status.busy": "2024-07-19T09:12:15.666705Z",
     "iopub.status.idle": "2024-07-19T09:12:15.676141Z",
     "shell.execute_reply": "2024-07-19T09:12:15.675541Z"
    },
    "id": "RWoXWKNtwyYl"
   },
   "outputs": [],
   "source": [
    "original_train_ds = original_train_ds.batch(128)  # Same as before MinDiff.\n",
    "\n",
    "# The MinDiff datasets can have a different batch_size from original_train_ds\n",
    "min_diff_female_ds = min_diff_female_ds.batch(32, drop_remainder=True)\n",
    "# Ideally we use the same batch size for both MinDiff datasets.\n",
    "min_diff_male_ds = min_diff_male_ds.batch(32, drop_remainder=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_trgfQuazSaE"
   },
   "source": [
    "### Packing the Datasets with `pack_min_diff_data`\n",
    "\n",
    "Once the datasets are prepared, pack them into a single dataset which will then be passed along to the model. A single batch from the resulting dataset will contain one batch from each of the three datasets you prepared previously.\n",
    "\n",
    "You can do this by using the provided `utils` function in the `tensorflow_model_remediation` package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:15.679352Z",
     "iopub.status.busy": "2024-07-19T09:12:15.679085Z",
     "iopub.status.idle": "2024-07-19T09:12:16.274053Z",
     "shell.execute_reply": "2024-07-19T09:12:16.273317Z"
    },
    "id": "m3-WDjO_z-cR"
   },
   "outputs": [],
   "source": [
    "train_with_min_diff_ds = min_diff.keras.utils.pack_min_diff_data(\n",
    "    original_dataset=original_train_ds,\n",
    "    sensitive_group_dataset=min_diff_female_ds,\n",
    "    nonsensitive_group_dataset=min_diff_male_ds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "A-Z5Qjkm0xjM"
   },
   "source": [
    "And that's it! You will be able to use other `util` functions in the package to unpack individual batches if needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:16.278058Z",
     "iopub.status.busy": "2024-07-19T09:12:16.277800Z",
     "iopub.status.idle": "2024-07-19T09:12:16.435109Z",
     "shell.execute_reply": "2024-07-19T09:12:16.434367Z"
    },
    "id": "k8cCF0Sp1Dp6"
   },
   "outputs": [],
   "source": [
    "for inputs, original_labels in train_with_min_diff_ds.take(1):\n",
    "  # Unpacking min_diff_data\n",
    "  min_diff_data = min_diff.keras.utils.unpack_min_diff_data(inputs)\n",
    "  min_diff_examples, min_diff_membership = min_diff_data\n",
    "  # Unpacking original data\n",
    "  original_inputs = min_diff.keras.utils.unpack_original_inputs(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MkuIAPh5af20"
   },
   "source": [
    "With your newly formed data, you are now ready to apply MinDiff in your model! To learn how this is done, please take a look at the other guides starting with [Integrating MinDiff with MinDiffModel](./integrating_min_diff_with_min_diff_model)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4i-SwiEe3zji"
   },
   "source": [
    "### Using a Custom Packing Format (optional)\n",
    "\n",
    "You may decide to pack the three datasets together in whatever way you choose. The only requirement is that you will need to ensure the model knows how to interpret the data. The default implementation of `MinDiffModel` assumes that the data was packed using <a href=\"https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/min_diff/keras/utils/pack_min_diff_data\"><code>min_diff.keras.utils.pack_min_diff_data</code></a>.\n",
    "\n",
    "One easy way to format your input as you want is to transform the data as a final step after you have used <a href=\"https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/min_diff/keras/utils/pack_min_diff_data\"><code>min_diff.keras.utils.pack_min_diff_data</code></a>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:16.439341Z",
     "iopub.status.busy": "2024-07-19T09:12:16.438709Z",
     "iopub.status.idle": "2024-07-19T09:12:16.548717Z",
     "shell.execute_reply": "2024-07-19T09:12:16.548041Z"
    },
    "id": "eihOp-6E48Tk"
   },
   "outputs": [],
   "source": [
    "# Reformat input to be a dict.\n",
    "def _reformat_input(inputs, original_labels):\n",
    "  unpacked_min_diff_data = min_diff.keras.utils.unpack_min_diff_data(inputs)\n",
    "  unpacked_original_inputs = min_diff.keras.utils.unpack_original_inputs(inputs)\n",
    "\n",
    "  return {\n",
    "      'min_diff_data': unpacked_min_diff_data,\n",
    "      'original_data': (unpacked_original_inputs, original_labels)}\n",
    "\n",
    "customized_train_with_min_diff_ds = train_with_min_diff_ds.map(_reformat_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vozEV0ST60sN"
   },
   "source": [
    "Your model will need to know how to read this customized input as detailed in the [Customizing MinDiffModel guide](./customizing_min_diff_model#customizing_default_behaviors_of_mindiffmodel)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:12:16.552270Z",
     "iopub.status.busy": "2024-07-19T09:12:16.552009Z",
     "iopub.status.idle": "2024-07-19T09:12:16.732993Z",
     "shell.execute_reply": "2024-07-19T09:12:16.732232Z"
    },
    "id": "zgWyP9Om66SF"
   },
   "outputs": [],
   "source": [
    "for batch in customized_train_with_min_diff_ds.take(1):\n",
    "  # Customized unpacking of min_diff_data\n",
    "  min_diff_data = batch['min_diff_data']\n",
    "  # Customized unpacking of original_data\n",
    "  original_data = batch['original_data']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XXJ_ShDEmMAy"
   },
   "source": [
    "## Additional Resources\n",
    "\n",
    "*   For an in depth discussion on fairness evaluation see the [Fairness Indicators guidance](https://www.tensorflow.org/responsible_ai/fairness_indicators/guide/guidance)\n",
    "*   For general information on Remediation and MinDiff, see the [remediation overview](https://www.tensorflow.org/responsible_ai/model_remediation).\n",
    "*    For details on requirements surrounding MinDiff see [this guide](https://www.tensorflow.org/responsible_ai/model_remediation/min_diff/guide/requirements).\n",
    "*   To see an end-to-end tutorial on using MinDiff in Keras, see [this tutorial](https://www.tensorflow.org/responsible_ai/model_remediation/min_diff/tutorials/min_diff_keras)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7xbNdDoIJLBt"
   },
   "source": [
    "## Utility Functions for other Guides\n",
    "\n",
    "This guide outlines the process and decision making that you can follow whenever applying MinDiff. The rest of the guides build off this framework. To make this easier, logic found in this guide has been factored out into helper functions:\n",
    "\n",
    "*   `get_uci_data`: This function is already used in this guide. It returns a `DataFrame` containing the UCI income data from the indicated split sampled at whatever rate is indicated (100% if unspecified).\n",
    "*   `df_to_dataset`: This function converts a `DataFrame` into a <a href=\"https://www.tensorflow.org/api_docs/python/tf/data/Dataset\"><code>tf.data.Dataset</code></a> as detailed in this guide with the added functionality of being able to pass the batch_size as a parameter.\n",
    "*   `get_uci_with_min_diff_dataset`: This function returns a <a href=\"https://www.tensorflow.org/api_docs/python/tf/data/Dataset\"><code>tf.data.Dataset</code></a> containing both the original data and the MinDiff data packed together using the Model Remediation Library util functions as described in this guide.\n",
    "\n",
    "Warning: These utility functions are **not** part of the official `tensorflow-model-remediation` package API and are subject to change at any time.\n",
    "\n",
    "The rest of the guides will build off of these to show how to use other parts of the library."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "min_diff_data_preparation.ipynb",
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}