Ver en TensorFlow.org | Ejecutar en Google Colab | Ver fuente en GitHub | Descargar cuaderno |
Abstracto
En este colab, demostramos cómo usar los diversos optimizadores implementados en TensorFlow Probability.
Dependencias y requisitos previos
Importar
%matplotlib inline
import contextlib
import functools
import os
import time
import numpy as np
import pandas as pd
import scipy as sp
from six.moves import urllib
from sklearn import preprocessing
import tensorflow.compat.v2 as tf
tf.enable_v2_behavior()
import tensorflow_probability as tfp
Optimizadores BFGS y L-BFGS
Los métodos de cuasi Newton son una clase de algoritmo de optimización de primer orden popular. Estos métodos utilizan una aproximación definida positiva al hessiano exacto para encontrar la dirección de búsqueda.
El algoritmo Broyden- Fletcher-Goldfarb-Shanno ( BFGS ) es una implementación específica de esta idea general. Es aplicable y es el método de elección para problemas de tamaño medio en el que el gradiente es continuo en todas partes (por ejemplo, regresión lineal con un \(L_2\) penalización).
L-BFGS es una versión de memoria limitada de BFGS que es útil para resolver los problemas más grandes cuyas matrices de Hesse no puede ser calculado a un costo razonable o no son escasos. En lugar de almacenar totalmente densos \(n \times n\) aproximaciones de matrices de Hesse, que sólo ahorran unos vectores de longitud \(n\) que representan estas aproximaciones implícita.
Funciones auxiliares
CACHE_DIR = os.path.join(os.sep, 'tmp', 'datasets')
def make_val_and_grad_fn(value_fn):
@functools.wraps(value_fn)
def val_and_grad(x):
return tfp.math.value_and_gradient(value_fn, x)
return val_and_grad
@contextlib.contextmanager
def timed_execution():
t0 = time.time()
yield
dt = time.time() - t0
print('Evaluation took: %f seconds' % dt)
def np_value(tensor):
"""Get numpy value out of possibly nested tuple of tensors."""
if isinstance(tensor, tuple):
return type(tensor)(*(np_value(t) for t in tensor))
else:
return tensor.numpy()
def run(optimizer):
"""Run an optimizer and measure it's evaluation time."""
optimizer() # Warmup.
with timed_execution():
result = optimizer()
return np_value(result)
L-BFGS en una función cuadrática simple
# Fix numpy seed for reproducibility
np.random.seed(12345)
# The objective must be supplied as a function that takes a single
# (Tensor) argument and returns a tuple. The first component of the
# tuple is the value of the objective at the supplied point and the
# second value is the gradient at the supplied point. The value must
# be a scalar and the gradient must have the same shape as the
# supplied argument.
# The `make_val_and_grad_fn` decorator helps transforming a function
# returning the objective value into one that returns both the gradient
# and the value. It also works for both eager and graph mode.
dim = 10
minimum = np.ones([dim])
scales = np.exp(np.random.randn(dim))
@make_val_and_grad_fn
def quadratic(x):
return tf.reduce_sum(scales * (x - minimum) ** 2, axis=-1)
# The minimization routine also requires you to supply an initial
# starting point for the search. For this example we choose a random
# starting point.
start = np.random.randn(dim)
# Finally an optional argument called tolerance let's you choose the
# stopping point of the search. The tolerance specifies the maximum
# (supremum) norm of the gradient vector at which the algorithm terminates.
# If you don't have a specific need for higher or lower accuracy, leaving
# this parameter unspecified (and hence using the default value of 1e-8)
# should be good enough.
tolerance = 1e-10
@tf.function
def quadratic_with_lbfgs():
return tfp.optimizer.lbfgs_minimize(
quadratic,
initial_position=tf.constant(start),
tolerance=tolerance)
results = run(quadratic_with_lbfgs)
# The optimization results contain multiple pieces of information. The most
# important fields are: 'converged' and 'position'.
# Converged is a boolean scalar tensor. As the name implies, it indicates
# whether the norm of the gradient at the final point was within tolerance.
# Position is the location of the minimum found. It is important to check
# that converged is True before using the value of the position.
print('L-BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)
Evaluation took: 0.014586 seconds L-BFGS Results Converged: True Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Number of iterations: 10
Mismo problema con BFGS
@tf.function
def quadratic_with_bfgs():
return tfp.optimizer.bfgs_minimize(
quadratic,
initial_position=tf.constant(start),
tolerance=tolerance)
results = run(quadratic_with_bfgs)
print('BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)
Evaluation took: 0.010468 seconds BFGS Results Converged: True Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Number of iterations: 10
Regresión lineal con penalización L1: datos de cáncer de próstata
Ejemplo del libro: Los elementos del aprendizaje estadístico, minería de datos, Inferencia y predicción por Trevor Hastie, Robert Tibshirani y Jerome Friedman.
Tenga en cuenta que este es un problema de optimización con la penalización L1.
Obtener conjunto de datos
def cache_or_download_file(cache_dir, url_base, filename):
"""Read a cached file or download it."""
filepath = os.path.join(cache_dir, filename)
if tf.io.gfile.exists(filepath):
return filepath
if not tf.io.gfile.exists(cache_dir):
tf.io.gfile.makedirs(cache_dir)
url = url_base + filename
print("Downloading {url} to {filepath}.".format(url=url, filepath=filepath))
urllib.request.urlretrieve(url, filepath)
return filepath
def get_prostate_dataset(cache_dir=CACHE_DIR):
"""Download the prostate dataset and read as Pandas dataframe."""
url_base = 'http://web.stanford.edu/~hastie/ElemStatLearn/datasets/'
return pd.read_csv(
cache_or_download_file(cache_dir, url_base, 'prostate.data'),
delim_whitespace=True, index_col=0)
prostate_df = get_prostate_dataset()
Downloading http://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data to /tmp/datasets/prostate.data.
Definición del problema
np.random.seed(12345)
feature_names = ['lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp',
'gleason', 'pgg45']
# Normalize features
scalar = preprocessing.StandardScaler()
prostate_df[feature_names] = pd.DataFrame(
scalar.fit_transform(
prostate_df[feature_names].astype('float64')))
# select training set
prostate_df_train = prostate_df[prostate_df.train == 'T']
# Select features and labels
features = prostate_df_train[feature_names]
labels = prostate_df_train[['lpsa']]
# Create tensors
feat = tf.constant(features.values, dtype=tf.float64)
lab = tf.constant(labels.values, dtype=tf.float64)
dtype = feat.dtype
regularization = 0 # regularization parameter
dim = 8 # number of features
# We pick a random starting point for the search
start = np.random.randn(dim + 1)
def regression_loss(params):
"""Compute loss for linear regression model with L1 penalty
Args:
params: A real tensor of shape [dim + 1]. The zeroth component
is the intercept term and the rest of the components are the
beta coefficients.
Returns:
The mean square error loss including L1 penalty.
"""
params = tf.squeeze(params)
intercept, beta = params[0], params[1:]
pred = tf.matmul(feat, tf.expand_dims(beta, axis=-1)) + intercept
mse_loss = tf.reduce_sum(
tf.cast(
tf.losses.mean_squared_error(y_true=lab, y_pred=pred), tf.float64))
l1_penalty = regularization * tf.reduce_sum(tf.abs(beta))
total_loss = mse_loss + l1_penalty
return total_loss
Resolviendo con L-BFGS
Ajuste usando L-BFGS. Aunque la penalización L1 introduce discontinuidades derivadas, en la práctica, L-BFGS todavía funciona bastante bien.
@tf.function
def l1_regression_with_lbfgs():
return tfp.optimizer.lbfgs_minimize(
make_val_and_grad_fn(regression_loss),
initial_position=tf.constant(start),
tolerance=1e-8)
results = run(l1_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]
print('L-BFGS Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta: Fitted {}'.format(fitted_beta))
Evaluation took: 0.017987 seconds L-BFGS Results Converged: True Intercept: Fitted (2.3879985744556484) Beta: Fitted [ 0.68626215 0.28193532 -0.17030254 0.10799274 0.33634988 -0.24888523 0.11992237 0.08689026]
Resolviendo con Nelder Mead
El método de Nelder Mead es uno de los métodos más populares de minimización libres derivados. Este optimizador no usa información de gradiente y no hace suposiciones sobre la diferenciabilidad de la función de destino; por lo tanto, es apropiado para funciones objetivas no uniformes, por ejemplo, problemas de optimización con penalización L1.
Para un problema de optimización en \(n\)-dimensiones que mantiene un conjunto de\(n+1\) soluciones candidatas que abarcan un simple no degenerado. Modifica sucesivamente el símplex en función de un conjunto de movimientos (reflexión, expansión, encogimiento y contracción) utilizando los valores de la función en cada uno de los vértices.
# Nelder mead expects an initial_vertex of shape [n + 1, 1].
initial_vertex = tf.expand_dims(tf.constant(start, dtype=dtype), axis=-1)
@tf.function
def l1_regression_with_nelder_mead():
return tfp.optimizer.nelder_mead_minimize(
regression_loss,
initial_vertex=initial_vertex,
func_tolerance=1e-10,
position_tolerance=1e-10)
results = run(l1_regression_with_nelder_mead)
minimum = results.position.reshape([-1])
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]
print('Nelder Mead Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta: Fitted {}'.format(fitted_beta))
Evaluation took: 0.325643 seconds Nelder Mead Results Converged: True Intercept: Fitted (2.387998456121595) Beta: Fitted [ 0.68626266 0.28193456 -0.17030291 0.10799375 0.33635132 -0.24888703 0.11992244 0.08689023]
Regresión logística con penalización L2
Para este ejemplo, creamos un conjunto de datos sintéticos para la clasificación y usamos el optimizador L-BFGS para ajustar los parámetros.
np.random.seed(12345)
dim = 5 # The number of features
n_obs = 10000 # The number of observations
betas = np.random.randn(dim) # The true beta
intercept = np.random.randn() # The true intercept
features = np.random.randn(n_obs, dim) # The feature matrix
probs = sp.special.expit(
np.matmul(features, np.expand_dims(betas, -1)) + intercept)
labels = sp.stats.bernoulli.rvs(probs) # The true labels
regularization = 0.8
feat = tf.constant(features)
lab = tf.constant(labels, dtype=feat.dtype)
@make_val_and_grad_fn
def negative_log_likelihood(params):
"""Negative log likelihood for logistic model with L2 penalty
Args:
params: A real tensor of shape [dim + 1]. The zeroth component
is the intercept term and the rest of the components are the
beta coefficients.
Returns:
The negative log likelihood plus the penalty term.
"""
intercept, beta = params[0], params[1:]
logit = tf.matmul(feat, tf.expand_dims(beta, -1)) + intercept
log_likelihood = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(
labels=lab, logits=logit))
l2_penalty = regularization * tf.reduce_sum(beta ** 2)
total_loss = log_likelihood + l2_penalty
return total_loss
start = np.random.randn(dim + 1)
@tf.function
def l2_regression_with_lbfgs():
return tfp.optimizer.lbfgs_minimize(
negative_log_likelihood,
initial_position=tf.constant(start),
tolerance=1e-8)
results = run(l2_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]
print('Converged:', results.converged)
print('Intercept: Fitted ({}), Actual ({})'.format(fitted_intercept, intercept))
print('Beta:\n\tFitted {},\n\tActual {}'.format(fitted_beta, betas))
Evaluation took: 0.056751 seconds Converged: True Intercept: Fitted (1.4111415084244365), Actual (1.3934058329729904) Beta: Fitted [-0.18016612 0.53121578 -0.56420632 -0.5336374 2.00499675], Actual [-0.20470766 0.47894334 -0.51943872 -0.5557303 1.96578057]
Soporte de lotes
Tanto BFGS como L-BFGS admiten cálculos por lotes, por ejemplo, para optimizar una única función desde muchos puntos de partida diferentes; o múltiples funciones paramétricas desde un solo punto.
Función única, múltiples puntos de partida
La función de Himmelblau es un caso de prueba de optimización estándar. La función viene dada por:
\[f(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2\]
La función tiene cuatro mínimos ubicados en:
- (3, 2),
- (-2,805118, 3,131312),
- (-3.779310, -3.283186),
- (3,584428, -1,848126).
Todos estos mínimos se pueden alcanzar desde puntos de partida apropiados.
# The function to minimize must take as input a tensor of shape [..., n]. In
# this n=2 is the size of the domain of the input and [...] are batching
# dimensions. The return value must be of shape [...], i.e. a batch of scalars
# with the objective value of the function evaluated at each input point.
@make_val_and_grad_fn
def himmelblau(coord):
x, y = coord[..., 0], coord[..., 1]
return (x * x + y - 11) ** 2 + (x + y * y - 7) ** 2
starts = tf.constant([[1, 1],
[-2, 2],
[-1, -1],
[1, -2]], dtype='float64')
# The stopping_condition allows to further specify when should the search stop.
# The default, tfp.optimizer.converged_all, will proceed until all points have
# either converged or failed. There is also a tfp.optimizer.converged_any to
# stop as soon as the first point converges, or all have failed.
@tf.function
def batch_multiple_starts():
return tfp.optimizer.lbfgs_minimize(
himmelblau, initial_position=starts,
stopping_condition=tfp.optimizer.converged_all,
tolerance=1e-8)
results = run(batch_multiple_starts)
print('Converged:', results.converged)
print('Minima:', results.position)
Evaluation took: 0.019095 seconds Converged: [ True True True True] Minima: [[ 3. 2. ] [-2.80511809 3.13131252] [-3.77931025 -3.28318599] [ 3.58442834 -1.84812653]]
Múltiples funciones
Para fines de demostración, en este ejemplo optimizamos simultáneamente una gran cantidad de cuencos cuadráticos generados aleatoriamente de alta dimensión.
np.random.seed(12345)
dim = 100
batches = 500
minimum = np.random.randn(batches, dim)
scales = np.exp(np.random.randn(batches, dim))
@make_val_and_grad_fn
def quadratic(x):
return tf.reduce_sum(input_tensor=scales * (x - minimum)**2, axis=-1)
# Make all starting points (1, 1, ..., 1). Note not all starting points need
# to be the same.
start = tf.ones((batches, dim), dtype='float64')
@tf.function
def batch_multiple_functions():
return tfp.optimizer.lbfgs_minimize(
quadratic, initial_position=start,
stopping_condition=tfp.optimizer.converged_all,
max_iterations=100,
tolerance=1e-8)
results = run(batch_multiple_functions)
print('All converged:', np.all(results.converged))
print('Largest error:', np.max(results.position - minimum))
Evaluation took: 1.994132 seconds All converged: True Largest error: 4.4131473142527966e-08