Ta strona została przetłumaczona przez Cloud Translation API.

Optymalizatory w TensorFlow Prawdopodobieństwo

Zobacz na TensorFlow.org

Uruchom w Google Colab

Wyświetl źródło na GitHub

Pobierz notatnik

Abstrakcyjny

W tej kolaboracji pokazujemy, jak korzystać z różnych optymalizatorów zaimplementowanych w TensorFlow Probability.

Zależności i wymagania wstępne

Import

%matplotlib inline


import contextlib
import functools
import os
import time

import numpy as np
import pandas as pd
import scipy as sp
from six.moves import urllib
from sklearn import preprocessing

import tensorflow.compat.v2 as tf
tf.enable_v2_behavior()

import tensorflow_probability as tfp

Optymalizatory BFGS i L-BFGS

Metody quasi Newton są klasą popularnych algorytmów optymalizacji pierwszego rzędu. Metody te wykorzystują pozytywne, określone przybliżenie do dokładnego hessu, aby znaleźć kierunek wyszukiwania.

Algorytm Broyden-Fletcher-Goldfarb-Shanno ( BFGS ) jest wdrożenie specyficznych tej ogólnej idei. Jest to zastosowanie i jest metodą z wyboru dla średnich rozmiarach problemów którym gradient jest ciągły na całym świecie (np regresji liniowej z \(L_2\) kary).

L BFGS jest wersja ograniczoną pamięć BFGS, które są przydatne do rozwiązywania problemów, których Heskie większe matryce nie może być obliczony przy rozsądnych kosztach i nie są rzadkie. Zamiast przechowywać w pełni gęste \(n \times n\) przybliżenia Hesji matrycach, tylko zapisać kilka wektorów o długości \(n\) że reprezentowania tych przybliżeń niejawnie.

Funkcje pomocnicze

CACHE_DIR = os.path.join(os.sep, 'tmp', 'datasets')


def make_val_and_grad_fn(value_fn):
  @functools.wraps(value_fn)
  def val_and_grad(x):
    return tfp.math.value_and_gradient(value_fn, x)
  return val_and_grad


@contextlib.contextmanager
def timed_execution():
  t0 = time.time()
  yield
  dt = time.time() - t0
  print('Evaluation took: %f seconds' % dt)


def np_value(tensor):
  """Get numpy value out of possibly nested tuple of tensors."""
  if isinstance(tensor, tuple):
    return type(tensor)(*(np_value(t) for t in tensor))
  else:
    return tensor.numpy()

def run(optimizer):
  """Run an optimizer and measure it's evaluation time."""
  optimizer()  # Warmup.
  with timed_execution():
    result = optimizer()
  return np_value(result)

L-BFGS na prostej funkcji kwadratowej

# Fix numpy seed for reproducibility
np.random.seed(12345)

# The objective must be supplied as a function that takes a single
# (Tensor) argument and returns a tuple. The first component of the
# tuple is the value of the objective at the supplied point and the
# second value is the gradient at the supplied point. The value must
# be a scalar and the gradient must have the same shape as the
# supplied argument.

# The `make_val_and_grad_fn` decorator helps transforming a function
# returning the objective value into one that returns both the gradient
# and the value. It also works for both eager and graph mode.

dim = 10
minimum = np.ones([dim])
scales = np.exp(np.random.randn(dim))

@make_val_and_grad_fn
def quadratic(x):
  return tf.reduce_sum(scales * (x - minimum) ** 2, axis=-1)

# The minimization routine also requires you to supply an initial
# starting point for the search. For this example we choose a random
# starting point.
start = np.random.randn(dim)

# Finally an optional argument called tolerance let's you choose the
# stopping point of the search. The tolerance specifies the maximum
# (supremum) norm of the gradient vector at which the algorithm terminates.
# If you don't have a specific need for higher or lower accuracy, leaving
# this parameter unspecified (and hence using the default value of 1e-8)
# should be good enough.
tolerance = 1e-10

@tf.function
def quadratic_with_lbfgs():
  return tfp.optimizer.lbfgs_minimize(
    quadratic,
    initial_position=tf.constant(start),
    tolerance=tolerance)

results = run(quadratic_with_lbfgs)

# The optimization results contain multiple pieces of information. The most
# important fields are: 'converged' and 'position'.
# Converged is a boolean scalar tensor. As the name implies, it indicates
# whether the norm of the gradient at the final point was within tolerance.
# Position is the location of the minimum found. It is important to check
# that converged is True before using the value of the position.

print('L-BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)

Evaluation took: 0.014586 seconds
L-BFGS Results
Converged: True
Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Number of iterations: 10

Ten sam problem z BFGS

@tf.function
def quadratic_with_bfgs():
  return tfp.optimizer.bfgs_minimize(
    quadratic,
    initial_position=tf.constant(start),
    tolerance=tolerance)

results = run(quadratic_with_bfgs)

print('BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)

Evaluation took: 0.010468 seconds
BFGS Results
Converged: True
Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Number of iterations: 10

Regresja liniowa z karą L1: dane dotyczące raka prostaty

Przykład z książki: The Elements of Statistical Learning, Data Mining, wnioskowania i przewidywania przez Trevora Hastie, Robert Tibshirani i Jerome Friedman.

Zauważ, że jest to problem z optymalizacją z karą L1.

Uzyskaj zbiór danych

def cache_or_download_file(cache_dir, url_base, filename):
  """Read a cached file or download it."""
  filepath = os.path.join(cache_dir, filename)
  if tf.io.gfile.exists(filepath):
    return filepath
  if not tf.io.gfile.exists(cache_dir):
    tf.io.gfile.makedirs(cache_dir)
  url = url_base + filename
  print("Downloading {url} to {filepath}.".format(url=url, filepath=filepath))
  urllib.request.urlretrieve(url, filepath)
  return filepath

def get_prostate_dataset(cache_dir=CACHE_DIR):
  """Download the prostate dataset and read as Pandas dataframe."""
  url_base = 'http://web.stanford.edu/~hastie/ElemStatLearn/datasets/'
  return pd.read_csv(
      cache_or_download_file(cache_dir, url_base, 'prostate.data'),
      delim_whitespace=True, index_col=0)

prostate_df = get_prostate_dataset()

Downloading http://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data to /tmp/datasets/prostate.data.

Definicja problemu

np.random.seed(12345)

feature_names = ['lcavol', 'lweight',   'age',  'lbph', 'svi', 'lcp',   
                 'gleason', 'pgg45']

# Normalize features
scalar = preprocessing.StandardScaler()
prostate_df[feature_names] = pd.DataFrame(
    scalar.fit_transform(
        prostate_df[feature_names].astype('float64')))

# select training set
prostate_df_train = prostate_df[prostate_df.train == 'T'] 

# Select features and labels 
features = prostate_df_train[feature_names]
labels =  prostate_df_train[['lpsa']]

# Create tensors
feat = tf.constant(features.values, dtype=tf.float64)
lab = tf.constant(labels.values, dtype=tf.float64)

dtype = feat.dtype

regularization = 0 # regularization parameter
dim = 8 # number of features

# We pick a random starting point for the search
start = np.random.randn(dim + 1)

def regression_loss(params):
  """Compute loss for linear regression model with L1 penalty

  Args:
    params: A real tensor of shape [dim + 1]. The zeroth component
      is the intercept term and the rest of the components are the
      beta coefficients.

  Returns:
    The mean square error loss including L1 penalty.
  """
  params = tf.squeeze(params)
  intercept, beta  = params[0], params[1:]
  pred = tf.matmul(feat, tf.expand_dims(beta, axis=-1)) + intercept
  mse_loss = tf.reduce_sum(
      tf.cast(
        tf.losses.mean_squared_error(y_true=lab, y_pred=pred), tf.float64))
  l1_penalty = regularization * tf.reduce_sum(tf.abs(beta))
  total_loss = mse_loss + l1_penalty
  return total_loss

Rozwiązywanie z L-BFGS

Dopasuj za pomocą L-BFGS. Mimo że kara L1 wprowadza nieciągłości pochodnych, w praktyce L-BFGS nadal działa całkiem dobrze.

@tf.function
def l1_regression_with_lbfgs():
  return tfp.optimizer.lbfgs_minimize(
    make_val_and_grad_fn(regression_loss),
    initial_position=tf.constant(start),
    tolerance=1e-8)

results = run(l1_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]

print('L-BFGS Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta:      Fitted {}'.format(fitted_beta))

Evaluation took: 0.017987 seconds
L-BFGS Results
Converged: True
Intercept: Fitted (2.3879985744556484)
Beta:      Fitted [ 0.68626215  0.28193532 -0.17030254  0.10799274  0.33634988 -0.24888523
  0.11992237  0.08689026]

Rozwiązywanie z Nelder Mead

Sposób Neldera Mead jest jednym z najbardziej popularnych sposobów minimalizacji wolnych pochodnych. Ten optymalizator nie używa informacji o gradientach i nie przyjmuje żadnych założeń dotyczących różniczkowalności funkcji docelowej; jest zatem odpowiedni dla niegładkich funkcji celu, na przykład problemów optymalizacji z karą L1.

W przypadku problemu optymalizacji w \(n\)-wymiary utrzymuje zestaw\(n+1\) rozwiązań kandydujących, które rozciągają się nie zdegenerowany simplex. Kolejno modyfikuje simpleks w oparciu o zestaw ruchów (odbicie, rozszerzenie, skurcz i skrócenie) przy użyciu wartości funkcji na każdym z wierzchołków.

# Nelder mead expects an initial_vertex of shape [n + 1, 1].
initial_vertex = tf.expand_dims(tf.constant(start, dtype=dtype), axis=-1)

@tf.function
def l1_regression_with_nelder_mead():
  return tfp.optimizer.nelder_mead_minimize(
      regression_loss,
      initial_vertex=initial_vertex,
      func_tolerance=1e-10,
      position_tolerance=1e-10)

results = run(l1_regression_with_nelder_mead)
minimum = results.position.reshape([-1])
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]

print('Nelder Mead Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta:      Fitted {}'.format(fitted_beta))

Evaluation took: 0.325643 seconds
Nelder Mead Results
Converged: True
Intercept: Fitted (2.387998456121595)
Beta:      Fitted [ 0.68626266  0.28193456 -0.17030291  0.10799375  0.33635132 -0.24888703
  0.11992244  0.08689023]

Regresja logistyczna z karą L2

W tym przykładzie tworzymy syntetyczny zestaw danych do klasyfikacji i używamy optymalizatora L-BFGS w celu dopasowania parametrów.

np.random.seed(12345)

dim = 5  # The number of features
n_obs = 10000  # The number of observations

betas = np.random.randn(dim)  # The true beta
intercept = np.random.randn()  # The true intercept

features = np.random.randn(n_obs, dim)  # The feature matrix
probs = sp.special.expit(
    np.matmul(features, np.expand_dims(betas, -1)) + intercept)

labels = sp.stats.bernoulli.rvs(probs)  # The true labels

regularization = 0.8
feat = tf.constant(features)
lab = tf.constant(labels, dtype=feat.dtype)

@make_val_and_grad_fn
def negative_log_likelihood(params):
  """Negative log likelihood for logistic model with L2 penalty

  Args:
    params: A real tensor of shape [dim + 1]. The zeroth component
      is the intercept term and the rest of the components are the
      beta coefficients.

  Returns:
    The negative log likelihood plus the penalty term. 
  """
  intercept, beta  = params[0], params[1:]
  logit = tf.matmul(feat, tf.expand_dims(beta, -1)) + intercept
  log_likelihood = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(
      labels=lab, logits=logit))
  l2_penalty = regularization * tf.reduce_sum(beta ** 2)
  total_loss = log_likelihood + l2_penalty
  return total_loss

start = np.random.randn(dim + 1)

@tf.function
def l2_regression_with_lbfgs():
  return tfp.optimizer.lbfgs_minimize(
      negative_log_likelihood,
      initial_position=tf.constant(start),
      tolerance=1e-8)

results = run(l2_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]

print('Converged:', results.converged)
print('Intercept: Fitted ({}), Actual ({})'.format(fitted_intercept, intercept))
print('Beta:\n\tFitted {},\n\tActual {}'.format(fitted_beta, betas))

Evaluation took: 0.056751 seconds
Converged: True
Intercept: Fitted (1.4111415084244365), Actual (1.3934058329729904)
Beta:
    Fitted [-0.18016612  0.53121578 -0.56420632 -0.5336374   2.00499675],
    Actual [-0.20470766  0.47894334 -0.51943872 -0.5557303   1.96578057]

Wsparcie dozowania

Zarówno BFGS, jak i L-BFGS obsługują obliczenia wsadowe, na przykład w celu optymalizacji pojedynczej funkcji z wielu różnych punktów początkowych; lub wiele funkcji parametrycznych z jednego punktu.

Pojedyncza funkcja, wiele punktów początkowych

Funkcja Himmelblau jest standardowym przypadkiem testowym optymalizacji. Funkcję podaje:

\[f(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2\]

Funkcja posiada cztery minima zlokalizowane w:

(3, 2),
(-2.805118, 3.131312),
(-3.779310, -3.283186),
(3,584428, -1,848126).

Wszystkie te minima można osiągnąć z odpowiednich punktów startowych.

# The function to minimize must take as input a tensor of shape [..., n]. In
# this n=2 is the size of the domain of the input and [...] are batching
# dimensions. The return value must be of shape [...], i.e. a batch of scalars
# with the objective value of the function evaluated at each input point.

@make_val_and_grad_fn
def himmelblau(coord):
  x, y = coord[..., 0], coord[..., 1]
  return (x * x + y - 11) ** 2 + (x + y * y - 7) ** 2

starts = tf.constant([[1, 1],
                      [-2, 2],
                      [-1, -1],
                      [1, -2]], dtype='float64')

# The stopping_condition allows to further specify when should the search stop.
# The default, tfp.optimizer.converged_all, will proceed until all points have
# either converged or failed. There is also a tfp.optimizer.converged_any to
# stop as soon as the first point converges, or all have failed.

@tf.function
def batch_multiple_starts():
  return tfp.optimizer.lbfgs_minimize(
      himmelblau, initial_position=starts,
      stopping_condition=tfp.optimizer.converged_all,
      tolerance=1e-8)

results = run(batch_multiple_starts)
print('Converged:', results.converged)
print('Minima:', results.position)

Evaluation took: 0.019095 seconds
Converged: [ True  True  True  True]
Minima: [[ 3.          2.        ]
 [-2.80511809  3.13131252]
 [-3.77931025 -3.28318599]
 [ 3.58442834 -1.84812653]]

Wiele funkcji

W celach demonstracyjnych w tym przykładzie jednocześnie optymalizujemy dużą liczbę wielowymiarowych, losowo generowanych misek kwadratowych.

np.random.seed(12345)

dim = 100
batches = 500
minimum = np.random.randn(batches, dim)
scales = np.exp(np.random.randn(batches, dim))

@make_val_and_grad_fn
def quadratic(x):
  return tf.reduce_sum(input_tensor=scales * (x - minimum)**2, axis=-1)

# Make all starting points (1, 1, ..., 1). Note not all starting points need
# to be the same.
start = tf.ones((batches, dim), dtype='float64')

@tf.function
def batch_multiple_functions():
  return tfp.optimizer.lbfgs_minimize(
      quadratic, initial_position=start,
      stopping_condition=tfp.optimizer.converged_all,
      max_iterations=100,
      tolerance=1e-8)

results = run(batch_multiple_functions)
print('All converged:', np.all(results.converged))
print('Largest error:', np.max(results.position - minimum))

Evaluation took: 1.994132 seconds
All converged: True
Largest error: 4.4131473142527966e-08