flint.optim

class flint.optim.Optimizer(params=None, lr: float = 0.01, weight_decay: float = 0.0)[source]

Bases: object

Base class for all optimizers.

Parameters
  • params (iterable) – An iterable of Tensor

  • lr (float, optional, default=0.01) – Learning rate

  • weight_decay (float, optional, default=0.) – Weight decay (L2 penalty)

step()[source]
zero_grad()[source]

Set the gradients of all parameters to zero.

class flint.optim.SGD(params=None, lr: float = 0.01, momentum: float = 0.0, nesterov: bool = False, weight_decay: float = 0.0)[source]

Bases: flint.optim.optimizer.Optimizer

Implementation of Stochastic Gradient Descent (optionally with momentum).

\[v_{t+1} = \mu \cdot v_t + g_{t+1} \]
\[\theta_{t+1} = \theta_t - \text{lr} \cdot v_{t+1} \]

where \(\theta\), \(g\), \(v\) and \(\mu\) denote the parameters, gradient, velocity, and momentum respectively.

Parameters
  • params (iterable) – An iterable of Tensor

  • lr (float, optional, default=0.01) – Learning rate

  • momentum (float, optional, default=0.) – Momentum factor

  • nesterov (bool, optional, default=False) – Enable Nesterov momentum or not

  • weight_decay (float, optional, default=0) – Weight decay (L2 penalty)

step()[source]
class flint.optim.Adadelta(params=None, rho: float = 0.99, eps: float = 1e-06, lr: float = 1.0, weight_decay: float = 0.0)[source]

Bases: flint.optim.optimizer.Optimizer

Implementation of Adadelta algorithm proposed in [1].

\[h_t = \rho h_{t-1} + (1 - \rho) g_t^2 \]
\[g'_t = \sqrt{\frac{\Delta \theta_{t-1} + \epsilon}{h_t + \epsilon}} \cdot g_t \]
\[\Delta \theta_t = \rho \Delta \theta_{t-1} + (1 - \rho) (g'_t)^2 \]
\[\theta_t = \theta_{t-1} - g'_t \]

where \(h\) is the moving average of the squared gradients, \(\epsilon\) is for improving numerical stability.

Parameters
  • params (iterable) – An iterable of Tensor

  • rho (float, optional, default=0.9) – Coefficient used for computing a running average of squared gradients

  • eps (float, optional, default=1e-6) – Term added to the denominator to improve numerical stability

  • lr (float, optional, default=1.0) – Coefficient that scale delta before it is applied to the parameters

  • weight_decay (float, optional, default=0) – Weight decay (L2 penalty)

References

  1. ADADELTA: An Adaptive Learning Rate Method. Matthew D. Zeiler.” arxiv 2012.

step()[source]
class flint.optim.Adagrad(params=None, lr: float = 0.01, eps: float = 1e-10, weight_decay: float = 0.0)[source]

Bases: flint.optim.optimizer.Optimizer

Implementation of Adagrad algorithm proposed in [1].

\[h_t = h_{t-1} + g_t^2 \]
\[\theta_{t+1} = \theta_t - \frac{\text{lr}}{\sqrt{h_t + \epsilon}} \cdot g_t \]
Parameters
  • params (iterable) – An iterable of Tensor

  • lr (float, optional, default=0.01) – Learning rate

  • eps (float, optional, default=1e-10) – Term added to the denominator to improve numerical stability

  • weight_decay (float, optional, default=0)) – Weight decay (L2 penalty)

References

  1. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” John Duchi, et al. JMRL 2011.

step()[source]
class flint.optim.Adam(params=None, lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)[source]

Bases: flint.optim.optimizer.Optimizer

Implementation of Adam algorithm proposed in [1].

\[v_t = \beta_1 v_{t-1} + (1 - \beta_1) g_t \]
\[h_t = \beta_2 h_{t-1} + (1 - \beta_2) g_t^2 \]

Bias correction:

\[\hat{v}_t = \frac{v_t}{1 - \beta_1^t} \]
\[\hat{h}_t = \frac{h_t}{1 - \beta_2^t} \]

Update parameters:

\[\theta_t = \theta_{t-1} - \text{lr} \cdot \frac{\hat{v}_t}{\sqrt{\hat{h}_t + \epsilon}} \]
Parameters
  • params (iterable) – An iterable of Tensor

  • lr (float, optional, default=1e-3) – Learning rate

  • betas (Tuple[float, float], optional, default=(0.9, 0.999)) – Coefficients used for computing running averages of gradient and its square

  • eps (float, optional, default=1e-8) – Term added to the denominator to improve numerical stability

  • weight_decay (float, optional, default=0) – Weight decay (L2 penalty)

References

  1. Adam: A Method for Stochastic Optimization.” Diederik P. Kingma and Jimmy Ba. ICLR 2015.

step()[source]
class flint.optim.RMSprop(params=None, lr: float = 0.01, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0.0)[source]

Bases: flint.optim.optimizer.Optimizer

Implementation of RMSprop algorithm proposed in [1].

\[h_t = \alpha h_{t-1} + (1 - \alpha) g_t^2 \]
\[\theta_{t+1} = \theta_t - \frac{\text{lr}}{\sqrt{h_t + \epsilon}} \cdot g_t \]
Parameters
  • params (iterable)) – An iterable of Tensor

  • lr (float, optional, default=0.01)) – Learning rate

  • alpha (float, optional, default=0.99) – Coefficient used for computing a running average of squared gradients

  • eps (float, optional, default=1e-8) – Term added to the denominator to improve numerical stability

  • weight_decay (float, optional, default=0) – Weight decay (L2 penalty)

References

  1. Neural Networks for Machine Learning, Lecture 6e - rmsprop: Divide the gradient by a running average of its recent magnitude.” Geoffrey Hinton.

step()[source]