SDG Optimizer in pytorch (torch.optim.SDG)

374 Views

A total of 1455 characters, expected to take 4 minutes to complete reading.

About torch.optim

In pytorch,torch.optim The file contains many optimizers, as shown in the following figure.

We can easily create these objects to implement the optimization algorithm.

gradient descent algorithm

Gradient descent (Gradient Descent) is an optimization method that uses gradients to minimize the optimization of the objective function.

Generally divided into the following three:

Batch Gradient Descent (BGD, Batch Gradient Descent)

Each time the parameters are updated during batch gradient descent, all the provided training samples are used. If the data set is large, then training can be very costly and slow. At the same time, it will take up a large amount of video memory (or memory), so this method is generally not used.

Its algorithm formula is expressed as follows:

 $\theta=\theta-\eta\cdot\frac{dJ(\theta)}{d\theta}$

Among them $\theta$ For the parameter to update, $\eta$ for the learning rate, $J(\theta)$ is the objective function.

Stochastic Gradient Descent (SGD, Stochastic gGradient Descent)

Stochastic gradient descent Each time the parameters are updated, only one sample is randomly selected for calculation to update the parameters.

Its algorithm formula is expressed as follows:

 $\theta=\theta-\eta\cdot\frac{dJ(\theta,(x^{i},y^{i}))}{d\theta}$

Among them $(x^{i},y^{i})$ Represents one piece of data extracted from the dataset and does not take all the samples for updating.

Mini-batch BGD (Mini-batch Gradient Descent)

Mini-Batch Gradient Descent Each time the parameters are updated, all data within a mini-batch is calculated each time.

Its algorithm formula is expressed as follows:

 $\theta=\theta-\eta\cdot\frac{dJ(\theta,(x^{i:i+b},y^{i:i+b}))}{d\theta}$

Among them $(x^{i:i+b},y^{i:i+b})$ Represents data from a batch within the dataset.

SGD in PyTorch

There is a torch.optim.SGD module in the pytorch, which provides an algorithm for gradient descent. But pay attention! This module can implement the stochastic gradient descent algorithm, but note that only when BatchSize = 1, is the stochastic gradient descent algorithm!

Let's look at the implementation of this module, see pytorch official website . The algorithm logic is as follows:

SDG Optimizer in pytorch (torch.optim.SDG)

By default, only the learning rate lr is provided, at this time. $\theta_t\leftarrow\theta_{t-1}-\gamma \cdot \nabla_\theta f_t(\theta_{t-1})$ . As you can see, this is not a pure random gradient descent algorithm, but calculates all the data you provide each time.

Take the following code as an example:

# 模型
model = xxx
criterion = xxx
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 数据集
data = DataLoader(train_data, batch_size=batch_size)

for i in range(xxx):
    for x, y in tqdm(data):
        optimizer.zero_grad()
        y_hat = model(x)
        loss = criterion(y_hat, y)
        loss.backward()
        optimizer.step()

When batch_size = 1 At this time, the SGD module calculates one data at a time, which is the stochastic gradient descent algorithm.

When 1 < batch_size < 整个数据集大小 At this time, the SGD module calculates one mini-batch data at a time, which is the small batch gradient descent algorithm.

When batch_size = 整个数据集大小 At this time, the SGD module calculates the data of an entire dataset at a time, which is the batch gradient descent algorithm. This algorithm is not recommended! The calculation cost is too high!

END