From ReLU to GE...

  • 2022-09-23 10:27:47

From ReLU to GELU, an overview of the activation function of neural networks

Keywords: ReLUGELU neural network The importance of activation function to neural network goes without saying. The Heart of Machines has also published some related introductory articles, such as "An Overview of Activation Functions in Deep Learning in One Article". This article also focuses on activation functions. Casper Hansen from the Technical University of Denmark introduced the activation functions of sigmoid, ReLU, ELU and the updated Leaky ReLU, SELU, GELU through formulas, charts and code experiments, and compared their advantages and disadvantages.

When calculating the activation values of each layer, we use the activation function, and then we can determine what these activation values are. Based on the previous activations, weights, and biases in each layer, we want to compute a value for each activation in the next layer. But before sending that value to the next layer, we want to use an activation function to scale this output. This article will introduce different activation functions.

Before reading this article, you can read my previous article on forward and backpropagation in neural networks, which briefly mentioned activation functions, but not what they actually do. The content of this article will build on what you already know from the previous article.

Casper Hansen

Table of contents

Overview

What is the sigmoid function?

The Gradient Problem: Backpropagation

Gradient vanishing problem

Gradient explosion problem

An extreme case of exploding gradients

Avoid exploding gradients: Gradient clipping/norm

Rectified Linear Unit (ReLU)

Dead ReLU: Advantages and Disadvantages

Exponential Linear Unit (ELU)

Leaky Rectified Linear Unit (Leaky ReLU)

Extended Exponential Linear Unit (SELU)

SELU: A special case of normalization

Weight initialization + dropout

Gaussian Error Linear Unit (GELU)

Code: Hyperparameter Search for Deep Neural Networks

Further Reading: Books and Papers

Overview

Activation functions are a crucial part of neural networks. In this long post, I'll take a comprehensive look at six different activation functions and explain their pros and cons. I'll give the equations and differential equations for the activation function, and I'll give a diagram of them. The goal of this article is to explain these equations and graphs in simple terms.

I'll cover vanishing gradients and exploding gradients; for the latter, I'll follow Nielsen's great example to explain why gradients explode.

Finally, I'll also provide some code that you can run in Jupyter Notebook yourself.

I'll run some small code experiments on the MNIST dataset to get a loss and accuracy plot for each activation function.


What is the sigmoid function?

The sigmoid function is a logistic function, which means: no matter what the input is, the output you get is between 0 and 1. That is, every neuron, node, or activation you input is scaled to a value between 0 and 1.

Illustration of the sigmoid function.

A function like sigmoid is often called a nonlinear function because we cannot describe it in terms of linearity. Many activation functions are nonlinear or a combination of linear and nonlinear (it is possible that part of the function is linear, but this is rare).

This is basically fine, except when the value is exactly 0 or 1 (which sometimes does happen). Why is this a problem?

This question is related to backpropagation (see my previous post for an introduction to backpropagation). In backpropagation, we want to compute the gradient of each weight, i.e. a small update for each weight. The purpose of this is to optimize the output of activation values throughout the network, so that it can get better results at the output layer, and then optimize the cost function.

During backpropagation, we have to calculate the proportion that each weight affects the cost function by calculating the partial derivatives of the cost function with respect to each weight. Assuming that instead of defining individual weights, we define all weights w in the last layer L as w^L, their derivatives are:

Note that when taking partial derivatives, we find the equation for ?a^L, then differentiate only ?z^L, leaving the rest the same. We use the apostrophe "'" to denote the derivative of any function. When computing the partial derivative of the intermediate term ?a^L/?z^L, we have:

Then the derivative of the sigmoid function is:

When we input a large x value (positive or negative) to this sigmoid function, we get a y value that is almost 0 - that is, when we input w × a+b, we may get a value close to value of 0.

Derivative illustration of the sigmoid function.

When x is a large value (positive or negative), we are essentially multiplying the remainder of this partial derivative by a value that is almost 0.

If there are too many weights with such large values, then we can't get a network that can adjust the weights at all, which is a big problem. If we don't adjust these weights, then the network has only minor updates so that the algorithm doesn't improve the network much over time. For each computation of the partial derivative for a weight, we put it into a gradient vector, and we'll use this gradient vector to update the neural network. As you can imagine, if all the values of this gradient vector are close to 0, then we can't really update anything at all.

What is described here is the vanishing gradient problem. This problem makes the sigmoid function impractical in neural networks, and we should use other activation functions described later.

Gradient problem

Gradient vanishing problem

My previous post said that if we want to update a specific weight, the update rule is:

But what if the partial derivative ?C/?w^(L) is so small that it disappears? At this point we run into the vanishing gradient problem, where many weights and biases receive only very small updates.

It can be seen that if the value of the weight is 0.2, this value will basically not change when the gradient disappears problem. Because this weight connects the first neuron of the first and second layers, respectively, we can write it as

Assuming that the value of this weight is 0.2, given a learning rate (how much is not important, 0.5 is used here), the new weight is:

The original value of this weight is 0.2, and now it is updated to 0.199999978. Obviously, this is problematic: the gradients are so small that they disappear, leaving the weights in the neural network barely updated. This can cause nodes in the network to be far from their optimal values. This problem can seriously hinder the learning of neural networks.

It has been observed that this problem is exacerbated if different layers learn at different rates. Layers learn at different rates, and the first few layers always get worse based on the learning rate.

From Nielsen's book Neural Networks and Deep Learning.

In this example, hidden layer 4 learns the fastest because its cost function only depends on the weight changes connected to hidden layer 4. Let's look at hidden layer 1; the cost function here depends on the weight change connecting hidden layer 1 to hidden layers 2, 3, 4. If you read my previous article on backpropagation, you probably know that earlier layers in the network reuse computations from later layers.

Meanwhile, as introduced earlier, the last layer depends only on a set of changes that occur when computing partial derivatives:

Ultimately, this is a big problem because now the weight layers are learning at a different rate. This means that layers later in the network will almost certainly be more optimized by layers earlier in the network.

And the problem is that the backpropagation algorithm doesn't know in which direction the weights should be passed to optimize the cost function.

Gradient explosion problem

The exploding gradient problem is essentially the opposite of the vanishing gradient problem. Research shows that such a problem is possible when the weights are in a state of "exploding", i.e. their value grows rapidly.

We will follow the following example to illustrate:

/chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets

Note that this example can also be used to demonstrate the vanishing gradient problem, and I chose it from a more conceptual perspective for easier explanation.

Essentially, when 0 1, we may have an exploding gradient problem. However, when a layer encounters this problem, there must be more weights that satisfy the gradient vanishing or exploding condition.

We start with a simple network. This network has a small number of weights, biases and activations, and also has only one node per layer.

The network is simple. The weights are denoted w_j, the biases are b_j, and the cost function is C. Nodes, neurons or activations are represented as circles.

Nielsen used a common representation in physics, Δ, to describe a change in a value (this is different from the gradient notation ?). For example, Δb_j describes the value change of the jth bias.

The core of my previous post is that we want to measure the rate of change of weights and biases in relation to the cost function. Layers aside, let's look at a specific bias, the first bias b_1. Then we measure the rate of change by:

The arguments for the following equations are the same as for the partial derivatives above. i.e. how do we measure the rate of change of the cost function by the rate of change of the bias? As just introduced, Nielsen uses Δ to describe the change, so we can say that this partial derivative can be roughly replaced by Δ:

Changes in weights and biases can be visualized as follows:

The animation is from 3blue1brown, the video address: https:///watch?v=tIeHLnjs5U8.

We start with the starting point of the network and calculate how a change in the first bias b_1 will affect the network. Since we know from the previous post that the first bias b_1 feeds the first activation a_1, we'll start here. Let's review this equation first:

If b_1 changes, we denote this change as Δb_1. Therefore, we notice that when b_1 changes, the activation a_1 also changes - we usually denote this as ?a_1/?b_1.

Therefore, we have the expression for the partial derivative on the left, which is the change in b_1 relative to a_1. But we start replacing the left term, first replacing a_1 with the sigmoid of z_1:

The above formula indicates that when b_1 changes, there is a certain change in the activation value a_1. We describe this change as Δa_1.

We treat the change Δa_1 as approximately the same as the change in the activation value a_1 plus the change Δb_1.

Here we skip a step, but essentially we just compute the partial derivative and replace the fractional part with the result of the partial derivative.

A change in a_1 results in a change in z_2

The described change Δa_1 now causes a change in the input z_2 of the next layer. If this seems odd or you're still not convinced, I suggest you read my previous article.

The representation is the same as before, and we denote the next change as Δz_2. We have to go through the previous process again, only this time to get the changes in z_2:

We can replace Δa_1 with:

We only calculate this formula. Hope you clearly understand the process of this step - it is the same process as calculating Δa_1.

This process repeats until we compute the entire network. By substituting the Δa_j values, we get a final function that computes the change in the cost function relative to the entire network (ie all weights, biases and activations).

Based on this, we calculate ?C/?b_1 again to get the final formula we need:

An extreme case of exploding gradients

According to this, if all weights w_j are large, ie if many weights have a value greater than 1, we start multiplying by larger values. For example, all the weights have some very high value, like 100, and we get some random output of the derivative of the sigmoid function between 0 and 0.25:

The last partial derivative is, which is reasonably believed to be much larger than 1, but is set to 1 for the sake of the example.

Using this update rule, if we assume that b_1 was previously equal to 1.56 and the learning rate was equal to 0.5.

Although this is an edge case, you know what I mean. The values of weights and biases can increase explosively, causing the entire network to explode.

Now take a moment to think about the weights and biases of the network and the rest of the activations, updating their values explosively. This is what we call the exploding gradient problem. Obviously, such a network can't learn anything, so this completely ruins the task you're trying to solve.

Avoiding exploding gradients: Gradient clipping/norm

The basic idea of solving the gradient explosion problem is to set a rule for it. I won't go into the math in depth for this part, but I'll give the steps of the process:

pick a threshold - if the gradient exceeds this value, gradient clipping or gradient norm is used;

Defines whether to use gradient clipping or norm. If using gradient clipping, you specify a threshold, such as 0.5. If this gradient value exceeds 0.5 or -0.5, it is either scaled to within the threshold by gradient normalization, or clipped to within the threshold.

Note, however, that none of these gradient methods can avoid the vanishing gradient problem. So we will further explore more ways to solve this problem. In general, you need these methods if you are using a recurrent neural network architecture (like LSTM or GRU), which tends to have exploding gradients.

Rectified Linear Unit (ReLU)

Rectified linear units are our solution to the vanishing gradient problem, but could this lead to other problems? Please look down.

The formula of ReLU is as follows:

The ReLU formula says:

If the input x is less than 0, set the output equal to 0;

If the input x is greater than 0, then let the output equal the input.

Although we can't graph it with most tools, you can explain ReLU graphically this way. Everything with an x value less than zero maps to a y value of 0, but everything with an x value greater than zero maps to itself. That is, if we enter x=1, we get y=1.

Diagram of the ReLU activation function.

That's fine, but what does this have to do with the vanishing gradient problem? First, we have to get its differential equation:

It means:

If the input x is greater than 0, the output is equal to 1;

If the input is less than or equal to 0, the output becomes 0.

Represented by the following diagram:

Differentiated ReLU.

Now we have the answer: when using the ReLU activation function, we do not get very small values (like 0.0000000438 for the sigmoid function above). Instead, it's either 0 (causing some gradients to return nothing) or 1.

But this creates another problem: the dead ReLU problem.

What if there are too many values below 0 when computing the gradient? We'll get quite a few weights and biases that won't update because they're updated by 0. To see how this process actually performs, let's look in reverse at the previous example of exploding gradients.

We denote ReLU as R in this equation, we just need to replace each sigmoid σ with R:

Now, say a random input z of this differentiated ReLU is less than 0 - this function will cause the bias to "die". Suppose R'(z_3)=0:

Conversely, when we get R'(z_3)=0, we can only get 0 when multiplied by other values, which will cause this bias to die. We know that the new value of a bias is that bias minus the learning rate minus the gradient, which means we get an update of 0.

Dead ReLU: Advantages and Disadvantages

When we introduce the ReLU function into the neural network, we also introduce a lot of sparsity. So what exactly does the term sparsity mean?

Sparse: Small in number, usually scattered over a large area. In a neural network, this means that the activation matrix contains many 0s. What does this sparsity performance get us? When a certain percentage (say 50%) of activations are saturated, we say that the neural network is sparse. This improves efficiency in terms of time and space complexity - constant values (usually) require less space and are less computationally expensive.

Yoshua Bengio et al. found that the ReLU component actually makes the neural network perform better, with the aforementioned time and space efficiency.

Paper address: https:///pdf/1905.01338.pdf

GELU

Gaussian error linear unit activation functions are used in recent Transformer models (Google's BERT and OpenAI's GPT-2). The GELU paper is from 2016 , but has only recently gained traction.

This activation function has the form:

It can be seen that this is a combination of some function (such as the hyperbolic tangent function tanh) and an approximate numerical value. Not much to say. Interesting is the graph of this function:

GELU activation function.

It can be seen that when x is greater than 0, the output is x; except for the interval from x=0 to x=1, the curve is more inclined to the y-axis.

I couldn't find the derivative of this function, so I used WolframAlpha to differentiate the function. The result is as follows:

As before, this is another combination of hyperbolic functions. But its graph looks interesting:

Differentiated GELU activation function.

advantage:

Seems to be the current best in the NLP field; especially in the Transformer model;

It can avoid the vanishing gradient problem.

shortcoming:

Although proposed in 2016, it is a rather novel activation function in practical applications.

Code for Deep Neural Networks

Say you want to try all of these activation functions to see which one works best, how would you do it? Usually we perform hyperparameter optimization - this can be done using scikit-learn's GridSearchCV function. But we want to compare, so the idea is to pick some hyperparameters and keep them constant while modifying the activation function.

Explain what I'm trying to do here:


Train the same neural network model using the activation function mentioned in this article;

Using the history of each activation function, plot loss and accuracy versus epoch.

The code is also published on GitHub and supports colab so you can get it up and running quickly. Address: https://github.com/casperbh96/Activation-Functions-Search

I prefer to use Keras' high-level API, so this will be done with Keras.

First import everything we need. Note that 4 libraries are used here: tensorflow, numpy, matplotlib, keras.

import tensorflow as tf

import numpy as np

import matplotlib.pyplot as plt

from keras.datasets import mnist

from keras.utils.np_utils import to_categorical

from keras.models import Sequential

from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D, Activation, LeakyReLU

from keras.layers.noise import AlphaDropout

from keras.utils.generic_utils import get_custom_objects

from keras import backend as K

from keras.optimizers import Adam

Now load the dataset we need to run our experiments; here the MNIST dataset was chosen. We can import it directly from Keras.

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Great, but we want to do some preprocessing on the data, like normalization. We need to go through a lot of functions to do this, mainly resizing the image (.reshape) and dividing by the largest RGB value of 255 (/= 255). Finally, we one-hot encode the data via to_categorical().

def preprocess_mnist(x_train, y_train, x_test, y_test):

# Normalizing all images of 28x28 pixels

x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)

x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

input_shape = (28, 28, 1)

# Float values for division

x_train = x_train.astype('float32')

x_test = x_test.astype('float32')

# Normalizing the RGB codes by dividing it to the max RGB value

x_train /= 255

x_test /= 255

# Categorical y values

y_train = to_categorical(y_train)

y_test = to_categorical(y_test)

return x_train, y_train, x_test, y_test, input_shape

x_train, y_train, x_test, y_test, input_shape = preprocess_mnist(x_train, y_train, x_test, y_test)

Now that we have finished data preprocessing, we can build the model and define the parameters required for Keras to run. First start with the convolutional neural network model itself. The SELU activation function is a special case, we need to use the kernel initializer 'lecun_normal' and the special form of dropout AlphaDropout(), everything else is left as normal.

def build_cnn(activation,

dropout_rate,

optimizer):

model = Sequential()if(activation == 'selu'):

model.add(Conv2D(32, kernel_size=(3, 3),

activation=activation,

input_shape=input_shape,

kernel_initializer='lecun_normal'))

model.add(Conv2D(64, (3, 3), activation=activation,

kernel_initializer='lecun_normal'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(AlphaDropout(0.25))

model.add(Flatten())

model.add(Dense(128, activation=activation,

kernel_initializer='lecun_normal'))

model.add(AlphaDropout(0.5))

model.add(Dense(10, activation='softmax')) else:

model.add(Conv2D(32, kernel_size=(3, 3),

activation=activation,

input_shape=input_shape))

model.add(Conv2D(64, (3, 3), activation=activation))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(128, activation=activation))

model.add(Dropout(0.5))

model.add(Dense(10, activation='softmax'))

model.compile(

loss='binary_crossentropy',

optimizer=optimizer,

metrics=['accuracy'])return model

There is a small problem with using the GELU function; this function is not currently available in Keras. Fortunately, we can easily add new activation functions to Keras.

# Add the GELU function to Keras

def gelu(x):

return 0.5 * x * (1 + tf.tanh(tf.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))

get_custom_objects().update({'gelu': Activation(gelu)})

# Add leaky-relu so we can use it as a string

get_custom_objects().update({'leaky-relu': Activation(LeakyReLU(alpha=0.2))})

act_func = ['sigmoid', 'relu', 'elu', 'leaky-relu', 'selu', 'gelu']

Now we can train the model using the different activation functions defined in the act_func array. We'll run a simple for loop over each activation function and add the results to an array:

result = []for activation in act_func:print('\nTraining with -->{0}<-- activation function\n'.format(activation))

model = build_cnn(activation=activation,

dropout_rate=0.2,

optimizer=Adam(clipvalue=0.5))

history = model.fit(x_train, y_train,

validation_split=0.20,

batch_size=128, # 128 is faster, but less accurate. 16/32 recommended

epochs=100,

verbose=1,

validation_data=(x_test, y_test))

result.append(history)

K.clear_session()del model

print(result)

Based on this, we can plot the history from model.fit() for each activation function and see how the loss and accuracy results change.

Now that we can plot the data, I wrote a small piece of code in matplotlib:

new_act_arr = act_func[1:]

new_results = result[1:]def plot_act_func_results(results, activation_functions = []):

plt.figure(figsize=(10,10))

plt.style.use('dark_background')# Plot validation accuracy values for act_func in results:

plt.plot(act_func.history['val_acc'])

plt.title('Model accuracy')

plt.ylabel('Test Accuracy')

plt.xlabel('Epoch')

plt.legend(activation_functions)

plt.show()# Plot validation loss values

plt.figure(figsize=(10,10))for act_func in results:

plt.plot(act_func.history['val_loss'])

plt.title('Model loss')

plt.ylabel('Test Loss')

plt.xlabel('Epoch')

plt.legend(activation_functions)

plt.show()

plot_act_func_results(new_results, new_act_arr)

This results in a graph like this: