-
2022-09-23 10:27:47
From ReLU to GELU, an overview of the activation function of neural networks
Keywords: ReLUGELU neural network The importance of activation function to neural network goes without saying. The Heart of Machines has also published some related introductory articles, such as "An Overview of Activation Functions in Deep Learning in One Article". This article also focuses on activation functions. Casper Hansen from the Technical University of Denmark introduced the activation functions of sigmoid, ReLU, ELU and the updated Leaky ReLU, SELU, GELU through formulas, charts and code experiments, and compared their advantages and disadvantages.
When calculating the activation values of each layer, we use the activation function, and then we can determine what these activation values are. Based on the previous activations, weights, and biases in each layer, we want to compute a value for each activation in the next layer. But before sending that value to the next layer, we want to use an activation function to scale this output. This article will introduce different activation functions.
Before reading this article, you can read my previous article on forward and backpropagation in neural networks, which briefly mentioned activation functions, but not what they actually do. The content of this article will build on what you already know from the previous article.
Casper Hansen
Table of contents
Overview
What is the sigmoid function?
The Gradient Problem: Backpropagation
Gradient vanishing problem
Gradient explosion problem
An extreme case of exploding gradients
Avoid exploding gradients: Gradient clipping/norm
Rectified Linear Unit (ReLU)
Dead ReLU: Advantages and Disadvantages
Exponential Linear Unit (ELU)
Leaky Rectified Linear Unit (Leaky ReLU)
Extended Exponential Linear Unit (SELU)
SELU: A special case of normalization
Weight initialization + dropout
Gaussian Error Linear Unit (GELU)
Code: Hyperparameter Search for Deep Neural Networks
Further Reading: Books and Papers
Overview
Activation functions are a crucial part of neural networks. In this long post, I'll take a comprehensive look at six different activation functions and explain their pros and cons. I'll give the equations and differential equations for the activation function, and I'll give a diagram of them. The goal of this article is to explain these equations and graphs in simple terms.
I'll cover vanishing gradients and exploding gradients; for the latter, I'll follow Nielsen's great example to explain why gradients explode.
Finally, I'll also provide some code that you can run in Jupyter Notebook yourself.
I'll run some small code experiments on the MNIST dataset to get a loss and accuracy plot for each activation function.
What is the sigmoid function?
The sigmoid function is a logistic function, which means: no matter what the input is, the output you get is between 0 and 1. That is, every neuron, node, or activation you input is scaled to a value between 0 and 1.
Illustration of the sigmoid function.
A function like sigmoid is often called a nonlinear function because we cannot describe it in terms of linearity. Many activation functions are nonlinear or a combination of linear and nonlinear (it is possible that part of the function is linear, but this is rare).
This is basically fine, except when the value is exactly 0 or 1 (which sometimes does happen). Why is this a problem?
This question is related to backpropagation (see my previous post for an introduction to backpropagation). In backpropagation, we want to compute the gradient of each weight, i.e. a small update for each weight. The purpose of this is to optimize the output of activation values throughout the network, so that it can get better results at the output layer, and then optimize the cost function.
During backpropagation, we have to calculate the proportion that each weight affects the cost function by calculating the partial derivatives of the cost function with respect to each weight. Assuming that instead of defining individual weights, we define all weights w in the last layer L as w^L, their derivatives are:
Note that when taking partial derivatives, we find the equation for ?a^L, then differentiate only ?z^L, leaving the rest the same. We use the apostrophe "'" to denote the derivative of any function. When computing the partial derivative of the intermediate term ?a^L/?z^L, we have:
Then the derivative of the sigmoid function is:
When we input a large x value (positive or negative) to this sigmoid function, we get a y value that is almost 0 - that is, when we input w × a+b, we may get a value close to value of 0.
Derivative illustration of the sigmoid function.
When x is a large value (positive or negative), we are essentially multiplying the remainder of this partial derivative by a value that is almost 0.
If there are too many weights with such large values, then we can't get a network that can adjust the weights at all, which is a big problem. If we don't adjust these weights, then the network has only minor updates so that the algorithm doesn't improve the network much over time. For each computation of the partial derivative for a weight, we put it into a gradient vector, and we'll use this gradient vector to update the neural network. As you can imagine, if all the values of this gradient vector are close to 0, then we can't really update anything at all.
What is described here is the vanishing gradient problem. This problem makes the sigmoid function impractical in neural networks, and we should use other activation functions described later.
Gradient problem
Gradient vanishing problem
My previous post said that if we want to update a specific weight, the update rule is:
But what if the partial derivative ?C/?w^(L) is so small that it disappears? At this point we run into the vanishing gradient problem, where many weights and biases receive only very small updates.
It can be seen that if the value of the weight is 0.2, this value will basically not change when the gradient disappears problem. Because this weight connects the first neuron of the first and second layers, respectively, we can write it as
Assuming that the value of this weight is 0.2, given a learning rate (how much is not important, 0.5 is used here), the new weight is:
The original value of this weight is 0.2, and now it is updated to 0.199999978. Obviously, this is problematic: the gradients are so small that they disappear, leaving the weights in the neural network barely updated. This can cause nodes in the network to be far from their optimal values. This problem can seriously hinder the learning of neural networks.
It has been observed that this problem is exacerbated if different layers learn at different rates. Layers learn at different rates, and the first few layers always get worse based on the learning rate.
From Nielsen's book Neural Networks and Deep Learning.
In this example, hidden layer 4 learns the fastest because its cost function only depends on the weight changes connected to hidden layer 4. Let's look at hidden layer 1; the cost function here depends on the weight change connecting hidden layer 1 to hidden layers 2, 3, 4. If you read my previous article on backpropagation, you probably know that earlier layers in the network reuse computations from later layers.
Meanwhile, as introduced earlier, the last layer depends only on a set of changes that occur when computing partial derivatives:
Ultimately, this is a big problem because now the weight layers are learning at a different rate. This means that layers later in the network will almost certainly be more optimized by layers earlier in the network.
And the problem is that the backpropagation algorithm doesn't know in which direction the weights should be passed to optimize the cost function.
Gradient explosion problem
The exploding gradient problem is essentially the opposite of the vanishing gradient problem. Research shows that such a problem is possible when the weights are in a state of "exploding", i.e. their value grows rapidly.
We will follow the following example to illustrate:
/chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets
Note that this example can also be used to demonstrate the vanishing gradient problem, and I chose it from a more conceptual perspective for easier explanation.
Essentially, when 0
We start with a simple network. This network has a small number of weights, biases and activations, and also has only one node per layer.
The network is simple. The weights are denoted w_j, the biases are b_j, and the cost function is C. Nodes, neurons or activations are represented as circles.
Nielsen used a common representation in physics, Δ, to describe a change in a value (this is different from the gradient notation ?). For example, Δb_j describes the value change of the jth bias.
The core of my previous post is that we want to measure the rate of change of weights and biases in relation to the cost function. Layers aside, let's look at a specific bias, the first bias b_1. Then we measure the rate of change by:
The arguments for the following equations are the same as for the partial derivatives above. i.e. how do we measure the rate of change of the cost function by the rate of change of the bias? As just introduced, Nielsen uses Δ to describe the change, so we can say that this partial derivative can be roughly replaced by Δ:
Changes in weights and biases can be visualized as follows:
The animation is from 3blue1brown, the video address: https:///watch?v=tIeHLnjs5U8.
We start with the starting point of the network and calculate how a change in the first bias b_1 will affect the network. Since we know from the previous post that the first bias b_1 feeds the first activation a_1, we'll start here. Let's review this equation first:
If b_1 changes, we denote this change as Δb_1. Therefore, we notice that when b_1 changes, the activation a_1 also changes - we usually denote this as ?a_1/?b_1.
Therefore, we have the expression for the partial derivative on the left, which is the change in b_1 relative to a_1. But we start replacing the left term, first replacing a_1 with the sigmoid of z_1:
The above formula indicates that when b_1 changes, there is a certain change in the activation value a_1. We describe this change as Δa_1.
We treat the change Δa_1 as approximately the same as the change in the activation value a_1 plus the change Δb_1.
Here we skip a step, but essentially we just compute the partial derivative and replace the fractional part with the result of the partial derivative.
A change in a_1 results in a change in z_2
The described change Δa_1 now causes a change in the input z_2 of the next layer. If this seems odd or you're still not convinced, I suggest you read my previous article.
The representation is the same as before, and we denote the next change as Δz_2. We have to go through the previous process again, only this time to get the changes in z_2:
We can replace Δa_1 with:
We only calculate this formula. Hope you clearly understand the process of this step - it is the same process as calculating Δa_1.
This process repeats until we compute the entire network. By substituting the Δa_j values, we get a final function that computes the change in the cost function relative to the entire network (ie all weights, biases and activations).
Based on this, we calculate ?C/?b_1 again to get the final formula we need:
An extreme case of exploding gradients
According to this, if all weights w_j are large, ie if many weights have a value greater than 1, we start multiplying by larger values. For example, all the weights have some very high value, like 100, and we get some random output of the derivative of the sigmoid function between 0 and 0.25:
The last partial derivative is, which is reasonably believed to be much larger than 1, but is set to 1 for the sake of the example.
Using this update rule, if we assume that b_1 was previously equal to 1.56 and the learning rate was equal to 0.5.
Although this is an edge case, you know what I mean. The values of weights and biases can increase explosively, causing the entire network to explode.
Now take a moment to think about the weights and biases of the network and the rest of the activations, updating their values explosively. This is what we call the exploding gradient problem. Obviously, such a network can't learn anything, so this completely ruins the task you're trying to solve.
Avoiding exploding gradients: Gradient clipping/norm
The basic idea of solving the gradient explosion problem is to set a rule for it. I won't go into the math in depth for this part, but I'll give the steps of the process:
pick a threshold - if the gradient exceeds this value, gradient clipping or gradient norm is used;
Defines whether to use gradient clipping or norm. If using gradient clipping, you specify a threshold, such as 0.5. If this gradient value exceeds 0.5 or -0.5, it is either scaled to within the threshold by gradient normalization, or clipped to within the threshold.
Note, however, that none of these gradient methods can avoid the vanishing gradient problem. So we will further explore more ways to solve this problem. In general, you need these methods if you are using a recurrent neural network architecture (like LSTM or GRU), which tends to have exploding gradients.
Rectified Linear Unit (ReLU)
Rectified linear units are our solution to the vanishing gradient problem, but could this lead to other problems? Please look down.
The formula of ReLU is as follows:
The ReLU formula says:
If the input x is less than 0, set the output equal to 0;
If the input x is greater than 0, then let the output equal the input.
Although we can't graph it with most tools, you can explain ReLU graphically this way. Everything with an x value less than zero maps to a y value of 0, but everything with an x value greater than zero maps to itself. That is, if we enter x=1, we get y=1.
Diagram of the ReLU activation function.
That's fine, but what does this have to do with the vanishing gradient problem? First, we have to get its differential equation:
It means:
If the input x is greater than 0, the output is equal to 1;
If the input is less than or equal to 0, the output becomes 0.
Represented by the following diagram:
Differentiated ReLU.
Now we have the answer: when using the ReLU activation function, we do not get very small values (like 0.0000000438 for the sigmoid function above). Instead, it's either 0 (causing some gradients to return nothing) or 1.
But this creates another problem: the dead ReLU problem.
What if there are too many values below 0 when computing the gradient? We'll get quite a few weights and biases that won't update because they're updated by 0. To see how this process actually performs, let's look in reverse at the previous example of exploding gradients.
We denote ReLU as R in this equation, we just need to replace each sigmoid σ with R:
Now, say a random input z of this differentiated ReLU is less than 0 - this function will cause the bias to "die". Suppose R'(z_3)=0:
Conversely, when we get R'(z_3)=0, we can only get 0 when multiplied by other values, which will cause this bias to die. We know that the new value of a bias is that bias minus the learning rate minus the gradient, which means we get an update of 0.
Dead ReLU: Advantages and Disadvantages
When we introduce the ReLU function into the neural network, we also introduce a lot of sparsity. So what exactly does the term sparsity mean?
Sparse: Small in number, usually scattered over a large area. In a neural network, this means that the activation matrix contains many 0s. What does this sparsity performance get us? When a certain percentage (say 50%) of activations are saturated, we say that the neural network is sparse. This improves efficiency in terms of time and space complexity - constant values (usually) require less space and are less computationally expensive.
Yoshua Bengio et al. found that the ReLU component actually makes the neural network perform better, with the aforementioned time and space efficiency.
Paper address: https:///pdf/1905.01338.pdf
GELU
Gaussian error linear unit activation functions are used in recent Transformer models (Google's BERT and OpenAI's GPT-2). The GELU paper is from 2016 , but has only recently gained traction.
This activation function has the form:
It can be seen that this is a combination of some function (such as the hyperbolic tangent function tanh) and an approximate numerical value. Not much to say. Interesting is the graph of this function:
GELU activation function.
It can be seen that when x is greater than 0, the output is x; except for the interval from x=0 to x=1, the curve is more inclined to the y-axis.
I couldn't find the derivative of this function, so I used WolframAlpha to differentiate the function. The result is as follows:
As before, this is another combination of hyperbolic functions. But its graph looks interesting:
Differentiated GELU activation function.
advantage:
Seems to be the current best in the NLP field; especially in the Transformer model;
It can avoid the vanishing gradient problem.
shortcoming:
Although proposed in 2016, it is a rather novel activation function in practical applications.
Code for Deep Neural Networks
Say you want to try all of these activation functions to see which one works best, how would you do it? Usually we perform hyperparameter optimization - this can be done using scikit-learn's GridSearchCV function. But we want to compare, so the idea is to pick some hyperparameters and keep them constant while modifying the activation function.
Explain what I'm trying to do here:
Train the same neural network model using the activation function mentioned in this article;
Using the history of each activation function, plot loss and accuracy versus epoch.
The code is also published on GitHub and supports colab so you can get it up and running quickly. Address: https://github.com/casperbh96/Activation-Functions-Search
I prefer to use Keras' high-level API, so this will be done with Keras.
First import everything we need. Note that 4 libraries are used here: tensorflow, numpy, matplotlib, keras.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D, Activation, LeakyReLU
from keras.layers.noise import AlphaDropout
from keras.utils.generic_utils import get_custom_objects
from keras import backend as K
from keras.optimizers import Adam
Now load the dataset we need to run our experiments; here the MNIST dataset was chosen. We can import it directly from Keras.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
Great, but we want to do some preprocessing on the data, like normalization. We need to go through a lot of functions to do this, mainly resizing the image (.reshape) and dividing by the largest RGB value of 255 (/= 255). Finally, we one-hot encode the data via to_categorical().
def preprocess_mnist(x_train, y_train, x_test, y_test):
# Normalizing all images of 28x28 pixels
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
# Float values for division
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalizing the RGB codes by dividing it to the max RGB value
x_train /= 255
x_test /= 255
# Categorical y values
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
return x_train, y_train, x_test, y_test, input_shape
x_train, y_train, x_test, y_test, input_shape = preprocess_mnist(x_train, y_train, x_test, y_test)
Now that we have finished data preprocessing, we can build the model and define the parameters required for Keras to run. First start with the convolutional neural network model itself. The SELU activation function is a special case, we need to use the kernel initializer 'lecun_normal' and the special form of dropout AlphaDropout(), everything else is left as normal.
def build_cnn(activation,
dropout_rate,
optimizer):
model = Sequential()if(activation == 'selu'):
model.add(Conv2D(32, kernel_size=(3, 3),
activation=activation,
input_shape=input_shape,
kernel_initializer='lecun_normal'))
model.add(Conv2D(64, (3, 3), activation=activation,
kernel_initializer='lecun_normal'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(AlphaDropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation=activation,
kernel_initializer='lecun_normal'))
model.add(AlphaDropout(0.5))
model.add(Dense(10, activation='softmax')) else:
model.add(Conv2D(32, kernel_size=(3, 3),
activation=activation,
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation=activation))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation=activation))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
model.compile(
loss='binary_crossentropy',
optimizer=optimizer,
metrics=['accuracy'])return model
There is a small problem with using the GELU function; this function is not currently available in Keras. Fortunately, we can easily add new activation functions to Keras.
# Add the GELU function to Keras
def gelu(x):
return 0.5 * x * (1 + tf.tanh(tf.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))
get_custom_objects().update({'gelu': Activation(gelu)})
# Add leaky-relu so we can use it as a string
get_custom_objects().update({'leaky-relu': Activation(LeakyReLU(alpha=0.2))})
act_func = ['sigmoid', 'relu', 'elu', 'leaky-relu', 'selu', 'gelu']
Now we can train the model using the different activation functions defined in the act_func array. We'll run a simple for loop over each activation function and add the results to an array:
result = []for activation in act_func:print('\nTraining with -->{0}<-- activation function\n'.format(activation))
model = build_cnn(activation=activation,
dropout_rate=0.2,
optimizer=Adam(clipvalue=0.5))
history = model.fit(x_train, y_train,
validation_split=0.20,
batch_size=128, # 128 is faster, but less accurate. 16/32 recommended
epochs=100,
verbose=1,
validation_data=(x_test, y_test))
result.append(history)
K.clear_session()del model
print(result)
Based on this, we can plot the history from model.fit() for each activation function and see how the loss and accuracy results change.
Now that we can plot the data, I wrote a small piece of code in matplotlib:
new_act_arr = act_func[1:]
new_results = result[1:]def plot_act_func_results(results, activation_functions = []):
plt.figure(figsize=(10,10))
plt.style.use('dark_background')# Plot validation accuracy values for act_func in results:
plt.plot(act_func.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Test Accuracy')
plt.xlabel('Epoch')
plt.legend(activation_functions)
plt.show()# Plot validation loss values
plt.figure(figsize=(10,10))for act_func in results:
plt.plot(act_func.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Test Loss')
plt.xlabel('Epoch')
plt.legend(activation_functions)
plt.show()
plot_act_func_results(new_results, new_act_arr)
This results in a graph like this: