What Are They?
RNNs are like learners with short-term memory – they remember recent information but get fuzzy as time passes.
Simply put, a Recurrent Neural Network (RNN) is like a “learner” with memory. Imagine this learner reading an article, remembering the meaning of each word temporarily before moving on to the next. While traditional neural networks treat each input independently (like starting from scratch each time), RNNs “remember” previous information (earlier words) and pass this memory to the next step, creating context for the entire sentence.
However, RNNs have a glaring issue: strong short-term memory, weak long-term memory. It’s like remembering the words you just read but struggling to recall earlier parts. This becomes particularly evident when processing long texts or sequences. Their ability to “remember” early information rapidly declines over time, a phenomenon known as the vanishing gradient problem.
So, how does LSTM (Long Short-Term Memory) solve RNN’s problem?
Think of LSTM as a “learner with smarter memory capabilities”. It doesn’t just remember recent information like RNNs; it learns to select important memories to retain and discard unimportant ones. LSTM uses mechanisms like “forget gates” and “memory gates” to decide which information to forget and which to remember long-term, thus solving RNN’s inability to retain information over extended periods.
Here’s an analogy: Imagine reading a novel where a character introduced at the beginning is crucial to the plot, but you don’t need to remember every minor detail as time goes on. LSTM is like your brain, capable of judging which information to remember for a long time (like key plot characters) and which to forget (like unimportant details). This allows it to handle information spanning longer time frames without forgetting as rapidly as RNNs do.
Some Technical Details
1. RNN Technical Details
The core of RNN lies in its “recurrence” – the network’s output depends not only on the current input but also on the hidden state from the previous time step. This is the source of its “memory”.
Mathematical Expression:
In RNNs, the hidden state \( h_t \) is computed using the current input \( x_t \) and the previous hidden state \( h_{t-1} \):
\[ h_t = \tanh(W_h h_{t-1} + W_x x_t + b) \]- \( W_h \) is the weight matrix connecting hidden states (responsible for remembering past information).
- \( W_x \) is the input layer weight matrix (responsible for processing current input).
- \( \tanh \) is the activation function (can be replaced with others like ReLU, but \( \tanh \) is common).
- \( b \) is the bias term.
The Vanishing Gradient Problem in RNNs
During RNN training, we use backpropagation to update weights. In long sequences, the hidden state \( h_t \) is continuously passed forward. This means the influence of earlier inputs on the current state requires multiple chain calculations. As sequence length increases, the gradient of the hidden state gradually approaches zero – this is the vanishing gradient.
In simpler terms, as sequences get longer, the impact of early input information (like \( h_1 \)) gets “diluted”, leading to poor memory of long-term dependencies. This makes RNNs ineffective at learning early information when processing long sequences.
2. LSTM Technical Details
LSTMs are designed to solve the vanishing gradient problem. Through special gating mechanisms, they selectively retain and forget information, maintaining long-term memory capabilities. The core of LSTM includes three gates: forget gate, input gate, and output gate, along with a cell state to help control information flow.
Mathematical Expression:
LSTM updates occur in the following steps:
-
Forget Gate \( f_t \): Decides which past memories to forget. It calculates the forget gate value \( f_t \) (a scalar between 0 and 1, closer to 1 means less forgetting) using the current input \( x_t \) and the previous hidden state \( h_{t-1} \).
\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]- \( W_f \) is the weight matrix for the forget gate.
- \( \sigma \) is the sigmoid function (output between 0 and 1).
-
Input Gate \( i_t \) and Candidate State \( \tilde{C}_t \): The input gate controls how new information updates the current cell state. It decides how much new information to add. The candidate state \( \tilde{C}_t \) is generated through a \( \tanh \) function, representing the content of new information.
\[ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \] \[ \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \] -
Cell State Update \( C_t \): The cell state \( C_t \) is the core of the LSTM structure. It updates to a new cell state by combining the forget gate \( f_t \) and input gate \( i_t \).
\[ C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t \]- The first part \( f_t \cdot C_{t-1} \) represents retaining the previous cell state (remembering the past).
- The second part \( i_t \cdot \tilde{C}_t \) represents adding new memories (incorporating current input).
-
Output Gate \( o_t \) and Hidden State \( h_t \): Finally, the output gate controls which parts of the information will be output as the hidden state for the current time step. The hidden state \( h_t \) is also influenced by the cell state \( C_t \):
\[ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \] \[ h_t = o_t \cdot \tanh(C_t) \]- The current time step’s hidden state is generated by combining the cell state \( C_t \) calculation result with the output gate.
How Does LSTM Solve the Problem?
-
Long-Distance Memory in Cell State: The cell state \( C_t \) acts like an “information highway”, almost free from the vanishing gradient problem seen in RNNs. Through forget and input gates, LSTM can selectively retain past information, allowing it to remember crucial information from far back even in long sequences.
-
Flexibility of Gating Mechanisms: LSTM can flexibly decide how much old information to forget and how much new information to introduce based on the current input and previous hidden state, through the degree of gate openness (values between 0 and 1). This flexibility gives LSTM powerful memory control capabilities.
Summary:
- In RNNs, the hidden state \( h_t \) gradually “fades” early information over time (vanishing gradient problem), making them unsuitable for long sequences.
- In LSTMs, the cell state \( C_t \) can preserve important information long-term through gating mechanisms, solving the vanishing gradient problem and making them suitable for tasks with long-term dependencies.