Long Short-Term Memory

A Long Short-Term Memory (LSTM) network is a type of recurrent neural network (RNN) that was specifically designed to address the limitations of vanilla RNNs, particularly the vanishing gradient problem.

The key innovation of an LSTM is its unique internal structure, which allows it to remember important information over long sequences of data.

The Blocks: The Gating Mechanism

An LSTM cell is a complex block with four main components, often referred to as “gates.” These gates use mathematical functions to control the flow of information through the cell. The central component is the cell state (Ct​), which acts as a “conveyor belt” of memory running through the entire network.

  1. Forget Gate: This gate decides what information from the previous cell state (Ct−1​) should be thrown away or forgotten. It looks at the current input (xt​) and the previous hidden state (ht−1​) and outputs a value between 0 and 1 for each piece of memory. A value of 0 means “forget this completely,” and a value of 1 means “keep this entirely.”
  2. Input Gate: This gate decides which new information from the current input should be stored in the cell state. It has two parts: a sigmoid layer that decides which values to update and a tanh layer that creates a vector of new candidate values (C~t​) to add to the cell state.
  3. Candidate State (C~t​): This is the new, potential information that could be added to the cell state. It is created by the input gate.
  4. Output Gate: This gate controls what information from the current cell state (Ct​) is used to calculate the new hidden state (ht​) and the output (yt​). It filters the cell state to pass on only the most relevant information for the current prediction.

How an LSTM Works

At each time step, an LSTM unit goes through the following process:

  • It takes in the current input (xt​), the previous hidden state (ht−1​), and the previous cell state (Ct−1​).
  • The forget gate determines what to discard from Ct−1​.
  • The input gate and the candidate state determine what new information to add to the cell state.
  • The old cell state is then updated with the new information to create the new cell state (Ct​).
  • Finally, the output gate decides what to output from the new cell state to create the new hidden state (ht​) and the network’s prediction (yt​).

The existence of the cell state and the gates is what allows LSTMs to retain crucial information over long sequences. The gates essentially act as smart switches, learning what information is important to keep and what can be discarded, thereby solving the vanishing gradient problem that plagued vanilla RNNs. This ability to handle long-term dependencies is why LSTMs were a breakthrough for tasks like speech recognition and language translation.


What is LSTM (Long Short Term Memory)? provides a visual and intuitive explanation of the LSTM architecture and its components.