LSTM Networks

Mohit Varikuti
5 min readAug 3, 2021

--

This article discusses the issues with traditional RNNs, such as disappearing and exploding gradients, and proposes a simple solution in the form of Long Short Term Memory (LSTM). Long Short-Term Memory (LSTM) is a more sophisticated variant of recurrent neural network (RNN) architecture that was created to more precisely describe chronological sequences and their long-range relationships than ordinary RNNs.

The inside design of a basic LSTM cell, the modifications included into the LSTM architecture, and a few applications of LSTMs that are in great demand are among the highlights. It also compares and contrasts LSTMs and GRUs. The paper closes with a list of LSTM network drawbacks and a quick overview of the forthcoming attention-based models that are rapidly replacing LSTMs in real-world applications.

Introduction:

LSTM networks are a type of recurrent neural network (RNN) that was developed to deal with circumstances where RNNs failed. When it comes to RNNs, they are networks that function on current inputs while taking into account prior outputs (feedback) and keeping them in memory for a brief period of time (short-term memory). The most common applications are in the domains of voice processing, non-Markovian control, and music creation, among others. RNNs, however, have several disadvantages. For starters, it is unable to retain data for a prolonged length of time. To anticipate the present output, it is often necessary to resort to information that was saved a long time ago. RNNs, on the other hand, are completely unable of dealing with such “long-term dependencies.”

Second, there is no finer control over which aspects of the context should be preserved and how much should be forgotten. The exploding and disappearing gradients (described later) that occur during the backtracking training phase of a network are another concern with RNNs. As a result, the concept of Long Short-Term Memory (LSTM) was introduced. It’s been built in such a way that the vanishing gradient problem is nearly entirely eliminated, but the training model remains unchanged.LSTMs are used to bridge long time delays in some applications, and they can also handle noise, distributed representations, and continuous data. There is no requirement to retain a finite number of states from the beginning with LSTMs, as in the hidden Markov model (HMM). Learning rates, input and output biases, and learning rates are only a few of the parameters provided by LSTMs.

As a result, no precise modifications are required.With LSTMs, the difficulty of updating each weight is decreased to O(1), comparable to Back Propagation Through Time (BPTT), which is a benefit.

Exploding and Vanishing Gradients:

The fundamental aim of a network’s training process is to reduce the amount of loss (in terms of error or cost) noticed in the output when training data is delivered through it. We compute the gradient, or loss, with regard to a certain set of weights, then change the weights appropriately, and continue until we have an ideal set of weights with the least amount of loss. Backtracking is a term used to describe this process.

Occasionally, the gradient is so small that it is virtually unnoticeable. It’s worth noting that a layer’s gradient is influenced by specific components in subsequent levels. If any of these components are tiny (less than 1), the gradient that is obtained will be even less. The scaling effect is the term for this. When this gradient is multiplied by the learning rate, which is a tiny number ranging from 0.1 to 0.001, the outcome is a lower number.

As a result, the change in weights is minor, resulting in almost the same output as previously. Similarly, if the gradients are extremely big due to large component values, the weights are changed to a value that is not optimum. The problem of bursting gradients is what it’s called. The neural network unit was re-built in such a way that the scaling factor was fixed to one to avoid this scaling impact. The cell was then given the name LSTM after being enriched with multiple gating units.

Architecture:

The main distinction between RNN and LSTM designs is that the LSTM’s buried layer is a gated unit or gated cell. It is made up of four layers that interact with one another to generate the cell’s output as well as the cell’s state. After then, these two items are passed on to the next concealed layer. LSTMs contain three logistic sigmoid gates and one tanh layer, unlike RNNs, which have only one neural net layer of tanh.

Gates were developed to limit the amount of data that could travel through the cell. They figure out which parts of the data will be needed by the following cell and which will be discarded. The result is generally in the 0–1 range, with 0 indicating “reject all” and 1 indicating “include all.”

Variations:

With the growing popularity of LSTMs, several modifications to the traditional LSTM architecture have been made to simplify the internal design of cells in order to make them operate more efficiently and minimize computing complexity. Peephole connections were introduced by Gers and Schmidhuber, allowing gate layers to know the condition of the cell at any time. Instead of two distinct gates, some LSTMs used a linked input and forget gate, which let them to make both choices at the same time. Another modification was the introduction of the Gated Recurrent Unit (GRU), which reduced the number of gates and therefore decreased the design complexity. It employs a mix of cell state and hidden state, as well as an update gate that combines forgotten and input gates.

GRUs Vs LSTMs

Despite the fact that GRUs are quite similar to LSTMs, they have never been as popular. But what exactly are GRUs? Gated Recurrent Units (GRU) is an acronym for Gated Recurrent Units. Cho’s suggested recurrent units include a gated method to efficiently and adaptively capture relationships across multiple time scales, as the name implies. They have a reset and an update gate. The former is in charge of deciding which piece of knowledge will be carried forward, whereas the latter is in charge of deciding how much information will be forgotten between two subsequent recurring units.

Another noteworthy feature of GRUs is that they don’t retain cell status in any way, thus they can’t control how much memory content the following unit is exposed to. LSTMs, on the other hand, control the quantity of fresh information that enters the cell. When computing the new, candidate activation, the GRU, on the other hand, controls the information flow from the prior activation, but not the quantity of the candidate activation being added (the control is tied via the update gate).

--

--

Mohit Varikuti

Im some random highschooler on the internet who likes to write about AI and tech and stuff. Leave a follow if u like my stuff I really appreciate it!