To answer this, we need to understand the Transformers architecture introduced by researchers at Google DeepMind in their paper "Attention is All You Need." But before we enter into Transformers Let's understand how natural language tasks were processed before Transformers.
Before Transformers, recurrent neural networks (RNNs) like LSTMs were commonly used for natural language processing tasks. RNNs process input sequentially, incorporating information from previous inputs into a hidden state.
However, RNNs suffered from some major issues. One was their limited short-term memory. As they processed more inputs, information from earlier inputs would diminish exponentially - making it difficult to capture long-range dependencies.
Transformers architecture introduced an alternative non-recurrent architecture with 3 key concepts to solve this:
๐ Positional Encoding: This converts the input tokens (words) into vectors while encoding positional information into those vectors.
๐ Attention: attends to elements in one sequence based on their relevance to a specific context
๐ Self-Attention: attends to elements in the same sequence, considering the relationships between each element and all others, including itself.
The self-attention mechanism allows models to look across all words in a sentence simultaneously. This provides better context and captures long-range dependencies more effectively compared to RNN models like LSTMs.
Additionally, the attention mechanism allows Transformers to selectively focus on the most relevant parts of the input while making predictions.
In summary, Transformers leveraged attention and self-attention to overcome shortcomings of RNN models like vanishing gradients and inefficient use of long-range context. This architecture has proven enormously effective for natural language understanding.