Giant machines 2017 transformers

8/11/2023

They struggle with really long sequences (despite using LSTM and GRU units).These architectures, however, have the following problems: The most advanced architectures in use before Transformers were Recurrent Neural Networks with LSTM/GRU.Transformers are big encoder-decoder models able to process a whole sequence with a sophisticated attention mechanism.Transformers revolutionizing the world of NLP, Speech, and Vision: 2018 onwards.Transformers did not become a overnight success until GPT and BERT immensely popularized it.In 2020, Vision Transformer (ViT) ( An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale) demonstrated that convolutions are not essential for building high-performance vision models.In 2017, Transformers (proposed in the “Attention Is All You Need” paper) demonstrated that recurrence and/or convolutions are not essential for building high-performance natural language models.CNNs were the essential building blocks of vision (and some NLP) models for three decades since the 1980s.LSTMs, GRUs and other flavors of RNNs were the essential building blocks of NLP models for two decades since 1990s.As recommended reading, Lilian Weng’s Attention? Attention! offers a great overview on various attention types and their pros/cons. Back in 2017, this idea sounded very radical, because the NLP community was so used to the sequential–one-word-at-a-time–style of processing text with RNNs. Knowing this, the word’s updated features are simply the sum of linear transformations of the features of all the words, weighted by their importance (as shown in the figure below source). The architecture takes a fresh approach to representation learning: Doing away with recurrence entirely, Transformers build features of each word using an attention mechanism (which had also been experimented in the world of RNNs as “ Augmented RNNs”) to figure out how important all the other words in the sentence are w.r.t.

Initially introduced for machine translation, Transformers have gradually replaced RNNs in mainstream NLP.

Chris Olah’s legendary blog for recaps on LSTMs and representation learning for NLP is highly recommend to develop a background in this area In the end, we get a hidden feature for each word in the sentence, which we pass to the next RNN layer or use for our NLP tasks of choice. Intuitively, we can imagine an RNN layer as a conveyor belt (as shown in the figure below source), with the words being processed on it autoregressively from left to right.

For Natural Language Processing (NLP), conventionally, Recurrent Neural Networks (RNNs) build representations of each word in a sentence in a sequential manner, i.e., one word at a time.
The neural network learns to build better-and-better representations by receiving feedback, usually via error/loss functions. These latent or hidden representations can then be used for performing something useful, such as classifying an image or translating a sentence.
At a high level, all neural network architectures build representations of input data as vectors/embeddings, which encode useful syntactic and semantic information about the data.
Choosing the right language model for your NLP use-case: key takeawaysīackground: Representation Learning for NLP.
What would we like to fix about the transformer? / Drawbacks of Transformers.
Benefits of Transformers compared to RNNs/GRUs/LSTMs.
Why multiple heads of attention? Why attention?.
Are Transformers learning neural syntax?.
Transformers: merging the worlds of linguistic theory and statistical NLP using fully connected graphs.
Sentences are fully-connected word graphs.
The relation between transformers and Graph Neural Networks.
Applying BPE to learn new, rare, and misspelled words.
Putting it all together: The Transformer Architecture.
Managing computational load due to multi-head attention.
Applications of Attention in Transformers.
Calculating \(Q\), \(K\), and \(V\) matrices in the Transformer architecture.
Attention in Transformers: What’s new and what’s not?.
Averaging is equivalent to uniform attention.
Types of Attention: Additive, Multiplicative (Dot-product), and Scaled.
Why attention? Contextualized Word Embeddings.
Why sinusoidal positional embeddings work.
Role of the Final Linear and Softmax Layer.
Generating words as a probability distribution over the vocabulary.
Second order sequence model as matrix multiplications.
Matrix multiplication as a table lookup.
Matrix multiplication as a series of dot products.
Background: Representation Learning for NLP.

0 Comments

Giant machines 2017 transformers

Leave a Reply.

Author

Archives

Categories