They struggle with really long sequences (despite using LSTM and GRU units).These architectures, however, have the following problems: The most advanced architectures in use before Transformers were Recurrent Neural Networks with LSTM/GRU.Transformers are big encoder-decoder models able to process a whole sequence with a sophisticated attention mechanism.Transformers revolutionizing the world of NLP, Speech, and Vision: 2018 onwards.Transformers did not become a overnight success until GPT and BERT immensely popularized it.In 2020, Vision Transformer (ViT) ( An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale) demonstrated that convolutions are not essential for building high-performance vision models.In 2017, Transformers (proposed in the “Attention Is All You Need” paper) demonstrated that recurrence and/or convolutions are not essential for building high-performance natural language models.CNNs were the essential building blocks of vision (and some NLP) models for three decades since the 1980s.LSTMs, GRUs and other flavors of RNNs were the essential building blocks of NLP models for two decades since 1990s.As recommended reading, Lilian Weng’s Attention? Attention! offers a great overview on various attention types and their pros/cons. Back in 2017, this idea sounded very radical, because the NLP community was so used to the sequential–one-word-at-a-time–style of processing text with RNNs. Knowing this, the word’s updated features are simply the sum of linear transformations of the features of all the words, weighted by their importance (as shown in the figure below source). The architecture takes a fresh approach to representation learning: Doing away with recurrence entirely, Transformers build features of each word using an attention mechanism (which had also been experimented in the world of RNNs as “ Augmented RNNs”) to figure out how important all the other words in the sentence are w.r.t. Initially introduced for machine translation, Transformers have gradually replaced RNNs in mainstream NLP. Chris Olah’s legendary blog for recaps on LSTMs and representation learning for NLP is highly recommend to develop a background in this area In the end, we get a hidden feature for each word in the sentence, which we pass to the next RNN layer or use for our NLP tasks of choice. Intuitively, we can imagine an RNN layer as a conveyor belt (as shown in the figure below source), with the words being processed on it autoregressively from left to right.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |