Attention is All You Need? LSTM proposer: I don’t think so

Yann Lecun, a pioneer in deep learning and a Turing prize winner, has always believed that unsupervised learning is the way to true artificial intelligence. In order to realize unsupervised learning, we need to explore energy based learning. This direction has existed for decades in the field of AI. Biologist John Hopfield extended it in the form of Hopfield network in 1982. This was a breakthrough in the field of machine learning at that time. It promoted the development of other learning algorithms, such as Hinton’s Boltzmann machine. < / P > < p > “energy based learning has been around for a while, and recently, it’s come back to me in search of less supervised methods. “Yann Lecun said in a speech at Princeton University in October 2019. < / P > < p > the prediction of AI direction given by the leaders of deep learning makes us interested in Hopfield network, which is an “ancient” thing. Recently, a group of researchers told us that the network update rules of transformer, which is hot in the field of NLP, is actually the same as Hopfield network in continuous state. < / P > < p > in 2018, a Google paper triggered the NLP academic community. In this paper, researchers propose a model called Bert, which refreshes the SOTA records of 11 NLP tasks. As we all know, the success of Bert is largely due to the transformer architecture behind it. In the article < p > in 2017, it was put forward in Google < p >. Since it was proposed, it has achieved very good results in many natural language processing problems: not only the training speed is faster, but also it is more suitable for modeling long-distance dependencies. < / P > < p > at present, most of the pre training models are based on the transformer model and used as their own feature extractor. It can be said that the emergence of transformer has changed the field of deep learning, especially NLP. Why is transformer so powerful? This is also due to the attention mechanism. The earliest application of attention mechanism in NLP can be traced back to 2014, when bengio team introduced it into neural machine translation task. But at that time, the core architecture of the model was RNN. In contrast, transformer completely abandons the traditional CNN and RNN, and the whole network structure is completely composed of attention mechanism. The effect of this change is also subversive. However, a recent study shows that this attention mechanism in transformer is equivalent to the update rule in the modern Hopfield network extended to continuous state. The author of this paper is from Linz University in Austria, Oslo University in Norway and other institutions, and co authored with J ü rgen schmidhuber. Sepp Hochreiter of LSTM is also one of the authors. < / P > < p > Hopfield network is a RNN model, which was proposed by John Hopfield in 1982. It combines the storage system and the binary system to ensure the convergence to the local minimum, but the convergence to the wrong local minimum instead of the global minimum may also occur. Hopfield neural network plays an important role in the revival of neural network research in the early 1980s. In this paper, a new Hopfield network is proposed by Linz University and Oslo University in Norway, which extends the modern Hopfield network from binary mode to continuous mode, and shows that the update rules of these new Hopfield networks are equivalent to the attention mechanism in transformer. < / P > < p > this new Hopfield network with continuous state retains the characteristics of discrete network: exponential storage capacity and extremely fast convergence rate. < p > < p > after discovering the equivalence between the new Hopfield network update rule and transformer attention mechanism, researchers use this discovery to analyze the transformer based models such as Bert. They found that these models have different operating modes and are more likely to operate at a higher energy minimum, which is a metastable state. < p > < p > Yannic Kilcher, a well-known YouTube blogger and doctoral student of the Federal Institute of technology in Zurich, also interpreted the paper, and the relevant videos were broadcast more than 10000 times in two days. < / P > < p > the deep learning community has been looking for alternatives to RNN to solve the problem of information storage, but most methods are based on attention. Transformer and Bert models push the performance of NLP tasks to a new level through attention mechanism. This study shows that the attention mechanism in transformer is actually equivalent to an update rule of modern Hopfield neural network extended to continuous state. The new Hopfield network can improve the schema storage exponentially, converge with one update, and the retrieval error also decreases exponentially. There is an inevitable trade-off between the number of storage modes and the convergence speed and retrieval error. < p > < p > transformer learns attention mechanism by constructing a query of pattern embedding and association space. In the first few layers, transformer and Bert models tend to run under the global average mechanism, but at higher levels they tend to run in metastable state. The gradient in transformer is the largest in metastable mechanism, evenly distributed in the global average time, and disappears when the fixed point is close to the storage mode. < p > < p > based on the explanation of Hopfield network, the researchers analyzed the learning of transformer and Bert architecture. Learning begins with attention heads, which are initially evenly distributed, and then most of them switch to metastable states. However, most of the attention heads in the first few layers are still evenly distributed and can be replaced by the average operation such as Gaussian weight proposed by researchers. < / P > < p > in contrast, attention heads in the last few layers learn steadily and seem to use metastability to collect information created by lower layers. These attention heads seem to be potential targets for improving transformer. The performance of the neural network integrated with Hopfield network is better than other methods in the classification task of immune group database. The Hopfield network can store hundreds of thousands of patterns. Researchers propose a pytorch layer called “Hopfield” to improve deep learning with the help of modern Hopfield network. It’s a powerful new concept that includes pooling, memory and attention. < / P > < p > in this paper, a new energy function is proposed, which is an improvement on the energy function of modern Hopfield network to extend it to continuous state. After the improvement, the new modern Hopfield network can store continuous mode, and keep the convergence and storage capacity of binary Hopfield network. Classical Hopfield networks do not need to constrain the norm of their own state vectors because they are binary and have a fixed length. In addition, a new update rule is proposed, which is proved to converge to the stable point of the energy function. They also show that a pattern that is well separated from other patterns can be retrieved by one-step update, while achieving exponential error reduction. < / P > < p > by using the logarithm of negative energy and adding a quadratic term that ensures the norm limitation of state vector ξ, the researchers generalized the energy function to continuous value mode. They proved that these changes maintain the exponential storage advantage of modern Hopfield network and the convergence property of single update, as shown in Figure 1. The researchers believe that the updating rules in continuous state Hopfield networks are the attention mechanisms used in transformer and Bert. For this reason, they assume the model y_ I maps to dimension D_ Hopfield space of K. The researcher makes all queries in < / P > < p > and softmax ∈ R ^ n into row vectors, and the researcher multiplies the update rule formula by W_ 5. The results are as follows: < / P > < p > through the theoretical analysis of attention mechanism, the researchers proposed the following three fixed points: a) if mode x_ If I do not achieve a good separation, then the iteration turns to a fixed point close to the mean of the vector arithmetic, i.e. the global fixed point; b) if the patterns are well separated from each other, the iteration is close to the pattern. If initial ξ is similar to pattern X_ i. Then it will converge to close to X_ The vector of I, P, will also converge to be close to E_ If some vectors are similar to each other and well separated from all other vectors, there is a so-called metastable state between similar vectors. At the beginning, the iteration near the metastable converges to the metastable state. < / P > < p > the researchers observed that both the transformer and the Bert models had attention heads containing the above three fixed points, as shown in Figure 2: < / P > < p > as shown in Figure 3D below, the degree of confusion was drawn according to the number of layers changed. The red line indicates the confusion of the initial model. The researchers found that the neutral energy in the lower layer is less affected than that in the higher layer, and is almost unaffected in the first layer. This shows that the attention head of the first layer can be replaced by the non attention head based on the average operation. < / P > < p > based on these analyses, researchers replaced the first layer of attention head with Gaussian weights, where the mean and variance of Gaussian functions can be learned. Therefore, they get a location Gaussian weight averaging scheme, in which each token has only two parameters and no attention head. In addition, attention heads always perform the same averaging operation regardless of input. The Gaussian weight used by the researchers is similar to the random synthesizer head, in which the attention weight can be learned directly. < / P > < p > as shown in Figure 2 above, the researchers found that the attention heads in the later layers mainly operated in class III state. < / P > < p > in order to study these layers, researchers replaced the attention head with the average operation as before. Compared with the previous layers and the middle layer, the performance of the model is much worse, as shown in Figure 3D above. When updating to about 9000 times, the loss function curve declines sharply, which is consistent with the shift of attention head to other categories. In contrast, the attention heads of the later layers are learning after falling, and they are closer to class. Finally, the layer normalization is highly related to the adjustment of the most important parameters in the modern Hopfield network. The researchers identified β as a key parameter of fixed point and attention head operation mode. The Hopfield network consists of β, m and m_ M and m_ X ‖ joint decision. The lower value of β induces the global average and higher β metastable value. Adjusting β or m is equivalent to the gain parameter of the normalized adjustment layer. Therefore, layer normalization controls the most important parameters of Hopfield network: β, m, M_ Max and ‖ M_ x‖。 < / P > < p > in the experimental part, the researchers confirmed that the modern Hopfield network can be regarded as a transformer like attention mechanism in the classification of immune histone library and large-scale multi instance learning. Theorem 3 in the original paper shows that modern Hopfield network has exponential storage capacity, which can solve large-scale multi instance learning problems such as immune database classification. The effectiveness of modern Hopfield network has been confirmed by large-scale comparative study. In addition, a new PYT of Hopfield layer is provided