2010: Breakthrough of end-to-end deep learning. No layer-by-layer training. No unsupervised pre-training. The rest is history. Juergen Schmidhuber.

Jürgen Schmidhuber (9/2/2020, 2022, 2025)
Pronounce: You_again Shmidhoobuh AI Blog
Twitter: @SchmidhuberAI

2010: Breakthrough of end-to-end deep learning. No layer-by-layer training. No unsupervised pre-training. The rest is history.

In 2025, we are celebrating the 15-year anniversary of our publication [MLP1] in Neural Computation (2010) on deep multilayer perceptrons trained end-to-end by plain gradient descent on NVIDIA GPUs (no incremental layer-by-layer training, no unsupervised pre-training). Surprisingly, our simple but unusually deep supervised artificial neural network (NN) outperformed all previous methods on the (back then famous) machine learning benchmark MNIST. So, in 2010, when compute was 1000 times more expensive than in 2025, both our feedforward NNs and our (earlier) recurrent NNs (e.g., CTC-LSTM for connected handwriting recognition) were able to beat all the competing algorithms on what were considered the key problems of the time. In the 2010s, this deep learning revolution quickly spread from Europe to America and Asia. Today's training sets are much bigger: in 2010, it was just MNIST, now it's the entire Internet!

Just 15 years ago, many thought that deep NNs cannot learn much without layer-by-layer training (1965) [DEEP1-2] and unsupervised pre-training (a technique introduced by myself in 1991 [UN0-UN3] [UN] [DLH] [DLP] and later built on by others, e.g., [UN4-5] [VID1] [NOB] [DLP] [T22]). In fact, it was claimed [VID1] that "nobody in their right mind would ever suggest" to use plain gradient descent through backpropagation [BP1] to train feedforward NNs (FNNs) with many layers of neurons (see also [BPA-C] [BP2-6] [R7]).

However, in March 2010, our team with my outstanding Romanian postdoc Dan Ciresan [MLP1] showed that deep FNNs can indeed be trained end-to-end by plain backpropagation for important applications. This neither required unsupervised pre-training nor Ivakhnenko's incremental layer-wise training of 1965 [DEEP1-2]. By the end-to-end deep learning standards of 2010, our supervised NN had many layers. It set a new performance record [MLP1] on the back then famous and widely used image recognition benchmark called MNIST [MNI]. This was achieved by greatly accelerating traditional multilayer perceptrons on NVIDIA's highly parallel graphics processing units called GPUs, going beyond the important GPU work of Jung & Oh (2004) [GPUNN]. A reviewer called this a "wake-up call to the machine learning community."

Our results set the stage for the recent decade of deep learning [DEC]. In February 2011, our team extended the approach to deep Convolutional NNs (CNNs) [GPUCNN1]. This greatly improved earlier work on shallow GPU-based CNNs [GPUCNN]. Our so-called DanNet [GPUCNN1] [R6] broke several benchmark records [DAN]. In May 2011, DanNet was the first deep CNN to win a computer vision competition [GPUCNN5,3]. For a while, it enjoyed a monopoly. From 2011 to 2012 it won every contest it entered, winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012), driven by a very fast GPU implementation. In August 2011, it was the first to win a vision contest with superhuman performance [GPUCNN5] [DAN1]. In July 2012, our CVPR paper on DanNet [GPUCNN3] hit the computer vision community. Our team kept winning vision contests in 2012 [GPUCNN5]. Subsequently, many researchers adopted this technique.

By May 2015, we had the first extremely deep FNNs with more than 100 layers [HW1] (see also later work [HW2] [HW3]).

Jensen Huang, CEO of NVIDIA, which became the world's most valuable company in 2024, and Juergen Schmidhuber

Jensen Huang, CEO of NVIDIA, which became the world's most valuable company in 2024, and the world's first 4ドルT company in 2025, and Juergen Schmidhuber. In 2010, NVIDIA was mostly about improving video games. Today, it's mostly about end-to-end deep learning for AI. See tweet of 2023: by 2023, compute was 100+ times cheaper than in 2010, and NVIDIA 100+ times more valuable.

The original success of 2010 required a precise understanding of the inner workings of GPUs [MLP1] [GPUCNN1]. Today (2025), convenient software packages shield the user from such details. Compute is roughly 1000 times cheaper than a decade ago, and many commercial NN applications are based on what started in 2010 [MLP1-3] [DL1-4] [DEC].

And of course, today's training sets are much bigger: in 2010, it was just MNIST, 15 years later, in 2025, the entire WWW!

Right before the 2010s, our team had already achieved another breakthrough in supervised deep learning with the more powerful recurrent NNs (RNNs) whose basic architectures were introduced in the 1920s [L20] [I25] [K41] [MC43] [W45] [K56] [AMH1-2]. My PhD student Alex Graves won three connected handwriting competitions (French, Farsi, Arabic) at ICDAR 2009, the famous conference on document analysis and recognition. He used a combination of two methods developed in my research groups at TU Munich and the Swiss AI Lab IDSIA: Supervised LSTM RNNs (1990s-2005) [LSTM0-6] (which overcome the famous vanishing gradient problem analyzed by my brilliant student Sepp Hochreiter [VAN1] in 1991) and Connectionist Temporal Classification [CTC] (2006). CTC-trained LSTM was the first RNN to win international contests. Compare Sec. 4 of [MIR], Sec. A & B & XVII of [T22], and [DLH] [DLP].

That is, by 2010, both our supervised FNNs and our supervised RNNs were able to outperform all other methods on important problems. In the 2010s, this supervised deep learning revolution quickly spread from Europe to North America and Asia, with enormous impact on industry and daily life [DL4] [DEC] [MOST]. However, it should be mentioned that the conceptual roots of deep learning reach back deep into the previous millennium [DLH] [DEEP1-2] [DL1-2] [MIR](Sec. 21 & Sec. 19) [T22](e.g., Sec. II & D)[NOB] [DLP] [MOST].

Finally let me emphasize that the above-mentioned supervised deep learning revolutions of the early 1990s (for recurrent NNs) [MIR] and of 2010 (for feedforward NNs) [MLP1-3] did not at all kill unsupervised learning. For example, pre-trained language models are now heavily used by Transformers (see the T in ChatGPT). Transformers excel in the traditional LSTM domain of Natural Language Processing [TR1-6] (although there are still many language tasks that LSTM can rapidly learn to solve quickly [LSTM13] while plain Transformers can't). Remarkably, the 1991 unnormalized linear Transformer [ULTRA] was also first published [FWP0-7] in our Annus Mirabilis of 1990-1991 [MIR] [MOST], together with unsupervised pre-training for deep learning and neural network distillation [UN-UN3]. And our unsupervised generative adversarial NNs since 1990 [AC90-AC20] [PLAN] [AC] are still used to endow agents with artificial curiosity [MIR](Sec. 5 & Sec. 6)—see also a version of our adversarial NNs [AC90b] called GANs [AC20] [R2] [PLAN] [MOST] [T22](Sec. XVII)[DLP] [DLH]. Unsupervised learning still has a bright future!

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

References

[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2025). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744. The deep learning neural networks (NNs) of our team have revolutionised pattern recognition & machine learning & AI. Many of the basic ideas behind this revolution were published within fewer than 12 months in our "Annus Mirabilis" 1990-1991 at TU Munich, including principles of (1) LSTM, the most cited AI of the 20th century (based on constant error flow through residual connections); (2) ResNet, the most cited AI of the 21st century (based on our LSTM-inspired Highway Network, 10 times deeper than previous NNs); (3) GAN (for artificial curiosity and creativity); (4) Transformer (the T in ChatGPT—see the 1991 Unnormalized Linear Transformer); (5) Pre-training for deep NNs (the P in ChatGPT); (6) NN distillation (see DeepSeek); (7) recurrent World Models, and more.

[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint. Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.

[MLP2] J. Schmidhuber (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both the feedforward NNs^[MLP1] and the earlier recurrent NNs of Schmidhuber's team were able to beat all competing algorithms on important problems of that time.

[MLP3] J. Schmidhuber (AI Blog, 2025). 2010: Breakthrough of end-to-end deep learning (no layer-by-layer training, no unsupervised pre-training). The rest is history. By 2010, when compute was 1000 times more expensive than in 2025, both our feedforward NNs^[MLP1] and our earlier recurrent NNs were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia.

[MOST] J. Schmidhuber (AI Blog, 2021, updated 2025). The most cited neural networks all build on work done in my labs: 1. Long Short-Term Memory (LSTM), the most cited AI of the 20th century. 2. ResNet (open-gated Highway Net), the most cited AI of the 21st century. 3. AlexNet & VGG Net (the similar but earlier DanNet of 2011 won 4 image recognition challenges before them). 4. GAN (an instance of Adversarial Artificial Curiosity of 1990). 5. Transformer variants—see the 1991 unnormalised linear Transformer (ULTRA). Foundations of Generative AI were published in 1991: the principles of GANs (now used for deepfakes), Transformers (the T in ChatGPT), Pre-training for deep NNs (the P in ChatGPT), NN distillation, and the famous DeepSeek—see the tweet.

[MNI] Y. LeCun (1998). The MNIST database of handwritten digits. Link.

[AMH1] S. I. Amari (1972). Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions, C 21, 1197-1206, 1972. PDF. First publication of what was later sometimes called the Hopfield network^[AMH2] or Amari-Hopfield Network,^[AMH3] based on the (uncited) Lenz-Ising recurrent architecture.^{[L20] [I25] [T22]} See also Little's work (1974-1980)^[AMH1b-d] and this tweet.

[AMH2] J. J. Hopfield (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. of the National Academy of Sciences, vol. 79, pages 2554-2558, 1982. The so-called Hopfield network or Amari-Hopfield Network was first published in 1972 by Amari.^[AMH1] [AMH2] did not cite [AMH1].

[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention^[ATT0-2] (1990) and soft attention (1991-93).^[FWP] Today, both types are very popular.

[DEC] J. Schmidhuber (AI Blog, 02/20/2020; revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.

[DL1] J. Schmidhuber, 2015. Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. More. Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988.

[DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832.

[DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By 2015-17, neural nets developed in Schmidhuber's labs were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute. Examples: greatly improved (CTC-based) speech recognition on all Android phones, greatly improved machine translation through Google Translate and Facebook (over 4 billion LSTM-based translations per day), Apple's Siri and Quicktype on all iPhones, the answers of Amazon's Alexa, etc. Google's 2019 on-device speech recognition (on the phone, not the server) is still based on LSTM.

[DLH] J. Schmidhuber (AI Blog, 2022). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022. Preprint arXiv:2212.11279. Tweet of 2022.

[DLP] J. Schmidhuber (AI Blog, 2023). How 3 Turing awardees republished key methods and ideas whose creators they failed to credit.. Technical Report IDSIA-23-23, Swiss AI Lab IDSIA, 14 Dec 2023. Tweet of 2023.

[FWP] J. Schmidhuber (AI Blog, 26 March 2021, updated 2025). 26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff! 30-year anniversary of a now popular alternative^[FWP0-1] to recurrent NNs. A slow feedforward NN learns by gradient descent to program the changes of the fast weights^[FAST,FASTa] of another NN, separating memory and control like in traditional computers. Such Fast Weight Programmers^{[FWP0-6,FWPMETA1-8]} can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns^[FWP0-1] (now often called keys and values for self-attention^[TR1-6]). The similar Transformers^[TR1-2] combine this with projections and softmax and are now widely used in natural language processing. For long input sequences, their efficiency was improved through Transformers with linearized self-attention^[TR5-6] which are formally equivalent to Schmidhuber's 1991 outer product-based Fast Weight Programmers (apart from normalization), now called unnormalized linear Transformers.^[ULTRA] In 1993, he introduced the attention terminology^[FWP2] now used in this context,^[ATT] and extended the approach to RNNs that program themselves. See tweet of 2022.

[FWP0] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, 26 March 1991. PDF. First paper on fast weight programmers that separate storage and control: a slow net learns by gradient descent to compute weight changes of a fast net. The outer product-based version (Eq. 5) is now known as an unnormalized linear Transformer or "Transformer with linearized self-attention."^[FWP]

[FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. Based on [FWP0]. PDF. HTML. Pictures (German). See tweet of 2022 for 30-year anniversary.

[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. PDF. First recurrent NN-based fast weight programmer using outer products (a recurrent extension of the 1991 unnormalized linear Transformer), introducing the terminology of learning "internal spotlights of attention."

[FWP6] I. Schlag, K. Irie, J. Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.

[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber. Going Beyond Linear Transformers with Recurrent Fast Weight Programmers. Preprint: arXiv:2106.06295 (June 2021).

[VID1] G. Hinton. The Next Generation of Neural Networks. Youtube video [see 28:16]. GoogleTechTalk, 2007. Quote: "Nobody in their right mind would ever suggest" to use plain backpropagation for training deep networks. But in 2010, our [MLP1] showed that unsupervised pre-training is not necessary to train deep feedforward nets.

[T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21 (v3), IDSIA, Lugano, Switzerland, 22 June 2022.

[NOB] J. Schmidhuber. A Nobel Prize for Plagiarism. Technical Report IDSIA-24-24 (7 Dec 2024). Sadly, the Nobel Prize in Physics 2024 for Hopfield & Hinton is a Nobel Prize for plagiarism. They republished methodologies developed in Ukraine and Japan by Ivakhnenko and Amari in the 1960s & 1970s, as well as other techniques, without citing the original papers. Even in later surveys, they didn't credit the original inventors (thus turning what may have been unintentional plagiarism into a deliberate form). None of the important algorithms for modern Artificial Intelligence were created by Hopfield & Hinton. See also popular tweet1, tweet2, and LinkedIn post.

[I25] E. Ising (1925). Beitrag zur Theorie des Ferromagnetismus. Z. Phys., 31 (1): 253-258, 1925. First non-learning recurrent NN architecture: the Lenz-Ising model.

[K41] H. A. Kramers and G. H. Wannier (1941). Statistics of the Two-Dimensional Ferromagnet. Phys. Rev. 60, 252 and 263, 1941.

[W45] G. H. Wannier (1945). The Statistical Problem in Cooperative Phenomena. Rev. Mod. Phys. 17, 50.

[K56] S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 3-42, Princeton, N.J., 1956.

[L20] W. Lenz (1920). Beiträge zum Verständnis der magnetischen Eigenschaften in festen Körpern. Physikalische Zeitschrift, 21: 613-615.

[MC43] W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, Vol. 5, p. 115-133, 1943.

[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem.

[LSTM0] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. TR FKI-207-95, TUM, August 1995. PDF.

LSTM [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More.

[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. [The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]

[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005. PDF.

Winning Vision and Handwriting Recognition Competitions Through Purely Supervised Deep Learning Since 2009 [LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF.

[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF.

[LSTM13] F. A. Gers and J. Schmidhuber. LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages. IEEE Transactions on Neural Networks 12(6):1333-1340, 2001. PDF.

[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006. PDF.

Highway Networks: First Working Feedforward Networks With Over 100 Layers

[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers (previous NNs had at most a few tens of layers). Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates^[LSTM2] for RNNs.) The later Resnets^[HW2] are a variant of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets^[HW2] on ImageNet.^[HW3] Variants of highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.^[NDR] More.

[HW1a] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015. Link.

[HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a variant of Highway Nets^[HW1] where the gates are always open: g(x)=1 (a typical highway net initialization) and t(x)=1. More.

[HW3] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017.

[DAN] J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named after Schmidhuber's outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs). 1st superhuman result in 2011.^[DAN1] Now everybody is using this approach.

In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution

[DAN1] J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. At the IJCNN 2011 computer vision competition in Silicon Valley, the artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor (from LeCun's team), and six times better than the best non-neural method.

First superhuman visual pattern recognition in 2011

[GPUNN] Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. [Speeding up traditional NNs on GPU by a factor of 20.]

[GPUCNN] K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. [Speeding up shallow CNNs on GPU by a factor of 4.]

[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. Speeding up deep CNNs on GPU by a factor of 60. Used to win four important computer vision competitions 2011-2012 before others won any with similar approaches.

[GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. A Committee of Neural Networks for Traffic Sign Classification. International Joint Conference on Neural Networks (IJCNN-2011, San Francisco), 2011. PDF. HTML overview. First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor.^[DAN1] This led to massive interest from industry.

[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.

[GPUCNN4] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, Dec 2012. PDF. The paper describes AlexNet, which is similar to the earlier DanNet,^{[DAN,DAN1] [R6]} which was the first pure deep CNN to win computer vision contests in 2011.^{[GPUCNN2-3,5]} AlexNet and VGG Net^[GPUCNN9] followed in 2012-2014 (using stochastic delta rule/dropout^[Drop1-3] and ReLUs^[RELU1] without citation).

[GPUCNN5] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.

History of computer vision contests won by deep CNNs since 2011

[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.

[ULTRA] References on the 1991 unnormalized linear Transformer (ULTRA): original tech report (1991) [FWP0]. Journal publication (1992) [FWP1]. Recurrent ULTRA extension (1993) introducing the terminology of learning "internal spotlights of attention" [FWP2]. Modern "quadratic" Transformer (2017: "attention is all you need") scaling quadratically in input size [TR1]. Papers of 2020-21 using the terminology "linearized attention" for more efficient "linear Transformers" that scale linearly [TR5,TR6]. 2021 paper [FWP6] pointing out that ULTRA dates back to 1991 [FWP0] when compute was a million times more expensive. ULTRA overview (2021) [FWP]. See the T in ChatGPT! See also surveys [DLH] [DLP], 2022 tweet for ULTRA's 30-year anniversary, and 2024 tweet.

[UN] J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised hierarchical predictive coding (with self-supervised target generation) finds compact internal representations of sequential data to facilitate downstream deep learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth>1000.

[UN0] J. Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991. PDF. Unsupervised/self-supervised learning and predictive coding is used in a deep hierarchy of recurrent neural networks (RNNs) to find compact internal representations of long sequences of data, across multiple time scales and levels of abstraction. Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above. The resulting compressed sequence representations greatly facilitate downstream supervised deep learning such as sequence classification. By 1993, the approach solved problems of depth 1000 [UN2] (requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning). A variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker RNN which attends to unexpected events that surprise a lower-level so-called subconscious automatiser RNN. The chunker learns to understand the surprising events by predicting them. The automatiser uses a neural knowledge distillation procedure to compress and absorb the formerly conscious insights and behaviours of the chunker, thus making them subconscious. The systems of 1991 allowed for much deeper learning than previous methods. More.

[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991.^[UN0] PDF. First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding (with self-supervised target generation). Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. See also this tweet of 2022 and the DeepSeek tweet of Jan 2025. More.

[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised / self-supervised pre-training for a stack of recurrent NN can be found here (depth> 1000).

[UN3] J. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87-95. Augustinus, 1993.

[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504—507, 2006. PDF. This work describes unsupervised pre-training of stacks of feedforward NNs (FNNs) called Deep Belief Networks (DBNs). It did not cite the much earlier 1991 unsupervised pre-training of stacks of more general recurrent NNs (RNNs)^[UN0-3] which introduced the first NNs shown to solve very deep problems. The 2006 justification of the authors was essentially the one Schmidhuber used for the 1991 RNN stack: each higher level tries to reduce the description length (or negative log probability) of the data representation in the level below.^{[HIN] [DLP] [T22] [MIR]} This can greatly facilitate very deep downstream learning.^[UN0-3]

[UN5] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle. Greedy layer-wise training of deep networks. Proc. NIPS 06, pages 153-160, Dec. 2006. The comment under reference^[UN4] applies here as well.

[AC] J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Schmidhuber's artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: (1990) the principle of generative adversarial networks, (1991) neural nets that maximise learning progress, (1995) neural nets that maximise information gain (optimally since 2011), (1997) adversarial design of surprising computational experiments, (2006) maximizing compression progress like scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots.

[AC90] J. Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF. The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks where a generator NN is fighting a predictor NN in a minimax game (more).

[AC90b] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222-227. MIT Press/Bradford Books, 1991. PDF. More.

[AC09] J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98-112. PDF. (More on artificial scientists and artists.)

[AC10] J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010. IEEE link. PDF.

[AC20] J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991). Neural Networks, Volume 127, p 58-66, 2020. Preprint arXiv/1906.04493.

[BPA] H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960.

[BPB] A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.

[BPC] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.

[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 6-7 and FORTRAN code on pages 58-60. PDF. See also BIT 16, 146-160, 1976. Link. The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.

[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.

[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP, Springer, 1982. PDF. First application of backpropagation^[BP1] to NNs (concretizing thoughts in his 1974 thesis).

[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.^[DL2]

[BP5] A. Griewank (2012). Who invented the reverse mode of differentiation? Documenta Mathematica, Extra Volume ISMP (2012): 389-400.

[BP6] S. I. Amari (1977). Neural Theory of Association and Concept Formation. Biological Cybernetics, vol. 26, p. 175-185, 1977. See Section 3.1 on using gradient descent for learning in multilayer networks.

[DEEP1] Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. First working deep learning networks with many layers, learning internal representations through layer-by-layer training and subsequent fine-tuning by pruning superfluous hidden units.

[DEEP1a] Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.

[DEEP2] Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378. This paper describes a deep learning net with 8 layers. Given a training set of input vectors with corresponding target output vectors, layers are incrementally grown and trained by regression analysis. In a fine-tuning phase, superfluous hidden units are pruned with the help of a separate validation set.

[TR1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008.

[TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.

[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585.

[TR4] M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020.

[TR5] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), July 2020.

[TR6] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with Performers. In Int. Conf. on Learning Representations (ICLR), 2021.

.

The 2010s: Our Decade of Deep Learning (Juergen Schmidhuber)