Jürgen Schmidhuber (Jan 2025, F.A.Z. 2024) 
Pronounce: You_again Shmidhoobuh   AI Blog 
Twitter: @SchmidhuberAI      1995-2025: The Decline o

1995-2025: The Decline of Germany & Japan vs US & China. Can All-Purpose Robots Fuel a Comeback?

submited by
Style Pass
2025-01-23 08:30:05

Jürgen Schmidhuber (Jan 2025, F.A.Z. 2024) Pronounce: You_again Shmidhoobuh AI Blog Twitter: @SchmidhuberAI 1995-2025: The Decline of Germany & Japan vs US & China. Can All-Purpose Robots Fuel a Comeback? Abstract. In 1995, in terms of nominal gross domestic product (GDP), a combined Germany and Japan were almost 1:1 economically with a combined USA and China, according to IMF (see chart above). Only 3 decades later, this ratio is now down to 1:5! Self-replicating AI-driven all-purpose robots may be the answer. Around 2000, Japan still was the country with the most robots; Germany was 2nd. Today, China is number 1. However, most existing robots are dumb. They are not adaptive like the coming smart robots that will learn to do all the jobs humans don't like, including making more such robots. The text below extends an AI-based translation of my 2024 F.A.Z. guest article[FA24] written for a German audience, but its basic ideas apply to all countries with a strong engineering background. Build the AI-controlled all-purpose robot! Germany is losing out—partly because performance counts for too little. How can the country keep up in artificial intelligence (AI)? I have a suggestion. Today's AI seems impressive. But it is nothing compared to what will come in the next 20 years. I said that 40 years ago and I was right. I said that 20 years ago and I was right. And I'm saying it again today, and indeed, in 20 years' time, everything that currently seems impressive will seem trivial in retrospect. A key component of this is AI-controlled general-purpose robots. Today, everyone is talking about Generative AI and ChatGPT. What many people don't know: The current boom in "Generative AI" using artificial neural networks has its roots in the early nineties at the Technical University of Munich, especially the "G" and the "P" and the "T" in "ChatGPT." At that time, we published "Artificial Curiosity" through what's now called "Generative Adversarial Networks" (1990, now widely used),[AC90-20][DLH][DLP] self-supervised pre-training for deep learning with long texts (1991, the P in ChatGPT stands for "pre-trained"),[UN][UN0-3][NOB] and unnormalized linear Transformers (1991, the T in ChatGPT stands for "Transformer").[ULTRA] The Long Short-Term Memory (LSTM)[LSTM0-15][VAN1]—the most cited AI of the 20th century and "arguably the most commercial AI achievement"[AV1]—was also developed in my lab at TU Munich. Here is an overview including references.[MOST] At that time, Munich was also the birthplace of the first self-driving cars in traffic.[AUT][DLH] So why are the biggest beneficiaries of AI today not in Germany, but in America and China? Germany has messed this up itself. Research and art follow the money, but unfortunately Germany has spent almost nothing on AI since the 1990s compared to other countries. Let's take a closer look: Germany has been going downhill since the late 1990s, and not just in AI. In 1995, according to the IMF, Germany still had 33% of the economic power of the USA in nominal terms; today it has only 16%. In 1995, Japan (the origin of foundational neural network breakthroughs of 1967-1988[GD1-2][RELU1][AMH1][NOB][CNN1][CNN1a][CNN1a+]) had 72 percent of the economic power of the USA, today it is only 15 percent. In 1995, the two big losers of WW II together were economically stronger than the USA; today they are less than a third as strong. In fact, according to the IMF, Japan (which had the five most valuable listed companies in the world in 1990) and Germany were roughly as strong economically in nominal terms as the much larger USA and China combined! See the title image. Since then, things have gone downhill in waves. By 2008, Germany only had 25% of the nominal economic power of the USA, but at least the EU as a whole was still as strong as the USA and China combined.[EU08] As a result of the financial crisis, however, Germany's taxpayers then lost enormous sums of money to the USA, for which the crisis was not really a crisis at all, but a major financial gain that led to the loss of importance of German and other European banks. Since 2015, the relative decline of Germany (and the EU) has progressed particularly rapidly. Since then, Germany has spent a lot of money on things that have yielded little and has become increasingly poorer and less significant compared to the USA. And as I said, research and art follow the money. Germany was also much stronger militarily in the 1990s than it is today. And even in sport, the decline in performance became clear. Until 2006, German athletes often led the Olympic medal tables, especially at the Winter Games.[OLY10] Today, they are mostly among the "distant runners-up." Even at the Summer Games, Germany still won 90 percent of the gold medals won by the USA in 1992, but only 30 percent in 2024. Its small neighbor, the Netherlands, has more. Why is that? Because Germany abolished its performance incentives. Example: as a pupil, I found the incentive provided by the points system in the National Youth Games enormously motivating: I wanted to be the best in the class. It was one of many performance incentives that have been deleted by politicians. I continue to train excellent young German researchers. But they often don't see any attractive opportunities in Germany afterwards. Instead, many want to go to the best-equipped foreign (mostly American) AI labs of the big platform companies, where they can earn 350,000 euros or more straight after their doctorate, much more than a German chair holder, who only receives a good 100,000 euros plus bonuses. In foreign AI laboratories, researchers are also provided with far more computing resources (very important in AI). They don't have to write research proposals and are still allowed to publish and make a name for themselves. Unfortunately, it looks a bit like my home country no longer wants to or can't really keep up in this merciless global competition for outstanding talent. I can't tell you what a shame I think that is. The incentives are wrong: many of the best and most expensively trained specialists are leaving the country and are being replaced by others who can contribute little to the country's success with a lot of tax money and false incentives. It's a self-reinforcing vicious circle. What immigration policy would a rational country pursue? One that raises the average in the country through appropriate incentives. If someone comes into the country who is richer than the average of those who are already there, the average wealth in the country increases and they are likely to pay more into the social systems than they take out. If he has a higher IQ than the average, the average IQ rises. If he is less criminal than the average, the number of crimes per inhabitant decreases. If he can speak German better than average, the German language skills per inhabitant increase. And so on. Many politicians do not understand these truly simple correlations and instead set well-intentioned but deeply counterproductive incentives that do not raise the important averages, but lower them, and thus harm the country. Germany has already worked its way back up from much worse valleys. I therefore remain a cautious optimist. But fundamental changes are needed. AI in the physical world is still in its infancy The only AI that works well today is AI in the virtual world behind the screen, for example for automatically summarizing documents, creating images, programs and PowerPoint slides. The next big thing will be AI in the physical world. However, the latter is much more demanding than the world behind the screen. So today it is quite easy for AI to learn how to play chess, Go or video games superhumanly well. But there is no AI-controlled soccer-playing robot that can keep up with a little boy. There is no robot that can do what a plumber can do. That will come at some point, but AI software research is not enough, it has to be combined with the physical world of machines and robots. That’s why in 2014—when compute was 100 times more expensive than today—we founded our AI company for the physical world. Alas, like some of our projects, it may have been a bit ahead of time, because the real world is very challenging. After all, passing the so-called "Turing Test"[TUR3,a,b][TUR21] is much easier than True AI in the physical world! But every 5 years, compute is getting 10 times cheaper, and smart robots are now starting to become a reality. For every 10,000 AI software companies, there are perhaps only 10 AI robot companies. So the field is not yet overcrowded. With its strong mechanical engineering sector, Germany still has a chance of becoming an international leader. However, I already wrote this six years ago in my F.A.Z. article "AI is a huge opportunity for Germany"[FA18] (see also my earlier 2015 F.A.Z. article on intelligent robots[FA15]). The leader of the CDU/CSU parliamentary group at the time was Volker Kauder. He said that everyone had to read the article. I was invited to the Reichstag, where many famous politicians listened to what I had to say on the subject. I suggested investing a small number of billions for a world-class AI campus in an attractive city as a basis for further investment. The proposal fell on open ears and initial thoughts were given to this. A few weeks later, however, Kauder was no longer leader of the parliamentary group and everything came to nothing. While the major powers are now investing hundreds of billions in AI, Germany prefers to spend hundreds of billions on unemployed immigrants. I can only recommend that our politicians rethink the incentives in this country. There is a huge opportunity in German mechanical engineering What can Germany do to get back on its feet and avoid a further exodus of the best? How about a big, visionary yet realistic national project that, if successful, would have a truly world-changing impact? Namely: build an all-purpose robot that can learn to do all the jobs humans don't like! In the not-too-distant future, someone will for the first time produce such intelligent (but not necessarily super-intelligent) robots at low cost, with which you can talk and interact and which you can teach something new without much prior knowledge (there are already approaches to this). A country with such versatile all-purpose robots would no longer need to worry about a shortage of skilled workers, secure pensions and unconditional basic income. AI-controlled general-purpose robots would of course also be extremely exportable, as everyone would like to have them to do thousands of inconvenient jobs. And they would be extremely scalable: robots that can operate the tools and machines operated by humans can also build (and repair when needed) more of their own kind. I called this the ultimate form of scaling.[JY24] The country that, through a combination of private initiative, universities and industrial policy, is the first to produce such general-purpose robots will change world history. Let's go, Germany! The image below is an old PowerPoint slide from a talk I gave in 2002 (click at it for more info): Acknowledgments Thanks for useful comments to several AI experts. Since science is about self-correction, let me know under juergen@idsia.ch if you can spot any remaining error. The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. References [AC] J.  Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Schmidhuber's artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: (1990) the principle of generative adversarial networks, (1991) neural nets that maximise learning progress, (1995) neural nets that maximise information gain (optimally since 2011), (1997) adversarial design of surprising computational experiments, (2006) maximizing compression progress like scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots. [AC90] J.  Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF. The first paper on online planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks where a generator NN is fighting a predictor NN in a minimax game (more). [AC90b] J.  Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222-227. MIT Press/Bradford Books, 1991. PDF. More. [AC09] J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98-112. PDF. (More on artificial scientists and artists.) [AC10] J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010. IEEE link. PDF. [AC20] J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991). Neural Networks, Volume 127, p 58-66, 2020. Preprint arXiv/1906.04493. [AMH1] S. I. Amari (1972). Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions, C 21, 1197-1206, 1972. PDF. First publication of what was later sometimes called the Hopfield network[AMH2] or Amari-Hopfield Network,[AMH3] based on the (uncited) Lenz-Ising recurrent architecture.[L20][I25][T22][NOB] See also Little's work (1974-1980)[AMH1b-d] and this tweet. [AMH1b] W. A. Little. The existence of persistent states in the brain. Mathematical Biosciences, 19.1-2, p. 101-120, 1974. Little uses Wannier's ideas of the 1940s[K41][W45] to express neural networks, and mentions the recurrent Ising model[L20][I25]on which the (uncited) Amari network[AMH1,2] is based. [AMH1c] W. A. Little and G. L. Shaw (1978). Analytic Study of the Memory Capacity of a Neural Network. Math Biosci. 39, 281–290 (1978). This paper shows explicitly how to store-recall patterns with the Ising-Lenz model. [AMH1d] W. A. Little (1980). An Ising Model of a Neural Network. In: W. Jaeger, H. Rost, P. Tautu (eds), Biological Growth and Spread. Lecture Notes in Biomathematics, vol 38. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-61850-5_18 [AMH2] J. J. Hopfield (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. of the National Academy of Sciences, vol. 79, pages 2554-2558, 1982. The Hopfield network or Amari-Hopfield Network was first published in 1972 by Amari.[AMH1] [AMH2] did not cite [AMH1]. [AMH3] A. P. Millan, J. J. Torres, J. Marro. How Memory Conforms to Brain Development. Front. Comput. Neuroscience, 2019 [ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber had both hard attention for foveas (1990) and soft attention in form of Transformers with linearized self-attention (1991-93).[FWP] Today, both types are very popular. [ATT0] J. Schmidhuber and R. Huber. Learning to generate focus trajectories for attentive vision. Technical Report FKI-128-90, Institut für Informatik, Technische Universität München, 1990. PDF. [ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135-141, 1991. Based on TR FKI-128-90, TUM, 1990. PDF. More. [ATT2] J.  Schmidhuber. Learning algorithms for networks with internal and external feedback. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Proc. of the 1990 Connectionist Models Summer School, pages 52-61. San Mateo, CA: Morgan Kaufmann, 1990. PS. (PDF.) [AUT] J.  Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around 1986, Ernst Dickmanns and his group at Univ. Bundeswehr Munich built the world's first real autonomous robot cars, using saccadic vision, probabilistic approaches such as Kalman filters, and parallel computers. By 1994, they were in highway traffic, at up to 180 km/h, automatically passing other cars. [AV1] A. Vance. Google Amazon and Facebook Owe Jürgen Schmidhuber a Fortune—This Man Is the Godfather the AI Community Wants to Forget. Business Week, Bloomberg, May 15, 2018. [BPA] H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960. Precursor of modern backpropagation.[BP1-4] [BPB] A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961. [BPC] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962. [BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 6-7 and FORTRAN code on pages 58-60. PDF. See also BIT 16, 146-160, 1976. Link. The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation. [BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP, Springer, 1982. PDF. First application of backpropagation[BP1] to NNs (concretizing thoughts in Werbos' 1974 thesis). [BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.[DL2] [BP5] A. Griewank (2012). Who invented the reverse mode of differentiation? Documenta Mathematica, Extra Volume ISMP (2012): 389-400. [BP6] S. I. Amari (1977). Neural Theory of Association and Concept Formation. Biological Cybernetics, vol. 26, p. 175-185, 1977. See Section 3.1 on using gradient descent for learning in multilayer networks. [CNN1] K. Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position—Neocognitron. Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979. The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: [CNN1+]. More in Scholarpedia. [CNN1+] K. Fukushima: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, vol. 36, no. 4, pp. 193-202 (April 1980). Link. [CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation[BP1][BP2] and weight-sharing to a convolutional architecture. [CNN1a+] W. Zhang, J. Tanida, K. Itoh, Y. Ichioka. Shift-invariant pattern recognition neural network and its optical architecture. Proc. Annual Conference of the Japan Society of Applied Physics, 1988. First backpropagation-trained 2D CNN. [CNN1b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, March 1989. Based on [CNN1a]. [CNN1c] Bower Award Ceremony 2021: Jürgen Schmidhuber lauds Kunihiko Fukushima. YouTube video, 2021. [CNN2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989. PDF. [CNN3] Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121-128. A CNN whose downsampling layers use Max-Pooling (which has become very popular) instead of Fukushima's Spatial Averaging.[CNN1] [CNN4] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007 [DAN] J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named after Schmidhuber's outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs). 1st superhuman result in 2011.[DAN1] Now everybody is using this approach. [DAN1] J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. At the IJCNN 2011 computer vision competition in Silicon Valley, the artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor (from LeCun's team), and six times better than the best non-neural method. [DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on the AI of Schmidhuber's team, with an outlook on the 2020s, also addressing privacy and data markets. [DEEP1] Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. First working Deep Learners with many layers, learning internal representations. [DEEP1a] Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55. [DEEP2] Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378. [DL1] J. Schmidhuber, 2015. Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. More. Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988. [DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832. [DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By 2015-17, neural nets developed in my labs were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute. Examples: greatly improved (CTC-based) speech recognition on all Android phones, greatly improved machine translation through Google Translate and Facebook (over 4 billion LSTM-based translations per day), Apple's Siri and Quicktype on all iPhones, the answers of Amazon's Alexa, etc. Google's 2019 on-device speech recognition (on the phone, not the server) is still based on LSTM. [DLH] J. Schmidhuber (AI Blog, 2022). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022. Preprint arXiv:2212.11279. Tweet of 2022. [DLP] J. Schmidhuber (AI Blog, 2023). How 3 Turing awardees republished key methods and ideas whose creators they failed to credit.. Technical Report IDSIA-23-23, Swiss AI Lab IDSIA, 14 Dec 2023. Tweet of 2023. [EU08] J. Schmidhuber (AI Blog, 2009). A new kind of empire? [FA15] J. Schmidhuber. Intelligente Roboter werden vom Leben fasziniert sein. (Intelligent robots will be fascinated by life.) FAZ, 1 Dec 2015. Link. [FA18] J. Schmidhuber. KI ist eine Riesenchance für Deutschland. (AI is a huge chance for Germany.) FAZ, 2018. Link. [FA24] J. Schmidhuber. Baut den KI-gesteuerten Allzweckroboter! (Build the AI-controlled all-purpose robot!) FAZ, 2024. Link. [FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie, Max-Planck Institut f. Biophysik und Chemie, Goettingen, 1981. First paper on fast weights or dynamic links. [FASTa] J. A. Feldman. Dynamic connections in neural networks. Biological Cybernetics, 46(1):27-39, 1982. 2nd paper on fast weights. [FB17] By 2017, Facebook used LSTM to handle over 4 billion automatic translations per day (The Verge, August 4, 2017); see also Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) [FWP] J.  Schmidhuber (AI Blog, 26 March 2021, updated 2023). 26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff! In 2022, ChatGPT took the world by storm, generating large volumes of text that are almost indistinguishable from what a human might write.[GPT3] ChatGPT and similar large language models (LLMs) are based on a family of artificial neural networks (NNs) called Transformers.[TR1-2] Already in 1991, when compute was a million times more expensive than today, Schmidhuber published the first Transformer variant, which is now called an unnormalised linear Transformer.[FWP0-1,6][TR5-6] That wasn't the name it got given at the time, but today the mathematical equivalence is obvious. In a sense, computational restrictions drove it to be even more efficient than later "quadratic" Transformer variants,[TR1-2] resulting in costs that scale linearly in input size, rather than quadratically. In the same year, Schmidhuber also introduced self-supervised pre-training for deep NNs, now used to train LLMs (the "P" in "GPT" stands for "pre-trained").[UN][UN0-3] In 1993, he introduced the attention terminology[FWP2] now used in this context,[ATT] and extended the approach to recurrent NNs that program themselves. See tweet of 2022. [FWP0] J.  Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, 26 March 1991. PDF. First paper on fast weight programmers that separate storage and control: a slow net learns by gradient descent to compute weight changes of a fast net. The outer product-based version (Eq. 5) is now known as an "unnormalised linear Transformer."[FWP] That wasn't the name it got given at the time, but today the mathematical equivalence is obvious. In a sense, computational restrictions drove it to be even more efficient than later "quadratic" Transformer variants,[TR1-2] resulting in costs that scale linearly in input size, rather than quadratically. [FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. Based on [FWP0]. PDF. See tweet of 2022 for 30-year anniversary. Overview. [FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. PDF. First recurrent NN-based fast weight programmer using outer products, introducing the terminology of learning "internal spotlights of attention." [FWP6] I. Schlag, K. Irie, J. Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174. [FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber. Going Beyond Linear Transformers with Recurrent Fast Weight Programmers. Preprint: arXiv:2106.06295 (June 2021). [GAN1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. NIPS 2014, 2672-2680, Dec 2014. A description of GANs that does not cite Schmidhuber's original GAN principle of 1990[AC][AC90,AC90b][AC20][R2][T22][DLP] (also containing wrong claims about Schmidhuber's adversarial NNs for Predictability Minimization[PM0-2][AC20][T22][DLP]). [GD1] S. I. Amari (1967). A theory of adaptive pattern classifier, IEEE Trans, EC-16, 279-307 (Japanese version published in 1965). PDF. Probably the first paper on using stochastic gradient descent for learning in multilayer neural networks (without specifying the specific gradient descent method now known as reverse mode of automatic differentiation or backpropagation[BP1]). [GD2] S. I. Amari (1968). Information Theory—Geometric Theory of Information, Kyoritsu Publ., 1968 (in Japanese). PDF. Contains computer simulation results for a five layer network (with 2 modifiable layers) which learns internal representations to classify non-linearily separable pattern classes. [GPT3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei. Language Models are Few-Shot Learners (2020). Preprint arXiv/2005.14165. [GPUNN] Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. Speeding up traditional NNs on GPU by a factor of 20. [GPUCNN] K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. Speeding up shallow CNNs on GPU by a factor of 4. [GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. Speeding up deep CNNs on GPU by a factor of 60. Used to win four important computer vision competitions 2011-2012 before others won any with similar approaches. [GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. A Committee of Neural Networks for Traffic Sign Classification. International Joint Conference on Neural Networks (IJCNN-2011, San Francisco), 2011. PDF. HTML overview. First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor.[DAN1] This led to massive interest from industry. [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. [GPUCNN5] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision. [GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet). First deep learner to win a contest on object detection in large images— first deep learner to win a medical imaging contest (2012). Link. How the Swiss AI Lab IDSIA used GPU-based CNNs to win the ICPR 2012 Contest on Mitosis Detection and the MICCAI 2013 Grand Challenge. [GSR] H. Sak, A. Senior, K. Rao, F. Beaufays, J. Schalkwyk—Google Speech Team. Google voice search: faster and more accurate. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. [GSR15] Dramatic improvement of Google's speech recognition through LSTM: Alphr Technology, Jul 2015, or 9to5google, Jul 2015 [GSR19] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. Chai Sim, T. Bagby, S. Chang, K. Rao, A. Gruenstein. Streaming end-to-end speech recognition for mobile devices. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. [GT16] Google's dramatically improved Google Translate of 2016 is based on LSTM, e.g., WIRED, Sep 2016, or siliconANGLE, Sep 2016 [HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers (previous NNs had at most a few tens of layers). Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates[LSTM2] for RNNs.) The later Resnets[HW2] are a variant of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] More. [HW1a] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015. Link. [HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a variant of Highway Nets[HW1] where the gates are always open: g(x)=1 (a typical highway net initialization) and t(x)=1. More. [HW3] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017. [JY24] The Father of Generative AI Without Turing Award. Jazzyear.com interviews J. Schmidhuber (August 2024). Quote: "What's next? It’s true AGI in the physical world, not just today’s AI behind the screen. The physical challenges of the real world are far more complex than those of the virtual one. AI still has a long way to go before it can replace skilled trades like plumbers or electricians. However, there is reason to believe AI in the physical world will soon make significant strides. A next major step will be self-replicating and self-improving societies of physical robots and other machines. We already have 3D printers that can print copies of parts of themselves. But no 3D printer can make a complete copy of itself, like a living being. To assemble a complete 3D printer, you need many other machines, for example, to take the raw material out of the ground, to refine it, to make the machines that make the machines that help make many unprintable parts of the 3D printer, to screw those parts together, and so on. Most importantly, you still need lots of people to oversee and manage all this, and to fix broken machines. Eventually, however, there will be entire societies of clever and not-so-clever physical machines that can collectively build from scratch all the things needed to make copies of themselves, mine the raw materials they need, repair broken robots and robot factories, and so on. Basically, a machine civilisation that can make copies of itself and then, of course, improve itself. Basically, I am talking about a new form of life, about self-replicating, self-maintaining, self-improving hardware, as opposed to the already existing, self-improving, machine-learning software. There will be enormous commercial pressure to create such life-like hardware, because it represents the ultimate form of scaling, and its owners will become very rich, because economic growth is all about scaling. Of course, such life-like hardware won't be confined to our little biosphere. No, variants of it will soon exist on other planets, or between planets, e.g. in the asteroid belt. As I have said many times in recent decades, space is hostile to humans but friendly to suitably designed robots, and it offers many more resources than our thin layer of biosphere, which receives less than a billionth of the energy of the Sun. Through life-like, self-replicating, self-maintaining hardware, the economy of our solar system will become billions of times larger than the current tiny economy of our biosphere. And of course, the coming expansion of the AI sphere won’t be limited to our tiny solar system." For decades, Schmidhuber has been obsessed with self-replicating robot factories.[FA15][SP16][SA17] [LSTM0] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. TR FKI-207-95, TUM, August 1995. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More. [LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. The "vanilla LSTM architecture" with forget gates that everybody is using today, e.g., in Google's Tensorflow. [LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005. PDF. [LSTM4] S. Fernandez, A. Graves, J. Schmidhuber. An application of recurrent neural networks to discriminative keyword spotting. Intl. Conf. on Artificial Neural Networks ICANN'07, 2007. PDF. [LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF. [LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF. [LSTM15] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. Advances in Neural Information Processing Systems 22, NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF. [LSTMPG] J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent famous applications: DeepMind's Starcraft player (2019) and OpenAI's dextrous robot hand & Dota player (2018)—Bill Gates called this a huge milestone in advancing AI. [MIR] J. Schmidhuber (AI Blog, Oct 2019, revised 2021). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. The deep learning neural networks of our team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 2020-21, we celebrate that many of the basic ideas behind this revolution were published within fewer than 12 months in our "Annus Mirabilis" 1990-1991 at TU Munich. [MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint. Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training. [MLP2] J. Schmidhuber (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both the feedforward NNs[MLP1] and the earlier recurrent NNs of Schmidhuber's team were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia. The rest is history. [MOST] J.  Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU Munich and IDSIA. (1) Long Short-Term Memory (LSTM), the most cited AI of the 20th century. (2) ResNet (open-gated Highway Net), the most cited AI of the 21st century. (3) AlexNet & VGG Net (the similar DanNet of 2011 won 4 image recognition challenges before them). (4) GAN (an instance of Schmidhuber's Adversarial Artificial Curiosity of 1990). (5) Transformer variants (unnormalised linear Transformers are formally equivalent to Schmidhuber's Fast Weight Programmers of 1991). In particular, Schmidhuber laid foundations of Generative AI, publishing principles of (4) GANs (1990, now used for deepfakes), (5) Transformers (1991, the "T" in "ChatGPT" stands for "Transformer"), and (6) self-supervised pre-training for deep NNs (the "P" in "GPT" stands for "pre-trained"). Most of this started with the Annus Mirabilis of 1990-1991.[MIR] [NDR] R. Csordas, K. Irie, J. Schmidhuber. The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732. [NAT1] J. Schmidhuber. Citation bubble about to burst? Nature, vol. 469, p. 34, 6 January 2011. HTML. [NOB] J. Schmidhuber. A Nobel Prize for Plagiarism. Technical Report IDSIA-24-24. Sadly, the Nobel Prize in Physics 2024 for Hopfield & Hinton is a Nobel Prize for plagiarism. They republished methodologies developed in Ukraine and Japan by Ivakhnenko and Amari in the 1960s & 1970s, as well as other techniques, without citing the original papers. Even in later surveys, they didn't credit the original inventors (thus turning what may have been unintentional plagiarism into a deliberate form). None of the important algorithms for modern Artificial Intelligence were created by Hopfield & Hinton. See also popular tweet1, tweet2, and LinkedIn post. [Nob10] J. Schmidhuber (2010). Evolution of National Nobel Prize Shares in the 20th Century. Technical Report, IDSIA & USI & SUPSI, Switzerland, 14 September 2010. Preprint arXiv:1009.2634v1. (Compare ScienceNews Blog, 1 Oct 2010.) [NYT1] NY Times article by J. Markoff, Nov. 27, 2016: When A.I. Matures, It May Call Jürgen Schmidhuber 'Dad' [OAI2] OpenAI: C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, S. Zhang (Dec 2019). Dota 2 with Large Scale Deep Reinforcement Learning. Preprint arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count. [OAI2a] J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five. [OLY10] J. Schmidhuber (AI Blog, 2010). All time Olympic gold medal count. [PM0] J. Schmidhuber. Learning factorial codes by predictability minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More. [PM1] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992. Based on [PM0], 1991. PDF. More. [PM2] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996. PDF. More. Relevant threads with many comments at reddit.com/r/MachineLearning, the largest machine learning forum with over 800k subscribers in 2019 (note that my name is often misspelled): [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [SA17] J. Schmidhuber. Falling Walls: The Past, Present and Future of Artificial Intelligence. Scientific American, Observations, Nov 2017. [RELU1] K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322-333. doi:10.1109/TSSC.1969.300225. This work introduced rectified linear units or ReLUs. [RELU2] C. v. d. Malsburg (1973). Self-Organization of Orientation Sensitive Cells in the Striate Cortex. Kybernetik, 14:85-100, 1973. See Table 1 for rectified linear units or ReLUs. Possibly this was also the first work on applying an EM algorithm to neural nets. [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. See also [DLP]. [TR1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008. This paper introduced the name "Transformers" for a now widely used NN type. It did not cite the 1991 publication on what's now called "Transformers with linearized self-attention."[FWP0-6][TR5-6] Schmidhuber also introduced the now popular attention terminology in 1993.[ATT][FWP2][R4] See tweet of 2022 for 30-year anniversary. [TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805. [TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585. [TR4] M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020. [TR5] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), July 2020. [TR6] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with Performers. In Int. Conf. on Learning Representations (ICLR), 2021. [TUR3] G. Oppy, D. Dowe (2021). The Turing Test. Stanford Encyclopedia of Philosophy. Quote: "it is sometimes suggested that the Turing Test is prefigured in Descartes' Discourse on the Method. (Copeland (2000:527) finds an anticipation of the test in the 1668 writings of the Cartesian de Cordemoy. Abramson (2011a) presents archival evidence that Turing was aware of Descartes' language test at the time that he wrote his 1950 paper. Gunderson (1964) provides an early instance of those who find that Turing's work is foreshadowed in the work of Descartes.)" [TUR3a] D. Abramson. Descartes' Influence on Turing. Studies in History and Philosophy of Science, 42:544-551, 2011. [TUR3b] Are computers conscious?—Panpsychism with Noam Chomsky | Theories of Everything. Mentioning the ancient "Turing Test" by Descartes. YouTube video, 2022. [TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though. This was number 1 on Hacker News. [ULTRA] References on the 1991 unnormalized linear Transformer (ULTRA): original tech report (1991) [FWP0]. Journal publication (1992) [FWP1]. Recurrent ULTRA extension (1993) introducing the terminology of learning "internal spotlights of attention” [FWP2]. Modern "quadratic" Transformer (2017: "attention is all you need") scaling quadratically in input size [TR1]. Papers of 2020-21 using the terminology "linearized attention" for more efficient "linear Transformers" that scale linearly [TR5,TR6]. 2021 paper [FWP6] pointing out that ULTRA dates back to 1991 [FWP0] when compute was a million times more expensive. ULTRA overview (2021) [FWP]. See the T in ChatGPT! See also surveys [DLH][DLP], 2022 tweet for ULTRA's 30-year anniversary, and 2024 tweet. [UN] J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised hierarchical predictive coding (with self-supervised target generation) finds compact internal representations of sequential data to facilitate downstream deep learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000. [UN0] J.  Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991. PDF. Unsupervised/self-supervised learning and predictive coding is used in a deep hierarchy of recurrent neural networks (RNNs) to find compact internal representations of long sequences of data, across multiple time scales and levels of abstraction. Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above. The resulting compressed sequence representations greatly facilitate downstream supervised deep learning such as sequence classification. By 1993, the approach solved problems of depth 1000 [UN2] (requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning). A variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker RNN which attends to unexpected events that surprise a lower-level so-called subconscious automatiser RNN. The chunker learns to understand the surprising events by predicting them. The automatiser uses a neural knowledge distillation procedure to compress and absorb the formerly conscious insights and behaviours of the chunker, thus making them subconscious. The systems of 1991 allowed for much deeper learning than previous methods. More. [UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991.[UN0] PDF. First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding (with self-supervised target generation). Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. See also this tweet. More. [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised / self-supervised pre-training for a stack of recurrent NN can be found here (depth > 1000). [UN3] J.  Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87-95. Augustinus, 1993. [UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504—507, 2006. PDF. This work describes unsupervised pre-training of stacks of feedforward NNs (FNNs) called Deep Belief Networks (DBNs). It did not cite the much earlier 1991 unsupervised pre-training of stacks of more general recurrent NNs (RNNs)[UN0-3] which introduced the first NNs shown to solve very deep problems. The 2006 justification of the authors was essentially the one Schmidhuber used for the 1991 RNN stack: each higher level tries to reduce the description length (or negative log probability) of the data representation in the level below.[HIN][DLP][T22][NOB][MIR] This can greatly facilitate very deep downstream learning.[UN0-3] [SP16] J. Schmidhuber interviewed by C. Stoecker: KI wird das All erobern. (AI will conquer the universe.) SPIEGEL, 6 Feb 2016. Link. [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem. [WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times. .

Abstract. In 1995, in terms of nominal gross domestic product (GDP), a combined Germany and Japan were almost 1:1 economically with a combined USA and China, according to IMF (see chart above). Only 3 decades later, this ratio is now down to 1:5!

Leave a Comment