🎄 Just in time for the magical week 🎅: LightOn and Answer.AI just made available a new model called ModernBERT.
ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (139M params) and large (395M params) model size.
To get a sense of how important the BERT model and its derivatives are, here are some figures:
Out of the 1.2 million different models uploaded on HuggingFace since its inception, Google's initial BERT model is the second model most downloaded with more than 65 millions downloads last month.
In the first 30 most downloaded models, BERT and related models account for 325 millions downloads last month.
We hope the community likes ModernBERT and build applications that will be smarter 🧠 , better 🛰️ , faster 🚀 and with longer context 🦒 .
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.
(*) the magical weeks are generally the last two weeks of December. Marie Curie discovers Radium (Dec 21st), the Wright brothers made their first flight (Dec 17th), Brattain and H. R. Moore made a demonstration of the transistor (Dec 23rd), Charles Babbage invented the calculating machine (Dec 26th).
Back in 2011, Marc Andreesen announced that Software was eating the world while everyone was trying to make sense of the realities of the cloud versus brick and mortar businesses. Eight years later, Tarry Singh articulated how AI was eating software; a year before GPT-3 and Codex would give solid ground to this prediction. Fast forward two years later, we just witnessed how AI ate HPC and we believe those are the first steps towards how AI is eating Learning, Creative and Office work.
Let me explain.
At LightOn, we have been working on getting AI to be transformative for everyone. For that to happen, we used the Jean Zay French national supercomputer for two different yet somehow related reasons this past year. First, our LightOn’s Optical Processing Unit hardware was integrated into this top105 supercomputer. Even though LightOn’s hardware is analog and uses a technology currently unknown to supercomputing, there are several good reasons the future of computing will use this technology. Relatedly, in a co-design fashion, we also used the Jean Zay facility to implement and run code for the building of Large Language/Foundation Models that we believe are key to Transformative AI. In March, we trained the largest French language model ever called Auriga and made it available to everyone through our PAGnol demo.
In July, we launched the Muse API, making our language models available for business use. Initially released in private beta, Muse has quickly gained its first customers, and a public commercial version with five languages is to be released in early 2022. Some of these early customers are using this new AI to redefine SEO or the experience for website creation.
“True happiness comes from the joy of deeds well done, the zest of creating things new” Antoine de Saint-Exupéry
Eventually, a major impact of these Large Language Models trained on HPC infrastructures will be the ability for everyone to personally learn faster and for office workers worldwide to get the job done in a fashion never seen before.
If you are a start-up company or an individual starting a business around this promise, don’t hesitate to join the Muse Partnership program, and let’s start a discussion around how Muse can help you.
These models will also have the same effect in creative work and in the discovery process.
Stay tuned, the true AI revolution is really coming!
It is with immense pride and pleasure to announce that LightOn’s OPU has been installed in one of the world’s Top500 supercomputer as part of a pilot program with GENCI and IDRIS/CNRS.
The team at LightOn is immensely proud to write the future of computing in this world-first integration of a computing photonic device into an HPC infrastructure.
As larger models seem to be providing more context and more ability for zero-shot learning, Julien just created the Akronomicon: an Extreme-Scale Leaderboard featuring the world's largest Machine Learning Models. And yes, LightOn is on that board for the moment!
At LightOn, we build photonic hardware that performs random projections and it is nice to find a source of materials on the subject in one document. Here is a report comprehensively presenting how randomized algorithms are key to the future of computing:
Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021.
Progress usually comes from a steady technology bootstrap…until it doesn’t.
Take for instance the race for the 1,000ドル genome that started in the early 2000s. Initially, sequencing the human genome meant a race between the well-funded public and private sectors but more importantly, the resources for the first breakthrough ended up costing upwards of 450ドルM. Yet despite all the economic promise of genome sequencing, had Moore’s law been applied, sequencing one full genome would still cost 100,000ドル today. However, once the goal became clearer to everyone, a diversity of technologies and challengers emerged. This intense competition eventually yielded a growth faster than Moore’s Law. The main takeaway is that one cannot rely on the steady progress of one specific technology alone to commoditize tools.
What does this have to do with the current state of silicon computing and the new demand for Large Language Models (LLMs)? Everything if you ask us and here is how.
Interestingly, much like the mass industrialization in the 1930s, the good folks at OpenAI are sketching new scaling laws for the industrialization of these larger models.
The sad truth is that extrapolating their findings to the training of a 10 Trillion parameters model involves a supercomputer runningcontinuously fortwo decades. The minimum capital expenditure of this adventure is estimated in the realm of several hundreds of million dollars.
Much like what happened in sequencing, while silicon improvement and architecture may achieve speedups in the following years, it is fair to say that, even with Moore’s law, no foreseeable technology can reasonably train a fully scaled-up GPT-4 and grab the economic value associated with it.
Rebooting silicon with a different physics, light, and NvNs
For a real breakthrough to occur, much like what happened in the sequencing story, different technologies need to be jointly optimized. In our case, this means performing co-design with new hardware and physics but also going rogue on full programmability.
LightOn’s photonic hardware can produce massively parallel matrix-vector multiplications with an equivalent of 2 trillion parameters “for free”: this is about one-fifth of the number of parameters needed for GPT-4. Next comes revisiting the programmability. Current LightOn’s technology keeps these weights fixed by design. Co-design means finding the algorithms for which CPUs and GPUs can perform some of the most intelligent computations and how LightOn’s massive Non-von Neumann (NvN) hardware can do the heavy lifting. We already published how we are replacing backpropagation, the workhorse of Deep Learning, with an algorithm that unleashes the full potential of our hardware in distributed training. We are also working similarly on an inference step that will take full advantage of the massive number of parameters at our disposal. This involved effort relies in a heavy part thanks to our access to ½ million GPU hours on some of France’s and Europe’s largest supercomputers.
And this is just the beginning. There is a vast untapped potential for repurposing large swaths of optical technologies directed primarily for entertainment and telecommunication into computing.
The road towards a 1,000ドル GPT-3
Based on the GPT-3 training cost estimates, achieving a 1,000ドル GPT-3 requires four orders of magnitude improvements. Much like what occurred in 2007 with the genome sequencing revolution, Moore’s law may take care of the first two orders of magnitude in the coming decade but the next two rely on an outburst of new efficient technologies — hardware and algorithms. It just so happens that GPT-3 has close to 100 layers, so achieving two orders of magnitude savings may arise faster than you can imagine. Stay tuned!
I gave a talk at #mathia2021 conference on March 9th, 2021 where I drew a parallel between the scaling laws that enabled industrialization in the 1920's and the new scaling laws in AI of the 2020's. AI is at its infancy and it needs to have guiding principles (as embedded in these empirical laws) and it also needs to develop new hardware. I showed how, in this context, LightOn can help unlock Transformative AI. Enjoy!
Today is a big day at LightOn as we unveil a hardware product, the Appliance, the world's first commercially available photonic co-processor for AI and HPC
We have had a few of these optical processing units in our own LightOn Cloud for the past two years and just retired one after more than 800 days working full time.
The scaling hypothesis motivates the expansion of models past trillions of parameters as a path towards better performance. Recent significant developments, such as GPT-3, have been driven by this conjecture. However, as models scale-up, training them efficiently with backpropagation becomes difficult. Because model, pipeline, and data parallelism distribute parameters and gradients over compute nodes, communication is challenging to orchestrate: this is a bottleneck to further scaling. In this work, we argue that alternative training methods can mitigate these issues, and can inform the design of extreme-scale training hardware. Indeed, using a synaptically asymmetric method with a parallelizable backward pass, such as Direct Feedback Alignement, communication needs are drastically reduced. We present a photonic accelerator for Direct Feedback Alignment, able to compute random projections with trillions of parameters. We demonstrate our system on benchmark tasks, using both fully-connected and graph convolutional networks. Our hardware is the first architecture-agnostic photonic co-processor for training neural networks. This is a significant step towards building scalable hardware, able to go beyond backpropagation, and opening new avenues for deep learning.
Aurélien sent me an email back in October and we are now in December! Time flies.
Dear Igor,
I hope things are well.
I have been following your NuitBlanche blog for quite a few years. It would thus be great for us if you consider a recent paper of ours to appear in your blog, entitled “Diffraction-unlimited imaging based on conventional optical devices”. This paper has been published in Optics Express this year and its link is: https://www.osapublishing.org/oe/abstract.cfm?uri=oe-28-8-11243
This manuscript proposes a new imaging paradigm for objects that are too far away to be illuminated or accessed, which allows them to be resolved beyond the limit of diffraction---which is thus distinct from the microscopy setting. Our concept involves an easy-to-implement acquisition procedure where a spatial light modulator (SLM) is placed some distance from a conventional optical device. After acquisition of a sequence of images for different SLM patterns, the object is reconstructed numerically. The key novelty of our acquisition approach is to ensure that the SLM modulates light before information is lost due to diffraction.
Feel free to let us know what you think, and happy to provide more information/pictures if needed. Thanks a lot for your time and consideration!
We propose a computational paradigm where off-the-shelf optical devices can be used to image objects in a scene well beyond their native optical resolution. By design, our approach is generic, does not require active illumination, and is applicable to several types of optical devices. It only requires the placement of a spatial light modulator some distance from the optical system. In this paper, we first introduce the acquisition strategy together with the reconstruction framework. We then conduct practical experiments with a webcam that confirm that this approach can image objects with substantially enhanced spatial resolution compared to the performance of the native optical device. We finally discuss potential applications, current limitations, and future research directions.
A combination of post-Moore’s law era and the advent of very large ML models require all of us to think up new approaches to computing hardware and AI algorithms at the same time. LightOn is one of the few (20) companies in the world publishing in both AI and hardware venues to engage both communities into thinking how theories and workflows may eventually be transformed by the photonic technology we develop.
This year, thanks to the awesome Machine Learning team at LightOn, we have two accepted papers at NeurIPS, the AI flagship conference, and have five papers in its“Beyond Backpropagation” satellite workshop that will take place on Saturday. This is significant on many levels, not the least being that these papers have been nurtured and spearheaded by two Ph.D. students (Ruben Ohana and Julien Launay) who are doing their thesis as LightOn engineers.
Here is the list of the different papers accepted at NeurIPS this year that involved LightOn members:
Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures, Julien Launay, François Boniface, Iacopo Poli, Florent Krzakala (Presenter: Julien Launay).
Ignorance is Bliss: Adversarial Robustness by Design through Analog Computing and Synaptic Asymmetry, Alessandro Cappelli, Ruben Ohana, Julien Launay, Iacopo Poli, Florent Krzakala (Presenter: Alessandro Cappelli). We had a blog post on this recently.
Align, then Select: Analysing the Learning Dynamics of Feedback Alignment, Maria Refinetti, Stéphane d’Ascoli, Ruben Ohana, Sebastian Goldt paper (Presenter: Ruben Ohana).
How and When does Feedback Alignment Work, Stéphane d’Ascoli, Maria Refinetti, Ruben Ohana, Sebastian Goldt. paper (Presenter: Ruben Ohana)