Transformer algorithm and deep-learning architecture is changing AI

  • Transformers enable the computer to understand the underlying structure of a mass of data, no matter what that data may relate to 
  • Text is converted to ‘tokens’ – numerical representations of the text itself 
  • Important tokens are magnified whilst unimportant are de-emphasised and relegated.
  • The embedding of tokens across sequences can range from a few tens of tokens to tens of millions of them, speeding up results and saving money

The race is on to exploit AI technologies that are able to far exceed the capabilities of the current wave of generative AI (GenAI) models that, as many have experienced, can just as easily drive users to distraction almost as often as they provide them with any useful service.

One very promising area of research is focused on a particular aspect of prevalent AI models that is not only already commonplace but has also been identified as having immense potential in many, much broader applications. This is the “transformer” – in no way a relative of the robots that go by grandiose names such as Bonecrusher, Overbite and Wedge. 

In AI, a transformer is a type of algorithm and deep-learning architecture that permits a computer to understand the underlying structure of a mass of data, no matter what that data may relate to. It transforms input sequences into output sequences by learning about data context and, with that knowledge, can track relationships within and between the components of a sequence. 

The notion was first outlined by a group of eight Google computer scientists in a famous paper on machine learning (ML), Attention Is All You Need,  published in 2017. The paper states: “We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.”

With this model, text is converted to numerical representations of itself. These are called ‘tokens’ and each token is converted into a vector via lookup from a word-embedding table. At each layer, each token is given a context via a parallel multi-head attention mechanism. Such contextualisation allows for the determination of the relative importance of each part of a sequence as compared to other parts of the same sequence. Thus, the relevance of important tokens are magnified whilst those adjudged to be of lesser worth are de-emphasised and relegated. The attention mechanism encodes the embedment of tokens across a fixed-width sequence that can range from a few tens of tokens to tens of millions of them. 

Over the past three years, the transformer has become the integral architecture of large language models (LLMs) and large language datasets after the commercial introduction of ChatGPT caught and beguiled the attention of the mass media and the general public. However, transformers are now being applied across areas other than simply text. Research and experimentation is well advanced into the application of multimodal GenAI systems able to understand and generate outputs that can be put to use in multiple sectors, including automotive, biology, chemistry, computer vision, DNA sequence analysis, drug development, imaging, medicine, molecular science, physics, robotics, speech recognition, and video generation. 

Transformers (together with parallel computing) enable large-scale AI models to process long sequences in their entirety. This attribute results in quicker processing and AI training times and is regarded as the next step along the road to generalisable AI – a model and system able to determine how well an algorithm works in a new and different setting. It is  ‘generalisable’ if it works equally well in multiple locations – for example, in the healthcare sector, in many hospitals across a wide geographic area, as it does in the original hospital where the AI was trained. It also applies to the ability of an AI model to perform with high and repeatable precision on data to which it has not previously been exposed and is beyond the parameters on which it was trained.

Transformer models also integrate different types of data and help enable learning to be transferred and customised for application in individual, enterprise or organisational, purposes. Models can be pre-trained on massive datasets and then later adapted to accommodate the specialised requirements of a particular organisation or agency. It means that organisations will no longer have to train their AI models from the ground up on enormous datasets, thus saving time and money.

Nvidia: Can’t do right for doing right?

Central to the success of AI are the semiconductors that drive the models, and by far the most powerful of them are the specialist GPU (graphics processing unit) chips. Nvidia is by far the most powerful GPU vendor, commanding about 90% of the GPU market currently. Furthermore, the company’s industry-leading software platform provides an extra competitive advantage that complements its outstanding hardware products, as was noted in our coverage of Nvidia’s latest financial results.

As the name ‘graphics processing unit’ makes evident, GPUs were designed initially to add graphical texture to 3D models by providing shading. Later, that capability was made programmable and allowed the addition of real-time lighting and shadows. Nvidia’s genius was to make their GPUs completely programmable and by 2012 it was able to exploit the use of neural networks by using them on Nvidia chips. Then the company began to build its software on its own Cuda interface: The Compute Unified Device Architecture is a proprietary parallel computing platform and application programming interface (API) that allows software to use some types of GPUs massively to speed up general-purpose processing. So now there are huge libraries and software that enable people more or less anywhere on earth to program with Cuda.

Of course, it would be perfectly possible to use and program other GPUs from other manufacturers, but to do so would take a lot of personnel and money, so companies, including big tech firms such as Amazon and Meta, prefer to pay a premium to buy Nvidia chips to stay at the top of their commercial game. Meanwhile, other chip makers are struggling to catch up with Nvidia, although AMD is starting to have some success. 

Intel does have a software stack for machine learning (ML) on its most recent central processing unit (CPU) platforms, but almost no companies are buying them as there’s not much point in training an AI model on a CPU when GPUs do it so much better. Meanwhile, Nvidia customers are stuck paying huge premiums for their GPUs and, as the result of very, very limited choice, Nvidia is tied with Apple to be the most valuable company on the planet. 

– Martyn Warwick, Editor in Chief, TelecomTV

Email Newsletters

Sign up to receive TelecomTV's top news and videos, plus exclusive subscriber-only content direct to your inbox.