Communicate with us

Transformers: BERT


ai-artificial-intelligence-bertIt is no secret that most devices will soon implement Artificial Intelligence. This is because Artificial Intelligence algorithms offer human-like reasoning capabilities to make predictions and decisions for very specific tasks, significantly reducing error and time spent, and therefore greatly increasing the efficiency of a system implementing them.

There are myriad research fields where, due to factors such as the need for high memory and processing capabilities (among other technological limitations), progress is not being made. However, with the use of Artificial Intelligence techniques, we can easily overcome these limitations.

[*] This post will discuss the field of Artificial Intelligence that helps machines understand, interpret, and manipulate human language: Natural Language Processing (NLP).

Achieving state of the art in NLP required creating and developing special algorithms which led to rapid technological advances in this field of Artificial Intelligence. So much so that applications have been developed that, using these algorithms, allow us to experiment human-to-machine conversations of quite good quality.

These technologies allow us to identify the sentiment with which a sentence is expressed, to predict whether, given two sentences, the second is contextually related to the first, and even to search for semantic equivalence between the two phrases.

One of the most reliable and accurate technologies to perform these tasks is the BERT system developed by GOOGLE in 2018.

The process up to BERT

In NLP’s early days, Recurrent Neural Networks (RNN) were used. As the name implies, RNNs implement Deep Learning techniques to feed data in a recurrent manner (i.e., one after the other) into a neural network.

Now let us transfer this system to our field and see how it would work. Suppose we have the sentence, “This is a test”, and we need our neural network to process so that we obtain let’s say a semantic idea of what the sentence means in the output.

The first step for our system is to break up the sentence into separate words. Once the words have been separated, we start the recurrent process, i.e., feed the first word only as input in the neural network. When the neural network outputs the result for the word “This”, we add it to the next word “is” and feed this as input in the same neural network. As you would expect, the new result obtained considers the first two words, “This is”. We continue the recurrent process in this way up to the last word, by then having processed the whole sentence, taking into account all the words in the sentence.

Although everything seems okay at first glance, this type of network has two major limitations:

  1. The first is trivial: by being recurrent, as we have already seen, the word “This” must be processed before starting to process the word “is”, and so on, adding an additional amount of time (equating to the number of words in the sentence) to the final processing. These algorithms are therefore slow in nature.
  2. The second major problem is that such systems tend to “forget” parts of longer sentences, that is, the network fails to take into account the initial words of the sentence in the final calculation. In this case, assuming the input sentence is long enough, the network is unable to consider the word “This” in the final calculation, so the meaning of the sentence changes.

The solution to this problem came from the famous paper, “Attention is all you need”, which proposed replacing RNNs with attention mechanisms.

But what are attention mechanisms? Attention mechanisms are architectures that can process the sentence word by word, in parallel, thereby overcoming the limitations associated with RNNs (processing time and lack of memory).

This is remarkably simple to achieve. Each word from the same sentence is fed as input into two different neural networks. The first neural network specifies what the meaning of the word is, and the second neural network specifies what types of words it seeks to relate to. Therefore, each word in the sentence will contain two values: its meaning, or key vector, and its needs, or query vector. Thus, the above example would look like this:


Once we have these results, a simple calculation can be used to obtain the correlation between each word and the rest of the phrase. We can achieve this by comparing what each word searches for (query vector), with what the rest of the words really are (key vector), giving a ‘match’ when both are compatible. To relate all the words to each other, you compare the query vector of a word with the key vectors of the other words, repeating this process for each word in the sentence.

Each result obtained from the comparisons is stored in a results matrix. Continuing with the above example, the system is as follows (X will be the value of the correlation between the word represented by the row and the column it belongs to):

Bidirectional Encoder Representations from Transformers

The results matrix lets us consider the importance of each word with respect to the other words, enabling the machine to understand the sentence’s context.

Therefore, from this paper emerge the great new Artificial Intelligence systems for NLP, the Transformers.

It was only a matter of time before Transformer-based Artificial Intelligence was developed, with BERT being one of the most promising models.


Bidirectional Encoder Representations from Transformers, or BERT, is based on Transformer technology. Its greatest strength lies in its versatility since it can be used to solve multiple tasks. How is it able to do this?

It’s all down to its internal architecture. Bert contains two major blocks. The first is a general pre-trained block based on Transformer networks, which allows a basic understanding of the language. The  second block, linked to the first, is for fine-tuning the system’s operation through Deep Learning, depending on the final need. The second block is what allows such great versatility, since it is a small block with great potential relying an already trained block that supports basic language recognition.


To understand this, we can imagine a first-year university student who has to pass an integration exam. The student must learn to integrate, but this goal will be a great deal easier if he has a derivation base. In this analogy, the student is the Artificial Intelligence, the integration learning is the second fine-tuning block, and the derivation base corresponds to the pre-trained block.

It is important to understand how using the same derivation base and adjusting the fine-tuning block, we could make the student learn to calculate a motion system’s acceleration (since this is the derivative of the velocity with respect to time) or compute current intensities in complex circuits (intensity being the derivative of the electric charge with respect to time).

To sum up, Artificial Intelligence does not disappoint in its promise to change the world as we know it. Especially regarding NLP, we can see how its progress in recent years has led to the development of algorithms which, thanks to Transformers, are achievable with current technology.

BERT is already being used in large projects, both academic and industrial. And it only takes a quick Internet search to understand the importance of these systems in our society and how they are enabling greats strides to be made in terms of comfort and needs – from  allowing communication to those individuals who are unable to speak for medical reasons, in the purest Stephen Hawking style, to the comfort of our home, talking to our voice assistant about whether we should wear warmer or cooler weather clothes to work today.

[*] At Teldat we thought it would be interesting to have external bloggers, and thus broaden the spectrum of information that is transmitted from Teldat Blog. This week, Ivan Castro,a student at the University of Alcalá de Henares, writes about BERT.

Iván Castro

Iván Castro

Final year student of Telecommunications Technology Engineering specialising in Telematics at the Polytechnic School of the University of Alcalá de Henares (EPS-UAH). Participation in projects for the implementation of Machine Learning and Natural Language Processing algorithms.

Nuestras Soluciones Relevantes

Our Relevant Solutions

Give us your opinion!


Submit a Comment

Your email address will not be published.

Contact us
Copy link
Powered by Social Snap