BERT is a multi-layer bidirectional Transformer encoder. He has implemented a BERT base – 12 layers (transformer blocks) which has 12 attention heads, and 110 million parameters. Figure 2 represents BERT architecture ,figure 3 shows input embedding is a combination of 3 embeddings: token embedding, segment embedding and the position embedding.
Figure 2: BERT Architecture
Preprocessing Text for BERT
The input representation used by BERT is able to represent a single text sentence as well as a pair of sentences in a single sequence of tokens.
The first token of every input sequence is the special classification token – [CLS]. This token is used in classification tasks as an aggregate of the entire sequence representation. It is ignored in non-classification tasks.
For single text sentence tasks, this [CLS] token is followed by the WordPiece tokens and the separator token – [SEP].