BERT is a multi-layer bidirectional Transformer encoder. He has
implemented a BERT base – 12 layers (transformer blocks) which has 12
attention heads, and 110 million parameters. Figure 2 represents BERT
architecture ,figure 3 shows input embedding is a combination of 3
embeddings: token embedding, segment embedding and the position
embedding.
Figure 2: BERT Architecture
Preprocessing Text for BERT
The input representation used by BERT is able to represent a single text
sentence as well as a pair of sentences in a single sequence of tokens.
The first token of every input sequence is the special classification
token – [CLS]. This token is used in classification tasks as an
aggregate of the entire sequence representation. It is ignored in
non-classification tasks.
For single text sentence tasks, this [CLS] token is followed by
the WordPiece tokens and the separator token – [SEP].