Figure 3 : input embedding is a combination of 3 embeddings
BERT developers have set a specific set of rules to represent
languages before feeding into the model.
- Position Embeddings: BERT learns and uses positional embeddings to
express the position of words in a sentence. These are added to
overcome the limitation of Transformer which, unlike an RNN, is not
able to capture “sequence” or “order” information
- Segment Embeddings: BERT can also take sentence pairs as inputs for
tasks. Therefore it learns a unique embedding for the first and the
second sentences to help the model distinguish between them. In the
above example, all the tokens marked as EA belong to sentence A
- Token Embeddings: These are the embeddings learned for the specific
token from the WordPiece token vocabulary For a given token, its input
representation is constructed by summing the corresponding token,
segment, and position embeddings.
Tokenization: BERT uses WordPiece tokenization. The vocabulary is
initialized with all the individual characters in the language, and
then the most frequent/likely combinations of the existing words in
the vocabulary are iteratively added.