BERT
- BERT is a deep learning model that has given state-of-the-art results
on a wide variety of natural language processing tasks. It stands for
Bidirectional Encoder Representations for Transformers. It has been
pre-trained on Wikipedia and BooksCorpus and requires (only)
task-specific fine-tuning.
- BERT is basically a bunch of Transformer encoders stacked together
(not the whole Transformer architecture but just the encoder). The
concept of bidirectionality is the key differentiator between BERT and
its predecessor, OpenAI GPT. BERT is bidirectional because its
self-attention layer performs self-attention on both directions.
- BERT is pre-trained on a large corpus of unlabelled text including the
entire Wikipedia(that’s 2,500 million words!) and Book Corpus (800
million words). This pretraining step is really important for BERT’s
success. This is because as we train a model on a large text corpus,
our model starts to pick up the deeper and intimate understandings of
how the language works.
- BERT is a deeply bidirectional model. Bidirectional means that BERT
learns information from both the left and the right side of a token’s
context during the training phase.This bidirectional understanding is
crucial to take NLP models to the next level.
- Finally the biggest advantage of BERT is it brought about the ImageNet
movement with it and the most impressive aspect of BERT is that we can
fine-tune it by adding just a couple of additional output layers to
create state-of-the-art models for a variety of NLP tasks.