Step 1: Data Cleaning
- ”stop words” usually refers to the most common words in a language.
for example ’a’, ’the’ etc. These words are essential parts of any
language but do not add anything significant to the meaning of a word.
- punctuation marks are marks such as a full stop, comma, or question
mark, used in writing to separate sentences and their elements and to
clarify meaning.
- convert all the messages in lowercase so that words.
- convert the words to its lemma form.
- embedded special characters, ”URLs” and finally digits are removed
from the tweets
Step 2: apply this vocab on our train and test datasets.
Step 3 : Apply N-gram analysis.
Step 4: Word embedding methods are applied to learn a real-valued vector
representation for a predefined fixed sized vocabulary from a corpus of
text.
Step 5: Embedding layer is applied to the neural network with a
Backpropagation algorithm.
Step 6: Word2Vec is applied for efficiently learn a standalone word
embedding from a text corpus.
Step 7: Continuous Bag-of-Words, or CBOW mode is applied to learns the
embedding by predicting the current word based on its context.
Step 8: Continuous Skip-Gram Model is applied for learning by predicting
the surrounding words for a given current word.
Step 9: The Global Vectors for Word Representation (GloVe) algorithm is
applied for a classical vector space model representation of words using
matrix factorization techniques that helps in calculating analogies.
Step 10: Apply CNN with word Embeddings.
- mapping of words to integers has been prepared, encode the tweets in
the training dataset and ensure that all documents have the same
length
- find the longest review using the max() function on the training
dataset and take its length and truncate tweets to the smallest size
or zero-pad.
- define neural network model, The model with embedding layer as the
first hidden layer and specify the size of the real-valued vector
space, and the maximum length of input documents.
- The maximum document length was calculated.
Step 11: Develop a multi-channel convolutional neural network for the
Tweet analysis prediction problem.
- CNN configuration with 32 filters, kernel size of 8 with a rectified
linear (relu) activation function.
- The back-end with standard Multilayer Perceptron layers to interpret
the CNN features.
- The output layer with sigmoid activation function to output a value
between 0 and 1 for the negative and positive sentiment in the review
- Fit the network on the training data having the parameters of
stochastic gradient descent optimizer and training epochs as 100, to
obtain the metrics accuracy and loss.
Step 12: Make predictions on test data.
Step 13: Evaluate and compare the model.