Step 1: Data Cleaning
  1. ”stop words” usually refers to the most common words in a language. for example ’a’, ’the’ etc. These words are essential parts of any language but do not add anything significant to the meaning of a word.
  2. punctuation marks are marks such as a full stop, comma, or question mark, used in writing to separate sentences and their elements and to clarify meaning.
  3. convert all the messages in lowercase so that words.
  4. convert the words to its lemma form.
  5. embedded special characters, ”URLs” and finally digits are removed from the tweets
Step 2: apply this vocab on our train and test datasets.
Step 3 : Apply N-gram analysis.
Step 4: Word embedding methods are applied to learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.
Step 5: Embedding layer is applied to the neural network with a Backpropagation algorithm.
Step 6: Word2Vec is applied for efficiently learn a standalone word embedding from a text corpus.
Step 7: Continuous Bag-of-Words, or CBOW mode is applied to learns the embedding by predicting the current word based on its context.
Step 8: Continuous Skip-Gram Model is applied for learning by predicting the surrounding words for a given current word.
Step 9: The Global Vectors for Word Representation (GloVe) algorithm is applied for a classical vector space model representation of words using matrix factorization techniques that helps in calculating analogies.
Step 10: Apply CNN with word Embeddings.
  1. mapping of words to integers has been prepared, encode the tweets in the training dataset and ensure that all documents have the same length
  2. find the longest review using the max() function on the training dataset and take its length and truncate tweets to the smallest size or zero-pad.
  3. define neural network model, The model with embedding layer as the first hidden layer and specify the size of the real-valued vector space, and the maximum length of input documents.
  4. The maximum document length was calculated.
Step 11: Develop a multi-channel convolutional neural network for the Tweet analysis prediction problem.
  1. CNN configuration with 32 filters, kernel size of 8 with a rectified linear (relu) activation function.
  2. The back-end with standard Multilayer Perceptron layers to interpret the CNN features.
  3. The output layer with sigmoid activation function to output a value between 0 and 1 for the negative and positive sentiment in the review
  4. Fit the network on the training data having the parameters of stochastic gradient descent optimizer and training epochs as 100, to obtain the metrics accuracy and loss.
Step 12: Make predictions on test data.
Step 13: Evaluate and compare the model.