Text Analysis using LSTM,RNN, and CNN

15 Sep 2020

PART-1: IMDB Modelling Task

The core of this project is based around a simple task – performing sentiment analysis with the IMDB dataset given here on Kaggle. This dataset has 50,000 movie reviews from the IMDB corpus.

RNN Variants

I have compared the performance on the classification tasks across Recurrent Network Variants.

Embeddings

Distributed embeddings provide a lot of power in text classification, but there are many different Embeddings types that can be used. I am comparing text classification using -

CNN for Text Classification

CNNs are designed to model local features while LSTMs are very good at handling long range dependencies. I am investigating the use of CNNs with multiple and heterogeneous kernel sizes both as an alternative to an LSTM solution, and as an additional layer before a LSTM solution.

Model Saving

From the various models created above, I am saving the model with highest accuracy. There are many ways in which models can be saved. I’m using .h5 file extensions to save a model. This file extension can save model architecture as well as mmodel weights. I have clearly documented my design in the code as well as Readme files.

PART-2. Working with very small Dataset (Transfer Learning use-case)

A problem with libraries which provide wrappers for well-known datasets is that they can make the task of using the dataset so easy, that we do not realise what is required in the construction and use of data in Deep Learning. Whereas, in real world problems we have our own data and it is often very small. If we want to use a pre-trained models to make use of the learning that has already been achieved with an existing model, then doing this is called Transfer Learning.

Given these issues, I have collected a very small movie review dataset and used it to train a model that is based on the pre-trained model constructed in my Part 1.

Data Collection

Modelling

PART-3: Test Generation - Writing My Own Movie Reviews

There is more to Language Processing then just classification. In this part I have used my skills in RNNs to generate some original text. I then benchmarked my Model against a more classical implementation of statistical analysis. For this work, I have again made use of the IMDB dataset of 50,000 movie reviews, except that I have split the data differently this time. My core model is based on LSTMs. Report model performance in terms of perplexity. Provide 5 outputs each from your best implementation and the statistical model. Make sure to save your best model and provide it via a link in the submission.

The code for entire project cxan be found on github, here.