top of page
Search

Naive Bayes Classifier

  • Writer: Roma Fatima
    Roma Fatima
  • Nov 30, 2021
  • 3 min read

A Naive Bayes classifier is a probabilistic machine learning model that’s used for discriminate different objects based on certain features. The concept of the classifier is based on the Bayes theorem.

ree

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.


The initial code for this classifier was taken from the Kaggle Competition Site and changes have been made on the same code.


We begin the program by importing classes numpy, pandas, math and operator. We read the IMDB movie review file and proceed to divide the data frame in ranges for training, developing and testing.

ree

We calculate the probabilities of positive and negative reviews.

ree

Output:

ree

We count how many positive and negative reviews are present in each data set of training, testing and developing.

ree

ree

ree

ree

Printing Training data:

ree

Printing Development data:

ree

Printing Testing data:

ree

We now proceed to clean the date by removing special characters and replacing all capital letters with small.

ree

We now build a vocabulary list.

ree

We omit rare words if the occurrence is less than five times.

ree

Now the total vocabulary list is of 99 words.

ree

We calculate the probabilities of the positive words and negative words.

ree

We now conduct 5-Fold Cross Validation.

ree

ree

Output:

ree

We now compare the effect of smoothing.

ree

5-Fold Cross Validation after smoothing will be:

ree

ree

Output:

ree

We compare the results and see if the effect of smoothing is better or not. We get an output that states the accuracy is better without smoothing.

ree

We now derive Top 10 words that predicts positive and negative class

ree

Positive words:

ree

Output:

ree

Negative words:

ree

Output:

ree

Finally, we calculate the final accuracy using hypermeters from above.

ree

ree

Output:

ree


My Contribution:


With the help of referenced articles, I was able to increase the accuracy of the prediction to 50%.


1. I changed the ratio of Training, Testing and Developing data sets. In the following sections you will come across the many experiments I did on the program to find the best ratio suitable for this data frame. I was able to obtain the perfect range of data for training, developing and testing, which is Training: 0 - 25.30%, Developing 25.30 - 30% and Testing 30 - 100%.


2. Omitting rare words. I tested various word frequencies in the range of 1-10 and obtained the best for my program, which coincidentally is of 5. I removed all the words which don't occur more than 5 times in the data frame.

Challenges Faced:


As I was working solely on IMDB reviews data frame it was difficult improving the accuracy and the highest I could achieve was only 50%.


It was also a challenge to repeatedly run the program for about 50-70 times to find the best suitable ratio for Training, Testing and Developing data sets. It was solely on approximation and trial and error method.


Another challenge I faced was importing the files in Google collab and it staying in the RAM for more than 30 mins idle time. I had to import files repeatedly. This issue resolved on it's own.

Experiments:


From the Data Mining lectures, I learnt that Naive Bayes Classifier does not require a lot of training data for optimal classification. Hence in my first experiment, I ran the program with various data ranges for training, developing and testing data sets. The data is in percentage of the total data frame of IMDB movie reviews.

ree

From this experiment I was able to obtain the perfect range of data for training, developing and testing, which is Training: 0 - 25.30%, Developing 25.30 - 30% and Testing 30 - 100%. This gives me the highest accuracy of 50% without smoothing.


Another experiment I conducted was to see if changing the frequency of words can affect the accuracy. I got this idea from the referenced article which states that changing data ranges can change the probabilities of classification. I ran the program by changing frequencies from 0 to 10 times and obtained the following graph.

ree

From this experiment, I came to a conclusion that using the minimum frequency of 5 times gives the highest accuracy of 50%.

References:


 
 
 

Comments


bottom of page