Naive Bayes Example on Iris Data Set¶

Step 0: Clean Dataset¶

In [ ]:
import pandas as pd

data = pd.read_csv('../.datasets/iris.csv')
data
Out[ ]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 5 columns

Step 1: Clean Dataset (if needed)¶

NOTE: Naive Bayes works better with discrete values instead of continuous values. In this example, I will round off all values to their nearest integers. In other algorithms, this might not be needed.

In [ ]:
import numpy as np

data['sepal_length'] = np.round(data['sepal_length'])
data['sepal_width'] = np.round(data['sepal_width'])
data['petal_length'] = np.round(data['petal_length'])
data['petal_width'] = np.round(data['petal_width'])
data
Out[ ]:
sepal_length sepal_width petal_length petal_width class
0 5.0 4.0 1.0 0.0 Iris-setosa
1 5.0 3.0 1.0 0.0 Iris-setosa
2 5.0 3.0 1.0 0.0 Iris-setosa
3 5.0 3.0 2.0 0.0 Iris-setosa
4 5.0 4.0 1.0 0.0 Iris-setosa
... ... ... ... ... ...
145 7.0 3.0 5.0 2.0 Iris-virginica
146 6.0 2.0 5.0 2.0 Iris-virginica
147 6.0 3.0 5.0 2.0 Iris-virginica
148 6.0 3.0 5.0 2.0 Iris-virginica
149 6.0 3.0 5.0 2.0 Iris-virginica

150 rows × 5 columns

Step 2: Split the dataset into training and testing sets¶

In [ ]:
X_data = data[data.columns.drop("class")]
X_data
Out[ ]:
sepal_length sepal_width petal_length petal_width
0 5.0 4.0 1.0 0.0
1 5.0 3.0 1.0 0.0
2 5.0 3.0 1.0 0.0
3 5.0 3.0 2.0 0.0
4 5.0 4.0 1.0 0.0
... ... ... ... ...
145 7.0 3.0 5.0 2.0
146 6.0 2.0 5.0 2.0
147 6.0 3.0 5.0 2.0
148 6.0 3.0 5.0 2.0
149 6.0 3.0 5.0 2.0

150 rows × 4 columns

In [ ]:
y_data = data['class']
y_data
Out[ ]:
0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: class, Length: 150, dtype: object
In [ ]:
from sklearn.model_selection import train_test_split

# Shuffle is True by default.
# In this example, I will split this into 80% train and 20% test.
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20)

# the following lines are just typehints
X_train: pd.DataFrame
X_test: pd.DataFrame
y_train: list
y_test: list

#X_train
print("Train:", X_train.__len__())
print("Test:", X_test.__len__())
Train: 120
Test: 30

Step 3: Train the model on the training set using a specified algorithm and hyperparameters.¶

In this example, we are using Naive Bayes Classifier, specifically, the Bernoulli Naive Bayes.

In [ ]:
from sklearn.naive_bayes import BernoulliNB

# BernoulliNB(hyperparameter = value, ...)
# here, we are using the default provided values
model = BernoulliNB().fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred
Out[ ]:
array(['Iris-virginica', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-setosa',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa'], dtype='<U15')

Step 4: Evaluate the model on the testing set.¶

In [ ]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)

print("Accuracy: {score:.2f}%".format(score = score * 100))
Accuracy: 60.00%

Well, that performed bad!¶

What about using another algorithm to train our model? Let's try using Gaussian Naive Bayes method instead.

In [ ]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB().fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred
Out[ ]:
array(['Iris-versicolor', 'Iris-versicolor', 'Iris-setosa', 'Iris-setosa',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-virginica', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-setosa', 'Iris-versicolor', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa',
       'Iris-versicolor', 'Iris-setosa', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-setosa'], dtype='<U15')
In [ ]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)

print("Accuracy: {score:.2f}%".format(score = score * 100))
Accuracy: 86.67%