Naive Bayes Example on Iris Data Set¶
Step 0: Clean Dataset¶
In [ ]:
import pandas as pd
data = pd.read_csv('../.datasets/iris.csv')
data
Out[ ]:
| sepal_length | sepal_width | petal_length | petal_width | class | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
150 rows × 5 columns
Step 1: Clean Dataset (if needed)¶
NOTE: Naive Bayes works better with discrete values instead of continuous values. In this example, I will round off all values to their nearest integers. In other algorithms, this might not be needed.
In [ ]:
import numpy as np
data['sepal_length'] = np.round(data['sepal_length'])
data['sepal_width'] = np.round(data['sepal_width'])
data['petal_length'] = np.round(data['petal_length'])
data['petal_width'] = np.round(data['petal_width'])
data
Out[ ]:
| sepal_length | sepal_width | petal_length | petal_width | class | |
|---|---|---|---|---|---|
| 0 | 5.0 | 4.0 | 1.0 | 0.0 | Iris-setosa |
| 1 | 5.0 | 3.0 | 1.0 | 0.0 | Iris-setosa |
| 2 | 5.0 | 3.0 | 1.0 | 0.0 | Iris-setosa |
| 3 | 5.0 | 3.0 | 2.0 | 0.0 | Iris-setosa |
| 4 | 5.0 | 4.0 | 1.0 | 0.0 | Iris-setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 7.0 | 3.0 | 5.0 | 2.0 | Iris-virginica |
| 146 | 6.0 | 2.0 | 5.0 | 2.0 | Iris-virginica |
| 147 | 6.0 | 3.0 | 5.0 | 2.0 | Iris-virginica |
| 148 | 6.0 | 3.0 | 5.0 | 2.0 | Iris-virginica |
| 149 | 6.0 | 3.0 | 5.0 | 2.0 | Iris-virginica |
150 rows × 5 columns
Step 2: Split the dataset into training and testing sets¶
In [ ]:
X_data = data[data.columns.drop("class")]
X_data
Out[ ]:
| sepal_length | sepal_width | petal_length | petal_width | |
|---|---|---|---|---|
| 0 | 5.0 | 4.0 | 1.0 | 0.0 |
| 1 | 5.0 | 3.0 | 1.0 | 0.0 |
| 2 | 5.0 | 3.0 | 1.0 | 0.0 |
| 3 | 5.0 | 3.0 | 2.0 | 0.0 |
| 4 | 5.0 | 4.0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... |
| 145 | 7.0 | 3.0 | 5.0 | 2.0 |
| 146 | 6.0 | 2.0 | 5.0 | 2.0 |
| 147 | 6.0 | 3.0 | 5.0 | 2.0 |
| 148 | 6.0 | 3.0 | 5.0 | 2.0 |
| 149 | 6.0 | 3.0 | 5.0 | 2.0 |
150 rows × 4 columns
In [ ]:
y_data = data['class']
y_data
Out[ ]:
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
Name: class, Length: 150, dtype: object
In [ ]:
from sklearn.model_selection import train_test_split
# Shuffle is True by default.
# In this example, I will split this into 80% train and 20% test.
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20)
# the following lines are just typehints
X_train: pd.DataFrame
X_test: pd.DataFrame
y_train: list
y_test: list
#X_train
print("Train:", X_train.__len__())
print("Test:", X_test.__len__())
Train: 120 Test: 30
Step 3: Train the model on the training set using a specified algorithm and hyperparameters.¶
In this example, we are using Naive Bayes Classifier, specifically, the Bernoulli Naive Bayes.
In [ ]:
from sklearn.naive_bayes import BernoulliNB
# BernoulliNB(hyperparameter = value, ...)
# here, we are using the default provided values
model = BernoulliNB().fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred
Out[ ]:
array(['Iris-virginica', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
'Iris-virginica', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
'Iris-setosa', 'Iris-virginica', 'Iris-virginica',
'Iris-virginica', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa',
'Iris-setosa', 'Iris-setosa', 'Iris-virginica', 'Iris-virginica',
'Iris-virginica', 'Iris-virginica', 'Iris-setosa',
'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
'Iris-virginica', 'Iris-setosa'], dtype='<U15')
Step 4: Evaluate the model on the testing set.¶
In [ ]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)
print("Accuracy: {score:.2f}%".format(score = score * 100))
Accuracy: 60.00%
Well, that performed bad!¶
What about using another algorithm to train our model? Let's try using Gaussian Naive Bayes method instead.
In [ ]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB().fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred
Out[ ]:
array(['Iris-versicolor', 'Iris-versicolor', 'Iris-setosa', 'Iris-setosa',
'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
'Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
'Iris-setosa', 'Iris-virginica', 'Iris-versicolor',
'Iris-versicolor', 'Iris-setosa', 'Iris-versicolor', 'Iris-setosa',
'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa',
'Iris-versicolor', 'Iris-setosa', 'Iris-versicolor',
'Iris-versicolor', 'Iris-setosa'], dtype='<U15')
In [ ]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)
print("Accuracy: {score:.2f}%".format(score = score * 100))
Accuracy: 86.67%