How I achieved classification accuracy of 78.78% on PIMA Indian Diabetes Dataset

I picked up my first Machine Learning dataset from this list and after spending few days doing exploratory analysis and massaging data I arrived at the accuracy of 78.78%

The code for this can be downloaded from GitHub or you can run it directly on Kaggle

Here’s how I did it

After carefully observing this data I categorized Insulin and Diabetes Pedigree function features, I then did a train/test split to prepare for analysis before standardizing using StandardScaler() from sklearn


After trying various algorithms (Logistic Regression, Random Forest and XGBoost) I tried Support Vector Machine to get an accuracy of 78.78% on this dataset using a Linear kernel, this is by far the highest consistent accuracy that I got.


I also noticed that Regularization parameter “C” didn’t have any impact on final accuracy of SVM

Happy “Machine” Learning

Leave a Reply

Your email address will not be published. Required fields are marked *