Last year, in my Advanced Topics in Computer Science class, we were tasked with designing the best system for supervised classification (a machine learning approach to separating data into representative regions) of a dataset of choice. I’d been eyeing the Sberbank Russian Housing Market dataset for quite a while leading up; this final project turned out to be the perfect place to explore it.
Some context on the dataset: Sberbank, the largest bank in Russia, sponsored a Kaggle competition to predict house prices in Russia’s turbulent economy to help them give more accurate real estate price predictions to their customers. Their given dataset consists of 6000+ housing data points from around Russia, where each point is the sale of a house. There are 278 features associated with each point, including: preschool count nearby, distance from the metro, cafe count nearby, mosque count nearby, etc.
Through scikit-learn, a Python machine learning library, I experimented with each of the four classifiers we learned about – Naive Bayes, decision tree, k-nearest neighbors (KNN), and support vector machine (SVM) – and varied parameters (tree depth, number of neighbors, etc.) along the way to identify the best classification method. Resulting confusion matrices (for both the reclassification and leave-one-out methods – the latter shows the system generalizes to unseen data, as stated by this Stackoverflow answer) are included and analyzed in the conclusions of my attached report. Note that due to the sky-high number of features (278, and for each of 6042 points), PCA plots and corresponding decision region visualizations would have taken days to render and were therefore excluded from the report.
I hope you enjoy the writeup, and I’d love to hear what you think – let me know in the comments!Sberbank Housing Data Classification – Rhea Kapur