Skip to content

Category: Connections with Computer Science

Combining Russian Studies + Machine Learning: Price Classification on Sberbank Russian Housing Market Data with Python

Last year, in my Advanced Topics in Computer Science class, we were tasked with designing the best system for supervised classification (a machine learning approach to separating data into representative regions) of a dataset of choice. I’d been eyeing the Sberbank Russian Housing Market dataset for quite a while leading up; this final project turned out to be the perfect place to explore it. 

Some context on the dataset: Sberbank, the largest bank in Russia, sponsored a Kaggle competition to predict house prices in Russia’s turbulent economy to help them give more accurate real estate price predictions to their customers. Their given dataset consists of 6000+ housing data points from around Russia, where each point is the sale of a house. There are 278 features associated with each point, including: preschool count nearby, distance from the metro, cafe count nearby, mosque count nearby, etc.

Through scikit-learn, a Python machine learning library, I experimented with each of the four classifiers we learned about – Naive Bayes, decision tree, k-nearest neighbors (KNN), and support vector machine (SVM) – and varied parameters (tree depth, number of neighbors, etc.) along the way to identify the best classification method. Resulting confusion matrices (for both the reclassification and leave-one-out methods – the latter shows the system generalizes to unseen data, as stated by this Stackoverflow answer) are included and analyzed in the conclusions of my attached report. Note that due to the sky-high number of features (278, and for each of 6042 points), PCA plots and corresponding decision region visualizations would have taken days to render and were therefore excluded from the report.

I hope you enjoy the writeup, and I’d love to hear what you think – let me know in the comments!

Sberbank Housing Data Classification – Rhea Kapur
Leave a Comment

Examining Slavic Language Speaker Statistics in the U.S. with R

Hey everyone!

For the Data Science unit in my Programming Language and Design class at school, I used R (a statistical programming language) to visualize data on people in the U.S. who speak a Slavic language at home along with their respective English proficiencies.

I love data science and creating cool graphs and visualizations through programming. My research (more on that coming soon!) is done almost entirely in R, so I’m very familiar with the language and various IDEs that people use (RStudio, Vim, Emacs + ESS, etc.). I’m also an avid reader of the daily email newsletter from R-bloggers, a site that offers great R tutorials and discussion as well as a strong community of R users.

Of course, I can’t bear to leave out Twitter, which has introduced me to many awesome women in data science. Rachael Tatman, a Data Scientist at Kaggle with a PhD in linguistics from the University of Washington, is one of my inspirations. She works mainly with R and Python (also used for data science, but more powerful in terms of algorithms/machine learning) and does really cool research in computational sociolinguistics, specifically looking at emoji and how different dialects are processed by computational systems.

For this project, I used RMarkdown to create a report and add descriptions and analysis. It is attached below. My favorite graphs are the map and the pie charts on the second page. R has some really nice color palettes (check out RColorBrewer!) to make graphs look amazing, and those were pretty cool to play around with. It was overall an immensely fun project.

Hope you enjoy reading!

Examining Slavic Language Speaker Statistics in the United States
Leave a Comment