Last year, in my Advanced Topics in Computer Science class, we were tasked with designing the best system for supervised classification (a machine learning approach to separating data into representative regions) of a dataset of choice. I’d been eyeing the Sberbank Russian Housing Market dataset for quite a while leading up; this final project turned out to be the perfect place to explore it.
Some context on the dataset: Sberbank, the largest bank in Russia, sponsored a Kaggle competition to predict house prices in Russia’s turbulent economy to help them give more accurate real estate price predictions to their customers. Their given dataset consists of 6000+ housing data points from around Russia, where each point is the sale of a house. There are 278 features associated with each point, including: preschool count nearby, distance from the metro, cafe count nearby, mosque count nearby, etc.
Through scikit-learn, a Python machine learning library, I experimented with each of the four classifiers we learned about – Naive Bayes, decision tree, k-nearest neighbors (KNN), and support vector machine (SVM) – and varied parameters (tree depth, number of neighbors, etc.) along the way to identify the best classification method. Resulting confusion matrices (for both the reclassification and leave-one-out methods – the latter shows the system generalizes to unseen data, as stated by this Stackoverflow answer) are included and analyzed in the conclusions of my attached report. Note that due to the sky-high number of features (278, and for each of 6042 points), PCA plots and corresponding decision region visualizations would have taken days to render and were therefore excluded from the report.
I hope you enjoy the writeup, and I’d love to hear what you think – let me know in the comments!
In my Advanced Topics in Computer Science class at school, we recently implemented k-Means Image Segmentation. The algorithm works by partitioning the dataset into k non-overlapping subgroups, or clusters. In this case, the dataset would be the set of pixels of an image (as we are performing image segmentation, or the process of breaking an image up into different sections). We are doing image segmentation based on color (on the R, G, B values), so our clusters would essentially be pixels that have the most similar colors.
Here’s a brief overview. See this link for more details.
Set a number of k clusters.
Initialize the k centroids (or cluster centers) by randomly selecting k points from the shuffled set of pixels.
At each iteration of the algorithm,
Compute the sum of the squared distance between all data points and all centroids.
Determine, for each pixel, which centroid is closest.
Assign that pixel to the corresponding (closest) cluster.
Re-assign each cluster’s center (i.e. re-compute the centriod) by averaging all of the data points in a cluster. With pixels, this means, for all pixels in a cluster, average the X-position and the Y-position. The centroid location will then be (Xavg, Yavg).
Stop iterations after a specified number has passed, or a certain error threshold has been reached, etc. You can set any end condition, just know that k-Means is an iterative algorithm, and it is in the programmer’s hands to terminate it.
You can find the code here, on my GitHub. In the meantime, enjoy the segmentations and the analysis at the end!
Some key insights:
The k-Means algorithm favors classifying different levels of shading (the colors that represent them) rather than classifying distinctly different colors. I had originally thought this may be to incorporate detail, but going back to the steps of the algorithm and analyzing them revealed that it is really just that the parts of the spectrum of shaded colors are more common than spots of different, vibrant, eye-catching colors. For instance, the shaded parts cover a greater area than do the blue pixels in the small blue eyes in the matryoshka dolls, and therefore, they are more likely to be initially picked as a color. However, if you manually set the starting pixel to be that small blue region, though, that color would be captured (albeit covering a very small portion of the segmented image).
The algorithm runs faster when images cover a smaller area of pixels, as would be expected. And, some image-specific observations: note the good results on the FIALKA image with k = 4 (i.e., four clusters) – the texture and 3D aspect of the photo is really captured well. For the image of the person in Moscow fog, with St. Basil’s Cathedral in the background, the sky is separated into lighter and darker parts. You can see this gradation in the original image as well, but it is definitely not as distinct as the classification would suggest (it’s actually much more gradual).
Hope you enjoyed reading! As always, let me know if you have any questions/thoughts in the comments. До скорого, Рая!
For the Data Science unit in my Programming Language and Design class at school, I used R (a statistical programming language) to visualize data on people in the U.S. who speak a Slavic language at home along with their respective English proficiencies.
I love data science and creating cool graphs and visualizations through programming. My research (more on that coming soon!) is done almost entirely in R, so I’m very familiar with the language and various IDEs that people use (RStudio, Vim, Emacs + ESS, etc.). I’m also an avid reader of the daily email newsletter from R-bloggers, a site that offers great R tutorials and discussion as well as a strong community of R users.
Of course, I can’t bear to leave out Twitter, which has introduced me to many awesome women in data science. Rachael Tatman, a Data Scientist at Kaggle with a PhD in linguistics from the University of Washington, is one of my inspirations. She works mainly with R and Python (also used for data science, but more powerful in terms of algorithms/machine learning) and does really cool research in computational sociolinguistics, specifically looking at emoji and how different dialects are processed by computational systems.
For this project, I used RMarkdown to create a report and add descriptions and analysis. It is attached below. My favorite graphs are the map and the pie charts on the second page. R has some really nice color palettes (check out RColorBrewer!) to make graphs look amazing, and those were pretty cool to play around with. It was overall an immensely fun project.
Ever since discovering Karen Spärck Jones and learning more about her, I’ve been reading a lot about information retrieval and natural language processing – especially as it relates to search engines. As I researched, I started to wonder about other languages. What about search engines for people who solely speak Spanish? Hindi? French? And even Russian! Are there gaps in efficiency & quality of these search engines compared to those in English, the world’s most widely spoken language? And if so, do any languages suffer more, and why?
I dug a bit more into widely used Russian search engines, out of curiosity. Apparently, the three most widely used ones are Yandex, Google.ru, and Rambler, listed in order of their popularity. I’ll focus on Yandex (or Яндекс, in Russian) in particular.
This past Friday, while reading the New York Times, I stumbled upon the Obituaries section. As I browsed through them, I found a particularly captivating one, published in January, that really struck me. It’s part of the “Overlooked” series, which features obituaries for people who were not honored at the time of their death.
Entitled “Overlooked No More: Karen Sparck Jones, Who Established the Basis for Search Engines,” the obituary describes Karen Spärck Jones (1935-2007), a self-taught programmer and an advocate for women in computer science. After meeting the head of the Cambridge Language Research Unit, Margaret Masterman, when studying at Cambridge University, Spärck Jones was inspired to enter natural language processing (NLP), a field in CS dedicated to improving how computers and machines interact with and process human language. Later, Spärck Jones went on to publish “Synonymy and Semantic Classification” in 1964, one of the foundational papers in NLP, and also coauthored a seminal textbook about the field. She was named the president of the Association for Computational Linguistics (ACL) in 1994, which hosts a highly regarded annual conference that, to this day, brings new research in computational linguistics and NLP to light. As a full-time professor at Cambridge, Spärck Jones also mentored many researchers, both male and female, and came up with the slogan “Computing is too important to be left to men” to encourage more women to enter the field.
As a woman, and an avid, self-taught programmer & aspiring linguist myself, I am in awe of Karen Spärck Jones. Even today, she has a huge impact. Her research continues to be cited, and her formulas are only just being implemented today – that goes to show just how ahead of her time she was. Also, 30 years ago, the gender disparity between men & women in CS was even more severe and difficult to deal with than it is today – and she dealt with it and surpassed it. She finally became a female Professor of Computers and Information at Cambridge, even though she notes that it took too long – because of her gender – and that bothered her. And this is all in addition to conducting highly influential research. Pretty inspiring stuff.
The NYT obituary also brings to light her absolutely critical research in information retrieval (applicable to search systems, etc.). Spärck Jones published (another!) seminal paper, this time in the area of information processing, entitled “A statistical interpretation of term specificity and its application in retrieval” in the Journal of Documentation. She used a combined, statistics + linguistics approach and came up with index-term weighting, where certain words in queries are weighted based on the frequency of their appearance in documents (think: all your search results), and based on that, the most efficient and relevant results would be retrieved. This research remains the basis of search engines like Google, even today! Spärck Jones tests her weighting formula on a number of different, well-known collections, and she even notes that pruning queries to reduce words to their stems is important & produces better results. For instancy, keeping “computing” as “comput-” to account for documents that may include “computation,” “computers,” “computed,” and more. I found that especially interesting and relevant – I’ve seen this accounted for countless times when I google things! In the rest of the paper, Spärck Jones devises and presents the results of her own weighting formula. It’s quite interesting stuff – I would recommend taking a look at the paper!
As a closing note, it was only later that day that I realized that it was International Women’s Day (Friday, March 8th, 2019). So I bring you this blog post to just point out this truly amazing, impactful woman – now one of my own inspirations.