Skip to content

Month: March 2019

Ever Heard of Yandex?

Ever since discovering Karen Spärck Jones and learning more about her, I’ve been reading a lot about information retrieval and natural language processing – especially as it relates to search engines. As I researched, I started to wonder about other languages. What about search engines for people who solely speak Spanish? Hindi? French? And even Russian! Are there gaps in efficiency & quality of these search engines compared to those in English, the world’s most widely spoken language? And if so, do any languages suffer more, and why?

I dug a bit more into widely used Russian search engines, out of curiosity. Apparently, the three most widely used ones are Yandex, Google.ru, and Rambler, listed in order of their popularity. I’ll focus on Yandex (or Яндекс, in Russian) in particular.

Yandex.ru main page
Leave a Comment

In Honor of International Women’s Day: Karen Spärck Jones, Information Processing and NLP Pioneer

This past Friday, while reading the New York Times, I stumbled upon the Obituaries section. As I browsed through them, I found a particularly captivating one, published in January, that really struck me. It’s part of the “Overlooked” series, which features obituaries for people who were not honored at the time of their death.

Entitled “Overlooked No More: Karen Sparck Jones, Who Established the Basis for Search Engines,” the obituary describes Karen Spärck Jones (1935-2007), a self-taught programmer and an advocate for women in computer science. After meeting the head of the Cambridge Language Research Unit, Margaret Masterman, when studying at Cambridge University, Spärck Jones was inspired to enter natural language processing (NLP), a field in CS dedicated to improving how computers and machines interact with and process human language. Later, Spärck Jones went on to publish “Synonymy and Semantic Classification” in 1964, one of the foundational papers in NLP, and also coauthored a seminal textbook about the field. She was named the president of the Association for Computational Linguistics (ACL) in 1994, which hosts a highly regarded annual conference that, to this day, brings new research in computational linguistics and NLP to light. As a full-time professor at Cambridge, Spärck Jones also mentored many researchers, both male and female, and came up with the slogan “Computing is too important to be left to men” to encourage more women to enter the field.

As a woman, and an avid, self-taught programmer & aspiring linguist myself, I am in awe of Karen Spärck Jones. Even today, she has a huge impact. Her research continues to be cited, and her formulas are only just being implemented today – that goes to show just how ahead of her time she was. Also, 30 years ago, the gender disparity between men & women in CS was even more severe and difficult to deal with than it is today – and she dealt with it and surpassed it. She finally became a female Professor of Computers and Information at Cambridge, even though she notes that it took too long – because of her gender – and that bothered her. And this is all in addition to conducting highly influential research. Pretty inspiring stuff.

The NYT obituary also brings to light her absolutely critical research in information retrieval (applicable to search systems, etc.). Spärck Jones published (another!) seminal paper, this time in the area of information processing, entitled “A statistical interpretation of term specificity and its application in retrieval” in the Journal of Documentation. She used a combined, statistics + linguistics approach and came up with index-term weighting, where certain words in queries are weighted based on the frequency of their appearance in documents (think: all your search results), and based on that, the most efficient and relevant results would be retrieved. This research remains the basis of search engines like Google, even today! Spärck Jones tests her weighting formula on a number of different, well-known collections, and she even notes that pruning queries to reduce words to their stems is important & produces better results. For instancy, keeping “computing” as “comput-” to account for documents that may include “computation,” “computers,” “computed,” and more. I found that especially interesting and relevant – I’ve seen this accounted for countless times when I google things! In the rest of the paper, Spärck Jones devises and presents the results of her own weighting formula. It’s quite interesting stuff – I would recommend taking a look at the paper!

Karen Spärck Jones and her 1972 term weighting formula.

As a closing note, it was only later that day that I realized that it was International Women’s Day (Friday, March 8th, 2019). So I bring you this blog post to just point out this truly amazing, impactful woman – now one of my own inspirations.

Leave a Comment