Skip to content

Category: Linguistics

In Honor of International Women’s Day: Karen Spärck Jones, Information Processing and NLP Pioneer

This past Friday, while reading the New York Times, I stumbled upon the Obituaries section. As I browsed through them, I found a particularly captivating one, published in January, that really struck me. It’s part of the “Overlooked” series, which features obituaries for people who were not honored at the time of their death.

Entitled “Overlooked No More: Karen Sparck Jones, Who Established the Basis for Search Engines,” the obituary describes Karen Spärck Jones (1935-2007), a self-taught programmer and an advocate for women in computer science. After meeting the head of the Cambridge Language Research Unit, Margaret Masterman, when studying at Cambridge University, Spärck Jones was inspired to enter natural language processing (NLP), a field in CS dedicated to improving how computers and machines interact with and process human language. Later, Spärck Jones went on to publish “Synonymy and Semantic Classification” in 1964, one of the foundational papers in NLP, and also coauthored a seminal textbook about the field. She was named the president of the Association for Computational Linguistics (ACL) in 1994, which hosts a highly regarded annual conference that, to this day, brings new research in computational linguistics and NLP to light. As a full-time professor at Cambridge, Spärck Jones also mentored many researchers, both male and female, and came up with the slogan “Computing is too important to be left to men” to encourage more women to enter the field.

As a woman, and an avid, self-taught programmer & aspiring linguist myself, I am in awe of Karen Spärck Jones. Even today, she has a huge impact. Her research continues to be cited, and her formulas are only just being implemented today – that goes to show just how ahead of her time she was. Also, 30 years ago, the gender disparity between men & women in CS was even more severe and difficult to deal with than it is today – and she dealt with it and surpassed it. She finally became a female Professor of Computers and Information at Cambridge, even though she notes that it took too long – because of her gender – and that bothered her. And this is all in addition to conducting highly influential research. Pretty inspiring stuff.

The NYT obituary also brings to light her absolutely critical research in information retrieval (applicable to search systems, etc.). Spärck Jones published (another!) seminal paper, this time in the area of information processing, entitled “A statistical interpretation of term specificity and its application in retrieval” in the Journal of Documentation. She used a combined, statistics + linguistics approach and came up with index-term weighting, where certain words in queries are weighted based on the frequency of their appearance in documents (think: all your search results), and based on that, the most efficient and relevant results would be retrieved. This research remains the basis of search engines like Google, even today! Spärck Jones tests her weighting formula on a number of different, well-known collections, and she even notes that pruning queries to reduce words to their stems is important & produces better results. For instancy, keeping “computing” as “comput-” to account for documents that may include “computation,” “computers,” “computed,” and more. I found that especially interesting and relevant – I’ve seen this accounted for countless times when I google things! In the rest of the paper, Spärck Jones devises and presents the results of her own weighting formula. It’s quite interesting stuff – I would recommend taking a look at the paper!

Karen Spärck Jones and her 1972 term weighting formula.

As a closing note, it was only later that day that I realized that it was International Women’s Day (Friday, March 8th, 2019). So I bring you this blog post to just point out this truly amazing, impactful woman – now one of my own inspirations.

Leave a Comment

Linguistics and Genetics in Russian History: What’s the Connection?

For my first foray into Russian history, I’m reading Russia: The Once and Future Empire from Pre-History to Putin by Philip Longworth.

I have to say – I’m thoroughly enjoying it. The first chapter alone touches upon different environmental, linguistic, technological, and cultural aspects of Russian history that have shaped who the Russian people are today. For example, at least in early Russian civilization, women were actually thought to be more valued than men, but they were still subject to common stereotypes (present only to bear children, or solely responsible for providing, as Longworth puts it, “care and comfort”). Interestingly enough, Longworth mentions that the development of metal technology, which was essential for Russia’s technological revolution, may have played a role in turning that breakthrough around and re-establishing men as the center of society – talk about pros and cons!

There was, however, this one quote that really piqued my interest and got me thinking:

“Interestingly, geneticists suggest that linguistic variations are roughly in line with genetic variations. The Russian language and the genes that make Russians what they are physically are evidently inseparable.”

As a definite linguistics nerd and someone who’s always found “what makes you, you” (a.k.a, genes!) super cool, I was truly struck by that line.

Some background: according to Longworth, the geographical environment and climate primarily shaped the genetic structure of Russians, although there was some slight differentiation from mating with other ethnic groups. For instance, Longworth writes: “in more northerly areas, where [Russians] had less exposure to sunlight, their hair grew fairer and their skin lighter.” As the Russians migrated northward, they also faced geographical barriers such as dense forests, marshland, and a mountainous landscape which further diversified them genetically.

Interestingly, as the physical traits of the Russians transformed under geographical/climatic pressure, Old Slavonic (the first Slavic language) diversified at the same time, for some of the same reasons. Geographical barriers were not only responsible for genetic variations, but also for separating societies and promoting linguistic differentiations.

I thought this was pretty amazing – I didn’t even know there was a connection between the two! But I had so many questions. Geography played a huge role in the diversification – why? And was this all some isolated pattern, or a mere instance or first indication of a serious, all-encompassing trend in human evolution?

I did some further research on the topic; turns out, the answer leans toward the latter. Longworth describes a generally well-known phenomena in linguistics and biology; however, it’s only been observed in certain regions. I came across a 2011 paper called Parallel Evolution of Languages and Genes in the Caucasus Region, where the authors analyzed languages and DNA of indigenous peoples from different populations in the Caucasus region (includes Russia, Georgia, Azerbaijan, and Armenia). They concluded that there was a strong correlation between genetic, geographical, and linguistic variation and that there was strong support for parallel evolution between language families and people’s physical traits. Another paper, published in the journal Current Biology, confirmed the existence of such a relationship in Cape Verde with the Creole population and their languages (Parallel Trajectories of Genetic and Linguistic Admixture in a Genetically Admixed Creole Population).

So, I hope you found this pretty interesting too. Let me know what you think in the comments, and stay tuned for more as I make my way through the book!

1 Comment