29 Dec 2016
“The wisdom of humanity is coded in language.”
Lyle Campbell, an American scholar and linguist known for his work on indegenous American languages, hints at a problem much graver than most would care to admit. Even as someone who can speak English fluently, there are times where it just seems more natural to speak in my native language, Hindi. It’s not that it’s any easier or more practical, it’s just that it seems more fitting in that particular situation. This is perhaps difficult to convey in words (isn’t that ironic?) but any bilingual speaker can confirm that instant connection you feel when you talk to someone in a native tongue, especially if it’s in a place where few people speak that language.
There is a Portugese word saudade, a term commonly used in Galician literature and heard in the music of Brazil. What strikes me most about saudade, and many other such words is that they are untranslatable to other languages, yet so undeniable potent. The concept of saudade portrays a meloncholy nostalgia for something that perhaps has not even happened. Take for example mamihlapinatapai, that is derived from the Yaghan language of Tierra del Fuego and known to be one of the hardest words to translate. It refers to an expressive and meaningful silence, the look that is shared across the table by two people where each understands the other and is in agreement with what is being expressed. I could go on with more examples of elaborate words in foreign tongues, but the point here is that there is a vast expanse of emotion captured in languages that is difficult to convey otherwise.
27 Sep 2016
In the early morning of 15 April 1912, a British passenger liner sank in the North Atlantic Ocean after colliding with an iceberg. More than 1,500 passengers died in the sinking, making it one of the deadliest maritime disasters. Since then, the Titanic has become one of the most famous ships in history, her memory kept alive in various forms of pop culture, museums, books and films.
We can use machine learning to explore some interesting questions. How much of role did a passenger’s socio-economic status play on their chance of survival? Did their name or age make a difference? What about siblings, parents or children? Is one of these factors more significant than the rest? Using decision trees and a random forest model, we can analyze the passenger data from the ship, answer some of these interesting questions and create a classifier that can predict if a passenger survived the tragedy.
11 Sep 2016
Nearly twenty years ago, Kurt Vonnengut, an American author perhaps most famously known for his satirical novel Slaughterhouse-Five, gave a lecture that would change the way we think about stories. Standing in front of a blackboard, chalk in hand, he proclaims, “There’s no reason why the simple shapes of stories can’t be fed into computers; they are beautiful shapes.” He then proceeds to plot a cosine curve, and amidst applause and laughter, playfully declares, “People love this story!”
Those Who Tell the Stories Rule the World
The notion Vonnengut explores is an interesting one - can we quantitatively look at writing to understand how it is emotionally structured? When we read, we feel emotionally connected to the writing. We get so ‘lost’ in the fictional world and fall so deep into it that our own emotions become mapped to the narrative. In fact, narrative transportation theory in psychology studies exactly this. The quantitative meta-analysis by Van Laer, De Ruyter, Visconti and Wetzels on the effects of narrative transportation allude to readers ‘mentally enter(ing) a world that a story evokes’. We feel what we read, and being able to understand how these emotions vary over the course of a story is, I think, an extremely interesting intellectual pursuit.
More importantly, this discussion leads to some interesting questions that we can now address through data analysis of big datasets. How does this emotional structure vary over generations of writing, from early 16th century Shakespeare to modern day Pratchett? How do these trends differ between cultures - how similar or different is Indian and Japanese literature in its emotional structure? Do certain authors have an emotional signature - a unique structure to their stories, a formula to their writing? Given an emotional structure, can we predict what kind of story it is (or perhaps even predict its ending?)
Many of these questions were inspired by the research of Andrew Reagan and the Computatational Story Lab at the University of Vermont, where they used sentiment analysis to analyze the ‘emotional arcs’ of 1,700 stories to reveal the most common ones. Their findings are fascinating - according to the research, all stories conform to one of six basic emotional arcs.
As an avid reader, this research really fascinates me. In this multi-part blog series, I will try to understand the concept of an emotional timeseries in a piece of literature, and how it is affected by various factors. In future posts, I will attempt to address some of the more interesting questions that I brought up earlier.
26 Jul 2016
I was looking for public datasets to explore the other day, and I ran into Yelp’s dataset from the Yelp Dataset Challenge. After poking around the data, I realized that it was a treasure trove of data for local businesses – it had around 2.4GB of data and invaluable information ranging from details like location and opening hours of the businesses, to user reviews about service and quality of food.
There are five different datasets:
yelp_academic_dataset_business contains details about businesses such as opening and closing hours, location, categories, number of reviews, ratings, as well as other attributes ranging from if it takes reservations to if it would be considered ‘hipster’.
yelp_academic_dataset_checkin contains all the check-in information at businesses.
yelp_academic_dataset_review contains the reviews for all the businesses, as well as the number of stars associated with the review. It also contains information on whether the review was rated as ‘funny’, ‘useful’ or ‘cool’.
yelp_academic_dataset_tip contains the user provided tips for the businesses, as well as the number of likes the tip received.
yelp_academic_dataset_user is the user information dataset. It contains information such as how many votes the user got, the number of reviews the user wrote, as well as other information like friends, average ratings and so on.
That is a lot of data. I’m getting excited just by the possibility of exploring and learning from all this information. I thought I’d start off by doing something relatively simple - a sentiment analysis on Yelp reviews by training a multinomial naive Bayes classifier.
10 Jun 2016
The ideas for my writing come from the strangest of places. This one began when I was searching for the syntactically correct way to comment out a block of code in Python. I came across this Stack Overflow link, which mentioned everything I expected it would, until I came across the following witty discussion on the # symbol.
Actually, that symbol is called an octothorp (referring to #). Please stop using local slang terms — few Americans call it a hash, and few non-Americans call it a pound, but nobody ever refers to anything else when they say octothorp. Except the person who chooses to defy this definitive answer by using it to mean something else. — @ArtOfWarfare
That escalated quickly. The debate had shifted from programming syntactic sugar to etymology. Stack Overflow can indeed be quite entertaining when you’ve been staring at code all day. What really got me interested in this topic, however, was this guy’s reply.
@ArtOfWarfare is correct, ‘#’ is an octothorpe. And ‘*’ is a hexathorpe, ‘+’ is a quadrathorpe, and ‘-’ is a duothorpe. Philosophical question: what is a thorpe? — @Pierre
Which led me to question, where does the symbol really come from? And what in the world is a thorpe?