sentiment analysis
machine learning
data visualization
cloud computing
data science

This Analysis on the Emergence of Clickbait Will Blow You Away

15 mins

Companies like Facebook and Google aren’t merely providing a free online service - they’re competing for your attention because they need it to thrive in today’s economy. Your interactions on these platforms generate invaluable data without which the machine intelligence that makes them so intuitive and personalized would fail to exist. By making their services free to maximize their user base, many of these companies adopt business models that are dependent on the usage of their services, such as data collection and advertising. This is where our attention economy comes in, a theory widely discussed in the fields of psychology, advertising, and economics.

Economist and Nobel Prize winner Herbert Simon was perhaps the first to discuss this concept when he wrote,

In an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients.

In other words, as online content becomes increasingly abundant and accessible, our attention becomes the limiting factor to which content is consumed and the businesses that understand this are the ones that end up winning. So to compete in today’s complex and dynamic online landscape, it is vital for companies to capitalize on their consumers’ attention economy.

Read more

Visualizing My Music Taste using Machine Learning and Sentiment Analysis

16 mins
Peter Margaritoff's Gradients: Created with the average color of album covers for the top 2,000 songs on Spotify

Intro • The xx

Music is a complicated thing. No one fully understands why we’re drawn towards a certain song, or so frustrated by another. In the recent paper Musical Preferences are Linked to Cognitive Styles by Greenberg et al., studies showed that cognitive style, or how individuals process information, influences their preference of music. The first study showed the link between empathy levels and the genres of choice. The second studied the effect of E-S cognitive styles (based on the Empathizing-Systemizing theory) on musical preference. Subjects with a bias towards empathizing preferred music with low arousal (gentle or warm), negative valence (sad or depressing) and emotional depth (poetic or thoughtful). On the other hand, those with a bias for systemizing showed preference for music with high arousal (strong or thrilling), positive valence (lively) and cerebral depth (complex).

This connection between cognitive processing and musical preference is very interesting to me. Having never truly understood my affinity to certain genres of music, I was inspired to attempt to quantify my music taste using data, to really understand the similarities and differences across the music that I listen to. I was also excited to apply concepts from machine learning such as clustering and unsupervised learning, and from natural language processing such as sentiment analysis, to this project and understand what insights this might produce. I’ve also wanted to explore Spotify’s API for some time now, and this proved to be the perfect opportunity to do so.

And yes, each title in this blog post will be a song from my dataset.

Read more

The Emotional Timeseries of Prose

15 mins

Nearly twenty years ago, Kurt Vonnengut, an American author perhaps most famously known for his satirical novel Slaughterhouse-Five, gave a lecture that would change the way we think about stories. Standing in front of a blackboard, chalk in hand, he proclaims, “There’s no reason why the simple shapes of stories can’t be fed into computers; they are beautiful shapes.” He then proceeds to plot a cosine curve, and amidst applause and laughter, playfully declares, “People love this story!”

Those Who Tell the Stories Rule the World

The notion Vonnengut explores is an interesting one - can we quantitatively look at writing to understand how it is emotionally structured? When we read, we feel emotionally connected to the writing. We get so ‘lost’ in the fictional world and fall so deep into it that our own emotions become mapped to the narrative. In fact, narrative transportation theory in psychology studies exactly this. The quantitative meta-analysis by Van Laer, De Ruyter, Visconti and Wetzels on the effects of narrative transportation allude to readers ‘mentally enter(ing) a world that a story evokes’. We feel what we read, and being able to understand how these emotions vary over the course of a story is, I think, an extremely interesting intellectual pursuit.

More importantly, this discussion leads to some interesting questions that we can now address through data analysis of big datasets. How does this emotional structure vary over generations of writing, from early 16th century Shakespeare to modern day Pratchett? How do these trends differ between cultures - how similar or different is Indian and Japanese literature in its emotional structure? Do certain authors have an emotional signature - a unique structure to their stories, a formula to their writing? Given an emotional structure, can we predict what kind of story it is (or perhaps even predict its ending?)

Many of these questions were inspired by the research of Andrew Reagan and the Computatational Story Lab at the University of Vermont, where they used sentiment analysis to analyze the ‘emotional arcs’ of 1,700 stories to reveal the most common ones. Their findings are fascinating - according to the research, all stories conform to one of six basic emotional arcs.


As an avid reader, this research really fascinates me. In this multi-part blog series, I will try to understand the concept of an emotional timeseries in a piece of literature, and how it is affected by various factors. In future posts, I will attempt to address some of the more interesting questions that I brought up earlier.

Read more

Sentiment Analysis on Yelp Reviews

9 mins

I was looking for public datasets to explore the other day, and I ran into Yelp’s dataset from the Yelp Dataset Challenge. After poking around the data, I realized that it was a treasure trove of data for local businesses – it had around 2.4GB of data and invaluable information ranging from details like location and opening hours of the businesses, to user reviews about service and quality of food.


There are five different datasets:

  • yelp_academic_dataset_business contains details about businesses such as opening and closing hours, location, categories, number of reviews, ratings, as well as other attributes ranging from if it takes reservations to if it would be considered ‘hipster’.

  • yelp_academic_dataset_checkin contains all the check-in information at businesses.

  • yelp_academic_dataset_review contains the reviews for all the businesses, as well as the number of stars associated with the review. It also contains information on whether the review was rated as ‘funny’, ‘useful’ or ‘cool’.

  • yelp_academic_dataset_tip contains the user provided tips for the businesses, as well as the number of likes the tip received.

  • yelp_academic_dataset_user is the user information dataset. It contains information such as how many votes the user got, the number of reviews the user wrote, as well as other information like friends, average ratings and so on.

That is a lot of data. I’m getting excited just by the possibility of exploring and learning from all this information. I thought I’d start off by doing something relatively simple - a sentiment analysis on Yelp reviews by training a multinomial naive Bayes classifier.

Read more