Several months ago I was reading a data mining book and got to the chapter on distance metric algorithms. For one of the examples, the authors used whiskey scores in order to help users find related whiskey based on their flavor profile. I loved the idea, but wanted to see if there were anything like this for coffee data. After a bit of Googling, I found nothing. In fact, finding a solid coffee cupping scoring rubric was difficult enough (people don’t like to agree), but eventually, I came across Sweet Maria’s, a roaster based out of Oakland, California. For the past decade, the roasters at Sweet Maria’s have followed a scoring system and have provided detailed reviews of their coffees which you can see here.
If you checkout the reviews, you’ll notice that they are split into a couple categories by year. It seems that over the years, roasters at Sweet Maria’s adjusted their scoring in order to account for the different aspects of the coffee they acquired. Naturally, with reviews in hand, I wanted to use the most recent scoring rubric which began in 2008, but there was one issues…the scores were images.
Hundreds of reviews with detailed content and a beautiful scoring system stuck inside the pixels of a PNG image. Some would argue that OCR would easily handle this problem, but having went down that route before, I found it to be problematic (would love to see someone do this) and in some weird way, I thought it would be therapeutic to manually transcribe the scores. So, after months of randomly working on the transcribing, I finished with close to 500 different reviews. You can access that data here.
Having the scores was just step one. After collecting all the scores, then checking to make sure I didn’t mess up any, I built a simple parser to walk the Sweet Maria’s reviews in order to pair the number score with the detailed review. There was a snag with the HTML generation that forced me to save all the content and work on it locally, but the end format looked like the following:
Using the new format, I was able to test two basic distance metric algorithms to the scoring data, cosine similarity and euclidean distance. Both algorithms worked well for comparing the score vectors, but I preferred the output of euclidean distance since it’s measurements weren’t bound to being in between a 1 or a 0. In some way, it felt like it was easier to process the distance in my head if it were represented with a greater range of numbers.
Realizing I would be better off displaying the data in a web application, I set off to create something that retained the original scoring charts and data from Sweet Maria’s, but also allowed the user to pick the aspects of the coffee they felt were most important to them in order to find a related coffee. After a quick search, I stumbled across a radar chart implementation built over D3.js that made it extremely easy to represent the coffee scores. The last item I needed was to build logic around the coffee features, so that when checked, they would be issued a higher weight inside the euclidean distance calculation.
When all was said and done, I ended up with the above application. Sure, a lot more could be done to refine the displayed data, but this worked for my little experiment. Having sat on this data for months, I wanted to release it to anyone who was interested. I still want to pursue answering more questions about the data, like do certain combinations of features create interesting country groupings and if so, can those groupings be explained by farming methods or weather patterns? Aside from probing the data more, I am also planning to draw in another source, www.coffeereview.com. Though the scoring is different, there are thousands of reviews on coffees being harvested right now. There’s plenty to answer, just not enough time to get it all done.