Monday’s lecture will focus on Human Factors in Data Science. The class will be an onslaught of needs finding, design, prototyping, and evaluation. It will be intense; brace yourselves.
As data scientists, you will ultimately produce a data product, be it a graph or a report or a presentation. This product will affect the real world, either through its own actions or through the decisions that people make. A consequence of affecting the real world is that the product itself will change the underlying data or assumptions upon which it is built. This reality is often ignored when designing a data product.
Spam: a digital arms race
Consider the spam filter. It’s a quiet, un-obtrusive part of your email pipeline. It works tirelessly to keep your inbox (mostly) free of emails from Nigerian princes and low interest loans. However, today’s spam filters are drastically different than the ones you used 5 years ago. Behind the scenes there is a quiet arms race in effect. This article provides a retrospective on the spam war from folks on the front line. It sums up the problem quite well:
Whenever researchers and developers improve spam-filtering software, spammers devise more sophisticated techniques to defeat the filters. Without constant innovation from spam researchers, the deluge of spam being filtered would overwhelm users’ inboxes.
Hatters gonna hate. Spammers gonna spam (at least until it no longer makes economic sense).
Music: manipulating popularity
Not all problems require reverse engineering of complex, learned models. Streaming music products, such as Youtube, Pandora, and Spotify, have forced companies like Billboard to have to redo their formula for ranking songs. Unlike radio, impressions from streaming music vendors are directly under the control of fans. This article details how this realization allowed one band to game the system.
Is this wrong? How should this type of fan participation be counted? It’s not clear. On one hand, these are actual fans. They care enough about the band to manipulate the data. On the other hand, there is asymmetric information. Not all bands know how this ranking system works. If they did, they might launch their own fan campaign, and maybe (almost certainly) the results would be different.
Translation: too much success
Not all manipulations of data are malicious. Consider the following example:
You are trying to translate the web. To do this you need data — specifically datasets with the same text but in different languages. One place to find this data is the web itself. Some webpages will have the same content in multiple different languages. Manual translation is expensive, but worthwhile for organizations that need a global presence.
Over time, your product will grow, producing higher and higher quality translations. Eventually, your translations will become “good enough”. People who can’t afford manual translations (and even some that can), may start using your service to translate and then host the translated website. Hurray for technology.
But wait! Your translations are only “good enough”, they still aren’t perfect. There are a number of applications you can’t power with the current quality of your translations. And besides, language is constantly evolving! Rules change, words are added and deleted, usage varies over time. You need fresh data to improve the quality of your system and adapt to changes.
But now you’ve polluted your data source. Many of the parallel webpages on the web are now the product of your algorithm. If you train on them, you are just reinforcing existing biases. You’re suffering from a tragedy of riches. Your product is too successful.
This is a problem that Google Translate faced as it matured as a product. Students, if you had this problem, how would you try to solve it?