Data products in the wild

Hi Students,

Monday’s lecture will focus on Human Factors in Data Science. The class will be an onslaught of needs finding, design, prototyping, and evaluation. It will be intense; brace yourselves.

As data scientists, you will ultimately produce a data product, be it a graph or a report or a presentation. This product will affect the real world, either through its own actions or through the decisions that people make. A consequence of affecting the real world is that the product itself will change the underlying data or assumptions upon which it is built. This reality is often ignored when designing a data product.

Spam: a digital arms race
Consider the spam filter. It’s a quiet, un-obtrusive part of your email pipeline. It works tirelessly to keep your inbox (mostly) free of emails from Nigerian princes and low interest loans. However, today’s spam filters are drastically different than the ones you used 5 years ago. Behind the scenes there is a quiet arms race in effect. This article provides a retrospective on the spam war from folks on the front line. It sums up the problem quite well:

Whenever researchers and developers improve spam-filtering software, spammers devise more sophisticated techniques to defeat the filters. Without constant innovation from spam researchers, the deluge of spam being filtered would overwhelm users’ inboxes.

Hatters gonna hate. Spammers gonna spam (at least until it no longer makes economic sense).

Music: manipulating popularity
Not all problems require reverse engineering of complex, learned models. Streaming music products, such as Youtube, Pandora, and Spotify, have forced companies like Billboard to have to redo their formula for ranking songs. Unlike radio, impressions from streaming music vendors are directly under the control of fans. This article details how this realization allowed one band to  game the system.

Is this wrong? How should this type of fan participation be counted? It’s not clear. On one hand, these are actual fans. They care enough about the band to manipulate the data. On the other hand, there is asymmetric information. Not all bands know how this ranking system works. If they did, they might launch their own fan campaign, and maybe (almost certainly) the results would be different.

Translation: too much success
Not all manipulations of data are malicious. Consider the following example:

You are trying to translate the web. To do this you need data — specifically datasets with the same text but in different languages. One place to find this data is the web itself. Some webpages will have the same content in multiple different languages. Manual translation is expensive, but worthwhile for organizations that need a global presence.

Over time, your product will grow, producing higher and higher quality translations. Eventually, your translations will become “good enough”. People who can’t afford manual translations (and even some that can), may start using your service to translate and then host the translated website. Hurray for technology.

But wait! Your translations are only “good enough”, they still aren’t perfect. There are a number of applications you can’t power with the current quality of your translations. And besides, language is constantly evolving!  Rules change, words are added and deleted, usage varies over time. You need fresh data to improve the quality of your system and adapt to changes.

But now you’ve polluted your data source. Many of the parallel webpages on the web are now the product of your algorithm. If you train on them, you are just reinforcing existing biases. You’re suffering from a tragedy of riches. Your product is too successful.

This is a problem that Google Translate faced as it matured as a product. Students, if you had this problem, how would you try to solve it?



  1. A large part of what the translator needs to do is diversify its training documents. If the training algorithm keeps reading the same documents, it will not learn anything useful. Accordingly, the translator should weed out documents that are similar (parallel in sentence structure, vocabulary, phrase usage, and other translation elements). It is my understanding that many translation services use probability to decide how to convert text. Sampling a dataset as formulaic and consistent as UN minutes might lead to inaccurate probability estimates of both sentence structure and vocabulary, among other factors. Instead, a translator should use a set of documents with great variance of subject matter and purpose in order to avoid training to a specific writing style.

    Another part of the solution is to lower the impact a single document has on the training algorithm, especially if the document very likely came from the translation service.

  2. One of the possible attempts to tackle the problem of polluting the data source is to separate or label the data source as “processed” vs “raw”. So every time a google translation is produced by the translator algorithm, we mark the website as “processed” in the webpages database. Now if we want to translate a new webpage (test data), we ask the algorithm to look for matching texts in our webpage data source which contains both processed and raw webpages. As there’s really no golden standard in translation, rules change, and words are added and deleted every now and then. Hence I don’t think the webpage that has been processed by google translator is necessarily “polluted”, it might represent some new ways of using the language as more and more people use the word that way. But still, the raw webpages are of less biased generated by algorithms than the processed webpages. I will suggest the algorithm to assign certain weight to each category, say 70% raw and 30% processed. To decide the actual weight will be another problem which requires more fitting and adjusting. But the idea is to generate a more suitable “hybrid” data source as the training data.

    Also, the websites that offer multiple languages are mainly corporate or organization official websites and news websites. The use in these websites tend to be very formal and not something that we will use as a daily basis. So it’s not uncommon that you entered some dialogues in google translate and it gives you translations that make not much sense. So I think the next step for Google may be trying to look for websites that can offer this kind of translation and put more weight to those webpages if the translator can detect the input is more of an informal nature. Possible sources I can think of will be language learning websites, some online translated version of novels etc.

  3. Shijiao Zhou · · Reply

    I personally found the previous music technology example indeed intriguing. In cases when I wanted to listen to music, the first source I turned to was various kinds of ranking sources, including Billboard. However, and surprisingly, in most cases I found those “top 10’s” not necessarily the ones over others I would ever prefer if I were given the chance to vote. Not until I read the related articles in this post did I realize these data could be manipulated via various methods – quite surprising! There exists an undefined grey-area where it’s hard for us to tell whether these top ratings were driven by truly the songs’ popularity, or whether they were mainly manipulated by those who finds the economic incentives in doing so – I suspect probably a mix of both factors are played into this role, and therefore it won’t truly reflect the popularity and product quality within the music industry as much as music listeners wish.

    Nevertheless, I really like the product “Hot Songs of Summer 2013” (the plot chart showing the relationship between # of plays and # of fans, and you can also play selected songs from this product)…I think this got rid of the most concerns people have about the validity of those ranking sources, as well as the fact that it found out some gaps between the reality and what was being ranked.

    While I agree with what most people have commented above, about what we can do to help Google Translate become a better product as it matures, that something we especially need to take into consideration is the “constantly changing/evolving” characteristic of language given complexities in both its cultural and habitual spaces. Manual translations have to be, unavoidably, invested in this algorithm from time to time just to keep the data source (not only the “raw” data) as clean and standardized as possible. On top of all the other technical aspects of machine learning mentioned by others, I believe human factor (the UN example was a great one) also plays a huge role in controlling the quality of the product as it keeps perfecting over time.

  4. Yali Chen · · Reply

    I think it is possible for Google to distinguish between fresh data and trained data by text analysis. Google can tag every translated page so that it is easier for people to extract only the fresh data and build statistical model on the data.

    However, sometimes distinguishing fresh data and translated data can be very difficult. For example, Google May not be able to tag page translated by other translator. Those pages would come in as fresh data. So another way to tackle this problem is building a new model on the mixed data.

  5. Seyi Adebayo · · Reply

    The best way to avoid a problem such as this is as many people have said, avoid training on google data and just start completely from scratch. To do so as some people have mentioned, I think the best plan of attack is to somehow mark data generated via google translate. This will ensure that any data being gathered is unbiased. After gathering the necessary amount of data, I would subset the data, begin computing algorithms on a subset of the raw data and use the data available from my previous trials as a benchmark. This way not only would I be refining language translation techniques in order to better translate new trends in languages but I would also be ensuring that I did not make the same mistakes as previously observed.

  6. How about we pin point certain websites that carries human translated texts in multiple languages. For example, we could run the training algorithm on’s English, Arabic, and Chinese page. Many of the articles in the site are facsimile of each other but just in different language. The same thing could be done on the page in the BBC World Services site.

    I am not sure about the legality of doing this. But I am pretty sure some agreement can be worked out between all the parties, since both side stand to gain something out of this at the end of the day. The content providers can diversified its revenue stream, while Google can gain assess to professional translated articles on a daily basis and its heck a lot cheaper than hiring a group of in house language QA/Translators.

  7. The goal is to get as many datasets with the same text but in different languages, with as much accuracy as possible. The best way to accomplish this is to use professionally translated documents, such as those from the UN, as previously mentioned.

    Pages that have already been translated can then be tagged and ranked. Then using the professionally translated documents as a benchmark, sample translations from highly ranked (e.g. popular) websites and predict how effective the the websites translation are. If the websites are highly effective, Google Translate could begin incorporating these websites into their translations.

  8. mipiccirilli · · Reply

    The goal is to get as many datasets with the same text but in different languages, with as much accuracy as possible. The best way to accomplish this is to use professionally translated documents, such as those from the UN, as previously mentioned.

    Pages that have already been translated can then be tagged and ranked. Then using the professionally translated documents as a benchmark, sample translations from highly ranked (e.g. popular) websites and predict how effective the the websites translation are. If the websites are highly effective, GT could begin incorporating these websites into their translations.

  9. I think the text mining of the website translation if very useful. Nowadays, people get most of their informations from the internet. Now, even if TV as a traditional media combines with internet as an IPTV. Internet is one of the best way of collecting data and information. We can analyse human behavior from the internet.

    One of the problem of analyzing website data is that the data is too large and most of the informations are text and picture. With a good translation of the figure and text to the data that computer can use, we can get many useful patterns to help us live a better life.

  10. Yukai Wang · · Reply

    I think the most important thing is to make some correction on the training data set. There are several ways to weaken the impact of those polluted data source and not impair sensitivity to new things.
    1) Develop a filter to detect whether it is or the possibility of the data source is using our translation algorithm. For example, we can run our translate algorithm on the original text then compare the result with the text on webpage. If the similarity is high, then we weight down that translation.
    2) Include some authorized source such as UN document and published book in different language. Then we assign higher weights on them in the translation algorithm.
    3) Check the distribution of each word’s translation then do cross term comparison. If some words’ translation distribution converged to our translation then it means these words have might be polluted by our translation and need to make sure their translations are correct.

  11. delta-epsilon · · Reply

    I feel that addressing these new trends would take two things. First, if we notice that our model contains a group of words or phrases whose frequency is really low, that could be a possible sign of a trend. The best way of seeing this is through testing. If a new saying is really popular, when people try to translate it, it should often be mistranslated. Since the saying is popular, websites that may use this new data would be numerous, and it should be easier to pick up on. Therefore, people can make models on both the old and new data.

  12. Ben Cheng · · Reply

    As the post suggested, language is always evolving, and the translating mechanism will have to continue to adapt to these changes. The problem of a polluted data source boils down to the increasing diffuculty of finding untainted data. In many cases, bibliographic tagging automatically inserted by Google Translate’s service should be enough to identify a piece as a product of Google’s translation product. Part of the solution to this problem is a matter of ensuring that this sort of tagging is inserted as much as possible whenever Google Translate is utilized.

    In other cases, the bibliographic information may have been doctrined or removed altogether. In this situation, my intuition would be to build in a pre-processing stage to the retraining of the translation model. Assuming that all new information is based on websites that have both an original-language and translated-language version, each piece of content should be fed into a filter that attempts to drop any content that is likely to have been a product of Google Translate. One possible way to filter content is to use the pre-existing translation model to ‘reverse translate’ each piece of content. In cases where the original-language content is available, a near perfect reverse-translation would strongly suggest that the content should be excluded from the learning process. Manual evaluation of websites in which it is known whether or not the service was used can be used to determine thresholds of how close a reverse-translation can be before being ‘too perfect.’ Other solutions could be to tap into translated content that are just inherently less likely to have used Google Translate, such as internationally published literature.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: