The Nerdiest Detectives in the World

Each week Ethan Rouen, a student in the class, will post on a topic of his interest based on class lectures. Ethan is a Ph.D. student in accounting at Columbia Business School and a columnist for

“Feasibility,” “Restatement,” “Litigation,” “Where,” “Acquisitions,” “Technological,” “Update,” “Words,” “Considered,” “Pooling”

Some investigators think those 10 words will help them catch their prey. Can you figure out what they have in common? My guess would be some kind of software that builds portable swimming pools instantly, but a lot of people drown in those pools.

According to a working paper by Gerald Hoberg and Craig Lewis, these are the top 10 words companies use more aggressively when they are committing fraud, as measured by whether or not the Securities and Exchange Commission filed an enforcement action against them in the year that they wrote those word in their annual disclosure (called a 10-K).

In class this week, we looked at different ways to measure text, including the bag of words technique, which is what is applied in this analysis. The more aggressive use of these 10 words (the authors include a total of 50 words) is statistically significant, but when I read them out loud, I can’t help thinking that the authors should have taken their analysis a bit further.

Why didn’t they look at the order of words or their location in the text to provide better context? What about the change in text year-over-year? What other tests could be applied to this kind of analysis to make us more confident that frequent use of the word “words” means fraud is more likely?

These questions are particularly important in this context because Craig Lewis is the head of the SEC’s Enforcement Division, charged with rooting out fraud. His office has begun implementing a model (dubbed, very awesomely, “Robocop”) that scans companies’ public disclosures in an attempt to determine whether fraud is being committed.

This issue is timely because the SEC in recent years has stepped away from looking for accounting fraud like the Enron debacle, choosing instead to focus on insider trading and Bernie Madoff-type scams. Yet SEC officials have said that they believe accounting fraud is still going on.

Robocop combines traditional measures of searching for fraud (in laymen’s terms, it searches for old-school ways that managers fudge their numbers) with text analysis. Recent academic research has shown that text analysis can help capture management behavior. Frequent use of the word “I” instead of “we” in public disclosures has been shown to provide insights into management intentions, as has complexity and length of disclosures.

So the question will be, how effective is text analysis in this situation? Unlike when they fudge the numbers, managers can fairly easily and continuously change their text to meet the model’s expectations. They can do this by mimicking the text of similar firms (this argument is one of the points of the paper). My guess is that a lot of resources will go into figuring out what Robocop is looking for (other than punk criminals in futuristic Detroit) in order to avoid detection.

Measuring text is often a subjective exercise, so it seems to me that all text analysis (and who among us don’t want to do some kind of text analysis?) should start with the data scientist asking if and how that text can be measured. Just as important, when we put that model back into the real world, how will its influence change its predictive abilities?

(Here’s a link to the paper I discussed in this post:


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: