Realizations in Biostatistics: natural language processing

Showing posts with label natural language processing. Show all posts

Friday, February 15, 2013

Sloppy journalism with interactive graphics is still sloppy journalism

The Guardian recently discussed the "declining linguistic standards" in State of the Union addresses. I thought this was an interesting exercise, but something seemed wrong about the article, and it turns out this is one case where the data do not really speak for themselves. There's a lot of interpretation and understanding behind cultural trends in the use of the English language in America, as well as the evolution of the presidents' intentions behind the address. There are a few important points:

The author correctly points out that Woodrow Wilson essentially changed the format of the address through precedent from written document to speech. Right after Wilson's first speech there is a huge drop in the "education level" (hang on for a discussion of this terminology) of these addresses. As I recall, Wilson is the only American president with a Ph.D.
The index used - Flesch-Kincaid (FK), is questionable. Good on The Guardian to use a single measure for all speeches, but I have to wonder if it is wise to use the same measure for speeches and written addresses. Furthermore, FK is very sensitive to the placement of punctuation (it weights sentence length heavily). For instance, as a friend pointed out, one of Wilson's speeches has a FK grade level of over 17, but if you replace one of the semi-colons in the speech with a period, the FK grade drops to 12. This subtlety is lost in speech format, giving FK an extremely high uncertainty (this same friend calls FK "utterly useless" for speeches).
The audience of the SOTU address has changed. Though it's a constitutional duty of the president, the delivery as a speech is not, and it only has to be delivered to Congress. However, most modern addresses have been in the form of televised speeches, and have to be understood by a wider and less politically savvy audience.
Cultural trends in the use of spoken and written English in America involve shorter sentences over time in general.
In this case, a more sophisticated natural language processing analysis might reveal some interesting trends. For instance, how do wartime speeches compare to times of peace? Are there any natural categories of speeches that fall out? What are the outliers? How does this compare to polls?

In short, we have some interesting data that needs heavy qualification and critical analysis, that is just presented on a page and capped with a headline that gives an overly simplistic interpretation.

Posted by Unknown at 8:30 AM

Labels: interpretation, natural language processing, politics, visualization

Saturday, January 14, 2012

Faster reading through math

Let’s face it, there is a lot of content on the web, and one thing I hate worse is reading halfway through an article and realizing that the title and first paragraph indicate little about the rest of the article. In effect, I check out the quick content first (usually after a link), and am disappointed.

My strategy now is to use automatic summaries, which are now a lot more accessible than they used to be. The algorithm has been around since 1958 (!) by H. P. Luhn and is described in books such as Mining the Social Web by Matthew Russell (where a Python implementation is given). With a little work, you can create a program that scrapes text from a blog, provides short and long summaries, and links to the original post, and packages it up in a neat HTML page.

Or you can use the cute interface in Safari, if you care to switch.

Posted by Unknown at 4:08 PM

Labels: natural language processing