Showing posts with label ibm. Show all posts
Showing posts with label ibm. Show all posts
Monday, March 2, 2015
The Linguistics behind IBM Watson
I will be talking about the linguistics behind IBM Watson's Question Answering on March 11 at the DC Natural Language Processing MeetUp. Here's the blurb:
In February 2011, IBM Watson defeated Brad Rutter and Ken Jennings in the Jeopardy! Challenge. Today, Watson is a cognitive system that enables a new partnership between people and computers that enhances and scales human expertise by providing a more natural relationship between the human and the computer.
One part of Watson’s cognitive computing platform is Question Answering. The main objective of QA is to analyze natural language questions and present concise answers with supporting evidence, rather than a list of possibly relevant documents like internet search engines.
This talk will describe some of the natural language processing components that go into just three of the basic stages of IBM Watson’s Question Answering pipeline:
The NLP components that help make this happen include a full syntactic parse, entity and relationship extraction, semantic tagging, co-reference, automatic frame discovery, and many others. This talk will discuss how sophisticated linguistic resources allow Watson to achieve true question answering functionality.
In February 2011, IBM Watson defeated Brad Rutter and Ken Jennings in the Jeopardy! Challenge. Today, Watson is a cognitive system that enables a new partnership between people and computers that enhances and scales human expertise by providing a more natural relationship between the human and the computer.
One part of Watson’s cognitive computing platform is Question Answering. The main objective of QA is to analyze natural language questions and present concise answers with supporting evidence, rather than a list of possibly relevant documents like internet search engines.
This talk will describe some of the natural language processing components that go into just three of the basic stages of IBM Watson’s Question Answering pipeline:
- Question Analysis
- Hypothesis Generation
- Semantic Types
The NLP components that help make this happen include a full syntactic parse, entity and relationship extraction, semantic tagging, co-reference, automatic frame discovery, and many others. This talk will discuss how sophisticated linguistic resources allow Watson to achieve true question answering functionality.
Wednesday, February 13, 2013
IBM SPSS Text Analytics: A Shakespearean Stage
This is the third in a series of posts about IBM's SPSS Text Analytics platform STAS (first post here, second here). These tests were performed Tuesday eve.
Yet again, my work-a-day schedule was a bit light on free time, so I didn't get to dig as deep as I had wanted to (that 14 day free trial is tic-toc-ing away, a low, dull, quick sound, such as a watch makes when enveloped in cotton, but this only increases my fury, as the beating of a drum stimulates the soldier into courage).
With tonight's SOTU speech, I of course want to use my last day (tomorrow) to run a bake off between the inevitable word frequency analyses that will pop up tomorrow, and STAS's more in-depth tools.
So, for tonight, I went back to my literary roots and performed a sort of simple Digital Humanities analysis of Shakespeare's 154 Sonnets. I used the Gutenberg free versions of the Bard's Sonnets. I had to do a little document pre-processing, of course (okay, I had to do A LOT of pre-processing). I've already noted that STAS requires unstructured language data to be ingested via cells in a spreadsheet, so I pasted each sonnet into its own cell, then I ran STAS' automatic NLP tools. The processing took all of 90 seconds.
What this gave me was a set of what IBM calls "concepts" and "types." I take "concepts" to be roughly synonyms, with the most frequent lexeme used as the exemplar of the concept. For example, STAS identified a concept it called "excellent" with 40 linguistic items including "great", "best", and "perfection" (see image below).
So far, I'm pretty impressed. Remember, STAS only took about 90 seconds of processing to produce all this. And this isn't half of what it did.
While I'm impressed, I saw some clear bad apples. For example, STAS generated a concept called "art", but the set of linguistic items it included in that concept are overly tied to the literal string a-r-t, see below:
However, to give the IBM crew credit they deserve, they never claim the default parameters are right for every project. They clearly state that the categorization process is iterative, requiring human-in-the-loop intervention to maximize the value of this tool. I simply let the default parameters run and looked at what popped out. STAS provides many ways to manually intervene in the process and affect the output in positive ways.
In addition to providing concepts (synonyms), STAS analyzes "types", which I take to be higher order groupings like entities. For example, it will identify Organizations, Dates, Persons, and Locations. These types are well known to the information extraction (IE) set. This is the bread and butter of IE.
For example, STAS identified a type it called "budget" with items like "pay", "loan", and " fortune". See screenshot below for examples.
Another interesting example of a type that STAS identified in 90 seconds is "eyes", including "bright eyes", "eyes delight" and "far eyes".
The "types" are not typical types that IE pros are used to dealing with, but I suspect that's a function of the Shakespeare corpora I used. I previously ran some tweets through it and the types were more typical, like Microsoft and San Francisco and such.
I haven't delved deep into STAS's sentiment analysis toolkit, but it does provide a variety of ways of analyzing the sentiment expressed within natural language. For example, below shows some of the positive sentiment words it identified.
Keep in mind, that the more powerful tools it provides (which I haven't played with yet), allows querying language data about things like Food + Positive to capture positive opinion regarding food in a particular Shakespeare play or scene.
With that, I'm truly looking to pitting STAS against the SOTU word count illiterati that will cloud the airwaves.
Yet again, my work-a-day schedule was a bit light on free time, so I didn't get to dig as deep as I had wanted to (that 14 day free trial is tic-toc-ing away, a low, dull, quick sound, such as a watch makes when enveloped in cotton, but this only increases my fury, as the beating of a drum stimulates the soldier into courage).
With tonight's SOTU speech, I of course want to use my last day (tomorrow) to run a bake off between the inevitable word frequency analyses that will pop up tomorrow, and STAS's more in-depth tools.
So, for tonight, I went back to my literary roots and performed a sort of simple Digital Humanities analysis of Shakespeare's 154 Sonnets. I used the Gutenberg free versions of the Bard's Sonnets. I had to do a little document pre-processing, of course (okay, I had to do A LOT of pre-processing). I've already noted that STAS requires unstructured language data to be ingested via cells in a spreadsheet, so I pasted each sonnet into its own cell, then I ran STAS' automatic NLP tools. The processing took all of 90 seconds.
What this gave me was a set of what IBM calls "concepts" and "types." I take "concepts" to be roughly synonyms, with the most frequent lexeme used as the exemplar of the concept. For example, STAS identified a concept it called "excellent" with 40 linguistic items including "great", "best", and "perfection" (see image below).
So far, I'm pretty impressed. Remember, STAS only took about 90 seconds of processing to produce all this. And this isn't half of what it did.
While I'm impressed, I saw some clear bad apples. For example, STAS generated a concept called "art", but the set of linguistic items it included in that concept are overly tied to the literal string a-r-t, see below:
However, to give the IBM crew credit they deserve, they never claim the default parameters are right for every project. They clearly state that the categorization process is iterative, requiring human-in-the-loop intervention to maximize the value of this tool. I simply let the default parameters run and looked at what popped out. STAS provides many ways to manually intervene in the process and affect the output in positive ways.
In addition to providing concepts (synonyms), STAS analyzes "types", which I take to be higher order groupings like entities. For example, it will identify Organizations, Dates, Persons, and Locations. These types are well known to the information extraction (IE) set. This is the bread and butter of IE.
For example, STAS identified a type it called "budget" with items like "pay", "loan", and " fortune". See screenshot below for examples.
Another interesting example of a type that STAS identified in 90 seconds is "eyes", including "bright eyes", "eyes delight" and "far eyes".
The "types" are not typical types that IE pros are used to dealing with, but I suspect that's a function of the Shakespeare corpora I used. I previously ran some tweets through it and the types were more typical, like Microsoft and San Francisco and such.
I haven't delved deep into STAS's sentiment analysis toolkit, but it does provide a variety of ways of analyzing the sentiment expressed within natural language. For example, below shows some of the positive sentiment words it identified.
Subscribe to:
Comments (Atom)
TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department
[reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...
-
Matt Damon's latest hit movie Elysium has a few linguistic oddities worth pointing out. The film takes place in a dystopian future set i...
-
Bob Carpenter recently made the following comment on one of my posts: I'm very excited to hear that linguists are beginning to take sta...
-
The commenters over at Liberman's post Apico-labials in English all clearly prefer the spelling syncing , but I find it just weird look...