We were getting good results with a feature that turned out to be cheating because while it made sense to use it, the value in the logs didn’t reflect the value in the incoming request, but rather the value in the outgoing response, which indirectly coded the category for classification.
The kicker is that this data’s only an approximation of the real problem. But it’s the best we have, and while more data’s better than more learning, some data’s better than nothing.
Other customers have wanted us to find things for them (e.g. forward earnings statements in 10Q footnotes, opinions of cars in blogs, recording artists in news), but there was no existing data, so we had to (help them) create it. That’s when they run into the problems like whether “Bob Dylans” is a person mention in The Six Bob Dylans: More Photos From Todd Haynes’ "I’m Not There" Movie. It turns out the customers are not semantic grad students or ontology boffins, so they usually just don’t care.
But the real problem with all of this tuning to within a percent of a system’s life is that it’s usually just overfitting when you go out into the wild. For instance, the customer mentioned above plans to change the overall organization and the instruction text on their site, so that none of our training data will exactly replicate the runtime environment.
]]>Nice catch on this conference – Eric Siegel, the organizer, appears to be heavily focused on real-world case studies — which should be interesting.
As you know, the Bay Area R UseRs group is doing a free, co-located event on Wed evening of the conference — so if you’re interested in mingling with some PAW folks as well as some R users — you can sign up at: http://ia.meetup.com/67/calendar/9573566/
Mike
]]>