Tuesday, June 30, 2009
Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos
Can you detect what is the correct order of a bunch of videos recorded at the same event by different people?
Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos is Yahoo paper is trying to answer this question on a testbed of YouTube videos.
The key idea is use fingerprints based on audio track and then cluster them into a time sequence of videos.
Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos is Yahoo paper is trying to answer this question on a testbed of YouTube videos.
The key idea is use fingerprints based on audio track and then cluster them into a time sequence of videos.
Monday, June 29, 2009
Large Scale Multi-Label Classification via MetaLabeler
SVM can is very effective for binary classification. In order to handle classification with many categories, there are also some extensions for multi-class SVM. Anyway, a more frequently used approach is to adopt a binary classifcation with a one-against all approach (each category is considered against the others and n categories are obtained by levaging n binary classifiers).
Large Scale Multi-Label Classification via MetaLaber suggests extending the one-against all approach by adopting an auxiliar classifier which learns the number of relevant top-k scores.
Large Scale Multi-Label Classification via MetaLaber suggests extending the one-against all approach by adopting an auxiliar classifier which learns the number of relevant top-k scores.
Sunday, June 28, 2009
Efficient Multiple-Click Models in Web Search
Can you use past clicks on search results to predict future clicks? Efficient Multiple-Click Models in Web Search is a Microsoft paper aiming at providing an answer to this question.
Two different models are evaluated. An Independent Click Model (ICM) where clicks are indipendent each other (i.e. position is not taken into account), and a Dependent Click Model (DCM) where position is taken into account. The models are built on a strong theoretical background based on log-likelihood maximization. Anyway, the final formulas derived are very simple to compute and requires linear time and space and can be computed incrementally.
Two different models are evaluated. An Independent Click Model (ICM) where clicks are indipendent each other (i.e. position is not taken into account), and a Dependent Click Model (DCM) where position is taken into account. The models are built on a strong theoretical background based on log-likelihood maximization. Anyway, the final formulas derived are very simple to compute and requires linear time and space and can be computed incrementally.
Saturday, June 27, 2009
Generate all the permutations of a multiset
A multiset has less than n! permutations, since elements can be repeated. Generate all the permutations in optimal space and time.
Friday, June 26, 2009
Thursday, June 25, 2009
Stephen Wolfram talking about Wolfram|Alpha at Pisa
Stephen Wolfram gave a public talk in Pisa about Wolfram|Alpha, his latest research pillar. This was the first public talk he gave world wide after the product launch, about 5 weeks ago. He already gave other talks to restricted audiences.
I liked the approach he adopted for presentation. He went straight to the point giving a demostration of the innovative aspects of Wolfram|Alpha.
He started with queries related to his research activities, such as Mathematica, Nks, Computational Knowledge Engine, maths. Then, he showed some aspects of natural language queries such as "integrate cosx dx", "find the integral of six x cos y". This was a first step behind the traditional search engine's kingdom.
I believe the audience was impressed by that and I heard people asking "is this just a nice demo or can we move to other domains as well?" Stephen, gave an impressive answer with queries like gdp france (economy), geoip (internet) and with simple but effective computations on the top of this data, such as "what is the gpd of spain?" , "gdp france/ italy", "italy internet users", Pisa, Lexigton and Pisa Lexigton (where the engine is assuming a travel intent), sun and many others.
Then, he moved to more sophisticate forms of computations, such as "jupiter pisa dec 10 1608", a tribute to Galileo who was born in Pisa and observed Jupiter, and other metric computations such as "5 miles/sec", or chemical computation such as "pentane 2 atm 400 centigrade", or genetics such as "aggttgagaggatt" (with an impressive approximate pattern matching technology), or finance "msft apple" (with a nice comparison among the two stocks), or nutrition with apple (where the meaning is disambiguated between the stock and the fruit) and nice nutrition computation such as "potassium 3 apple + 1 cheddar cheese", or personal finance computation "5% mortgage 50000 euros 20 years". I liked very much when the computational engine tried to guess mathematical formulae like "1 1 2 3 5", the famous Fibonacci numbers another tribute to another famous man born in Pisa.
After this initial demo, Stephen discussed at a very high level the 4 foundamental pieces composing Wolfram|Alpha.
Here a collection of questions and a synthesis of the answers.
Q: how do you position W|A compared with the rest of NLP technology?
A: The problems we face are quite different from traditional NLP ones. We don't have well-strucured documents, we don't have well formed synthatical sentences with subject, verb, and so on. We adopted Wikipedia quite estensively for understanding entities and pre-aggregated data. Anyway, apart from that, Wikipedia is of a little use, even infoboxes are not good from the quality point of view . One thing that we want to investigate is to let user updload their own data to our engine,and to perfom some computation of top of that data. One additional research field that we are investigating is what we called "minimum preserving transformation", a methodology to map different, but semantically similar queries, into out internal fragment based representation.
Q: What are the resources involved in W|A in terms of people and servers?
A: We have 4 different co-location (data-centers). Every single request goes to 8 CPUs in parallel, at the very beginning we started with about 10.000 servers, now we are increasing that number.
For each request, we have a starting serving time of about 5ms when the first result is sent to the client. Then other results are injected with AJAX technology. 100 people worked on the project for about 3 years. Now some of them returned to our core Mathematica developement, but we are doing a massive hiring campain. We are now starting to mine our query logs and there is a huge amount of information there. People wants to search and not just to test knowledge
Q: How important is the caching strategy?
A: Our queries are quite different from the ones provided by search engine, where traditionally at least 25% of queries can be cached. We have a rather low percentage, almost zero. One thing that is very promising is use past succesful queries to suggest related queries to new users.
Q: Can you search Wolfram?
(vanity query worked quite well)
Q: What about a deal with Google or other engines?
A: Future is pretty interesting and we have nice relations with media and news. A bunch of new things will come next months...
Q: Do you think that Wolfram|Alpha will have a negative impact on homework? with people lazy about studying?
A: Any new technology has the same questions. I believe we had similar questions when the librarians came or when we saw CDs with encyclopedia, or with Wikipedia. Actually, I strongly believe that Wolfram|Alpha can encorauge people to study more and more.
Q: I am quite interested in the type of queries and questions you receive. How many of them are about facts which are happening right now? For instance, iran election
A: A large number of the queries we get are about real time data. We are investigating this sector. Potentially, we need a large number of feeds, clean them in real time and perform a real time computation. This is interesting and we get a lot of queries like this, even if our query log analysis is just 5 weeks old.
(disclaimer: this above was my own question, I notice that W|A has poor perfomance on facts that are happening right now, or just few weeks ago. Good they want to address the issue)
Q: How do you see the future of W|A? how difficult is to run a project like this?
A: Wolfram is an independent company and we have the money and the crazyness to throw away hundreds of millions in a project that had at beginning no future at all. I wanted to answer a question: "With current technology, is an computation engine feasible?" At the beginning I thought "No way", but I invested the money and after two years I said "Maybe", one year ago I started to think "Yes it is". Many things are necessary to drive a projects like this. You need the Web with all the feeds and the information you can get from there. You need the crazyness and you need to power to take decision in freedom, with a limited number of people involved in the decision. Basically, you need to drive it. Today, we are just at time t + 5 weeks. Answer is Yes, anyway.
Q: Do you think that this technology can be used to make guesses?
A: What do you mean?
Q: I mean humans are used to reason and make guesses. Suppose that someone is asking me "Is Los Angeles larger than Tallassee?" I never heard about Talassee, but I've heard about Los Angeles many times. Therefore, I would use the frequency of the name has an indicator of how large is the city. I would use a different measure to infer something that I don't know.
A: (Stephen started to mumble... then he continued to mumble and I saw some computations drawing on his face). Than he said "Quite interesting question, I never thought about it, I guess you are referring to rules of thumbs. So my question is how many rules are out of there? I will think about it. We have some initial guess when we try to infer mathematical formulas like in
"13 5445656 32" or when you ask for things like "5000 words". Personally, I believe that "Human reason is great, but science is better".
Q: Do you believe that W|A can answer to philosophical question? Like the ultimate question: "Does god exist?" He asked the question and got this answer:
"I'm sorry, but I don't think a poor computational knowledge engine, no matter how powerful, is capable of providing a simple answer to that question."
(Just after that his phone rang. I tought he received a call from above!). Was this a prepared question? Maybe, I don't know.
So the conclusion. The presentation was quite interesting and we are definitevely in front of something new. When W|A gives an answer, it is generally quite impressive and you cannot stop to play with it. The precision is good but the recall is somehow quite low. They have a low coverage within certain specific domains. Anyway, we are just at "t+5 weeks", as Stephen said. Therefore, it's too early to express a definitive judgment.
I can say that computation is quite effective when you are navigating trough specific domains such as Maths, Physics, Nutrition, Geography, Finance, and another bunch of them. There are domains where they have no data. Hence, no computation at all. And as Stephen said they just know about English, and no other language at the moment.
I have a request for W|A team, I understand the need to preserve the IP of what you are doing with patents and the like. Anyway, I hope that you will publish a little more about your results in scientific publications like all the other big engines are doing (e.g. Google Research, Microsoft Research, and Yahoo Research). I hope that you are not going in the direction to keep all the industrial results as secret ones. I remeber, I heard the term knowledge engine before in 98 and it is no longer there. I believe that this was because they decided to adopt a rather obscure way of describing their technology (personal opinion).
W|A is a quite different story, I want to see it evolving and opening to the rest of the world. I will keep monintoring your results.
I liked the approach he adopted for presentation. He went straight to the point giving a demostration of the innovative aspects of Wolfram|Alpha.
He started with queries related to his research activities, such as Mathematica, Nks, Computational Knowledge Engine, maths. Then, he showed some aspects of natural language queries such as "integrate cosx dx", "find the integral of six x cos y". This was a first step behind the traditional search engine's kingdom.
I believe the audience was impressed by that and I heard people asking "is this just a nice demo or can we move to other domains as well?" Stephen, gave an impressive answer with queries like gdp france (economy), geoip (internet) and with simple but effective computations on the top of this data, such as "what is the gpd of spain?" , "gdp france/ italy", "italy internet users", Pisa, Lexigton and Pisa Lexigton (where the engine is assuming a travel intent), sun and many others.
Then, he moved to more sophisticate forms of computations, such as "jupiter pisa dec 10 1608", a tribute to Galileo who was born in Pisa and observed Jupiter, and other metric computations such as "5 miles/sec", or chemical computation such as "pentane 2 atm 400 centigrade", or genetics such as "aggttgagaggatt" (with an impressive approximate pattern matching technology), or finance "msft apple" (with a nice comparison among the two stocks), or nutrition with apple (where the meaning is disambiguated between the stock and the fruit) and nice nutrition computation such as "potassium 3 apple + 1 cheddar cheese", or personal finance computation "5% mortgage 50000 euros 20 years". I liked very much when the computational engine tried to guess mathematical formulae like "1 1 2 3 5", the famous Fibonacci numbers another tribute to another famous man born in Pisa.
After this initial demo, Stephen discussed at a very high level the 4 foundamental pieces composing Wolfram|Alpha.
- A very large number of real time data feeds, which are continuously cleaned-up by human editors who are expert in every distinct domain. Stephen provided no precise number about how many people are involved in this cleansing step;
- A computational engine used to map the cleansed real data feed into some internal representation. Stephen said that any information is mapped into "short fragments", which are not necessarly NPL units.
- A query engine which maps the user queries into the internal representation. He claimed that currently they have about 22% of queries which show no matching answers.
- A collection of ranking algorithms for selecting the best answer among the potential ones;
Here a collection of questions and a synthesis of the answers.
Q: how do you position W|A compared with the rest of NLP technology?
A: The problems we face are quite different from traditional NLP ones. We don't have well-strucured documents, we don't have well formed synthatical sentences with subject, verb, and so on. We adopted Wikipedia quite estensively for understanding entities and pre-aggregated data. Anyway, apart from that, Wikipedia is of a little use, even infoboxes are not good from the quality point of view . One thing that we want to investigate is to let user updload their own data to our engine,and to perfom some computation of top of that data. One additional research field that we are investigating is what we called "minimum preserving transformation", a methodology to map different, but semantically similar queries, into out internal fragment based representation.
Q: What are the resources involved in W|A in terms of people and servers?
A: We have 4 different co-location (data-centers). Every single request goes to 8 CPUs in parallel, at the very beginning we started with about 10.000 servers, now we are increasing that number.
For each request, we have a starting serving time of about 5ms when the first result is sent to the client. Then other results are injected with AJAX technology. 100 people worked on the project for about 3 years. Now some of them returned to our core Mathematica developement, but we are doing a massive hiring campain. We are now starting to mine our query logs and there is a huge amount of information there. People wants to search and not just to test knowledge
Q: How important is the caching strategy?
A: Our queries are quite different from the ones provided by search engine, where traditionally at least 25% of queries can be cached. We have a rather low percentage, almost zero. One thing that is very promising is use past succesful queries to suggest related queries to new users.
Q: Can you search Wolfram?
(vanity query worked quite well)
Q: What about a deal with Google or other engines?
A: Future is pretty interesting and we have nice relations with media and news. A bunch of new things will come next months...
Q: Do you think that Wolfram|Alpha will have a negative impact on homework? with people lazy about studying?
A: Any new technology has the same questions. I believe we had similar questions when the librarians came or when we saw CDs with encyclopedia, or with Wikipedia. Actually, I strongly believe that Wolfram|Alpha can encorauge people to study more and more.
Q: I am quite interested in the type of queries and questions you receive. How many of them are about facts which are happening right now? For instance, iran election
A: A large number of the queries we get are about real time data. We are investigating this sector. Potentially, we need a large number of feeds, clean them in real time and perform a real time computation. This is interesting and we get a lot of queries like this, even if our query log analysis is just 5 weeks old.
(disclaimer: this above was my own question, I notice that W|A has poor perfomance on facts that are happening right now, or just few weeks ago. Good they want to address the issue)
Q: How do you see the future of W|A? how difficult is to run a project like this?
A: Wolfram is an independent company and we have the money and the crazyness to throw away hundreds of millions in a project that had at beginning no future at all. I wanted to answer a question: "With current technology, is an computation engine feasible?" At the beginning I thought "No way", but I invested the money and after two years I said "Maybe", one year ago I started to think "Yes it is". Many things are necessary to drive a projects like this. You need the Web with all the feeds and the information you can get from there. You need the crazyness and you need to power to take decision in freedom, with a limited number of people involved in the decision. Basically, you need to drive it. Today, we are just at time t + 5 weeks. Answer is Yes, anyway.
Q: Do you think that this technology can be used to make guesses?
A: What do you mean?
Q: I mean humans are used to reason and make guesses. Suppose that someone is asking me "Is Los Angeles larger than Tallassee?" I never heard about Talassee, but I've heard about Los Angeles many times. Therefore, I would use the frequency of the name has an indicator of how large is the city. I would use a different measure to infer something that I don't know.
A: (Stephen started to mumble... then he continued to mumble and I saw some computations drawing on his face). Than he said "Quite interesting question, I never thought about it, I guess you are referring to rules of thumbs. So my question is how many rules are out of there? I will think about it. We have some initial guess when we try to infer mathematical formulas like in
"13 5445656 32" or when you ask for things like "5000 words". Personally, I believe that "Human reason is great, but science is better".
Q: Do you believe that W|A can answer to philosophical question? Like the ultimate question: "Does god exist?" He asked the question and got this answer:
"I'm sorry, but I don't think a poor computational knowledge engine, no matter how powerful, is capable of providing a simple answer to that question."
(Just after that his phone rang. I tought he received a call from above!). Was this a prepared question? Maybe, I don't know.
So the conclusion. The presentation was quite interesting and we are definitevely in front of something new. When W|A gives an answer, it is generally quite impressive and you cannot stop to play with it. The precision is good but the recall is somehow quite low. They have a low coverage within certain specific domains. Anyway, we are just at "t+5 weeks", as Stephen said. Therefore, it's too early to express a definitive judgment.
I can say that computation is quite effective when you are navigating trough specific domains such as Maths, Physics, Nutrition, Geography, Finance, and another bunch of them. There are domains where they have no data. Hence, no computation at all. And as Stephen said they just know about English, and no other language at the moment.
I have a request for W|A team, I understand the need to preserve the IP of what you are doing with patents and the like. Anyway, I hope that you will publish a little more about your results in scientific publications like all the other big engines are doing (e.g. Google Research, Microsoft Research, and Yahoo Research). I hope that you are not going in the direction to keep all the industrial results as secret ones. I remeber, I heard the term knowledge engine before in 98 and it is no longer there. I believe that this was because they decided to adopt a rather obscure way of describing their technology (personal opinion).
W|A is a quite different story, I want to see it evolving and opening to the rest of the world. I will keep monintoring your results.
Tuesday, June 23, 2009
What is happening in Washington ? real time text and other signals
I strongly believe real time search should adopt new ranking signals, and not being just text based.
Monday, June 22, 2009
How Much Can Behavioral Targeting Help Online Advertising?
How Much Can Behavioral Targeting Help Online Advertising? is a Microsoft paper about Internet Monetization. The paper proves how behavioural targeting can help in improving monetization and click-through (CTR).
In other words, users with similiar search needs tends to click similar ads. Segmenting users by means of clustering is therefore a good strategy for increasing the revenues of a search engines. Users are profiled by means of the search queries and the clicked URLs. Standard clustering algorithms (such as K-Means, min-wise hashing and CLUTO) are then used to segment the users.
Obtained results are pretty impressive.
In other words, users with similiar search needs tends to click similar ads. Segmenting users by means of clustering is therefore a good strategy for increasing the revenues of a search engines. Users are profiled by means of the search queries and the clicked URLs. Standard clustering algorithms (such as K-Means, min-wise hashing and CLUTO) are then used to segment the users.
Obtained results are pretty impressive.
Sunday, June 21, 2009
Speeding up Algorithms on Compressed Web Graphs
Speeding up Algorithms on Compressed Web Graphs is a Microsoft paper @ WSDM09, which leverages a smart tranformation for compressing the Web graph. The idea is pretty simple: an heuristic based on frequent itemset mining is used for tranforming cliques into stars connections.
As a consequences, the number of edges can be drammatically reduced at the cost of introducing some new virtual node.
The paper then introduce an optimized matrix-vector multiplication which reoders the adjencency matrix of the tranformed Web graph. This multiplication is then used to speed-up PageRank, Hits and Salsa on the compressed graph.
As a consequences, the number of edges can be drammatically reduced at the cost of introducing some new virtual node.
The paper then introduce an optimized matrix-vector multiplication which reoders the adjencency matrix of the tranformed Web graph. This multiplication is then used to speed-up PageRank, Hits and Salsa on the compressed graph.
Saturday, June 20, 2009
Steve Ballmer on Search
Nice article on Bing, the new Microsoft search engine.
"Sometimes the error you make is what you don't do or what you don't start soon enough," he said. "Most of our mistakes came not because we didn't see the technology change that was coming. Ironically, we didn't see the business change that was coming."
"He blames Microsoft's corporate heft, in part. Microsoft had spent richly on research and development. Its R&D budget comes to 9ドル billion this year alone. And the company had plenty of people working on producing a search engine. But Microsoft had lost its startup brashness. New companies have an edge: They have to succeed big or go bankrupt. That forces them to take risks fast, before it's too late."
"We've got our mojo going now. We're rolling. We're the little engine that could."
"Sometimes the error you make is what you don't do or what you don't start soon enough," he said. "Most of our mistakes came not because we didn't see the technology change that was coming. Ironically, we didn't see the business change that was coming."
"He blames Microsoft's corporate heft, in part. Microsoft had spent richly on research and development. Its R&D budget comes to 9ドル billion this year alone. And the company had plenty of people working on producing a search engine. But Microsoft had lost its startup brashness. New companies have an edge: They have to succeed big or go bankrupt. That forces them to take risks fast, before it's too late."
"We've got our mojo going now. We're rolling. We're the little engine that could."
Friday, June 19, 2009
Thursday, June 18, 2009
Real time search (what about Ranking?)
The number of real time search engines is increasing. Yesterday, two new engines joined the race. CrowdEye is from Ken Moss, who ran search engineering at Microsoft and built the new engine himself. Collecta is from Gerry Campbell, who was a search executive at AOL and Reuters, as well as an adviser to Summize (now Twitter Search). OneRiot who is run by Kimbal Musk and my old friend Alessio Signorini. Other engines are Topsy, Tweetmeme and Scoopler, not to mention Twitter Search itself.
This reminds me the old times when Excite, Lycos, Altavista and a lot of new-comers joined the search race -- back in 1997. Search is alive and kicking with new exciting stuffs to play with.
I would expect more and more academic publication on real time search problems, such as real time ranking.
This reminds me the old times when Excite, Lycos, Altavista and a lot of new-comers joined the search race -- back in 1997. Search is alive and kicking with new exciting stuffs to play with.
I would expect more and more academic publication on real time search problems, such as real time ranking.
The Tradeoffs Between Open and Traditional Relation Extraction
The Tradeoffs Between Open and Traditional Relation Extraction is a paper about extraction of relations between entities on a massive scale. The system is based on a self-training unsupervised approach derived by the Conditional Random Fields (CRF) theory. A set of training examples are extracted by a training corpus by means of hand-crafted defined parsing rules. The training examples are then used to train a linear chain of CRF.
.
Results are pretty impressive both in term of precision and recall.
.
Results are pretty impressive both in term of precision and recall.
Wednesday, June 17, 2009
Subscribe to:
Comments (Atom)