Antonio Gulli's coding playground

Tuesday, June 30, 2009

Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos

Can you detect what is the correct order of a bunch of videos recorded at the same event by different people?

Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos is Yahoo paper is trying to answer this question on a testbed of YouTube videos.

The key idea is use fingerprints based on audio track and then cluster them into a time sequence of videos.

Pubblicato da codingplayground a 8:29 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Monday, June 29, 2009

Large Scale Multi-Label Classification via MetaLabeler

SVM can is very effective for binary classification. In order to handle classification with many categories, there are also some extensions for multi-class SVM. Anyway, a more frequently used approach is to adopt a binary classifcation with a one-against all approach (each category is considered against the others and n categories are obtained by levaging n binary classifiers).

Large Scale Multi-Label Classification via MetaLaber suggests extending the one-against all approach by adopting an auxiliar classifier which learns the number of relevant top-k scores.

Pubblicato da codingplayground a 8:07 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Sunday, June 28, 2009

Efficient Multiple-Click Models in Web Search

Can you use past clicks on search results to predict future clicks? Efficient Multiple-Click Models in Web Search is a Microsoft paper aiming at providing an answer to this question.

Two different models are evaluated. An Independent Click Model (ICM) where clicks are indipendent each other (i.e. position is not taken into account), and a Dependent Click Model (DCM) where position is taken into account. The models are built on a strong theoretical background based on log-likelihood maximization. Anyway, the final formulas derived are very simple to compute and requires linear time and space and can be computed incrementally.

Pubblicato da codingplayground a 12:40 AM 1 comment:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Saturday, June 27, 2009

Generate all the permutations of a multiset

A multiset has less than n! permutations, since elements can be repeated. Generate all the permutations in optimal space and time.

Pubblicato da codingplayground a 9:06 PM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Friday, June 26, 2009

Michael jackson, i can't stop loving you

Off - topic: Michael, thanks for making so many beautiful moments in my life.

Pubblicato da codingplayground a 1:27 AM 1 comment:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Thursday, June 25, 2009

Stephen Wolfram talking about Wolfram|Alpha at Pisa

Stephen Wolfram gave a public talk in Pisa about Wolfram|Alpha, his latest research pillar. This was the first public talk he gave world wide after the product launch, about 5 weeks ago. He already gave other talks to restricted audiences.

I liked the approach he adopted for presentation. He went straight to the point giving a demostration of the innovative aspects of Wolfram|Alpha.

He started with queries related to his research activities, such as Mathematica, Nks, Computational Knowledge Engine, maths. Then, he showed some aspects of natural language queries such as "integrate cosx dx", "find the integral of six x cos y". This was a first step behind the traditional search engine's kingdom.

I believe the audience was impressed by that and I heard people asking "is this just a nice demo or can we move to other domains as well?" Stephen, gave an impressive answer with queries like gdp france (economy), geoip (internet) and with simple but effective computations on the top of this data, such as "what is the gpd of spain?" , "gdp france/ italy", "italy internet users", Pisa, Lexigton and Pisa Lexigton (where the engine is assuming a travel intent), sun and many others.

Then, he moved to more sophisticate forms of computations, such as "jupiter pisa dec 10 1608", a tribute to Galileo who was born in Pisa and observed Jupiter, and other metric computations such as "5 miles/sec", or chemical computation such as "pentane 2 atm 400 centigrade", or genetics such as "aggttgagaggatt" (with an impressive approximate pattern matching technology), or finance "msft apple" (with a nice comparison among the two stocks), or nutrition with apple (where the meaning is disambiguated between the stock and the fruit) and nice nutrition computation such as "potassium 3 apple + 1 cheddar cheese", or personal finance computation "5% mortgage 50000 euros 20 years". I liked very much when the computational engine tried to guess mathematical formulae like "1 1 2 3 5", the famous Fibonacci numbers another tribute to another famous man born in Pisa.

After this initial demo, Stephen discussed at a very high level the 4 foundamental pieces composing Wolfram|Alpha.

A very large number of real time data feeds, which are continuously cleaned-up by human editors who are expert in every distinct domain. Stephen provided no precise number about how many people are involved in this cleansing step;
A computational engine used to map the cleansed real data feed into some internal representation. Stephen said that any information is mapped into "short fragments", which are not necessarly NPL units.
A query engine which maps the user queries into the internal representation. He claimed that currently they have about 22% of queries which show no matching answers.
A collection of ranking algorithms for selecting the best answer among the potential ones;

He said the the whole code is made up of ~6 Million lines of Mathematica code, and that more than 50K algorithms have been implemented. Unlikely, no other scientific information has been provided.

Here a collection of questions and a synthesis of the answers.

Q: how do you position W|A compared with the rest of NLP technology?
A: The problems we face are quite different from traditional NLP ones. We don't have well-strucured documents, we don't have well formed synthatical sentences with subject, verb, and so on. We adopted Wikipedia quite estensively for understanding entities and pre-aggregated data. Anyway, apart from that, Wikipedia is of a little use, even infoboxes are not good from the quality point of view . One thing that we want to investigate is to let user updload their own data to our engine,and to perfom some computation of top of that data. One additional research field that we are investigating is what we called "minimum preserving transformation", a methodology to map different, but semantically similar queries, into out internal fragment based representation.

Q: What are the resources involved in W|A in terms of people and servers?
A: We have 4 different co-location (data-centers). Every single request goes to 8 CPUs in parallel, at the very beginning we started with about 10.000 servers, now we are increasing that number.
For each request, we have a starting serving time of about 5ms when the first result is sent to the client. Then other results are injected with AJAX technology. 100 people worked on the project for about 3 years. Now some of them returned to our core Mathematica developement, but we are doing a massive hiring campain. We are now starting to mine our query logs and there is a huge amount of information there. People wants to search and not just to test knowledge

Q: How important is the caching strategy?
A: Our queries are quite different from the ones provided by search engine, where traditionally at least 25% of queries can be cached. We have a rather low percentage, almost zero. One thing that is very promising is use past succesful queries to suggest related queries to new users.

Q: Can you search Wolfram?
(vanity query worked quite well)

Q: What about a deal with Google or other engines?
A: Future is pretty interesting and we have nice relations with media and news. A bunch of new things will come next months...

Q: Do you think that Wolfram|Alpha will have a negative impact on homework? with people lazy about studying?
A: Any new technology has the same questions. I believe we had similar questions when the librarians came or when we saw CDs with encyclopedia, or with Wikipedia. Actually, I strongly believe that Wolfram|Alpha can encorauge people to study more and more.

Q: I am quite interested in the type of queries and questions you receive. How many of them are about facts which are happening right now? For instance, iran election
A: A large number of the queries we get are about real time data. We are investigating this sector. Potentially, we need a large number of feeds, clean them in real time and perform a real time computation. This is interesting and we get a lot of queries like this, even if our query log analysis is just 5 weeks old.
(disclaimer: this above was my own question, I notice that W|A has poor perfomance on facts that are happening right now, or just few weeks ago. Good they want to address the issue)

Q: How do you see the future of W|A? how difficult is to run a project like this?
A: Wolfram is an independent company and we have the money and the crazyness to throw away hundreds of millions in a project that had at beginning no future at all. I wanted to answer a question: "With current technology, is an computation engine feasible?" At the beginning I thought "No way", but I invested the money and after two years I said "Maybe", one year ago I started to think "Yes it is". Many things are necessary to drive a projects like this. You need the Web with all the feeds and the information you can get from there. You need the crazyness and you need to power to take decision in freedom, with a limited number of people involved in the decision. Basically, you need to drive it. Today, we are just at time t + 5 weeks. Answer is Yes, anyway.

Q: Do you think that this technology can be used to make guesses?
A: What do you mean?
Q: I mean humans are used to reason and make guesses. Suppose that someone is asking me "Is Los Angeles larger than Tallassee?" I never heard about Talassee, but I've heard about Los Angeles many times. Therefore, I would use the frequency of the name has an indicator of how large is the city. I would use a different measure to infer something that I don't know.
A: (Stephen started to mumble... then he continued to mumble and I saw some computations drawing on his face). Than he said "Quite interesting question, I never thought about it, I guess you are referring to rules of thumbs. So my question is how many rules are out of there? I will think about it. We have some initial guess when we try to infer mathematical formulas like in
"13 5445656 32" or when you ask for things like "5000 words". Personally, I believe that "Human reason is great, but science is better".

Q: Do you believe that W|A can answer to philosophical question? Like the ultimate question: "Does god exist?" He asked the question and got this answer:

"I'm sorry, but I don't think a poor computational knowledge engine, no matter how powerful, is capable of providing a simple answer to that question."

(Just after that his phone rang. I tought he received a call from above!). Was this a prepared question? Maybe, I don't know.

So the conclusion. The presentation was quite interesting and we are definitevely in front of something new. When W|A gives an answer, it is generally quite impressive and you cannot stop to play with it. The precision is good but the recall is somehow quite low. They have a low coverage within certain specific domains. Anyway, we are just at "t+5 weeks", as Stephen said. Therefore, it's too early to express a definitive judgment.

I can say that computation is quite effective when you are navigating trough specific domains such as Maths, Physics, Nutrition, Geography, Finance, and another bunch of them. There are domains where they have no data. Hence, no computation at all. And as Stephen said they just know about English, and no other language at the moment.

I have a request for W|A team, I understand the need to preserve the IP of what you are doing with patents and the like. Anyway, I hope that you will publish a little more about your results in scientific publications like all the other big engines are doing (e.g. Google Research, Microsoft Research, and Yahoo Research). I hope that you are not going in the direction to keep all the industrial results as secret ones. I remeber, I heard the term knowledge engine before in 98 and it is no longer there. I believe that this was because they decided to adopt a rather obscure way of describing their technology (personal opinion).

W|A is a quite different story, I want to see it evolving and opening to the rest of the world. I will keep monintoring your results.

Pubblicato da codingplayground a 2:37 AM 1 comment:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Tuesday, June 23, 2009

Stock market and search

Is the stock market anticipating real economy?

Pubblicato da codingplayground a 12:30 PM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Real time spammers

Spammers are invading twitter search, by stuffing a lot spam text.

Pubblicato da codingplayground a 7:54 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

What is happening in Washington ? real time text and other signals

I strongly believe real time search should adopt new ranking signals, and not being just text based.

Pubblicato da codingplayground a 3:13 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Monday, June 22, 2009

How Much Can Behavioral Targeting Help Online Advertising?

How Much Can Behavioral Targeting Help Online Advertising? is a Microsoft paper about Internet Monetization. The paper proves how behavioural targeting can help in improving monetization and click-through (CTR).

In other words, users with similiar search needs tends to click similar ads. Segmenting users by means of clustering is therefore a good strategy for increasing the revenues of a search engines. Users are profiled by means of the search queries and the clicked URLs. Standard clustering algorithms (such as K-Means, min-wise hashing and CLUTO) are then used to segment the users.

Obtained results are pretty impressive.

Pubblicato da codingplayground a 9:45 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Sunday, June 21, 2009

Speeding up Algorithms on Compressed Web Graphs

Speeding up Algorithms on Compressed Web Graphs is a Microsoft paper @ WSDM09, which leverages a smart tranformation for compressing the Web graph. The idea is pretty simple: an heuristic based on frequent itemset mining is used for tranforming cliques into stars connections.
As a consequences, the number of edges can be drammatically reduced at the cost of introducing some new virtual node.

The paper then introduce an optimized matrix-vector multiplication which reoders the adjencency matrix of the tranformed Web graph. This multiplication is then used to speed-up PageRank, Hits and Salsa on the compressed graph.

Pubblicato da codingplayground a 11:59 AM 2 comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Saturday, June 20, 2009

Iran Protests

Pubblicato da codingplayground a 2:13 PM 1 comment:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Steve Ballmer on Search

Nice article on Bing, the new Microsoft search engine.

"Sometimes the error you make is what you don't do or what you don't start soon enough," he said. "Most of our mistakes came not because we didn't see the technology change that was coming. Ironically, we didn't see the business change that was coming."

"He blames Microsoft's corporate heft, in part. Microsoft had spent richly on research and development. Its R&D budget comes to 9ドル billion this year alone. And the company had plenty of people working on producing a search engine. But Microsoft had lost its startup brashness. New companies have an edge: They have to succeed big or go bankrupt. That forces them to take risks fast, before it's too late."

"We've got our mojo going now. We're rolling. We're the little engine that could."

Pubblicato da codingplayground a 1:47 PM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Google Flipper

I like the idea behind Google Flipper, we had some visual idea and patent here

Pubblicato da codingplayground a 11:23 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Friday, June 19, 2009

Suu Kyi turns 64, in prison

Pubblicato da codingplayground a 6:49 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

What is the status of swine flu?

Pubblicato da codingplayground a 12:41 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Thursday, June 18, 2009

Real time search (what about Ranking?)

The number of real time search engines is increasing. Yesterday, two new engines joined the race. CrowdEye is from Ken Moss, who ran search engineering at Microsoft and built the new engine himself. Collecta is from Gerry Campbell, who was a search executive at AOL and Reuters, as well as an adviser to Summize (now Twitter Search). OneRiot who is run by Kimbal Musk and my old friend Alessio Signorini. Other engines are Topsy, Tweetmeme and Scoopler, not to mention Twitter Search itself.

This reminds me the old times when Excite, Lycos, Altavista and a lot of new-comers joined the search race -- back in 1997. Search is alive and kicking with new exciting stuffs to play with.

I would expect more and more academic publication on real time search problems, such as real time ranking.

Pubblicato da codingplayground a 11:37 PM 4 comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Carol and Tim

Pubblicato da codingplayground a 8:18 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

The Tradeoffs Between Open and Traditional Relation Extraction

The Tradeoffs Between Open and Traditional Relation Extraction is a paper about extraction of relations between entities on a massive scale. The system is based on a self-training unsupervised approach derived by the Conditional Random Fields (CRF) theory. A set of training examples are extracted by a training corpus by means of hand-crafted defined parsing rules. The training examples are then used to train a linear chain of CRF.
.

Results are pretty impressive both in term of precision and recall.

Pubblicato da codingplayground a 6:59 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Wednesday, June 17, 2009

Facebook got MySpace

Pubblicato da codingplayground a 7:11 AM No comments:

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Subscribe to: Comments (Atom)

Antonio Gulli

Gulli Family

Subscribe To

Posts

Atom

Posts

All Comments

Atom

All Comments

Search This Blog

Antonio Gulli Google

Google

Antonio Gulli Microsoft

Antonio Gulli Ask.com

Antonio Gulli Highlander

Antonio Gulli University

Antonio Gulli Elsevier

Antonio Gulli My Ferrari

Antonio Gulli My Search Engine

Antonio Gulli My Shipit Microsoft

Antonio Gulli My Patents

Antonio Gulli My Awards

Antonio Gulli Image Search

Blog Archive

2016 (2)
- May (2)
  - Demystifying deep learning: explaining a LeNet lik...
  - Demystifying deep learning series: hands on experi...

2015 (184)
- December (4)
- November (16)
- October (32)
- September (30)
- August (29)
- July (31)
- June (24)
- May (11)
- April (2)
- March (2)
- February (2)
- January (1)

2014 (82)
- December (1)
- November (1)
- September (11)
- August (25)
- July (1)
- June (14)
- May (23)
- April (1)
- March (1)
- February (2)
- January (2)

2013 (120)
- December (16)
- November (1)
- October (1)
- September (30)
- August (17)
- July (17)
- May (3)
- April (2)
- March (22)
- February (4)
- January (7)

2012 (241)
- November (29)
- October (22)
- September (28)
- August (31)
- July (31)
- June (31)
- May (11)
- April (17)
- March (27)
- February (4)
- January (10)

2011 (361)
- December (20)
- November (29)
- October (34)
- September (31)
- August (35)
- July (32)
- June (30)
- May (33)
- April (30)
- March (30)
- February (29)
- January (28)

2010 (387)
- December (33)
- November (28)
- October (32)
- September (31)
- August (42)
- July (37)
- June (32)
- May (31)
- April (30)
- March (30)
- February (29)
- January (32)

2009 (382)
- December (31)
- November (31)
- October (25)
- September (17)
- August (28)
- July (32)
- June (48)
- May (38)
- April (23)
- March (40)
- February (33)
- January (36)

2008 (10)
- December (4)
- August (3)
- July (1)
- June (2)

Subscribe To

Posts

Atom

Posts

All Comments

Atom

All Comments

My Blog List

The Official Google Blog

Simplifying messaging compliance for businesses with Android RCS Archival on Pixel - Android Enterprise and Google Pixel devices are getting RCS archival to help IT admins easily configure and maintain compliant records
4 hours ago
Inside Search

Google Search with Gemini 3: Our most intelligent search yet - Learn more about Gemini 3 and how it’s upgrading Google Search and AI Mode.
8 hours ago
Computational Complexity

Test of Time Awards: A Good Idea but .... - Since there is now a CCC Test-of-Time Award, see here, (CCC stands for Computational Complexity Conference), I decided to look at other Test-of-Time award...
2 days ago
Facebook Developer Blog

Platform Evolution: Facebook Social Plugins to Be Discontinued February 2026 - As Meta’s developer platform continues to evolve, we’re making strategic decisions to focus on tools and features that deliver the most value to developers...
1 week ago
PeteSearch

All AI Benchmarks are Wrong, but some are Useful - When I was new to Google Brain, I got involved in a long and heated discussion about evaluation numbers for some models we were using. As we walked out of ...
4 weeks ago
GigaOM

Reclaiming Control: Digital Sovereignty in 2025 - Sovereignty has mattered since the invention of the nation state—defined by borders, laws, and taxes that apply within and without. While many The post R...
5 months ago
Process Algebra Diary

2024 Award Winners announced by Computer Science Canada | Informatique Canada - Computer Science Canada | Informatique Canada has announced the list of recipients of its awards for 2024. Members of the TCS community will be pleased ...
5 months ago
TechCrunch Europe

Rummy Nabob: The Royal Destination for Ultimate Card Gaming - Online gaming in India has undergone a dramatic transformation in the past decade, and one of the brightest stars in this evolution is the classic game o...
5 months ago
TechCrunch

Top 10 AI Tools That Will Transform Your Content Creation in 2025 - [image: Top 10 AI Tools That Will Transform Your Content Creation in 2025] Looking to level up your content creation game in 2025? You're in the right pla...
10 months ago
Search Engine Land: News About Search Engines & Search Marketing

TikTok unveils 5 new advertising tools - TikTok introduced Smart+, GMV Max, PETs, Conversion Lift Studies, and Out of Phone: Retail to help brands drive stronger results.
1 year ago
in theory

FOCS Test of Time Awards - The FOCS test of time award recognizes each year a paper, or papers, each from the FOCS conference of 10, 20, and 30 years earlier for the impact they have...
1 year ago
Zen and the Art of Programming

I Ask Basic Questions During Technical Interviews - Over the past 18 years at IBM, I have interviewed over a thousand programmers for positions within my team (which has a fair number of interns.) Most of ...
1 year ago
Official Google Research Blog

Generative AI to quantify uncertainty in weather forecasting - Posted by Lizao (Larry) Li, Software Engineer, and Rob Carver, Research Scientist, Google Research Accurate weather forecasts can have a direct impact on ...
1 year ago
Geeking with Greg

My book, Algorithms and Misinformation - Misinformation and disinformation are the biggest problems on the internet. To solve a problem, you need to understand the problem. In *Algorithms and Mis...
1 year ago
GeeksforGeeks

Python Flask Projects with Source Code (Beginners to Advanced) - Flask, a Python web application framework, was created by Armin Ronacher. Known for its lightweight and efficient nature, Flask is designed for quick sta...
2 years ago
SEO by the Sea

Identifying Subjective Attributes Of Entities - Identifying UGC Subjective Attributes Of Entities This recently granted patent is about identifying subjective attributes of entities. I haven’t seen a p...
3 years ago
My Biased Coin

Current CS 124 Stats - This is as much personal recording for me (and perhaps of interest to Harvard people who read the blog). But also putting the numbers here for others to k...
5 years ago
my slice of pizza

Trumpet - Someone asked me for my second favorite musical instrument: Trumpet. Hear it, and I challenge you to not raise your head and look beyond horizons. Put it ...
5 years ago
Matt Cutts: Gadgets, Google, and SEO

All the Fitbit activity badges - Fitbit has discontinued their Fitbit One step trackers, which seems like a good opportunity to step back and reflect on wearing one for the last decade or ...
6 years ago
John Battelle's Searchblog

Do We Want A Society Built On The Architecture of Dumb Terminals? - The post Do We Want A Society Built On The Architecture of Dumb Terminals? appeared first on John Battelle's Search Blog. God, “innovation.” First banali...
7 years ago
Google News Blog

Find out how journalists across the world use technology today - How do journalists use technology around the world? A new interactive explorer produced by the International Center for Journalists (ICFJ) and Google New...
7 years ago
Search Engine Watch Blog

Five very quick tips to building a loyal ecommerce customer base - Don’t leave all that juicy revenue out there; get smart about re-engaging and watch your numbers climb.
9 years ago
Microsoft Research Downloads

LatticeCrypto - LatticeCrypto is a high-performance and portable software library that implements lattice-based cryptographic algorithms. The first release of the library ...
9 years ago
Wolfram|Alpha Blog

Find All Wolfram News in One Place—The Wolfram Blog - This is the final post here at the Wolfram|Alpha Blog. Approximately six and a half years ago our launch team started the Wolfram|Alpha blog just prior to ...
9 years ago
Mini-Microsoft

18,000 Microsoft Jobs Gone... Eventually? - *1. Cut Once.* *2. Cut Deeply.* And might I humbly add: *3. Cut Quickly.* As of this morning, we're looking to cut 18,000 Microsoft positions including ...
11 years ago
Twitter Blog

An invitation to #ComedyFest (BYOB) - This week Twitter is turning into a comedy club, and you’ve got the best seats in the house, all for the price of free. We’re not saying that enjoying your...
12 years ago