About
This is a blog on artificial intelligence and "Social Science++", with an emphasis on computation and statistics. My website is brenocon.com.
Blogroll
Blog Search
-
Archives
How much text versus metadata is in a tweet?
Posted on June 14, 2011
This should have been a blog post, but I got lazy and wrote a plaintext document instead.
For twitter, context matters: 90% of a tweet is metadata and 10% is text. That’s measured by (an approximation of) information content; by raw data size, it’s 95/5.
This entry was posted in Uncategorized. Bookmark the permalink.
2 Responses to How much text versus metadata is in a tweet?
Nice! Compressibility’s totally the way to measure this. The only question is what scale you do the compression on.
With these implementations like zip, we wind up providing upper bounds on the amount of information. PPM with longer contexts is even better, but they still don’t get close to simple smoothed language models due to online constraints. There are some really cool hierarchical Bayesian language model with non-parametrics that are even better that Frank Wood, Nick Bartlett, David Pfau, Yee Whye Teh and a few others developed:
http://www.stat.columbia.edu/~fwood/Papers/Bartlett-DCC-2011.pdf
Thanks for the comment Bob, I really appreciate it. Somehow I missed the literature that compares PPM to what are now standard smoothed LM’s. That paper looks great.