Sunday, December 11, 2016
I set up a new data analysis blog
Sunday, November 11, 2012
Analysis of the statistics blogosphere
Included are most of the Python code I used to obtain blog content, some of my attempts to automate the building of the network (I ended up using a manual process in the end), and my analysis. I also included the data. (You can probably see some of your own content.)
Here's what I learned/got reminded of the most:
- Doing projects like this is hard when you have other responsibilities, and you usually end up paring down your ambitions toward the end
- Data collection and curation was, as usual, the most difficult process
- Network analysis is fun, but I have a ways to go to know where to start first, what questions to ask, and so forth (these are the things you learn with experience)
- The measures that seem to be the most revealing are not always obvious -- in this network, it was the number of shortest paths compared to a random graph
- Andrew Gelman's blog is central (but you probably don't need a formal analysis to tell you that)
- There's a lot of great content about statistics, data analysis, data science, and statistical computing out there. I've relied on blog posts for a lot of my work, and I've found even more great stuff. It's a firehose of information.
Monday, November 5, 2012
Snapshot of the statistics blogosphere
This was generated during my social network analysis project. I haven’t finished yet, but I did want to show the cute picture. The statistics blogosphere is like a school of jellyfish.
Monday, October 29, 2012
The most valuable thing about my little stat blog network project
So, I decided to construct the linking graph through blogrolls, and finally settled on using a manual process. The best part of this project is really finding out for myself all the great content out there!
Monday, October 22, 2012
SNA class proposal
I’ve been taking several classes through Coursera (nothing against the other platforms; I took two of the original three classes via Stanford and just stuck with the platform). The latest one is Social Network Analysis, which has a programming project. Here is what I have posted as a proposal:
Ok, I've been thinking about the programming project idea some, and at first I was thinking of analyzing the statistics blogging community, mostly because I belong to it and I wanted to see what comes out. The analysis below can be done for any sort of community. I've developed this idea a little further and wanted to record it here for two reasons. First, I simply need to write it down to get it out of my head and in such a way that the public can understand it. Second, I'd like feedback.
As it turns out, I took the NLP class in the spring and think there's some overlap that can be exploited. (This comes up nicely in the Mining the Social Web and Programming Collective Intelligence books.) There are measures of content similarity, such as cosine similarity, which are simple to compute and reasonably work well to see how similar content is. Content can then be clustered based on similarity. So, then, I have the following questions:
- What are the communities, and do they relate to clusters of content similarity?
- If so, who are the "brokers" between different communities, and what do they blog about? There are a couple of aggregators, such as StatBlogs and R-Bloggers, that I imagine would glue together several communities (that's their purpose and value), but I imagine there are a few others that are aggregator-like + commentary as well. Original content generators, like mine, will probably be on the edges.
- Is it better to threshold edges based on a number of mentions, or use an edge weight based on the number of mentions?
- If I have time, I may try to do some sort of topic or named entity extraction, and get an automated way of seeing what these different communities are talking about.
Monday, August 6, 2012
Getting connected: why you should get connected to people, and how
Getting connected to professionals in your field can be difficult, but it’s worth the effort. Here’s why, in no particular order:
- you exchange ideas, find new ways to approach problems, share career experiences, and learn to navigate the multitude of aspects of your profession
- you form connections that can potentially help if you need to change jobs
- it’s fun to be social (even if you are an introvert like me)
- you can potentially add a lot of value to your company, leading to career advancement opportunities
- you can justify going to cool conferences, if you enjoy those
- you have a better chance of new opportunities to publish, get invited talks, or collaborate
The above is fairly general, but for the how I will focus on statisticians because that is where I can offer the most:
- Join a professional organization. For statisticians in the US, join the American Statistical Association (ASA). Other countries have similar organization, for instance, the UK has the Royal Statistical Society, and Canada, India, and China have similar societies. In addition, there are more specialized groups such as the Institute for Mathematical Statistics, Society for Industrial and Applied Mathematics, East North American Region of the International Biometric Society, West North American Region of the International Biometric Society, and so forth. The ASA is very broad, and these other groups are more specialized. Chances are, there is a specialized group in your area.
- If you join the ASA, join a section, and find out if there is an active local chapter as well. The ASA is so huge that it is overwhelming to new members, but sections are smaller and more focused, and local chapters offer the opportunity to connect personally without a lot of travel or distance communication.
- You might start a group in your home town, such as an R User’s Group. Revolution Analytics will often sponsor a fledgling R User’s Group. Of course, this startup doesn’t have to be focused on R.
- If you have been a member for a couple of years, offer to volunteer. Chances are, the work is not glorious, but it will be important. The most important part, anyway, is that you will gain skills coordinating others and meet new people.
- If you go to a conference, offer to chair and try to speak. It is very easy to speak at the Joint Statistical Meetings.
- Use social media to get online connections, then try to meet these people in real life. I have formed several connections because I blog and tweet (@randomjohn). You can also use Google+, though I haven’t quite figured out how to do so effectively. I also don’t use Facebook that much for my professional outlet, but it is possible. Blogging offers a lot of other benefits as well, if you do it correctly. Blogging communities, such as R Bloggers and SAS Community, enhance the value of blogging.
Getting connected is valuable, and it takes a lot of work. Think of it as a career-long effort, and your network as a garden. It takes time to set up, maintain, and cultivate, but the effort is worth it.
Friday, January 21, 2011
List of statistics blogs
Friday, November 21, 2008
PMean - STaTS reborn?
Sunday, August 19, 2007
What does the First Ever Pharma Blogosphere Survey tell us
Web polls in blog entries - I don't trust them
- Most web polls do not control whether one person can vote multiple times. Most services will now use cookies or IP addresses to block multiple votes from one person, but these services are imperfect at best. Changing an IP address is easy (just go to a different Starbucks, and cookies can be deleted). Cookies are easily deleted.
- Wording questions in surveys is a tricky proposition, and millions of billable hours are spent agonizing over the wording. (Perhaps 75% of that is going a bit too far, but you get the point.) Very little time is generally spent wording the question of a web poll. The end result is that readers may not be answering the same question a blogger asks.
- Forget random sampling, matching cases, identifying demographic information, or any of the classical statistical controls that are intended to isolate noise and false signal from true signal. Web poll samples are "People who happen to find the blog entry and care enough to click on a web poll." At best, the readers who feel strongly about an issue are the ones likely to click, while people who are feel less strongly (but might lean a certain way) will probably just glaze over.
- Answers to web polls will typically be immediate reactions to the blog post, rather than thoughtful, considered answers. Internet life is fast-paced, and readers (in general) simply don't have the time to thoughtfully answer a web poll.