Realizations in Biostatistics: blogging

Showing posts with label blogging. Show all posts

Sunday, December 11, 2016

I set up a new data analysis blog

Well, I tried to write a blog post using the RStudio Rmarkdown system, and utterly failed. Thus, I set up a system where I could write from RStudio. So I set up a Github pages blog at randomjohn.github.io. There I can easily write and publish posts involving data analysis.

Posted by Unknown at 9:18 PM

Labels: blogging, open data, R

Sunday, November 11, 2012

Analysis of the statistics blogosphere

My analysis of the statistics blogosphere for the Coursera Social Networking Analysis class is up. The Python code and the data are up at my github repository. Enjoy!

Included are most of the Python code I used to obtain blog content, some of my attempts to automate the building of the network (I ended up using a manual process in the end), and my analysis. I also included the data. (You can probably see some of your own content.)

Here's what I learned/got reminded of the most:

Doing projects like this is hard when you have other responsibilities, and you usually end up paring down your ambitions toward the end
Data collection and curation was, as usual, the most difficult process
Network analysis is fun, but I have a ways to go to know where to start first, what questions to ask, and so forth (these are the things you learn with experience)
The measures that seem to be the most revealing are not always obvious -- in this network, it was the number of shortest paths compared to a random graph
Andrew Gelman's blog is central (but you probably don't need a formal analysis to tell you that)
There's a lot of great content about statistics, data analysis, data science, and statistical computing out there. I've relied on blog posts for a lot of my work, and I've found even more great stuff. It's a firehose of information.

Posted by Unknown at 12:16 AM

Labels: blogging, Other blogs, Python, social network analysis

Monday, November 5, 2012

Snapshot of the statistics blogosphere

stats_blogs

This was generated during my social network analysis project. I haven’t finished yet, but I did want to show the cute picture. The statistics blogosphere is like a school of jellyfish.

Posted by Unknown at 8:45 AM

Labels: blogging, social network analysis, statistics

Monday, October 29, 2012

The most valuable thing about my little stat blog network project

So, I decided to construct the linking graph through blogrolls, and finally settled on using a manual process. The best part of this project is really finding out for myself all the great content out there!

Posted by Unknown at 9:45 PM

Labels: blogging, social network analysis, statistics

Monday, October 22, 2012

SNA class proposal

I’ve been taking several classes through Coursera (nothing against the other platforms; I took two of the original three classes via Stanford and just stuck with the platform). The latest one is Social Network Analysis, which has a programming project. Here is what I have posted as a proposal:

Ok, I've been thinking about the programming project idea some, and at first I was thinking of analyzing the statistics blogging community, mostly because I belong to it and I wanted to see what comes out. The analysis below can be done for any sort of community. I've developed this idea a little further and wanted to record it here for two reasons. First, I simply need to write it down to get it out of my head and in such a way that the public can understand it. Second, I'd like feedback.
As it turns out, I took the NLP class in the spring and think there's some overlap that can be exploited. (This comes up nicely in the Mining the Social Web and Programming Collective Intelligence books.) There are measures of content similarity, such as cosine similarity, which are simple to compute and reasonably work well to see how similar content is. Content can then be clustered based on similarity. So, then, I have the following questions:

What are the communities, and do they relate to clusters of content similarity?
If so, who are the "brokers" between different communities, and what do they blog about? There are a couple of aggregators, such as StatBlogs and R-Bloggers, that I imagine would glue together several communities (that's their purpose and value), but I imagine there are a few others that are aggregator-like + commentary as well. Original content generators, like mine, will probably be on the edges.
Is it better to threshold edges based on a number of mentions, or use an edge weight based on the number of mentions?
If I have time, I may try to do some sort of topic or named entity extraction, and get an automated way of seeing what these different communities are talking about.

Posted by Unknown at 8:45 AM

Labels: blogging, coursera, social network analysis, statistics

Monday, August 6, 2012

Getting connected: why you should get connected to people, and how

Getting connected to professionals in your field can be difficult, but it’s worth the effort. Here’s why, in no particular order:

you exchange ideas, find new ways to approach problems, share career experiences, and learn to navigate the multitude of aspects of your profession
you form connections that can potentially help if you need to change jobs
it’s fun to be social (even if you are an introvert like me)
you can potentially add a lot of value to your company, leading to career advancement opportunities
you can justify going to cool conferences, if you enjoy those
you have a better chance of new opportunities to publish, get invited talks, or collaborate

The above is fairly general, but for the how I will focus on statisticians because that is where I can offer the most:

Join a professional organization. For statisticians in the US, join the American Statistical Association (ASA). Other countries have similar organization, for instance, the UK has the Royal Statistical Society, and Canada, India, and China have similar societies. In addition, there are more specialized groups such as the Institute for Mathematical Statistics, Society for Industrial and Applied Mathematics, East North American Region of the International Biometric Society, West North American Region of the International Biometric Society, and so forth. The ASA is very broad, and these other groups are more specialized. Chances are, there is a specialized group in your area.
If you join the ASA, join a section, and find out if there is an active local chapter as well. The ASA is so huge that it is overwhelming to new members, but sections are smaller and more focused, and local chapters offer the opportunity to connect personally without a lot of travel or distance communication.
You might start a group in your home town, such as an R User’s Group. Revolution Analytics will often sponsor a fledgling R User’s Group. Of course, this startup doesn’t have to be focused on R.
If you have been a member for a couple of years, offer to volunteer. Chances are, the work is not glorious, but it will be important. The most important part, anyway, is that you will gain skills coordinating others and meet new people.
If you go to a conference, offer to chair and try to speak. It is very easy to speak at the Joint Statistical Meetings.
Use social media to get online connections, then try to meet these people in real life. I have formed several connections because I blog and tweet (@randomjohn). You can also use Google+, though I haven’t quite figured out how to do so effectively. I also don’t use Facebook that much for my professional outlet, but it is possible. Blogging offers a lot of other benefits as well, if you do it correctly. Blogging communities, such as R Bloggers and SAS Community, enhance the value of blogging.

Getting connected is valuable, and it takes a lot of work. Think of it as a career-long effort, and your network as a garden. It takes time to set up, maintain, and cultivate, but the effort is worth it.

Posted by Unknown at 11:45 AM

Labels: blogging, careers, JSM, networking

Friday, January 21, 2011

List of statistics blogs

Here is a list of 40 statistics blogs. I read many of them regularly, but there are a few that are new to me.

Posted by Unknown at 9:18 AM

Labels: blogging, Other blogs

Friday, November 21, 2008

PMean - STaTS reborn?

I have referred to the STaTS pages in the past for statistical references. Sadly, Children Mercy hospital has temporarily taken down the pages, but Steve Simon has resurrected some of his content (and hopes to be able to get all of his content back) and more at pmean.

Posted by Unknown at 6:47 PM

Labels: blogging, Other blogs, statistics

Sunday, August 19, 2007

What does the First Ever Pharma Blogosphere Survey tell us

First, let me make a few comments. I find John Mack's Pharma Marketing blog useful. Marketing tends to be a black box for me. From my perspective, for the inputs you have guys who want to sell things, and for the outputs you have commercials and other promotional materials. I (partly by choice and partly by the way my brain works) understand very little about what happens between input and output. All I know about it is play my strengths and downplay my weaknesses. This is part of the reason I'm limiting myself to discussing statistical issues, at least on this blog.

However, when he came out with his First Ever Pharma Blogosphere Survey®©™, I was skeptical. In fact, I didn't pay much attention to it. But then he started making claims based on the survey, especially surrounding Peter Rost's new gig at BrandweekNRX. His predicted Brandweek would "flush its creditibility down the toilet" by hiring Rost, and cited his survey data to back up his case (he had other arguments as well, but, as noted above, I'm just covering what I know). And since I'm skeptical of his data, I'm skeptical of his analysis, and, therefore, his arguments, conclusions, and predictions based on the data. To his credit, however, he posts the raw data so at least we know he didn't use a graphics program to draw his bar graphs.

Rost's counterarguments are worthy of analysis as well. He notes that most people read the Pharma Marketing blog (the survey was conducted from its sister site Pharma Blogosphere), raising the question about which population Mack was really sampling. The correct answer, of course, is people who happened to read that blog entry around the day it was posted who cared enough to bother to take a web survey. I would agree that Mack's following probably make up a bulk of the survey.

But more important is the comparison of the survey to more objective data, such as site counters. (Note that site counter data isn't perfect, either, but it is more objective than web polls since the data collection does not require user interaction.) And it looks like that objective data doesn't match Mack's data.

Then you throw in the data from eDrugSearch, which has its own algorithm for ranking healthcare websites, but they seem a very out of line with the ranking algorithm from that of Technorati, which uses some modifications to the number of incoming links (I think to adjust for the fact that some blogs just all link to one another).

So, at any rate, you can be sure that Peter Rost will keep you abreast of his rankings, and for now they certainly do not seem to match Mack's predictions. And, while the eDrugSearch and Technorati rankings seem far from perfect, they do tend to agree on the upward trend in readership of BrandweekNRx and Rost's personal blog, at least for now. Mack's survey, and the predictions based on them, are the only data I've seen so far that have not agreed.

In the meantime, I say the proof is in the pudding. Read these sites, or, better yet, put them in an RSS reader so you can skim for the material you like. Discard the material you don't like. As for me, well, I like to keep abreast of the news in my industry because, well, it could affect my ability to feed my children. So far, Rost's blog breaks news that doesn't get picked up anywhere else, (as does Pharmalot and PharmGossip). Mack's blogs did, too, at least until he started getting obsessed with his subjective evaluation of Rost's content.

Posted by Unknown at 3:03 PM

Labels: blogging, Peter Rost

Web polls in blog entries - I don't trust them

I distrust web polls. While there are more trustworthy sources of polling such as Surveymonkey, these web surveying sites have to be backed up with essential the same type of operational techniques found in standard paper surveys. The web polls I distrust are ones that that bloggers put in their entries in their blog entries to poll their readers on their thoughts of certain issues. Sometimes they will even follow up with an entry saying "this isn't a scientific poll, but here are the results."

A small step up from this are the web surveys, such as John Mack's First Ever Pharma Blogsphere Survey®™©. They have a lot of the same problems as the simple web poll, and few of the controls necessary to ensure valid results. So I'll discuss simple one-off web polls and web surveys together.

Most of the problems and biases with these web polls aren't statistical; rather, they are operational. The data from these is so bad that no amount of statistics can rescue them. It's better not to even bring statistics into the equation here. Following are the operational biases I consider unavoidable and insurmountable:

Most web polls do not control whether one person can vote multiple times. Most services will now use cookies or IP addresses to block multiple votes from one person, but these services are imperfect at best. Changing an IP address is easy (just go to a different Starbucks, and cookies can be deleted). Cookies are easily deleted.
Wording questions in surveys is a tricky proposition, and millions of billable hours are spent agonizing over the wording. (Perhaps 75% of that is going a bit too far, but you get the point.) Very little time is generally spent wording the question of a web poll. The end result is that readers may not be answering the same question a blogger asks.
Forget random sampling, matching cases, identifying demographic information, or any of the classical statistical controls that are intended to isolate noise and false signal from true signal. Web poll samples are "People who happen to find the blog entry and care enough to click on a web poll." At best, the readers who feel strongly about an issue are the ones likely to click, while people who are feel less strongly (but might lean a certain way) will probably just glaze over.
Answers to web polls will typically be immediate reactions to the blog post, rather than thoughtful, considered answers. Internet life is fast-paced, and readers (in general) simply don't have the time to thoughtfully answer a web poll.

Web polls and surveys might be useful for guaging whether readers are interested in a particular topic posted by the blogger, and so they do have a use in guiding the future material in a blog. But beyond that, I can't trust them.

Next step: an analysis of the John Mack/Peter Rost kerfluffle.

Posted by Unknown at 12:22 PM

Labels: bias, blogging, statistics, surveys