Realizations in Biostatistics: classes

Showing posts with label classes. Show all posts

Thursday, December 22, 2011

A statistician’s view of Stanford’s open Introduction to Databases class

In addition to the Introduction to Machine Learning class (which I have reviewed), I took a class on introduction to databases, taught by Prof. Jennifer Widom. This class consisted of video lectures, review questions, and exercises. Topics covered included XML (markup, DTDs, schema, XPath, XQuery, and XSLT) and relational databases (relational algebra, database normalization, SQL, constraints, triggers, views, authentication, online analytical processing, and recursion). At the end we had a quick lesson on NoSQL systems just to introduce the topic and discuss where they are appropriate.

This class was different in structure from the Machine Learning class in two ways: there were two exams and the potential for a statement of accomplishment.

I think any practicing statistician should learn at least the material on relational databases, because data storage and retrieval is such an important part of statistics today. Many different statistical packages now connect to relational databases through technologies like ODBC, and knowledge of SQL can enhance the use of these systems. For example, for subset analyses it is usually better to do subsetting with the database than pull all the data into the statistical analysis package and then perform the subset. In biostatistics, the data are usually collected in a database using an electronic data capture or paper-based system, which store the data in an Oracle database.

I found that I already use the material in this course, even though I typically don’t write SQL queries more complicated than a simple subset. Some of the examples for the exercises involved representing a social network, which may help when I do my own analyses of networks. Other examples were of relationships that you might find in biostatistics and other fields.

I found that by spending up to 5 hours a week on the class I was able to get a lot out of it. Unfortunately, Stanford is not offering this in the winter quarter, but they have promised to offer this again in the future. I heartily recommend it for anyone practicing statistics, and the price is right.

Posted by Unknown at 11:43 AM

Labels: classes, databases

Monday, December 19, 2011

A statistician’s view on Stanford’s public machine learning course

This past fall, I took Stanford’s class on machine learning. Overall, it was a terrific experience, and I’d like to share a few thoughts on it:

A lot of participants were concerned that it was a watered down version of Stanford’s CS229. And, in fact, the course was more limited in scope and more applied than the official Stanford class. However, I found this to be a strength. Because I was already familiar with most of the methods in the beginning (linear and multiple regression, logistic regression), I could focus more on the machine learning perspective that the class brought to these methods. This helped in later sections where I wasn’t so familiar with the methods.
The embedded review questions and the end of section review questions were very well done, with some randomization algorithm making it impossible to guess until everything was right.
Programming exercises were done in Octave, an open source Matlab-like programming environment. I really enjoyed doing this programming, because it meant I essentially programmed regression and logistic regression algorithms by hand with the exception of a numerical optimization algorithm. I got a huge confidence boost when I managed to get the backpropagation algorithm for neural networks correct. Emphasis on these exercises was on the loops, which you could code using “slow” loops (for loops, for instance), but then really needed to vectorize using the principles of linear algebra. For instance, there was an algorithm for a recommender system that would take hours if coded using for loops, but ran in minutes using a vectorized implementation. (This is because the implicit loops of vectorization were run using optimized linear algebra routines.) In statistics, we don’t always worry about implementation details so much, but in machine learning situations, implementation is important because these algorithms often need to run in real time.
The class encouraged me to look at the Kaggle competitions. I’m not doing terribly well in them, but now at least I’m hacking on some data myself and learning a lot in the process.
The structure of the public class helps a lot over, for example, the iTunes U version of the class. But now I’m looking at the CS 229 lectures on iTunes U and am understanding them a lot more now.
Kudos to Stanford for taking the lead on this effort. This is the next logical progression of distance education, and takes a lot of effort and time.

I also took the databases class, which was even more structured with a mid-term and final exam. This was a bit of a stretch for me, but learning about data storage and retrieval is a good complement to statistics and machine learning. I’ve coded a few complex SQL queries in my life, but this class really took my understanding of both XML-based and relational database systems to the next level.

Stanford is offering machine learning again, along with a gaggle of other classes. I recommend you check them out. (Find a list, for example, at the bottom of the page of Probabilistic Graph Models site.) (Note: Stanford does not offer official credit for these classes.)

Posted by Unknown at 8:45 AM

Labels: classes, learning, matrix, professionalism

Friday, April 2, 2010

Adventures in graduate school

I was recently reflecting at, basic classes aside, I use information from mainly three graduate classes. Two of them were special topics classes, and one was a class that had finally evolved from a special topics class.

In one of the special topics class, we were given a choice of two topics: one field survey of Gaussian processes, which would have been useful but that was not so interesting to the professor, and local time (i.e. the amount of time a continuous process spends in the neighborhood of a point), which was much more specialized (and for which I did not meet the prerequisites) and much more interesting to the professor. I chose the local time because I figured if the professor was excited about it, I would be excited enough to learn what I needed to to understand the class. As a result, I have a much deeper understanding of time series and stochastic processes in general.

The second special topics class seemed to have a very specialized focus, pattern recognition. It covered the abstract Vapnik-Chervonenkis theory in detail, and we discussed rates of convergence, exponential inequalities on probabilities, and other hard-core theory. I could have easily forgotten that class, but the professor was excited about it, and because of it I am having a much easier time understanding data mining methods than I would have otherwise.

The third class, though it was not labeled a special topics class, was a statistical computing class where the professor shared his new research in addition to the basics. There I learned a lot about scatterplot smoothing, Fourier analysis, local polynomial and other nonparametric regression methods that I still use very often.

In each of these cases, I decided to forgo a basic or survey class for a special topics class. Because of the professor's enthusiasm toward the subject in each case, I was willing to go the extra mile and learn whatever prerequisite information I needed to understand the class. In each case as well, that willingness to go the extra mile and fill in the gaps has carried over to over a decade later where I have kept up my interest and am always looking to apply these methods to new cases, when appropriate.

I am currently taking the bootstrapping course at statistics.com and am happy to say that I am experiencing the same thing. (I was introduced to the bootstrap in fact in my computing class mentioned above but we never got beyond the basics due to time.) We are getting the basics and current research, and I'm already able to apply it to problems I have now.

Posted by Unknown at 12:39 PM

Labels: Bayesian statistics, classes, graduate school, statistical programming