Showing posts with label big-data. Show all posts
Showing posts with label big-data. Show all posts
Wednesday, October 16, 2013
Progressively complex representations for graphs
After NIPS last year, I had posted a note on the evolution of data models and how we think about the "shape" of data. In brief, we have increasingly complex ways of thinking about the (geometric) shape of data that carefully balance the need for expressivity and the ability to determine the parameters of the representation efficiently.
There's a basic fact of data analysis that you come up against once you start playing with data. Either you endow the space with a geometry, or you endow it with a graph structure (and of course, nowadays you might throw in a spherical cow or two). And there are only a few ways to mix the representations (spectral methods being one such approach).
But as far as I know, there are no standard ways to create a layered representation of graphs to model increasing complexity and expressivity in a similarly balanced manner. There are of course an infinite number of ways to parametrize graphs by quantities that capture increasing complexity (treewidth, connectivity, expansion, induced metric structure, and so on). But I don't know of any established and standard ways to model families of graphs with increasing complexity that capture data patterns of interest.
One of the things that I've been tasked with (along with Peter Richtarik and Han Liu) is to draft out a report on the activities at the big data program at Simons this semester. We're looking also at challenges in the theory of big data, and in my mind coming up with good models for graph structures that capture the intricacies of real data is one of them.
There's a basic fact of data analysis that you come up against once you start playing with data. Either you endow the space with a geometry, or you endow it with a graph structure (and of course, nowadays you might throw in a spherical cow or two). And there are only a few ways to mix the representations (spectral methods being one such approach).
But as far as I know, there are no standard ways to create a layered representation of graphs to model increasing complexity and expressivity in a similarly balanced manner. There are of course an infinite number of ways to parametrize graphs by quantities that capture increasing complexity (treewidth, connectivity, expansion, induced metric structure, and so on). But I don't know of any established and standard ways to model families of graphs with increasing complexity that capture data patterns of interest.
One of the things that I've been tasked with (along with Peter Richtarik and Han Liu) is to draft out a report on the activities at the big data program at Simons this semester. We're looking also at challenges in the theory of big data, and in my mind coming up with good models for graph structures that capture the intricacies of real data is one of them.
Labels:
big-data,
research,
simons foundation
Friday, September 06, 2013
More camping with high dimensional boots...
I discovered today that you don't have to wait for talk videos to be posted on the Simons website. All videos are live streamed via ustream, and they have their own channel for the boot camp talks, where you can watch the videos immediately after they're streamed.
Martin Wainwright gave a 3 hour presentation on high dimensional statistics. Michael Jordan's talk earlier was good preparation for this, just to get familiar with basic concepts like the risk of an estimator and minimax theory.
Wainwright's talks were much denser, and it would be hard do an intelligent summary of the entire presentation. The central theme of his talk was this:
In classical statistics, we can evaluate the quality of an estimator as $n \rightarrow \infty$ using standard asymptotic methods, and for the most part they are well understood (convergence, rates of convergence, sampling distributions, and so on). But in all these results, it's assumed that the data dimensionality stays fixed. Suppose it doesn't though ?
In particular, suppose you have a situation where $d/n \rightarrow \alpha > 0$ (this notion was first introduced by Kolmogorov). For example, suppose $\alpha = 0.5$. What happens to the behavior of your estimation methods ? He worked out a few simple examples with experiments to show that in such cases, classical asymptotic bounds fail to capture what's really going on. For example, suppose you wanted to estimate the mean of a distribution and you used the sample mean, relying on the central limit theorem. Then it turns out that in the $d/n \rightarrow \alpha$ regime, the convergence is slowed down by the parameter $\alpha$.
Another example of this problem is estimating the covariance of a matrix. Assume you have a sample of iid Gaussian random variables drawn from $N(0, I_{d \times d}),ドル and you want to use the sample covariance to estimate the population covariance (in this case, the identity matrix). You can look at the distribution of eigenvalues of the resulting matrix (which you expect to be sharply concentrated around 1) and in fact you get a much more spread out distribution (this is known as the Marcenko-Pastur distribution). You can show that that the maximum singular value of the matrix is no bigger than 1ドル + \sqrt{d/n}$ with high probability. But this error term $\sqrt{d/n}$ does not go away.
If the data is indeed high dimensional, is there low-dimensional/sparse structure one can exploit to do inference more efficiently ? This gets us into the realm of sparse approximations and compressed sensing, and he spends some time explaining why sparse recovery via the LASSO is actually possible, and describes a condition called the "restricted null space property" that characterizes when exact recovery can be done (this property is implied by the RIP, but is strictly weaker).
In the second part of the talk he talked more generally about so-called regularized M-estimators, and how one might prove minimax bounds for parameter estimation. Again, while the specifics are quite technical, he brought up one point in passing that I think is worth highlighting.
When doing parameter estimation, the "curvature" of the space plays an important role, and has done so since the Cramer-Rao bound, the idea of Fisher information and Efron's differential-geometric perspective. The idea is that if the optimal parameter lies in a highly curved region of loss-parameter space, then estimation is easy, because any deviation from the true parameter incurs a huge loss. Conversely, a region of space where the loss function doesn't change a lot is difficult for parameter estimation, because the parameter can change significantly.
Once again, this is in sharp contrast to how we view the landscape of approximations for a hard problem. If we have a cost function that varies gently over the space, then this actually makes approximating the function a little easier. But a cost function that has sharp spikes is a little trickier to approximate, because a small movement away from the optimal solution changes the cost dramatically.
There are some problems with this intuition. After all, a sharp potential well is good for gradient descent methods. But the difference here is that in estimation, you only care about the loss function as a tool to get at the true goal: the desired parameter. However, in much of algorithm design, the loss function IS the thing you're optimizing, and you don't necessarily care about the object that achieves that optimal loss. This goes back to my earlier point about the data versus the problem. If optimizing the loss is the end goal, then a certain set of tools come into play. But if the goal is to find the "true" answer, and the loss function is merely a means to this end, then our focus on problem definition isn't necessarily helpful.
Martin Wainwright gave a 3 hour presentation on high dimensional statistics. Michael Jordan's talk earlier was good preparation for this, just to get familiar with basic concepts like the risk of an estimator and minimax theory.
Wainwright's talks were much denser, and it would be hard do an intelligent summary of the entire presentation. The central theme of his talk was this:
In classical statistics, we can evaluate the quality of an estimator as $n \rightarrow \infty$ using standard asymptotic methods, and for the most part they are well understood (convergence, rates of convergence, sampling distributions, and so on). But in all these results, it's assumed that the data dimensionality stays fixed. Suppose it doesn't though ?
In particular, suppose you have a situation where $d/n \rightarrow \alpha > 0$ (this notion was first introduced by Kolmogorov). For example, suppose $\alpha = 0.5$. What happens to the behavior of your estimation methods ? He worked out a few simple examples with experiments to show that in such cases, classical asymptotic bounds fail to capture what's really going on. For example, suppose you wanted to estimate the mean of a distribution and you used the sample mean, relying on the central limit theorem. Then it turns out that in the $d/n \rightarrow \alpha$ regime, the convergence is slowed down by the parameter $\alpha$.
Another example of this problem is estimating the covariance of a matrix. Assume you have a sample of iid Gaussian random variables drawn from $N(0, I_{d \times d}),ドル and you want to use the sample covariance to estimate the population covariance (in this case, the identity matrix). You can look at the distribution of eigenvalues of the resulting matrix (which you expect to be sharply concentrated around 1) and in fact you get a much more spread out distribution (this is known as the Marcenko-Pastur distribution). You can show that that the maximum singular value of the matrix is no bigger than 1ドル + \sqrt{d/n}$ with high probability. But this error term $\sqrt{d/n}$ does not go away.
If the data is indeed high dimensional, is there low-dimensional/sparse structure one can exploit to do inference more efficiently ? This gets us into the realm of sparse approximations and compressed sensing, and he spends some time explaining why sparse recovery via the LASSO is actually possible, and describes a condition called the "restricted null space property" that characterizes when exact recovery can be done (this property is implied by the RIP, but is strictly weaker).
In the second part of the talk he talked more generally about so-called regularized M-estimators, and how one might prove minimax bounds for parameter estimation. Again, while the specifics are quite technical, he brought up one point in passing that I think is worth highlighting.
When doing parameter estimation, the "curvature" of the space plays an important role, and has done so since the Cramer-Rao bound, the idea of Fisher information and Efron's differential-geometric perspective. The idea is that if the optimal parameter lies in a highly curved region of loss-parameter space, then estimation is easy, because any deviation from the true parameter incurs a huge loss. Conversely, a region of space where the loss function doesn't change a lot is difficult for parameter estimation, because the parameter can change significantly.
Once again, this is in sharp contrast to how we view the landscape of approximations for a hard problem. If we have a cost function that varies gently over the space, then this actually makes approximating the function a little easier. But a cost function that has sharp spikes is a little trickier to approximate, because a small movement away from the optimal solution changes the cost dramatically.
There are some problems with this intuition. After all, a sharp potential well is good for gradient descent methods. But the difference here is that in estimation, you only care about the loss function as a tool to get at the true goal: the desired parameter. However, in much of algorithm design, the loss function IS the thing you're optimizing, and you don't necessarily care about the object that achieves that optimal loss. This goes back to my earlier point about the data versus the problem. If optimizing the loss is the end goal, then a certain set of tools come into play. But if the goal is to find the "true" answer, and the loss function is merely a means to this end, then our focus on problem definition isn't necessarily helpful.
Labels:
big-data,
sabbatical,
simons foundation
Wednesday, September 04, 2013
Statistics, geometry and computer science.
Geometry is the glue between statistics and computer science. -- Michael Jordan
That's why everyone gets stuck in it.
-- Graham Cormode
Michael Jordan opened the proceedings with a talk titled "Big Data: The computation/statistics interface". From a CS perspective, it was an excellent talk that hit just the right level of detail for me to grasp some basic terminology in statistics, as well as understand some of the questions people are pondering right now.
We think of big data as a PROBLEM. More data, more scaling issues, O(n) becomes infeasible, and so on. In particular, we think of large data as a workload that taxes our resources: time, space, communication, and so on.
Michael presented a counterpoint. From a statistical perspective, more data is better. Estimators converge, error reduces, and the law of large numbers kicks in. Rather than treating data as something that we have to manage efficiently, can we think of data (quality) as a resource itself that can be managed efficiently ? In other words, can we tradeoff estimator accuracy against other resources ?
He proceeded to give a wonderful exposition of the basics of statistical decision theory, including the frequentist and Bayesian views of estimator risk. Along the way, he described a geometric interpretation of the James-Stein estimator which is too neat to skip.
The James-Stein estimator is really a very counterintuitive object. Consider a set of samples $X_i \sim N(\theta_i, 1), i \le n,ドル where the $\theta_i$ are unknown parameters. The goal is to sample $X_i$ to determine the $\theta_i,ドル by minimizing some loss function $\sum (\theta_i - \hat{\theta_i})^2$. The $\theta_i$ are assumed unrelated to each other.
The "obvious" estimator is $\hat{\theta_i} = X_i$. It's the MLE after all. But it turns out that this estimator is strictly dominated (in terms of expected loss aka risk) by the so-called shrinkage estimator:
\[ \hat{\theta_i} = (1 - \frac{c}{S^2}) X_i \]
where $S = \sum_j X_j^2$.
What's surprising about this is that the estimator for a single $\theta_i$ takes into account samples from other $X_i,ドル even though the $\theta_i$ are assumed to be unrelated.
It turns out that there's an elegant geometric interpretation of this estimator, first attributed to Galton by Stephen Stigler. Consider the pairs $(X_i, \theta_i)$ as points in the plane. We don't quite know where these points lie because we don't know what $\theta_i$ is. Because the $X_i$ are normally distributed around the $\theta_i,ドル each point is really a horizontal interval of interest.
Now the standard estimator $\hat{\theta_i} = X_i$ arises from trying to solve a regression problem of $X$ versus $\theta,ドル or more precisely solving the regression $\hat{X} = E(X\mid \theta)$. But really, what we're trying to do is solve the regression $\hat{\theta} = E(\theta \mid X)$. In other words, regression of $y$ against $x$ is different from the regression of $x$ against $y,ドル as long as we have more than three points. And this is precisely what the JS estimator says.
Returning to the question he started the talk with, can we show a formal tradeoff between data estimation accuracy and sampling cost ? In a recent paper with Venkat Chandrasekharan from Caltech, they show a very nice (and counterintuitive) result: that using a cruder relaxation of an optimization problem can actually lead to more efficient estimation as long as you have sufficient data. Note that this goes against the standard idea that a crude relaxation is "worse" in an approximation sense.
The idea is as follows. Consider the very simple problem of denoising where the observations $\mathbf{y}$ are generated from the input $\mathbf{x}$ by a noise perturbation:
\[ \mathbf{y} = \mathbf{x} + \sigma \mathbf{z} \]
where $\mathbf{z}$ is normally distributed. Let us assume that $\mathbf{x}$ is drawn from some set $S$ (for example, $S$ is the set of low-rank matrices, or the set of permutation matrices).
The simplest estimator for $x$ is an $\ell_2$ projection: compute the sample mean $\overline{\mathbf{y}}$ and then find its projection onto $S$. But this involves a minimization over $S,ドル which might be intractable.
We can relax $S$ to some $S',ドル where $S \subset S'$ and minimization becomes easier. But this would worsen the approximation ratio of the optimization depending on the relaxation. And here's where the insight comes in.
Suppose you instead look at the statistical risk of this estimator. In other words, look at the expected difference between the true $\mathbf{x}^*$ and the estimated $\hat{\mathbf{x}}$. The main result they show in this paper (paraphrased) is
expected risk is upper bounded by (C/n) * geometric complexity of $S, \mathbf{x}^*$where $n$ is the number of samples.
Suppose we fix the desired risk. Then an increase in $n$ can be used to "pay for" increased "geometric complexity". And here's where the final idea comes in. The "geometric complexity" used here is the Gaussian-squared complexity, which is defined as
\[ g(D) = E[ \sup_{x \in D} \langle x, z\rangle^2 ] \]
where $z$ is normally distributed.
In particular, the set $D$ used in the above expression is the set of directions from $\mathbf{x}^*$ to points in $S$. Suppose $S$ is very "pointed" at $\mathbf{x}^*$. Then the Gaussian-squared complexity is small and the number of samples needed is small. But if instead we use a relaxation $S'$ that is "blunt". The Gaussian complexity goes up, but if we have more samples, that keeps the risk fixed. If the optimization for $S'$ is significantly easier than the corresponding problem for $S,ドル then we still win, even though $n$ has increased, and even though the classical "approximation ratio" might have become worse.
In the final part of his talk, Michael talked about his recent work on "bags of little bootstraps", which Muthu also covered in his post. This post is already too long, so I'll defer this to another time.
Labels:
big-data,
sabbatical,
simons foundation
Tuesday, August 27, 2013
On "a theory of big data"
+Moritz Hardt kicked off our Simons big data program with an excellent rumination on the role of theory in "big data". Some followup thoughts:
Moritz makes the point that theory, for better or for worse (mostly for better) made the choice to give problems primacy over data. I harp on this a lot in my algorithms class, and also talked about this in the context of computational thinking: the idea of 'naming a problem' is a key intellectual contribution of theoryCS.
But to look at what might happen if we allowed more flexibility in problem definition, we don't have to look too far. Machine learning (a discipline that faces data head on) is characterized by the study of problems that are well defined in the broad sense, but have a lot of wiggle room in the specific (and I'm now hoping +Sebastien Bubeck will respond and critique what follows)
Consider classification. In a broad sense, it's very well defined: given a collection of points labeled with (1,-1), find a decision rule that can be used to separate the +s from the -s. But in the specifics: what's the decision rule ? what's the penalty for making a mistake ? how much labeled data do we have ? does it cost us to get labels ? and so on.
For each possible answer to these questions, you can construct a well defined problem. And you could focus on solutions to that very well defined problem. But that's not particularly important. Maybe I use hinge loss to capture errors in my classification, or maybe I use some other error function. I don't care as much as I care about a formulation that allows me to solve different flavors of the problem using a single paradigm: possibly some form of gradient descent.
This theme shows up again and again. In clustering. In regression. In the different flavors of learning (active, semisupervised, transfer, multitask, ...). A good solution to a problem focuses not on the specific definition, but a general framework that captures different variations on the problem and reduces them to solving some optimization that can then be engineered. This is also why optimization (and understanding heuristics for optimization) is such a focus on machine learning (Bubeck's lecture notes are a great resource on this, btw)
There is of course a downside. The downside is that you (could) miss out on connections between problems: the reductions that are the lifeblood of work in theoryCS. In fact, John Langford has looked into this issue specifically, with his machine learning reductions project.
So returning to the original question, what should a theory of big data look like ? A big part of theoryCS research is the skillful management of resources (space, time, random bits, communication, what have you..). But an even more important part is the abstraction of computation as a first-order phenomenon, a "theory of the feasible", as it were.
Consider the example of privacy. Privacy is a concern that arises from access to data, and is a core component of any "big data" discussion. What differential privacy achieved was to frame the problem computationally, as both a definition of what's feasible to protect, and as a set of mechanisms to guarantee protection, and guarantee what cannot be protected.
Similarly, I'm interested in other problems arising out of our need to interact with data that can be framed abstractly in "the language of the feasible". There's work on how to value published data, and how to price it. There's work on how to verify computations, and how to do computations securely. There's work on how to understand and interpret data analysis. And then of course there's the large body of work on how to manage and compute efficiently with data under brand new models of computation.
The "language of the feasible" is our not-so-secret weapon in defining abstractions: it's more than just resource allocation and management, and it's what gives theoryCS power to speak to other domains.
Labels:
big-data,
research,
simons foundation
Monday, October 08, 2012
On why I'm excited about "big data"
I was in Aarhus recently for a MADALGO workshop on large-scale parallel and distributed models, where I did a sequence of lectures on GPU algorithms. I was briefly interviewed by a university reporter for an article, and did a little video on why I think big data/big iron problems are interesting.
At the risk of embarrassing myself even more than I usually do, here's the video. Note that this was recorded at a time of great crisis across the globe, when all hair styling products had mysteriously disappeared for a few days.
At the risk of embarrassing myself even more than I usually do, here's the video. Note that this was recorded at a time of great crisis across the globe, when all hair styling products had mysteriously disappeared for a few days.
Sunday, June 17, 2012
A Reflection On Big Data (Guest post)
Ed: Graham Cormode kindly agreed to send a missive from the recently concluded big data workshop at Duke
There seem to be no shortage of Big Data events at the moment, reflecting some of the enthusiasm around this topic. Last week, there were meetings at NIST and at Duke and there are upcoming events at Stanford and SAMSI to name but a few. (Ed: also the new Simons Foundation program on big data)
I attended the Duke event, and was hoping to blog about the meeting, but suffered a big data problem of my own: with 25 speakers, plus panels and breakout sessions, how to pull out any meaningful summary? So instead, I'll just pick on one issue that was the subject of much discussion: what exactly is new about big data? In particular, there has been much interest in what we called 'massive' data over the last years (as captured by MADALGO, the center for massive data algorithmics and the accompanying MASSIVE workshop). Is Big Data just a rebranding of Massive Data? Which is bigger, Big Data, Massive Data, or Very Large Data ?
What came out from our discussions is that we believe that Big Data is qualitatively different from the challenges that have come before. The major change is the increasing heterogeneity of data that is being produced. Previously, we might have considered data that was represented in a simple form, as a very high-dimensional vector, over which we want to capture key properties. While Big Data does include such problems, it also includes many other types of data: database tables, text data, graph data, social network data, GPS trajectories, and more. Big Data problems can consist of a mixture of examples of such data, not all of which are individually "big", but making sense of which represents a big problem. Addressing these challenges requires not only algorithmics and systems, but machine learning, statistics and data mining insights.
This is by no means a new observation: the axes of Volume, Velocity and Variety have been widely discussed before. But it did remind me that we should not get too hung up on how big our big data is, since size is not the sole defining characteristic of handling big data (although it is a necessary characteristic).
Many other familiar topics came up in discussions, such as: how can big data sets be made more accessible to researchers? how can privacy concerns with big data be addressed? what are the right models for big data, for both the data and the computation? what are the key techniques for working with big data? do we even know enough to teach a course on big data?
What struck me about these discussions is that not only were we far from reaching a consensus on any point, but also no one seemed to think they had any strong candidate solutions. Consequently, I left assured that Big Data is much more than a rebranding of existing work. It represents a significant research agenda that will challenge us for years to come.
There seem to be no shortage of Big Data events at the moment, reflecting some of the enthusiasm around this topic. Last week, there were meetings at NIST and at Duke and there are upcoming events at Stanford and SAMSI to name but a few. (Ed: also the new Simons Foundation program on big data)
I attended the Duke event, and was hoping to blog about the meeting, but suffered a big data problem of my own: with 25 speakers, plus panels and breakout sessions, how to pull out any meaningful summary? So instead, I'll just pick on one issue that was the subject of much discussion: what exactly is new about big data? In particular, there has been much interest in what we called 'massive' data over the last years (as captured by MADALGO, the center for massive data algorithmics and the accompanying MASSIVE workshop). Is Big Data just a rebranding of Massive Data? Which is bigger, Big Data, Massive Data, or Very Large Data ?
What came out from our discussions is that we believe that Big Data is qualitatively different from the challenges that have come before. The major change is the increasing heterogeneity of data that is being produced. Previously, we might have considered data that was represented in a simple form, as a very high-dimensional vector, over which we want to capture key properties. While Big Data does include such problems, it also includes many other types of data: database tables, text data, graph data, social network data, GPS trajectories, and more. Big Data problems can consist of a mixture of examples of such data, not all of which are individually "big", but making sense of which represents a big problem. Addressing these challenges requires not only algorithmics and systems, but machine learning, statistics and data mining insights.
This is by no means a new observation: the axes of Volume, Velocity and Variety have been widely discussed before. But it did remind me that we should not get too hung up on how big our big data is, since size is not the sole defining characteristic of handling big data (although it is a necessary characteristic).
Many other familiar topics came up in discussions, such as: how can big data sets be made more accessible to researchers? how can privacy concerns with big data be addressed? what are the right models for big data, for both the data and the computation? what are the key techniques for working with big data? do we even know enough to teach a course on big data?
What struck me about these discussions is that not only were we far from reaching a consensus on any point, but also no one seemed to think they had any strong candidate solutions. Consequently, I left assured that Big Data is much more than a rebranding of existing work. It represents a significant research agenda that will challenge us for years to come.
Subscribe to:
Comments (Atom)