Showing posts with label streaming. Show all posts
Showing posts with label streaming. Show all posts
Monday, March 30, 2015
Revisiting the Misra-Gries estimator
If you've ever taken a class on streaming algorithms, or want to ace some tech company puzzle interview question, you've probably heard of this little gem:
This result has been proved and reproved a number of times, and there's a fairly general proof based on the fact that what you're really doing is approximating the true frequency to an additive error of $\epsilon n$ (and so any item that occurs more often than that will be retained).
What's more interesting is how you might go about proving the basic statement from first principles if you don't know the proof. This came up at dinner, when I was talking to people who hadn't heard of the problem before and were able to reconstruct a proof on the fly.
So let's return to the basic statement, and let's be a little more precise. The precise guarantee required is as follows:
Given n items, determine whether a majority element (one occurring strictly more than n/2 times) exists. You are allowed one pass over the data (which means you can read the elements in sequence, and that's it), and exactly TWO units of working storage.This is a special case of the general Misra-Gries estimator for finding frequent items in a stream with small space. The MG estimator allows you to find (approximately) all items that occur more than $\epsilon n$ times using $O(1/\epsilon)$ space. Setting $\epsilon = 1/2$ yields the desired result.
This result has been proved and reproved a number of times, and there's a fairly general proof based on the fact that what you're really doing is approximating the true frequency to an additive error of $\epsilon n$ (and so any item that occurs more often than that will be retained).
What's more interesting is how you might go about proving the basic statement from first principles if you don't know the proof. This came up at dinner, when I was talking to people who hadn't heard of the problem before and were able to reconstruct a proof on the fly.
So let's return to the basic statement, and let's be a little more precise. The precise guarantee required is as follows:
- If a strict majority element exists, you MUST return it
- If not, you can return any element you want (even the most infrequent one)
Because of the second statement above, we only have to worry about streams in which a strict majority element does exist.
Observation 1: if you haven't seen a majority element in any prefix of the stream, you can throw that prefix away.
Why: we know there's a majority element somewhere. If it didn't show up so far, it has to be at least much of a majority in what's remaining.
Observation 2: This prefix could be as short as two elements long.
In other words, we see an element. Call it x. If we now see $y \ne x,ドル x is no longer a majority element so far, and we can chuck it, starting a new stream at $y$.
But if we do see $x$ again, then it could be a majority element. How do we keep track of whether it ever stops being a majority element ?
We could keep track of the number of items seen so far. But that's an extra counter. Why not just pair instances of $x$ with instances of other elements seen so far by subtraction. If any time we hit zero, we can invoke observation 1 and start again.
If not, then we will end with a nonzero count for $x,ドル which must be the correct element.
And this gives us the exact algorithm we need.
Tuesday, January 27, 2015
Streaming @ SODA: Part II
This is the second of two posts by Samira Daruki on the streaming sessions at SODA 2015. For the first post, see here.
In the third paper from the streaming graph family in SODA15: "Parameterized Streaming: Maximal Matching and Vertex Cover", Chitnis, Cormode, Hajiaghayi and Monemizadeh introduce a new approach to handling graph streams called parameterized streaming algorithms. Also, in addition to insertion-only model, they consider the dynamic model of streaming graphs in which the input is a sequence of insertion/deletion on the edges.
This dynamic model of streaming graph processing is popular when the graph size is changing, and has recently received much attention due to breakthroughs by Ahn, Guha and McGregor (one, and two). Over these two papers, they showed the first results for a number of graph problems over dynamic streams. This has provoked much interest into what can be computed over dynamic graph streams, although still there is not much work on solving graph optimization problems in this model. The challenge here is that when an edge is deleted, sometimes it requires a substantial work to repair the solution again, so we need to make sure that the algorithm has enough information to do so, while keeping only a bounded amount of working space. (ed: not surprisingly, some of these ideas are useful for non-streaming dynamic graph algorithms: witness the paper by Kapron, King and Mountjoy on dynamic connectivity in (randomized) worst-case polylog time from SODA a few years ago)
Returning to parametrized streaming, in this paper instead of solving exactly the optimization problem on the graph stream, the goal is to solve the “parametrized” version of the problem, where the parameter $k$ is given and we want to solve the following decision problem:
To solve the parameterized problems, one of the techniques which is used is kernelization, which uses a polynomial time preprocessing to map the input to another equivalent input of smaller size $f(k)$ (called a kernel) with a new parameter value $k’ \le g(k),ドル for a computable function $g$.
In this paper, by combining kernelization techniques with randomized sketch structures, the first streaming algorithms for the parameterized versions of the Vertex Cover problem is obtained. The main idea here is to maintain a maximal matching of underlying graph in a streaming fashion. Then run the well-known kernelization algorithm for Vertex Cover on the maintained maximal matching. The data structure to maintain the maximal matching use the $k$-sample recovery sketching algorithm, which is a generalization of linear sketching for $\ell_0$-sampling, as the main tool and apply it to the neighborhood of each vertex (incident edges) in the resulted matching. So as the edges are inserted or deleted, these sketches can be updated without needing knowledge of the full neighborhood of nodes. However, there are some challenges with deletion of edges: as the edges are deleted we need to have an intelligent mechanism to ensure the matching remains maximal using only limited stored information.
Another nice result here is showing a tight lower bound of $\Omega(k^2)$ (by reducing from the INDEX problem in communication complexity) for the space complexity of any (randomized) streaming algorithms for parameterized Vertex Cover, which holds even in the insertion-only model.
Besides the general models of insert-only and dynamic, another restricted model in dynamic framework is also discussed in which we know for sure that at time $i,ドル the size of the vertex cover of underlying graph induced by the edges till that point is at most $k$. With this promise, they develop a dynamic parameterized streaming algorithm whose space usage matches the proved lower bound.
It is interesting to think about other NP-hard problems in the framework of parameterized streaming and explore how kernelization can be helpful in this direction or see whether we can find other powerful hammers to overcome the challenges which arises in designing algorithms for hard problems in streaming setting.
Coda:
Along with the three papers discussed above, there was another paper on streaming presented at SODA (by Naumovitz and Saks) which provides a deterministic polylogarithmic-space streaming algorithm for approximating distance to monotonicity for a sequence of $n$ numbers, compared to the corresponding randomized result presented at SODA two years ago.
While I won't discuss this work here so as to keep these posts just about streaming graph algorithms, I encourage the interested readers to take a look at this paper as well, as the last one in the streaming family of SODA15.
In the third paper from the streaming graph family in SODA15: "Parameterized Streaming: Maximal Matching and Vertex Cover", Chitnis, Cormode, Hajiaghayi and Monemizadeh introduce a new approach to handling graph streams called parameterized streaming algorithms. Also, in addition to insertion-only model, they consider the dynamic model of streaming graphs in which the input is a sequence of insertion/deletion on the edges.
This dynamic model of streaming graph processing is popular when the graph size is changing, and has recently received much attention due to breakthroughs by Ahn, Guha and McGregor (one, and two). Over these two papers, they showed the first results for a number of graph problems over dynamic streams. This has provoked much interest into what can be computed over dynamic graph streams, although still there is not much work on solving graph optimization problems in this model. The challenge here is that when an edge is deleted, sometimes it requires a substantial work to repair the solution again, so we need to make sure that the algorithm has enough information to do so, while keeping only a bounded amount of working space. (ed: not surprisingly, some of these ideas are useful for non-streaming dynamic graph algorithms: witness the paper by Kapron, King and Mountjoy on dynamic connectivity in (randomized) worst-case polylog time from SODA a few years ago)
Returning to parametrized streaming, in this paper instead of solving exactly the optimization problem on the graph stream, the goal is to solve the “parametrized” version of the problem, where the parameter $k$ is given and we want to solve the following decision problem:
Is there a solution with size bounded by $k$?The motivation behind parametrizing the problem comes from real world applications in which the solution of the graph problems is small comparing to the size of the input (i.e. sublinear in the size of input). In these cases, the interesting challenge is to solve the optimization graph problems in streaming fashion using space bounded by some function of the “solution size” instead of the “input size”.
To solve the parameterized problems, one of the techniques which is used is kernelization, which uses a polynomial time preprocessing to map the input to another equivalent input of smaller size $f(k)$ (called a kernel) with a new parameter value $k’ \le g(k),ドル for a computable function $g$.
In this paper, by combining kernelization techniques with randomized sketch structures, the first streaming algorithms for the parameterized versions of the Vertex Cover problem is obtained. The main idea here is to maintain a maximal matching of underlying graph in a streaming fashion. Then run the well-known kernelization algorithm for Vertex Cover on the maintained maximal matching. The data structure to maintain the maximal matching use the $k$-sample recovery sketching algorithm, which is a generalization of linear sketching for $\ell_0$-sampling, as the main tool and apply it to the neighborhood of each vertex (incident edges) in the resulted matching. So as the edges are inserted or deleted, these sketches can be updated without needing knowledge of the full neighborhood of nodes. However, there are some challenges with deletion of edges: as the edges are deleted we need to have an intelligent mechanism to ensure the matching remains maximal using only limited stored information.
Another nice result here is showing a tight lower bound of $\Omega(k^2)$ (by reducing from the INDEX problem in communication complexity) for the space complexity of any (randomized) streaming algorithms for parameterized Vertex Cover, which holds even in the insertion-only model.
Besides the general models of insert-only and dynamic, another restricted model in dynamic framework is also discussed in which we know for sure that at time $i,ドル the size of the vertex cover of underlying graph induced by the edges till that point is at most $k$. With this promise, they develop a dynamic parameterized streaming algorithm whose space usage matches the proved lower bound.
It is interesting to think about other NP-hard problems in the framework of parameterized streaming and explore how kernelization can be helpful in this direction or see whether we can find other powerful hammers to overcome the challenges which arises in designing algorithms for hard problems in streaming setting.
Coda:
Along with the three papers discussed above, there was another paper on streaming presented at SODA (by Naumovitz and Saks) which provides a deterministic polylogarithmic-space streaming algorithm for approximating distance to monotonicity for a sequence of $n$ numbers, compared to the corresponding randomized result presented at SODA two years ago.
While I won't discuss this work here so as to keep these posts just about streaming graph algorithms, I encourage the interested readers to take a look at this paper as well, as the last one in the streaming family of SODA15.
Wednesday, July 25, 2012
Data Streaming in Dortmund: Day II
Andrew McGregor and I are tag-blogging the workshop. Read his post on day 1 here.
Day II could roughly be summarized as:
There were a number of talks on sparse recovery, which includes compressed sensing, and asks the question: How best can we reconstruct a signal from linear probes if we know the signal is sparse.
Eric Price led things off with a discussion of his recent breakthroughs on sparse FFTs. While the regular DFT takes $n \log n$ time, it's reasonable to ask if we can do better if we know the signal has only $k$ nonzero Fourier coefficients. He talked about a sequence of papers that do this "almost optimally", in that they improve the $n\log n$ running time as long as the sparsity parameter $k = o(n)$.
Anna Gilbert provided an interesting counterpoint to this line of work. She argued that for real analog signals the process of sensing and recording them, even if the original signal was extremely sparse, can lead to discrete signals that have $\Theta(n)$ nonzero Fourier coefficients, and this is in some way intrinsic to the sensing process. This was part of an attempt to explain why a long line of sublinear methods (including Eric's work) don't do very well on real signals. This line of attack is called 'off the grid" reconstruction because you're off the (discrete) grid of a digital signal.
In sparse recovery, there are two parameters of interest: the number of measurements you make of the signal, and the time for reconstruction. Obviously, you'd like to get both to be as small as possible, and information-theoretic arguments show that you have to spend at least $k \log(n/k)$ measurements. Martin Strauss focused on speeding up the reconstruction time while maintaining measurement-optimality, in a setting known as the "for-all" formulation, where the adversary is allowed to pick a signal after seeing the probe matrix (there's a related "foreach" model that is subtlely different, and I'm still a little confused about the two).
On a slightly different note (but not that different as it turns out), Atri Rudra talked about a fun problem: Given the "shadows" of a set of 3D points along three orthogonal planes, can you find the minimum number of points that could yield the specific shadows ? If all projections have complexity $n,ドル it's well known that the right bound is $n^{3/2}$. While this bound was known, it wasn't constructive, and part of Atri's work was providing an actual construction. There are all kinds of interesting connections between this problem, join queries, triangle counting in graphs, and even the scary-hairy 3SUM, but that's a topic for another time.
Other talks in the afternoon: Alex Andoni talked about finding the eigenvalues of a matrix efficiently on streams (specifically finding the "heavy" eigenvalues). I learnt about the Cauchy interlacing theorem from his talk - it's a neat result about how the eigenvalues of submatrices behave. Ely Porat talked about the problem of preventing evil entities(削除) Hollywood (削除ここまで)from poisoning a BitTorrent stream of packets, and presented ideas involving homomorphic signatures for packets via error correcting codes.
Joshua Brody returned to the basic streaming setup. Most stream algorithms that estimate some quantity introduce two-sided error (the true estimate could be above or below the reported value). He asked whether this was necessary to stay with sublinear bounds: it turns out for some problems, limiting yourself to 1-sided error can worsen the space complexity needed to solve the problem (note that for problems like the Count-min sketch, the estimate is one-sided by design, and so there's no deterioriation)
Coming up next: Andrew's summary of day three, and a report on our heroic tree-swinging adventures.
Day II could roughly be summarized as:
Sitting by the stream recovering from a sparse attack of acronyms.
There were a number of talks on sparse recovery, which includes compressed sensing, and asks the question: How best can we reconstruct a signal from linear probes if we know the signal is sparse.
Eric Price led things off with a discussion of his recent breakthroughs on sparse FFTs. While the regular DFT takes $n \log n$ time, it's reasonable to ask if we can do better if we know the signal has only $k$ nonzero Fourier coefficients. He talked about a sequence of papers that do this "almost optimally", in that they improve the $n\log n$ running time as long as the sparsity parameter $k = o(n)$.
Anna Gilbert provided an interesting counterpoint to this line of work. She argued that for real analog signals the process of sensing and recording them, even if the original signal was extremely sparse, can lead to discrete signals that have $\Theta(n)$ nonzero Fourier coefficients, and this is in some way intrinsic to the sensing process. This was part of an attempt to explain why a long line of sublinear methods (including Eric's work) don't do very well on real signals. This line of attack is called 'off the grid" reconstruction because you're off the (discrete) grid of a digital signal.
In sparse recovery, there are two parameters of interest: the number of measurements you make of the signal, and the time for reconstruction. Obviously, you'd like to get both to be as small as possible, and information-theoretic arguments show that you have to spend at least $k \log(n/k)$ measurements. Martin Strauss focused on speeding up the reconstruction time while maintaining measurement-optimality, in a setting known as the "for-all" formulation, where the adversary is allowed to pick a signal after seeing the probe matrix (there's a related "foreach" model that is subtlely different, and I'm still a little confused about the two).
On a slightly different note (but not that different as it turns out), Atri Rudra talked about a fun problem: Given the "shadows" of a set of 3D points along three orthogonal planes, can you find the minimum number of points that could yield the specific shadows ? If all projections have complexity $n,ドル it's well known that the right bound is $n^{3/2}$. While this bound was known, it wasn't constructive, and part of Atri's work was providing an actual construction. There are all kinds of interesting connections between this problem, join queries, triangle counting in graphs, and even the scary-hairy 3SUM, but that's a topic for another time.
Other talks in the afternoon: Alex Andoni talked about finding the eigenvalues of a matrix efficiently on streams (specifically finding the "heavy" eigenvalues). I learnt about the Cauchy interlacing theorem from his talk - it's a neat result about how the eigenvalues of submatrices behave. Ely Porat talked about the problem of preventing evil entities
Joshua Brody returned to the basic streaming setup. Most stream algorithms that estimate some quantity introduce two-sided error (the true estimate could be above or below the reported value). He asked whether this was necessary to stay with sublinear bounds: it turns out for some problems, limiting yourself to 1-sided error can worsen the space complexity needed to solve the problem (note that for problems like the Count-min sketch, the estimate is one-sided by design, and so there's no deterioriation)
Coming up next: Andrew's summary of day three, and a report on our heroic tree-swinging adventures.
Friday, September 23, 2011
Finding the right reference for finding the majority element
Probably everyone knows the standard trick for determining in one pass whether a stream admits an element occuring more than half the time. But what not everyone might now is the origin of this method.
Stuart Shieber writes about the algorithm (discovered by Boyer and Moore) and the mishaps it had on the way to publication. The article also describes how the followup paper by Misra and Gries came into being. Interestingly, I almost always see the Misra-Gries generalization being cited, and not the original Boyer-Moore method.
(HT Fernando Pereira)
Stuart Shieber writes about the algorithm (discovered by Boyer and Moore) and the mishaps it had on the way to publication. The article also describes how the followup paper by Misra and Gries came into being. Interestingly, I almost always see the Misra-Gries generalization being cited, and not the original Boyer-Moore method.
(HT Fernando Pereira)
Thursday, June 30, 2011
Bob Morris and stream algorithms
Bob Morris, one of the early contributors to UNIX and an NSA cryptographer, died on Sunday. The New York Times has a remembrance that talks about his contribution to password security and cryptography in general.
What's not mentioned in the NYT profile is that he's the Morris of the 'Morris counter'. This was arguably the very first streaming algorithm, nearly 20 years before the AMS paper and the HRR tech report first introduced streaming algorithms, five years before the Flajolet-Martin paper on counting distinct elements in a stream (also not framed as as streaming problem), and two years before the Munro-Paterson algorithm for streaming median finding.
The Morris paper is titled "Counting large numbers of events in small registers":
So how do we count $\log n$ ? Let's assume for now that we're using base 2 logarithms. If the current counter value is $i,ドル then we "know" that the next update should come only $i$ counts later, if indeed we've seen $i$ elements. However, we don't know that the counter value reflects the true count, so we instead do it probabilistically. Specifically, toss a coin whose probability of coming up heads is 2ドル^{-C},ドル and only increment the counter if the coin comes up heads.
Phillipe Flajolet did a masterful analysis of this approach, and showed that the expected value of 2ドル^C$ was $n+2,ドル yielding an unbiased estimator for the true count. He also showed that the variance of 2ドル^C$ is $n(n+1)/2$.
This approach yields a fixed error bound for the resulting count. The next trick is then to change the base from 2 to some parameter $a,ドル and then reanalyze the method. Much work later, it can be shown that with roughly $\log \log n + \log(1/\epsilon)$ bits, you can get a multiplicative approximation of 1ドル+\epsilon$.
Stream algorithms are a good icebreaker in an algorithms class: they're unusual, seem to solve an impossible problem, and do it really well. I also like to point out to students that three of the more important streaming algorithms were designed well before people cared about "big data" or streaming. It's a valuable lesson in the importance of understanding "old" tools.
In case you're wondering if anyone still uses Morris counters, here's a paper from IJCAI 2009 that plugs them into a larger system to maintain counts of $n$-grams when building a statistical language model.
What's not mentioned in the NYT profile is that he's the Morris of the 'Morris counter'. This was arguably the very first streaming algorithm, nearly 20 years before the AMS paper and the HRR tech report first introduced streaming algorithms, five years before the Flajolet-Martin paper on counting distinct elements in a stream (also not framed as as streaming problem), and two years before the Munro-Paterson algorithm for streaming median finding.
The Morris paper is titled "Counting large numbers of events in small registers":
It is possible to use a small counter to keep approximate counts of large numbers. The resulting expected error can be rather precisely controlled. An example is given in which 8-bit counters (bytes) are used to keep track of as many as 130,000 events with a relative error which is substantially independent of the number n of events. This relative error can be expected to be 24 percent or less 95 percent of the time (i.e. σ = n/8). The techniques could be used to advantage in multichannel counting hardware or software used for the monitoring of experiments or processes.In modern language, this can be re-rendered as:
It is possible to approximately count the number of elements in a stream using $O(\log \log n)$ bits, where $n$ is the length of the stream.The idea is very simple. The trick is to count $C = \log n$ instead of $n$. Clearly, doing so requires only $\log\log n$ bits. Then any additive approximation to $C$ yields a multiplicative approximation to $n$.
So how do we count $\log n$ ? Let's assume for now that we're using base 2 logarithms. If the current counter value is $i,ドル then we "know" that the next update should come only $i$ counts later, if indeed we've seen $i$ elements. However, we don't know that the counter value reflects the true count, so we instead do it probabilistically. Specifically, toss a coin whose probability of coming up heads is 2ドル^{-C},ドル and only increment the counter if the coin comes up heads.
Phillipe Flajolet did a masterful analysis of this approach, and showed that the expected value of 2ドル^C$ was $n+2,ドル yielding an unbiased estimator for the true count. He also showed that the variance of 2ドル^C$ is $n(n+1)/2$.
This approach yields a fixed error bound for the resulting count. The next trick is then to change the base from 2 to some parameter $a,ドル and then reanalyze the method. Much work later, it can be shown that with roughly $\log \log n + \log(1/\epsilon)$ bits, you can get a multiplicative approximation of 1ドル+\epsilon$.
Postscript
Stream algorithms are a good icebreaker in an algorithms class: they're unusual, seem to solve an impossible problem, and do it really well. I also like to point out to students that three of the more important streaming algorithms were designed well before people cared about "big data" or streaming. It's a valuable lesson in the importance of understanding "old" tools.
In case you're wondering if anyone still uses Morris counters, here's a paper from IJCAI 2009 that plugs them into a larger system to maintain counts of $n$-grams when building a statistical language model.
Wednesday, March 11, 2009
Memory exercises and streaming.
The latest guest column on Olivia Judson's "Wild Side" blog at the NYT is on increasing intelligence: Can we do exercises to increase it ? In the article, they describe a test that when performed repeatedly leads to measurable improvement in IQ:
Of course, to folks who do streaming, this is easily recognized as a simple subcase of detecting a duplicate item in a stream (that problem allows n to vary over the entire stream), which in turn is a special case of the most frequent element estimation problem.
These are hard problems for sublinear space stream algorithms. It's interesting that they're taxing for human working memory as well.
A common way to measure working memory is called the “n-back” task. Presented with a sequential series of items, the person taking the test has to report when the current item is identical to the item that was presented a certain number (n) of items ago in the series. For example, the test taker might see a sequence of letters like
L K L R K H H N T T N X
presented one at a time. If the test is an easy 1-back task, she should press a button when she sees the second H and the second T. For a 3-back task, the right answers are K and N, since they are identical to items three places before them in the list. Most people find the 3-back condition to be challenging.
Of course, to folks who do streaming, this is easily recognized as a simple subcase of detecting a duplicate item in a stream (that problem allows n to vary over the entire stream), which in turn is a special case of the most frequent element estimation problem.
These are hard problems for sublinear space stream algorithms. It's interesting that they're taxing for human working memory as well.
Subscribe to:
Comments (Atom)