Hi, I've been plotting data with millions of datapoints, and it takes a long time for the plot to draw and to rescale with the mouse. I was playing around with the draw_lines function in the agg backend and sped it up a bit, maybe others will find it useful. I've attached my optimized version of _backend_agg.cpp. (just replace the old src/_backend_agg.cpp and recompile). The main optimization is to combine sequential lines that are parallel into a single line to be drawn once. Try "plot(rand(100000),'-')" to see the change. Resizing and zooming should be faster than before. This comes at the expense of accuracy since many lines are not drawn. I've tried to balance out accuracy vs speed to my liking. The main thing to tweak is a length threshold which I have hardcoded to 0.25. This is the squared length in pixels that we can move in the direction perpendicular to the set of parallel lines we are combining and still combine the lines. (so it can move 0.5 pixels to the side before a new line will be started). If you set it to 0, you should get the same plots as without my changes. I also made it skip any lines that are outside of the drawing area. Hopefully it's not buggy! Allan
John, Have you considered Allan's version of _backend_agg.cpp (message date was 06/27)? I just tried it, and so far, I like it. I have not found any problems with the backend_driver.py test, and the improvement in speed that Allan notes with the "plot(rand(100000))" example is dramatic--the difference between unusable for panning and zooming with the original backend, and quite responsive and very usable with Allan's version. I have not tried to understand the algorithm or code. Allan, Can you suggest a simple test that graphically illustrates the "loss of accuracy", perhaps a worst case, so we can see whether this is an actual problem? I am also wondering whether your optimization technique, or some variation of it, could and should be applied in the generation of vector-based printer files: ps, pdf, svg. They can get excessively large and slow with very large datasets, and the ability to trim them with minimal loss of genuine information conveyed to the ultimate viewer of the plot would be very nice. Eric ah...@cs... wrote: > Hi, > > I've been plotting data with millions of datapoints, and it takes a long > time for the plot to draw and to rescale with the mouse. I was playing > around with the draw_lines function in the agg backend and sped it up a > bit, maybe others will find it useful. I've attached my optimized version > of _backend_agg.cpp. (just replace the old src/_backend_agg.cpp and > recompile). > > The main optimization is to combine sequential lines that are parallel > into a single line to be drawn once. Try "plot(rand(100000),'-')" to see > the change. Resizing and zooming should be faster than before. > > This comes at the expense of accuracy since many lines are not drawn. I've > tried to balance out accuracy vs speed to my liking. The main thing to > tweak is a length threshold which I have hardcoded to 0.25. This is the > squared length in pixels that we can move in the direction perpendicular > to the set of parallel lines we are combining and still combine the lines. > (so it can move 0.5 pixels to the side before a new line will be started). > If you set it to 0, you should get the same plots as without my changes. > > I also made it skip any lines that are outside of the drawing area. > > Hopefully it's not buggy! > > Allan
Hi, First, I noticed a bug in the version I sent before. I've attached a new version in patch form, to be applied from the base directory of a normal 0.90.1 installation with 'patch -p1 <patch'. One artifact I can clearly see is there is partly due to skipping lines < 1 pixel in length. The original agg renderer did this too, so it is not too different from before. My patch makes it worse, though. You can see it with plot(rand(10000)) again, but zoom out on only the y axis until the data is almost a straight line from right to left. You should see little clumps appear as it gets close to a straight line. Besides this I have not noticed anything. There may be problem cases I have not thought of, of course. To see the kind of problem that might occur, you can try changing the distance cutoff from 0.25 to something huge like 25.0. There you will definitely notice errors. About doing this for other image formats: I think it is OK for pixel-based images, but probably not for SVG or vector-based images. I can imagine someone writing to an SVG so that they can use another program to make a pretty rendering of it, but if lines are missing this second program does not have the full data to work with. Allan On Sat, July 7, 2007 10:15 pm, Eric Firing wrote: > John, > > > Have you considered Allan's version of _backend_agg.cpp (message date > was 06/27)? I just tried it, and so far, I like it. I have not found any > problems with the backend_driver.py test, and the improvement in speed > that Allan notes with the "plot(rand(100000))" example is dramatic--the > difference between unusable for panning and zooming with the original > backend, and quite responsive and very usable with Allan's version. > > I have not tried to understand the algorithm or code. > > > Allan, > > > Can you suggest a simple test that graphically illustrates the "loss of > accuracy", perhaps a worst case, so we can see whether this is an actual > problem? > > I am also wondering whether your optimization technique, or some > variation of it, could and should be applied in the generation of > vector-based printer files: ps, pdf, svg. They can get excessively large > and slow with very large datasets, and the ability to trim them with > minimal loss of genuine information conveyed to the ultimate viewer of the > plot would be very nice. > > Eric
On 7/8/07, ah...@cs... <ah...@cs...> wrote: > First, I noticed a bug in the version I sent before. I've attached a new > version in patch form, to be applied from the base directory of a normal > 0.90.1 installation with 'patch -p1 <patch'. Thanks Allan, I just applied this to svn. Some time ago I made an optimization to Agg's draw_markers to used cached rasters of the markers and this made a dramatic difference in speed. So it is nice to complement that with this draw_lines optimization. I was curious to see what approach you took to cull the points outside the window -- just looking at the vertices of the current point and last point and dropping if both are outside the x and y interval of the viewport. As you mention in the comment, we could do a line segment/rectangle intersection test to cull more, but it is not clear that that would be faster because of the tradeoff between the cost of the test and cost of drawing segments we don't need. I wonder if you cached some of the interval tests whether you might see some additional speedups. Eg, something like thisoutsidex = (thisx<0 || thisx>width) thisoutsidey = (thisy<0 || thisy>height) if (thisoutsidex && thisoutsidey && lastoutsidex && lastoutsidey) { // other stuff as necessary... lastoutsidex = thisoutsidex lastoutsidey = thisoutsidey continue; } lastoutsidex = thisoutsidex lastoutsidey = thisoutsidey I suspect the comparisons are pretty cheap so I don't know how much this will save but it might be worth looking into. There is one case where your culling algorithm will fail, I think view window x2 ----------------------------- | | | | | | _______________ x1 The segment connecting x1 and x2 will be dropped, but the user should see a line here, right? If we could do a rectangle/line intersection test w/o giving up too much in terms of performance, that would be better I think. That or give users the ability to turn it off. Since people use mpl to view their scientific data, we should be careful to make sure they can see it.... > One artifact I can clearly see is there is partly due to skipping lines < > 1 pixel in length. The original agg renderer did this too, so it is not > too different from before. My patch makes it worse, though. You can see it > with plot(rand(10000)) again, but zoom out on only the y axis until the > data is almost a straight line from right to left. You should see little > clumps appear as it gets close to a straight line. > > Besides this I have not noticed anything. There may be problem cases I > have not thought of, of course. To see the kind of problem that might > occur, you can try changing the distance cutoff from 0.25 to something > huge like 25.0. There you will definitely notice errors. Do you think it is worth making this a line property so users can tweak it? There are a couple of ways to get the line property information into the backend -- the typical way is to make a corresponding graphics context property and then set it in the line draw method. > About doing this for other image formats: I think it is OK for pixel-based > images, but probably not for SVG or vector-based images. I can imagine > someone writing to an SVG so that they can use another program to make a > pretty rendering of it, but if lines are missing this second program does > not have the full data to work with. I do think it would be nice to support culling points outside the viewport across backends. A recurring criticism is that our postscript files with lots of points, in which the zoom is set to just a subset of points, are quite large because we use graphics clipping to get rid of them but the data are still in the file. As you note, this can be a good thing for users who want to do post-hoc editing, and culling is tricky because, for example, a point whose center is outside the viewport could have a edge vertex inside the viewport. So we would need polygon intersections to do it right (and even this would not completely solve the case of polygons with thick edge strokes) but having the option to cull points whose centers fell outside the viewport, or line segments whose vertices fell outside the x and y intervals, would be a big win in most common use cases. Since we already support masks in the line class, probably the best way to do this is would be some special purpose (possibly extension) code that efficiently sets the mask. Then we would automatically get support across backends. JDH
On Mon, July 9, 2007 10:57 am, John Hunter wrote: > I wonder if you cached some of the interval tests whether you might > see some additional speedups. Eg, something like > .... You could do this, but I don't think you would even notice the difference. Really, my impression is that the only things that really take time are drawing a line, and maybe doing excessive amounts of multiplication/division. It can't hurt to try, though, if there is time. The full rectangle test would give a boost in some cases, but I think these cases are going to be extremely rare, so I didn't bother. All x vs time type data works fine under the simpler test, for example. I'll give it a try sometime later this week though, just to see. > There is one case where your culling algorithm will fail, I think view window x2 ----------------------------- | | | | | | _______________ x1 It doesn't fail, as you can see by trying to plot something like that. Both points need to be on the 'undrawn' side of one of the lines of the box. In your case, the points are to either side of the top line, the left line and the bottom line, and are both to the 'drawn' side of the right line. They are not both on the undrawn side of any line. > Do you think it is worth making this a line property so users can > tweak it? There are a couple of ways to get the line property information > into the backend -- the typical way is to make a corresponding graphics > context property and then set it in the line draw method. I thought of doing this, but I didn't get to it. I originally wrote it just for myself. If you know how, it seems like a good idea. > I do think it would be nice to support culling points outside the > viewport across backends. A recurring criticism is that our postscript > files with lots of points, in which the zoom is set to just a subset of > points, are quite large because we use graphics clipping to get rid of > them but the data are still in the file. As you note, this can be a good > thing for users who want to do post-hoc editing, and culling is tricky > because, for example, a point whose center is outside the viewport could > have a edge vertex inside the viewport. So we would need polygon > intersections to do it right (and even this would not completely solve the > case of polygons with thick edge strokes) but having the option to cull > points whose centers fell outside the viewport, or line segments whose > vertices fell outside the x and y intervals, would be a big win in most > common use cases. Dropping parallel lines in the visible area still seems like a bad idea for SVG, but I can see that culling points outside it could be OK. There are two things you want to cull: lines between two points, and polygons drawn around single points. I don't know exactly how the polygon parts work in matplotlib, but it seems reasonable that you could find an upper value to the radius of the polygon around a point. Add max radius of polygon + stroke width to get a 'bound' value, and then instead of (x < 0 || x > width ...) do (x < -bound || x > width + bound ...) to cull. Just make the bound big enough that you are sure nothing could be in the viewing area, even if you draw a few more points than you need to. For lines between two points, if you use just the simple cull test I am using, you could do the same thing as for polygons where the bound depends only on the stroke width. Actually, I should really do that in the agg code. If you use a full box test, you would again need a more complicated algorithm with extra intersection tests, it could get messy. Also, I am not sure how and when transforms are applied to the data, but I can imagine the transform and culling interfering with each other. In the agg code, the data is transformed to its final pixel units before I do culling. Allan
John (and others), I've made a quick change to the svg backend to cull the data, see attached patch. I haven't tested it extensively, but it seems OK. It culls both lines and polygons out of svg output. About making it general across backends: I've looked a bit at how things get rendered, and here is what I understand: We can only do the culling in pixel coordinates, since the height/width of the figure is in pixel units. The transform from unit to pixel coordinates either happens in Line2D.draw() (such as SVG), or in the backend's draw function (such as for agg). Therefore we could only generalize for the case where the transform is in line2d, and even then, it looks simpler to do it in the backend to me since that is the only place we conveniently have all the pixel information we need. Maybe I have misunderstood, though. Also, I noticed I forgot to comment out a degugging cout from my last agg patch, near the end of draw_lines. Oops, sorry, you should fix that in svn. Allan By the way, Here's what I used to test the svg patch, using the -dSVG switch: import pylab from scipy import * pylab.plot(sin(arange(1000)*pi/100.0)) pylab.plot(rand(1000)*2, '.') pylab.axis([300, 800, -0.5, 1]) pylab.savefig('test')
ah...@cs... wrote: > John (and others), > > I've made a quick change to the svg backend to cull the data, see attached > patch. I haven't tested it extensively, but it seems OK. It culls both > lines and polygons out of svg output. Allan, Looks good, and the basic idea should be generally applicable. It can be sped up (probably), and the code made more compact, by using numpy arrays; but this can wait. > > About making it general across backends: I've looked a bit at how things > get rendered, and here is what I understand: We can only do the culling in > pixel coordinates, since the height/width of the figure is in pixel units. > > The transform from unit to pixel coordinates either happens in > Line2D.draw() (such as SVG), or in the backend's draw function (such as > for agg). Therefore we could only generalize for the case where the > transform is in line2d, and even then, it looks simpler to do it in the > backend to me since that is the only place we conveniently have all the > pixel information we need. Maybe I have misunderstood, though. The relevant transformation is between data coordinates and pixel coordinates, so provided the transformation is from one rectilinear system to another, it should be possible to do the culling in either system. One way to do it would be by inserting nans in the x and y arrays (I am thinking in terms of numpy array-based algorithms), since nan is already supported in backends as a do-not-draw marker. This brings up the larger question of why we have "newstyle" (do transformation in backend) and oldstyle (transform before calling backend) at all, and whether we still need both. My guess is that we don't; by consistently using numpy arrays for x and y, and by using the transformation function that works with these arrays, we should be able to do the transform equally efficiently in the backend or earlier. Making the backends consistent would take a bit of work, and perhaps some culling of backends that are neither widely used nor actively maintained. Eric