As a followup to the work on floats, I fixed rec2csv to deal with funky strings (strings with commas and quotes). I've checked this is as r4749, but I thought I'd announce here in case someone (John, in particular) was depending on some peculiar aspect of the old implementation. If there is something you depend on, can you update the unittests in unit/mlab_unit.py to check for the behavior you need? Also, John, I didn't see mlab.FormatString being used anywhere, but I didn't want to change it, either. So I made FormatString2 and used it. But if there's no known use of FormatString, lets kill the original and more FormatString2 into its place. -Andrew
On Dec 16, 2007 5:26 PM, Andrew Straw <str...@as...> wrote: > As a followup to the work on floats, I fixed rec2csv to deal with funky > strings (strings with commas and quotes). I've checked this is as r4749, > but I thought I'd announce here in case someone (John, in particular) > was depending on some peculiar aspect of the old implementation. If > there is something you depend on, can you update the unittests in > unit/mlab_unit.py to check for the behavior you need? > > Also, John, I didn't see mlab.FormatString being used anywhere, but I > didn't want to change it, either. So I made FormatString2 and used it. > But if there's no known use of FormatString, lets kill the original and > more FormatString2 into its place. I am happy to leave it as is, and fix it if I bump into anything. There was a use case at work that made me quote all the strings with " but I am not sure what it is right now so I will test with your version tomorrow and see if I can find any problems. There is one more thing I'd like to clean up in this code, and that is the handling of date and datetime. The current implementation will return datetime regardless of whether the string looks like a date or a datetime. Thus it would fail a roundtrip tests if you had a date field. I may make some changes tomorrow to support date or datetime. What do you think of making checkrows=0 the default for csv2rec? It is slower, but mostly guaranteed to do the right thing, whereas checkrows=5 which is the current default can often fail if the first 5 rows match some type but later rows do not, and may lead to confusion among new users. JDH
On Dec 16, 2007 9:10 PM, John Hunter <jd...@gm...> wrote: > On Dec 16, 2007 5:26 PM, Andrew Straw <str...@as...> wrote: > > > As a followup to the work on floats, I fixed rec2csv to deal with funky > > strings (strings with commas and quotes). I've checked this is as r4749, > > but I thought I'd announce here in case someone (John, in particular) > > was depending on some peculiar aspect of the old implementation. If > > there is something you depend on, can you update the unittests in > > unit/mlab_unit.py to check for the behavior you need? > > > > Also, John, I didn't see mlab.FormatString being used anywhere, but I > > didn't want to change it, either. So I made FormatString2 and used it. > > But if there's no known use of FormatString, lets kill the original and > > more FormatString2 into its place. > > I am happy to leave it as is, and fix it if I bump into anything. > There was a use case at work that made me quote all the strings with " > but I am not sure what it is right now so I will test with your > version tomorrow and see if I can find any problems. > > There is one more thing I'd like to clean up in this code, and that is > the handling of date and datetime. The current implementation will > return datetime regardless of whether the string looks like a date or > a datetime. Thus it would fail a roundtrip tests if you had a date > field. I may make some changes tomorrow to support date or datetime. I'm CCing Chris Burns, from Berkeley here, but it would be great if you post a note of these changes on the numpy list. One of the things to come out of the sprint was a lot of work on numpy i/o, and that includes looking into integrating much of the MPL facilities. So it would be great if you keep the numpy team posted, to make sure they get your latest code. Cheers. > > What do you think of making checkrows=0 the default for csv2rec? It > is slower, but mostly guaranteed to do the right thing, whereas > checkrows=5 which is the current default can often fail if the first 5 > rows match some type but later rows do not, and may lead to confusion > among new users.
On Dec 16, 2007 5:26 PM, Andrew Straw <str...@as...> wrote: > As a followup to the work on floats, I fixed rec2csv to deal with funky > strings (strings with commas and quotes). I've checked this is as r4749, > but I thought I'd announce here in case someone (John, in particular) > was depending on some peculiar aspect of the old implementation. If > there is something you depend on, can you update the unittests in > unit/mlab_unit.py to check for the behavior you need? OK, I found one problem with repr. When adding support for date vs datetime to the csv2rec roundtrip, and adding this to the unit test, I notice that repr is probably not what we want for datetime; str makes more sense here: In [1]: import datetime In [2]: d = datetime.date.today() In [3]: repr(d) Out[3]: 'datetime.date(2007, 12, 16)' In [4]: str(d) Out[4]: '2007-12-16' The latter is more natural in the CSV file, and the repr version is not supported by dateutil, at least not the one we are shipping: In [5]: import dateutil.parser In [6]: dateutil.parser.parse(repr(d)) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/jdhunter/python/svn/matplotlib/examples/<ipython console> in <module>() ...snip the rest of the traceback... In [7]: dateutil.parser.parse(str(d)) Out[7]: datetime.datetime(2007, 12, 16, 0, 0) So I changed FormatObject to use str, pending further discussion. At least for my common use cases, the only obj types I have in my record arrays are dates and datetimes, and I find this to be a pretty compelling use case since it is the type least likely to be supported by other persistence methods (tostring and pickle both fail or do not behave as expected with datetimes in the recarray). But there is an oddity in the parsing of milliseconds which is causing the updated unit test to fail; the code below illustrates the problem: In [3]: import dateutil.parser In [4]: import datetime In [5]: s = '2007-12-18 22:29:34.924122' In [6]: dateutil.parser.parse(s) Out[6]: datetime.datetime(2007, 12, 18, 22, 29, 34, 924121) Thoughts? JDH
Thanks for looking at this. From the unittest that you checked in, it appears you've succeeded in getting dates and datetimes to work, modulo the apparent datetime precision buglet you mention below. My understanding with repr() is that, in general, it should attempt to support lossless copying of data with the eval() function. It does appear that dateutils does this -- repr() produces a string that looks like what you'd use to call the constructor. (Numpy scalars don't follow this idiom, and it's doubtful they could change without breaking a lot of code.) I also agree that for a CSV file, something like the str() on date and datetime objects makes most sense. This is all just to say that the date and datetime stuff looks good to my eye, but you're using that stuff a lot more than me. Finally, from your example, it does look like there's a bug somewhere in the datetime code. Since I don't use them, I leave that to you... Finally, as far as the checkrows -- I must admit that I wasn't even aware of its existence until you asked the question. (That's the problem with writing code that just works - no one notices it!) Now that you've posed the question, I suppose I'm in favor a default of 0 (which actually means infinity). I suppose no one in their right mind is going to use CSV files to store gigabytes of data, so parsing the whole thing for consistency seems like a worthwhile expenditure of cycles. (And if they're thinking about it, maybe the slowness with dissuade them, or at least cause them to read the docstrings! :) John Hunter wrote: > On Dec 16, 2007 5:26 PM, Andrew Straw <str...@as...> wrote: > > >> As a followup to the work on floats, I fixed rec2csv to deal with funky >> strings (strings with commas and quotes). I've checked this is as r4749, >> but I thought I'd announce here in case someone (John, in particular) >> was depending on some peculiar aspect of the old implementation. If >> there is something you depend on, can you update the unittests in >> unit/mlab_unit.py to check for the behavior you need? >> > > OK, I found one problem with repr. When adding support for date vs > datetime to the csv2rec roundtrip, and adding this to the unit test, I > notice that repr is probably not what we want for datetime; str makes > more sense here: > > In [1]: import datetime > > In [2]: d = datetime.date.today() > > In [3]: repr(d) > Out[3]: 'datetime.date(2007, 12, 16)' > > In [4]: str(d) > Out[4]: '2007-12-16' > > The latter is more natural in the CSV file, and the repr version is > not supported by dateutil, at least not the one we are shipping: > > In [5]: import dateutil.parser > > In [6]: dateutil.parser.parse(repr(d)) > --------------------------------------------------------------------------- > ValueError Traceback (most recent call last) > > /Users/jdhunter/python/svn/matplotlib/examples/<ipython console> in <module>() > > ...snip the rest of the traceback... > > In [7]: dateutil.parser.parse(str(d)) > Out[7]: datetime.datetime(2007, 12, 16, 0, 0) > > > So I changed FormatObject to use str, pending further discussion. At > least for my common use cases, the only obj types I have in my record > arrays are dates and datetimes, and I find this to be a pretty > compelling use case since it is the type least likely to be supported > by other persistence methods (tostring and pickle both fail or do not > behave as expected with datetimes in the recarray). > > But there is an oddity in the parsing of milliseconds which is causing > the updated unit test to fail; the code below illustrates the problem: > > In [3]: import dateutil.parser > > In [4]: import datetime > > In [5]: s = '2007-12-18 22:29:34.924122' > > In [6]: dateutil.parser.parse(s) > Out[6]: datetime.datetime(2007, 12, 18, 22, 29, 34, 924121) > > Thoughts? > > JDH > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services > for just about anything Open Source. > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Matplotlib-devel mailing list > Mat...@li... > https://lists.sourceforge.net/lists/listinfo/matplotlib-devel >