Issue 18219: csv.DictWriter is slow when writing files with large number of columns

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62419

classification

Title:	csv.DictWriter is slow when writing files with large number of columns
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.7, Python 3.6

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	Mariatta, hughdbrown, methane, mtraskin, peter.otten, python-dev, r.david.murray, serhiy.storchaka, terry.reedy, vstinner
Priority:	normal	Keywords:	patch

Created on 2013年06月15日 05:12 by mtraskin, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
csvdictwriter.patch	mtraskin, 2013年06月15日 05:12	review
csvdictwriter.v2.patch	mtraskin, 2013年06月16日 06:19	review
csvdictwriter.v3.patch	mtraskin, 2013年08月15日 05:23	review
csvdictwriter.v4.patch	mtraskin, 2013年09月03日 03:30	review
issue18219.patch	Mariatta, 2016年10月21日 03:15	review
issue18219v2.patch	Mariatta, 2016年10月21日 04:29	review
issue18219v3.patch	Mariatta, 2016年10月21日 09:27	review
issue18219v4.patch	Mariatta, 2016年10月21日 09:44	review
issue18219v5.patch	Mariatta, 2016年10月21日 10:09	review
issue18219v6.patch	Mariatta, 2016年10月21日 10:11	review
issue18219v7.patch	Mariatta, 2016年10月21日 10:28	review
issue18219v8.patch	Mariatta, 2016年10月21日 10:48	review
issue18219v9.patch	Mariatta, 2016年10月21日 14:15	review

Pull Requests
URL	Status	Linked	Edit
PR 552	closed	dstufft, 2017年03月31日 16:36

Messages (23)
msg191197 - (view)	Author: Mikhail Traskin (mtraskin) *	Date: 2013年06月15日 05:12
_dict_to_list method of the csv.DictWriter objects created with extrasaction="raise" uses look-up in the list of field names to check if current row has any unknown fields. This results in O(n^2) execution time and is very slow if there are a lot of columns in a CSV file (in hundreds or thousands). Replacing look-up in a list with a look-up in a set solves the issue (see the attached patch).
msg191198 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2013年06月15日 05:52
I think there is no need in public fieldset property. Just use private self._fieldset field in private _dict_to_list() method.
msg191263 - (view)	Author: Mikhail Traskin (mtraskin) *	Date: 2013年06月16日 06:19
Any way is fine with me. If you prefer to avoid having public filedset property, please use the attached patch.
msg191604 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2013年06月21日 19:05
What is the purpose in touching fieldnames, either in tuple-izing it or in making it private and wrapped with a property. If someone wants to modify it, that is up to them. In any case, this change is not germane to the issue and could break code, so I would not make it. wrong_fields could be calculated with any(k for k in rowdict if k not in self._fieldset) to stop on the first extra, if any. That said, in 3.x, replacing wrong_fields = <long expression> if wrong_fields: with if rowdict.keys() - self._fieldset: should be even faster because the iteration, which will nearly always go to completion, is entirely in C (or whatever). Does test/text_cvs have tests for DictWriter, both with and without rowdict errors? If so, or if added, I would be willing to commit a patch that simply added ._fieldset and used it as above for a set difference. Also, if you have not done so yet, please go to http://www.python.org/psf/contrib/ and http://www.python.org/psf/contrib/contrib-form/ new electronic form and submit a contributor agreement. An '*' will appear after your name here when it is processed.
msg195233 - (view)	Author: Mikhail Traskin (mtraskin) *	Date: 2013年08月15日 05:23
> What is the purpose in touching fieldnames [...] Wrapping the fieldnames property and tupleizing it guarantees that fieldnames and _fieldset fields are consistent. Otherwise, having a separate _fieldset field means that someone who is modifying the fieldnames field will not modify the _fieldset. This will result in inconsistent DictWriter behavior. Normal DictWriter users (ones that do not modify fieldnames after DictWriter was created) will not notice this wrapper. "Non-normal" DictWriter will have their code broken, but it is better than having inconsistent internal data structures since these errors are very hard to detect. If you insist on keeping the interface intact, then use the attached v3 of the patch: it creates a fieldset object every time the _dict_to_list method is executed. This does slow execution down, but performance is acceptable, just about 1.5 time slower than version with _fieldset field. > wrong_fields could be calculated with [...] I believe it is better to report all wrong fields at ones. In addition this optimization is meaningless, since usually, unless something is wrong, the field check will require full scan of the rowdict. > That said, in 3.x, replacing [...] In 2.x the list comprehension version is faster than the set difference version. In 3.x the set difference is slightly faster (maybe 10% faster). However, list comprehension works both in 2.x and 3.x, while set difference requires different code for them. Hence I prefer sticking with list comprehension. > Does test/text_cvs have tests [...] No there are no tests for wrong fields. Correct fields are already checked with standard writing tests. I do not know how you write tests for exception handling. If you provide a link with instructions, I can write the missing test part. > Also, if you have not done so yet, please go to [...] I have already done this.
msg195245 - (view)	Author: Peter Otten (peter.otten) *	Date: 2013年08月15日 10:23
Note that set operations on dict views work with lists, too. So the only change necessary is to replace wrong_fields = [k for k in rowdict if k not in self.fieldnames] with wrong_fields = rowdict.keys() - self.filenames (A backport to 2.7 would need to replace keys() with viewkeys())
msg196821 - (view)	Author: Mikhail Traskin (mtraskin) *	Date: 2013年09月03日 03:30
Peter, thank you for letting me know that views work with list, I was not aware of this. This is indeed the best solution and it also keeps the DictWriter interface unchanged. Terry, attached patch contains the DictWriter change and a test case in test_csv.py.
msg279058 - (view)	Author: Hugh Brown (hughdbrown)	Date: 2016年10月20日 17:25
I came across this problem today when I was using a 1000+ column CSV from a client. It was taking about 15 minutes to process each file. I found the problem and made this change: # wrong_fields = [k for k in rowdict if k not in self.fieldnames] wrong_fields = set(rowdict.keys()) - set(self.fieldnames) And my processing time went down to 12 seconds per file -- a 75x speedup. It's kind of sad that this change has been waiting for over three years when it is so simple. Any chance we could make one of the acceptable code changes and release it?
msg279101 - (view)	Author: Mariatta (Mariatta) * (Python committer)	Date: 2016年10月21日 03:15
Hello, please review my patch. I used set subtraction to calculate wrong_fields, added more test cases, and clarify documentation with regards to extrasaction parameter. Please let me know if this works. Thanks :)
msg279105 - (view)	Author: Hugh Brown (hughdbrown)	Date: 2016年10月21日 03:27
Fabulous. Looks great. Let's ship! It is not the optimal fix for 3.x platforms. A better fix would calculate the set of fieldnames only once in __init__ (or only as often as fieldnames is changed). But I stress that it is a robust change that works in versions 2.7 through 3.x for sure. And it is way better than the alternative of searching a list.
msg279107 - (view)	Author: Mariatta (Mariatta) * (Python committer)	Date: 2016年10月21日 04:20
Thanks Hugh, Are you thinking of something like the following? class DictWriter: def __init__(self, f, fieldnames, restval="", extrasaction="raise", dialect="excel", args, *kwds): self._fieldnames = fieldnames # list of keys for the dict self._fieldnames_set = set(self._fieldnames) @property def fieldnames(self): return self._fieldnames @fieldnames.setter def fieldnames(self, value): self._fieldnames = value self._fieldnames_set = set(self._fieldnames) def _dict_to_list(self, rowdict): if self.extrasaction == "raise": wrong_fields = rowdict.keys() - self._fieldnames_set ... If so, I can work on another patch. Thanks.
msg279108 - (view)	Author: Hugh Brown (hughdbrown)	Date: 2016年10月21日 04:24
Mariatta: Yes, that is what I was thinking of. That takes my 12 execution time down to 10 seconds. (Or, at least, a fix I did of this nature had that effect -- I have not timed your patch but it should be the same.)
msg279109 - (view)	Author: Mariatta (Mariatta) * (Python committer)	Date: 2016年10月21日 04:29
Thanks, Hugh. Please check the updated patch :)
msg279115 - (view)	Author: Inada Naoki (methane) * (Python committer)	Date: 2016年10月21日 10:19
LGTM, Thanks Mariatta. (But one more LGTM from coredev is required for commit)
msg279116 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2016年10月21日 10:22
issue18219v6.patch: LGTM, but I added a minor PEP 8 comment. INADA Naoki: "LGTM, Thanks Mariatta. (But one more LGTM from coredev is required for commit)" If you are confident (ex: if the change is simple, like this one), you can push it directly.
msg279117 - (view)	Author: Mariatta (Mariatta) * (Python committer)	Date: 2016年10月21日 10:28
Inada-san, Victor, thank you. Here is the updated patch.
msg279118 - (view)	Author: Inada Naoki (methane) * (Python committer)	Date: 2016年10月21日 10:37
> If you are confident (ex: if the change is simple, like this one), you can push it directly. My mentor (Yury) prohibit it while I'm beginner. And as you saw, I missed PEP 8 violation :)
msg279119 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2016年10月21日 10:45
> My mentor (Yury) prohibit it while I'm beginner. Oh right, trust your mentor more than me ;-)
msg279120 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2016年10月21日 10:53
New changeset 1928074e6519 by INADA Naoki in branch '3.6': Issue #18219: Optimize csv.DictWriter for large number of columns. https://hg.python.org/cpython/rev/1928074e6519 New changeset 6f1602dfa4d5 by INADA Naoki in branch 'default': Issue #18219: Optimize csv.DictWriter for large number of columns. https://hg.python.org/cpython/rev/6f1602dfa4d5
msg279121 - (view)	Author: Inada Naoki (methane) * (Python committer)	Date: 2016年10月21日 10:54
committed.
msg279124 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2016年10月21日 13:13
Shouldn't docs changes and new tests be added to 3.5?
msg279128 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2016年10月21日 14:01
Serhiy: I know you prefer applying test changes to the maint version, and I don't disagree, but there are others who prefer not to and we really don't have an official policy on it at this point. (We used to say no, a few years ago :) The doc change looks wrong to me. It looks like a rst source paragraph was split into separate lines instead of being a flowed paragraph in the source? I don't understand why that was done.
msg279129 - (view)	Author: Mariatta (Mariatta) * (Python committer)	Date: 2016年10月21日 14:15
Thanks David. I uploaded patch to address your concern with the docs. Can you please check? Serhiy, with regards to applying docs and test to 3.5, does that require a different patch than what I have? Thanks.

History
Date	User	Action	Args
2022年04月11日 14:57:46	admin	set	github: 62419
2017年03月31日 16:36:23	dstufft	set	pull_requests: + pull_request969
2016年10月21日 14:15:53	Mariatta	set	files: + issue18219v9.patch messages: + msg279129
2016年10月21日 14:01:42	r.david.murray	set	nosy: + r.david.murray messages: + msg279128
2016年10月21日 13:13:08	serhiy.storchaka	set	messages: + msg279124
2016年10月21日 10:54:34	methane	set	status: open -> closed resolution: fixed messages: + msg279121 stage: commit review -> resolved
2016年10月21日 10:53:41	python-dev	set	nosy: + python-dev messages: + msg279120
2016年10月21日 10:48:17	Mariatta	set	files: + issue18219v8.patch
2016年10月21日 10:45:16	vstinner	set	messages: + msg279119
2016年10月21日 10:37:59	methane	set	messages: + msg279118
2016年10月21日 10:28:01	Mariatta	set	files: + issue18219v7.patch messages: + msg279117
2016年10月21日 10:22:26	vstinner	set	nosy: + vstinner messages: + msg279116
2016年10月21日 10:19:35	methane	set	nosy: + methane messages: + msg279115 versions: - Python 3.5
2016年10月21日 10:11:49	Mariatta	set	files: + issue18219v6.patch
2016年10月21日 10:09:58	Mariatta	set	files: + issue18219v5.patch
2016年10月21日 09:44:55	Mariatta	set	files: + issue18219v4.patch
2016年10月21日 09:27:07	Mariatta	set	files: + issue18219v3.patch
2016年10月21日 04:29:58	Mariatta	set	files: + issue18219v2.patch messages: + msg279109
2016年10月21日 04:24:09	hughdbrown	set	messages: + msg279108
2016年10月21日 04:20:11	Mariatta	set	messages: + msg279107
2016年10月21日 03:27:23	hughdbrown	set	messages: + msg279105
2016年10月21日 03:15:19	Mariatta	set	files: + issue18219.patch nosy: + Mariatta messages: + msg279101
2016年10月20日 18:05:29	SilentGhost	set	stage: commit review versions: + Python 3.5, Python 3.6, Python 3.7, - Python 3.4
2016年10月20日 17:25:36	hughdbrown	set	nosy: + hughdbrown messages: + msg279058
2013年09月03日 03:30:23	mtraskin	set	files: + csvdictwriter.v4.patch messages: + msg196821
2013年08月15日 10:23:17	peter.otten	set	nosy: + peter.otten messages: + msg195245
2013年08月15日 05:23:29	mtraskin	set	files: + csvdictwriter.v3.patch messages: + msg195233
2013年06月21日 19:05:40	terry.reedy	set	nosy: + terry.reedy messages: + msg191604 versions: + Python 3.4
2013年06月16日 06:19:13	mtraskin	set	files: + csvdictwriter.v2.patch messages: + msg191263
2013年06月15日 05:52:23	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg191198
2013年06月15日 05:12:39	mtraskin	create

homepage