Issue 15958: bytes.join() should allow arbitrary buffer objects

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60162

classification

Title:	bytes.join() should allow arbitrary buffer objects
Type:	enhancement	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.4

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	Arfrever, ezio.melotti, glyph, pitrou, python-dev, serhiy.storchaka, skrah
Priority:	normal	Keywords:	patch

Created on 2012年09月17日 18:03 by pitrou, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue15958.diff	ezio.melotti, 2012年09月22日 08:00	Patch for bytes + tests	review
issue15958-2.diff	ezio.melotti, 2012年09月22日 09:13	review
issue15958-3.diff	ezio.melotti, 2012年09月22日 10:25	review
bytes_join_buffers.patch	pitrou, 2012年10月13日 19:24	review
bytes_join_buffers2.patch	pitrou, 2012年10月15日 19:08	review
bytes_join_buffers3.patch	pitrou, 2012年10月16日 17:53	review

Messages (27)
msg170618 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年09月17日 18:03
This should ideally succeed: >>> b''.join([memoryview(b'foo'), b'bar']) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: sequence item 0: expected bytes, memoryview found (ditto for bytearray.join)
msg170965 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2012年09月22日 08:00
Attached patch adds support for memoryviews to bytes.join: >>> b''.join([memoryview(b'foo'), b'bar']) b'foobar' The implementation currently has some duplication, because it does a first pass to calculate the total size to allocate, and another pass to create the result that calculates the individual sizes again. Py_SIZE(item) can't be used here for memoryviews, so during the first pass it's now necessary to check for memoryviews and extract the len from the memoryview buffer, and then do it again during the second pass. If necessary this could be optimized/improved. I also tried to check for multi-dimensional and non-contiguous buffers, but I didn't manage to obtain a failure so I removed the check for now. More tests should probably be added to cover these cases, and possibly the patch should be adjusted accordingly. If/when the patch is OK I'll do the same for bytearrays.
msg170974 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年09月22日 08:51
I think memoryview here is example only, and Antoine had in mind arbitrary buffer objects.
msg170976 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2012年09月22日 09:13
Indeed. Attached new patch. Tests still need to be improved; bytearrays are still not changed.
msg170978 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年09月22日 09:35
We would need to release the buffers and also check for format 'B'. With issue15958-2.diff this is possible: >>> import array >>> a = array.array('d', [1.2345]) >>> b''.join([b'ABC', a]) b'ABC\x8d\x97n\x12\x83\xc0\xf3?' It is unfortunate that a PyBUF_SIMPLE request does not guarantee 'B'. I think that's probably a mistake, but it's certainly existing practice.
msg170980 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年09月22日 09:43
Also, perhaps we can keep a fast path for bytes and bytearray, but I didn't time the difference.
msg170982 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年09月22日 10:21
Well, given the following works: >>> import array >>> a = array.array('d', [1.2345]) >>> b'' + a b'\x8d\x97n\x12\x83\xc0\xf3?' It should also work for bytes.join(). I guess that means I'm against the strict-typedness of memoryviews. As the name suggests, it provides access to some memory area, not some structured array of data.
msg170983 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2012年09月22日 10:25
Attached new refleakless patch.
msg170984 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年09月22日 10:31
> Attached new refleakless patch. Your approach is dangerous, because the buffers may change size between two calls to PyObject_GetBuffer(). I think you should keep the Py_buffers alive in an array, and only release them at the end (it may also be slightly faster to do so). A nit: you are adding a lot of newlines in test_bytes.py.
msg171008 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年09月22日 17:08
> I think you should keep the > Py_buffers alive in an array, and only release them at the end (it may > also be slightly faster to do so). However allocation of this array may considerably slow down the function. We may need the special-case for bytes and bytearray. Stop, and the bytearray (or bytearray subclass) can change size between two calls to Py_SIZE()?
msg172824 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年10月13日 19:24
Here is a patch with tests.
msg172828 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月13日 20:30
Patch LGTM, however... $ ./python -m timeit -s "a=[b'a']*100000" "b','.join(a)" Vanilla: 3.69 msec per loop Patched: 11.6 msec per loop
msg172829 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年10月13日 20:38
> Patch LGTM, however... > > $ ./python -m timeit -s "a=[b'a']*100000" "b','.join(a)" > > Vanilla: 3.69 msec per loop > Patched: 11.6 msec per loop True. It is a bit of a pathological case, though.
msg172834 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月13日 22:19
Here is a patch with restored performance. Is not it too complicated?
msg172836 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年10月13日 22:23
The problem with your approach is that the sequence could be mutated while another thread is running (_getbuffer() may release the GIL). Then the pre-computed size gets wrong.
msg172838 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月13日 22:40
> The problem with your approach is that the sequence could be mutated while > another thread is running (_getbuffer() may release the GIL). Then the > pre-computed size gets wrong. Well, then I withdraw my patch. But what if the sequence will be mutated and PySequence_Size(seq) will become less seqlen? Then using PySequence_Fast_GET_ITEM() will be incorrect.
msg172839 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年10月13日 22:47
> But what if the sequence will be mutated and PySequence_Size(seq) will become > less seqlen? Then using PySequence_Fast_GET_ITEM() will be incorrect. Perhaps we should detect that case and raise, then.
msg172996 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年10月15日 19:08
Here is an updated patch using PySequence_Fast_GET_SIZE to avoid problems when the sequence is resized during iteration.
msg173064 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年10月16日 17:53
Here is a new patch checking that the sequence size didn't change. I also refactored the join() implementation into a shared function in stringlib.
msg173067 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月16日 18:30
I added new comments. :-(
msg173069 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年10月16日 19:04
> I added new comments. :-( Thanks. I think I will commit after adding the missing #undef :-)
msg173071 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年10月16日 19:09
New changeset 16285c1b4dda by Antoine Pitrou in branch 'default': Issue #15958: bytes.join and bytearray.join now accept arbitrary buffer objects. http://hg.python.org/cpython/rev/16285c1b4dda
msg173075 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年10月16日 19:11
Done now.
msg173079 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月16日 19:23
Well done. However check at top of Objects/stringlib/join.h does not protect from using the file with asciilib or ucs1lib.
msg173080 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年10月16日 19:24
> However check at top of Objects/stringlib/join.h does not protect from using > the file with asciilib or ucs1lib. I'm not sure that's a problem. Someone would have to go out of their way to use join.h with only UCS1 unicode strings. Also tests would probably start failing, since str.join doesn't have the right signature :-)
msg173254 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月18日 11:06
Yet one issue. You forgot to add join.h to BYTESTR_DEPS in Makefile.pre.in.
msg173290 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年10月18日 19:33
New changeset 388e43bb519d by Antoine Pitrou in branch 'default': Followup to issue #15958: add join.h to Makefile dependencies for byte strings http://hg.python.org/cpython/rev/388e43bb519d

History
Date	User	Action	Args
2022年04月11日 14:57:36	admin	set	github: 60162
2013年01月22日 19:36:53	glyph	set	nosy: + glyph
2012年10月18日 19:33:20	python-dev	set	messages: + msg173290
2012年10月18日 11:06:50	serhiy.storchaka	set	messages: + msg173254
2012年10月16日 19:24:55	pitrou	set	messages: + msg173080
2012年10月16日 19:23:08	serhiy.storchaka	set	messages: + msg173079
2012年10月16日 19:11:34	pitrou	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2012年10月16日 19:11:26	pitrou	set	messages: + msg173075
2012年10月16日 19:09:17	python-dev	set	nosy: + python-dev messages: + msg173071
2012年10月16日 19:04:53	pitrou	set	messages: + msg173069
2012年10月16日 18:30:02	serhiy.storchaka	set	messages: + msg173067
2012年10月16日 17:53:29	pitrou	set	files: + bytes_join_buffers3.patch messages: + msg173064
2012年10月15日 19:08:42	pitrou	set	files: + bytes_join_buffers2.patch messages: + msg172996
2012年10月13日 23:28:17	serhiy.storchaka	set	files: - bytes_join_buffers_2.patch
2012年10月13日 22:47:18	pitrou	set	messages: + msg172839
2012年10月13日 22:40:44	serhiy.storchaka	set	messages: + msg172838
2012年10月13日 22:23:25	pitrou	set	messages: + msg172836
2012年10月13日 22:19:09	serhiy.storchaka	set	files: + bytes_join_buffers_2.patch messages: + msg172834
2012年10月13日 20:38:17	pitrou	set	messages: + msg172829
2012年10月13日 20:30:28	serhiy.storchaka	set	messages: + msg172828
2012年10月13日 19:24:45	pitrou	set	files: + bytes_join_buffers.patch messages: + msg172824
2012年09月22日 17:08:59	serhiy.storchaka	set	messages: + msg171008
2012年09月22日 10:31:09	pitrou	set	messages: + msg170984
2012年09月22日 10:25:20	ezio.melotti	set	files: + issue15958-3.diff messages: + msg170983
2012年09月22日 10:21:42	pitrou	set	messages: + msg170982
2012年09月22日 09:43:03	skrah	set	messages: + msg170980
2012年09月22日 09:35:33	skrah	set	nosy: + skrah messages: + msg170978
2012年09月22日 09:13:56	ezio.melotti	set	files: + issue15958-2.diff messages: + msg170976
2012年09月22日 08:51:45	serhiy.storchaka	set	messages: + msg170974
2012年09月22日 08:00:24	ezio.melotti	set	files: + issue15958.diff nosy: + ezio.melotti messages: + msg170965 keywords: + patch stage: needs patch -> patch review
2012年09月18日 03:23:22	Arfrever	set	nosy: + Arfrever
2012年09月17日 18:23:48	serhiy.storchaka	set	nosy: + serhiy.storchaka
2012年09月17日 18:03:30	pitrou	create

homepage