This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年05月27日 23:13 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Messages (27) | |||
|---|---|---|---|
| msg106625 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年05月27日 23:13 | |
readbuffer_encode() and charbuffer_encode() are not really encoder nor related to encodings: they are related to PyBuffer. readbuffer_encode() uses "s#" format and charbuffer_encode() uses "t#" format to parse their arguments. Both functions were introduced by the creation of the _codecs module 10 years ago (r14660). I think that these functions should be removed. memoryview() should be used instead. Note: charbuffer_encode() is the last function using on of the "t" format (t, t#, t*) in Python3. |
|||
| msg106626 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年05月27日 23:17 | |
A search in Google doesn't show anything interesting: it looks like these functions were never used outside Python test suite. I just noticed r41461: "Add tests for various error cases and for readbuffer_encode() and charbuffer_encode(). This increases code coverage in Modules/_codecsmodule.c from 83% to 95%." (4 years ago) |
|||
| msg106640 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年05月28日 07:54 | |
STINNER Victor wrote: > > New submission from STINNER Victor <victor.stinner@haypocalc.com>: > > readbuffer_encode() and charbuffer_encode() are not really encoder nor related to encodings: they are related to PyBuffer. readbuffer_encode() uses "s#" format and charbuffer_encode() uses "t#" format to parse their arguments. Both functions were introduced by the creation of the _codecs module 10 years ago (r14660). > > I think that these functions should be removed. memoryview() should be used instead. > > Note: charbuffer_encode() is the last function using on of the "t" format (t, t#, t*) in Python3. Those two encoder functions were meant to be used by Python codec implementations which want to use the readbuffer and charbuffer interfaces available in Python via "s#" and "t#" to access input object data. They are not used by the builtin codecs, but may well be in use by 3rd party codecs. I'm not sure why you think those functions are not encoders. |
|||
| msg106645 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年05月28日 11:14 | |
> Those two encoder functions were meant to be used by Python codec > implementations which want to use the readbuffer and charbuffer > interfaces available in Python via "s#" and "t#" to access input > object data. Ah ok. > They are not used by the builtin codecs, > but may well be in use by 3rd party codecs. My quick Google search didn't found any of those. I suppose that str and bytes are enough for most people. Do you know an usecase of text or bytes stored in different types than str and bytes? (I suppose the bytearray is compatible with bytes, and so it can be used instead of bytes) > I'm not sure why you think those functions are not encoders. I consider that Python3 codecs module only encode and decode text to/from an encoding, whereas Python2 had extra unrelated codecs like "base64" or "hex" (but it was decided to remove them to cleanup the codecs module). |
|||
| msg106650 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年05月28日 11:35 | |
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> Those two encoder functions were meant to be used by Python codec >> implementations which want to use the readbuffer and charbuffer >> interfaces available in Python via "s#" and "t#" to access input >> object data. > > Ah ok. > >> They are not used by the builtin codecs, >> but may well be in use by 3rd party codecs. > > My quick Google search didn't found any of those. I suppose that str and bytes are enough for most people. Do you know an usecase of text or bytes stored in different types than str and bytes? (I suppose the bytearray is compatible with bytes, and so it can be used instead of bytes) Any Python object can expose a buffer interface and the above functions then allow accessing these interfaces from within Python. Think of e.g. memory mapped files, image/audio/video objects, database BLOBs, scientific data types, numeric arrays, etc. There are lots of such object types. >> I'm not sure why you think those functions are not encoders. > > I consider that Python3 codecs module only encode and decode text to/from an encoding, whereas Python2 had extra unrelated codecs like "base64" or "hex" (but it was decided to remove them to cleanup the codecs module). Those codecs will be reenabled in Python 3.2. Removing them was a mistake. The codec machinery is not limited to only working on Unicode and bytes. It can work on arbitrary type combinations, depending on what a codec wants to implement. |
|||
| msg106653 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年05月28日 12:19 | |
> Any Python object can expose a buffer interface and the above
> functions then allow accessing these interfaces from within
> Python.
What's the point? The codecs functions already support objects exposing the buffer interface:
>>> b = b"\xe9"
>>> codecs.latin_1_decode(memoryview(b))
('é', 1)
>>> codecs.latin_1_decode(array.array("b", b))
('é', 1)
Those two functions are undocumented. They serve no useful purpose (you can call the bytes(...) constructor instead, or even use the buffer object directly as showed above). They are badly named since they don't have anything to do with codecs. Google Code Search shows them not appearing anywhere else than implementations of the Python stdlib. Removing them only seems reasonable.
|
|||
| msg106656 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年05月28日 12:39 | |
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> Any Python object can expose a buffer interface and the above >> functions then allow accessing these interfaces from within >> Python. > > What's the point? The codecs functions already support objects exposing the buffer interface: > >>>> b = b"\xe9" >>>> codecs.latin_1_decode(memoryview(b)) > ('é', 1) >>>> codecs.latin_1_decode(array.array("b", b)) > ('é', 1) > > Those two functions are undocumented. They serve no useful purpose (you can call the bytes(...) constructor instead, or even use the buffer object directly as showed above). They are badly named since they don't have anything to do with codecs. Google Code Search shows them not appearing anywhere else than implementations of the Python stdlib. Removing them only seems reasonable. readbuffer_encode and charbuffer_encode convert objects to bytes and provide a codec encoder interface for this, hence the naming. They are meant to be used as encode methods for codecs, just like the other *_encode functions exposed in the _codecs module, e.g. class BinaryDataCodec(codecs.Codec): # Note: Binding these as C functions will result in the class not # converting them to methods. This is intended. encode = codecs.readbuffer_encode decode = codecs.latin_1_decode While it's possible to emulate the functions via other methods, these methods always introduce intermediate objects, which isn't necessary and only costs performance. Given than "t#" was basically rendered useless in Python3 (see issue8839), removing charbuffer_encode() is indeed possible, so +1 on removing charbuffer_encode() -1 on removing readbuffer_encode() |
|||
| msg106657 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年05月28日 12:45 | |
I’d be grateful if someone could post links to discussion about the removal of codecs like hex and rot13 and about their coming back. It may be useful for a NEWS entry too, not just for my personal curiosity ;) I’ll try to find them next week or so if nobody posts them before. Thanks. |
|||
| msg106658 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年05月28日 13:02 | |
> class BinaryDataCodec(codecs.Codec): > > # Note: Binding these as C functions will result in the class not > # converting them to methods. This is intended. > encode = codecs.readbuffer_encode > decode = codecs.latin_1_decode What's the point, though? Creating a non-symmetrical codec doesn't sound like a very useful or recommandable thing to do. Especially in the py3k codec model where encode() only works on unicode objects. > While it's possible to emulate the functions via other methods, > these methods always introduce intermediate objects, which isn't > necessary and only costs performance. The bytes() constructor doesn't (shouldn't) create any more intermediate objects than read/charbuffer_encode() do. And all this doesn't address the fact that these functions have never been documented, and don't seem used in the outside world (understandably so, since there's no way to know about their existence, and their intended use). |
|||
| msg106659 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年05月28日 13:09 | |
> I’d be grateful if someone could post links to discussion > about the removal of codecs like hex and rot13 r55932 (~3 years ago): "Rip out all codecs that can't work in a unicode/bytes world: base64, uu, zlib, rot_13, hex, quopri, bz2, string_escape. However codecs.escape_encode() and codecs.escape_decode() still exist, as they are used for pickling str8 objects (so those two functions can go, when the str8 type is removed)." There were removed 1 year and an half before Python 3.0 release. > ... and about their coming back which coming back? |
|||
| msg106660 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年05月28日 13:12 | |
Thanks for the link. Do you have a pointer to the PEP or ML thread discussing that change? "Which coming back?" Martin said these codecs are coming back in 3.2. |
|||
| msg106661 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年05月28日 13:18 | |
> Martin said these codecs are coming back in 3.2. Oh, there is the issue #7485 where Martin wrote: * 2009年12月10日 23:15: "It was a mistake that they were integrated" * 2009年12月12日 19:25: "I would still be opposed to such a change (...) adding them would be really confusing." |
|||
| msg106662 - (view) | Author: Walter Dörwald (doerwalter) * (Python committer) | Date: 2010年05月28日 13:20 | |
> > I’d be grateful if someone could post links to discussion > > about the removal of codecs like hex and rot13 > r55932 (~3 years ago): That was my commit. ;) > Thanks for the link. Do you have a pointer to the PEP or ML thread > discussing that change? The removal is documented here: http://www.artima.com/weblogs/viewpost.jsp?thread=208549 """ We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction. This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API). """ A post by Georg Brandl about this is at http://mail.python.org/pipermail/python-3000/2007-June/008420.html (Note that this thread began in private email between Guido, MvL, Georg and myself. If needed I can dig up the emails.) |
|||
| msg106663 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年05月28日 13:23 | |
> Oh, there is the issue #7485 where Martin wrote: Copy/paste failure: issue #7475. |
|||
| msg106664 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年05月28日 13:23 | |
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> class BinaryDataCodec(codecs.Codec): >> >> # Note: Binding these as C functions will result in the class not >> # converting them to methods. This is intended. >> encode = codecs.readbuffer_encode >> decode = codecs.latin_1_decode > > What's the point, though? Creating a non-symmetrical codec doesn't sound > like a very useful or recommandable thing to do. Why not ? If you're only interested in the binary data and don't care about the original input object type, that's a very natural thing to do. E.g. you could use a memory mapped file as input to the encoder. Would you really expect the codec to recreate such a file object when decoding the binary data ? > Especially in the py3k > codec model where encode() only works on unicode objects. That's a common misunderstanding. The codec system does not mandate a specific type combination. Only the helper methods .encode() and .decode() on bytes and str objects in Python3 do. >> While it's possible to emulate the functions via other methods, >> these methods always introduce intermediate objects, which isn't >> necessary and only costs performance. > > The bytes() constructor doesn't (shouldn't) create any more intermediate > objects than read/charbuffer_encode() do. Looking at the code, the data takes quite a long path through the whole machinery. For non-Unicode objects, it always tries to create an integer and only if that fails reverts back to the buffer interface after a few more function calls. Furthermore, the bytes() constructor accepts a lot more objects than the "s#" parser marker, e.g. lists of integers, plain integers, arbitrary iterators, which a codec just interested in the binary representation of an object via the buffer interface most likely doesn't want to accept. > And all this doesn't address the fact that these functions have never > been documented, and don't seem used in the outside world > (understandably so, since there's no way to know about their existence, > and their intended use). That's a documentation bug and probably the result of the fact that none of the exposed encoder/decoder APIs are documented. |
|||
| msg106665 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年05月28日 13:25 | |
STINNER Victor wrote: > >> Martin said these codecs are coming back in 3.2. I said that and it was discussed on the python-dev mailing list a while back. We'll also add .transform() methods on bytes and str objects to access same-type codecs. |
|||
| msg106666 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年05月28日 13:33 | |
> readbuffer_encode() and charbuffer_encode() are not really encoder > nor related to encodings: they are related to PyBuffer That was the initial problem: codecs is specific to encodings (in Python3), encodes str to bytes, and decodes bytes (or any read buffer) to str. I don't like readbuffer_*encode* and *charbuffer_encode* function names, because there are different than other codecs: they encode *bytes* to bytes (and not str to bytes). I think that these functions should be removed or moved somewhere else under a different name. |
|||
| msg106667 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年05月28日 13:35 | |
> > And all this doesn't address the fact that these functions have never > > been documented, and don't seem used in the outside world > > (understandably so, since there's no way to know about their existence, > > and their intended use). > > That's a documentation bug and probably the result of the fact > that none of the exposed encoder/decoder APIs are documented. Are you planning to fix it? It is not obvious anybody else is able to properly document those functions. |
|||
| msg106672 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年05月28日 13:52 | |
> I don't like readbuffer_*encode* and *charbuffer_encode* > function names, because there are different than other codecs "transform" as hinted by MvL seems perfect. Thanks everyone for the pointers here and in #7475! I’ll search the missing one ("it was discussed on the python-dev mailing list a while back") later. |
|||
| msg106693 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2010年05月28日 22:29 | |
> Martin said these codecs are coming back in 3.2. I think you are confusing me with MAL. I remain opposed to adding them back. Users ought to use the modules that provide these these conversions as functions. |
|||
| msg107288 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年06月07日 22:48 | |
MAL agreed to remove "t#" parsing format (#8839), whereas charbuffer_encode() main goal was to offer "t#" parsing format to Python object space. charbuffer_encode() is now useless in Python3. bytes() accepts any buffer object (read-only and read/write buffer), so readbuffer_encode() became useless in Python3. readbuffer_encode() and charbuffer_encode() were never documented, and are not used by any 3rd party library. Can we remove these two functions? |
|||
| msg107307 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年06月08日 08:00 | |
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > MAL agreed to remove "t#" parsing format (#8839), whereas charbuffer_encode() main goal was to offer "t#" parsing format to Python object space. charbuffer_encode() is now useless in Python3. bytes() accepts any buffer object (read-only and read/write buffer), so readbuffer_encode() became useless in Python3. > > readbuffer_encode() and charbuffer_encode() were never documented, and are not used by any 3rd party library. > > Can we remove these two functions? Like I said before: We can remore charbuffer_encode() now and perhaps add it again later on when buffers have learned (again) to provide access to a text version of their data. In this case, we'd likely add t# back again as well. Please leave readbuffer_encode() as-is. |
|||
| msg107318 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年06月08日 12:42 | |
> Please leave readbuffer_encode() as-is. Then please add documentation for it. |
|||
| msg107319 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年06月08日 12:44 | |
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> Please leave readbuffer_encode() as-is. > > Then please add documentation for it. Will do. |
|||
| msg107363 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年06月08日 23:13 | |
r81854 removes codecs.charbuffer_encode() (and t# parsing format) from Python 3.2 (blocked in 3.1: r81855). -- My problem with codecs.readbuffer_encode() is that it does accept byte *and* character strings. If you want to get a byte string, just use bytes(input). If you want to convert a character string to a byte string, use input.encode("utf-8"). But accepting both types may lead to mojibake as we had in Python2. MAL> That's a common misunderstanding. The codec system does not MAL> mandate a specific type combination. Only the helper methods MAL> .encode() and .decode() on bytes and str objects in Python3 do. This is related to #7475: we have to decide if we drop completly this (currently unused) feature (eg. remove codecs.readbuffer_encode()), or if we "reenable" this feature again (reintroduce hex, bz2, rot13, ... codecs). This discussion should occur on the mailing list. |
|||
| msg107373 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年06月09日 08:20 | |
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > r81854 removes codecs.charbuffer_encode() (and t# parsing format) from Python 3.2 (blocked in 3.1: r81855). > > -- > > My problem with codecs.readbuffer_encode() is that it does accept byte *and* character strings. If you want to get a byte string, just use bytes(input). If you want to convert a character string to a byte string, use input.encode("utf-8"). But accepting both types may lead to mojibake as we had in Python2. The point is to have an interface to the "s#" parser marker from Python. This accepts bytes, objects with a buffer interface and Unicode objects (via the default encoding). It does not accept e.g. lists, tuples or plain integers like bytes() does. > MAL> That's a common misunderstanding. The codec system does not > MAL> mandate a specific type combination. Only the helper methods > MAL> .encode() and .decode() on bytes and str objects in Python3 do. > > This is related to #7475: we have to decide if we drop completly this (currently unused) feature (eg. remove codecs.readbuffer_encode()), or if we "reenable" this feature again (reintroduce hex, bz2, rot13, ... codecs). This discussion should occur on the mailing list. We are not going to drop this design feature of the codec system and we've already had the discussion in 2008. The statement that it is an unused feature is plain wrong. Please don't forget that people are actually using these things in their applications, many of which have not been ported to Python3. We're not just talking about code that you find in CPython or the stdlib. The removed codecs will go back into 3.2. |
|||
| msg107807 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年06月14日 19:02 | |
This issue was about removing codecs.readbuffer_encode() and codecs.charbuffer_encode(). codecs.charbuffer_encode() was removed, but codecs.readbuffer_encode() explained that it should be kept. So I close this issue because there is nothing more to do on this topic. @lemburg: You still have to write some doc (and tests?) for codecs.readbuffer_encode() ;-) |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:01 | admin | set | github: 53084 |
| 2010年06月14日 19:03:04 | vstinner | set | status: open -> closed resolution: fixed |
| 2010年06月14日 19:02:42 | vstinner | set | messages: + msg107807 |
| 2010年06月09日 08:20:32 | lemburg | set | messages:
+ msg107373 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年06月08日 23:13:33 | vstinner | set | messages: + msg107363 |
| 2010年06月08日 12:44:27 | lemburg | set | messages:
+ msg107319 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年06月08日 12:42:15 | pitrou | set | messages:
+ msg107318 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年06月08日 08:00:05 | lemburg | set | messages:
+ msg107307 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年06月07日 22:48:44 | vstinner | set | messages: + msg107288 |
| 2010年05月28日 22:29:27 | loewis | set | messages:
+ msg106693 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 13:52:36 | eric.araujo | set | messages:
+ msg106672 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 13:35:01 | pitrou | set | messages:
+ msg106667 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 13:33:41 | vstinner | set | messages: + msg106666 |
| 2010年05月28日 13:25:43 | lemburg | set | messages:
+ msg106665 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 13:23:29 | lemburg | set | messages:
+ msg106664 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 13:23:18 | vstinner | set | messages: + msg106663 |
| 2010年05月28日 13:20:50 | doerwalter | set | nosy:
+ doerwalter messages: + msg106662 |
| 2010年05月28日 13:18:42 | vstinner | set | messages: + msg106661 |
| 2010年05月28日 13:12:04 | eric.araujo | set | messages:
+ msg106660 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 13:09:57 | vstinner | set | messages: + msg106659 |
| 2010年05月28日 13:02:21 | pitrou | set | messages:
+ msg106658 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 12:45:25 | eric.araujo | set | nosy:
+ eric.araujo messages: + msg106657 |
| 2010年05月28日 12:39:40 | lemburg | set | messages:
+ msg106656 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 12:19:19 | pitrou | set | nosy:
+ loewis, pitrou messages: + msg106653 |
| 2010年05月28日 11:35:25 | lemburg | set | messages:
+ msg106650 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 11:14:56 | vstinner | set | messages:
+ msg106645 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() |
| 2010年05月28日 07:54:17 | lemburg | set | nosy:
+ lemburg title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() messages: + msg106640 |
| 2010年05月27日 23:17:21 | vstinner | set | messages: + msg106626 |
| 2010年05月27日 23:13:35 | vstinner | create | |