Construct: (un)dumping bytes to hex

Question 1

This code is part of the Construct library. Docstrings explain what the code is supposed to do.

from construct.lib.py3compat import byte2int, int2byte, bytes2str, iteratebytes, iterateints
# Map an integer in the inclusive range 0-255 to its string byte representation
_printable = dict((i, ".") for i in range(256))
_printable.update((i, bytes2str(int2byte(i))) for i in range(32, 128))
def hexdump(data, linesize):
 r"""
 Turns bytes into a unicode string of the format:
 >>>print(hexdump(b'0' * 100, 16))
 0000 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0010 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0020 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0040 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0050 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0060 30 30 30 30 0000 
 """
 prettylines = []
 fmt = "%%04X %%-%ds %%s" % (3 * linesize - 1,)
 fmtlinesize = 7 + 3*linesize + 3 + linesize
 for i in range(0, len(data), linesize):
 line = data[i:i+linesize]
 hextext = " ".join('%02x' % b for b in iterateints(line))
 rawtext = "".join(_printable[b] for b in iterateints(line))
 prettylines.append(fmt % (i, str(hextext), str(rawtext)))
 if prettylines:
 prettylines[-1] = prettylines[-1].ljust(fmtlinesize)
 prettylines.append("")
 return "\n".join(prettylines)
def hexundump(data, linesize):
 r"""
 Reverse of ``hexdump()``.
 """
 raw = []
 fmtlinesize = 7 + 3*linesize + 3 + linesize
 for line in data.split("\n"):
 bytes = [int2byte(int(s,16)) for s in line[7:7+3*linesize].split()]
 raw.extend(bytes)
 return b"".join(raw)

some of the code refers to: https://github.com/construct/construct/blob/master/construct/lib/py3compat.py

Question 2

Could you write a summary of what this code is meant to do

Question 3

@Jamal Considering your meta answer meta.codereview.stackexchange.com/questions/1938/… I am surprised that you closed it.

Question 4

Both answers there state that they aren't off-topic, but the other answer (that you've linked to) is consistent with what I've done here.

Question 5

@Jamal The title and the docstring describe the code clearly enough to me. I've reopened the question.

Question 6

@ArekBulski: The docstring looks fine to me, but I think many users here like to have an introduction explaining what the code does and what you would like to get out of the review. You could say, "please read the docstring" or words to that effect.

Question 7

The construct library implements a Python 2/3 compatibility layer. Would it make sense to use a third-party library here like six? Even if you have good reasons for using your own compatibility library, it might make sense to use the same interface names as six, for example iterbytes instead of your iterateints. This would make it easier to collaborate with Python programmers who are used to the six interface.
The name iteratebytes is imported but not used.
There's no need to use bytes2str for compatibility here — int2byte(i).decode('ascii') works in both Python 2 and 3, since you know that 32 <= i < 128.
In fact, I doubt that you need a bytes2str compatibility shim at all — b.decode('latin1') works for all byte strings in both Python 2 and 3.
The _printable data structure is a dictionary mapping numbers between 0 and 255 to their string representation. This would be slightly simpler if implemented as a list (because then there would be no need to store the keys):
```
_printable = [int2byte(i).decode('ascii') if 32 <= i < 128 else '.'
 for i in range(256)]
```
I think the _printable name could be improved — perhaps _raw_repr to parallel rawtext?
If it makes sense to cache the raw representation of bytes (as you do in the _printable data structure, then it also makes sense to cache the hexadecimal representation:
```
_hex_repr = [format(i, '02X') for i in range(256)]
```
There does not seem to be any reason to make the docstring an r-string.
"Unicode" has a capital "U".

The doctest in the docstring does not pass.

$ python -mdoctest cr143734.py
**********************************************************************
File "cr143734.py", line 11, in cr143734.hexdump
Failed example:
 print(hexdump(b'0' * 100, 16))
Expected:
 0000 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0010 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0020 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0040 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0050 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0060 30 30 30 30 0000 
Got:
 0000 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0010 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0020 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0040 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0050 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0060 30 30 30 30 0000 
 <BLANKLINE>
**********************************************************************
1 items had failures:
 1 of 1 in cr143734.hexdump
***Test Failed*** 1 failures.

The computation of fmtlinesize is off by one. It uses 3*linesize but the previous line has 3 * linesize - 1.
Because of this off-by-one error, the last but one line of the output has an unnecessary extra space.
If there are more than 65535 bytes in data, then the formatting goes wrong:
```
FFF0 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
10000 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
```
Since the hexdump function is given a byte string (and not, say, an iterable or a file object) it is easy to work out the necessary number of digits in the address.
The first argument to hexundump should have a different name — this is not the data, it's the dump of the data.
It seems inelegant for hexundump to take a linesize argument when this can be deduced from the dump.

Question 8

For disclosure, I am the current maintainer of Construct and I did purge six from dependencies. Since the library already had both compatibility module and six, I just moved what I needed into pycompat and purged six. One less depencency. Import is copy pasted from __init__, skip that. Docstrings were not meant as doctests, I have a huge test suite for that.

Question 9

What I took: _printable made into a list (indexing faster than of dict), _hexprintable also added, supporting 04X and 08X offsets on left side (larger wont be printed ever), fmtlinesize removed, and finally linesize could be computed but the way library is made, that value is always available.

Question 10

@ArekBulski: The point of doctest is not just to use the docstrings as part of your test suite, but also to check that the examples are correct.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Answer 1 · 2016-10-10 14:36:48Z

The construct library implements a Python 2/3 compatibility layer. Would it make sense to use a third-party library here like six? Even if you have good reasons for using your own compatibility library, it might make sense to use the same interface names as six, for example iterbytes instead of your iterateints. This would make it easier to collaborate with Python programmers who are used to the six interface.
The name iteratebytes is imported but not used.
There's no need to use bytes2str for compatibility here — int2byte(i).decode('ascii') works in both Python 2 and 3, since you know that 32 <= i < 128.
In fact, I doubt that you need a bytes2str compatibility shim at all — b.decode('latin1') works for all byte strings in both Python 2 and 3.
The _printable data structure is a dictionary mapping numbers between 0 and 255 to their string representation. This would be slightly simpler if implemented as a list (because then there would be no need to store the keys):
```
_printable = [int2byte(i).decode('ascii') if 32 <= i < 128 else '.'
 for i in range(256)]
```
I think the _printable name could be improved — perhaps _raw_repr to parallel rawtext?
If it makes sense to cache the raw representation of bytes (as you do in the _printable data structure, then it also makes sense to cache the hexadecimal representation:
```
_hex_repr = [format(i, '02X') for i in range(256)]
```
There does not seem to be any reason to make the docstring an r-string.
"Unicode" has a capital "U".

The doctest in the docstring does not pass.

$ python -mdoctest cr143734.py
**********************************************************************
File "cr143734.py", line 11, in cr143734.hexdump
Failed example:
 print(hexdump(b'0' * 100, 16))
Expected:
 0000 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0010 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0020 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0040 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0050 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0060 30 30 30 30 0000 
Got:
 0000 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0010 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0020 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0040 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0050 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
 0060 30 30 30 30 0000 
 <BLANKLINE>
**********************************************************************
1 items had failures:
 1 of 1 in cr143734.hexdump
***Test Failed*** 1 failures.

The computation of fmtlinesize is off by one. It uses 3*linesize but the previous line has 3 * linesize - 1.
Because of this off-by-one error, the last but one line of the output has an unnecessary extra space.
If there are more than 65535 bytes in data, then the formatting goes wrong:
```
FFF0 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
10000 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000
```
Since the hexdump function is given a byte string (and not, say, an iterable or a file object) it is easy to work out the necessary number of digits in the address.
The first argument to hexundump should have a different name — this is not the data, it's the dump of the data.
It seems inelegant for hexundump to take a linesize argument when this can be deduced from the dump.

For disclosure, I am the current maintainer of Construct and I did purge six from dependencies. Since the library already had both compatibility module and six, I just moved what I needed into pycompat and purged six. One less depencency. Import is copy pasted from __init__, skip that. Docstrings were not meant as doctests, I have a huge test suite for that.
What I took: _printable made into a list (indexing faster than of dict), _hexprintable also added, supporting 04X and 08X offsets on left side (larger wont be printed ever), fmtlinesize removed, and finally linesize could be computed but the way library is made, that value is always available.
@ArekBulski: The point of doctest is not just to use the docstrings as part of your test suite, but also to check that the examples are correct.

Stack Exchange Network

Construct: (un)dumping bytes to hex

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Construct: (un)dumping bytes to hex

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions