My aim is to print bytes string lines in Python 3, but truncate each line to a maximum width so that it fits the current terminal window.
My first attempt was only print(output[:max_width])
, but this did not work because Python counts e.g. tabs \t
as one character, but the terminal displays them as multiple characters. Also, it would evaluate carriage-returns etc.
Therefore now I have this snippet of code, line
is a bytes string:
output = line.decode(codec, "replace")
if max_width:
output = "".join(c for c in output if c.isprintable())
print(output[:max_width])
else:
print(output)
However, I guess it's pretty slow to refactor each string line this way just to filter out non-printable characters like \t
and \r
(and whatever characters I might have forgotten).
Please note that codec
is specified by the user. It might be "ascii"
, utf-8
, utf-16
or any other valid built-in codec.
Can you please suggest me how to improve this code's performance?
1 Answer 1
Something that may help performance wise could be itertools.islice
.
This will allow you to call str.isprintable()
max_width
amount of times,
as this is a binary file that may not have many \n
s it can save a lot of effort.
output = line.decode(codec, "replace")
if max_width:
print("".join(itertools.islice((c for c in output if c.isprintable()), max_width)))
else:
print(output)
This on it's own may not help on files that have a lot of \n
s.
The bottle neck in these file would most likely be the overhead incurred by print
.
And so it's much faster to build a string to display once.
In these cases you would want to use something like:
(Untested code)
def read_data(path):
with open(path) as f:
for line in f:
output = line.decode(codec, "replace")
if max_width:
yield "".join(itertools.islice(
(c for c in output if c.isprintable()),
max_width))
else:
yield output
print('\n'.join(read_data(...)))
However the above is not good on machines with limited memory or extremely large files. In these cases you would want to use a buffer and print the buffer when a threshold has been reached.
After PEP 3138 your method to remove non-printables seems to be the correct way.
Explore related questions
See similar questions with these tags.
codec
may vary. It could beascii
as well asutf-16
or some iso codec. Not sure which methods are suitable for that, but manually crafting a set of forbidden characters is probably difficult. \$\endgroup\$output = ''.join(c for c in output if c.isprintable()); print(output[:max_width] if max_width else output)
?" \$\endgroup\$