Message139733
| Author |
Devin Jeanpierre |
| Recipients |
Devin Jeanpierre |
| Date |
2011年07月04日.03:58:16 |
| SpamBayes Score |
2.0462698e-10 |
| Marked as misclassified |
No |
| Message-id |
<1309751897.67.0.332272921906.issue12486@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding).
The naive approach might be something like:
def my_readline():
return my_oldreadline().encode('utf-8')
But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one):
def my_readline_safe(was_read=[]):
if not was_read:
was_read.append(True)can
return b'# coding: utf-8'
return my_oldreadline().encode('utf-8')
tokenstream = tokenize.tokenize(my_readline_safe)
Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function:
tokenstream = tokenize._tokenize(my_readline, 'utf-8')
Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function:
tokenstream = tokenize.utokenize(my_oldreadline) |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2011年07月04日 03:58:17 | Devin Jeanpierre | set | recipients:
+ Devin Jeanpierre |
| 2011年07月04日 03:58:17 | Devin Jeanpierre | set | messageid: <1309751897.67.0.332272921906.issue12486@psf.upfronthosting.co.za> |
| 2011年07月04日 03:58:16 | Devin Jeanpierre | link | issue12486 messages |
| 2011年07月04日 03:58:16 | Devin Jeanpierre | create |
|