Message 139733 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	Devin Jeanpierre
Recipients	Devin Jeanpierre
Date	2011年07月04日.03:58:16
SpamBayes Score	2.0462698e-10
Marked as misclassified	No
Message-id	<1309751897.67.0.332272921906.issue12486@psf.upfronthosting.co.za>

Content
tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding). The naive approach might be something like: def my_readline(): return my_oldreadline().encode('utf-8') But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one): def my_readline_safe(was_read=[]): if not was_read: was_read.append(True)can return b'# coding: utf-8' return my_oldreadline().encode('utf-8') tokenstream = tokenize.tokenize(my_readline_safe) Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function: tokenstream = tokenize._tokenize(my_readline, 'utf-8') Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function: tokenstream = tokenize.utokenize(my_oldreadline)

Content

tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding).
The naive approach might be something like:
 def my_readline():
 return my_oldreadline().encode('utf-8')
But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one):
 def my_readline_safe(was_read=[]):
 if not was_read:
 was_read.append(True)can 
 return b'# coding: utf-8'
 return my_oldreadline().encode('utf-8')
 tokenstream = tokenize.tokenize(my_readline_safe)
Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function:
 tokenstream = tokenize._tokenize(my_readline, 'utf-8')
Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function:
 tokenstream = tokenize.utokenize(my_oldreadline)

History
Date	User	Action	Args
2011年07月04日 03:58:17	Devin Jeanpierre	set	recipients: + Devin Jeanpierre
2011年07月04日 03:58:17	Devin Jeanpierre	set	messageid: <1309751897.67.0.332272921906.issue12486@psf.upfronthosting.co.za>
2011年07月04日 03:58:16	Devin Jeanpierre	link	issue12486 messages
2011年07月04日 03:58:16	Devin Jeanpierre	create

homepage