Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Mon Oct 31 20:01:14 EDT 2011

On 10/31/11 18:02, Steven D'Aprano wrote:
> # Define legal characters:
> LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
> # everybody forgets about formfeed... \f
> # and are you sure you want to include chr(127) as a text char?
>> def is_ascii_text(text):
> for c in text:
> if c not in LEGAL:
> return False
> return True
>>> Algorithmically, that's as efficient as possible: there's no faster way
> of performing the test, although one implementation may be faster or
> slower than another. (PyPy is likely to be faster than CPython, for
> example.)

Additionally, if one has some foreknowledge of the character 
distribution, one might be able to tweak your
> def is_ascii_text(text):
> legal = frozenset(LEGAL)
> return all(c in legal for c in text)

with some if/else chain that might be faster than the hashing 
involved in a set lookup (emphasis on the *might*, not being an 
expert on CPython internals) such as
 def is_ascii_text(text):
 return all(
 (' ' <= c <= '\x7a') or
 c == '\n' or
 c == '\t'
 for c in text)
But Steven's main points are all spot on: (1) use an O(1) lookup; 
(2) return at the first sign of trouble; and (3) push it into the 
C implementation rather than a for-loop. (and the "locals are 
faster in CPython" is something I didn't know)
-tkc