I'm trying to create a list from a string with items alternating between words and parse characters like ['Hello', ' ', 'World']
Is there a built in function, existing module, or simpler way to achieve something like below? I'm interested in variable chars for parsing.
def parse_chars(string, chars):
parse_set = {c for c in chars}
string_list = []
start = 0
for index, char in enumerate(string):
if char not in parse_set:
if index - start > 0:
word = string[start:index]
string_list.append(word)
string_list.append(char)
start = index + 1
document_len = len(string)
if start != document_len:
word = string[start:document_len]
string_list.append(word)
return string_list
filename = 'sample.txt'
with open(filename) as document_open:
document_string = document_open.read()
alphanumeric = (map(chr, range(48, 58)) +
map(chr, range(65, 90)) +
map(chr, range(97, 123)))
print parse_chars(document_string, alphanumeric)
[' ', 'A', ' ', 'space', ' ', 'then', ' ', '3', ' ', 'blank', ' ', 'lines', '\n', '\n', '\n', '3', ' ', 'blank', ' ', 'spaces', ' ', ' ', ' ', 'end']
1 Answer 1
The documentation for re.split
says:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
For example:
>>> import re
>>> re.split('( )', 'hello world')
['hello', ' ', 'world']
If the string starts or ends with a separator, you get an empty string:
>>> re.split('( )', ' a b c ')
['', ' ', 'a', ' ', 'b', ' ', 'c', ' ', '']
You probably don't want these empty strings, so you should filter them out:
>>> [w for w in re.split('( )', ' a b c ') if w]
[' ', 'a', ' ', 'b', ' ', 'c', ' ']
So your parse_chars
function would become:
[w for w in re.split('([^0-9A-Za-z])', string) if w]
For example:
>>> [w for w in re.split('([^0-9A-Za-z])', '10 green bottles!') if w]
['10', ' ', 'green', ' ', 'bottles', '!']