Parse a string as a list of words and parse characters

Question 1

I'm trying to create a list from a string with items alternating between words and parse characters like ['Hello', ' ', 'World']

Is there a built in function, existing module, or simpler way to achieve something like below? I'm interested in variable chars for parsing.

sample.txt

def parse_chars(string, chars):
 parse_set = {c for c in chars}
 string_list = []
 start = 0
 for index, char in enumerate(string):
 if char not in parse_set:
 if index - start > 0:
 word = string[start:index]
 string_list.append(word)
 string_list.append(char)
 start = index + 1
 document_len = len(string)
 if start != document_len:
 word = string[start:document_len]
 string_list.append(word)
 return string_list
filename = 'sample.txt'
with open(filename) as document_open:
 document_string = document_open.read()
alphanumeric = (map(chr, range(48, 58)) + 
 map(chr, range(65, 90)) +
 map(chr, range(97, 123)))
print parse_chars(document_string, alphanumeric)

[' ', 'A', ' ', 'space', ' ', 'then', ' ', '3', ' ', 'blank', ' ', 'lines', '\n', '\n', '\n', '3', ' ', 'blank', ' ', 'spaces', ' ', ' ', ' ', 'end']

Question 2

The documentation for re.split says:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

For example:

>>> import re
>>> re.split('( )', 'hello world')
['hello', ' ', 'world']

If the string starts or ends with a separator, you get an empty string:

>>> re.split('( )', ' a b c ')
['', ' ', 'a', ' ', 'b', ' ', 'c', ' ', '']

You probably don't want these empty strings, so you should filter them out:

>>> [w for w in re.split('( )', ' a b c ') if w]
[' ', 'a', ' ', 'b', ' ', 'c', ' ']

So your parse_chars function would become:

[w for w in re.split('([^0-9A-Za-z])', string) if w]

For example:

>>> [w for w in re.split('([^0-9A-Za-z])', '10 green bottles!') if w]
['10', ' ', 'green', ' ', 'bottles', '!']

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2015-05-20 12:28:46Z

The documentation for re.split says:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

For example:

>>> import re
>>> re.split('( )', 'hello world')
['hello', ' ', 'world']

If the string starts or ends with a separator, you get an empty string:

>>> re.split('( )', ' a b c ')
['', ' ', 'a', ' ', 'b', ' ', 'c', ' ', '']

You probably don't want these empty strings, so you should filter them out:

>>> [w for w in re.split('( )', ' a b c ') if w]
[' ', 'a', ' ', 'b', ' ', 'c', ' ']

So your parse_chars function would become:

[w for w in re.split('([^0-9A-Za-z])', string) if w]

For example:

>>> [w for w in re.split('([^0-9A-Za-z])', '10 green bottles!') if w]
['10', ' ', 'green', ' ', 'bottles', '!']

Stack Exchange Network

Parse a string as a list of words and parse characters

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse a string as a list of words and parse characters

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions