NOTE: python 3.2
I want to make a python script that recieves c++ simple expressions as input, and outputs the very same expressions as tokens.
I vaguely remember my course in compilation, and I need something far less complex than a compiler.
Examples
int& name1=arr1[place1];
int *name2= arr2[ place2];
should output
[ "int", "&", "name1", "=", "arr1", "[", "place1", "]" ]
[ "int", "*", "name2", "=", "arr2", "[", "place2", "]" ]
The spaces shouldn't matter, and I don't want them in the output.
This seems like a very simple task for someone who knows what they're doing, while I keep getting garbage white spaces or getting the division at wrong places.
I would greatly appreciate a quick solution for this - it really looks like a one-liner to me
Note that I only need expressions like I showed here. Nothing fancy.
Thanks
-
It's generally appreciated to show the code you already got.Eli Korvigo– Eli Korvigo2015年08月26日 17:41:20 +00:00Commented Aug 26, 2015 at 17:41
-
1@EliKorvigo I'm in a military environment that is closed to the world network. Can't get my code out. Anyway, I thought this would be an easy question that doesn't really need preliminary work. If it isn't do tell.Gulzar– Gulzar2015年08月26日 17:43:58 +00:00Commented Aug 26, 2015 at 17:43
-
If these suggestions aren't working, try describing your algorithm since you can't post code.Surreal Dreams– Surreal Dreams2015年08月26日 18:01:27 +00:00Commented Aug 26, 2015 at 18:01
-
1You can probably repeatedly refine regular expressions to get an approximation to what you want. Or you could build a simple, readable and maintainable lexer using PLY or some similar Python library. I'd strongly suggest option 2.rici– rici2015年08月26日 19:03:21 +00:00Commented Aug 26, 2015 at 19:03
4 Answers 4
Not overly familiar with c++ but you could maybe use re.findall with a list of special chars:
lines="""int& name1=arr1[place1];
int *name2= arr2[ place2];"""
import re
for line in lines.splitlines():
print(re.findall("[\*\$\[\]&=]|\w+",line))
['int', '&', 'name1', '=', 'arr1', '[', 'place1', ']']
['int', '*', 'name2', '=', 'arr2', '[', 'place2', ']']
Comments
Looks to me like you need to define a list of "special/operator" characters. Replace any of those characters with itself plus a space of padding on either side. Use string.split() to turn the string into a list of "words". If you need a string representation, finish up with string.join(wordlist, "', '") and add a "[ '" to the front and "' ]" to the end.
I'm almost certainly missing a few things, like looking for semicolons to strip off, or to use in breaking apart concatenated expressions. You weren't specific about how many expressions you'd read in at once. If you read in many at a time, you could split on the semicolon character, then iterate over the resulting list of expressions.
2 Comments
The first step is to replace the spaces with a blank. that is ' ' with a ''. Then use a split function. Make a list of special characters or words, and replace them with a special character and a delimiter. Split the line with the delimiter. Here is the example:
for line in sys.stdin:
line = line.replace(' ', '')
line = line.replace('&',',&,')
a = line.split(',')
2 Comments
Here is a generator that might do the trick:
def parseCPP(line):
line=line.rstrip(";")
word=""
for i in line:
if i.isalnum():
word+=i
else:
if word:
yield word
word=""
if i!=" ":
yield i
Note this just picks up consecutive strings of alphanumeric characters. Any non-space characters are assumed to be operators/tokens by themselves.
Hope this helps :)