I have this code that uses loops and random generators to create a random list of values (operators, strings, integers, floats, and new lines) and a corresponding list of the correct tokens to test a lexer I have made. Currently it is estimated to take 1:50 (1 hour 50 mins) to create 1 billion tokens, but I am wondering if it can be faster. I attempted to optimize it by using numpy random functions, but that resulted in a time estimate of 32 hours. Is it possible to improve performance? Here is my code (using tqdm
for a progress bar and time estimate):
from tqdm import tqdm
import random
import string
nOp = {
'+': 'add',
'-': 'subtract',
'*': 'multiply',
'/': 'divide',
'^': 'exponent',
'/^': 'root',
'%': 'modulo',
'==': 'equal',
'!': 'not', '>':
'greater', '<':
'less', '>=':
'greater_equal',
'<=': 'less_equal',
'&': 'and',
'|': 'or',
'%/': 'divisible'
}
strTuple = tuple((string.ascii_letters + (" " * 8) + "\"" + "'").split())
digitTuple = tuple(string.digits)
s = []
correct = []
for i in tqdm(range(1000000000)):
#get a random number to determine the data type
tn = random.randint(0, 16)
#operator (random from dict above)
if tn >= 0 and tn <= 3:
op = random.choice(tuple(nOp.keys()))
s.append(op.strip().replace("\\", "").replace("(?<= )", "").replace("(?= )", ""))
correct.append(nOp[op])
#string (random length and characters)
elif tn >= 4 and tn <= 7:
s.append('"' + "".join([random.choice(strTuple) for i in range(random.randint(0, 75))]) + '"')
correct.append("string")
#integer (random length and digits)
elif tn >= 8 and tn <= 11:
s.append(random.choice(["-", ""]) + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]))
correct.append("integer")
#float (random length and digits before and after decimal place)
elif tn >= 12 and tn <= 15:
s.append(random.choice(["-", ""]) + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]) + "." + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]))
correct.append("float")
#new line
elif tn == 16:
s.append("\n")
correct.append("EOL")
-
\$\begingroup\$ Will this even fit in memory? What will you do with the tokens once this loop is done? \$\endgroup\$Reinderien– Reinderien2022年11月01日 21:43:27 +00:00Commented Nov 1, 2022 at 21:43
2 Answers 2
You generate a random float with this line:
s.append(random.choice(["-", ""]) + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]) + "." + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]))
There is a function for what you're trying to do:
s.append(str(random.uniform(-10**10, 10**10)))
Same applies to the line where you generate integers.
s.append(op.strip().replace("\\", "").replace("(?<= )", "").replace("(?= )", ""))
It's unclear what you're trying to do here since op
doesn't contain any question marks. strip()
is also redundant.
Instead of
for i in tqdm(range(1000000000)):
tn = random.randint(0, 16)
if tn >= 0 and tn <= 3:
...
elif tn >= 4 and tn <= 7:
...
...
you can use choices
with weights. This allows you to write it a bit cleaner:
for data_type in random.choices(
('operator', 'string', 'integer', 'float', 'new line'),
(4, 4, 4, 4, 1),
k=10
):
if data_type == 'operator':
...
if data_type == 'string':
...
...
You won't actually be able to store that much data in a list because of limited memory. Why do you need so much anyway?
-
\$\begingroup\$ "You generate a random float with this line: ... There is a function for what you're trying to do: ..." If we assume (boldly, I know) that OP's code is doing what they want, then this probably isn't a good substitution.
1.0
and1.0000000000
are the same "number", but it appears they're testing a parser of some kind, so it makes sense that they'd want redundant representations like this. \$\endgroup\$ShapeOfMatter– ShapeOfMatter2022年11月02日 06:20:51 +00:00Commented Nov 2, 2022 at 6:20
Formatting:
- Use a linter. Maybe you already are; this isn't bad. But there are a couple things that are annoying to read which a linter would have caught.
- What is going on with the formatting of
nOp
? - Put comments inside
if
bloacks, same way you put them inside functions.
Design principals:
- Use a
main
method. - Parameterize key things like the number of items you want to generate and the RNG you want to use.
- Try using generators and streaming. The
itertools
package will help with this. It's easy to get carried away, and rarely actually affects performance, but it's a nice flexible way to think about things. - When there's an existing tool that allows you to replace several operations with one, it's usually a good idea to use it:
- Instead of
[prng.choice(xs) for _ in range(n)]
, useprng.choices(xs, k=n)
. - Avoid string concatenation. It's basically fine for these situations, but f-strings are better for more complicated constructions, so just use them for everything.
- Instead of matching ranges of numbers, just draw the random number from a non-uniform distribution using
choices
.
- Instead of
- Most of the time, an
if/elif
batch should also have anelse
that says what happens if they all fail. - Especially when there are a lot of strings being used (which is itself often a red flag), declare them as constants (kinda like you do with
digitTuple
).
Is this actually doing what you want?
- Should there be integer cases that start with a plus sign?
- Is it ok that negative zero will show up in this? Multiple leading zeros?
- What is
op.strip().replace("\\", "").replace("(?<= )", "").replace("(?= )", "")
doing? This probably shouldn't be there. tuple((string.ascii_letters + (" " * 8) + "\"" + "'").split())
yields
('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', '"\'')
,
you probably wanted each each character as its own option, right? Possibly including space. Did you mean for spaces to be weighted more heavily? But won't the double-quotes inside the words then be indistinguishable from the ones you're wrapping around the words?- It looks like you're randomly generating test cases for a parser? There's probably an off-the-shelf tool for that; I know the normal practice is to generate fewer test cases in a less-random way, to make sure you're covering all the interesting edge cases instead of wasting time exploring homogeneous planes of randomness.
The above changes have no effect on the performance that I've been able to discern.
- Precalculate even more stuff?
- If
nOp
keys don't need further processing, just sample directly fromnOp.items()
?
But still, no meaningful change...
I was able to eak out another 30% speed improvement (give-or-take measurement error) by just cutting out tqdm, but I don't know what good that does you.
Explore related questions
See similar questions with these tags.