Create a random list of values with corresponding tokens

Question 1

I have this code that uses loops and random generators to create a random list of values (operators, strings, integers, floats, and new lines) and a corresponding list of the correct tokens to test a lexer I have made. Currently it is estimated to take 1:50 (1 hour 50 mins) to create 1 billion tokens, but I am wondering if it can be faster. I attempted to optimize it by using numpy random functions, but that resulted in a time estimate of 32 hours. Is it possible to improve performance? Here is my code (using tqdm for a progress bar and time estimate):

from tqdm import tqdm
import random
import string
nOp = {
'+': 'add',
'-': 'subtract',
'*': 'multiply',
'/': 'divide',
'^': 'exponent',
'/^': 'root',
'%': 'modulo',
'==': 'equal',
'!': 'not', '>':
'greater', '<':
'less', '>=':
'greater_equal',
'<=': 'less_equal',
'&': 'and',
'|': 'or',
'%/': 'divisible'
}
strTuple = tuple((string.ascii_letters + (" " * 8) + "\"" + "'").split())
digitTuple = tuple(string.digits)
s = []
correct = []
for i in tqdm(range(1000000000)):
 #get a random number to determine the data type
 tn = random.randint(0, 16)
 #operator (random from dict above)
 if tn >= 0 and tn <= 3:
 op = random.choice(tuple(nOp.keys()))
 s.append(op.strip().replace("\\", "").replace("(?<= )", "").replace("(?= )", ""))
 correct.append(nOp[op])
 #string (random length and characters)
 elif tn >= 4 and tn <= 7:
 s.append('"' + "".join([random.choice(strTuple) for i in range(random.randint(0, 75))]) + '"')
 correct.append("string")
 #integer (random length and digits)
 elif tn >= 8 and tn <= 11:
 s.append(random.choice(["-", ""]) + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]))
 correct.append("integer")
 #float (random length and digits before and after decimal place)
 elif tn >= 12 and tn <= 15:
 s.append(random.choice(["-", ""]) + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]) + "." + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]))
 correct.append("float")
 #new line
 elif tn == 16:
 s.append("\n")
 correct.append("EOL")

Question 2

Will this even fit in memory? What will you do with the tokens once this loop is done?

Question 3

You generate a random float with this line:

s.append(random.choice(["-", ""]) + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]) + "." + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]))

There is a function for what you're trying to do:

s.append(str(random.uniform(-10**10, 10**10)))

Same applies to the line where you generate integers.

s.append(op.strip().replace("\\", "").replace("(?<= )", "").replace("(?= )", ""))

It's unclear what you're trying to do here since op doesn't contain any question marks. strip() is also redundant.

Instead of

for i in tqdm(range(1000000000)):
 tn = random.randint(0, 16)
 if tn >= 0 and tn <= 3:
 ...
 elif tn >= 4 and tn <= 7:
 ...
 ...

you can use choices with weights. This allows you to write it a bit cleaner:

for data_type in random.choices(
 ('operator', 'string', 'integer', 'float', 'new line'),
 (4, 4, 4, 4, 1),
 k=10
):
 if data_type == 'operator':
 ...
 if data_type == 'string':
 ...
 ...

You won't actually be able to store that much data in a list because of limited memory. Why do you need so much anyway?

Question 4

"You generate a random float with this line: ... There is a function for what you're trying to do: ..." If we assume (boldly, I know) that OP's code is doing what they want, then this probably isn't a good substitution. 1.0 and 1.0000000000 are the same "number", but it appears they're testing a parser of some kind, so it makes sense that they'd want redundant representations like this.

Question 5

Formatting:

Use a linter. Maybe you already are; this isn't bad. But there are a couple things that are annoying to read which a linter would have caught.
What is going on with the formatting of nOp?
Put comments inside if bloacks, same way you put them inside functions.

Design principals:

Use a main method.
Parameterize key things like the number of items you want to generate and the RNG you want to use.
Try using generators and streaming. The itertools package will help with this. It's easy to get carried away, and rarely actually affects performance, but it's a nice flexible way to think about things.
When there's an existing tool that allows you to replace several operations with one, it's usually a good idea to use it:
- Instead of [prng.choice(xs) for _ in range(n)], use prng.choices(xs, k=n).
- Avoid string concatenation. It's basically fine for these situations, but f-strings are better for more complicated constructions, so just use them for everything.
- Instead of matching ranges of numbers, just draw the random number from a non-uniform distribution using choices.
Most of the time, an if/elif batch should also have an else that says what happens if they all fail.
Especially when there are a lot of strings being used (which is itself often a red flag), declare them as constants (kinda like you do with digitTuple).

Is this actually doing what you want?

Should there be integer cases that start with a plus sign?
Is it ok that negative zero will show up in this? Multiple leading zeros?
What is op.strip().replace("\\", "").replace("(?<= )", "").replace("(?= )", "") doing? This probably shouldn't be there.
tuple((string.ascii_letters + (" " * 8) + "\"" + "'").split()) yields
('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', '"\''),
you probably wanted each each character as its own option, right? Possibly including space. Did you mean for spaces to be weighted more heavily? But won't the double-quotes inside the words then be indistinguishable from the ones you're wrapping around the words?
It looks like you're randomly generating test cases for a parser? There's probably an off-the-shelf tool for that; I know the normal practice is to generate fewer test cases in a less-random way, to make sure you're covering all the interesting edge cases instead of wasting time exploring homogeneous planes of randomness.

The above changes have no effect on the performance that I've been able to discern.

Precalculate even more stuff?
If nOp keys don't need further processing, just sample directly from nOp.items()?

But still, no meaningful change...

I was able to eak out another 30% speed improvement (give-or-take measurement error) by just cutting out tqdm, but I don't know what good that does you.

QuasiStellar QuasiStellar 2,3274 silver badges17 bronze badges · Answer 1 · 2022-11-01 23:35:25Z

You generate a random float with this line:

s.append(random.choice(["-", ""]) + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]) + "." + "".join([random.choice(digitTuple) for i in range(random.randint(1, 10))]))

There is a function for what you're trying to do:

s.append(str(random.uniform(-10**10, 10**10)))

Same applies to the line where you generate integers.

s.append(op.strip().replace("\\", "").replace("(?<= )", "").replace("(?= )", ""))

It's unclear what you're trying to do here since op doesn't contain any question marks. strip() is also redundant.

Instead of

for i in tqdm(range(1000000000)):
 tn = random.randint(0, 16)
 if tn >= 0 and tn <= 3:
 ...
 elif tn >= 4 and tn <= 7:
 ...
 ...

you can use choices with weights. This allows you to write it a bit cleaner:

for data_type in random.choices(
 ('operator', 'string', 'integer', 'float', 'new line'),
 (4, 4, 4, 4, 1),
 k=10
):
 if data_type == 'operator':
 ...
 if data_type == 'string':
 ...
 ...

You won't actually be able to store that much data in a list because of limited memory. Why do you need so much anyway?

"You generate a random float with this line: ... There is a function for what you're trying to do: ..." If we assume (boldly, I know) that OP's code is doing what they want, then this probably isn't a good substitution. 1.0 and 1.0000000000 are the same "number", but it appears they're testing a parser of some kind, so it makes sense that they'd want redundant representations like this.

ShapeOfMatter ShapeOfMatter 4,4377 silver badges25 bronze badges · Answer 2 · 2022-11-02 06:16:51Z

Formatting:

Use a linter. Maybe you already are; this isn't bad. But there are a couple things that are annoying to read which a linter would have caught.
What is going on with the formatting of nOp?
Put comments inside if bloacks, same way you put them inside functions.

Design principals:

Use a main method.
Parameterize key things like the number of items you want to generate and the RNG you want to use.
Try using generators and streaming. The itertools package will help with this. It's easy to get carried away, and rarely actually affects performance, but it's a nice flexible way to think about things.
When there's an existing tool that allows you to replace several operations with one, it's usually a good idea to use it:
- Instead of [prng.choice(xs) for _ in range(n)], use prng.choices(xs, k=n).
- Avoid string concatenation. It's basically fine for these situations, but f-strings are better for more complicated constructions, so just use them for everything.
- Instead of matching ranges of numbers, just draw the random number from a non-uniform distribution using choices.
Most of the time, an if/elif batch should also have an else that says what happens if they all fail.
Especially when there are a lot of strings being used (which is itself often a red flag), declare them as constants (kinda like you do with digitTuple).

Is this actually doing what you want?

Should there be integer cases that start with a plus sign?
Is it ok that negative zero will show up in this? Multiple leading zeros?
What is op.strip().replace("\\", "").replace("(?<= )", "").replace("(?= )", "") doing? This probably shouldn't be there.
tuple((string.ascii_letters + (" " * 8) + "\"" + "'").split()) yields
('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', '"\''),
you probably wanted each each character as its own option, right? Possibly including space. Did you mean for spaces to be weighted more heavily? But won't the double-quotes inside the words then be indistinguishable from the ones you're wrapping around the words?
It looks like you're randomly generating test cases for a parser? There's probably an off-the-shelf tool for that; I know the normal practice is to generate fewer test cases in a less-random way, to make sure you're covering all the interesting edge cases instead of wasting time exploring homogeneous planes of randomness.

The above changes have no effect on the performance that I've been able to discern.

Precalculate even more stuff?
If nOp keys don't need further processing, just sample directly from nOp.items()?

But still, no meaningful change...

I was able to eak out another 30% speed improvement (give-or-take measurement error) by just cutting out tqdm, but I don't know what good that does you.

Stack Exchange Network

Create a random list of values with corresponding tokens

2 Answers 2

Formatting:

Design principals:

Is this actually doing what you want?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Create a random list of values with corresponding tokens

2 Answers 2

Formatting:

Design principals:

Is this actually doing what you want?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions