12
\$\begingroup\$

I have a Python script that given some configuration, generates a random CSV file that can be used for testing purposes.

I want to know if it adheres to the best Python and coding practices. It works for cases where I need upwards of 10K+ rows fast enough for my requirements, so I am not too worried about performance although inputs on performance are also appreciated.

Input:

  1. Schema: as a dict, information about each column name, data type and some other constraints (like fixed length/in a range/ from a given list)
  2. Number of rows
  3. Name of the output CSV file

Script:

import random as rnd
import csv
from abc import ABC, abstractmethod
# csv creator, creates a csv files with a given config
roundPrecision = 3
class BoundType(ABC):
 def __init__(self, dtype, params):
 self.dType = dtype
 self.params = params
 @abstractmethod
 def generate(self):
 pass
class FixedLength(BoundType):
 # params is length
 def generate(self):
 length = self.params.get("len", 1)
 if self.dType == "int":
 return rnd.randint(10 ** (length - 1), 10 ** length - 1)
 elif self.dType == "float":
 return FixedLength("int", self.params).generate() + round(rnd.random(), roundPrecision)
 elif self.dType == "string":
 alphabet = list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
 word = [rnd.choice(alphabet) for _ in range(length)]
 return ''.join(word)
 else:
 return None
class FixedRange(BoundType):
 # params is range
 def generate(self):
 lo, hi = (self.params.get("lohi"))
 if self.dType == "int":
 return rnd.randint(lo, hi)
 elif self.dType == "float":
 return round(rnd.uniform(lo, hi), roundPrecision)
 else:
 return None
class FromPossibleValues(BoundType):
 # params is a list
 def generate(self):
 possibleval = self.params.get("set", set())
 return rnd.choice(possibleval)
def createcsv(rows, filename, schema):
 with open(f'./output/{filename}.csv', 'w', encoding='UTF8', newline='') as f:
 writer = csv.writer(f)
 writer.writerow(schema.keys())
 for _ in range(rows):
 writer.writerow([x.generate() for x in schema.values()])

Test:

from csvGen.csvGenerator import FixedLength, FixedRange, FromPossibleValues, createcsv
schema = {
 "col1": FixedLength("int", {"len": 5}),
 "col2": FixedLength("float", {"len": 5}),
 "col3": FixedLength("string", {"len": 5}),
 "col4": FixedRange("int", {"lohi": (10, 15)}),
 "col5": FixedRange("float", {"lohi": (5.5, 6.7)}),
 "col6": FromPossibleValues("int", {"set": [1, 2, 3, 4, 5]}),
 "col7": FromPossibleValues("int", {"set": [1.1, 2.2, 3.3]}),
 "col8": FromPossibleValues("int", {"set": ["A", "AB"]})
}
rows = 10
fileName = "eightVals"
createcsv(rows, fileName, schema)

This is what the output looks like for the given test :

col1 col2 col3 col4 col5 col6 col7 col8
51685 71830.471 PAXBK 12 6.192 1 2.2 AB
60384 42341.991 RHNUK 11 6.037 1 1.1 AB
73505 30997.171 DVOGT 10 6.69 5 2.2 A
60528 85072.731 FWWXW 10 5.761 1 2.2 A
23048 65401.245 EVPUX 13 6.474 4 1.1 AB
74748 66969.774 PEULP 15 6.546 3 2.2 AB
88763 34749.184 VOAUO 10 6.402 4 2.2 AB
77351 44566.163 JOBQF 13 5.683 1 2.2 AB
50820 73002.154 EACZT 15 5.711 1 1.1 AB
53037 89225.572 YTLBI 13 6.328 1 2.2 AB
Sᴀᴍ Onᴇᴌᴀ
29.5k16 gold badges45 silver badges202 bronze badges
asked Oct 17, 2022 at 14:51
\$\endgroup\$
1
  • 1
    \$\begingroup\$ @Peter, your edit was mostly good, but please don't change the code comments for grammar/spelling, as that's an aspect of review. \$\endgroup\$ Commented Oct 18, 2022 at 15:24

2 Answers 2

9
\$\begingroup\$

This seems overbuilt, and would also be more tersely expressed with Numpy/Pandas. Additionally, your manual loops - whereas you're satisfied with current performance - will be faster in vectorised form.

This program can be as simple as

import numpy as np
import pandas as pd
from string import ascii_uppercase
from numpy.random import default_rng
round_precision = 3
nrows = 10
rand = default_rng()
array_3 = rand.choice(tuple(ascii_uppercase), size=(nrows, 5))
df = pd.DataFrame({
 'col1': rand.integers(low=1e4, high=1e5, size=nrows),
 'col2': rand.uniform(low=1e4, high=1e5, size=nrows).round(decimals=round_precision),
 'col3': pd.DataFrame(array_3).agg(''.join, axis=1),
 'col4': rand.integers(low=10, high=16, size=nrows),
 'col5': rand.uniform(low=5.5, high=6.7, size=nrows).round(decimals=round_precision),
 'col6': rand.integers(low=1, high=6, size=nrows),
 'col7': rand.choice(np.linspace(start=1.1, stop=3.3, num=3), size=nrows),
 'col8': rand.choice(('A', 'AB'), size=nrows),
})
df.to_csv('eightVals.csv', index=None)

with output

col1,col2,col3,col4,col5,col6,col7,col8
90510,54337.357,TITJJ,12,6.254,2,2.2,A
82827,10498.364,MLZFY,11,5.881,5,2.2,A
81292,95873.98,UOHVL,14,5.931,2,1.1,AB
44278,69859.781,RGQVO,11,6.492,2,2.2,AB
26424,18106.356,XPCYV,15,6.176,5,2.2,AB
12324,52143.779,VUEWA,15,5.959,2,3.3,A
38774,34881.201,URXXL,15,6.545,2,2.2,AB
25801,31699.204,XEWGS,14,6.427,3,3.3,A
59555,98153.322,WYGBV,11,6.062,1,2.2,A
44449,86569.519,MDNXN,15,6.229,1,3.3,A

If this needs to be reusable (and I'm not entirely convinced that it is, given what you've shown), you can adapt the above to utility functions. I don't know that classes are necessary.

answered Oct 18, 2022 at 1:00
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Storing the config separately makes a lot of sense in data pipelines IMHO. Classes don't seem necessary just for that indeed though. \$\endgroup\$ Commented Oct 18, 2022 at 12:53
9
\$\begingroup\$

I like how you started designing with generate method as interface for your generators. I like how you define your schema. I like also how the final loop and code works.

Really can't agree with the way you implemented your generate methods. First it was incredibly confusing and it took me unnecessarily too long to figure out, what are these classes about. To be more specific:

  • Why passing params as dict? Why not pass named parameters and make them instance variables? It would make code much more readable and easier to use. Only reason I can think about is that you could instantiate all those classes with same type of parameters (but you don't need to do that).
  • That big if in your generate method is code smell. Each of those if branches should be it's own class with it's own generate implementation.
  • Names FixedLength and FromPossibleValues can be better.

Some possibilities examples how to implement your classes better (imho):

class FixedLengthIntGenerator(BoundType):
 def __init__(self, length):
 self.length = length
 def generate(self):
 length = self.length
 return rnd.randint(10 ** (length - 1), 10 ** length - 1)
# TODO: other classes for string, float, range ...
class FromPresetChoicesGenerator(BoundType):
 def __init__(self, choices):
 self.choices = choices
 def generate(self):
 return rnd.choice(self.choices)

Yes it is more verbose, but it is more readable, extensible, different implementations are not so bound. Also it's not necessary to pass this "dType" anymore. Instead, you choose what implementation to use.

answered Oct 17, 2022 at 18:23
\$\endgroup\$
1
  • 3
    \$\begingroup\$ If you use a class having two methods, one of which is __init__, you should be using a function instead. Additionally using inheritance without a single call to super() or any of the parent class' features, is a big no no for me. \$\endgroup\$ Commented Oct 18, 2022 at 5:52

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.