Algo for generating a fake, but feasable, dataset

Question 1

So, I was working on a dashboard for a potential customer and I needed a fake dataset with employee information to demonstrate. Mainly, I needed to know when the employee arrived the company (first swipe), when he left (last swipe) and the hours he spent in many other areas of the company (ORC is a room, DSP is another, and so on)

Basically, I create a random hour between 8am and 10am and assign this to the first swipe. I do the same for last swipe but in a range from 5pm to 7pm. Then, I calculate how many hours he worked by subtracting one from another.

With this information now I start to calculate how many hours he spent in every area of the company. The ORC is the working room, so I want to keep between 50% to 80% of hours worked there and the rest randomly assign to other areas.

I spent a lot of time in this code, and it's been a while since I created it. It's not the most pythonic code you will ever see, but it worked :D

import calendar
import datetime
import random
def random_hour(start, end):
 hour_rand = random.randint(start, end)
 minutes_rand = random.randint(0, 59)
 return datetime.timedelta(hours=hour_rand, minutes=minutes_rand)
def random_weight(total_working_hours):
 working_hours_per_area = {
 'in_orc': 0,
 'in_cafe': 0,
 'in_dsp': 0,
 'in_kiosk': 0,
 'in_training': 0,
 }
 whole_time = 100
 total_time = datetime.timedelta()
 for i, area in enumerate(working_hours_per_area):
 if whole_time < 0:
 whole_time = 0
 if i == 0:
 rand_time = random.randint(50, 80)
 else:
 rand_time = random.randint(0, whole_time)
 whole_time -= rand_time
 working_hours_per_area[area] = rand_time / 100
 total_aux = sum(working_hours_per_area.values())
 if total_aux < 1.0:
 diff = 1.0 - total_aux
 min_hour = min(working_hours_per_area.keys(), key=(lambda k: working_hours_per_area[k]))
 working_hours_per_area[area] += diff
 for area in working_hours_per_area:
 working_hours_per_area[area] = working_hours_per_area[area] * total_working_hours
 return working_hours_per_area
employees = [
 ['CHI-123', 'CLOVIS TONELADA'],
 ['CHI-456', 'JOSE DA COVA'],
 ['CHI-789', 'EMERSON PEDREIRA'],
 ['CHI-321', 'GREYCE CROQUETE'],
 ['CHI-654', 'ROBERTO PINGA'],
 ['CHI-987', 'CAROLINA DOZE AVOS'],
]
days = []
cal = calendar.Calendar()
for week in cal.monthdatescalendar(2020,9):
 for day in week:
 if day.weekday() < 5:
 days.append(day)
f = open('dataset.csv', 'w+')
f.write('Date;Employee_Name;Employee_Code;First_Swipe;Last_Swipe;Total_Working_Hours;In_ORC;In_Cafe;In_DSP;In_Kiosk;In_Training\n')
for day in days:
 for i in range(0, 6):
 date = day
 employee_name = employees[i][1]
 employee_code = employees[i][0]
 first_swipe = random_hour(8, 10)
 last_swipe = random_hour(17, 19)
 total_working_hours = last_swipe - first_swipe
 working_hours_per_area = random_weight(total_working_hours)
 total_working_hours_per_area = datetime.timedelta(hours=0, minutes=0)
 locals().update(working_hours_per_area)
 write = ';'.join([str(date), employee_name, employee_code, str(first_swipe), str(last_swipe), str(total_working_hours), \
 str(in_orc), str(in_cafe), str(in_dsp), str(in_kiosk), str(in_training)])
 f.write(write + '\n')
f.close()

Question 2

You didn't include any specific request, so here are some general comments.

comments / documentation

You say it's been a while since you wrote it. When you look at the code now, are there places you ask yourself "why did I do that?" or where it takes time to figure out what is going on? If so, those are good places to add comments.

Also docstrings could be added to the file and the functions.

random_hour(start, end)

The writeup says it returns a random time between start and end. However, it actually returns a random timedelta between start and end + 59 minutes. Also, similar python functions tend to include the start and exclude the end (e.g. randrange), so it would be good to document this.

random_weight(total_working_hours)

dicts() are not guaranteed to be ordered until Python 3.7. So i==0 may not correspond to in_orc. It would be better to iterate over the keys and check if the key=='in_orc'.

min_hour is calculated but never used. I think it is supposed to be area.

module level code

It is common to put the top level code in a function such as main(). And the call main() from code such as

if __name__ == '__main__':
 main()

csv module

The standard library includes the csv module for reading a writting csv and other kinfs of delimited text files. It takes care of escaping characters or enclosing strings in quotes if needed.

unpacking

Instead of using for i in range(0,6) to iterate over the employees, use something like:

for employee_code, employee_name in employees:
 ...

locals()

The python documentation says the dictionary returned by locals() should NOT be modified. The changes may not be picked up by the interpreter.

That's enough for now.

RootTwo RootTwo 10.7k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2019-11-15 06:18:33Z

You didn't include any specific request, so here are some general comments.

comments / documentation

You say it's been a while since you wrote it. When you look at the code now, are there places you ask yourself "why did I do that?" or where it takes time to figure out what is going on? If so, those are good places to add comments.

Also docstrings could be added to the file and the functions.

random_hour(start, end)

The writeup says it returns a random time between start and end. However, it actually returns a random timedelta between start and end + 59 minutes. Also, similar python functions tend to include the start and exclude the end (e.g. randrange), so it would be good to document this.

random_weight(total_working_hours)

dicts() are not guaranteed to be ordered until Python 3.7. So i==0 may not correspond to in_orc. It would be better to iterate over the keys and check if the key=='in_orc'.

min_hour is calculated but never used. I think it is supposed to be area.

module level code

It is common to put the top level code in a function such as main(). And the call main() from code such as

if __name__ == '__main__':
 main()

csv module

The standard library includes the csv module for reading a writting csv and other kinfs of delimited text files. It takes care of escaping characters or enclosing strings in quotes if needed.

unpacking

Instead of using for i in range(0,6) to iterate over the employees, use something like:

for employee_code, employee_name in employees:
 ...

locals()

The python documentation says the dictionary returned by locals() should NOT be modified. The changes may not be picked up by the interpreter.

That's enough for now.

Stack Exchange Network

Algo for generating a fake, but feasable, dataset

1 Answer 1

comments / documentation

random_hour(start, end)

random_weight(total_working_hours)

module level code

csv module

unpacking

locals()

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Algo for generating a fake, but feasable, dataset

1 Answer 1

comments / documentation

random_hour(start, end)

random_weight(total_working_hours)

module level code

csv module

unpacking

locals()

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions