Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 88c69a4

Browse files
committed
Merge branch 'main' into 'main'
Day 48 See merge request postgres-ai/postgresql-consulting/postgres-howtos!17
2 parents adba15b + 66fdf02 commit 88c69a4

File tree

2 files changed

+317
-0
lines changed

2 files changed

+317
-0
lines changed

‎0048_how_to_generate_fake_data.md

Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
Originally from: [tweet](https://twitter.com/samokhvalov/status/1723973291692691581), [LinkedIn post]().
2+
3+
---
4+
5+
# How to generate fake data
6+
7+
> I post a new PostgreSQL "howto" article every day. Join me in this
8+
> journey – [subscribe](https://twitter.com/samokhvalov/), provide feedback, share!
9+
10+
## Simple numbers
11+
12+
Sequential numbers from 1 to 5:
13+
14+
```sql
15+
nik=# select i from generate_series(1, 5) as i;
16+
i
17+
---
18+
1
19+
2
20+
3
21+
4
22+
5
23+
(5 rows)
24+
```
25+
26+
5 random `BIGINT` numbers from 0 to 100:
27+
28+
```sql
29+
nik=# select (random() * 100)::int8 from generate_series(1, 5);
30+
int8
31+
------
32+
85
33+
61
34+
44
35+
70
36+
16
37+
(5 rows)
38+
```
39+
40+
Note that per the [docs](https://postgresql.org/docs/current/functions-math.html#FUNCTIONS-MATH-RANDOM-TABLE), random()
41+
42+
> uses a deterministic pseudo-random number generator. It is fast but not suitable for cryptographic applications...
43+
44+
We shouldn't use it for tasks as token or password generation (for that, use the library called `pgcrypto`).
45+
But it is okay to use it for pure random data generation (not for obfuscation).
46+
47+
Starting with Postgres 16, there is also
48+
[random_normal(mean, stddev)](https://postgresql.org/docs/16/functions-math.html#FUNCTIONS-MATH-RANDOM-TABLE):
49+
50+
> Returns a random value from the normal distribution with the given parameters; `mean` defaults to 0.0 and `stddev`
51+
> defaults to 1.0
52+
53+
Generating 1 million numbers and checking their distribution:
54+
55+
```sql
56+
nik=# with data (r) as (
57+
select random_normal()
58+
from generate_series(1, 1000000)
59+
)
60+
select
61+
width_bucket(r, -3, 3, 5) as bucket,
62+
count(*) as frequency
63+
from data
64+
group by bucket
65+
order by bucket;
66+
67+
bucket | frequency
68+
--------+-----------
69+
0 | 1374
70+
1 | 34370
71+
2 | 238786
72+
3 | 450859
73+
4 | 238601
74+
5 | 34627
75+
6 | 1383
76+
(7 rows)
77+
```
78+
79+
Again, neither `random()` nor `random_normal()` should be used as cryptographically strong random number generators –
80+
for that, use `pgcrypto`. Otherwise, knowing one value, one could "guess" the next one:
81+
82+
```sql
83+
nik=# set seed to 0.1234;
84+
SET
85+
86+
nik=# select random_normal();
87+
random_normal
88+
----------------------
89+
-0.32020779450641174
90+
(1 row)
91+
92+
nik=# select random_normal();
93+
random_normal
94+
--------------------
95+
0.8995226247227294
96+
(1 row)
97+
98+
nik=# set seed to 0.1234; -- start again
99+
SET
100+
101+
nik=# select random_normal();
102+
random_normal
103+
----------------------
104+
-0.32020779450641174
105+
(1 row)
106+
107+
nik=# select random_normal();
108+
random_normal
109+
--------------------
110+
0.8995226247227294
111+
(1 row)
112+
```
113+
114+
## Timestamps, dates, intervals
115+
116+
Timestamps (with timezone) for January 2024, starting with 2024年01月01日, with 7-day shift:
117+
118+
```sql
119+
nik=# show timezone;
120+
TimeZone
121+
---------------------
122+
America/Los_Angeles
123+
(1 row)
124+
125+
nik=# select i from generate_series(timestamptz '2024年01月01日', timestamptz '2024年01月31日', interval '7 day') i;
126+
i
127+
------------------------
128+
2024-01-01 00:00:00-08
129+
2024-01-08 00:00:00-08
130+
2024-01-15 00:00:00-08
131+
2024-01-22 00:00:00-08
132+
2024-01-29 00:00:00-08
133+
(5 rows)
134+
```
135+
136+
Generate 3 random timestamps for the previous week (useful for filling columns such as `created_at`):
137+
138+
```sql
139+
nik=# select
140+
date_trunc('week', now())
141+
- interval '7 day'
142+
+ format('%s day', (random() * 7))::interval
143+
from generate_series(1, 3);
144+
145+
?column?
146+
-------------------------------
147+
2023-10-31 00:50:59.503352-07
148+
2023-11-03 11:25:39.770384-07
149+
2023-11-03 13:43:27.087973-07
150+
(3 rows)
151+
```
152+
153+
Generate a random birthdate for a person aged 18-100 years:
154+
155+
```sql
156+
nik=# select
157+
(
158+
now()
159+
- format('%s day', 365 * 18)::interval
160+
- format('%s day', 365 * random() * 82)::interval
161+
)::date;
162+
date
163+
164+
------------
165+
1954-01-17
166+
(1 row)
167+
```
168+
169+
## Pseudowords
170+
171+
Generate a pseudoword consisting of 2-12 lowercase Latin letters:
172+
173+
```sql
174+
nik=# select string_agg(chr((random() * 25)::int + 97), '')
175+
from generate_series(1, 2 + (10 * random())::int);
176+
string_agg
177+
------------
178+
yegwrsl
179+
(1 row)
180+
181+
nik=# select string_agg(chr((random() * 25)::int + 97), '')
182+
from generate_series(1, 2 + (10 * random())::int);
183+
string_agg
184+
------------
185+
wusapjx
186+
(1 row)
187+
```
188+
189+
Generate a "sentence" consisting of 5-10 such words:
190+
191+
```sql
192+
nik=# select string_agg(w, ' ')
193+
from
194+
generate_series(1, 5) as i,
195+
lateral (
196+
select string_agg(chr((random() * 25)::int + 97), ''), i
197+
from generate_series(1, 2 + (10 * random())::int + i - i)
198+
) as words(w);
199+
string_agg
200+
-------------------------------------
201+
uvo bwp kcypvcnctui tn demkfnxruwxk
202+
(1 row)
203+
```
204+
205+
Note `LATERAL` and the trick with "idle" references to the outer generator (`, i` and `+i - i`) to generate new random
206+
values in each iteration.
207+
208+
## Normal words, names, emails, SSNs, etc. (Faker)
209+
210+
[Faker](https://faker.readthedocs.io/en/master/) is a Python library that enables you to generate fake data such as
211+
names, addresses, phone numbers, and more. For other languages:
212+
213+
- [Faker for Ruby](https://github.com/faker-ruby/faker)
214+
- [Faker for Go](https://github.com/go-faker/faker)
215+
- [Faker for Java](https://github.com/DiUS/java-faker)
216+
- [Faker for Rust](https://github.com/cksac/fake-rs)
217+
- [Faker for JavaScript](https://github.com/faker-js/faker)
218+
219+
There are several options to use Faker for Python:
220+
221+
- A regular Python program with Postgres connection (boring; but would work with any Postgres including RDS).
222+
- [postgresql_faker](https://gitlab.com/dalibo/postgresql_faker/)
223+
- PL/Python functions.
224+
225+
Here, we'll demonstrate the use of the latter approach, with the "untrusted" version of PL/Python,
226+
([Day 47: How to install Postgres 16 with plpython3u](0047_how_to_install_postgres_16_with_plpython3u.md); N/A for
227+
managed Postgres services such as RDS; note that in this case, the "trusted" version should suit too).
228+
229+
```sql
230+
nik=# create or replace function generate_random_sentence(
231+
min_length int,
232+
max_length int
233+
) returns text
234+
as $$
235+
from faker import Faker
236+
import random
237+
238+
if min_length > max_length:
239+
raise ValueError('min_length > max_length')
240+
241+
fake = Faker()
242+
243+
sentence_length = random.randint(min_length, max_length)
244+
245+
return ' '.join(fake.words(nb=sentence_length))
246+
$$ language plpython3u;
247+
CREATE FUNCTION
248+
249+
nik=# select generate_random_sentence(7, 15);
250+
generate_random_sentence
251+
---------------------------------------------------------------------------------
252+
operation day down forward foreign left your anything clear age seven memory as
253+
(1 row)
254+
```
255+
256+
A function to generate names, emails, and SSNs:
257+
258+
```sql
259+
nik=# create or replace function generate_faker_data(
260+
data_type text,
261+
locale text default 'en_US'
262+
)
263+
returns text as $$
264+
from faker import Faker
265+
266+
fake = Faker(locale)
267+
268+
if data_type == 'email':
269+
result = http://fake.email()
270+
elif data_type == 'lastname':
271+
result = fake.last_name()
272+
elif data_type == 'firstname':
273+
result = fake.first_name()
274+
elif data_type == 'ssn':
275+
result = fake.ssn()
276+
else:
277+
raise Exception('Invalid type')
278+
279+
return result
280+
$$ language plpython3u;
281+
282+
select
283+
generate_faker_data('firstname', locale) as firstname,
284+
generate_faker_data('lastname', locale) as lastname,
285+
generate_faker_data('ssn') as "SSN";
286+
CREATE FUNCTION
287+
288+
nik=# select
289+
locale,
290+
generate_faker_data('firstname', locale) as firstname,
291+
generate_faker_data('lastname', locale) as lastname
292+
from
293+
(values ('en_US'), ('uk_UA'), ('it_IT')) as _(locale);
294+
locale | firstname | lastname
295+
--------+-----------+-----------
296+
en_US | Ashley | Rodgers
297+
uk_UA | Анастасія | Матвієнко
298+
it_IT | Carolina | Donatoni
299+
(3 rows)
300+
301+
nik=# select generate_faker_data('ssn');
302+
generate_faker_data
303+
---------------------
304+
008-47-2950
305+
(1 row)
306+
307+
nik=# select generate_faker_data('email', 'es') from generate_series(1, 5);
308+
generate_faker_data
309+
-------------------------
310+
isidoro42@example.net
311+
anselma04@example.com
312+
torreatilio@example.com
313+
natanael39@example.org
314+
teodosio79@example.net
315+
(5 rows)
316+
```

‎README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ As an example, first 2 rows:
7474
- 0045 [How to monitor xmin horizon to prevent XID/MultiXID wraparound and high bloat](./0045_how_to_monitor_xmin_horizon.md)
7575
- 0046 [How to deal with bloat](./0046_how_to_deal_with_bloat.md)
7676
- 0047 [How to install Postgres 16 with plpython3u: Recipes for macOS, Ubuntu, Debian, CentOS, Docker](./0047_how_to_install_postgres_16_with_plpython3u.md)
77+
- 0048 [How to generate fake data](./0048_how_to_generate_fake_data.md)
7778
- ...
7879

7980
## Contributors

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /