22

I would like to be able to generate random bytea fields of arbitrary length (<1Gb) for populating test data.

What is the best way of doing this?

Evan Carroll
65.7k50 gold badges259 silver badges511 bronze badges
asked Aug 15, 2012 at 10:08

3 Answers 3

26

Enhancing Jack Douglas's answer to avoid the need for PL/PgSQL looping and bytea concatenation, you can use:

CREATE OR REPLACE FUNCTION random_bytea(bytea_length integer)
RETURNS bytea AS $body$
 SELECT decode(string_agg(lpad(to_hex(width_bucket(random(), 0, 1, 256)-1),2,'0') ,''), 'hex')
 FROM generate_series(1, 1ドル);
$body$
LANGUAGE 'sql'
VOLATILE
SET search_path = 'pg_catalog';

It's a simple SQL function that's cheaper to call than PL/PgSQL.

The difference in performance due to the changed aggregation method is immense for larger bytea values. Though the original function is actually up to 3x faster for sizes < 50 bytes, this one scales much better for larger values.

Or use a C extension function:

I've implemented a random bytea generator as a simple C extension function. It's in my scrapcode repository on GitHub. See the README there.

It nukes the performance of the above SQL version:

regress=# \a
regress=# \o /dev/null
regress=# \timing on
regress=# select random_bytea(2000000);
Time: 895.972 ms
regress=# drop function random_bytea(integer);
regress=# create extension random_bytea;
regress=# select random_bytea(2000000);
Time: 24.126 ms
answered Aug 16, 2012 at 0:35
12
  • 1
    Well, I came up with nearly the same solution, but tested only for lower values. There @Jack's solution was a clear winner. +1 for you for not stopping here :) Commented Aug 16, 2012 at 4:35
  • Thank you - this is excellent and thought provoking. I think FROM generate_series(0, 1ドル); needs to be FROM generate_series(1, 1ドル);. Have you tried recursion? My limited testing implies that this scales better: Commented Aug 16, 2012 at 5:45
  • 2
    I tried symlinking /dev/urandom into /var/lib/pgsql/data and reading it with pg_read_file() for bonus crazy points, but unfortunately pg_read_file() reads text input via an encoding conversion, so it can't read bytea. If you really want max speed, write a C extension function that uses a fast pseudo-random number generator to produce binary data and wrap a bytea datum around the buffer :-) Commented Aug 16, 2012 at 6:29
  • 1
    @JackDouglas I couldn't help it. C extension version of random_bytea. github.com/ringerc/scrapcode/tree/master/postgresql/… Commented Aug 16, 2012 at 8:57
  • 1
    Another excellent answer! Actually one of the best I've seen so far. I haven't tested the extension, but I trust it works as advertised. Commented Aug 16, 2012 at 22:42
12

The pgcrypto extension has gen_random_bytes(count integer):

test=# create extension pgcrypto;
test=# select gen_random_bytes(16);
 gen_random_bytes
------------------------------------
 \xaeb98ae41489460c5292aafade4498ee
(1 row)

The create extension only needs to be done once.

answered Mar 24, 2020 at 9:06
1
  • 2
    That's the right answer as long as you need <=1024 bytes at a time. Commented Apr 24, 2023 at 17:06
6

I would like to be able to generate random bytea fields of arbitrary length

This function will do it, but 1Gb will take a long time because it does not scale linearly with output length:

create function random_bytea(p_length in integer) returns bytea language plpgsql as $$
declare
 o bytea := '';
begin 
 for i in 1..p_length loop
 o := o||decode(lpad(to_hex(width_bucket(random(), 0, 1, 256)-1),2,'0'), 'hex');
 end loop;
 return o;
end;$$;

output test:

select random_bytea(2);
/*
|random_bytea|
|:-----------|
|\xcf99 |
*/
select random_bytea(10);
/*
|random_bytea |
|:---------------------|
|\x781b462c3158db229b3c|
*/
select length(random_bytea(100000))
 , clock_timestamp()-statement_timestamp() time_taken;
/*
|length|time_taken |
|-----:|:--------------|
|100000|00:00:00.654008|
*/

dbfiddle here

answered Aug 15, 2012 at 10:08
2
  • Nice use of width_bucket. Handy. Commented Aug 16, 2012 at 0:22
  • 1
    I've enhanced your approach to avoid the PL/PgSQL and expensive concatenation loop; see new answer. By using string_agg over generate_series instead of a PL/PgSQL concatenation loop on bytea I'm seeing a 150-fold improvement in performance. Commented Aug 16, 2012 at 0:38

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.