How To Generate Random Data in PostgreSQL
It doesn’t happend every day, but there are occasionally reasons to use random data, or even random sequences of data. PostgreSQL, “The World’s Most Advanced Open Source Relational Database” allows to generate random data in a lot of different ways. While I was preparing the new version and sample data for our recommendation system RecommenderPro I was using PostgreSQL to create some (a lot) of anonimized data to demonstrate the functionality of the product, hence the idea of this post. Here’s some example of how to use our beloved PostgreSQL to generate random data:
random() function
If you just need a random float number the random function is enough. It return a floating point number in the range 0 <= x < 1
.
select random() /* 0.0 <= value < 1 */
If you need a floating point number between 0 and, let’s say, 9, you can use the following select statement.
select random()*9 /* 0.0 <= value < 9 */
If you need an integer number between 0 and 9 use the following select statement.
select floor(random()*9) /* 0 <= value < 9 */
And if you’d like to have some different offset you can simply subtract or add that. The following will return values between -20 and 20:
SELECT trunc(40*random() - 20);
PostgreSQL doesn a good job in returning random values however, there are chanches, that you need repetible values to make some kind of observation. How to force random to generate always the same random values? Simple! Just seed the random!
Try run the following query multiple times
SELECT setseed(.5), random();
It returns a random result as expected but it’s the same random result every time. Changing the seed value from 0.5 to some other value, the value returned by random will be different but will be still always the same for any query execution. In case you have multiple query that need the same sequence of randomic values, you can also set the seed in the first select.
SELECT setseed(.5);
select random();
select random()*10;
select floor(random()*10);
Sequences of random data using generate_series
One of the most important tool in the perfect PostgreSQL developer toolbox is the generate_series
function.
The generate_series
function generates a list of numbers (or timestamp) from a starting value to a stop value using a step (by default the step is 1).
select * from generate_series(1,10)
generate_series |
---|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
If you need to define a column name, use the following syntax.
select * from generate_series(1,10) as mytable(myvalue)
myvalue |
---|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
To generate some random data with an id
you can get some idea from the following select.
SELECT
generate_series(1,10) AS id,
md5(random()::text) AS random_text,
random() * 10 AS random_number,
floor(random() * 10) AS random_integer,
'2000-01-01'::date + trunc(random() * 366 * 10)::int random_date /* 10 year range starting from jan 1st, 2000 */
id | random_text | random_number | random_integer | random_date |
---|---|---|---|---|
1 | 8a17b7ed70b2b1835727682787f39097 | 0.4225312727089303 | 3.0 | 2000-09-08 |
2 | f47324eb11a6416b66e53d775f1f4098 | 3.621055028581388 | 3.0 | 2009-01-16 |
3 | 8593ee40889c503387193cc771683e45 | 6.211348115097302 | 7.0 | 2004-05-02 |
4 | 0f8d782bed4d512d0e3e5c5d6dd0ca34 | 8.290160101204087 | 8.0 | 2010-01-03 |
5 | 58f2a1ff22f712cac6dfee48a5c97ba6 | 6.324909868740711 | 3.0 | 2005-10-11 |
6 | f98823fd945e5b698dfeab2aa08e3df4 | 7.469451084566678 | 0.0 | 2003-07-01 |
7 | 5c5835cbf6b0b64c463ea7594139269b | 8.83030099747625 | 9.0 | 2007-12-18 |
8 | e810e029b46c9c548cd3cf3982c1ce34 | 0.4947420333170882 | 3.0 | 2000-05-06 |
9 | 8a8f2fd5ba9bf9cf3488ce68b82b718e | 4.968150791127037 | 3.0 | 2004-07-28 |
10 | 5de881f247410b7ea409d5b41c10c284 | 1.0645231178098058 | 2.0 | 2008-05-13 |
To create a table with random data you can use the create table ... as select
statement.
create table my_random_data as
select
generate_series(1,10) AS id,
md5(random()::text) AS random_text,
random() * 10 AS random_number,
floor(random() * 10) AS random_integer,
'2000-01-01'::date + trunc(random() * 366 * 10)::int random_date /* 10 year range starting from jan 1st, 2000 */
In this case the column data types are inferred by the select output, as shown below
TABLE my_random_data (
id int4 NULL,
random_text text NULL,
random_number float8 NULL,
random_integer float8 NULL,
random_date date NULL
);
Generate random first names and last name
Working with random numbers and dates are quite simple. Dealing with random character data with a “potential” meaning isn’t. How to generate unique random first names and last names?
Here it is!
drop table if exists random_people;
create table random_people as
with rnd_people as (
select distinct
('[0:9]={"Daniele","Debora","Mattia","Jake","Amy","Henry","Neil","Tim","John","George"}'::text[])
[floor(random()*10)] first_name,
('[0:9]={"Rossi","Verdi","Gialli","Caponi","Gallini","Gatti","Ford","Daniel","Harrison","Macdonald"}'::text[])
[floor(random()*10)] last_name
from
generate_series(1, 10 * 10 * 50)
)
select
row_number() over() id, first_name, last_name
from
rnd_people;
After executing this query the random_people
table contains (very probably) 100 records with data similar to the followings.
id | first_name | last_name |
---|---|---|
1 | Henry | Daniel |
2 | John | Gallini |
3 | Daniele | Gallini |
4 | Debora | Caponi |
5 | Neil | Gialli |
6 | George | Harrison |
7 | John | Gatti |
8 | Jake | Gialli |
9 | Jake | Harrison |
10 | Tim | Rossi |
… | … | … |
Generate random data from a text file
Having the correct grants PostgreSQL allows to read files from the file system.
The following query generates 10000 records with verses from the “DIVINA COMMEDIA - Canto 1° Inferno” in random order.
The text is stored in a text file accessible by the PG server. There are 136 lines of text in the file (hence the random() * 136). Array returned by string_to_array
starts by 1.
WITH versi_canto_1(verso) AS(
SELECT (string_to_array(pg_read_file('C:\DEV\recommenderpro\data\inferno_canto_1.txt')::text,E'\r\n'))
)
SELECT versi_canto_1.verso[floor(random() * 136) + 1] FROM versi_canto_1, generate_series(1,10000);
The output is similar to the following
verso |
---|
Io non so ben ridir com’i’ v’intrai, |
ripresi via per la piaggia diserta, |
tu se’ solo colui da cu’ io tolsi |
Vedi la bestia per cu’ io mi volsi: |
Ed una lupa, che di tutte brame |
che tu mi meni là dov’or dicesti, |
Ma tu perché ritorni a tanta noia? |
E qual è quei che volontieri acquista, |
con lei ti lascerò nel mio partire; |
con la paura ch’uscia di sua vista, |
nel tempo de li dèi falsi e bugiardi. |
rispuose poi che lagrimar mi vide, |
mantoani per patria ambedui. |
… |
Conclusions
Hope you’ll find this post useful. Do you have any other advice about data generation in PostgreSQL? Let me know in the comments.
Comments
comments powered by Disqus