I regularly import data from httparchive.org. The data is a MySQL CSV export and I use pgloader, which handles the quirks of this export (\N
for NULL
), etc.
I also need to do some additional processing for normalisation:
- split the url in protocol (http|https) and host parts
- convert the string date "Mon DD YYYY" to date object
Currently, I have some triggers to do this when importing the data but I'm looking for ways to improve this, particularly to see whether some steps can be run in parallel.
I have the following CTE for extracting protocol and port:
with split as
(select regexp_match(url, '(https|http)://(.+)/' )as parts
from urls )
Running locally this seems faster than tsdebug
This works well with a select but seems very slow as an update.
with split as
(select regexp_match(url, '(https|http)://(.+)/' )as parts
from urls )
update urls
set
protocol = parts[1],
host = parts[2]
from split
An alternative, especially when working with a text source would be split the URL before it goes to Postgres.
The uncompressed CSV is 3526430884 bytes and takes around 20 minutes to import with no processing. But more than twice this with the processing. FWIW I have also tried using a Foreign Data Wrapper. But, even after solving various problems with the CSV (nulls, encoding), this leads to a memory error.
With some help, I've managed to run benchmarks and improve my triggers.
CREATE OR REPLACE FUNCTION public.extract_protocol()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
DECLARE elements text [];
BEGIN
elements := regexp_match(NEW.url, '(https|http)://(.+)/');
NEW.protocol = elements[1];
NEW.host = elements[2];
RETURN NEW;
END;
$function$
This now runs faster than doing a subsquent update but neither are limiting factors. The bottleneck is now the overhead of the indices when inserting the cleaned up data into the main table. I think my only options there are to weigh up the cost of indices for the insert over disabling and then adding them.
1 Answer 1
Your UPDATE syntax generates a cross join of the table urls
with the split
result. Which is essentially a cross join of the table with itself.
You need to have some kind of join condition between the target table and the source. The obvious choice would be the primary key column of the table.
with split as (
select pk_column, regexp_match(url, '(https|http)://(.+)/' ) as parts
from urls
)
update urls
set protocol = s.parts[1],
host = s.parts[2]
from split s
where urls.pk_column = s.pk_column --<< here
I think your attempt to avoid evaluating the regex expression twice by using the CTE makes things slower, rather than faster. I would expect the overhead of joining the table with itself is far bigger than the overhead of evaluating the regex twice.
So I think, you should also try:
update urls
set protocol = (regexp_match(url, '(https|http)://(.+)/' ))[1]
host = (regexp_match(url, '(https|http)://(.+)/' ))[2]
-
Thanks, I assumed there was a problem with the update syntax. As this is a temporary table, I don't have any PKs so that's not an option anyway. Would it be possible to use a variable in a trigger to avoid running the function twice?Charlie Clark– Charlie Clark2022年04月07日 08:52:27 +00:00Commented Apr 7, 2022 at 8:52
-
@CharlieClark: you can use the system column
ctid
instead ofpk_column
. But I am quite certain the version without the CTE will be fasteruser1822– user18222022年04月07日 08:53:03 +00:00Commented Apr 7, 2022 at 8:53 -
It is, it takes about 2 minutes, but faster as a trigger.
AS $function$BEGIN NEW.protocol = (regexp_match(NEW.url, '(https|http)://(.+)/'))[1]; NEW.host = (regexp_match(NEW.url, '(https|http)://(.+)/'))[2]; RETURN NEW;
Can I use a variable here so the regexp is only called once? Asking mainly out of interest.Charlie Clark– Charlie Clark2022年04月07日 09:01:01 +00:00Commented Apr 7, 2022 at 9:01 -
Sure. e.g.
declare elements text[]
then useelements := regexp_match(NEW.url, '(https|http)://(.+)/');
then referenceelements[1]
in your assignment in the triggeruser1822– user18222022年04月07日 09:02:48 +00:00Commented Apr 7, 2022 at 9:02 -
That gives me a "duplicate declaration at or near "elements"Charlie Clark– Charlie Clark2022年04月07日 09:19:08 +00:00Commented Apr 7, 2022 at 9:19