Slow query / Indexes creation (PostgreSQL 9.2)

Question 1

I have the following Query:

explain analyze
SELECT split_part(full_path, '/', 4)::INT AS account_id,
 split_part(full_path, '/', 6)::INT AS note_id,
 split_part(full_path, '/', 9)::TEXT AS variation,
 st_size,
 segment_index,
 reverse(split_part(reverse(full_path), '/', 1)) as file_name,
 i.st_ino,
 full_path,
 (i.st_size / 1000000::FLOAT)::NUMERIC(5,2) || 'MB' AS size_mb
FROM gorfs.inodes i
JOIN gorfs.inode_segments s
 ON i.st_ino = s.st_ino_target
WHERE
 i.checksum_md5 IS NOT NULL
 AND s.full_path ~ '^/userfiles/account/[0-9]+/[a-z]+/[0-9]+'
 AND i.st_size > 0;
 split_part(s.full_path, '/', 4)::INT IN (
SELECT account.id
 FROM public.ja_clients AS account
 WHERE
 NOT (
 ((account.last_sub_pay > EXTRACT('epoch' FROM (transaction_timestamp() - CAST('4 Months' AS INTERVAL)))) AND (account.price_model > 0)) OR
 (account.regdate > EXTRACT('epoch' FROM (transaction_timestamp() - CAST('3 Month' AS INTERVAL)))) OR
 (((account.price_model = 0) AND (account.jobcredits > 0)) AND (account.last_login > EXTRACT('epoch' FROM (transaction_timestamp() - CAST('4 Month' AS INTERVAL)))))
 ) LIMIT 100
);

Explain analyze link: http://explain.depesz.com/s/Oc6

The query is taking ages, and I can't get the problem solved.

These are the index I've already created on the inode_segments table:

Indexes:
 "ix_account_id_from_full_path" "btree" (("split_part"("full_path"::"text", '/'::"text", 4)::integer)) WHERE "full_path"::"text" ~ '^/userfiles/account/[0-9]+/[a-z]+/[0-9]+'::"text"
 "ix_inode_segments_ja_files_lookup" "btree" ((
CASE
 WHEN "full_path"::"text" ~ '/[^/]*\.[^/]*$'::"text" THEN "upper"("regexp_replace"("full_path"::"text", '.*\.'::"text", ''::"text", 'g'::"text"))
 ELSE NULL::"text"
END)) WHERE "gorfs"."is_kaminski_note_path"("full_path"::"text")
 "ix_inode_segments_notes_clientids" "btree" (("split_part"("full_path"::"text", '/'::"text", 4)::integer)) WHERE "gorfs"."is_kaminski_note_path"("full_path"::"text")
 "ix_inode_segments_notes_clientids2" "btree" ("full_path")
 "ix_inode_segments_notes_fileids" "btree" (("split_part"("full_path"::"text", '/'::"text", 8)::integer)) WHERE "gorfs"."is_kaminski_note_path"("full_path"::"text")
 "ix_inode_segments_notes_noteids" "btree" ((NULLIF("split_part"("full_path"::"text", '/'::"text", 6), 'unassigned'::"text")::integer)) WHERE "gorfs"."is_kaminski_note_path"("full_path"::"text")

These are the index I've already created on the inodes table:

 Indexes:
 "ix_inodes_checksum_st_size" "btree" ("checksum_md5", "st_size") WHERE "checksum_md5" IS NOT NULL

Question:

What else can I do to improve the Performance of the Query?

This is related to my previous question; index creation - Slow Query - PostgreSQL 9.2

UPDATE 1:

Explain analyze: http://explain.depesz.com/s/UBr

The index and function have been created as mentioned on the answer below.

UPDATE 2:

Explain analyze: http://explain.depesz.com/s/LHS

Using the Query provided on the answer below

Question 2

You're limiting your selection to a random 100 account ids. Is that intentional?

Question 3

yes @ZiggyCrueltyfreeZeitgeister - Otherwise the query will be running for weeks =\

Question 4

Perhaps this will help.

If you'll rely on the account_id from full_path often, then you'll benefit from a function and a functional index for it:

CREATE OR REPLACE FUNCTION gorfs.f_get_account_from_full_path(p_full_path text) RETURNS int AS $body$
SELECT (regexp_matches(1,ドル '^/userfiles/account/([0-9]+)/[a-z]+/[0-9]+'))[1]::int
$body$ LANGUAGE SQL IMMUTABLE SECURITY DEFINER RETURNS NULL ON NULL INPUT;
CREATE INDEX ON gorfs.inode_segments (gorfs.f_get_account_from_full_path(full_path));

Ensure gorfs.inodes has an index (or key much better if applicable) on st_ino!

You run the function split_part several times for each row, this is likely taking a significant toll. I've replaced it with string_to_array, and then fetch the individual pieces as needed. I also didn't understand what you intended to obtain for the field field_name using reverse? The query below returns the last element for it.

Your query returns many million rows. Even if PostgreSQL processes the query reasonably quickly, your client application (especially if you use PgAdminIII) will struggle allocating enough memory and receive and format the results, and probably be what takes the most time. So you may want to create a temporary table with the results, and then query against the temporary table:

CREATE TEMP TABLE myresults AS
WITH
 accounts AS (
 SELECT id
 FROM public.ja_clients
 WHERE NOT (
 (last_sub_pay > EXTRACT('epoch' FROM now() - '4 Months'::INTERVAL) AND price_model > 0) OR
 regdate > EXTRACT('epoch' FROM now() - '3 Month'::INTERVAL) OR
 (price_model = 0 AND jobcredits > 0 AND last_login > EXTRACT('epoch' FROM now() - '4 Month'::INTERVAL))
 )
 ORDER BY 1 LIMIT 100 -- first 100 accounts for testing purposes; comment out this line once the query is proven performant enough
 ) 
SELECT r.parts[4]::INT AS account_id, r.parts[6]::INT AS note_id, r.parts[9] AS variation,
 st_size, segment_index, r.parts[array_upper(r.parts, 1)] AS file_name, st_ino, full_path, size_mb
FROM (
 SELECT string_to_array(full_path, '/') AS parts, st_size, segment_index, i.st_ino, full_path,
 (i.st_size / 1000000::FLOAT)::NUMERIC(5,2) || 'MB' AS size_mb
 FROM gorfs.inode_segments s
 JOIN gorfs.inodes i ON (i.st_ino = s.st_ino_target)
 WHERE gorfs.f_get_account_from_full_path(s.full_path) IN (SELECT * FROM accounts)
 AND i.checksum_md5 IS NOT NULL
 AND i.st_size > 0
 ) r;
SELECT *
FROM myresults
LIMIT 100;

Question 5

Thanks for that. But I got an error when creating the index: ERROR: functions in index expression must be marked IMMUTABLE

Question 6

@JohnThomaz query updated (I forgot about that), just replace STABLE with IMMUTABLE ;)

Question 7

Thanks for that. The function's been created and the index as well. But the query is still taking ages is not using the new index - I've updated the Question. Please have a look

Question 8

@JohnThomaz I can't see any changes in the question. Can you publish the new explain analyze statistics?

Question 9

Yep it's in the question... update 1

Question 10

Your strategy for getting information from full_path can be useful for a one-off, but for ongoing queries to it, especially over millions of records and expecting quick results, it is far from optimal.

Considering the sheer size of your tables, you'll probably benefit from datawarehousing. If your tables are constantly updated, you'll need a trigger to keep your datawarehouse updated.

The following is not exactly datawarehousing, but it follows the principle of creating the most lightweight extract of the data you'll need for future queries:

CREATE TABLE gorfs.inodes_segments_dw(
 st_ino_target "gorfs"."ino_t" NOT NULL,
 segment_index "gorfs"."pathname_component" NOT NULL,
 account_id int NULL,
 note_id int NULL,
 variation text NULL,
 file_name text NULL
);
INSERT INTO gorefs.inodes_segments_dw
SELECT st_ino_target, segment_index, parts[4]::INT AS account_id, parts[6]::INT AS note_id, parts[9] AS variation,
 parts[array_upper(parts, 1)] AS file_name
FROM (
 SELECT *, string_to_array(full_path, '/') AS parts
 FROM gorfs.inode_segments
 ) s;
CREATE INDEX ON gorfs.inodes_segments_dw (account_id);

Then you can run the full query again against it rather than the full table with calculated fields:

CREATE TEMP TABLE myresults AS
WITH
 accounts AS (
 SELECT id
 FROM public.ja_clients
 WHERE NOT (
 (last_sub_pay > EXTRACT('epoch' FROM now() - '4 Months'::INTERVAL) AND price_model > 0) OR
 regdate > EXTRACT('epoch' FROM now() - '3 Month'::INTERVAL) OR
 (price_model = 0 AND jobcredits > 0 AND last_login > EXTRACT('epoch' FROM now() - '4 Month'::INTERVAL))
 )
 ORDER BY 1 LIMIT 100 -- first 100 accounts for testing purposes; comment out this line once the query is proven performant enough
 ) 
SELECT s.account_id, s.note_id, s.variation, s.st_size, s.segment_index, s.file_name, s.st_ino, is.full_path, s.size_mb
FROM (
 SELECT s.account_id, s.note_id, s.variation, i.st_size, s.segment_index, s.file_name, i.st_ino,
 (i.st_size / 1000000::FLOAT)::NUMERIC(5,2) || 'MB' AS size_mb
 FROM accounts a
 JOIN gorfs.inodes_segments_dw s ON (s.account_id = a.account_id)
 JOIN gorfs.inodes i ON (i.st_ino = s.st_ino_target AND i.checksum_md5 IS NOT NULL AND i.st_size > 0)
 ) s
JOIN gorefs.inodes_segments is ON ((is.st_ino, is.segment_index) = (s.st_ino, s.segment_index));

Another advisable strategy would be separating this query in parts, using temporary tables, step by step. This is not necessarily faster, but convenient for finding problems with the selection logic, or using a strategy that best suits our own knowledge of the data:

CREATE TEMP TABLE accounts (account_id int NOT NULL PRIMARY KEY);
INSERT INTO accounts
SELECT id
FROM public.ja_clients
WHERE NOT (
 (last_sub_pay > EXTRACT('epoch' FROM now() - '4 Months'::INTERVAL) AND price_model > 0) OR
 regdate > EXTRACT('epoch' FROM now() - '3 Month'::INTERVAL) OR
 (price_model = 0 AND jobcredits > 0 AND last_login > EXTRACT('epoch' FROM now() - '4 Month'::INTERVAL))
 );
CREATE TEMP TABLE inodes (st_ino "gorfs"."ino_t" NOT NULL PRIMARY KEY, st_size "gorfs"."off_t", size_mb text);
INSERT INTO inodes
SELECT st_ino, st_size, (st_size / 1000000::FLOAT)::NUMERIC(5,2) || 'MB' AS size_mb
FROM gorfs.inodes
WHERE checksum_md5 IS NOT NULL AND st_size > 0;
CREATE TEMP TABLE segments (
 st_ino_target "gorfs"."ino_t" NOT NULL,
 segment_index "gorfs"."pathname_component" NOT NULL,
 account_id int NOT NULL
);
INSERT INTO segments
SELECT *
FROM (
 SELECT st_ino_target, segment_index, (regexp_matches(1,ドル '^/userfiles/account/([0-9]+)/[a-z]+/[0-9]+'))[1]::int AS account_id
 FROM gorfs.inode_segments
 )
WHERE account_id IS NOT NULL
CREATE INDEX ON segments (account_id);
CREATE TEMP TABLE results AS
SELECT account_id, parts[6]::INT AS note_id, parts[9] AS variation,
 st_size, segment_index, parts[array_upper(parts, 1)] AS file_name, st_ino, full_path, size_mb
FROM (
 SELECT sall.*, string_to_array(full_path, '/') AS parts, full_path
 FROM (
 SELECT i.*, s.segment_index
 FROM accounts a
 JOIN segments s ON (s.account_id = a.account_id)
 JOIN inodes i ON (i.st_ino = s.st_ino_target)
 JOIN (s.st_ino_target = i.st_ino)
 ) sall 
 JOIN gorfs.inode_segments s ON ((s.st_ino_target, s.segment_index) = (sall.st_ino_target, sall.segment_index))
 ) s;
SELECT *
FROM results
LIMIT 100;

Question 11

Thanks for that! Just one thing.. I have to select the data from one (or more) account_id. I didn't see on your query a place to put the ID that I want... can you clarify for me please?

Question 12

@JohnThomaz In both cases the filter by account is achieved via a JOIN with accounts

Question 13

Well.. I don't think that is a good idea... The GORFS.inode_segment table is 1.7TB size. Putting the data in another table is just inviable for me =\

Question 14

@JohnThomaz you had no problem creating indexes, so you should be able to create narrow subsets of inode_segments. How wide is segment_index typically? For example, 80 million rows, for 100 bytes per row, would be an 8GB table. Far from the current 1700GB of the full table.

score 0 · Accepted Answer · 2016-05-03 05:19:45Z

Perhaps this will help.

If you'll rely on the account_id from full_path often, then you'll benefit from a function and a functional index for it:

CREATE OR REPLACE FUNCTION gorfs.f_get_account_from_full_path(p_full_path text) RETURNS int AS $body$
SELECT (regexp_matches(1,ドル '^/userfiles/account/([0-9]+)/[a-z]+/[0-9]+'))[1]::int
$body$ LANGUAGE SQL IMMUTABLE SECURITY DEFINER RETURNS NULL ON NULL INPUT;
CREATE INDEX ON gorfs.inode_segments (gorfs.f_get_account_from_full_path(full_path));

Ensure gorfs.inodes has an index (or key much better if applicable) on st_ino!

You run the function split_part several times for each row, this is likely taking a significant toll. I've replaced it with string_to_array, and then fetch the individual pieces as needed. I also didn't understand what you intended to obtain for the field field_name using reverse? The query below returns the last element for it.

Your query returns many million rows. Even if PostgreSQL processes the query reasonably quickly, your client application (especially if you use PgAdminIII) will struggle allocating enough memory and receive and format the results, and probably be what takes the most time. So you may want to create a temporary table with the results, and then query against the temporary table:

CREATE TEMP TABLE myresults AS
WITH
 accounts AS (
 SELECT id
 FROM public.ja_clients
 WHERE NOT (
 (last_sub_pay > EXTRACT('epoch' FROM now() - '4 Months'::INTERVAL) AND price_model > 0) OR
 regdate > EXTRACT('epoch' FROM now() - '3 Month'::INTERVAL) OR
 (price_model = 0 AND jobcredits > 0 AND last_login > EXTRACT('epoch' FROM now() - '4 Month'::INTERVAL))
 )
 ORDER BY 1 LIMIT 100 -- first 100 accounts for testing purposes; comment out this line once the query is proven performant enough
 ) 
SELECT r.parts[4]::INT AS account_id, r.parts[6]::INT AS note_id, r.parts[9] AS variation,
 st_size, segment_index, r.parts[array_upper(r.parts, 1)] AS file_name, st_ino, full_path, size_mb
FROM (
 SELECT string_to_array(full_path, '/') AS parts, st_size, segment_index, i.st_ino, full_path,
 (i.st_size / 1000000::FLOAT)::NUMERIC(5,2) || 'MB' AS size_mb
 FROM gorfs.inode_segments s
 JOIN gorfs.inodes i ON (i.st_ino = s.st_ino_target)
 WHERE gorfs.f_get_account_from_full_path(s.full_path) IN (SELECT * FROM accounts)
 AND i.checksum_md5 IS NOT NULL
 AND i.st_size > 0
 ) r;
SELECT *
FROM myresults
LIMIT 100;

Thanks for that. But I got an error when creating the index: ERROR: functions in index expression must be marked IMMUTABLE
@JohnThomaz query updated (I forgot about that), just replace STABLE with IMMUTABLE ;)
Thanks for that. The function's been created and the index as well. But the query is still taking ages is not using the new index - I've updated the Question. Please have a look
@JohnThomaz I can't see any changes in the question. Can you publish the new explain analyze statistics?

Stack Exchange Network

Slow query / Indexes creation (PostgreSQL 9.2)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Slow query / Indexes creation (PostgreSQL 9.2)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions