I have a table structured as follows:
legal_entity_id (text) | solution_string (text) | total_score (bigint) |
---|---|---|
asjY4474dgd | Server | 9 |
I need to determine, for each solution and company provided by users, how many companies have expressed interest in that solution (you can assume that if a company is listed in this table, it has interest in at least one solution). Here’s an example of the desired output:
solution_string | count |
---|---|
Server | 315 |
Automotive | 256 |
The table contains approximately 2.5 million rows, and I need the query to execute in under 100ms for 95% of the cases, when querying for 25,000 companies and 20 solutions. Indexes are already in place on both legal_entity_id
and solution_string
columns. However, there is no primary key for this table (edit: a primary key has been added to legal_entity_id, solution_name). The current query being used is as follows:
SELECT solution_string, COUNT(*)
FROM intent_signals
WHERE legal_entity_id IN (id1, id2, ...) AND solution_string in (solution_1, solution_2, ...)
GROUP BY solution_string;
The query is extremely slow, and we have no idea how to optimize it. Currently, it takes an average of 2.8 seconds to complete with 5000 legal_entity_ids.
Edit: additional info
PostgreSQL 10.22 (Ubuntu 10.22-0ubuntu0.18.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
QUERY PLAN |
---|
Finalize GroupAggregate (cost=4888296.66..4888317.51 rows=27 width=26) (actual time=5830.528..5848.222 rows=2 loops=1) |
Group Key: solution_string |
-> Gather Merge (cost=4888296.66..4888316.83 rows=81 width=26) (actual time=5830.303..5848.206 rows=8 loops=1) |
Workers Planned: 3 |
Workers Launched: 3 |
-> Partial GroupAggregate (cost=4887296.62..4887307.28 rows=27 width=26) (actual time=5807.789..5807.904 rows=2 loops=4) |
Group Key: solution_string |
-> Sort (cost=4887296.62..4887300.08 rows=1385 width=18) (actual time=5807.701..5807.759 rows=955 loops=4) |
Sort Key: solution_string |
Sort Method: quicksort Memory: 116kB |
-> Parallel Seq Scan on intent_signals (cost=0.00..4887224.35 rows=1385 width=18) (actual time=0.149..5807.172 rows=955 loops=4) |
Filter: ((solution_string = ANY ('{Automotive,"Banking Other",Financial}'::text[])) AND (legal_entity_id = ANY ('{0010800002nSsILAA0,}'::text[]))) |
Rows Removed by Filter: 600911 |
Planning time: 36.703 ms |
Execution time: 5848.318 ms |
-
1Please consider reading this advicemustaccio– mustaccio2024年09月24日 16:14:59 +00:00Commented Sep 24, 2024 at 16:14
-
Pasting execution plan help understand the case. Is it your index composite? If not I believe a composite index of both company_id,solution_name(high cardinality column come first) can improve .goodfella– goodfella2024年09月25日 04:12:17 +00:00Commented Sep 25, 2024 at 4:12
-
To clarify: I renamed the columns so that the actual names are used. @goodfella I added the recommended information. I also added a PK on legal_entity_id and solution_string. Before these columuns had separate indexes.user306828– user3068282024年09月25日 11:36:51 +00:00Commented Sep 25, 2024 at 11:36
1 Answer 1
From here, you have an approximate COUNT(*)
! Gotta keep your tables ANALYZE
-ed OR VACUUM
-ed though.
The query to run is:
SELECT reltuples AS estimate FROM pg_class WHERE relname = 'table_name';
See also Erwin Brandstetter's post here for the ins and outs.
You also have the TABLE_SAMPLE sampling method
technique explained here. Again Erwin Brandstetter to the rescue here. Query to run is:
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Depending on accuracy desired, increase 1
to a suitable value
Maybe of interest - bit of a monster post on sampling (by me), see here!
I would suggest testing on your own system(s) to see what works best for you!
Your times seem very high - is your IN
clause very long - maybe try a sub-SELECT
?
-
Unfortunately I need the exact count and the table has no dead tuples. The
IN
clause can be very long and can contain up to 25,000 legal_entity_id(s), while the solutions can be up to 30.user306828– user3068282024年09月25日 15:40:11 +00:00Commented Sep 25, 2024 at 15:40 -
@user306828 - if there are no dead tuples (and table is Analyzed/Vacuumed) the pg_class should be exact. Very long
IN
clauses can be problematic - you can't turn it into aSELECT
? Ya pays yer money and you makes your choice!Vérace– Vérace2024年09月25日 16:52:48 +00:00Commented Sep 25, 2024 at 16:52
Explore related questions
See similar questions with these tags.