0

I have a table structured as follows:

legal_entity_id (text) solution_string (text) total_score (bigint)
asjY4474dgd Server 9

I need to determine, for each solution and company provided by users, how many companies have expressed interest in that solution (you can assume that if a company is listed in this table, it has interest in at least one solution). Here’s an example of the desired output:

solution_string count
Server 315
Automotive 256

The table contains approximately 2.5 million rows, and I need the query to execute in under 100ms for 95% of the cases, when querying for 25,000 companies and 20 solutions. Indexes are already in place on both legal_entity_id and solution_string columns. However, there is no primary key for this table (edit: a primary key has been added to legal_entity_id, solution_name). The current query being used is as follows:

SELECT solution_string, COUNT(*)
FROM intent_signals
WHERE legal_entity_id IN (id1, id2, ...) AND solution_string in (solution_1, solution_2, ...)
GROUP BY solution_string;

The query is extremely slow, and we have no idea how to optimize it. Currently, it takes an average of 2.8 seconds to complete with 5000 legal_entity_ids.

Edit: additional info

PostgreSQL 10.22 (Ubuntu 10.22-0ubuntu0.18.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
QUERY PLAN
Finalize GroupAggregate (cost=4888296.66..4888317.51 rows=27 width=26) (actual time=5830.528..5848.222 rows=2 loops=1)
Group Key: solution_string
-> Gather Merge (cost=4888296.66..4888316.83 rows=81 width=26) (actual time=5830.303..5848.206 rows=8 loops=1)
Workers Planned: 3
Workers Launched: 3
-> Partial GroupAggregate (cost=4887296.62..4887307.28 rows=27 width=26) (actual time=5807.789..5807.904 rows=2 loops=4)
Group Key: solution_string
-> Sort (cost=4887296.62..4887300.08 rows=1385 width=18) (actual time=5807.701..5807.759 rows=955 loops=4)
Sort Key: solution_string
Sort Method: quicksort Memory: 116kB
-> Parallel Seq Scan on intent_signals (cost=0.00..4887224.35 rows=1385 width=18) (actual time=0.149..5807.172 rows=955 loops=4)
Filter: ((solution_string = ANY ('{Automotive,"Banking Other",Financial}'::text[])) AND (legal_entity_id = ANY ('{0010800002nSsILAA0,}'::text[])))
Rows Removed by Filter: 600911
Planning time: 36.703 ms
Execution time: 5848.318 ms
asked Sep 24, 2024 at 15:57
3
  • 1
    Please consider reading this advice Commented Sep 24, 2024 at 16:14
  • Pasting execution plan help understand the case. Is it your index composite? If not I believe a composite index of both company_id,solution_name(high cardinality column come first) can improve . Commented Sep 25, 2024 at 4:12
  • To clarify: I renamed the columns so that the actual names are used. @goodfella I added the recommended information. I also added a PK on legal_entity_id and solution_string. Before these columuns had separate indexes. Commented Sep 25, 2024 at 11:36

1 Answer 1

0

From here, you have an approximate COUNT(*)! Gotta keep your tables ANALYZE-ed OR VACUUM-ed though.

The query to run is:

SELECT reltuples AS estimate FROM pg_class WHERE relname = 'table_name';

See also Erwin Brandstetter's post here for the ins and outs.

You also have the TABLE_SAMPLE sampling method technique explained here. Again Erwin Brandstetter to the rescue here. Query to run is:

SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);

Depending on accuracy desired, increase 1 to a suitable value

Maybe of interest - bit of a monster post on sampling (by me), see here!

I would suggest testing on your own system(s) to see what works best for you!

Your times seem very high - is your IN clause very long - maybe try a sub-SELECT?

answered Sep 25, 2024 at 14:16
2
  • Unfortunately I need the exact count and the table has no dead tuples. The IN clause can be very long and can contain up to 25,000 legal_entity_id(s), while the solutions can be up to 30. Commented Sep 25, 2024 at 15:40
  • @user306828 - if there are no dead tuples (and table is Analyzed/Vacuumed) the pg_class should be exact. Very long IN clauses can be problematic - you can't turn it into a SELECT? Ya pays yer money and you makes your choice! Commented Sep 25, 2024 at 16:52

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.