I have a table race_racer that contains time serialized logs for each racer in a race. I am trying to select the lastest record for a given racer in x number of races prior to a date. Basically selecting their finalized stats. The table has approximately 64mm records. I'm using a framework that doesn't support window functions and DB functions are not an option.
I cannot select based on a list of race_ids because a racer might not be in a given race.
Postgres 15
Table DLL
CREATE TABLE race_racer (
race_racer_id SERIAL PRIMARY KEY,
log_id integer NOT NULL,
racer_id integer NOT NULL,
race_id integer NOT NULL,
stats jsonb,
created_at timestamp without time zone NOT NULL DEFAULT now()
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX race_racer_pkey ON race_racer(race_racer_id int4_ops);
CREATE INDEX race_racer_log_id_idx ON race_racer(log_id int4_ops);
CREATE INDEX race_racer_racer_id_idx ON race_racer(racer_id int4_ops);
CREATE INDEX race_racer_race_id_idx ON race_racer(race_id int4_ops);
CREATE INDEX race_racer_created_at_idx ON race_racer(created_at timestamp_ops);
race_racer_id is an auto increment and log_id the record id from our vendor. It's possible for duplicate log_id records to exist as the data is populated in batches.
Query
SELECT MAX(p2.race_racer_id) race_racer_id
FROM race_racer p2
WHERE p2.racer_id = 10002093
AND p2.created_at < '2024-08-01'
GROUP BY p2.log_id, p2.race_id
ORDER BY p2.log_id DESC
LIMIT 25
EXPLAIN(ANALYZE,BUFFERS,SETTINGS,VERBOSE)
Limit (cost=115.70..2445.14 rows=25 width=12) (actual time=150899.392..369576.667 rows=25 loops=1)
Output: (max(race_racer_id)), log_id, race_id
Buffers: shared hit=3984768 read=553269 written=66
-> GroupAggregate (cost=115.70..9417387.87 rows=101068 width=12) (actual time=150899.390..369576.638 rows=25 loops=1)
Output: max(race_racer_id), log_id, race_id
Group Key: p2.log_id, p2.race_id
Buffers: shared hit=3984768 read=553269 written=66
-> Incremental Sort (cost=115.70..9415611.55 rows=102085 width=12) (actual time=147171.318..369575.442 rows=7578 loops=1)
Output: log_id, race_id, race_racer_id
Sort Key: p2.log_id DESC, p2.race_id
Presorted Key: p2.log_id
Full-sort Groups: 26 Sort Method: quicksort Average Memory: 28kB Peak Memory: 28kB
Pre-sorted Groups: 26 Sort Method: quicksort Average Memory: 41kB Peak Memory: 41kB
Buffers: shared hit=3984768 read=553269 written=66
-> Index Scan Backward using race_racer_log_id_idx on public.race_racer p2 (cost=0.44..9411732.46 rows=102085 width=12) (actual time=142302.220..369572.386 rows=7865 loops=1)
Output: log_id, race_id, race_racer_id
Filter: ((p2.created_at < '2024-08-01 00:00:00'::timestamp without time zone) AND (p2.racer_id = 10002093))
Rows Removed by Filter: 4547156
Buffers: shared hit=3984765 read=553269 written=66
Settings: effective_cache_size = '19256MB', effective_io_concurrency = '300', maintenance_io_concurrency = '300', random_page_cost = '1', work_mem = '25671kB'
Query Identifier: 7029425313897609646
Planning:
Buffers: shared hit=182 read=10
Planning Time: 11.919 ms
Execution Time: 369579.279 ms
The query is executing in about 45 seconds which is incredibly slow. Please help! If I select the latest race data for a specific racer using race_id, it resolves in under 20ms.
I've tried making indexes and changing the query around to no avail.
2 Answers 2
This query could benefit from an index like this:
CREATE INDEX idx_cover_race_racer ON race_racer(
racer_id, created_at, log_id, race_id, race_racer_id);
You might have to change/optimize the order of the columns in the index.
3 Comments
Try an index on (racer_id, log_id).
That way it doesn't need to decide whether to get the selectivity of racer_id or the ordering of log_id. It can get both at the same time.
3 Comments
Explore related questions
See similar questions with these tags.
EXPLAIN(ANALYZE,BUFFERS,SETTINGS,VERBOSE). What framework are you using that precludes the use of window functions? If you're able to run an arbitrary query, then I don't see how a framework could impose such a limitation. Consider adding code to generate a synthetic data set with characteristics similar to the live data. At the very least, provide the number of races, number of racers, and the min, max, and mean number of racers per race.EXPLAIN(ANALYZE,BUFFERS,SETTINGS,VERBOSE), not justexplain. A plan is just a plan, if will never be fast, slow, good or bad. You have to execute the plan to get this information.