-1

Context

A bit of context. I have a service responsible for handling search queries from our users. The goal of the project is to build a unified storage system that can hold various entities from our logistics domain. At the moment, it only stores logistics orders.

The idea is that if, for instance, I add a new delivery method using drones, I wouldn’t need to add new columns like drone_model in the database schema — instead, that data would simply be stored as part of the unified order record in a JSONB field. This way, users could search across entities by these custom attributes.

To implement this idea, I decided to store entity data in JSONB.

As a result, I have a table with the following fields:

id, entity_type, entity_id, created_at and entity_data (JSONB).
The main task is to retrieve entity_id values based on jsonpath queries.

Implementation issues

Initially, the table had only one GIN index:
gin(entity_data jsonb_path_ops).

During a dry-run under load, I noticed that some queries took over 20 seconds to complete — especially those involving date-based sorting, e.g.
.date >= "..." && .date <= "....".

To improve performance, I added indexes on specific JSON fields:

CREATE INDEX entity__btree__date__idx
ON entity USING btree ((entity_data->>'date'))
WHERE entity_type = 'order' AND entity_data->>'date' IS NOT NULL;

and like this:

CREATE INDEX entity__btree__first_dispatch_loading_arrival_at__idx
ON entity USING btree (
(entity_data->'attributes'->'order_dispatch_route'->'dispatches'->0->'loading_point'->>'arrival_at')
)
WHERE entity_type = 'order'
AND entity_data->'attributes'->'order_dispatch_route'->'dispatches' IS NOT NULL
AND jsonb_typeof(entity_data->'attributes'->'order_dispatch_route'->'dispatches') = 'array'
AND jsonb_array_length(entity_data->'attributes'->'order_dispatch_route'->'dispatches') > 0
AND entity_data->'attributes'->'order_dispatch_route'->'dispatches'->0->'loading_point'->>'arrival_at' IS NOT NULL;

To actually use these indexes, I extract the part of the incoming jsonpath condition that can be optimized (i.e. has an index) and append it to the final query via an AND clause.

After that, I ran another dry-run. The 20+ second queries were still there. Looking at the most CPU-hungry queries, I found that most were of this form:

$ ? (!((@.status == "CANCELED" || @.status == "COMPLETE")) && (@.tariff_type == "Shuttle") && @.vehicle_plate_number like_regex ".*E292.*" && (@.organization_id == 2226) && @.namespace == "some-namespace")

Execution time: 35 seconds

After that I tried adding a composite index to optimize such queries:

CREATE INDEX entity__btree__organization_id__namespace__tariff_type__idx
ON entity
USING btree (
((entity_data->>'organization_id')::bigint),
(entity_data->>'namespace'),
(entity_data->>'tariff_type')
)
WHERE entity_type = 'order'
AND entity_data->>'organization_id' IS NOT NULL
AND entity_data->>'namespace' IS NOT NULL
AND entity_data->>'tariff_type' IS NOT NULL;

I hoped this would help with the heaviest queries. However, after running another dry-run, CPU usage spiked to 1000%+, and my service throughput (on read) dropped from 50 RPS down to 20 RPS.

I had to urgently shut down the service because the database was hitting CPU saturation.

Even when repeated the test with zero write load, the issue persisted.

So now the main question — why did adding this particular index cause such severe CPU overhead? Interestingly, the queries I targeted with this index did get faster — they now complete in 1–3 seconds instead of 20+.

Overall, I’d love to hear your opinion on my plan for implementing unified search via JSONB. I understand there’s no single "magic index" to solve everything, but I want to get a clearer sense of which direction I should move in.

And some more info:
Total rows: 3,287,327

Memory usage:
SELECT pg_size_pretty(pg_total_relation_size('entity')); 9 GB

Just table size:
SELECT pg_size_pretty(pg_relation_size('entity')); 4 GB

And ~18% of JSONB records exceed 2KB (of total row count)

explain (analyze, buffers)
SELECT entity_id
FROM entity
WHERE entity_type = 'order'
 AND ((entity_data->>'organization_id')::bigint) = 2226
 AND (entity_data->>'namespace') = 'some-orders'
 AND (entity_data->>'tariff_type') = 'Shuttle'
 AND entity_data @? '$ ? (!((@.status == "CANCELED" || @.status == "COMPLETE")))'
ORDER BY LENGTH(entity_id) DESC, entity_id DESC
LIMIT 10 OFFSET 0;
Limit (cost=2.47..2.47 rows=1 width=12) (actual time=30039.434..30039.446 rows=10 loops=1)
 -> Sort (cost=2.47..2.47 rows=1 width=12) (actual time=30039.432..30039.443 rows=10 loops=1)
" Sort Key: (length(entity_id)) DESC, entity_id DESC"
 Sort Method: top-N heapsort Memory: 25kB
 -> Index Scan using entity__btree__organization_id__namespace__tariff_type__idx on entity (cost=0.43..2.46 rows=1 width=12) (actual time=1.090..29975.095 rows=51424 loops=1)
 Index Cond: ((((entity_data ->> 'organization_id'::text))::bigint = 2226) AND ((entity_data ->> 'namespace'::text) = 'some-orders'::text) AND ((entity_data ->> 'tariff_type'::text) = 'Shuttle'::text))
" Filter: (entity_data @? '$?(!(@.""status"" == ""CANCELED"" || @.""status"" == ""COMPLETE""))'::jsonpath)"
 Rows Removed by Filter: 19459
Planning Time: 2.627 ms
Execution Time: 30039.531 ms
Dale K
28.1k15 gold badges60 silver badges86 bronze badges
asked Dec 23, 2025 at 11:25
1
  • 4
    I'd extract the columns you frequently need in your WHERE conditions from the JSON and define regular table columns for them. Commented Dec 23, 2025 at 12:17

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.