When does it make sense to put data in elastic search vs creating secondary indexing on Primary datastore? Elastic search with another primary store
Pros:
- Primary datastore can be optimised for read write usecases.
- Elastic search suports more than just key value matching like fuzzy match, etc.
Cons:
- Out of sync with primary datastore
- two more component to manage (ES as well a pipeline to insert in ES)
- Would need some sort of Change data capture capability from Primary datastore.
Secondary indexes on Primary datastore
Pros:
- Less moving parts.
- Less consistency issues ( because secondary indexes can be eventually consistant)
Cons
- Not all datastore support secondary indexing
- Secondary index queries are more oftan scatter gather, doing it on higher QPS will limit read write qps on primary access patterns like read, write by PK
Are there other considerations while deciding this?
2 Answers 2
I was given good advice by an architect, early in my career: design and extra complexity, to improve performance, before you’ve discovered where the actual bottle decks in the system are. Otherwise, you will likely end up optimizing the wrong area.
I’d recommend trying this out with full (fake?) volume on your primary data store with a secondary index to determine if that beat your performance requirements and is the actual bottleneck.
It would depend on the scale of your problem.
If you have identified one new query in the business domain that will be be stable and used regularly, but is inefficient with the current schema, add the supporting index. Each DML must now keep this in sync with the base table, however. So the system, overall, has more work to do. Latency on everything will increase everso slightly.
If instead the requirement is to support arbitrary ad hoc queries over all tables something like Elasticsearch will be the answer. The cost being that of syncing the two stores and the latency of that process.
At some point the cumulative cost of incrementally adding those secondary indexes will be more than that of replicating to another storage engine. If you envisage that future as most likely you can design it from the outset. Otherwise go with the secondaries, monitor, and be prepared to switch.
I have maintained several large RDBMS applications. Often tables on latency-sensitive paths will have several secondary indexes, sometimes many. And I have chosen not to add indexes for non-sensitive workloads (say, overnight batch reporting) to minimise the impact on those same tables. There is a balance to be found according to what is important to this application, and there are no free lunches.