0

I would very appreciate your help here, an important question about architecture modification.

  • We use the ETL process to fetch data from external services (for example, Github).

  • Extract the data (e.g. issues and PRs) and create raw_data objects

  • Transform the data to our unique objects - let's call them assets.

  • Load the assets as JSONs (jsonb field) to postgres DB (general assets table).

We have a few problems with this approach and now we are considering changing our pipeline.

Bidirectional asset connection: our assets have a direct relationship between them, for example, every user has groups, and every group has users. Currently, we manually fill the data from one of the sides again and again (before loading the data to Postgres). We manage this in-memory (scale has entered the meeting)

Very slow analysis: we need the ability to present the whole JSON, but we also need to make an analysis between all of the assets. The problem is that if we have a lot of assets (a lot of JSONs inside of the assets table), the analysis is very slow. For example, 'find all issues that are related to the pull request', in this case, we need to iterate over all of the internal JSONs and search with regex on a specific field.

What would you recommend in this case? Suggestions:

Manage it in Postgres, use functions to convert JSONs into tables, and create triggers to fill the Bidirectional relation.

Data warehouse? I lack knowledge on the subject, but in general not sure it will be ideal, first of all, because of OLAP vs OLTP, we need to make analysis but also the possibility to present the whole rows. And how would we fill the bidirectional connection?

GraphDB? Sounds ideal for the bidirectional asset, but not sure about scaling problems.

What do you think guys? What would you do in this case?

Rohit Gupta
2,1248 gold badges20 silver badges25 bronze badges
asked Feb 28, 2023 at 11:15

1 Answer 1

1

There are two distinct issues here:

One is the storage of data in JSON

PostgreSQL (like all other relational databases) is fundamentally designed to process relational structures, i.e. tables. It is very efficient when it comes to filtering, joining and aggregating those.

Like other relational databases, PostgreSQL has been extended to support other data structures, like JSON. At first approximation, you can imagine that the constant parsing of complex JSON texts is going to be costly in performance. Combine that with regexp searches and you are in for a very slow application, or one that requires massive hardware to perform.

I don't know how JSON support is implemented inside PostgreSQL. Hopefully there are smarts that apply some internal shredding of the JSON documents, combined with some clever indexing ... But if the documents are just stored as-is, i.e. as plain JSON strings, then the processing will be very inefficient.

So, my suggestion is to do the shredding of the JSON documents into plain relational tables (i.e. your Manage in Postgres option) You can do that using JSON operations inside the code that feeds your database. You did not mention what language you use for that. The better approach IMO is Python because JSON objects can be easily turned into Python dict and from they're easily processed and extracted into plain relational structures. If java, then you can use Jackson for doing the same. Or you can use PostgreSQL's JSON functions do to that inside the database, driven by stored procedures and triggers. The choice is yours.

Once your data is in plain relational structures, then you can use plain SQL queries on your data.

The other is the need for graph-based processing

You mention graph databases as an alternative. That may or may not be useful. It depends on the kind of queries you do on your data. Graph structures shine when you do multi-hop queries.

For single-hop queries, plain joins are as efficient (probably more efficient). For example, finding all assets related to other assets is a plain join. Just like finding customers who purchased certain products.

However finding assets related to assets related to assets related to assets ... i.e. analyze dependencies in a BOM-like structure is difficult to express in relational terms (you need to use recursive query syntax) and generally inefficient. Again, this is where graphs shine:

  • They allows you to express this multi-hop expression in a very simple syntax (like the MATCH clause in PGQL or the similar clause in Neo4j's Cypher language. That syntax allows you even to search in unlimited depth, i.e. in a single statement find all the final assemblies that depend on this particular screw, irrespective of the complexity of the assembly.

  • They operate on data structures that are specifically optimized for graphs (sparse matrices, pointers etc).

So, if this is your need (detect complex multi-level dependencies), then yes you may want to consider graph techniques. Consider graph-only specialized databases like Neo4j or TigerGraph, or relational databases with graph extensions like PostgreSQL's recently announced AGE extension, or Oracle with Oracle Property Graph or even RDF.

answered Mar 1, 2023 at 10:17

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.