Setup/Intro
I have 10k+ nodes in my Neo4j graph in which I need to display a sub-graph (100-500 nodes) between 2 start/end nodes on the frontend app along with info about the critical path and all dependencies (upstream/downstream paths from/to start/end) of each node.
I have a list of all possible start/end nodes and it's tiny (~10 pairs).
The start and end nodes are the params of the request.
The response I have sent from middleware API to UI now is something like this:
Nodes: [
{
Id: 4,
downstreamIds: [5,6,7], //all nodes on the paths leading to end node
upstreamIds: [1,2,3], //all nodes on the paths coming from start node
...
},
...
]
Problem
The issue is that for each node I have 2 separate queries to get both the downstream and upstream lists like this one:
MATCH path = (o:Operation)-[DEPENDS_ON*]->(start:Operation ) WHERE id(o) = $operationId RETURN path
...so for n nodes I have 1 query for the nodes + 2n queries for downstream+upstream + 1 query for some aggregated stats.
It takes 502 queries to fetch a start/end sub-graph that has 500 nodes in it.
The aggregated stats query is one traversal and it is fast not an issue.
However overall this request can take up 2 minutes in worst case scenario i.e: each node has all other nodes as downstream and upstream dependencies.
Possible solutions
Return a list of all relationships which is 2n2 edges (500 * 500 * 2 for worst case) and calculate the downstream/upstream list in UI using Javascript. I'm not really sure how to do that with Cypher.
Also storing 500,000 objects and filtering them in UI doesn't sound right.Pre-process the queries for downstream/upstream for each node and cache them in a separate fast key-value store. I'm thinking nosql mongoDB.
So I request for the nodes from graph then get the dependencies from the key-value store with 1 extra query (much faster/no graph traversal)
Which is better? Any other solutions?
-
Where does operationId come from, could you query for all IDs in a single query, then split them back out on the client side?user1937198– user193719808/17/2020 13:46:20Commented Aug 17, 2020 at 13:46
-
operationId is the id of the node I want to get the relationships for. This query runs once for each node. I need to get this info for each node separately.IamMowgoud– IamMowgoud08/17/2020 14:18:32Commented Aug 17, 2020 at 14:18
1 Answer 1
I know this is a little more than half a year old, but thought I might share an answer just in case. I don't have a specific answer for this problem, but I can share my use case and personal solution for inspiration.
My use case for needing subgraphs: I have a chain of processes that produce output files as inputs to other processes, e.g. (Process_1)-[:OUTPUTS]->(File)-[:INPUTS]->(Process_2). I wanted to find all downstream processes given any starting one. I use the apoc library subgraphNodes function to define my path and give me only the nodes.
match(n:Process {id:$id})
call apoc.path.subgraphNodes(n, {
relationshipFilter:'OUTPUTS,INPUTS',
labelFilter:'>Process'
})
yield node
return node
In my case I didn't need to provide an ending node since I wanted the whole subgraph. But perhaps you could use the terminatorNodes parameter for that.
-
1Thanks Christopher. I ended up caching the downstream Ids in a property. This solution suited my usecase and enhanced the performance dramatically.IamMowgoud– IamMowgoud03/30/2021 08:38:07Commented Mar 30, 2021 at 8:38
Explore related questions
See similar questions with these tags.