[Python-Dev] Re: Should set objects maintain insertion order too?

2019年12月29日 18:03:35 -0800

Due to version and platform constraints, a SAT solver is necessary for
conda and (now) pip. Libsolv is one of the fastest around.
 https://github.com/QuantStack/mamba is conda implemented with libsolv (and
parallel downloading of *declarative* dependency metadata).
For general purpose graphs with implicit node instantation from edge
declaration, NetworkX has implementations of many graph algorithms.
https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.topological_sort.html
CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from
Python) with a "NetworkX-like API" but doesn't yet have a topological sort
(though it does have BFS)
https://github.com/rapidsai/cugraph
"pip needs a dependency resolver" + End (Fn+Right) links to the latest work
on the new pip code (that must require declarative dependency metadata)
https://github.com/pypa/pip/issues/988
That said, this implementation of topo sort could have a deterministic
output given an OrderedSet:
https://rosettacode.org/wiki/Topological_sort#Python
A table including Big-O and memory usage for the necessary data structure
and methods would be useful.
On Sun, Dec 29, 2019, 5:12 PM Tim Peters <[email protected]> wrote:
> [Larry]
> > It's a lightweight abstract dependency graph. Its nodes are opaque,
> > only required to be hashable. And it doesn't require that you give it
> > all the nodes in strict dependency order.
> >
> > When you add a node, you can also optionally specify
> > dependencies, and those dependencies aren't required
> > to be nodes that the graph has previously seen. So you can add
> > a node A that depends on B and C, without showing it B or C
> > first. When that happens it creates placeholder nodes for B
> > and C, and naturally these nodes have no dependencies. You
> > can then later inform the graph that node B has dependencies
> > on X Y and Z.
> >
> > (My specific use case is a package manager. Packages are identified
> > with unique strings. So the nodes I give the give the graph are simply
> > those package names. It's pretty common for me to get a package
> > with dependencies on packages I haven't seen yet. Designing the
> > graph to create placeholders let me skip making two passes over
> > the package list, pre-creating the package objects, etc etc. This
> > made the code simpler and presumably faster.)
> >
> > The graph API maintains an externally-visible "ready" iterable of
> > nodes. And since B can go from ready to not-ready, it can get added
> > to "ready" and subsequently removed.
> >
> > Also, when a node is satisfied, you simply remove it from the graph.
> > If A depends on B and C, and they all represent jobs, and B and C
> > have no dependencies, they'll be in the "ready" iterable. You iterate
> > over "ready", and execute those jobs, and assuming they are
> > successful you then remove them from the graph. When A's
> > dependencies are all satisfied, it'll get added to the "ready" iterable.
> > And naturally when B and C were removed from the graph, they were
> > removed from the "ready" iterable too.
>
> OK! You're doing a topological sort. There are natural & simple ways
> to do that right now that scale efficiently to very large graphs (time
> & space linear in the sum of the number of nodes and the number of
> dependencies). Curiously, the difficulties are mostly driven by the
> quality of _error_ messages you want (in case of, e.g., cycles in the
> dependency graph).
>
> Lots of those implementations became deterministic "by magic" when
> ordered dicts were introduced. This is why: a bare bones
> implementation needs to associate two pieces of info with each node:
> a list of its immediate successors, and an integer count of the number
> of its immediate predecessors. A dict is the natural way to implement
> that association.
>
> When all the dependency info has been entered, then:
>
> - First all nodes are emitted whose predecessor count is 0. Iterating
> over the association dict once is the natural way to find them, and
> that order is defined now.
>
> - Then, as each emitted node N is marked done:
> for child in N.successors:
> assert child.predcount > 0
> child.predcount -= 1
> if child.predcount == 0:
> emit(child)
>
> That's about all there is to it. But there's no end to the cruft that
> _can_ be added to, e.g., verify that a node is marked done no more
> than once, or to compute & display strongly connected components if
> not all nodes are emitted, etc.
>
> Ordered sets could improve this "natural" implementation if the
> successor lists were successor sets instead, simply because then -
> like lists - their iteration order would be defined, which can in turn
> affect the order in which nodes are emitted in the loop just above.
> But lists are adequate to the task, are cheaper in both time and
> space, and can't _not_ be ordered ;-)
>
>
> > Thus it's this "ready" collection that I want to a) be iterable, b) be
> stable, and
> > c) support fast removal. I add and remove nodes from that set a lot,
> which I
> > realized would perturb the iteration order from run to run, etc etc etc,
> and here
> > we are.
>
> The sketch above (which is a bog-common way to implement topsorts)
> doesn't build a data structure recording the final order, and, in
> fact, never deletes anything. You might protest that the initial
> iteration step (scanning the association dict to find nodes with no
> predecessors) is an expense you skip because nodes with predecessors
> are deleted from your "ready" structure all along. But nodes with
> predecessors are _touched_ either way, and merely skipping over them
> is bound to be cheaper than deleting them.
>
> After that initial step, no search of any kind is needed again.
>
> > I grant you maybe it's a weird approach. But after two false starts,
> where I
> > was starting to boil the oceans with ABCs and "node is satisfied" APis
> and
> > separate iterator objects, I had a re-think and hit on this super
> lightweight
> > design. I'm actually quite pleased with it--it's short, it has a
> minimal API,
> > it's readable, it was easy to get right, and it solves my problem.
>
> Whereas I would have started with a topsort and finished it while I
> was sleeping ;-) Seriously, I've written such a thing many times, but
> never reuse the code because it's so easy to redo from scratch. Every
> application has _some_ unique twist, which makes a general-purpose API
> so bloated it's harder to figure out how to use than to write custom
> code.
>
> In your application, I'm guessing (but don't know) that when a package
> name is emitted, it's possible you're not able to find the package,
> and so the process has to stop. That's why I inserted a "as each
> emitted node N is marked done" step to my description. That is, I'm
> picturing an API that adds a (say)
>
> topsorter.record_succeeded(node)
>
> method to say that - yes - the node was processed successfully. Else,
> by default, its successors (if any) must not be emitted.
>
> But if that isn't needed, the whole topsort order can be materialized
> into (say) a list in one gulp, or delivered by a generator.
>
>
> > Happy new year,
>
> And to you too! :-)
> _______________________________________________
> Python-Dev mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/[email protected]/message/HRKQPTPQQNIU3DYMYUYRC7USVRJGMRFC/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
_______________________________________________
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/6UKXR63M6NYI3EQRXL7AOH3AWMTSNXUW/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to