Packages · guacsec/trustify · Discussion #478

ctron
Jun 28, 2024
Maintainer

I am sorry for re-iterating over this. But maybe we can find a better way to name and do things. So I'll try to take a step back, maybe it needs a bit more than just renaming. Maybe not.

Assuming the intention is to ingest all kinds of stuff, and then collect/aggregate that data into a model that grows, the more we ingest.

Right now, we have the following tables:

Qualified Package -> Versioned Package -> Package

However, these tables only store fragments of PURLs. In a way, that we can easily reference stuff inside the
database. Better names would IMO be:

Qualified PURL -> Versioned PURL -> Base PURL

This set of information grows with each SBOM we ingest, because we extract PURLs from SBOMs. We also extract PURLs from other sources. But let's ignore that for now.

A simplified view on the SBOMs looks like this:

SBOM:
 sbom: Uuid
SBOM Package:
 sbom: Uuid
 node: Uuid
 name: String
 version: Option<String>
SBOM -[0..*]-> SBOM Packages -[0..*]-> Qualfied PURL
 -[0..*]-> CPE

So SBOMs contain packages (SBOM packages) and may (or may not) declare an alternative name for their packages.

We can browse through SBOM packages, get them by ID. Get relationships between them. All without ever touching PURLs.

In some cases (for RH data for most cases), these have PURLs attached. Which we can use to reference with other
documents that use PURLs.

Going through the conversations again, I think we might actually miss another "package".

SBOM -[0..*]-> SBOM Packages -[1..1]-> THIS_PACKAGE -[0..*]-> Qualfied PURL
 -[0..*]-> CPE
THIS_PACKAGE:
 purls: Vec<>,
 cpes: Vec<>,
 ...

THIS_PACKAGE is independent of an SBOM. And collects (grows) with each package that gets ingested into the system.

The question to me is: how do we identify this package?

By name (from the SBOM) won't work. By hash of an artifact? Ingesting a new SBOM package, how would we now to
which THIS_PACKAGE it would need to contribute its information to?

And if we move references (like purls and CPEs) from the SBOM package to THIS_PACKAGE, how would know what the
SBOM contributed? How the SBOM named that package (aside from the SBOM package name)?

On the other side, do we really need to store THIS_PACKAGE in the database? All the information is there via the SBOM packages anyway. Using SBOM packages also doesn't cause the issue of not knowing where it came from or finding an identifier that would be required to aggregate information.

Maybe THIS_PACKAGE is just a virtual construct, returned by some APIs, based on the PURL tables and the SBOM packages?

We could call that "package". Maybe there's a better name for that too?

Replies: 6 comments 8 replies

bobmcwhirter
Jun 28, 2024

Thanks for this.

First, I'm unclear on what THIS_PACKAGE ... is? So any further comments are based on a giant black-hole of my understanding.

It sounds like SBOMs aggregate/reference a multitude of $things, which may be addressed by various names.

Perhaps if we take @JimFuller-RedHat's thought of calling those $things Components, we can avoid other thing-or-name-of-thing issues.

Given that a pURL is a "package URL", it feels like "package" may indeed be "things you can assign a pURL to".

Likewise, CPEs are names of $things, but tend to name "products" for want of a better word.

Like pURLs, CPEs can be roughly ambiguous, in that they are based around pattern-matching. While every product identifiable by a CPE should have a canonical CPE, an arbitrary CPE (with possible wildcards) may not canonically point to a single product.

So, thinking, in human prose (not DB DDL)...

An SBOM references 0-or-more components
A component may be a reference to a package or a product (or a third thing we've not yet discovered).

That being said, I think you're 100% right in that our current package-related tables are indeed pURL-centric.

I also agree that those tables could/should be better named, such as base_purl versioned_purl and qualified_purl.

Likewise, we have a cpe table that may or may not be in play at the moment (ignorance on my part).

Jumping up from human prose to APIs, connecting through the DB DDL...

A human wants to find packages and products, and just so happens to need to use a pURL or CPE to communicate their desires.

If we stick the the idea that a "package" is "anything addressible using a pURL", then /api/v1/package/... still continues to make sense to me. Sometimes we want to speak about log4j. Sometimes we want to speak about log4j@1.2.3. But a human is still talking about "the package generally known as Apache Log4j".

Likewise, a human may want to understand things about a product (RHEL, RHEL8.2, RHEL8.2 on Sparc), and may end up using a CPE to do so. Ergo, still /api/v1/product/... endpoints.

So if we separate the human-facing prose-centric desires from the DB DDL implementation details, I think my proposal is...

SBOM "packages" should be DDLd as "components" name-wise
Our "package" tables which are pURL-centric should be renamed to be pURL-centric.
Our "package" API makes sense, where human-provided keys are pURL-centric.
Our "product" API makes sense, where human-provided keys are CPE-centric.
Both of the above APIs could/should have our UUID-based escape hatch for simpler URLs and determinism.

Another analogy would be the use-case of "I'd like to call Jim on the telephone". You certainly dial his phone number (the key used to indicate your desires), but you're not talking to the phone number. You're talking to Jim on the other end of the connection.

"I liked to call Jim" -> "I'd like information about log4j, version 1.2.3"
<dials +1-404-xxx-xxxx> -> /api/v1/package/{purlish}
Chats with Jim -> learns about log4j version 1.2.3

0 replies

bobmcwhirter
Jun 28, 2024

wrt THIS_PACKAGE...

Perhaps this is ... another set of tables?

Just like a product can be associated with a cpe, maybe we need another package table that can reference to a qualified_purl table?

Or possible 1+ pURLs/CPEs, depending?

Keep the distinction between $things and $one_of_possibly_many_names_for_a_thing.

0 replies

bobmcwhirter
Jun 28, 2024

Part of our ambiguity in pURLs, and our table layout is to support "pointing to more than a single bag of bits".

An advisory says log4j@[2.0,3.0) is affected, let's say. So we point to the versionless purl table, joined to a version-range table. And then we want to know which concrete fully-qualified pURLs fall inside that assertion.

Ultimately, to answer the questions you posited above.

0 replies

ctron
Jul 1, 2024
Maintainer Author

I'd wish there would be a mindmap mode for discussion. Branching off individual topics :)

So I'll start slow and try to separate this:

The idea of THIS_PACKAGE was to have (without having a proper name) a counterpart to an "SBOM package" but on a global/universal level. An SBOM package is a resource/thing owned by an SBOM. A global package (this_package) is a package which exists outside of any SBOM. There is a reference from an SBOM package to that THIS_PACKAGE. Probably more than one SBOM package points to that global THIS_PACKAGE.

Then again, that might just be a virtual thing, as the same can be achieved by doing the lookups we do today.

0 replies

ctron
Jul 1, 2024
Maintainer Author

I think I mostly agree with your initial comment on this @bobmcwhirter ... I am not sure if "component" is a better pick for a name, because it feels like a term that is equally ambiguous and overloaded. Aside from that, SPDX uses "package", CDX uses component. But mostly mean the same thing.

What might make sense, is to come up with a mapping glossary: Trustify / SPDX, Trustify / CDX, Trustify / Klingon.

And as you said, the endpoints still deal with "packages", that's why I think it makes sense keeping that prefix. But part of this interaction is based on PURLs. So it might make sense to have that by-purl indicator. And we can extend this with by-cpe or by-digest to make clear when we do operations by other identifiers. I think this pattern works.

And it might be, that those endpoints go to the same service functions internally, just with different enums or ID types. But I think having a /api/v1/package/by-purl/{purl} is much clearer than having something like /api/v1/package/{anything}. Because it makes it clear on the API what to expect.

I am not sure how to name things internally. I think what would help a lot is to add code comments, to explain what the idea of a function or structure is. Renaming all struct and functions might be overkill.

Renaming the database tables is the right thing to do IMO.

5 replies

@jcrossley3

jcrossley3 Jul 1, 2024
Maintainer

And it might be, that those endpoints go to the same service functions internally, just with different enums or ID types. But I think having a /api/v1/package/by-purl/{purl} is much clearer than having something like /api/v1/package/{anything}. Because it makes it clear on the API what to expect.

Though I agree that /api/v1/package/{anything} is too clever, I feel like our API design is being driven by our DDL rather than the UI. If the common way the UI requests a package is by purl, then we need /api/v1/package/{purl}. If the UI less commonly needs to request a single package by {cpe}, {uuid} or {digest}, then we can add those .../by-xxx/{xxx} path segments. But let's not do it just because we can. We control the UI, so we can remove the ambiguity in our responses (why return multiple identifiers for a single resource?) and simplify our API.

@ctron

ctron Jul 2, 2024
Maintainer Author

I agree that we should have an optimized API for the UI. However, I don't think that the UI is the only client to the API.

Having a /api/v1/package/{purl} just for the UI sake, might still make it confusing for all others. On the other side, for the UI, it's "just carlos" that needs to code it once. So the user will never interact with that API directly. So I think that the path actually shouldn't matter.

@carlosthe19916

carlosthe19916 Jul 2, 2024
Collaborator

IMHO generally speaking we can better in 2 things:

Single way of fetching entities: It does not matter if the "key" of a package is a PURL, UUID, Name, etc, what matters is that there must be a single consistent way of identifying and fetching that entity; If besides the "main" "key" we add other ways of fetching that entity then that is a bonus. I mentioned "Package" but the same concept applies to every single entity
- This issue is a clear/concrete example of not being able to be consistent while defining the main key for fetching Packages /api/v1/package/by-purl/{id} requires an UUID rather than a PURL #490
DTO models: A "Package" should be a "Package" regardless of where it is. If I hit Endpoint1 and the response tells me EntityA has field1,field2,field3 then if for some reason I hit another endpoint Endpoint2 and the EntityA is somehow also part of that response then also there the EntityA should have field1, field2, field3.

@jcrossley3 @ctron I don't mean to be negative but I am having a difficult time dealing with the current endpoints, sometimes it confuses me what we are actually doing. If there are other clients of the API I hope they are not having the same difficulties as me.

@ctron

ctron Jul 2, 2024
Maintainer Author

Just quoting from above:

I think its entirely fine to use a purl query as a lookup to get to concrete packages but when we started corgi ... purls were not stable enough to be considered for lookup table duty ... maybe things have changed.

I still think that's the case. So I think having the ability to search by an actual purl is fine, but in most cases we return the UUID of a known PURL. And we know we can resolve this. So I think we should not make our life more complicated. And I don't see any benefit for the user in this case. The UI gets and ID and forwards it to another call. The user never interacts with that.

there must be a single consistent way of identifying and fetching that entity

And that's exactly the problem. If there is a single way, it will be hard to stay consistent. Because it makes a difference if its a base PURL, qualified PURLs, UUID of an SBOM package, name of an SBOM package, name of a PURL. Having dedicated endpoints for this makes it predictable what the operation is. Many of the issues opened lately are exactly around that confusion.

@jcrossley3

jcrossley3 Jul 2, 2024
Maintainer

I agree that we should have an optimized API for the UI. However, I don't think that the UI is the only client to the API.

It's our only client now. We should incorporate feedback we get from it. Other clients will likely give the same feedback.

Having a /api/v1/package/{purl} just for the UI sake, might still make it confusing for all others. On the other side, for the UI, it's "just carlos" that needs to code it once. So the user will never interact with that API directly. So I think that the path actually shouldn't matter.

Show me the client who is less confused by /api/v1/package/by-purl/{uuid} than /api/v1/package/{purl}.

Packages #478

Uh oh!

ctron Jun 28, 2024 Maintainer

Replies: 6 comments · 8 replies

Uh oh!

bobmcwhirter Jun 28, 2024

Uh oh!

Uh oh!

bobmcwhirter Jun 28, 2024

Uh oh!

bobmcwhirter Jun 28, 2024

Uh oh!

ctron Jul 1, 2024 Maintainer Author

Uh oh!

ctron Jul 1, 2024 Maintainer Author

Uh oh!

Uh oh!

jcrossley3 Jul 1, 2024 Maintainer

Uh oh!

ctron Jul 2, 2024 Maintainer Author

Uh oh!

carlosthe19916 Jul 2, 2024 Collaborator

Uh oh!

ctron Jul 2, 2024 Maintainer Author

Uh oh!

jcrossley3 Jul 2, 2024 Maintainer

ctron
Jun 28, 2024
Maintainer

Replies: 6 comments 8 replies

bobmcwhirter
Jun 28, 2024

bobmcwhirter
Jun 28, 2024

bobmcwhirter
Jun 28, 2024

ctron
Jul 1, 2024
Maintainer Author

ctron
Jul 1, 2024
Maintainer Author

jcrossley3 Jul 1, 2024
Maintainer

ctron Jul 2, 2024
Maintainer Author

carlosthe19916 Jul 2, 2024
Collaborator

ctron Jul 2, 2024
Maintainer Author

jcrossley3 Jul 2, 2024
Maintainer