-
Notifications
You must be signed in to change notification settings - Fork 300
Improve prefetching of relationships #921
-
Currently, to improve performance one can specify select_for_includes and prefetch_for_includes in viewsets. This ensures that the required fields are loaded in the minimum amount of queries when ?includeed.
I haven't seen anything talking about how to reuse defined fields in other views.
Example:
CommentViewSet has these attributes defined:
select_for_includes = {"writer": ["author__bio"]} prefetch_for_includes = { "__all__": [], "author": ["author__bio", "author__entries"], }
EntrySerializer allows one to include comments, but since we haven't defined something like the above for EntryViewSet, nothing will be pre-fetched.
We can go and define it like so:
prefetch_for_includes = { "__all__": [], "comments.writer": ["comments__author__bio", "comments__author__entries"], "comments.author": ["comments__author__bio", "comments__author__entries"], }
As we see, we kind of repeat ourselves.
We also should note that writer was a select_related field, which is now prefetch_related field.
Actual discussion
Now since serializers themselves are the ones defining all the includes, they should be the ones defining what to prefetch and what to not, and how, without having to repeat the same fields many times across different viewsets
It also should be smart enough to understand that a field can be selected in a viewset, and prefetch in another viewset, depending on the "parent" field.
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 4 comments 18 replies
-
And while we are at this, we need to think how to support prefetching related fields in SerializerMethodResourceRelatedField results.
Currently, http://127.0.0.1:8000/entries?include=suggested.authors will add 9 more queries to fetch the authors.
What comes to mind is passing extra kwargs to get_<field_name> letting the user handle the rest if needed.
Beta Was this translation helpful? Give feedback.
All reactions
-
I have run into this issue as well that defining prefetch_for_includes is a bit of a head ache. Main issue here is though that AutoPrefetchMixin is not smart enough. AutoPrefetchMixin should be able to automatically prefetch the relationships of an included resource and most prefetch_for_includes configuration would be obsolete (at least the once I have done in the past... 😄).
It is an interesting idea though to move the prefetching configuration to the serializer. A proof of concept would be needed also in terms of what API would fit. I would see above issue improving of AutoPrefetchMixin the higher priority though as it is a quick win for all users without a need for configuration.
Beta Was this translation helpful? Give feedback.
All reactions
-
The problem I found with AutoPrefetchMixin is that it's completely automatic, with no easy way (but to rewrite it) to fine-tune it.
Many times I need to use Prefetch to get a subset of related items (based on user permissions, etc), which AutoPrefetchMixin doesn't offer.
I don't think that the use cases where you'll prefetch every related entity is more than when you need to customize it and limit it more.
Most simple ones are not showing hidden/deleted/removed/inactive/flagged-for-review entities, and the list goes on.
Beta Was this translation helpful? Give feedback.
All reactions
-
I agree automatic will never cover all use cases but the most common once it should. And for the rest manually configuring prefetching needs to be possible.
You are talking about having a subset of related items - where do you usually do this? By filtering the queryset of a ResourceRelatedField or somewhere else?
Beta Was this translation helpful? Give feedback.
All reactions
-
The queryset of ResourceRelatedField is afaik for input validation. I do it in prefetch_for_includes, where I modified PreloadIncludesMixin to accept a callable (passing the request to it), to filter against user, etc.
Beta Was this translation helpful? Give feedback.
All reactions
-
I was a bit too quick when writing 😉 but of course queryset on ResourceRelatedField is about input validation. I would rather do this by overwriting the get_queryset on the view.
I do see your problem though when it comes to nested includes and multiple definition of the Prefetch configuration on different views. This brings the question whether included_serializers is actually the correct way of defining includes and whether it should not be configured on the view as included_views. Idea behind this is that a include should be serialized exactly the same then when the specific end point is called. And the get_queryset is important in this as it defines what data can actually been seen.
I haven't thought this through but just came to my mind when thinking about your issue because serializers are about serializing objects and the view about what data should actually be serialized.
Beta Was this translation helpful? Give feedback.
All reactions
-
On second thought I do not think prefetching etc. would work with included views and basically included_serializers is like nested serializers in DRF (but simply optional by the include parameter).
I do think though related_serializers feature needs to be thought through and should rather use views instead of serializers directly (but I guess this is rather the topic of discussion #916)
I think the limitation with using included_serializers (as in fact the same in DRF with nested serialzers) though is that the permission of the parent object apply. So if a relationship has different permissions than main object the included feature should not be activated for this relationship.
Beta Was this translation helpful? Give feedback.
All reactions
-
My two cents: I started playing with this a while back and also came to the conclusion that all the prefetching logic needs to be defined on the serializer, else includes are either left out in the dark, or require a tonne of repetition to ensure they're supported across all views.
My approach would've been to define a SparseFieldsMeta inner class on the serializer which'd be responsible for:
- Annotations -- Performing particular annotations on the queryset when particular fields are requested. (eg something like
.annotate_with_totals()whentotalis in the sparsefields request). And of course conversely avoiding calling the potentially expensive annotation when the field is not needed. - Permissioning -- refusing to return certain fields under certain conditions. For this package I think it'd be nice to support permission-based field control out of the box, but leave space for extensibility, so end users can customise this further to check for arbitrary feature flags etc.
- On-demand-only -- Perhaps breaking scope of this discussion completely, however currently when no sparsefields are defined, ALL fields are returned. We've found it useful in our own project to roll with a concept called "on_demand_only" fields, which will not come through be default (because they're eg usually irrelevant, or expensive to compute), and must be explicitly requested through sparsefields. The reason I mention it here is that such a a thing would also fit quite well into this
SparseFieldsMetaconcept. - Prefetching/Selectrelating -- and of course the meat of this whole discussion. This needs to be defined in a way that allows for recursive construction of
Prefectchobjects, as well as appropriate attachment ofselect_relatedandprefetch_relatedcalls. It needs to be intelligent -- a relationship requested through sparsefields isn't enough to trigger it; the include is needed too. Further, all of the other points above (eg qs annotations) would need to be packed into these Prefetch objects too, since we're looking at a nested qs that might be built of out multiple layers.
Here's an example of what such a thing would look like:
MySerializer:
class SparseFieldMeta:
# Example of a permission-based field. LHS defines req'd perm, RHS is
# a list of fields protected by that perm.
permission_required = {
'accounts.view_rate': ['rate'],
}
# Example of annotations & on-demand-only. LHS defines annotation
# that will get called when any of the fields on the RHS are requested.
# Fields on RHS will not come through by default, only when explicitly
# listed in the sparsefields list.
on_demand_only = {
'all': ['__str__', 'contractor_set', 'client_set', 'clientgroup_set'],
'annotate_with_usage': ['usage_percentage'],
}
# Example of prefetching logic -- if any of these are listed in the includes,
# they will be added on as a prefetch too. This section would potentially
# need to support tuples in the list as well, for more complex Prefetch
# definitions. Conversely, perhaps this section would be entirely inferable
# from the Serializer definition and other entries in the SparseFieldMeta.
prefetch_related = [
'branches',
'groups',
]
# Example of select_related logic. LHS defines additional FKs to be
# select_related when any of the fields on the RHS are requested. The
# fields on RHS would likely be computed on the fly and require data from
# these relationships to build.
select_related = {
'type': ['__str__'],
'variant': ['__str__'],
}
# Example of a custom addition that hides fields on the RHS if the
# feature_flag on the LHS isn't enabled. The logic for supporting this
# config would need to live somewhere comfortable, probably as a
# Serializer mixin that can interact cleanly with whatever consumes
# SparseFieldMeta.
feature_flag_required = {
'branches': ['branches'],
}
# Example of a custom addition that restricts non-staff users to only a
# small subset of the field. Ditto re mixin.
nonstaff_fields = [
'__str__', 'name',
]
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Hey @sliverc thanks for taking a look!
1a. SerializerMethodResourceRelatedFields aren't covered (and can't be), because the relation won't exist on the model, so prefetch_related just can't be used. In these cases whatever these things are relying on could be handled manually in ViewSet.get_queryset (in the absence of prefetch_for_related). This wouldn't work with nested includes, but it doesn't currently either. So no feature loss. We could attempt to discuss how we could find a way to fold in support for these as part of the work I'm proposing. Potentially there could be some arg on SerializerMethodResourceRelatedField that dictates how to approach prefetching them. I think if we do consider something like this, it'd be in a future iteration though.
1b. I have a solution for user-based visibility reduction in my production code, but didn't want to bloat out the PR further. Summarising briefly, where I've put OnDemandFieldsMixin in this PR, in our production code I've called it RestrictedFieldsMixin, and it's responsible for popping off not only expensive fields, but also fields that are restricted based on various rules (feature flags, user permissions, etc). I'll paste a sanitised copy of it in a followup comment. Let me know your thoughts. I don't want to bloat out this work too much, but we could potentially just leave a stub method in there, for developers to be able to overwrite if they want to be doing such magic. That same mixin also allows for explicitly defining one's own ondemand fields, other than the expensive ones. Eg stuff like cached_properties or calculations that one might not want to always auto-include.
2.. Agreed. I'll move the ondemand stuff into a separate PR. The two work best hand-in-hand, but given what I've said in 1b, we could probs put a bit more work into the ondemand mixin to support configurability in addition to the automatic excludes. This'd cover the expensive file field you mentioned too.
I'll close the current PR, and we can spin up a couple of smaller ones off it, one at a time (can happen over several releases, even).
I'm thinking, in this order:
- Automatic nested prefetching: Upgrade
AutoPrefetchMixinto useadd_nested_prefetches_to_qslogic. - Automatic ondemand: Introduce
OnDemandFieldsMixinto automatically remove "expensive" fields from the Serializer unless they've been explicitly asked for in the sparsefieldset. By default, "expensive fields" would just be reverse relations and m2ms, toggleable with a settings option. - Extend
OnDemandFieldsMixinto allow explicitly defining one's own "expensive" fields. I have an implementation already in use in production, but it introduces an opinionated config syntax for serializers, so would want to workshop this here first and make sure the syntax is as appropriate as possible for the users of this package before continuing. - Field permissioning: Introduce a
RestrictedFieldsMixinto automatically remove fields from the Serializer based on information on the request. Could be user permissions, api version, system feature flags, etc. These fields would always get excluded, regardless of whether sparsefieldsets tried to ask for them or not. - Remove/Deprecate
PreloadIncludesMixin, given all use cases are covered by the above.
Let me know what you think. Gimme a green light and I'm on it. I'm itching to get it done while it's still fresh in memory hehe. I'd also be keen to do some housekeeping cleanup on this package too. There's currently a lot of cruft, and the single utils file is uniweldy.
Beta Was this translation helpful? Give feedback.
All reactions
-
Here's the full RestrictedFields implementation I'm using in production. Would split it into points 2,3,4 above, using two separate mixins.
RestrictedFieldsMixin
class RestrictedFieldsMixin: def __init_subclass__(cls, **kwargs): """Run a smidge of validation at class declaration, to avoid silly mistakes.""" # Throw error if there are expensive relations listed in the on_demand_only config. if restricted_fields_meta := getattr(cls, 'RestrictedFields', None): expensive_relation_fields = set(get_expensive_relational_fields(cls)) for fields in getattr(restricted_fields_meta, 'on_demand_only', {}).values(): if mistake_fields := expensive_relation_fields & set(fields): raise AttributeError( f"{cls.__name__!r} includes relational fields in its `on_demand_only` list." " Expensive relations are already considered ondemand implicitly." f" Please remove: {mistake_fields}" ) return super().__init_subclass__(**kwargs) def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # Pop any fields off the serializer that shouldn't come through. for field in self.get_excluded_restricted_fields() | self.get_excluded_ondemand_fields(): self.fields.pop(field, None) def get_excluded_restricted_fields(self) -> list[str]: """ Determines which fields may not be accessed, based on any of the following reasons: - Non-staff -- user is not staff, and thus only has access to a limited subset of all fields. - Permission -- user is lacking particular permissions to view certain fields. - Extension -- site has certain extensions disabled, concealing certain fields. """ restricted_fields_meta = getattr(self, 'RestrictedFields', None) restricted_fields = set() if user := getattr(self.context.get('request', None), 'user', None): # If user is not staff, swap out the base list of fields for a cut-down version. if not (user.is_staff or user.is_superuser): nonstaff_fields = getattr(restricted_fields_meta, 'nonstaff_fields', []) # Mark all serializer fields for removal, except those listed in nonstaff_fields. # Also, custom fields are handled elsewhere, so don't remove any of those either. restricted_fields |= {f for f in self.fields.keys() if f not in nonstaff_fields and not f.startswith('extra_fields_')} # If the user is lacking certain permissions, remove corresponding fields. for perm, fields in getattr(restricted_fields_meta, 'permission_required', {}).items(): if not user.has_perm(perm): restricted_fields |= set(fields) # If the customer has certain extensions disabled, remove corresponding fields. for app, fields in getattr(restricted_fields_meta, 'extension_required', {}).items(): if not Extension.objects.is_installed(app): restricted_fields |= set(fields) return restricted_fields def get_excluded_ondemand_fields(self) -> list[str]: """ Determine which fields should be popped off if not explicitly asked for. Will not nominate any fields that have been designated as `demanded_fields` in context. Ondemand fields are determined in two ways: - Fields that we automatically determine to be expensive, and thus automatically remove from the default offering. Currently such fields are M2Ms and reverse FKs. - Explicitly listed on the serializer, in the `RestrictedFields.on_demand_only` section, oft accompanied by an annotation method that needs to be run in order for the ondemand field to exist. """ restricted_fields_meta = getattr(self, 'RestrictedFields', None) # M2Ms and reverse FKs. if <some_settings_flag>: ondemand_fields = set() # Expensive fields are not considered. else: ondemand_fields = set(get_expensive_relational_fields(type(self))) # `on_demand_only` declarations. for fields in getattr(restricted_fields_meta, 'on_demand_only', {}).values(): ondemand_fields |= set(fields) # If we've instantiated the serializer ourselves, we'll have fed `demanded_fields` into its context. # If it's happened as part of drf render internals, then we have a fallback where the view # has provided the entire sparsefields context for us to pick through. if 'demanded_fields' in self.context: demanded_fields = set(self.context.get('demanded_fields')) else: resource_name = get_resource_type_from_serializer(type(self)) demanded_fields = set(self.context.get('all_sparsefields', {}).get(resource_name, [])) # We only want to exclude those ondemand fields that haven't been explicitly requested. return ondemand_fields - set(demanded_fields)
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for your response.
In terms of RestrictedFieldsMixin that was not what I wanted to refer to. I also think your code snippet is not REST conform, see more in this discussion. What I was actually referring to is when you have a M2M relation and the API should not render relations which for instance are marked as deleted.
In DRF someone could overwrite get_queryset like the following (a dirty example):
def get_queryset(self): return models.Blog.objects.prefetch_related( Prefetch('tags', models.Tags.objects.filter(is_deleted=False)) )
This is defined on the view but actually does not work with the DJA included feature. For this we need a solution which PreloadIncludesMixin currently also does not offer.
There are other use cases where an automatic approach won't work. For instance on a serializer a SerializerMethodField is used which calculates a name which it gets from two different relations (FK). This serializer gets then defined as a include in another serializer. PreloadIncludesMixin currently supports such use cases where the configuration defines what relations need to be prefetched for such a field in case serializer is actually included in the request.
All in all It is really important to make AutoPrefetchMixin better but I believe there will always be a need for manual configuration as only the API developer knows what actually needs to be prefetched especially in complex cases.
In terms of priorities as DJA is an open source project it really depends on the contributors what they want to invest time on.
From a maintainer perspective though I certainly prefer to fix existing features first before adding new features. So this said I would prefer to make AutoPrefetchMixin and PreloadIncludesMixin work well first.
In terms of the on demand only fields I prefer this to be discussed in another discussion thread where we can go deeper on what the actual use cases are for this feature and how those use cases can be best covered.
Beta Was this translation helpful? Give feedback.
All reactions
-
I also think your code snippet is not REST conform, see more in this discussion.
Not sure I'm following on this one. What I believe you're saying here is that whether I hit /api/a/?include=b, or /api/b/, the shape of b should present the exact same way regardless. Am I understanding correctly? Where in what I've proposed is that conformance broken?
What I was actually referring to is when you have a M2M relation and the API should not render relations which for instance are marked as deleted.
In DRF someone could overwrite
get_querysetlike the following (a dirty example):def get_queryset(self): return models.Blog.objects.prefetch_related( Prefetch('tags', models.Tags.objects.filter(is_deleted=False)) )This is defined on the view but actually does not work with the DJA included feature. For this we need a solution which
PreloadIncludesMixincurrently also does not offer.
Yeah, so the solution I'm currently running with is to use the queryset defined on the ResourceRelatedField on the serializer. This is the best place to specify this, as it means we can abide by it at all levels of nesting. However I'm aware this is somewhat of an abuse, as ResourceRelatedField.queryset is designed more for filtering options/validating input, rather than the actual related items. There is an overlap in concepts here, but the overlap isn't perfect.. my thoughts of how to get around this are either to introduce an additional field on ResourceRelatedField, which allows defining a get_queryset() method for this purpose (this'd be a hard sell i suspect...) or conjuring up a way to tell the automatic prefetcher to skip the given field, because the user will deal with its prefetching themselves.
There are other use cases where an automatic approach won't work. For instance on a serializer a
SerializerMethodFieldis used which calculates a name which it gets from two different relations (FK). This serializer gets then defined as a include in another serializer.PreloadIncludesMixincurrently supports such use cases where the configuration defines what relations need to be prefetched for such a field in case serializer is actually included in the request.
It "supports" it, but at great cost -- any viewset that might potentially drill down to hitting this field, needs to have its own version of specifications for this include in its config. The solution I'm currently running builds this into the ondemand fields concept: you can designate certain fields as ondemand_only, and as part of that config you can nominate queryset methods you want run when said fields are requested, to perform any additional annotations or prefetch the SerializerMethodField might need. In short, I believe this'd be buildable into some sort of config on the serializer, or through introduction of additional arguments to SerializerMethodField, to achieve the same result.
All in all It is really important to make
AutoPrefetchMixinbetter but I believe there will always be a need for manual configuration as only the API developer knows what actually needs to be prefetched especially in complex cases.
Agreed, there need to be escape hatches for non-standard behaviour. DJA should be opinionated to some degree, to prevent foot-shooting, and to push REST/JSONAPI spec compliance, but we certainly shouldn't be regressing away from what's currently possible, at least not without a very strong reason.
In terms of priorities as DJA is an open source project it really depends on the contributors what they want to invest time on.
From a maintainer perspective though I certainly prefer to fix existing features first before adding new features. So this said I would prefer to make
AutoPrefetchMixinandPreloadIncludesMixinwork well first.
Can we agree PreloadIncludesMixin is inherently flawed and the goal should be to offer sufficient parallel functionality in order to be able to remove it? Viewset-based prefetching/select_relating will never play nicely with nested includes.
I see what you're saying though, and I definitely appreciate a measured approach to this. Stability is important when there are so many projects depending on the package, and I wish it had come sooner (eg stuff like this needless and expensive pluralisation, using inflection library, no less, should never have made it in)
In my view, as it stands, the n+1 problem requires significant changes for it to get addressed properly. I've currently got some energy to make these changes, and am happy to put some time into drilling down into this discussion with yourself or whoever, step by step, to implement these changes in as minimally regressive a way as possible, with an approach we all agree on. If you're not sold on the 5 steps I proposed above, what sort of roadmap would you suggest?
In terms of the on demand only fields I prefer this to be discussed in another discussion thread where we can go deeper on what the actual use cases are for this feature and how those use cases can be best covered.
Sounds good. I wanted to mention them initially here, for context, as they play a role in addressing the problem too, however they can be fleshed out more carefully down the line, separately.
Cheers,
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for your lengthy response and willingness to work on this. I am not gonna address all what you mentioned above as of lack of time (can follow up on the different issue in the respective discussions).
I noticed though that my sentence about AutoPrefetchMixin and PreloadIncludesMixin can be misunderstood. I am not suggesting that those mixins remain but the concept behind them of auto-prefetching and manual configuration of prefetching do. As those concepts are already introduced in DJA they have in my opinion priority over totally new features.
Hence this would be my action plan:
- Improve auto prefetching: As far as I see is there no more discussion needed on this and you could open an issue explaining what needs to be improved and create a PR which address this. (so far what I see is the missing sparse field and recurse M2M and reverse relationship support. There might be more.)
- Continue discussion on how manual configuration of prefetching could work. As this is the initial idea of this discussion I would continue here. A PR with some POC code might also help.
- Open another discussion about on demand only fields so we can talk about use cases and possible implementation.
The different discussions can of course also happen in parallel.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
In issue #337 @sliverc specifically mentions an existing N+1 query problem for any reverse relationships (eg. one to one in reverse or one to many) and how that's not handled by AutoPrefetchMixin
In our use of the framework, we make our own base classes and have amended AutoPrefetchMixin to fix the N+1 query reverse relationship issue. Basically our mixin is just a copy of the AutoPrefetchMixin with added logic at the beginning, examing the users "include" strings and adding auto fetch of reverse relationships 1 level deeper.
Since it is using the serializers included_serializers, it can only apply auto prefetch when it is able to fetch the classes (in our projects, we require that included_serializers is always complete and have added validations for this), but even without having completeness the addition to the mixin will still work to reduce N+1 queries in any cases where the class can be found.
The core of the solution is examining the relationships for models for each serializer:
# snip ... looping through the base view serializer and any serializers that the user asked for with "include" ... level_model = getattr(level_serializer_class.Meta, "model") info = model_meta.get_field_info(level_model) relations_set = { x for x in info.relations.keys() if not x.endswith("_set") } for field in relations_set: field_info = info.relations[field] if field_info.reverse or field_info.to_many: included_recursive_prefetch.add(f"{level_agg}{field}") #... snip # And then resume the work of `AutoPrefetchMixin` included_resources = get_included_resources( self.request, self.serializer_class ) + list(included_recursive_prefetch) qs = super().get_queryset(*args, **kwargs) # ...
Full gist here: https://gist.github.com/dsschneidermann/ddf9c9e4a782c2f6769d503ff0a0a42e
An additional optimization missing from here would be restricting the queries to do SELECT id FROM .. rather than fetching the full instances - but I honestly have no idea how to make that effect happen in Django ORM. And in that case it would not anymore be an addition to AutoPrefetchMixin but a new thing entirely.
As it is, it can be used as a drop-in replacement for AutoPrefetchMixin to fix only N+1 queries on reverse relationships and it'll bring no new complexity to the framework, which I think is a major advantage.
Beta Was this translation helpful? Give feedback.
All reactions
-
Nice approach! Minimising added complexity is a definite boon, however correctness/useability come first.
Currently the two offerings in drfjsonapi don't cut it:
PreloadIncludesMixinforces unnecessary manual effort, and is fundamentally flawed in that it cannot be set up to consider nesting properly.AutoPrefetchMixindoesn't recurse (which your solution addresses), nor does it offer a way of opting out of fetching the additional data if sparsefieldsets haven't asked for it. Specifically, for reverse relations and M2Ms, there are three cases to consider: [1] don't return the field at all (no prefetch), [2] return only the ids (simple prefetch), and [3] return the entire relations (deep prefetch and recurse this same logic at the next level down).
So I don't think avoiding added complexity or shaking this up too much is an option here.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks @dsschneidermann for posting your mixin here. Very helpful for people who stumble upon this thread and need a quick solution which might very well work for their use case. As @jarekwg mentioned and he is already working on in #964 I would prefer when in the future AutoPrefetchMixin also consider sparse field sets.
Beta Was this translation helpful? Give feedback.