- 
  Notifications
 
You must be signed in to change notification settings  - Fork 25.6k
 
Add ES93BloomFilterStoredFieldsFormat for efficient field existence checks #137331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ES93BloomFilterStoredFieldsFormat for efficient field existence checks #137331
Conversation
...hecks Introduces a new stored fields format that builds a Bloom filter for a specific field to enable fast existence checks without storing the field itself. This delegates storage of all other fields to another StoredFieldsFormat while maintaining the ability to quickly determine if a document might contain the target field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I left many comments but most of them are nits.
I think this is ready for review, we'll improve the format step by step.
 
 
 server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilterHashFunctions.java
 
 Show resolved
 Hide resolved
 
  
 
 ...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
 
 Outdated
 
 Show resolved
 Hide resolved
 
  
 
 ...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
 
 Outdated
 
 Show resolved
 Hide resolved
 
  
 
 ...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
 
 Outdated
 
 Show resolved
 Hide resolved
 
  
 
 ...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
 
 Outdated
 
 Show resolved
 Hide resolved
 
  
 
 ...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
 
 Outdated
 
 Show resolved
 Hide resolved
 
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could have a better coverage if we were randomly using some field names that are generated in BaseStoredFieldsFormatTestCase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that we'll broke some of the integration tests that rely on these fields actually being stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is null, then we haven't seen an _id stored field. Does it still make sense to write an empty file then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, for merges maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's for merges. Once we get merges working, bloomFilterFieldInfo should be always not null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sfbf will be Special Francisco's Bloom Filter in my mind forever once this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to return the delegateReader even if bloomFilterFieldReader is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, but I would expect that the bloom filter field reader will be always != null
Skip skipping the bloom filter field for all data types Compute the bloom filter size in a simpler way
Pinging @elastic/es-storage-engine (Team:StorageEngine)
Hi @fcofdez, I've created a changelog YAML for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit isn't addressed yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 1e4ea4c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on how this bloom stored field format will be registered. If it's registered only for the _id field through (an adjusted version of) PerFieldFormatSupplier, what I think we'll do, then I prefer to have all field behave the same as we know other types of fields won't be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think there was an assertion on info.name being equal to the bloomFilterFieldName which I liked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 4293207
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks!
...z/elasticsearch into bloom-filter-stored-fields-format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martijnvg knows this part better for sure.
 
 
 server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilterHashFunctions.java
 
 Show resolved
 Hide resolved
 
  
 
 server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilterHashFunctions.java
 
 Show resolved
 Hide resolved
 
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit isn't addressed yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think typically codec versions start at zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 605c93d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity what is the plan to support merging? Iirc ES87BloomFilterPostingsFormat supports this by rebuilding the bloom filter from the merged terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would just OR the bitsets, but that depends a bit on the sizing strategy. Otherwise we could just rebuild the bloom filter as you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: instead of makeing the BloomFilterFieldReader public, maybe keep it package protected / private like other classes in here and add an public interface for this specific method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 6550601
Thanks all for the reviews!
Uh oh!
There was an error while loading. Please reload this page.
Introduces a new stored fields format that builds a Bloom filter for a specific field to enable fast existence checks without storing the field itself. This delegates storage of all other fields to another StoredFieldsFormat while maintaining the ability to quickly determine if a document might contain the target field.
Missing parts