Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add ES93BloomFilterStoredFieldsFormat for efficient field existence checks #137331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
fcofdez merged 13 commits into elastic:main from fcofdez:bloom-filter-stored-fields-format
Nov 3, 2025

Conversation

@fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Oct 29, 2025
edited
Loading

Introduces a new stored fields format that builds a Bloom filter for a specific field to enable fast existence checks without storing the field itself. This delegates storage of all other fields to another StoredFieldsFormat while maintaining the ability to quickly determine if a document might contain the target field.

Missing parts

  • Merge support for the bloom filter
  • Dynamic bloom filter sizing
  • More extensive testing

...hecks
Introduces a new stored fields format that builds a Bloom filter for a
specific field to enable fast existence checks without storing the field
itself. This delegates storage of all other fields to another
StoredFieldsFormat while maintaining the ability to quickly determine if
a document might contain the target field.
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I left many comments but most of them are nits.

I think this is ready for review, we'll improve the format step by step.

"",
TestUtil.getDefaultCodec().storedFieldsFormat(),
ByteSizeValue.ofKb(2),
IdFieldMapper.NAME
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could have a better coverage if we were randomly using some field names that are generated in BaseStoredFieldsFormatTestCase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that we'll broke some of the integration tests that rely on these fields actually being stored.


private void finishBloomFilterStoredFormat() throws IOException {
BloomFilterMetadata bloomFilterMetadata = null;
if (bloomFilterFieldInfo != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is null, then we haven't seen an _id stored field. Does it still make sense to write an empty file then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, for merges maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for merges. Once we get merges working, bloomFilterFieldInfo should be always not null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

*/
public class ES93BloomFilterStoredFieldsFormat extends StoredFieldsFormat {
public static final String STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME = "ES93BloomFilterStoredFieldsFormat";
public static final String STORED_FIELDS_BLOOM_FILTER_EXTENSION = "sfbf";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sfbf will be Special Francisco's Bloom Filter in my mind forever once this is merged.

fcofdez and martijnvg reacted with laugh emoji
var success = false;
try {
bloomFilterFieldReader = BloomFilterFieldReader.open(directory, si, fn, context, segmentSuffix);
success = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return the delegateReader even if bloomFilterFieldReader is null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, but I would expect that the bloom filter field reader will be always != null

Skip skipping the bloom filter field for all data types
Compute the bloom filter size in a simpler way
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

+ " or less (rounded to nearest power of two)"
);
}
this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE);
this.bloomFilterSizeInBits = Math.toIntExact(Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE));

Copy link
Member

@martijnvg martijnvg Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit isn't addressed yet

Copy link
Contributor Author

@fcofdez fcofdez Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1e4ea4c

if (info.getName().equals(bloomFilterFieldName)) {
addToBloomFilter(info, value);
} else {
delegateWriter.writeField(info, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on how this bloom stored field format will be registered. If it's registered only for the _id field through (an adjusted version of) PerFieldFormatSupplier, what I think we'll do, then I prefer to have all field behave the same as we know other types of fields won't be called.

}

private void addToBloomFilter(FieldInfo info, BytesRef value) {
bloomFilterFieldInfo = info;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think there was an assertion on info.name being equal to the bloomFilterFieldName which I liked

Copy link
Contributor Author

@fcofdez fcofdez Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4293207


private void finishBloomFilterStoredFormat() throws IOException {
BloomFilterMetadata bloomFilterMetadata = null;
if (bloomFilterFieldInfo != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg knows this part better for sure.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

+ " or less (rounded to nearest power of two)"
);
}
this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE);
Copy link
Member

@martijnvg martijnvg Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit isn't addressed yet

public static final String STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME = "ES93BloomFilterStoredFieldsFormat";
public static final String STORED_FIELDS_BLOOM_FILTER_EXTENSION = "sfbf";
public static final String STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION = "sfbfm";
private static final int VERSION_START = 1;
Copy link
Member

@martijnvg martijnvg Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think typically codec versions start at zero?

Copy link
Contributor Author

@fcofdez fcofdez Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 605c93d


@Override
public int merge(MergeState mergeState) throws IOException {
// Skip merging the bloom filter for now
Copy link
Member

@martijnvg martijnvg Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity what is the plan to support merging? Iirc ES87BloomFilterPostingsFormat supports this by rebuilding the bloom filter from the merged terms.

Copy link
Contributor Author

@fcofdez fcofdez Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would just OR the bitsets, but that depends a bit on the sizing strategy. Otherwise we could just rebuild the bloom filter as you mentioned.

this.bloomFilterData = bloomFilterData;
}

public boolean mayContainTerm(String field, BytesRef term) throws IOException {
Copy link
Member

@martijnvg martijnvg Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: instead of makeing the BloomFilterFieldReader public, maybe keep it package protected / private like other classes in here and add an public interface for this specific method?

Copy link
Contributor Author

@fcofdez fcofdez Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6550601

@fcofdez fcofdez requested a review from burqen November 3, 2025 08:28
@fcofdez fcofdez merged commit 2f6ea7a into elastic:main Nov 3, 2025
34 checks passed
Copy link
Contributor Author

fcofdez commented Nov 3, 2025

Thanks all for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@martijnvg martijnvg martijnvg approved these changes

@tlrx tlrx tlrx approved these changes

@kkrik-es kkrik-es kkrik-es approved these changes

@burqen burqen Awaiting requested review from burqen

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /