Our HBase cluster (version 2.1.10) is generating excessively small HFiles, often below 10 MB and sometimes as low as 2 KB.
This occurs despite configuring hbase.hregion.memstore.flush.size to a 256M and
tuning parameters like hbase.hregion.percolumnfamilyflush.size.lower.bound.min to match the flush size. Additionally, we've enabled BASIC in-memory compaction.
The root cause appears to be the global nature of MemStoreSizing . This variable, shared across all column families within a region,triggers a region-wide flush when the total memstore size exceeds the threshold.
Consequently, even if only one column family is actively accumulating data, the entire region is flushed, potentially leading to the creation of small HFiles.
We seek guidance on strategies to prevent the generation of small HFiles and to enable per-column-family flushing in multi-column-family HBase tables.
1 Answer 1
To avoid creating small files in HBase, increase the MemStore flush size
(hbase.regionserver.memstore.flush.size)
and write buffer size
(hbase.regionserver.write.buffer.size)
to delay flushing. Adjust compaction settings, like storefile compaction threshold
(hbase.regionserver.storefile.compaction.threshold)
to reduce frequent minor compactions. Increase blockingStoreFileSize
(hbase.hstore.blockingStoreFileSize)
to control HFile sizes. Consider using bulk loading for large data imports. Also, optimize region sizes by adjusting region split size
(hbase.regionserver.region.split.size)
Regularly monitor HBase metrics and adjust configurations for efficient flushing and file management.