Random file access in a block based file format

Question 1

I am writing a program to store, retrieve and delete "blocks" of data of varying sizes.

The way it currently works is by keeping a database storing the locations of the blocks and the locations of free space in the file.

The file is split into pages such that in each page there are no two free chunk spaces next to each other (During a delete operation, any free chunks which are adjacent are merged into one bigger chunk)

The problem with this is that I am seeing horrible IO performance when removing a bunch of blocks and inserting new ones of different sizes (blocks range from 1k to about 200k and may be written anywhere in the file provided they fit in an existing free chunk in the file. If no such free chunk is found, a new page is created at the end of the file).

Can anyone suggest a way to improve on this, or maybe point me in the right direction?

Question 2

Are you interested in conceptual, or rather in implementation-level answers?

Question 3

Conceptual would be enough. I just want some ideas how I can improve. Though this has to take into account average disk IO (a normal hard disk, not an SSD). I included the implementation details so whoever answers can have an idea how the file is being used

Question 4

A starting point for this would be the heap management algorithms, start perhaps here. I remember Knuth in his Art of Computer Programming has a detailed discussion, I believe in volume 1. The algorithms described are good for RAM, adding in the non-uniform access of rotating disks will be quite a challenge...

Isn't it much simpler to use a database and blobs? Your time is valuable too...

Question 5

I can't really use a database for this thing :(

Question 6

As you mention writing to a normal HDD in your comment to your original post, the likely factor may be the way you write or remap the data.

Ordinary hard drives are based on storing and reading data off of rotating magnetic platters. The data itself is placed in "rings" on the platters.

Due to how they work, hard drives are most efficient if data can be read or written in a sequential manner. When reading or writing to random places in a file or in a file system, performance drops drastically with hard drives, as they have to wait (up to some 6-10+ milliseconds per query) until the next packet of data spins by the read/write heads.

There are many likely culprits, but a few may be

the file you are working on is itself heavily fragmented. Try defragmenting the hard drive.
Your method of searching for free chunks may be inefficient (and may also generate lots of random reads for the HDD, especially if you use a parallelized/multithreaded search approach)
Your chunk/page/block size and/or method of reading/writing these generate a lot of overhead or random reads/writes , maybe because chunk/page size is too small (thus more pages/chunks to handle)
your method of updating your database of pages/chunks may be dragging you down
you may have a lot of (meta)data that is processed, even though it might not need to be.

One suggestion for optimizing performance and effeciency when used on HDDs would be to allow fragmentation of data blocks.

While the drawback is that you will have to write a "defragmenter" program (for future maintenance) and implement a block "jump" scheme.

The advantages you get are:

Considerably more effecient use and allocation of disk space, as large blocks can be spread over numerous smaller areas of free chunks.
Increased IO speed, as the read/write load is adapted to the hardware in use, and kept as sequential as possible

If you have major speed problems in your deletion step, don't actually delete the data (unless you actually need to)... disable the write protection (your "deletion" step) and overwrite these chunks with new data on the next write. This is the way that most filesystems and hard drives handle file "deletion" (and I ain't talking about trashcans :o) )

vonbrand vonbrand 14.2k3 gold badges42 silver badges52 bronze badges · Answer 1 · 2013-03-28 00:41:38Z

A starting point for this would be the heap management algorithms, start perhaps here. I remember Knuth in his Art of Computer Programming has a detailed discussion, I believe in volume 1. The algorithms described are good for RAM, adding in the non-uniform access of rotating disks will be quite a challenge...

Isn't it much simpler to use a database and blobs? Your time is valuable too...

$\begingroup$ I can't really use a database for this thing :( $\endgroup$

Cedric Mamo
– Cedric Mamo

2013年03月28日 10:37:01 +00:00
Commented Mar 28, 2013 at 10:37

Tylon Foxx Tylon Foxx 1211 bronze badge · Answer 2 · 2018-02-20 19:41:56Z

As you mention writing to a normal HDD in your comment to your original post, the likely factor may be the way you write or remap the data.

Ordinary hard drives are based on storing and reading data off of rotating magnetic platters. The data itself is placed in "rings" on the platters.

Due to how they work, hard drives are most efficient if data can be read or written in a sequential manner. When reading or writing to random places in a file or in a file system, performance drops drastically with hard drives, as they have to wait (up to some 6-10+ milliseconds per query) until the next packet of data spins by the read/write heads.

There are many likely culprits, but a few may be

the file you are working on is itself heavily fragmented. Try defragmenting the hard drive.
Your method of searching for free chunks may be inefficient (and may also generate lots of random reads for the HDD, especially if you use a parallelized/multithreaded search approach)
Your chunk/page/block size and/or method of reading/writing these generate a lot of overhead or random reads/writes , maybe because chunk/page size is too small (thus more pages/chunks to handle)
your method of updating your database of pages/chunks may be dragging you down
you may have a lot of (meta)data that is processed, even though it might not need to be.

One suggestion for optimizing performance and effeciency when used on HDDs would be to allow fragmentation of data blocks.

While the drawback is that you will have to write a "defragmenter" program (for future maintenance) and implement a block "jump" scheme.

The advantages you get are:

Considerably more effecient use and allocation of disk space, as large blocks can be spread over numerous smaller areas of free chunks.
Increased IO speed, as the read/write load is adapted to the hardware in use, and kept as sequential as possible

If you have major speed problems in your deletion step, don't actually delete the data (unless you actually need to)... disable the write protection (your "deletion" step) and overwrite these chunks with new data on the next write. This is the way that most filesystems and hard drives handle file "deletion" (and I ain't talking about trashcans :o) )

Stack Exchange Network

Random file access in a block based file format

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Random file access in a block based file format

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions