How do compression algorithms keep track of multiple files?

Question 1

I noticed that when you compress data they can compress multiple files and since all files are made up of bytes how do these compression algorithms keep track of them? What seperators do they use?

From my understanding it goes like this, for example I have an array [1,2,3] and another array [4,12,6] and I store them in as a string "1234126" these algorithms can keep track of their original form so when you decompress it still goes "1234126" → "[1,2,3]","[4,12,6]" how do compression algorithms do these?

My only solution for these is not as efficient → [1,2,3] would be converted to 111213 so when I decompress I know the length of the number for example : 111213 → [0] = 1 so next number has length of 1 which is [1] = 1 and so on.. but I don't think this is efficient as it bloats data, and if it goes beyond "9" you'd need to add another number for example the word "this is a data" has a length of 14 so before I compress this it would be → 214this is a data then proceed to compression which would be bloated. Is this how compression algorithms do stuff like this?

steps:

file1 has this is a data from file1, file2 has this is a data from file2 → encode data from both files → 214this is a data from file1214this is a data from file2 → compress → result : 776^&676s → decompress "776^&676s" → 214this is a data from file1214this is a data from file2 → decode → file1 has this is a data from file1 and file2 has this is a data from file2

I hope this question isn't so confusing :)

Question 2

A simple method: You create an algorithm first that can translate a sequence of bytes into a compressed sequence, and can decompress the compressed sequence into the original file if the length is known.

Then let's say you have ten files: You get their filenames, and a compressed sequence for each file. The output file starts with an array describing the files: First filename, start and length of the first file, second filename, start and length of the second file, and so on, then you append the first compressed file, then the second compressed file, and so on.

If you want to decompress file xyz, you go through the table at the start of the compressed file, find the start and length of the file named xyz, and then you know what to decompress.

You can vary this slightly to make it possible to add, remove or change files without re-writing everything.

Question 3

There are 2 options,

Zip archive format style, each file is compressed separately. There is a index at the end of the file (can also be at the start) that lists the metadata for each file and where the compressed block for each file exists.

The other option is the tarball style, first the set of files are combined into a non-compressed bytestream with separators and then that full bytestream is compressed.

Both options have their advantages and disadvantages. The zip style means that you can access each file without needing to touch anything other than the index and the block of compressed data for that file and each file can use a different compression algorithm (including none for compressed media formats). With the tarball you can get better compression overall when subsequent files have a similar format.

gnasher729 gnasher729 32.5k36 silver badges58 bronze badges · Answer 1 · 2022-09-20 11:23:55Z

A simple method: You create an algorithm first that can translate a sequence of bytes into a compressed sequence, and can decompress the compressed sequence into the original file if the length is known.

Then let's say you have ten files: You get their filenames, and a compressed sequence for each file. The output file starts with an array describing the files: First filename, start and length of the first file, second filename, start and length of the second file, and so on, then you append the first compressed file, then the second compressed file, and so on.

If you want to decompress file xyz, you go through the table at the start of the compressed file, find the start and length of the file named xyz, and then you know what to decompress.

You can vary this slightly to make it possible to add, remove or change files without re-writing everything.

ratchet freak ratchet freak 4,7061 gold badge19 silver badges16 bronze badges · Answer 2 · 2022-09-26 09:43:34Z

There are 2 options,

Zip archive format style, each file is compressed separately. There is a index at the end of the file (can also be at the start) that lists the metadata for each file and where the compressed block for each file exists.

The other option is the tarball style, first the set of files are combined into a non-compressed bytestream with separators and then that full bytestream is compressed.

Both options have their advantages and disadvantages. The zip style means that you can access each file without needing to touch anything other than the index and the block of compressed data for that file and each file can use a different compression algorithm (including none for compressed media formats). With the tarball you can get better compression overall when subsequent files have a similar format.

Stack Exchange Network

How do compression algorithms keep track of multiple files?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How do compression algorithms keep track of multiple files?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions