0
$\begingroup$

I noticed that when you compress data they can compress multiple files and since all files are made up of bytes how do these compression algorithms keep track of them? What seperators do they use?

From my understanding it goes like this, for example I have an array [1,2,3] and another array [4,12,6] and I store them in as a string "1234126" these algorithms can keep track of their original form so when you decompress it still goes "1234126" → "[1,2,3]","[4,12,6]" how do compression algorithms do these?

My only solution for these is not as efficient → [1,2,3] would be converted to 111213 so when I decompress I know the length of the number for example : 111213 → [0] = 1 so next number has length of 1 which is [1] = 1 and so on.. but I don't think this is efficient as it bloats data, and if it goes beyond "9" you'd need to add another number for example the word "this is a data" has a length of 14 so before I compress this it would be → 214this is a data then proceed to compression which would be bloated. Is this how compression algorithms do stuff like this?

steps:

file1 has this is a data from file1, file2 has this is a data from file2 → encode data from both files → 214this is a data from file1214this is a data from file2 → compress → result : 776^&676s → decompress "776^&676s" → 214this is a data from file1214this is a data from file2 → decode → file1 has this is a data from file1 and file2 has this is a data from file2

I hope this question isn't so confusing :)

asked Sep 20, 2022 at 11:15
$\endgroup$

2 Answers 2

1
$\begingroup$

A simple method: You create an algorithm first that can translate a sequence of bytes into a compressed sequence, and can decompress the compressed sequence into the original file if the length is known.

Then let's say you have ten files: You get their filenames, and a compressed sequence for each file. The output file starts with an array describing the files: First filename, start and length of the first file, second filename, start and length of the second file, and so on, then you append the first compressed file, then the second compressed file, and so on.

If you want to decompress file xyz, you go through the table at the start of the compressed file, find the start and length of the file named xyz, and then you know what to decompress.

You can vary this slightly to make it possible to add, remove or change files without re-writing everything.

answered Sep 20, 2022 at 11:23
$\endgroup$
1
$\begingroup$

There are 2 options,

Zip archive format style, each file is compressed separately. There is a index at the end of the file (can also be at the start) that lists the metadata for each file and where the compressed block for each file exists.

The other option is the tarball style, first the set of files are combined into a non-compressed bytestream with separators and then that full bytestream is compressed.

Both options have their advantages and disadvantages. The zip style means that you can access each file without needing to touch anything other than the index and the block of compressed data for that file and each file can use a different compression algorithm (including none for compressed media formats). With the tarball you can get better compression overall when subsequent files have a similar format.

answered Sep 26, 2022 at 9:43
$\endgroup$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.