I finally got off my ass and put together a fully bijective from any file to any file
BWTS DC thing.
I would like to thank Yuta since we talked alot and fixed his DC to make it shorter
for some cases. So that it would lead to this code. I have the original stuff Yuta
sent me in the file. He really does fast clean work in changing DC.
I would also like to thank Shelwien who help motivate me.
ALso Matt for his encouragement.
There are many other people that helped so far.
The guy who uses Fibonacci on comp.compression I told him years ago I would
write a bijective Fibonacii thing OK i did.
Look it may have errors I tested in forward direction the calgary18 got 882,031
which considering I have not added an arb255 style arithmetic coder yet I
think is pretty good. I still got the file "BANANAS" to compress smaller.
I tested the reverse on short files. The problem is that I wanted to use 0 as
a table exit code. This meant '1111' in the input close to front would cause
symbol table to end when testing even short files. So I used the 256th
symbol bijectively as the stop in table. The same problem occurs when using
0 to terminate the n-2 character runs it occurs randomly to often so used 21.
Please test it on short files when testing for full bijectiveity.
by short I mean 100 bytes or less. It can blow up to a gig even
in a dozen bytes. I don't count that as an error. Look I can go
a lot farther but this is a start. Please look at it.
Offtopic a bit, but I always wanted to ask.
Since the bijective BWT transform is 1:1 with no expansion or starting pointer, are there any interesting properties by applying the bij-BWT twice? Or un-bij-BWTing a text file? Or applying bijBWT a very large number of times. (In theory if you apply bij-BWT enough times sequentially you should get your original file back! Of course this may take a googolplex of iterations..)
None of those crazy applications would be useful for searching or compression or anything, but it'd be curious if there were some other pattern or structure revealed by such abuse.
When I think about bijections there are several types. One is a bijection from a closed finite set to another closed finite set. Such a bijection in this class is the standard BWTS it works on a starting file and then ouputs a permutation of the starting set. If you do this standard BWTS on any file then if you do it enough times you always get the original file back.Quote Originally Posted by GerryB View PostOfftopic a bit, but I always wanted to ask.
Since the bijective BWT transform is 1:1 with no expansion or starting pointer, are there any interesting properties by applying the bij-BWT twice? Or un-bij-BWTing a text file? Or applying bijBWT a very large number of times. (In theory if you apply bij-BWT enough times sequentially you should get your original file back! Of course this may take a googolplex of iterations..)
None of those crazy applications would be useful for searching or compression or anything, but it'd be curious if there were some other pattern or structure revealed by such abuse.
Another kind of bijection to the set of all files from all files is one where there is an inifinte number of items. One with a little care could reduce that to some finite set but I generally leave it open ended.
Let me give an example of this second type. Think of the set of all binary strings of one and zeros. In this case the bit one or zero could be thought of as a bit. This set can be bijective mapped to the set of 8 bit byte files, I do that all the time but not here. Look at the files in an ordered way
0
1
00
01
10
11
000
0001
and so on forever
I could map this set to the set of number starting with 2 by putting a 1
in front and reading the resultant number example.
0 becomes 10 which is 2
1 becomes 11 which is 3
00 becomes 100 which is 4
01 becomes 101 which is 5
notice this is a familar pattern where every file is mapped bijectively to set of numbers
starting with 2.
I can define a bijection. where if the file is represented by an odd number its mapped to
the next lower file that uses the next lower odd number except if the file is represented by
3 it maps to 2. And define the mapping of files represented by even numbers as being
replaced by the file that has next larger even number. You could make this a loop
by fixing the max length of string. In this case when there is no file represented by
the next even number map that file to the one with the highest odd number. But
I am allowing all files so not a loop.
This bijection is not a loop. When you allow the set of all files you can have loops you
can have things that get smaller and then get bigger. You can even define thing that
may have what likes like a dampened ocsellation (spell check failed me here) that gets
smaller then larger. You can have subsets in the files that form loops and so on.
When you repeatedly compress much can happen but in general repeated even
one time for most real life files causes expansion.
My main hobby is compression and encryption. I almost had a chance to go to Israel
and talk to some of the big boys. Gil was going to introduce me. It had free tickets to
fly to Prague and then to Israel but my wife worried the trip would kill me so I did not
go. I actually think she thought they would mistake my for a terrorist since I act so
differently from a normal person and I know only english. Twice in my life I was almost
killed by police. Once in Japan going though a checkpoint long story. And once in
Ridgecrest California where a woman cop stopped me for no reason and I asked an
innocent question about the glock she was carrying. She immediately pulled it out
and was shaking. I was not sure if I should rush her or disarm her. I mentally decided
that if she shot I would do my best to see her dead along with me. She really had
no business being a cop. I froze and she called for back up. Fortunately I knew many
of the cops that showed up we played poker together. They took her aside still
shaking and told me she just over reacted and that she was transfered in from anther
place and never had met people who act like me so I am different.
Anyway I had discussions years ago about bijective compression as a first
pass before encryption. There was a German guy very interested in it.
Anyway to make a long story short. The BWTS-DC-FIB in its present form
I would not use as the stage immediately before something
like AES. First of all the code is not complete there may some
errors. Second of all though the classic Unitcity Distance may
be larger. There are weakness that one could distinguish most
compressed files from a true random file. But these things can
be improved upon.
If you want to test this code please when doing compression try to do
either the 3 things in order or just test the middle one dctemp4.
Note dctemp4 e inputfile outputfile
input file is any file output file is a mulitply of 8bytes where each 8 bytes represents a postitve
number from zero to what ever. The code is bijective from bytes to the 8byte grouping.
However realistically if you have a random file and do dectemp4 d inputfile outputfile
it can easily blowup in just 8 bytes. I want to test this thing in the d direction for small
numbers just to see if the concept is correct. A later version will make this less likely
I just want to see if the bijective DC concept is correct.
For the reverse direction if you get dctemp4 d to fail where the sum of each of the 64bit
numbers totals less than a million then you have found an error. In either the DC itself
or my interface to it. Like wise if the sum is less than a million and your
do dctemp4 d filea fileb followed by dectemp4 fileb filec and
filea not equal to filec then there is either an error in DC itseld not being bijective
or in my connecting to it. Once again even for this filea is made of 64 bit numbers
each of which is 0 to a large number the total value is less than a million and the
number of numbers is less than a thousand. I expect that I still have a few errors
I hope if that is true that one of you find the error and send me the file causing
the error and I will try to fix it before I do more with this code. Note
I have not yet at this stage added an arithmetic compressor it will compress
better when I do that. However even now it wil compress enwik9
t0 185,709,858 bytes. Using a machine with 6GB memory and the 64
bit exes.
Thank You
David A. Scott
here is the replacement for dctemp5.exe
I will post a cleaner one soon
it is just for testing.
it was complied with no options
X:\>fc /L /N /W dctemp6.c dctemp5.c
Comparing files dctemp6.c and DCTEMP5.C
***** dctemp6.c
59: DCV[m] = (n - 1) - (j + 1);
61: if( m == 1) DCV[0]--;
63: return m;
***** DCTEMP5.C
59: DCV[m] = (n - 1) - (j + 1);
61: return m;
*****
***** dctemp6.c
102: }
103: static int ds1 = 0;
104: static int bread ( int *c , long long *t, FILE *fp )
***** DCTEMP5.C
100: }
101: static int ds1 = 3;
102: static int bread ( int *c , long long *t, FILE *fp )
*****
***** dctemp6.c
117: *t = bn / (k + eonsflg);
118: ds1 = (x == EOF) && ( 0 == ( (bn +1) % (k + eonsflg)));
120: *c = adjidr(BR, bn % (k-- + eonsflg ));
***** DCTEMP5.C
115: *t = bn / (k + eonsflg);
116: ds1 = (bn +1) % (k + eonsflg);
118: *c = adjidr(BR, bn % (k-- + eonsflg ));
*****
***** dctemp6.c
170: if ( m == 0 ) return;
171: // if(m == 1 ) DCV[0] -= 1;
172: if ( 256 != a ) {
173: fprintf(stderr," eons DVC[0] = %u ",DCV[0]);
174: c = adjid(B, 256);
175: bwrite( c ,(long long) DCV[0], fp ); /* write eons with next DCV val
ue */
176: fprintf(stderr," rest of DCV follows %u \n ",m-1);
***** DCTEMP5.C
168: if ( m == 0 ) return;
169: if ( 256 != a ) {
170: fprintf(stderr," eons DVC[0] = %u ",DCV[0]-1);
171: c = adjid(B, 256);
172: bwrite( c ,(long long) DCV[0] - 1, fp ); /* write eons with next DCV
value */
173: fprintf(stderr," rest of DCV follows %u \n ",m-1);
*****
***** dctemp6.c
177: } else {
178: fprintf(stderr," NO eons all 256 symbols used DVC[0] = %u \n",DCV[0]
);
179: bwrite(-2,(long long)DCV[0],fp);
180: fprintf(stderr," no eons DVC[0] = %u ",DCV[0]);
***** DCTEMP5.C
174: } else {
175: fprintf(stderr," NO eons all 256 symbols used DVC[0] = %u \n",DCV[0]
-1);
176: bwrite(-2,(long long)DCV[0] - 1,fp);
177: fprintf(stderr," no eons DVC[0] = %u ",DCV[0]);
*****
***** dctemp6.c
237: /* specail case no eonst ends early */
238: if (ds1) t++;
239: // t++;
***** DCTEMP5.C
234: /* specail case no eonst ends early */
235: if (ds1 == 0) t++;
236: // t++;
*****