You mission today is to invent a text compressor.
Task
You'll write two functions:
The packer is a function that accepts a string of ASCII characters (U+0000 to U+007F) and outputs a Unicode string (U+0000 to U+10FFFF), containing the fewest characters possible.
The unpacker is a function that accepts an encoded Unicode string and outputs exactly the original ASCII string.
Input
The only authorized input is the ASCII string (for the packer) and the packed Unicode string (for the unpacker). No user input, no internet connection, no use of file system.
Your functions can have access to this list of english words. You can use this list as a local txt file, or copy its content in your source code as a string or an array of strings.
You can't hardcode the snippets below in your functions.
Output
The only authorized output for both functions is a string.
The output of the unpacker must contain exactly the same characters as the input of the packer.
Your inputs and outputs can use any character encoding supporting all Unicode (UTF-8/16/32, GB18030, ...), as your score will only depend on the number of Unicode characters in the output. Please precise which encoding you're using, though.
To count the number of Unicode characters in your output, you can use this tool: http://mothereff.in/byte-counter
Scoring
Your entry must be able to pack and unpack the 10 following text snippets (that I took on this forum).
Your score will be the sum of the sizes of your 10 packed strings (in Unicode characters) + the size of your two functions (in Unicode characters, too)
Don't count the size of the dictionnary if you use it.
Please include in your entries the "score" of each snippet and their packed version.
Lowest score wins.
Data
Here are the snippets to encode to compute your score:
1: Rick Roll lyrics (1870b): We're no strangers to code golf, you know the rules, and so do I
We're no strangers to love You know the rules and so do I A full commitment's what I'm thinking of You wouldn't get this from any other guy I just wanna tell you how I'm feeling Gotta make you understand Never gonna give you up Never gonna let you down Never gonna run around and desert you Never gonna make you cry Never gonna say goodbye Never gonna tell a lie and hurt you We've known each other for so long Your heart's been aching but You're too shy to say it Inside we both know what's been going on We know the game and we're gonna play it And if you ask me how I'm feeling Don't tell me you're too blind to see Never gonna give you up Never gonna let you down Never gonna run around and desert you Never gonna make you cry Never gonna say goodbye Never gonna tell a lie and hurt you Never gonna give you up Never gonna let you down Never gonna run around and desert you Never gonna make you cry Never gonna say goodbye Never gonna tell a lie and hurt you (Ooh, give you up) (Ooh, give you up) (Ooh) Never gonna give, never gonna give (Give you up) (Ooh) Never gonna give, never gonna give (Give you up) We've know each other for so long Your heart's been aching but You're too shy to say it Inside we both know what's been going on We know the game and we're gonna play it I just wanna tell you how I'm feeling Gotta make you understand Never gonna give you up Never gonna let you down Never gonna run around and desert you Never gonna make you cry Never gonna say goodbye Never gonna tell a lie and hurt you Never gonna give you up Never gonna let you down Never gonna run around and desert you Never gonna make you cry Never gonna say goodbye Never gonna tell a lie and hurt you Never gonna give you up Never gonna let you down Never gonna run around and desert you Never gonna make you cry Never gonna say goodbye Never gonna tell a lie and hurt you
2: The Golfer (412b): Golfing ASCII-art
'\ . . |>18>> \ . ' . | O>> . 'o | \ . | /\ . | / / .' | jgs^^^^^^^`^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3: the number diamond (233b): Print this diamond
1 121 12321 1234321 123454321 12345654321 1234567654321 123456787654321 12345678987654321 123456787654321 1234567654321 12345654321 123454321 1234321 12321 121 1
4: the alphabet four times (107b): Print the alphabet four times
abcdefghijklmnopqrstuvwxyz qwertyuiopasdfghjklzxcvbnm pyfgcrlaoeuidhtnsqjkxbmwvz zyxwvutsrqponmlkjihgfedcba
5: Old McDonald's lyrics (203b): Old MacDonald function
Old MacDonald had a farm, E-I-E-I-O, And on that farm he had a cow, E-I-E-I-O, With a moo moo here and a moo moo there, Here a moo, there a moo, everywhere a moo moo, Old MacDonald had a farm, E-I-E-I-O!
6: Rock around the clock lyrics (144b): Rock Around the Clock
1, 2, 3 o'clock, 4 o'clock rock, 5, 6, 7 o'clock, 8 o'clock rock, 9, 10, 11 o'clock, 12 o'clock rock, We're gonna rock around the clock tonight.
7: Hello World (296b): Say "Hello" to the world in ASCII art
_ _ _ _ _ _ _ | | | | ___| | | ___ __ _____ _ __| | __| | | | |_| |/ _ \ | |/ _ \ \ \ /\ / / _ \| '__| |/ _` | | | _ | __/ | | (_) | \ V V / (_) | | | | (_| |_| |_| |_|\___|_|_|\___( ) \_/\_/ \___/|_| |_|\__,_(_) |/
8: Irish blessing (210b): An Old Irish Blessing
May the road rise up to meet you May the wind be always at your back May the sun shine warm upon your face The rains fall soft upon your fields And until we meet again May God hold you in the hollow of His hand
9: There was an old lady lyrics (1208b): There Was an Old Lady
There was an old lady who swallowed a fly. I don't know why she swallowed that fly, Perhaps she'll die. There was an old lady who swallowed a spider, That wriggled and iggled and jiggled inside her. She swallowed the spider to catch the fly, I don't know why she swallowed that fly, Perhaps she'll die. There was an old lady who swallowed a bird, How absurd to swallow a bird. She swallowed the bird to catch the spider, She swallowed the spider to catch the fly, I don't know why she swallowed that fly, Perhaps she'll die. There was an old lady who swallowed a cat, Imagine that to swallow a cat. She swallowed the cat to catch the bird, She swallowed the bird to catch the spider, She swallowed the spider to catch the fly, I don't know why she swallowed that fly, Perhaps she'll die. There was an old lady who swallowed a dog, What a hog to swallow a dog. She swallowed the dog to catch the cat, She swallowed the cat to catch the bird, She swallowed the bird to catch the spider, She swallowed the spider to catch the fly, I don't know why she swallowed that fly, Perhaps she'll die. There was an old lady who swallowed a horse, She died of course.
10: gettysburg address (1452b): How random is the Gettysburg Address
Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us-that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion-that we here highly resolve that these dead shall not have died in vain-that this nation, under God, shall have a new birth of freedom-and that government of the people, by the people, for the people, shall not perish from the earth.
Total (uncompressed): 6135 chars/bytes.
Have fun!
4 Answers 4
Haskell - 5322 points
Bytes of code : 686
Original size : 6147 = 1871+415+234+108+204+145+297+211+1209+1453
Encoded size : 4636 = 1396+233+163+92+153+115+197+164+979+1144
Score : 686+ 4636
Character count compression : ~25%
Code
Optimizations aside, this stores values between 0 and 7f in unicode characters as their prime factors.
It does not lower the number of bytes of the encoded output, it only lowers the number of unicode characters. For instance, test #4 contains 108 characters and the encoded output, 92. Their respective sizes are however 108 and 364 bytes.
import Data.Bits
import Data.List
import Data.Numbers.Primes
import qualified Data.Text as T
a=nub$concat$map(primeFactors)[0..127]
d(a:r)c=let s=shift c 5in if s<=0x10ffffthen d r(s+a)else c:d r a
d[]c=[c]
f(a:r)=let o=a.&.0x1fin(if o/=a then f((shiftR a 5):r)else f r)++[o]
f[]=[]
g=T.pack.map(toEnum).(\(a:r)->d r a).concatMap j.map(map(\x->head$elemIndices x a)).map(primeFactors.fromEnum).T.unpack
h=T.pack.map(toEnum.product.map((!!)a)).i.f.reverse.map(fromEnum).T.unpack
i(a:r)=let z=a`clearBit`4;x=if a`testBit`4then(take z$repeat$head r,tail r)else splitAt z r in[fst x]++i(snd x)
i[]=[]
j x@(a:_)=let l=length x in if(take l(repeat a))==x then[l`setBit`4,a]else[l]++x
j[]=[0]
How it works
Encoding
- Each character is converted to its numeric equivalent, lets call this number
n. nis then converted into a list of its prime factors,ps.- It conveniently happens that the numbers 0 through 127 have 32 common prime factors, excluding
1. This mean they, the factors, can be stored as indexes on as little as 5 bits. 1is a special case and is represented by an empty list.
- It conveniently happens that the numbers 0 through 127 have 32 common prime factors, excluding
- With
psthe encoding can now start.- Each number of
psis converted into its index in the list of 32 unique factors (In the above code this list is identified asa). - (Keep in mind at this point we are dealing with a list of list of indexes of prime factors) To proceed to the next step,
psneeds to be flattened. To preserve the integrity of the data, each list is converted into another list of two parts- The first element stores its length and if the it is composed of the same factor.
- There are at most 6 prime factors per list, this information is stored on the rightmost 3 bits. The fifth bit is used as a flag to indicate if the list is comprised of a single factor.
- The remaining elements are the indexes themselves or a single index if there is less than two different factors in the list.
- The first element stores its length and if the it is composed of the same factor.
- These lists are then concatenated into a single flattened list,
fs.
- Each number of
- The elements of
fscan then be packed into unicode characters using bit shifting.
- Each character is converted to its numeric equivalent, lets call this number
Decoding
- Do the encoding steps in reverse.
- If you are wondering how
1fits into this, I would like to remind you thatproduct [] == 1.
Tests
Using this interface for testing would be painful so I used this function to provide the results below.
edTest f = do
t <- readFile f
let txt = T.pack t
enc = g txt
dec = h enc
tst = txt == dec
putStrLn $ (show $ T.length txt) ++ "," ++ (show $ T.length enc) ++ "," ++ (show $ T.length dec)++","++(show tst)
putStrLn $ if not tst then T.unpack txt ++ "\n---NEXT---\n" ++ T.unpack dec else ""
λ> edTest "1"
1871,1396,1871,True
λ> edTest "2"
412,233,412,True
λ> edTest "3"
234,163,234,True
λ> edTest "4"
108,92,108,True
λ> edTest "5"
204,153,204,True
λ> edTest "6"
145,115,145,True
λ> edTest "7"
297,197,297,True
λ> edTest "8"
211,164,211,True
λ> edTest "9"
1209,979,1209,True
λ> edTest "10"
1453,1144,1453,True
Sample
The output of the encoding function g for test #4 is this
"99429円582753円135266円70785円35953円855074円247652円1082563円68738円49724円164898円68157円99429円67973円1082404円587873円73795円298017円330818円198705円69861円1082435円595009円607426円36414円69873円855074円265249円346275円67779円68738円77985円1082513円821353円132131円101410円247652円1082562円49724円164898円67649円594977円34915円67746円50273円135265円103997円563265円103457円1086021円99399円584802円70753円73889円34882円582722円411459円67779円68740円1084516円1082563円1091681円103491円313282円49724円164897円68705円135741円69858円50241円607426円35905円608421円1082435円69858円50274円71777円43075円298018円280517円1082404円67971円36017円955425円67665円919600円100452円132129円214883円35057円856097円101474円70753円135737円"
or if you are an adept of gibberish, this
𘑥𡁢𑒁豱𐲂숼𨐢𘑥𐦅𒁃𰠱𑃥踾𑃱𐲂𓂡𠐣𘰢숼𨐢𐡁衣쑡𡁡𘑇𑑡𒂡衂𐲄숼𨐡𡈽𑃢쑁豁𑃢쑢ꡃ𐦃貱𐡑𘡤𠐡裱𘱢𑑡𡈹
Additionnal notes
- Using http://mothereff.in/byte-counter, directory listings and
edTestthe size of the tests are all consistent but still differs from the indicated size in the question. - Test #10 contains a couple of EM DASHes (
—) that I replaced with-since they are outside of the0-7frange. - Further compression could be achieved using the remaining fourth bit while flattening, for instance,
00base case,01repeat all,10repeat except last,11repeat except last two. - The test files and the code are all available here https://github.com/gxtaillon/codegolf/tree/master/Kolmogorov
-
\$\begingroup\$ Hello, thanks for this answer! :) I didn't understand what happens in binary when you convert
abcdefghijklm...to𘑥𡁢𑒁豱..., could you explain it a little bit more please? Also, I've fixed the char counts and converted the the em-dashes in #10, in the question. My char counts are still different than yours, though. Dunno why, I used the mothereff.in tool. \$\endgroup\$xem– xem2014年07月20日 07:38:53 +00:00Commented Jul 20, 2014 at 7:38 -
\$\begingroup\$ @xem The intricate details have been revealed. \$\endgroup\$gxtaillon– gxtaillon2014年07月20日 18:22:42 +00:00Commented Jul 20, 2014 at 18:22
-
\$\begingroup\$ My mind is (figuratively) literally blown by the idea that the numbers 0 and 2-127 can be encoded on 5 bits. Did you find this by yourself or was it something known? Bonus question: how many bits do you need to store only the printable ascii characters, i.e. 95 different chars? \$\endgroup\$xem– xem2014年07月20日 18:43:33 +00:00Commented Jul 20, 2014 at 18:43
-
\$\begingroup\$ @xem The numbers are not encoded on 5 bits, each of their factors are. I would be very happy if I had found a way of encoding 7 bits on only 5. As for ascii characters, using this method they would still need 5 bits each. \$\endgroup\$gxtaillon– gxtaillon2014年07月20日 19:15:58 +00:00Commented Jul 20, 2014 at 19:15
-
1\$\begingroup\$ Since in the range you specified there are at most 6 factors per number, the length uses 3 out of a 5 bits "block". Then the indexes are encoded on 5 bits, yes. In this implementation, one of the 2 unused bits in the length block is used to get additional compression. \$\endgroup\$gxtaillon– gxtaillon2014年07月21日 02:41:05 +00:00Commented Jul 21, 2014 at 2:41
C++ (C++11), 2741 points
This answer uses UTF-32 as the encoding for the compressed text.
#include <cstdio>
#include <iostream>
#include <locale>
#include <string>
#define L locale
using namespace std;long b,n,i,l;void c(){string d;char x;while((x=cin.get())!=EOF)d+=x;b=(d.size()*7)%20;n=5;wcout.imbue(L());for(char y:d){b=b<<7|y&127;n+=7;if(n>=20)wcout.put(b>>(n-=20)&0xFFFFF);}if(n)wcout.put(b<<20-n&0xFFFFF);}void d(){wstring d;wchar_t w;wcin.imbue(L());while((w=wcin.get())!=EOF)d+=w;l=-1;for(wchar_t y:d){b=b<<20|y;n+=20;if(l<0)l=b>>15&31,n-=5;while(n>=7&(i<d.size()-1|n>20-l))cout.put(b>>(n-=7)&127);++i;}}int main(int t,char**a){L::global(L("en_US.utf8"));**++a<'d'?c():d();}
Char counts and scoring
Code: 593 chars (the trailing newline is stripped)
Compressed texts (unicode characters): 654 + 145 + 82 + 38 + 51 + 104 + 73 + 423 + 506 = 2148 (counted with wc -m for the number of unicode characters rather than bytes, the byte counts are, as as with @gxtaillon's answer, higher than the originals, 8413 bytes in total, as counted with wc -c).
Compression ratio (ASCII to unicode): 35.01% (using the 6135 bytes from the question (same as wc -c))
Beware:
A lot of shells cannot handle the unicode characters this program produces. Thus, decompressing may lead to the text being cut off somewhere when the shell cannot handle a character, as input is taken via stdin from the shell.
Compiling
It should compile with clang++ and g++ -std=c++11, but it will show some warnings about operator precedence, as an expression like b<<20-n&0xFFFFF will not be treated as ((b << 20) - n) & 0xFFFFF, as one might expect, but rather as (b << (20 - n)) & 0xFFFFF.
Usage
- Compile the program into an executable, e.g.
./compress. - Run the program as
./compress cto compress or./compress dto decompress. (Careful, leaving out the option gives a SEGFAULT (error checking is so character expensive...) and other options (such as usingDinstead ofd) may give unexpected results - Input is read from
stdinand output written tostdout
How it works
Ungolfed
#include <cstdio>
#include <iostream>
#include <locale>
#include <string>
using namespace std;
long b, n, i, l;
// Compress
void c() {
string d;
char x;
// Read from STDIN until EOF
while((x = cin.get()) != EOF)
d += x;
// Calculate the number of bits used from the last unicode character
// (maximum 19) and store it into the first 5 bits
b = (d.size() * 7) % 20;
n = 5;
// Set the output locale to allow unicode
wcout.imbue(locale());
// For each character in the input...
for (char y : d) {
// Add its bit representation (7 bits) to the right of the buffer
// by shifting the buffer left and ORing with the character code
b = (b << 7) | (y & 127);
// Add 7 to the bit counter
n += 7;
// If enough data is present (20 bits per output character),
// calculate the output character by shifting the buffer right,
// so that the last 20 bits are the left 20 bits of the buffer.
// Also decrement the bit counter by 20, as 20 bits are removed.
if (n >= 20)
wcout.put((b >> (n -= 20)) & 0xFFFFF);
}
// If there are still bits in the buffer, write them to the front of
// another unicode character
if (n)
wcout.put((b << (20 - n)) & 0xFFFFF);
}
// Decompress
void d() {
wstring d;
wchar_t w;
// Set STDIN to UNICODE
wcin.imbue(locale());
// Read wide characters from STDIN (std::wcin) until EOF
while ((w = wcin.get()) != EOF)
d += w;
// `l' represents the number of bits used in the last unicode character.
// It will be set later
l = -1;
// For each input character...
for (wchar_t y : d) {
// Add it to the buffer and add 20 to the bit counter
b = (b << 20) | y;
n += 20;
// If the number of bits in the last unicode character has not been
// set yet, read the first 5 buffer bits into `l'. This is
// necessary because the last character may contain more than 7
// (one entire uncompressed character) unused bits which may
// otherwise be interpreted as garbage.
if (l < 0) {
l = (b >> 15) & 31;
n -= 5;
}
// As long as there is data to turn into output characters
// (at least 7 bits in the buffer and either not the last
// unicode character or before the unused bits)
while (n >= 7 && ((i < d.size() - 1) || (n > (20 - l)))
cout.put((b >> (n -= 7)) & 127); // Output the left 7 bits in the buffer as an ASCII character
++i; // Increment the character index, so that we know when we reach the last input character
}
}
int main(int t, char**a) {
// Set the default locale to en_US.utf8 (with unicode)
locale::global(locale("en_US.utf8"));
// Decide whether to compress or decompress.
// This is just fancy pointer stuff for a[1][0] < 'd' ? c() : d()
(**(++a) < 'd') ? c() : d();
}
Explanation
As all unicode characters from U+0000 to U+10FFFF are allowed, we can use 20 bits per unicode char: U+FFFFF uses 20 bits and is still included in the allowed range. Thus, we just try to cram all the individual ASCII char bits into the unicode characters to store multiple ASCII characters in one unicode character. However, we also need to store the number of bits used in the last unicode character, because unused garbage bits may otherwise be interpreted as proper compressed ASCII characters. As the maximum number of used bits in the last unicode character is 20, we will need 5 bits for that, which are placed in the beginning of the compressed data.
Example output
This is the output for example #4 (as given by less):
<U+4E1C5><U+8F265><U+CD9F4><U+69D5A><U+F66DD><U+DBF87><U+1E5CF><U+A75ED>
<U+DFC79><U+F42B8><U+F7CBC><U+BA79E><U+BA77F>쏏𦛏<U+A356B><U+D9EBC><U+63ED8>
<U+B76D1><U+5C3CE><U+6CF8F><U+96CC3><U+BF2F5><U+D3934><U+74DDC><U+F8EAD>
<U+7E316><U+DEFDB><U+D0AF5><U+E7C77><U+EDD7A><U+73E5C><U+786FD><U+DB766>
<U+BD5A7><U+467CD><U+97263><U+C5840>
(쏏𦛏 give <U+C3CF><U+266CF> as character codes, but I might have gotten that wrong)
Python 3, 289+818=1107 points
Only lightly golfed.
import zlib as Z
def p(s):
z=int.from_bytes(Z.compress(s),'big');o=''
while z:
z,d=divmod(z,1<<20)
if d>0xd000:d+=1<<16
o+=chr(d)
return o[::-1]
def u(s):
i=0
for c in s:
d=ord(c)
if d>0xe000:d-=1<<16
i=(i<<20)+d
return Z.decompress(i.to_bytes(i.bit_length()//8+1,'big'))
The total code size is 289 bytes, and encodes the given 6135 bytes into 818 Unicode characters -- the total output byte count is 3201 bytes, significantly smaller than the original inputs.
Encodes using zlib, then secondarily using unicode encoding. Some extra logic was needed to avoid surrogates (which Python really hates).
Example output from #4, as seen by less (37 Unicode chars):
x<U+AC0DC><U+BB701><U+D0200><U+D00B0><U+AD2F4><U+EEFC5>𤆺<U+F4F34>멍<U+3C63A><U+2F62C><U+BA5B6><U+4E70A><U+F7D88><U+FF138><U+40CAE>
<U+CB43E><U+C30F5><U+6FFEF>𥠝<U+698BE><U+9D73A><U+95199><U+BD941><U+10B55E><U+88889><U+75A1F><U+4C4BB><U+5C67A><U+1089A3><U+C75A7>
<U+38AC1><U+4B6BB><U+592F0>ᚋ<U+F2C9B>
Driver program for testing:
if __name__ == '__main__':
import os
total = 0
for i in range(1,10+1):
out = p(open('data/%d.txt'%i,'rb').read())
total += len(out)
open('out/%d.bin'%i,'w',encoding='utf8').write(out)
print(total)
for i in range(1,10+1):
out = u(open('out/%d.bin'%i,'r',encoding='utf8').read())
open('data2/%d.txt'%i,'wb').write(out)
Output byte counts:
607 out/1.bin
128 out/2.bin
101 out/3.bin
143 out/4.bin
177 out/5.bin
145 out/6.bin
186 out/7.bin
222 out/8.bin
389 out/9.bin
1103 out/10.bin
3201 total
-
1\$\begingroup\$ Isn't the fact that this is using a compression library cheating a bit? \$\endgroup\$Beta Decay– Beta Decay2014年11月01日 10:53:19 +00:00Commented Nov 1, 2014 at 10:53
-
\$\begingroup\$ @BetaDecay: It doesn't restrict that in the question, so I figured it was fair game. \$\endgroup\$nneonneo– nneonneo2014年11月01日 17:59:12 +00:00Commented Nov 1, 2014 at 17:59
-
\$\begingroup\$ Also, you need to include a decompresser. \$\endgroup\$Beta Decay– Beta Decay2014年11月04日 07:07:00 +00:00Commented Nov 4, 2014 at 7:07
-
\$\begingroup\$ @BetaDecay:
pis the packer,uis the unpacker. \$\endgroup\$nneonneo– nneonneo2014年11月04日 15:04:42 +00:00Commented Nov 4, 2014 at 15:04
Python 2 - 1141 points
from zlib import *;v=256
def e(b):
x=0
for c in compress(b,9):x=(x*v)+ord(c)
b=bin(x)[2:]
return "".join(unichr(int("1"+b[a:a+19],2))for a in range(0,len(b),19))
def d(s):
y=int("".join(bin(ord(a))[3:]for a in s),2);x=""
while y:y,d=(y/v,chr(y%v));x=d+x
return decompress(x)
Code size is 281 bytes and it encodes the 6135 bytes into 860 unicode characters.
How it works:
To Encode:
- Compress the string to be encoded.
- Interpret the compressed string as a base 256 number.
- Convert the number to binary.
- Split the binary into groups of
19bits, add a1bit to the beginning of each of them, and then convert to Unicode characters.
Decoding is the reverse.
Note that some versions of python can only handle unicode characters up to 0xFFFF, and thus this code will raise a ValueError.
private static final String RICK_ROLL_RETURN = "We're no strangers to love...\$\endgroup\$