Who wants to be a Kolmogorov complexity winner?

Question 1

You mission today is to invent a text compressor.

Task

You'll write two functions:

The packer is a function that accepts a string of ASCII characters (U+0000 to U+007F) and outputs a Unicode string (U+0000 to U+10FFFF), containing the fewest characters possible.
The unpacker is a function that accepts an encoded Unicode string and outputs exactly the original ASCII string.

Input

The only authorized input is the ASCII string (for the packer) and the packed Unicode string (for the unpacker). No user input, no internet connection, no use of file system.

Your functions can have access to this list of english words. You can use this list as a local txt file, or copy its content in your source code as a string or an array of strings.

You can't hardcode the snippets below in your functions.

Output

The only authorized output for both functions is a string.

The output of the unpacker must contain exactly the same characters as the input of the packer.

Your inputs and outputs can use any character encoding supporting all Unicode (UTF-8/16/32, GB18030, ...), as your score will only depend on the number of Unicode characters in the output. Please precise which encoding you're using, though.

To count the number of Unicode characters in your output, you can use this tool: http://mothereff.in/byte-counter

Scoring

Your entry must be able to pack and unpack the 10 following text snippets (that I took on this forum).

Your score will be the sum of the sizes of your 10 packed strings (in Unicode characters) + the size of your two functions (in Unicode characters, too)

Don't count the size of the dictionnary if you use it.

Please include in your entries the "score" of each snippet and their packed version.

Lowest score wins.

Data

Here are the snippets to encode to compute your score:

1: Rick Roll lyrics (1870b): We're no strangers to code golf, you know the rules, and so do I

We're no strangers to love
You know the rules and so do I
A full commitment's what I'm thinking of
You wouldn't get this from any other guy
I just wanna tell you how I'm feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
We've known each other for so long
Your heart's been aching but
You're too shy to say it
Inside we both know what's been going on
We know the game and we're gonna play it
And if you ask me how I'm feeling
Don't tell me you're too blind to see
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
(Ooh, give you up)
(Ooh, give you up)
(Ooh)
Never gonna give, never gonna give
(Give you up)
(Ooh)
Never gonna give, never gonna give
(Give you up)
We've know each other for so long
Your heart's been aching but
You're too shy to say it
Inside we both know what's been going on
We know the game and we're gonna play it
I just wanna tell you how I'm feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

2: The Golfer (412b): Golfing ASCII-art

 '\ . . |>18>>
 \ . ' . |
 O>> . 'o |
 \ . |
 /\ . |
 / / .' |
 jgs^^^^^^^`^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

3: the number diamond (233b): Print this diamond

 1
 121
 12321
 1234321
 123454321
 12345654321
 1234567654321
 123456787654321
12345678987654321
 123456787654321
 1234567654321
 12345654321
 123454321
 1234321
 12321
 121
 1

4: the alphabet four times (107b): Print the alphabet four times

abcdefghijklmnopqrstuvwxyz
qwertyuiopasdfghjklzxcvbnm
pyfgcrlaoeuidhtnsqjkxbmwvz
zyxwvutsrqponmlkjihgfedcba

5: Old McDonald's lyrics (203b): Old MacDonald function

Old MacDonald had a farm, E-I-E-I-O,
And on that farm he had a cow, E-I-E-I-O,
With a moo moo here and a moo moo there,
Here a moo, there a moo, everywhere a moo moo,
Old MacDonald had a farm, E-I-E-I-O!

6: Rock around the clock lyrics (144b): Rock Around the Clock

1, 2, 3 o'clock, 4 o'clock rock,
5, 6, 7 o'clock, 8 o'clock rock,
9, 10, 11 o'clock, 12 o'clock rock,
We're gonna rock around the clock tonight.

7: Hello World (296b): Say "Hello" to the world in ASCII art

 _ _ _ _ _ _ _
| | | | ___| | | ___ __ _____ _ __| | __| | |
| |_| |/ _ \ | |/ _ \ \ \ /\ / / _ \| '__| |/ _` | |
| _ | __/ | | (_) | \ V V / (_) | | | | (_| |_|
|_| |_|\___|_|_|\___( ) \_/\_/ \___/|_| |_|\__,_(_)
 |/

8: Irish blessing (210b): An Old Irish Blessing

May the road rise up to meet you
May the wind be always at your back
May the sun shine warm upon your face
The rains fall soft upon your fields
And until we meet again
May God hold you in the hollow of His hand

9: There was an old lady lyrics (1208b): There Was an Old Lady

There was an old lady who swallowed a fly. 
I don't know why she swallowed that fly, 
Perhaps she'll die.
There was an old lady who swallowed a spider, 
That wriggled and iggled and jiggled inside her. 
She swallowed the spider to catch the fly, 
I don't know why she swallowed that fly, 
Perhaps she'll die.
There was an old lady who swallowed a bird, 
How absurd to swallow a bird. 
She swallowed the bird to catch the spider, 
She swallowed the spider to catch the fly, 
I don't know why she swallowed that fly, 
Perhaps she'll die.
There was an old lady who swallowed a cat, 
Imagine that to swallow a cat. 
She swallowed the cat to catch the bird, 
She swallowed the bird to catch the spider, 
She swallowed the spider to catch the fly, 
I don't know why she swallowed that fly, 
Perhaps she'll die.
There was an old lady who swallowed a dog, 
What a hog to swallow a dog. 
She swallowed the dog to catch the cat, 
She swallowed the cat to catch the bird, 
She swallowed the bird to catch the spider, 
She swallowed the spider to catch the fly, 
I don't know why she swallowed that fly, 
Perhaps she'll die.
There was an old lady who swallowed a horse, 
She died of course.

10: gettysburg address (1452b): How random is the Gettysburg Address

Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us-that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion-that we here highly resolve that these dead shall not have died in vain-that this nation, under God, shall have a new birth of freedom-and that government of the people, by the people, for the people, shall not perish from the earth.

Total (uncompressed): 6135 chars/bytes.

Have fun!

Question 2

This isn't a challenge to invent a language, this is a challenge to compress something.

Question 3

I think not including the size of the compiler/executor (compressor/decompressor) in the score makes this challenge a little open-ended. At some point, the line between dictionary and hard-coding will become very thin.

Question 4

Darn, and here I was already typing private static final String RICK_ROLL_RETURN = "We're no strangers to love...

Question 5

I don't think you've addressed Dennis' observation.

Question 6

@xem I think the post could be improved by organizing the information into sections such like #Task, #Input, #Output, #Scoring. I also think that the size of the source code for the compressor and the decompressor should be included in the score. This doesn't hurt anything, but it solves the problem Dennis pointed out.

Question 7

Haskell - 5322 points

Bytes of code : 686

Original size : 6147 = 1871+415+234+108+204+145+297+211+1209+1453

Encoded size : 4636 = 1396+233+163+92+153+115+197+164+979+1144

Score : 686+ 4636

Character count compression : ~25%

Code

Optimizations aside, this stores values between 0 and 7f in unicode characters as their prime factors.

It does not lower the number of bytes of the encoded output, it only lowers the number of unicode characters. For instance, test #4 contains 108 characters and the encoded output, 92. Their respective sizes are however 108 and 364 bytes.

import Data.Bits
import Data.List
import Data.Numbers.Primes
import qualified Data.Text as T
a=nub$concat$map(primeFactors)[0..127]
d(a:r)c=let s=shift c 5in if s<=0x10ffffthen d r(s+a)else c:d r a
d[]c=[c]
f(a:r)=let o=a.&.0x1fin(if o/=a then f((shiftR a 5):r)else f r)++[o]
f[]=[]
g=T.pack.map(toEnum).(\(a:r)->d r a).concatMap j.map(map(\x->head$elemIndices x a)).map(primeFactors.fromEnum).T.unpack
h=T.pack.map(toEnum.product.map((!!)a)).i.f.reverse.map(fromEnum).T.unpack
i(a:r)=let z=a`clearBit`4;x=if a`testBit`4then(take z$repeat$head r,tail r)else splitAt z r in[fst x]++i(snd x)
i[]=[]
j x@(a:_)=let l=length x in if(take l(repeat a))==x then[l`setBit`4,a]else[l]++x
j[]=[0]

How it works

Encoding
1. Each character is converted to its numeric equivalent, lets call this number n.
2. n is then converted into a list of its prime factors, ps.
  - It conveniently happens that the numbers 0 through 127 have 32 common prime factors, excluding 1. This mean they, the factors, can be stored as indexes on as little as 5 bits.
  - 1 is a special case and is represented by an empty list.
3. With ps the encoding can now start.
  1. Each number of ps is converted into its index in the list of 32 unique factors (In the above code this list is identified as a).
  2. (Keep in mind at this point we are dealing with a list of list of indexes of prime factors) To proceed to the next step, ps needs to be flattened. To preserve the integrity of the data, each list is converted into another list of two parts
    1. The first element stores its length and if the it is composed of the same factor.
      - There are at most 6 prime factors per list, this information is stored on the rightmost 3 bits. The fifth bit is used as a flag to indicate if the list is comprised of a single factor.
    2. The remaining elements are the indexes themselves or a single index if there is less than two different factors in the list.
  3. These lists are then concatenated into a single flattened list, fs.
4. The elements of fs can then be packed into unicode characters using bit shifting.
Decoding
- Do the encoding steps in reverse.
- If you are wondering how 1 fits into this, I would like to remind you that product [] == 1.

Tests

Using this interface for testing would be painful so I used this function to provide the results below.

edTest f = do
 t <- readFile f
 let txt = T.pack t
 enc = g txt
 dec = h enc
 tst = txt == dec
 putStrLn $ (show $ T.length txt) ++ "," ++ (show $ T.length enc) ++ "," ++ (show $ T.length dec)++","++(show tst)
 putStrLn $ if not tst then T.unpack txt ++ "\n---NEXT---\n" ++ T.unpack dec else ""

λ> edTest "1"
1871,1396,1871,True
λ> edTest "2"
412,233,412,True
λ> edTest "3"
234,163,234,True
λ> edTest "4"
108,92,108,True
λ> edTest "5"
204,153,204,True
λ> edTest "6"
145,115,145,True
λ> edTest "7"
297,197,297,True
λ> edTest "8"
211,164,211,True
λ> edTest "9"
1209,979,1209,True
λ> edTest "10"
1453,1144,1453,True

Sample

The output of the encoding function g for test #4 is this
"99429円582753円135266円70785円35953円855074円247652円1082563円68738円49724円164898円68157円99429円67973円1082404円587873円73795円298017円330818円198705円69861円1082435円595009円607426円36414円69873円855074円265249円346275円67779円68738円77985円1082513円821353円132131円101410円247652円1082562円49724円164898円67649円594977円34915円67746円50273円135265円103997円563265円103457円1086021円99399円584802円70753円73889円34882円582722円411459円67779円68740円1084516円1082563円1091681円103491円313282円49724円164897円68705円135741円69858円50241円607426円35905円608421円1082435円69858円50274円71777円43075円298018円280517円1082404円67971円36017円955425円67665円919600円100452円132129円214883円35057円856097円101474円70753円135737円"
or if you are an adept of gibberish, this
𘑥򎑡𡁢𑒁豱󐰢𼝤􈓃𐲂숼𨐢𐨽𘑥𐦅􈐤򏡡𒁃񈰡񐱂𰠱𑃥􈑃򑑁򔓂踾𑃱󐰢񀰡񔢣𐣃𐲂𓂡􈒑󈡩𠐣𘰢𼝤􈓂숼𨐢𐡁򑐡衣𐢢쑡𡁡𙘽򉡁𙐡􉉅𘑇򎱢𑑡𒂡衂򎑂񤝃𐣃𐲄􈱤􈓃􊡡𙑃񌟂숼𨐡𐱡𡈽𑃢쑁򔓂豁򔢥􈑃𑃢쑢𑡡ꡃ񈰢񄟅􈐤𐦃貱󩐡𐡑󠠰𘡤𠐡𴝣裱󑀡𘱢𑑡𡈹

Additionnal notes

Using http://mothereff.in/byte-counter, directory listings and edTest the size of the tests are all consistent but still differs from the indicated size in the question.
Test #10 contains a couple of EM DASHes (—) that I replaced with - since they are outside of the 0-7f range.
Further compression could be achieved using the remaining fourth bit while flattening, for instance, 00 base case, 01 repeat all, 10 repeat except last, 11 repeat except last two.
The test files and the code are all available here https://github.com/gxtaillon/codegolf/tree/master/Kolmogorov

Question 8

Hello, thanks for this answer! :) I didn't understand what happens in binary when you convert abcdefghijklm... to 𘑥򎑡𡁢𑒁豱󐰢𼝤..., could you explain it a little bit more please? Also, I've fixed the char counts and converted the the em-dashes in #10, in the question. My char counts are still different than yours, though. Dunno why, I used the mothereff.in tool.

Question 9

@xem The intricate details have been revealed.

Question 10

My mind is (figuratively) literally blown by the idea that the numbers 0 and 2-127 can be encoded on 5 bits. Did you find this by yourself or was it something known? Bonus question: how many bits do you need to store only the printable ascii characters, i.e. 95 different chars?

Question 11

@xem The numbers are not encoded on 5 bits, each of their factors are. I would be very happy if I had found a way of encoding 7 bits on only 5. As for ascii characters, using this method they would still need 5 bits each.

Question 12

Since in the range you specified there are at most 6 factors per number, the length uses 3 out of a 5 bits "block". Then the indexes are encoded on 5 bits, yes. In this implementation, one of the 2 unused bits in the length block is used to get additional compression.

Question 13

C++ (C++11), 2741 points

This answer uses UTF-32 as the encoding for the compressed text.

#include <cstdio>
#include <iostream>
#include <locale>
#include <string>
#define L locale
using namespace std;long b,n,i,l;void c(){string d;char x;while((x=cin.get())!=EOF)d+=x;b=(d.size()*7)%20;n=5;wcout.imbue(L());for(char y:d){b=b<<7|y&127;n+=7;if(n>=20)wcout.put(b>>(n-=20)&0xFFFFF);}if(n)wcout.put(b<<20-n&0xFFFFF);}void d(){wstring d;wchar_t w;wcin.imbue(L());while((w=wcin.get())!=EOF)d+=w;l=-1;for(wchar_t y:d){b=b<<20|y;n+=20;if(l<0)l=b>>15&31,n-=5;while(n>=7&(i<d.size()-1|n>20-l))cout.put(b>>(n-=7)&127);++i;}}int main(int t,char**a){L::global(L("en_US.utf8"));**++a<'d'?c():d();}

Char counts and scoring

Code: 593 chars (the trailing newline is stripped)

Compressed texts (unicode characters): 654 + 145 + 82 + 38 + 51 + 104 + 73 + 423 + 506 = 2148 (counted with wc -m for the number of unicode characters rather than bytes, the byte counts are, as as with @gxtaillon's answer, higher than the originals, 8413 bytes in total, as counted with wc -c).

Compression ratio (ASCII to unicode): 35.01% (using the 6135 bytes from the question (same as wc -c))

Beware:

A lot of shells cannot handle the unicode characters this program produces. Thus, decompressing may lead to the text being cut off somewhere when the shell cannot handle a character, as input is taken via stdin from the shell.

Compiling

It should compile with clang++ and g++ -std=c++11, but it will show some warnings about operator precedence, as an expression like b<<20-n&0xFFFFF will not be treated as ((b << 20) - n) & 0xFFFFF, as one might expect, but rather as (b << (20 - n)) & 0xFFFFF.

Usage

Compile the program into an executable, e.g. ./compress.
Run the program as ./compress c to compress or ./compress d to decompress. (Careful, leaving out the option gives a SEGFAULT (error checking is so character expensive...) and other options (such as using D instead of d) may give unexpected results
Input is read from stdin and output written to stdout

How it works

Ungolfed

#include <cstdio>
#include <iostream>
#include <locale>
#include <string>
using namespace std;
long b, n, i, l;
// Compress
void c() {
 string d;
 char x;
 // Read from STDIN until EOF
 while((x = cin.get()) != EOF)
 d += x;
 // Calculate the number of bits used from the last unicode character
 // (maximum 19) and store it into the first 5 bits
 b = (d.size() * 7) % 20;
 n = 5;
 // Set the output locale to allow unicode
 wcout.imbue(locale());
 // For each character in the input...
 for (char y : d) {
 // Add its bit representation (7 bits) to the right of the buffer
 // by shifting the buffer left and ORing with the character code
 b = (b << 7) | (y & 127);
 // Add 7 to the bit counter
 n += 7;
 // If enough data is present (20 bits per output character),
 // calculate the output character by shifting the buffer right,
 // so that the last 20 bits are the left 20 bits of the buffer.
 // Also decrement the bit counter by 20, as 20 bits are removed.
 if (n >= 20)
 wcout.put((b >> (n -= 20)) & 0xFFFFF);
 }
 // If there are still bits in the buffer, write them to the front of
 // another unicode character
 if (n)
 wcout.put((b << (20 - n)) & 0xFFFFF);
}
// Decompress
void d() {
 wstring d;
 wchar_t w;
 // Set STDIN to UNICODE
 wcin.imbue(locale());
 // Read wide characters from STDIN (std::wcin) until EOF
 while ((w = wcin.get()) != EOF)
 d += w;
 // `l' represents the number of bits used in the last unicode character.
 // It will be set later
 l = -1;
 // For each input character...
 for (wchar_t y : d) {
 // Add it to the buffer and add 20 to the bit counter
 b = (b << 20) | y;
 n += 20;
 // If the number of bits in the last unicode character has not been
 // set yet, read the first 5 buffer bits into `l'. This is
 // necessary because the last character may contain more than 7
 // (one entire uncompressed character) unused bits which may
 // otherwise be interpreted as garbage.
 if (l < 0) {
 l = (b >> 15) & 31;
 n -= 5;
 }
 // As long as there is data to turn into output characters
 // (at least 7 bits in the buffer and either not the last
 // unicode character or before the unused bits)
 while (n >= 7 && ((i < d.size() - 1) || (n > (20 - l)))
 cout.put((b >> (n -= 7)) & 127); // Output the left 7 bits in the buffer as an ASCII character
 ++i; // Increment the character index, so that we know when we reach the last input character
 }
}
int main(int t, char**a) {
 // Set the default locale to en_US.utf8 (with unicode)
 locale::global(locale("en_US.utf8"));
 // Decide whether to compress or decompress.
 // This is just fancy pointer stuff for a[1][0] < 'd' ? c() : d()
 (**(++a) < 'd') ? c() : d();
}

Explanation

As all unicode characters from U+0000 to U+10FFFF are allowed, we can use 20 bits per unicode char: U+FFFFF uses 20 bits and is still included in the allowed range. Thus, we just try to cram all the individual ASCII char bits into the unicode characters to store multiple ASCII characters in one unicode character. However, we also need to store the number of bits used in the last unicode character, because unused garbage bits may otherwise be interpreted as proper compressed ASCII characters. As the maximum number of used bits in the last unicode character is 20, we will need 5 bits for that, which are placed in the beginning of the compressed data.

Example output

This is the output for example #4 (as given by less):

<U+4E1C5><U+8F265><U+CD9F4><U+69D5A><U+F66DD><U+DBF87><U+1E5CF><U+A75ED>
<U+DFC79><U+F42B8><U+F7CBC><U+BA79E><U+BA77F>쏏𦛏<U+A356B><U+D9EBC><U+63ED8>
<U+B76D1><U+5C3CE><U+6CF8F><U+96CC3><U+BF2F5><U+D3934><U+74DDC><U+F8EAD>
<U+7E316><U+DEFDB><U+D0AF5><U+E7C77><U+EDD7A><U+73E5C><U+786FD><U+DB766>
<U+BD5A7><U+467CD><U+97263><U+C5840>

(쏏𦛏 give <U+C3CF><U+266CF> as character codes, but I might have gotten that wrong)

Question 14

Python 3, 289+818=1107 points

Only lightly golfed.

import zlib as Z
def p(s):
 z=int.from_bytes(Z.compress(s),'big');o=''
 while z:
 z,d=divmod(z,1<<20)
 if d>0xd000:d+=1<<16
 o+=chr(d)
 return o[::-1]
def u(s):
 i=0
 for c in s:
 d=ord(c)
 if d>0xe000:d-=1<<16
 i=(i<<20)+d
 return Z.decompress(i.to_bytes(i.bit_length()//8+1,'big'))

The total code size is 289 bytes, and encodes the given 6135 bytes into 818 Unicode characters -- the total output byte count is 3201 bytes, significantly smaller than the original inputs.

Encodes using zlib, then secondarily using unicode encoding. Some extra logic was needed to avoid surrogates (which Python really hates).

Example output from #4, as seen by less (37 Unicode chars):

x<U+AC0DC><U+BB701><U+D0200><U+D00B0><U+AD2F4><U+EEFC5>𤆺<U+F4F34>멍<U+3C63A><U+2F62C><U+BA5B6><U+4E70A><U+F7D88><U+FF138><U+40CAE>
<U+CB43E><U+C30F5><U+6FFEF>𥠝<U+698BE><U+9D73A><U+95199><U+BD941><U+10B55E><U+88889><U+75A1F><U+4C4BB><U+5C67A><U+1089A3><U+C75A7>
<U+38AC1><U+4B6BB><U+592F0>ᚋ<U+F2C9B>

Driver program for testing:

if __name__ == '__main__':
 import os
 total = 0
 for i in range(1,10+1):
 out = p(open('data/%d.txt'%i,'rb').read())
 total += len(out)
 open('out/%d.bin'%i,'w',encoding='utf8').write(out)
 print(total)
 for i in range(1,10+1):
 out = u(open('out/%d.bin'%i,'r',encoding='utf8').read())
 open('data2/%d.txt'%i,'wb').write(out)

Output byte counts:

 607 out/1.bin
 128 out/2.bin
 101 out/3.bin
 143 out/4.bin
 177 out/5.bin
 145 out/6.bin
 186 out/7.bin
 222 out/8.bin
 389 out/9.bin
1103 out/10.bin
3201 total

Question 15

Isn't the fact that this is using a compression library cheating a bit?

Question 16

@BetaDecay: It doesn't restrict that in the question, so I figured it was fair game.

Question 17

Also, you need to include a decompresser.

Question 18

@BetaDecay: p is the packer, u is the unpacker.

Question 19

Python 2 - 1141 points

from zlib import *;v=256
def e(b):
 x=0
 for c in compress(b,9):x=(x*v)+ord(c)
 b=bin(x)[2:]
 return "".join(unichr(int("1"+b[a:a+19],2))for a in range(0,len(b),19))
def d(s):
 y=int("".join(bin(ord(a))[3:]for a in s),2);x=""
 while y:y,d=(y/v,chr(y%v));x=d+x
 return decompress(x)

Code size is 281 bytes and it encodes the 6135 bytes into 860 unicode characters.

How it works:

To Encode:

Compress the string to be encoded.
Interpret the compressed string as a base 256 number.
Convert the number to binary.
Split the binary into groups of 19 bits, add a 1 bit to the beginning of each of them, and then convert to Unicode characters.

Decoding is the reverse.

Note that some versions of python can only handle unicode characters up to 0xFFFF, and thus this code will raise a ValueError.

gxtaillon 6074 silver badges14 bronze badges · Accepted Answer · 2014-07-19 22:12:56Z

Haskell - 5322 points

Bytes of code : 686

Original size : 6147 = 1871+415+234+108+204+145+297+211+1209+1453

Encoded size : 4636 = 1396+233+163+92+153+115+197+164+979+1144

Score : 686+ 4636

Character count compression : ~25%

Code

Optimizations aside, this stores values between 0 and 7f in unicode characters as their prime factors.

It does not lower the number of bytes of the encoded output, it only lowers the number of unicode characters. For instance, test #4 contains 108 characters and the encoded output, 92. Their respective sizes are however 108 and 364 bytes.

import Data.Bits
import Data.List
import Data.Numbers.Primes
import qualified Data.Text as T
a=nub$concat$map(primeFactors)[0..127]
d(a:r)c=let s=shift c 5in if s<=0x10ffffthen d r(s+a)else c:d r a
d[]c=[c]
f(a:r)=let o=a.&.0x1fin(if o/=a then f((shiftR a 5):r)else f r)++[o]
f[]=[]
g=T.pack.map(toEnum).(\(a:r)->d r a).concatMap j.map(map(\x->head$elemIndices x a)).map(primeFactors.fromEnum).T.unpack
h=T.pack.map(toEnum.product.map((!!)a)).i.f.reverse.map(fromEnum).T.unpack
i(a:r)=let z=a`clearBit`4;x=if a`testBit`4then(take z$repeat$head r,tail r)else splitAt z r in[fst x]++i(snd x)
i[]=[]
j x@(a:_)=let l=length x in if(take l(repeat a))==x then[l`setBit`4,a]else[l]++x
j[]=[0]

How it works

Encoding
1. Each character is converted to its numeric equivalent, lets call this number n.
2. n is then converted into a list of its prime factors, ps.
  - It conveniently happens that the numbers 0 through 127 have 32 common prime factors, excluding 1. This mean they, the factors, can be stored as indexes on as little as 5 bits.
  - 1 is a special case and is represented by an empty list.
3. With ps the encoding can now start.
  1. Each number of ps is converted into its index in the list of 32 unique factors (In the above code this list is identified as a).
  2. (Keep in mind at this point we are dealing with a list of list of indexes of prime factors) To proceed to the next step, ps needs to be flattened. To preserve the integrity of the data, each list is converted into another list of two parts
    1. The first element stores its length and if the it is composed of the same factor.
      - There are at most 6 prime factors per list, this information is stored on the rightmost 3 bits. The fifth bit is used as a flag to indicate if the list is comprised of a single factor.
    2. The remaining elements are the indexes themselves or a single index if there is less than two different factors in the list.
  3. These lists are then concatenated into a single flattened list, fs.
4. The elements of fs can then be packed into unicode characters using bit shifting.
Decoding
- Do the encoding steps in reverse.
- If you are wondering how 1 fits into this, I would like to remind you that product [] == 1.

Tests

Using this interface for testing would be painful so I used this function to provide the results below.

edTest f = do
 t <- readFile f
 let txt = T.pack t
 enc = g txt
 dec = h enc
 tst = txt == dec
 putStrLn $ (show $ T.length txt) ++ "," ++ (show $ T.length enc) ++ "," ++ (show $ T.length dec)++","++(show tst)
 putStrLn $ if not tst then T.unpack txt ++ "\n---NEXT---\n" ++ T.unpack dec else ""

λ> edTest "1"
1871,1396,1871,True
λ> edTest "2"
412,233,412,True
λ> edTest "3"
234,163,234,True
λ> edTest "4"
108,92,108,True
λ> edTest "5"
204,153,204,True
λ> edTest "6"
145,115,145,True
λ> edTest "7"
297,197,297,True
λ> edTest "8"
211,164,211,True
λ> edTest "9"
1209,979,1209,True
λ> edTest "10"
1453,1144,1453,True

Sample

The output of the encoding function g for test #4 is this
"99429円582753円135266円70785円35953円855074円247652円1082563円68738円49724円164898円68157円99429円67973円1082404円587873円73795円298017円330818円198705円69861円1082435円595009円607426円36414円69873円855074円265249円346275円67779円68738円77985円1082513円821353円132131円101410円247652円1082562円49724円164898円67649円594977円34915円67746円50273円135265円103997円563265円103457円1086021円99399円584802円70753円73889円34882円582722円411459円67779円68740円1084516円1082563円1091681円103491円313282円49724円164897円68705円135741円69858円50241円607426円35905円608421円1082435円69858円50274円71777円43075円298018円280517円1082404円67971円36017円955425円67665円919600円100452円132129円214883円35057円856097円101474円70753円135737円"
or if you are an adept of gibberish, this
𘑥򎑡𡁢𑒁豱󐰢𼝤􈓃𐲂숼𨐢𐨽𘑥𐦅􈐤򏡡𒁃񈰡񐱂𰠱𑃥􈑃򑑁򔓂踾𑃱󐰢񀰡񔢣𐣃𐲂𓂡􈒑󈡩𠐣𘰢𼝤􈓂숼𨐢𐡁򑐡衣𐢢쑡𡁡𙘽򉡁𙐡􉉅𘑇򎱢𑑡𒂡衂򎑂񤝃𐣃𐲄􈱤􈓃􊡡𙑃񌟂숼𨐡𐱡𡈽𑃢쑁򔓂豁򔢥􈑃𑃢쑢𑡡ꡃ񈰢񄟅􈐤𐦃貱󩐡𐡑󠠰𘡤𠐡𴝣裱󑀡𘱢𑑡𡈹

Additionnal notes

Using http://mothereff.in/byte-counter, directory listings and edTest the size of the tests are all consistent but still differs from the indicated size in the question.
Test #10 contains a couple of EM DASHes (—) that I replaced with - since they are outside of the 0-7f range.
Further compression could be achieved using the remaining fourth bit while flattening, for instance, 00 base case, 01 repeat all, 10 repeat except last, 11 repeat except last two.
The test files and the code are all available here https://github.com/gxtaillon/codegolf/tree/master/Kolmogorov

Hello, thanks for this answer! :) I didn't understand what happens in binary when you convert abcdefghijklm... to 𘑥򎑡𡁢𑒁豱󐰢𼝤..., could you explain it a little bit more please? Also, I've fixed the char counts and converted the the em-dashes in #10, in the question. My char counts are still different than yours, though. Dunno why, I used the mothereff.in tool.
My mind is (figuratively) literally blown by the idea that the numbers 0 and 2-127 can be encoded on 5 bits. Did you find this by yourself or was it something known? Bonus question: how many bits do you need to store only the printable ascii characters, i.e. 95 different chars?
@xem The numbers are not encoded on 5 bits, each of their factors are. I would be very happy if I had found a way of encoding 7 bits on only 5. As for ascii characters, using this method they would still need 5 bits each.
Since in the range you specified there are at most 6 factors per number, the length uses 3 out of a 5 bits "block". Then the indexes are encoded on 5 bits, yes. In this implementation, one of the 2 unused bits in the length block is used to get additional compression.

Stack Exchange Network

Who wants to be a Kolmogorov complexity winner?

Task

Input

Output

Scoring

Data

4 Answers 4

Haskell - 5322 points

Code

How it works

Tests

Sample

Additionnal notes

C++ (C++11), 2741 points

Char counts and scoring

Beware:

Compiling

Usage

How it works

Ungolfed

Explanation

Example output

Python 3, 289+818=1107 points

Python 2 - 1141 points

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Who wants to be a Kolmogorov complexity winner?

Task

Input

Output

Scoring

Data

4 Answers 4

Haskell - 5322 points

Code

How it works

Tests

Sample

Additionnal notes

C++ (C++11), 2741 points

Char counts and scoring

Beware:

Compiling

Usage

How it works

Ungolfed

Explanation

Example output

Python 3, 289+818=1107 points

Python 2 - 1141 points

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions