Shayne Fletcher: Run length encoding data compression method

Saturday, May 30, 2015

Run length encoding data compression method

Functional programming and lists go together like Fred and Ginger. This little exercise is one of Werner Hett's "Ninety-Nine Prolog Problems". The idea is to implement the run length encoding data compression method.

Here's how we start. First we write a function that packs consecutive duplicates of a list into sublists e.g.

# B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'] ;;
- : char list list = [['a'; 'a'; 'a']; ['b']; ['c'; 'c']; ['d'];]

Then, consecutive duplicates of elements are encoded as terms $(N, E)$ where $N$ is the number of duplicates of the element $E$ e.g.

# B.rle (B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e']) ;;
- : (int * char) list = [(3, 'a'); (1, 'b'); (2, 'c'); (1, 'd'); (2, 'e')]

We will of course require a function to decode compressed data e.g.

 # B.unrle(B.rle(
 B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e'])
 ) ;;
- : char list = ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e']

(Credit goes to Harvey Stein for the names rle and unrle by the way).

So that's it for the first iteration - here's some code that aims to implement these specifications.

module B = struct
let pack (x : α list) : α list list =
 let f (acc : α list list) (c : α) : α list list =
 match acc with
 | (((b :: _) as hd) :: tl) when c = b -> (c :: hd) :: tl
 | _ -> [c] :: acc
 in List.fold_left f [] x
let rle (x : α list list) : (int * α) list =
 let f (acc : (int * α) list) (l : α list) : (int * α) list =
 (List.length l, List.hd l) :: acc
 in List.fold_left f [] x
 
let unrle (data : (int * α) list) =
 let repeat ((n : int), (c : α)) : α list =
 let rec aux acc i = if i = 0 then acc else aux (c :: acc) (i - 1) in
 aux [] n in
 let f (acc : α list) (elem : (int * α)) : α list =
 acc @ (repeat elem) in
 List.fold_left f [] data
 
end

Now, pack is just a device of course. We don't really need it so here's the next iteration that does away with it.

module E = struct
let rle (x : α list) : (int * α) list =
 let f (acc : (int * α) list) (c : α) : (int * α) list =
 match acc with
 | ((n, e) :: tl) when e = c -> (n + 1, c):: tl
 | _-> (1, c) :: acc
 in List.rev (List.fold_left f [] x)
let unrle (data : (int * α) list) =
 let repeat ((n : int), (c : α)) : α list =
 let rec aux acc i = if i = 0 then acc else aux (c :: acc) (i - 1) in
 aux [] n in
 let f (acc : α list) (elem : (int * α)) : α list =
 acc @ (repeat elem) in
 List.fold_left f [] data
end

Nifty!

Ok, the next idea is that when a singleton byte is encountered, we don't write the term $(1, E)$ instead, we just write $E$. Now OCaml doesn't admit heterogenous lists like Prolog appears to do so we need a sum type for the two possibilities. This then is the final version.

module F = struct
 type α t = | S of α | C of (int * α)
 let rle (bytes : α list) : α t list =
 let f (acc : α t list) (b : α) : α t list =
 match acc with
 | ((S e) :: tl) when e = b -> (C (2, e)) :: tl
 | ((C (n, e)) :: tl) when e = b -> (C (n + 1, b)) :: tl
 | _-> S b :: acc
 in List.rev (List.fold_left f [] bytes)
 let unrle (data : (α t) list) =
 let rec aux (acc : α list) (b : α) : (int -> α list) = 
 function | 0 -> acc | i -> aux (b :: acc) b (i - 1) in
 let f (acc : α list) (e : α t) : α list =
 acc @ (match e with | S b -> [b]| C (n, b) -> aux [] b n) in
 List.fold_left f [] data
end

Having worked out the details in OCaml, translation into C++ is reasonably straight-forward. One economy granted by this language is that we can do away with the data constructor S in this version.

#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <boost/range.hpp>
#include <boost/range/numeric.hpp>
#include <list>
//Representation of the encoding
template <class A> struct C { std::pair <int, A> item; };
template <class A> using datum = boost::variant <A, C<A>>;
template <class A> using encoding = std::list<datum<A>>;
//Procedural function object that updates an encoding given a
//datum
template <class A>
struct update : boost::static_visitor<> {
 A c;
 encoding<A>& l;
 update (A c, encoding<A>& l) : c (c), l (l) 
 {}
 void operator ()(A e) const { 
 if (e == c) {
 l.back () = C<A>{ std::make_pair(2, c) }; 
 return;
 }
 l.push_back (c);
 }
 void operator ()(C<A> const& elem) const { 
 if (elem.item.second == c) {
 l.back () = C<A>{ std::make_pair (elem.item.first + 1, c) };
 return;
 }
 l.push_back (c);
 }
};
template <class R>
encoding<typename boost::range_value<R>::type> rle (R bytes) {
 typedef boost::range_value<R>::type A;
 auto f = [](encoding<A> acc, A b) -> encoding<A> {
 if (acc.size () == 0)
 acc.push_back (b);
 else {
 boost::apply_visitor (update<A>(b, acc), acc.back ());
 }
 return acc;
 };
 return boost::accumulate (bytes, encoding<A> (), f);
}

I've left implementing unrle () as an exercise. Here's a little test though that confirms that we are getting savings in from the compression scheme as we hope for.

int main () {
 std::string buf=
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
 "c"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"
 "z";
 std::list<char> data(buf.begin (), buf.end());
 encoding<char> compressed = rle (data);
 std::cout << sizeof (char) * (data.size ()) << std::endl;
 std::cout << sizeof (datum <char>) * (compressed.size ()) << std::endl;
 return 0;
}

On my machine, this program prints the values 484ドル$ and 72ドル$.

Posted by Shayne Fletcher at 2:03 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: Algorithms, C++, Compression, OCaml, Prolog