Saturday, May 30, 2015

Run length encoding data compression method

Functional programming and lists go together like Fred and Ginger. This little exercise is one of Werner Hett's "Ninety-Nine Prolog Problems". The idea is to implement the run length encoding data compression method.

Here's how we start. First we write a function that packs consecutive duplicates of a list into sublists e.g.

# B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'] ;;
- : char list list = [['a'; 'a'; 'a']; ['b']; ['c'; 'c']; ['d'];]
Then, consecutive duplicates of elements are encoded as terms $(N, E)$ where $N$ is the number of duplicates of the element $E$ e.g.
# B.rle (B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e']) ;;
- : (int * char) list = [(3, 'a'); (1, 'b'); (2, 'c'); (1, 'd'); (2, 'e')]
We will of course require a function to decode compressed data e.g.
 # B.unrle(B.rle(
 B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e'])
 ) ;;
- : char list = ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e']
(Credit goes to Harvey Stein for the names rle and unrle by the way).

So that's it for the first iteration - here's some code that aims to implement these specifications.
module B = struct
let pack (x : α list) : α list list =
 let f (acc : α list list) (c : α) : α list list =
 match acc with
 | (((b :: _) as hd) :: tl) when c = b -> (c :: hd) :: tl
 | _ -> [c] :: acc
 in List.fold_left f [] x
let rle (x : α list list) : (int * α) list =
 let f (acc : (int * α) list) (l : α list) : (int * α) list =
 (List.length l, List.hd l) :: acc
 in List.fold_left f [] x
 
let unrle (data : (int * α) list) =
 let repeat ((n : int), (c : α)) : α list =
 let rec aux acc i = if i = 0 then acc else aux (c :: acc) (i - 1) in
 aux [] n in
 let f (acc : α list) (elem : (int * α)) : α list =
 acc @ (repeat elem) in
 List.fold_left f [] data
 
end

Now, pack is just a device of course. We don't really need it so here's the next iteration that does away with it.

module E = struct
let rle (x : α list) : (int * α) list =
 let f (acc : (int * α) list) (c : α) : (int * α) list =
 match acc with
 | ((n, e) :: tl) when e = c -> (n + 1, c):: tl
 | _-> (1, c) :: acc
 in List.rev (List.fold_left f [] x)
let unrle (data : (int * α) list) =
 let repeat ((n : int), (c : α)) : α list =
 let rec aux acc i = if i = 0 then acc else aux (c :: acc) (i - 1) in
 aux [] n in
 let f (acc : α list) (elem : (int * α)) : α list =
 acc @ (repeat elem) in
 List.fold_left f [] data
end

Nifty!

Ok, the next idea is that when a singleton byte is encountered, we don't write the term $(1, E)$ instead, we just write $E$. Now OCaml doesn't admit heterogenous lists like Prolog appears to do so we need a sum type for the two possibilities. This then is the final version.

module F = struct
 type α t = | S of α | C of (int * α)
 let rle (bytes : α list) : α t list =
 let f (acc : α t list) (b : α) : α t list =
 match acc with
 | ((S e) :: tl) when e = b -> (C (2, e)) :: tl
 | ((C (n, e)) :: tl) when e = b -> (C (n + 1, b)) :: tl
 | _-> S b :: acc
 in List.rev (List.fold_left f [] bytes)
 let unrle (data : (α t) list) =
 let rec aux (acc : α list) (b : α) : (int -> α list) = 
 function | 0 -> acc | i -> aux (b :: acc) b (i - 1) in
 let f (acc : α list) (e : α t) : α list =
 acc @ (match e with | S b -> [b]| C (n, b) -> aux [] b n) in
 List.fold_left f [] data
end

Having worked out the details in OCaml, translation into C++ is reasonably straight-forward. One economy granted by this language is that we can do away with the data constructor S in this version.

#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <boost/range.hpp>
#include <boost/range/numeric.hpp>
#include <list>
//Representation of the encoding
template <class A> struct C { std::pair <int, A> item; };
template <class A> using datum = boost::variant <A, C<A>>;
template <class A> using encoding = std::list<datum<A>>;
//Procedural function object that updates an encoding given a
//datum
template <class A>
struct update : boost::static_visitor<> {
 A c;
 encoding<A>& l;
 update (A c, encoding<A>& l) : c (c), l (l) 
 {}
 void operator ()(A e) const { 
 if (e == c) {
 l.back () = C<A>{ std::make_pair(2, c) }; 
 return;
 }
 l.push_back (c);
 }
 void operator ()(C<A> const& elem) const { 
 if (elem.item.second == c) {
 l.back () = C<A>{ std::make_pair (elem.item.first + 1, c) };
 return;
 }
 l.push_back (c);
 }
};
template <class R>
encoding<typename boost::range_value<R>::type> rle (R bytes) {
 typedef boost::range_value<R>::type A;
 auto f = [](encoding<A> acc, A b) -> encoding<A> {
 if (acc.size () == 0)
 acc.push_back (b);
 else {
 boost::apply_visitor (update<A>(b, acc), acc.back ());
 }
 return acc;
 };
 return boost::accumulate (bytes, encoding<A> (), f);
}

I've left implementing unrle () as an exercise. Here's a little test though that confirms that we are getting savings in from the compression scheme as we hope for.

int main () {
 std::string buf=
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
 "c"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
 "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"
 "z";
 std::list<char> data(buf.begin (), buf.end());
 encoding<char> compressed = rle (data);
 std::cout << sizeof (char) * (data.size ()) << std::endl;
 std::cout << sizeof (datum <char>) * (compressed.size ()) << std::endl;
 return 0;
}
On my machine, this program prints the values 484ドル$ and 72ドル$.

AltStyle によって変換されたページ (->オリジナル) /