Combine imports like
io::Read
andio::Write
withuse std::io::{Read, Write}
. In this case, consider using the IO prelude:use std::io::prelude::*
.You've intermixed test and production code, as well as test and production dependencies. This produces compile-time warnings (
warning: function is never used: `read_file`, `rand_string`, `test`
) and wastes end-user time during compilation and bandwidth during download.I prefer to use a crate like quick-error to succinctly create error macros. I combined the description and display text where appropriate.
"frecuency" is misspelled; should be frequency.
It's idiomatic to use
|&foo| foo
instead of|foo| *foo
when you can; generallyCopy
types allow this.If you are creating and then throwing away the
HashMap
, just consume it during iteration to not have any references.Prefer
match *foo { Foo:Variant => /* ... */ }
overmatch foo { &Foo::Variant => /* ... */ }
. Same forif let
.Pattern match against
Some(true)
directly instead of using a match guard. I'd also explicitly saySome(false)
as there are only the two cases.Using
NativeEndian
is suspicious; that means the saved file will not be easily portable across different endian machines.There's nothing wrong with using
reverse
, but you could just reverse the order of arguments tocmp
.I like to introduce a
key
method thatEq
,Ord
,Hash
, etc. all use. This prevents them from drifting out of sync.I ignored
BitVector
as it has already been asked about in its own question as it has already been asked about in its own question.Decompress is usually written as one word with no separation.
&string.as_bytes()
returns a&&[u8]
, more references than needed.string.as_bytes()
is sufficient.The rand crate should be in
dev-dependencies
and only used within the test module or only included when `#[cfg(test)].Generating a bunch of random strings feels like a quickcheck style test. I changed it, but the tests fail for strings containing
'0円'
'1円'
, maybe more. These are valid characters, but the error is alwaysEmpty
.Avoid writing to files in a test. Files are slow compared to memory, and involving any external resources makes things harder. Since Rust tests run in parallel, as soon as you have a second test, you will have conflicts.
Combine imports like
io::Read
andio::Write
withuse std::io::{Read, Write}
. In this case, consider using the IO prelude:use std::io::prelude::*
.You've intermixed test and production code, as well as test and production dependencies. This produces compile-time warnings (
warning: function is never used: `read_file`, `rand_string`, `test`
) and wastes end-user time during compilation and bandwidth during download.I prefer to use a crate like quick-error to succinctly create error macros. I combined the description and display text where appropriate.
"frecuency" is misspelled; should be frequency.
It's idiomatic to use
|&foo| foo
instead of|foo| *foo
when you can; generallyCopy
types allow this.If you are creating and then throwing away the
HashMap
, just consume it during iteration to not have any references.Prefer
match *foo { Foo:Variant => /* ... */ }
overmatch foo { &Foo::Variant => /* ... */ }
. Same forif let
.Pattern match against
Some(true)
directly instead of using a match guard. I'd also explicitly saySome(false)
as there are only the two cases.Using
NativeEndian
is suspicious; that means the saved file will not be easily portable across different endian machines.There's nothing wrong with using
reverse
, but you could just reverse the order of arguments tocmp
.I like to introduce a
key
method thatEq
,Ord
,Hash
, etc. all use. This prevents them from drifting out of sync.I ignored
BitVector
as it has already been asked about in its own question.Decompress is usually written as one word with no separation.
&string.as_bytes()
returns a&&[u8]
, more references than needed.string.as_bytes()
is sufficient.The rand crate should be in
dev-dependencies
and only used within the test module or only included when `#[cfg(test)].Generating a bunch of random strings feels like a quickcheck style test. I changed it, but the tests fail for strings containing
'0円'
'1円'
, maybe more. These are valid characters, but the error is alwaysEmpty
.Avoid writing to files in a test. Files are slow compared to memory, and involving any external resources makes things harder. Since Rust tests run in parallel, as soon as you have a second test, you will have conflicts.
Combine imports like
io::Read
andio::Write
withuse std::io::{Read, Write}
. In this case, consider using the IO prelude:use std::io::prelude::*
.You've intermixed test and production code, as well as test and production dependencies. This produces compile-time warnings (
warning: function is never used: `read_file`, `rand_string`, `test`
) and wastes end-user time during compilation and bandwidth during download.I prefer to use a crate like quick-error to succinctly create error macros. I combined the description and display text where appropriate.
"frecuency" is misspelled; should be frequency.
It's idiomatic to use
|&foo| foo
instead of|foo| *foo
when you can; generallyCopy
types allow this.If you are creating and then throwing away the
HashMap
, just consume it during iteration to not have any references.Prefer
match *foo { Foo:Variant => /* ... */ }
overmatch foo { &Foo::Variant => /* ... */ }
. Same forif let
.Pattern match against
Some(true)
directly instead of using a match guard. I'd also explicitly saySome(false)
as there are only the two cases.Using
NativeEndian
is suspicious; that means the saved file will not be easily portable across different endian machines.There's nothing wrong with using
reverse
, but you could just reverse the order of arguments tocmp
.I like to introduce a
key
method thatEq
,Ord
,Hash
, etc. all use. This prevents them from drifting out of sync.I ignored
BitVector
as it has already been asked about in its own question.Decompress is usually written as one word with no separation.
&string.as_bytes()
returns a&&[u8]
, more references than needed.string.as_bytes()
is sufficient.The rand crate should be in
dev-dependencies
and only used within the test module or only included when `#[cfg(test)].Generating a bunch of random strings feels like a quickcheck style test. I changed it, but the tests fail for strings containing
'0円'
'1円'
, maybe more. These are valid characters, but the error is alwaysEmpty
.Avoid writing to files in a test. Files are slow compared to memory, and involving any external resources makes things harder. Since Rust tests run in parallel, as soon as you have a second test, you will have conflicts.
Combine imports like
io::Read
andio::Write
withuse std::io::{Read, Write}
. In this case, consider using the IO prelude:use std::io::prelude::*
.You've intermixed test and production code, as well as test and production dependencies. This produces compile-time warnings (
warning: function is never used: `read_file`, `rand_string`, `test`
) and wastes end-user time during compilation and bandwidth during download.I prefer to use a crate like quick-error to succinctly create error macros. I combined the description and display text where appropriate.
"frecuency" is misspelled; should be frequency.
It's idiomatic to use
|&foo| foo
instead of|foo| *foo
when you can; generallyCopy
types allow this.If you are creating and then throwing away the
HashMap
, just consume it during iteration to not have any references.Prefer
match *foo { Foo:Variant => /* ... */ }
overmatch foo { &Foo::Variant => /* ... */ }
. Same forif let
.Pattern match against
Some(true)
directly instead of using a match guard. I'd also explicitly saySome(false)
as there are only the two cases.Using
NativeEndian
is suspicious; that means the saved file will not be easily portable across different endian machines.There's nothing wrong with using
reverse
, but you could just reverse the order of arguments tocmp
.I like to introduce a
key
method thatEq
,Ord
,Hash
, etc. all use. This prevents them from drifting out of sync.I ignored
BitVector
as it has already been asked about in its own question.Decompress is usually written as one word with no separation.
&string.as_bytes()
returns a&&[u8]
, more references than needed.string.as_bytes()
is sufficient.The rand crate should be in
dev-dependencies
and only used within the test module or only included when `#[cfg(test)].Generating a bunch of random strings feels like a quickcheck style test. I changed it, but the tests fail for strings containing
'0円'
'1円'
, maybe more. These are valid characters, but the error is alwaysEmpty
.Avoid writing to files in a test. Files are slow compared to memory, and involving any external resources makes things harder. Since Rust tests run in parallel, as soon as you have a second test, you will have conflicts.
#[macro_use]
extern crate quick_error;
extern crate byteorder;
#[cfg(test)]
#[macro_use]
extern crate quickcheck;
use std::collections::hash_map::HashMap;
use std::collections::BinaryHeap;
use byteorder::{ReadBytesExt, WriteBytesExt, NativeEndian};
use std::{result, io, path, cmp, fs};
type Result<T> = result::Result<T, HuffmanError>;
const BITS: usize = 8;
quick_error! {
#[derive(Debug)]
pub enum HuffmanError {
Io(err: io::Error) {
from()
cause(err)
description(err.description())
display("IO error: {}", err)
}
ParseTree {
description("Parse tree error: invalid encoded huffman tree")
}
AlphabetMismatch {
description("Alphabet mismatch error: alphabet doesn't match parsed tree")
}
Empty {
description("Empty stream")
}
}
}
#[derive(Eq, Debug)]
enum HuffmanTree {
Inner {
frequency: u32,
left: Box<HuffmanTree>,
right: Box<HuffmanTree>,
},
Leaf { frequency: u32, character: char },
}
fn characters_mode(file: &str) -> HashMap<char, u32> {
let mut table = HashMap::new();
for c in file.chars() {
*(table.entry(c).or_insert(0)) += 1;
}
table
}
fn create_priority_queue(input: &str) -> BinaryHeap<HuffmanTree> {
characters_mode(input)
.into_iter()
.map(|(c, f)| {
HuffmanTree::Leaf {
character: c,
frequency: f,
}
})
.collect()
}
impl HuffmanTree {
fn new(input: &str) -> Result<HuffmanTree> {
let mut pq = create_priority_queue(input);
for _ in 1..pq.len() {
let min1 = pq.pop().unwrap();
let min2 = pq.pop().unwrap();
pq.push(min1.join(min2));
}
pq.pop().ok_or(HuffmanError::Empty)
}
fn freq(&self) -> u32 {
use HuffmanTree::*;
match *self {
Inner { frequency, .. } |
Leaf { frequency, .. } => frequency,
}
}
fn join(self, other: HuffmanTree) -> HuffmanTree {
HuffmanTree::Inner {
frequency: self.freq() + other.freq(),
left: Box::new(self),
right: Box::new(other),
}
}
fn create_char_mapper_recur(&self,
bit_vec: &mut BitVector,
map: &mut HashMap<char, BitVector>) {
use HuffmanTree::*;
match *self {
Inner { ref left, ref right, .. } => {
bit_vec.push(true);
left.create_char_mapper_recur(bit_vec, map);
bit_vec.push(false);
right.create_char_mapper_recur(bit_vec, map);
}
Leaf { character, .. } => {
map.insert(character, bit_vec.clone());
}
}
bit_vec.pop();
}
fn create_char_mapper(&self) -> HashMap<char, BitVector> {
let mut bit_vec = BitVector::new();
let mut map = HashMap::new();
if let HuffmanTree::Leaf { .. } = *self {
bit_vec.push(true);
}
self.create_char_mapper_recur(&mut bit_vec, &mut map);
map
}
fn decode<I, T>(encoded_walk: &mut I, mut chars: &mut T) -> Result<HuffmanTree>
where I: Iterator<Item = bool>,
T: Iterator<Item = char>
{
match encoded_walk.next() {
Some(true) => {
let left = Self::decode(encoded_walk, chars)?;
let right = Self::decode(encoded_walk, chars)?;
Ok(left.join(right))
}
Some(false) => {
let c = chars.next()
.ok_or(HuffmanError::AlphabetMismatch)?;
Ok(HuffmanTree::Leaf {
frequency: 0,
character: c,
})
}
None => Err(HuffmanError::ParseTree),
}
}
fn serialize<W: std::io::Write>(&self, writer: &mut W) -> Result<()> {
let (encoded_walk, alphabet) = self.encode();
let walk_bit_len = encoded_walk.len() as u64;
writer.write_u64::<NativeEndian>(walk_bit_len)?;
writer.write(&encoded_walk.bits)?;
let encoded_alphabet = alphabet.as_bytes();
let alphabet_byte_len = encoded_alphabet.len() as u64;
writer.write_u64::<NativeEndian>(alphabet_byte_len)?;
writer.write_all(encoded_alphabet)?;
Ok(())
}
fn de_serialize<R: std::io::Read>(reader: &mut R) -> Result<HuffmanTree> {
use std::io::Read;
let walk_len = reader.read_u64::<NativeEndian>()?;
let mut walk_bytes = Vec::new();
reader.take((walk_len + BITS as u64 - 1) / BITS as u64)
.read_to_end(&mut walk_bytes)?;
let bit_vec = BitVector {
bits: walk_bytes,
size: walk_len as usize,
};
let chars_len = reader.read_u64::<NativeEndian>()?;
let mut chars = String::new();
reader.take(chars_len)
.read_to_string(&mut chars)?;
let bit_iter = &mut bit_vec.iter();
Self::decode(bit_iter, &mut chars.chars())
}
fn encode_recur(&self, bit_vec: &mut BitVector, alphabet: &mut String) {
use HuffmanTree::*;
match *self {
Inner { ref left, ref right, .. } => {
bit_vec.push(true);
left.encode_recur(bit_vec, alphabet);
right.encode_recur(bit_vec, alphabet);
}
Leaf { character, .. } => {
bit_vec.push(false);
alphabet.push(character);
}
}
}
fn encode(&self) -> (BitVector, String) {
let mut bit_vector = BitVector::new();
let mut alphabet = String::new();
self.encode_recur(&mut bit_vector, &mut alphabet);
(bit_vector, alphabet)
}
fn encode_string(&self, file_str: &str) -> BitVector {
let char_map = self.create_char_mapper();
let itr = file_str.chars()
.map(|c| char_map.get(&c).expect("Assertion error"));
let mut bit_vec = BitVector::new();
for code in itr {
bit_vec.append(&code);
}
bit_vec
}
fn decode_string<I>(&self, bit_iter: I) -> String
where I: Iterator<Item = bool>
{
use HuffmanTree::*;
let mut node = self;
let mut output = String::new();
for bit in bit_iter {
if let Inner { ref left, ref right, .. } = *node {
node = if bit { left } else { right }
}
if let Leaf { character, .. } = *node {
output.push(character);
node = self;
}
}
output
}
}
impl Ord for HuffmanTree {
fn cmp(&self, other: &HuffmanTree) -> cmp::Ordering {
self.freq().cmp(&other.freq()).reverse()
}
}
impl PartialOrd for HuffmanTree {
fn partial_cmp(&self, other: &HuffmanTree) -> Option<cmp::Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for HuffmanTree {
fn eq(&self, other: &HuffmanTree) -> bool {
self.freq() == other.freq()
}
}
// BitVector elided
pub trait HuffmanCompress {
fn compress<T: std::io::Write>(&mut self, &mut T) -> Result<()>;
}
pub trait HuffmanDeCompress {
fn de_compress<T: std::io::Write>(&mut self, writer: &mut T) -> Result<()> {
let string = self.de_compress_to_string()?;
writer.write_all(string.as_bytes())?;
Ok(())
}
fn de_compress_to_string(&mut self) -> Result<String>;
}
pub trait HuffmanCodes: HuffmanDeCompress + HuffmanCompress {}
impl<T: std::io::Read> HuffmanCodes for T {}
impl<T: std::io::Read> HuffmanCompress for T {
fn compress<W: std::io::Write>(&mut self, writer: &mut W) -> Result<()> {
let mut file_str = String::new();
self.read_to_string(&mut file_str)?;
let tree = HuffmanTree::new(&file_str)?;
let encoded_file = tree.encode_string(&file_str);
let junk = BITS - (encoded_file.len() % BITS);
let mask = (((junk != BITS) as i8) << (BITS - 1)) >> (BITS - 1);
tree.serialize(writer)?;
writer.write_i8(junk as i8 & mask)?;
writer.write(&encoded_file.bits)?;
Ok(())
}
}
impl<T: std::io::Read> HuffmanDeCompress for T {
fn de_compress_to_string(&mut self) -> Result<String> {
let tree = HuffmanTree::de_serialize(self)?;
let junk = self.read_i8()?;
let mut bytes = Vec::new();
self.read_to_end(&mut bytes)?;
let bit_vec = BitVector {
size: BITS * bytes.len() - junk as usize,
bits: bytes,
};
let string = tree.decode_string(bit_vec.iter());
Ok(string)
}
}
pub fn encode<P, T>(path: P, target: T) -> Result<()>
where P: AsRef<path::Path>,
T: AsRef<path::Path>
{
let f1 = fs::File::open(path)?;
let mut reader = io::BufReader::new(f1);
let f2 = fs::File::create(target)?;
let mut writer = io::BufWriter::new(f2);
reader.compress(&mut writer)?;
Ok(())
}
pub fn decode<P>(path: P) -> Result<String>
where P: AsRef<path::Path>
{
let f = fs::File::open(path)?;
let mut reader = &mut io::BufReader::new(f);
Ok(reader.de_compress_to_string()?)
}
#[cfg(test)]
mod test {
use quickcheck::TestResult;
use std::{io, path, fs};
use super::*;
use std::io::prelude::*;
fn read_file<P: AsRef<path::Path>>(path: P) -> io::Result<String> {
let f = fs::File::open(path)?;
let mut reader = io::BufReader::new(f);
let mut result = String::new();
reader.read_to_string(&mut result)?;
Ok(result)
}
quickcheck! {
fn prop(text: String) -> super::Result<TestResult> {
// TODO: What is the appropriate invalid string set?
if text.is_empty() {
return Ok(TestResult::discard());
}
const FILE: &'static str = "test";
const CFILE: &'static str = "compress";
let f = fs::File::create(FILE)?;
let mut writer = io::BufWriter::new(f);
writer.write_all(text.as_bytes())?;
encode(FILE, CFILE)?;
let x = decode(CFILE)?;
let y = read_file(FILE)?;
// TODO: Should probably make these into two separate
// tests for better error reporting.
let eq = x == y && x == text;
Ok(TestResult::from_bool(eq))
}
}
}
Bigger questions:
Why use
char
as the smallest unit instead ofu8
?Why force the user to read the whole input all the way? If compressing a 2GiB file, I'd have to allocate 2GiB of memory!
new
andencode
could stream data in and out, allowing the user to choose to allocate in one chunk or not.Dealing with
File
s andPath
s seems too specific and low-level for this library.