In the following:
import Data.Bits (popCount, shiftL, (.|.))
import Data.Char (chr, ord)
import Data.Word (Word8)
import qualified Data.ByteString as B
data MByteSequence = MByteSequence [Word8] -- a sequence of map bytes (len 0-15)
data Atom = TerminationAtom |
NonOffsetAtom { nonOffsetGapLength :: Integer -- the number of gap bytes (max 2**53-1)
, nonOffsetGapValue :: Bool -- the value of the gap bytes
, followingMapBytes :: MByteSequence -- a sequence of following map bytes (len 1-15)
} |
OffsetAtom { offsetGapLength :: Integer -- the number of gap bytes (max 2**53-1)
, offsetGapValue :: Bool -- the value of the gap bytes and the majority of the offset byte
, offsetBit :: Int -- the bit with a different value in the offset byte (0-7)
}
getAtomType :: Atom -> Int
getAtomType TerminationAtom = 1
getAtomType (NonOffsetAtom len _ _)
| len <= 3 = fromInteger len
| len > 3 = 4
getAtomType (OffsetAtom len val _)
| len <= 3 && val == False = 5
| len > 3 = 6
| len <= 3 && val == True = 7
getFillField :: Atom -> Int
getFillField TerminationAtom = 0
getFillField (NonOffsetAtom _ val _) = if val then 1 else 0
getFillField (OffsetAtom len val _) = if len > 3 then (if val then 1 else 0) else fromInteger len
getDataField :: Atom -> Int
getDataField TerminationAtom = 0
getDataField (NonOffsetAtom _ val (MByteSequence mbytes))
| length mbytes == 1 &&
head mbytes == (if val then 0 else 0xFF) = 0
| otherwise = length mbytes
getDataField (OffsetAtom _ _ bit) = bit
serialize :: Atom -> Word8
serialize TerminationAtom = 0
serialize (NonOffsetAtom x y z) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
(getFillField all) `shiftL` 4 .|.
(getDataField all)
where all = NonOffsetAtom x y z
serialize (OffsetAtom x y z) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
(getFillField all) `shiftL` 3 .|.
(getDataField all)
where all = OffsetAtom x y z
To me, serialize
especially seems pretty ugly; the use of the where
clauses and the similarity of the latter two pattern match cases all seems bad to me, yet I'm unsure what should be done instead. The other functions, while split up into logical blocks, seem perhaps too small? Should I use Word8
as the return type of all the functions (the return values are all small enough to fit)?
I'd welcome any feedback on code style, Haskell best practices (in general!), and any design patterns I should be using here!
1 Answer 1
To me, serialize especially seems pretty ugly; the use of the where clauses and the similarity of the latter two pattern match cases all seems bad to me, yet I'm unsure what should be done instead.
You can rewrite the serialize
function using as-patterns which will eliminate the where
clause like so.
serialize :: Atom -> Word8
serialize TerminationAtom = 0
serialize all@(NonOffsetAtom _ _ _) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
(getFillField all) `shiftL` 4 .|.
(getDataField all)
serialize all@(OffsetAtom _ _ _) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
(getFillField all) `shiftL` 3 .|.
(getDataField all)
You could eliminate the duplication further by extracting the body of the function into another function parameterized over the shift value. Something like this.
serialize all@(NonOffsetAtom _ _ _) = shifty 4 all
serialize all@(OffsetAtom _ _ _) = shifty 3 all
shifty n a = fromIntegral $ (getAtomType a) `shiftL` 6 .|.
(getFillField a) `shiftL` n .|.
(getDataField a)
Should I use Word8 as the return type of all the functions (the return values are all small enough to fit)?
Yes. If your functions logically only operate on 8-bit byte values, then that's an important fact you can and should capture in the type signature. If returning a value of 256
or 500000
or -1
would be absurd, then eliminate the possibility of that even happening!
In General
There are other odds and ends that could be changed or cleaned up, like comparing Bool
arguments to True
or False
instead of using them directly (getAtomType
) or using if
s instead of more pattern matching (getFillField (NonOffsetAtom _ True _) = ...
).
Update
Having slept on it and cleared my head, I think your code would be much improved by leveraging ADTs and a serialization typeclass. As it stands your solution feels very C-like, and your data types are used like structs, arrays and low-level machine ints instead of rich data. I'll sketch an outline of what I would do here, please feel welcome to interrogate me in the comments if the full picture doesn't come through for you.
data Four = Zero | One | Two | Three
data AtomType = ShortBase Four | LongBase
| ShortOffset Bool | LongOffset
newtype FillType = Fill Four
data DataField = ...
data Word = Terminal
| Word AtomType FillType DataField
class Serialize a where
serialize :: a -> Word8
instance Serialize Four where
serialize Zero = 0
serialize One = 1
serialize Two = 2
serialize Three = 3
instance Serialize AtomType where
serialize (ShortBase n) = serialize n `shiftL` 6
serialize LongBase = 4 `shiftL` 6
serialize (ShortOffset False) = 5 `shiftL` 6
serialize (ShortOffset True) = 7 `shiftL` 6
serialize LongOffset = 6 `shiftL` 6
...
instance Serialize Word where
serialize Terminal = 0
serialize (Word t f d) = serialize t .|. serialize f .|. serialize d
My goal here was to make everything as correct-by-construction as possible (q.v. the Four
type) and to reduce the opportunities for ambiguity (e.g., a "Fill Field" may mean any number of things if all you're looking at is a literal 0
, so the representation should be as precise as necessary and no more).
On the benefits of this approach
- If you can determine the names based on the problem domain, this will be extremely readable
- Requires less dead code (In particular I'm looking at the support for
TerminationAtom
everywhere) - Easily made symmetrical with a
Deserialize
class
Intermediary representations of data can always slow down your code, but with aggressive inlining I don't think this would be too much of a performance hit.
-
\$\begingroup\$ A couple of quick notes before I head to bed (will give some more useful comment tomorrow, and edit the question a bit!):
getFillField
is safe because it's in the else clause ofif len > 3
(hence len is 0-3). When it comes to type ranges, many of them only logically make sense in a range like 0-15 while Haskell (AFAIK) cannot easily represent (refinement types would be lovely!). As for what the code does, it's the start of me trying to get my head around byte-aligned bitmap compression (Google will help there with detail, for now!). \$\endgroup\$gsnedders– gsnedders2014年07月23日 00:42:54 +00:00Commented Jul 23, 2014 at 0:42 -
\$\begingroup\$ Ah you're right about the length thing! I parsed that backward. I've got some thoughts I'll edit my answer with on that. \$\endgroup\$bisserlis– bisserlis2014年07月23日 03:04:36 +00:00Commented Jul 23, 2014 at 3:04