Separation of code into functions and use of patern matching in Haskell

Question 1

In the following:

import Data.Bits (popCount, shiftL, (.|.))
import Data.Char (chr, ord)
import Data.Word (Word8)
import qualified Data.ByteString as B
data MByteSequence = MByteSequence [Word8] -- a sequence of map bytes (len 0-15)
data Atom = TerminationAtom |
 NonOffsetAtom { nonOffsetGapLength :: Integer -- the number of gap bytes (max 2**53-1)
 , nonOffsetGapValue :: Bool -- the value of the gap bytes
 , followingMapBytes :: MByteSequence -- a sequence of following map bytes (len 1-15)
 } |
 OffsetAtom { offsetGapLength :: Integer -- the number of gap bytes (max 2**53-1)
 , offsetGapValue :: Bool -- the value of the gap bytes and the majority of the offset byte
 , offsetBit :: Int -- the bit with a different value in the offset byte (0-7)
 }
getAtomType :: Atom -> Int
getAtomType TerminationAtom = 1
getAtomType (NonOffsetAtom len _ _)
 | len <= 3 = fromInteger len
 | len > 3 = 4
getAtomType (OffsetAtom len val _)
 | len <= 3 && val == False = 5
 | len > 3 = 6
 | len <= 3 && val == True = 7
getFillField :: Atom -> Int
getFillField TerminationAtom = 0
getFillField (NonOffsetAtom _ val _) = if val then 1 else 0
getFillField (OffsetAtom len val _) = if len > 3 then (if val then 1 else 0) else fromInteger len
getDataField :: Atom -> Int
getDataField TerminationAtom = 0
getDataField (NonOffsetAtom _ val (MByteSequence mbytes))
 | length mbytes == 1 &&
 head mbytes == (if val then 0 else 0xFF) = 0
 | otherwise = length mbytes
getDataField (OffsetAtom _ _ bit) = bit
serialize :: Atom -> Word8
serialize TerminationAtom = 0
serialize (NonOffsetAtom x y z) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
 (getFillField all) `shiftL` 4 .|.
 (getDataField all)
 where all = NonOffsetAtom x y z
serialize (OffsetAtom x y z) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
 (getFillField all) `shiftL` 3 .|.
 (getDataField all)
 where all = OffsetAtom x y z

To me, serialize especially seems pretty ugly; the use of the where clauses and the similarity of the latter two pattern match cases all seems bad to me, yet I'm unsure what should be done instead. The other functions, while split up into logical blocks, seem perhaps too small? Should I use Word8 as the return type of all the functions (the return values are all small enough to fit)?

I'd welcome any feedback on code style, Haskell best practices (in general!), and any design patterns I should be using here!

Question 2

To me, serialize especially seems pretty ugly; the use of the where clauses and the similarity of the latter two pattern match cases all seems bad to me, yet I'm unsure what should be done instead.

You can rewrite the serialize function using as-patterns which will eliminate the where clause like so.

serialize :: Atom -> Word8
serialize TerminationAtom = 0
serialize all@(NonOffsetAtom _ _ _) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
 (getFillField all) `shiftL` 4 .|.
 (getDataField all)
serialize all@(OffsetAtom _ _ _) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
 (getFillField all) `shiftL` 3 .|.
 (getDataField all)

You could eliminate the duplication further by extracting the body of the function into another function parameterized over the shift value. Something like this.

serialize all@(NonOffsetAtom _ _ _) = shifty 4 all
serialize all@(OffsetAtom _ _ _) = shifty 3 all
shifty n a = fromIntegral $ (getAtomType a) `shiftL` 6 .|.
 (getFillField a) `shiftL` n .|.
 (getDataField a)

Should I use Word8 as the return type of all the functions (the return values are all small enough to fit)?

Yes. If your functions logically only operate on 8-bit byte values, then that's an important fact you can and should capture in the type signature. If returning a value of 256 or 500000 or -1 would be absurd, then eliminate the possibility of that even happening!

In General

There are other odds and ends that could be changed or cleaned up, like comparing Bool arguments to True or False instead of using them directly (getAtomType) or using ifs instead of more pattern matching (getFillField (NonOffsetAtom _ True _) = ...).

Update

Having slept on it and cleared my head, I think your code would be much improved by leveraging ADTs and a serialization typeclass. As it stands your solution feels very C-like, and your data types are used like structs, arrays and low-level machine ints instead of rich data. I'll sketch an outline of what I would do here, please feel welcome to interrogate me in the comments if the full picture doesn't come through for you.

data Four = Zero | One | Two | Three
data AtomType = ShortBase Four | LongBase
 | ShortOffset Bool | LongOffset
newtype FillType = Fill Four
data DataField = ...
data Word = Terminal
 | Word AtomType FillType DataField
class Serialize a where
 serialize :: a -> Word8
instance Serialize Four where
 serialize Zero = 0
 serialize One = 1
 serialize Two = 2
 serialize Three = 3
instance Serialize AtomType where
 serialize (ShortBase n) = serialize n `shiftL` 6
 serialize LongBase = 4 `shiftL` 6
 serialize (ShortOffset False) = 5 `shiftL` 6
 serialize (ShortOffset True) = 7 `shiftL` 6
 serialize LongOffset = 6 `shiftL` 6
...
instance Serialize Word where
 serialize Terminal = 0
 serialize (Word t f d) = serialize t .|. serialize f .|. serialize d

My goal here was to make everything as correct-by-construction as possible (q.v. the Four type) and to reduce the opportunities for ambiguity (e.g., a "Fill Field" may mean any number of things if all you're looking at is a literal 0, so the representation should be as precise as necessary and no more).

On the benefits of this approach

If you can determine the names based on the problem domain, this will be extremely readable
Requires less dead code (In particular I'm looking at the support for TerminationAtom everywhere)
Easily made symmetrical with a Deserialize class

Intermediary representations of data can always slow down your code, but with aggressive inlining I don't think this would be too much of a performance hit.

Question 3

A couple of quick notes before I head to bed (will give some more useful comment tomorrow, and edit the question a bit!): getFillField is safe because it's in the else clause of if len > 3 (hence len is 0-3). When it comes to type ranges, many of them only logically make sense in a range like 0-15 while Haskell (AFAIK) cannot easily represent (refinement types would be lovely!). As for what the code does, it's the start of me trying to get my head around byte-aligned bitmap compression (Google will help there with detail, for now!).

Question 4

Ah you're right about the length thing! I parsed that backward. I've got some thoughts I'll edit my answer with on that.

bisserlis bisserlis 3,3311 gold badge13 silver badges17 bronze badges · Answer 1 · 2014-07-23 00:24:10Z

To me, serialize especially seems pretty ugly; the use of the where clauses and the similarity of the latter two pattern match cases all seems bad to me, yet I'm unsure what should be done instead.

You can rewrite the serialize function using as-patterns which will eliminate the where clause like so.

serialize :: Atom -> Word8
serialize TerminationAtom = 0
serialize all@(NonOffsetAtom _ _ _) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
 (getFillField all) `shiftL` 4 .|.
 (getDataField all)
serialize all@(OffsetAtom _ _ _) = fromIntegral $ (getAtomType all) `shiftL` 6 .|.
 (getFillField all) `shiftL` 3 .|.
 (getDataField all)

You could eliminate the duplication further by extracting the body of the function into another function parameterized over the shift value. Something like this.

serialize all@(NonOffsetAtom _ _ _) = shifty 4 all
serialize all@(OffsetAtom _ _ _) = shifty 3 all
shifty n a = fromIntegral $ (getAtomType a) `shiftL` 6 .|.
 (getFillField a) `shiftL` n .|.
 (getDataField a)

Should I use Word8 as the return type of all the functions (the return values are all small enough to fit)?

Yes. If your functions logically only operate on 8-bit byte values, then that's an important fact you can and should capture in the type signature. If returning a value of 256 or 500000 or -1 would be absurd, then eliminate the possibility of that even happening!

In General

There are other odds and ends that could be changed or cleaned up, like comparing Bool arguments to True or False instead of using them directly (getAtomType) or using ifs instead of more pattern matching (getFillField (NonOffsetAtom _ True _) = ...).

Update

Having slept on it and cleared my head, I think your code would be much improved by leveraging ADTs and a serialization typeclass. As it stands your solution feels very C-like, and your data types are used like structs, arrays and low-level machine ints instead of rich data. I'll sketch an outline of what I would do here, please feel welcome to interrogate me in the comments if the full picture doesn't come through for you.

data Four = Zero | One | Two | Three
data AtomType = ShortBase Four | LongBase
 | ShortOffset Bool | LongOffset
newtype FillType = Fill Four
data DataField = ...
data Word = Terminal
 | Word AtomType FillType DataField
class Serialize a where
 serialize :: a -> Word8
instance Serialize Four where
 serialize Zero = 0
 serialize One = 1
 serialize Two = 2
 serialize Three = 3
instance Serialize AtomType where
 serialize (ShortBase n) = serialize n `shiftL` 6
 serialize LongBase = 4 `shiftL` 6
 serialize (ShortOffset False) = 5 `shiftL` 6
 serialize (ShortOffset True) = 7 `shiftL` 6
 serialize LongOffset = 6 `shiftL` 6
...
instance Serialize Word where
 serialize Terminal = 0
 serialize (Word t f d) = serialize t .|. serialize f .|. serialize d

My goal here was to make everything as correct-by-construction as possible (q.v. the Four type) and to reduce the opportunities for ambiguity (e.g., a "Fill Field" may mean any number of things if all you're looking at is a literal 0, so the representation should be as precise as necessary and no more).

On the benefits of this approach

If you can determine the names based on the problem domain, this will be extremely readable
Requires less dead code (In particular I'm looking at the support for TerminationAtom everywhere)
Easily made symmetrical with a Deserialize class

Intermediary representations of data can always slow down your code, but with aggressive inlining I don't think this would be too much of a performance hit.

A couple of quick notes before I head to bed (will give some more useful comment tomorrow, and edit the question a bit!): getFillField is safe because it's in the else clause of if len > 3 (hence len is 0-3). When it comes to type ranges, many of them only logically make sense in a range like 0-15 while Haskell (AFAIK) cannot easily represent (refinement types would be lovely!). As for what the code does, it's the start of me trying to get my head around byte-aligned bitmap compression (Google will help there with detail, for now!).
Ah you're right about the length thing! I parsed that backward. I've got some thoughts I'll edit my answer with on that.

Stack Exchange Network

Separation of code into functions and use of patern matching in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Separation of code into functions and use of patern matching in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions