Knuth's Word Wrap Algorithm in Haskell

Question 1

I put together a Haskell function for Knuth's word wrap algorithm. A simplified version of the algorithm is described here.

I've already been told that there are certain aspects of my style that are not very agreeable (e.g. lack of comments and type declarations). However, I'd like to get some opinions on where I might be able to improve the performance of the code itself without resorting to agressive, structure breaking optimization techniques.

import Data.Array
import Data.List
import Data.Ord
knuth m xs = map (concat . intersperse " ") $ split ixs 1 ws
 where
 (_:ixs) = reverse $ snd (c ! lls)
 ws = words xs
 ls = map length ws
 lls = length ls
 split [] _ ws = [ws]
 split (x:xs) n ws = let (a, b) = splitAt (x-n) ws in a : split xs x b
 c = listArray (0, lls) $ map go [0..lls]
 where
 go 0 = (0, [])
 go j = minimumBy (comparing fst) $ map (path j) [1..j]
 path j i = let (x, p) = c ! (i-1) in (lc ! (i, j) + x, i:p)
 lc = listArray ((1, 1), (lls, lls)) $ concat $ go ls
 where
 go = map (costs . extras) . take lls . tails
 extras = lpad lls 0 . map (m -) . zipWith (+) [0..] . scanl1 (+)
 lpad m x = reverse . take m . (++ repeat x) . reverse
 costs [] = []
 costs (e:es)
 | e < 0 = (1/0) : costs es
 | null es = [0]
 | otherwise = exp (fromIntegral e) : costs es

The code can be exercised like the following examples:

t1 = knuth 6 "aaa bb cc ddddd"
t2 = knuth 37 "You cant trust code that you did not create yourself. (Especially code from companies that employ people like me.) No amount of source-level verification or scrutiny will protect you from untrusted code."

Question 2

Is there a link to Knuth's algorithm with actual code or pseudo code? Preferably the one you used. (Or can you quote the code here?)

Question 3

Sure. csl.mtu.edu/cs1141/www/assignments/prog2.pdf

Question 4

Lack of comments, type information, and good naming makes this code much harder to read than it should be.

lls is not descriptive, and the s at the end makes it sound like a list of something, when it's just a number. Since n is not used for anything else, I'd go with that.

The names c and lc are horrible, these arrays are central to the algorithm, and their definitions take up most of the function, but there is nothing to tell us what they are in the program. The linked article says that the c array represents cumulative costs, and lc[i][j] is the cost of a line containing words i through j, so I would suggest the names cumulativeCost and lineCost. However, the c in your program also adds the p array from the article, so cumulativeCost doesn't quite cover it. cumulativeCostAndPath is a bit long, but not too bad.

Since we're only ever interested in either the cost or the path from cumulativeCostAndPath (though they're calculated together), I'd suggest getpath = snd and cumCost = fst as ways of being clearer about what you mean (you could also separate the pieces, but that would be more work).

With these name changes, the first confusing line of the program, (_:ixs) = reverse $ snd (c ! lls) becomes (_:ixs) = reverse $ getpath (cumulativeCostAndPath ! n).

path j i = let (x, p) = c ! (i-1) in (lc ! (i, j) + x, i:p) is also hard to read, mostly because of naming. It uses 6 names, of which lc is the only one with more than one character, and only i and j have obvious meanings. I'm going to call x costSoFar and p pathSoFar. I'm also going to change the let into a where and break it into 2 lines.

path j i = (lineCost ! (i, j) + costSoFar, i:pathSoFar)
 where (costSoFar, pathSoFar) = cumulativeCostAndPath ! (i-1)

It is not obvious what extras is trying to do, even with the article as a reference. It would be much easier to follow with a comment explaining the formula for extras[i][j] in the article, what extras is meant to represent, and mentioning that the scanl1 call in extras and the tails call in go interact to create the last term in extras.

The costs function could use a comment explaining what each of the 3 possibilities means. If extras were extracted to be its own array, it would allow costs to be rewritten more cleanly, as well as potentially simplifying the definition of lc. Before figuring out the details of what's going on, that deep pipeline of functions was hard to reason about.

You spend a lot of effort making lists which have to match up to arrays. This is harder to follow than defining each array element with a function (which is possible for these arrays), and it sets you up for off-by-one errors if you're not careful.

I can see two possible ways to speed it up. You use exp in the definition of costs, which is more expensive than the 2 multiplications required for cubing suggested in the article. Also, the calculation of extras doesn't stop after it's impossible. Since after finding that words i through j won't fit, all of the rest of them won't fit either, you could avoid calculating everything after the first negative number. If you have 500 words to wrap and an 80 column limit, figuring out whether you can fit words 1 through 200 on a line is a waste of time.

Question 5

First of all I agree with @MichaelShaw on most points. So I won't repeat the "use meaningful names and types" part, as he covered it well enough. I will also touch upon the problem I see with nesting; and on readability, I will second his comment

You spend a lot of effort making lists which have to match up to arrays. This is harder to follow than defining each array element with a function

with actual examples and with special emphasis on using the same language as your requirements (within technical limitations).

Too much nesting

There is too much nesting in the code. Any function doing anything of interest should be top level. So it could be tested individually or developed in the REPL etc, but functions defined in where clauses cannot be tested.

If you're doing this because you want to expost knuth function only, put your functions in a module and export only the functions you want to expose. Something like this : module Homework2 ( knuth ) where.

And If you're doing this because you want to close over fixed inputs such as m, just don't. If you use consistent and/or meaningful names for common parameters then it won't be a problem.

For example in the homework you are given:

extras i j = {- expression containing m l in addition to i j -}

can be expressed as

extras :: {- after you develop your function in REPL; 
 copy, fix if too general, and paste the type signature REPL gives -}
extras m l = extras' where extras' i j = ...

or

(extras m l) i j = ...

Opaqueness

Another problem with the code is its opaqueness. Your code should speak the language of your requirements

$$extras[i][j] = M - j + i - \sum_{k=i}^{j} l_k$$

Compare:

extra m l i j = m - j + i + sumLk
 where 
 sumLk = sum [l!k | k <- [i .. j]]

With:

extras = lpad lls 0 . map (m -) . zipWith (+) [0..] . scanl1 (+)
 where
 lpad m x = reverse . take m . (++ repeat x) . reverse

$$ lc[i][j] = \begin{cases} \hfill \infty \hfill & \text{ if $extras[i][j]<0$} \\ \hfill 0 \hfill & \text{ if $j=n$ and is even $extras[i][j] \geq 0$} \\ \hfill (extras[i][j])^3 \hfill & \text{otherwise} \\ \end{cases} $$

Compare:

cost m n l = lc 
 where
 lc i j 
 | extra' < 0 = 1/0
 | j == n = 0
 | otherwise = fromIntegral $ extra' ^ 3
 where
 extra' = extra m l i j

With:

costs [] = []
costs (e:es)
 | e < 0 = (1/0) : costs es
 | null es = [0]
 | otherwise = exp (fromIntegral e) : costs es

The problem with the naive implementation cost is that it takes \$O(n^3)\$ to fill its results in an array. But it can be optimized back to \$O(n^2)\$

lc m n ls = array ((1,1),(n,n)) [ ((i, j), cost' i j) 
 | i <- [1..n], j <- [1..n], i < j]
 where
 r = runningTotals n ls
 cost' = cost m n r
cost :: Int -> Int -> Array Int Int -> (Int -> Int -> Float)
cost m n r i j 
 | extra < 0 = 1/0
 | j == n = 0
 | otherwise = fromIntegral $ extra ^ 3
 where
 extra = m - j + i + sumLk
 sumLk = r!j - r!(i-1)
runningTotals :: (Num a) => Int -> [a] -> Array Int a 
runningTotals n = listArray (0,n) . scanl (+) 0

Also not I added i < j in the definition of lc, even it wouldn't give any performance benefits, but now I can be sure I will not be iterating on the cases (i>j), for which cost is trivially infinity, when defining c. Also I am now sure I won't be accessing lc!(j,i) instead of lc!(i,j), or similar typo or off-by-one errors.

$$ c[j] = \begin{cases} \hfill 0 \hfill & j = 0 \\ \hfill \min_{1 \leq i \leq j} (c[i-1]+lc[i][ j]) \hfill & j>0 \\ \end{cases} $$

Naive implementation:

cumulativeCost m n l = c
 where
 c 0 = 0
 c j = minimum [c (i-1) + lc i j | i <- [1 .. j]]
 lc = cost m n l

Memoized implementation:

c m n ls = c'
 where
 c' = array (0,n) [(i, c'' i) | i <- [0..n]]
 c'' 0 = 0
 c'' j = minimum [(c'!i-1) + (lc'!(i,j)) | i <- [1 .. j]]
 lc' = lc m ls

The definition of c is not compared to yours because the one above doesn't contain path building code.

Some further refactorings:

Above c is dependent on the particular lc and above lc is dependent on the particular cost function only by accident; you can pass lc and cost in as parameters respectively... (The same is true for your knuth function: You shouldn't need to recompile it in order to use exp extra instead of extra^3 as cost. Also you should be able to run knuth with different cost functions and compare the results.)

As for performance, you can do some cut off in the calculation of c[j]. I think this is also the improvements suggested by @MichaelShaw. But his wording

Since after finding that words i through j won't fit, all of the rest of them won't fit either,

suggests, even if it may not be what he means, "f i..j doesn't fit i..j+1 won't fit either", but that requires using mutable arrays I guess. Instead you should think "If i..j doesn't fit i-1..j won't fit either". And that means you should iterate is in decreasing order.

-- instead of this
c'' j = minimum [(c'!i-1) + (lc!(i,j)) | i <- [1 .. j]]
-- this
c'' j = minimum $ takeWhile (< 1/0) [(c'!i-1) + (lc!(i,j)) | i <- [j,j-1..1]]

This can be a very significant improvement for large n. If your function will be run against huge inputs (as is case with programming competitions). However IRL you word-wrap one paragraph at a time, where n~50. Another concern is now your implementation would be tied to this particular cost function again. Although it is safe to assume costs will never be more than (1/0), cost function shouldn't use Infinity to disallow any other kind of line break. You should put your assumption lc[i][j] == Infinity => lc[i-1][j] == Infinity in the documentation for c (which is another sign that c should take lc as a parameter).

Another cut-off strategy that can improve performance could be takeWhile (j - i < maxWordsPerLine). maxWordsPerLine = 1 + lineWidth/spaceWidth could be a safe bet. This strategy would decrease time complexity significantly for n>> maxWordsPerLine, from O(n^2) to O(n*maxWordsPerLine); though which is never the case IRL either. Of course, if your program would be run against n>> maxWordsPerLine, nXn monolithic array would not be the best data structure.

Question 6

+1. What I meant by that wording was "if i..j doesn't fit then i..j+1 won't either", but your way of looking at it is also true. Both are adding words to a line that's already too long. I was thinking in terms of generating extras a line at a time (with a fixed i for a given line). Your point that it would also save time in generating c is a good one, and for that, since we're interested in c[j], fixing j makes sense.

Michael Shaw Michael Shaw 9835 silver badges12 bronze badges · Answer 1 · 2015-03-25 07:53:42Z

Lack of comments, type information, and good naming makes this code much harder to read than it should be.

lls is not descriptive, and the s at the end makes it sound like a list of something, when it's just a number. Since n is not used for anything else, I'd go with that.

The names c and lc are horrible, these arrays are central to the algorithm, and their definitions take up most of the function, but there is nothing to tell us what they are in the program. The linked article says that the c array represents cumulative costs, and lc[i][j] is the cost of a line containing words i through j, so I would suggest the names cumulativeCost and lineCost. However, the c in your program also adds the p array from the article, so cumulativeCost doesn't quite cover it. cumulativeCostAndPath is a bit long, but not too bad.

Since we're only ever interested in either the cost or the path from cumulativeCostAndPath (though they're calculated together), I'd suggest getpath = snd and cumCost = fst as ways of being clearer about what you mean (you could also separate the pieces, but that would be more work).

With these name changes, the first confusing line of the program, (_:ixs) = reverse $ snd (c ! lls) becomes (_:ixs) = reverse $ getpath (cumulativeCostAndPath ! n).

path j i = let (x, p) = c ! (i-1) in (lc ! (i, j) + x, i:p) is also hard to read, mostly because of naming. It uses 6 names, of which lc is the only one with more than one character, and only i and j have obvious meanings. I'm going to call x costSoFar and p pathSoFar. I'm also going to change the let into a where and break it into 2 lines.

path j i = (lineCost ! (i, j) + costSoFar, i:pathSoFar)
 where (costSoFar, pathSoFar) = cumulativeCostAndPath ! (i-1)

It is not obvious what extras is trying to do, even with the article as a reference. It would be much easier to follow with a comment explaining the formula for extras[i][j] in the article, what extras is meant to represent, and mentioning that the scanl1 call in extras and the tails call in go interact to create the last term in extras.

The costs function could use a comment explaining what each of the 3 possibilities means. If extras were extracted to be its own array, it would allow costs to be rewritten more cleanly, as well as potentially simplifying the definition of lc. Before figuring out the details of what's going on, that deep pipeline of functions was hard to reason about.

You spend a lot of effort making lists which have to match up to arrays. This is harder to follow than defining each array element with a function (which is possible for these arrays), and it sets you up for off-by-one errors if you're not careful.

I can see two possible ways to speed it up. You use exp in the definition of costs, which is more expensive than the 2 multiplications required for cubing suggested in the article. Also, the calculation of extras doesn't stop after it's impossible. Since after finding that words i through j won't fit, all of the rest of them won't fit either, you could avoid calculating everything after the first negative number. If you have 500 words to wrap and an 80 column limit, figuring out whether you can fit words 1 through 200 on a line is a waste of time.

abuzittin gillifirca abuzittin gillifirca 6,35715 silver badges25 bronze badges · Answer 2 · 2015-03-26 15:49:38Z

First of all I agree with @MichaelShaw on most points. So I won't repeat the "use meaningful names and types" part, as he covered it well enough. I will also touch upon the problem I see with nesting; and on readability, I will second his comment

You spend a lot of effort making lists which have to match up to arrays. This is harder to follow than defining each array element with a function

with actual examples and with special emphasis on using the same language as your requirements (within technical limitations).

Too much nesting

There is too much nesting in the code. Any function doing anything of interest should be top level. So it could be tested individually or developed in the REPL etc, but functions defined in where clauses cannot be tested.

If you're doing this because you want to expost knuth function only, put your functions in a module and export only the functions you want to expose. Something like this : module Homework2 ( knuth ) where.

And If you're doing this because you want to close over fixed inputs such as m, just don't. If you use consistent and/or meaningful names for common parameters then it won't be a problem.

For example in the homework you are given:

extras i j = {- expression containing m l in addition to i j -}

can be expressed as

extras :: {- after you develop your function in REPL; 
 copy, fix if too general, and paste the type signature REPL gives -}
extras m l = extras' where extras' i j = ...

or

(extras m l) i j = ...

Opaqueness

Another problem with the code is its opaqueness. Your code should speak the language of your requirements

$$extras[i][j] = M - j + i - \sum_{k=i}^{j} l_k$$

Compare:

extra m l i j = m - j + i + sumLk
 where 
 sumLk = sum [l!k | k <- [i .. j]]

With:

extras = lpad lls 0 . map (m -) . zipWith (+) [0..] . scanl1 (+)
 where
 lpad m x = reverse . take m . (++ repeat x) . reverse

$$ lc[i][j] = \begin{cases} \hfill \infty \hfill & \text{ if $extras[i][j]<0$} \\ \hfill 0 \hfill & \text{ if $j=n$ and is even $extras[i][j] \geq 0$} \\ \hfill (extras[i][j])^3 \hfill & \text{otherwise} \\ \end{cases} $$

Compare:

cost m n l = lc 
 where
 lc i j 
 | extra' < 0 = 1/0
 | j == n = 0
 | otherwise = fromIntegral $ extra' ^ 3
 where
 extra' = extra m l i j

With:

costs [] = []
costs (e:es)
 | e < 0 = (1/0) : costs es
 | null es = [0]
 | otherwise = exp (fromIntegral e) : costs es

The problem with the naive implementation cost is that it takes \$O(n^3)\$ to fill its results in an array. But it can be optimized back to \$O(n^2)\$

lc m n ls = array ((1,1),(n,n)) [ ((i, j), cost' i j) 
 | i <- [1..n], j <- [1..n], i < j]
 where
 r = runningTotals n ls
 cost' = cost m n r
cost :: Int -> Int -> Array Int Int -> (Int -> Int -> Float)
cost m n r i j 
 | extra < 0 = 1/0
 | j == n = 0
 | otherwise = fromIntegral $ extra ^ 3
 where
 extra = m - j + i + sumLk
 sumLk = r!j - r!(i-1)
runningTotals :: (Num a) => Int -> [a] -> Array Int a 
runningTotals n = listArray (0,n) . scanl (+) 0

Also not I added i < j in the definition of lc, even it wouldn't give any performance benefits, but now I can be sure I will not be iterating on the cases (i>j), for which cost is trivially infinity, when defining c. Also I am now sure I won't be accessing lc!(j,i) instead of lc!(i,j), or similar typo or off-by-one errors.

$$ c[j] = \begin{cases} \hfill 0 \hfill & j = 0 \\ \hfill \min_{1 \leq i \leq j} (c[i-1]+lc[i][ j]) \hfill & j>0 \\ \end{cases} $$

Naive implementation:

cumulativeCost m n l = c
 where
 c 0 = 0
 c j = minimum [c (i-1) + lc i j | i <- [1 .. j]]
 lc = cost m n l

Memoized implementation:

c m n ls = c'
 where
 c' = array (0,n) [(i, c'' i) | i <- [0..n]]
 c'' 0 = 0
 c'' j = minimum [(c'!i-1) + (lc'!(i,j)) | i <- [1 .. j]]
 lc' = lc m ls

The definition of c is not compared to yours because the one above doesn't contain path building code.

Some further refactorings:

Above c is dependent on the particular lc and above lc is dependent on the particular cost function only by accident; you can pass lc and cost in as parameters respectively... (The same is true for your knuth function: You shouldn't need to recompile it in order to use exp extra instead of extra^3 as cost. Also you should be able to run knuth with different cost functions and compare the results.)

As for performance, you can do some cut off in the calculation of c[j]. I think this is also the improvements suggested by @MichaelShaw. But his wording

Since after finding that words i through j won't fit, all of the rest of them won't fit either,

suggests, even if it may not be what he means, "f i..j doesn't fit i..j+1 won't fit either", but that requires using mutable arrays I guess. Instead you should think "If i..j doesn't fit i-1..j won't fit either". And that means you should iterate is in decreasing order.

-- instead of this
c'' j = minimum [(c'!i-1) + (lc!(i,j)) | i <- [1 .. j]]
-- this
c'' j = minimum $ takeWhile (< 1/0) [(c'!i-1) + (lc!(i,j)) | i <- [j,j-1..1]]

This can be a very significant improvement for large n. If your function will be run against huge inputs (as is case with programming competitions). However IRL you word-wrap one paragraph at a time, where n~50. Another concern is now your implementation would be tied to this particular cost function again. Although it is safe to assume costs will never be more than (1/0), cost function shouldn't use Infinity to disallow any other kind of line break. You should put your assumption lc[i][j] == Infinity => lc[i-1][j] == Infinity in the documentation for c (which is another sign that c should take lc as a parameter).

Another cut-off strategy that can improve performance could be takeWhile (j - i < maxWordsPerLine). maxWordsPerLine = 1 + lineWidth/spaceWidth could be a safe bet. This strategy would decrease time complexity significantly for n>> maxWordsPerLine, from O(n^2) to O(n*maxWordsPerLine); though which is never the case IRL either. Of course, if your program would be run against n>> maxWordsPerLine, nXn monolithic array would not be the best data structure.

+1. What I meant by that wording was "if i..j doesn't fit then i..j+1 won't either", but your way of looking at it is also true. Both are adding words to a line that's already too long. I was thinking in terms of generating extras a line at a time (with a fixed i for a given line). Your point that it would also save time in generating c is a good one, and for that, since we're interested in c[j], fixing j makes sense.

Stack Exchange Network

Knuth's Word Wrap Algorithm in Haskell

2 Answers 2

Too much nesting

Opaqueness

Some further refactorings:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Knuth's Word Wrap Algorithm in Haskell

2 Answers 2

Too much nesting

Opaqueness

Some further refactorings:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions