I put together a Haskell function for Knuth's word wrap algorithm. A simplified version of the algorithm is described here.
I've already been told that there are certain aspects of my style that are not very agreeable (e.g. lack of comments and type declarations). However, I'd like to get some opinions on where I might be able to improve the performance of the code itself without resorting to agressive, structure breaking optimization techniques.
import Data.Array
import Data.List
import Data.Ord
knuth m xs = map (concat . intersperse " ") $ split ixs 1 ws
where
(_:ixs) = reverse $ snd (c ! lls)
ws = words xs
ls = map length ws
lls = length ls
split [] _ ws = [ws]
split (x:xs) n ws = let (a, b) = splitAt (x-n) ws in a : split xs x b
c = listArray (0, lls) $ map go [0..lls]
where
go 0 = (0, [])
go j = minimumBy (comparing fst) $ map (path j) [1..j]
path j i = let (x, p) = c ! (i-1) in (lc ! (i, j) + x, i:p)
lc = listArray ((1, 1), (lls, lls)) $ concat $ go ls
where
go = map (costs . extras) . take lls . tails
extras = lpad lls 0 . map (m -) . zipWith (+) [0..] . scanl1 (+)
lpad m x = reverse . take m . (++ repeat x) . reverse
costs [] = []
costs (e:es)
| e < 0 = (1/0) : costs es
| null es = [0]
| otherwise = exp (fromIntegral e) : costs es
The code can be exercised like the following examples:
t1 = knuth 6 "aaa bb cc ddddd"
t2 = knuth 37 "You cant trust code that you did not create yourself. (Especially code from companies that employ people like me.) No amount of source-level verification or scrutiny will protect you from untrusted code."
-
\$\begingroup\$ Is there a link to Knuth's algorithm with actual code or pseudo code? Preferably the one you used. (Or can you quote the code here?) \$\endgroup\$abuzittin gillifirca– abuzittin gillifirca2015年03月20日 07:17:59 +00:00Commented Mar 20, 2015 at 7:17
-
\$\begingroup\$ Sure. csl.mtu.edu/cs1141/www/assignments/prog2.pdf \$\endgroup\$Ana– Ana2015年03月23日 20:44:54 +00:00Commented Mar 23, 2015 at 20:44
2 Answers 2
Lack of comments, type information, and good naming makes this code much harder to read than it should be.
lls
is not descriptive, and the s at the end makes it sound like a list of something, when it's just a number. Since n
is not used for anything else, I'd go with that.
The names c
and lc
are horrible, these arrays are central to the algorithm, and their definitions take up most of the function, but there is nothing to tell us what they are in the program. The linked article says that the c
array represents cumulative costs, and lc[i][j]
is the cost of a line containing words i through j, so I would suggest the names cumulativeCost
and lineCost
. However, the c
in your program also adds the p
array from the article, so cumulativeCost
doesn't quite cover it. cumulativeCostAndPath
is a bit long, but not too bad.
Since we're only ever interested in either the cost or the path from cumulativeCostAndPath
(though they're calculated together), I'd suggest getpath = snd
and cumCost = fst
as ways of being clearer about what you mean (you could also separate the pieces, but that would be more work).
With these name changes, the first confusing line of the program, (_:ixs) = reverse $ snd (c ! lls)
becomes (_:ixs) = reverse $ getpath (cumulativeCostAndPath ! n)
.
path j i = let (x, p) = c ! (i-1) in (lc ! (i, j) + x, i:p)
is also hard to read, mostly because of naming. It uses 6 names, of which lc
is the only one with more than one character, and only i
and j
have obvious meanings. I'm going to call x
costSoFar
and p
pathSoFar
. I'm also going to change the let
into a where
and break it into 2 lines.
path j i = (lineCost ! (i, j) + costSoFar, i:pathSoFar)
where (costSoFar, pathSoFar) = cumulativeCostAndPath ! (i-1)
It is not obvious what extras
is trying to do, even with the article as a reference. It would be much easier to follow with a comment explaining the formula for extras[i][j]
in the article, what extras is meant to represent, and mentioning that the scanl1
call in extras
and the tails
call in go
interact to create the last term in extras.
The costs
function could use a comment explaining what each of the 3 possibilities means. If extras
were extracted to be its own array, it would allow costs
to be rewritten more cleanly, as well as potentially simplifying the definition of lc
. Before figuring out the details of what's going on, that deep pipeline of functions was hard to reason about.
You spend a lot of effort making lists which have to match up to arrays. This is harder to follow than defining each array element with a function (which is possible for these arrays), and it sets you up for off-by-one errors if you're not careful.
I can see two possible ways to speed it up. You use exp
in the definition of costs
, which is more expensive than the 2 multiplications required for cubing suggested in the article. Also, the calculation of extras
doesn't stop after it's impossible. Since after finding that words i through j won't fit, all of the rest of them won't fit either, you could avoid calculating everything after the first negative number. If you have 500 words to wrap and an 80 column limit, figuring out whether you can fit words 1 through 200 on a line is a waste of time.
First of all I agree with @MichaelShaw on most points. So I won't repeat the "use meaningful names and types" part, as he covered it well enough. I will also touch upon the problem I see with nesting; and on readability, I will second his comment
You spend a lot of effort making lists which have to match up to arrays. This is harder to follow than defining each array element with a function
with actual examples and with special emphasis on using the same language as your requirements (within technical limitations).
Too much nesting
There is too much nesting in the code. Any function doing anything of interest should be top level. So it could be tested individually or developed in the REPL etc, but functions defined in where
clauses cannot be tested.
If you're doing this because you want to expost knuth
function only, put your functions in a module and export only the functions you want to expose. Something like this : module Homework2 ( knuth ) where
.
And If you're doing this because you want to close over fixed inputs such as m
, just don't. If you use consistent and/or meaningful names for common parameters then it won't be a problem.
For example in the homework you are given:
extras i j = {- expression containing m l in addition to i j -}
can be expressed as
extras :: {- after you develop your function in REPL;
copy, fix if too general, and paste the type signature REPL gives -}
extras m l = extras' where extras' i j = ...
or
(extras m l) i j = ...
Opaqueness
Another problem with the code is its opaqueness. Your code should speak the language of your requirements
$$extras[i][j] = M - j + i - \sum_{k=i}^{j} l_k$$
Compare:
extra m l i j = m - j + i + sumLk
where
sumLk = sum [l!k | k <- [i .. j]]
With:
extras = lpad lls 0 . map (m -) . zipWith (+) [0..] . scanl1 (+)
where
lpad m x = reverse . take m . (++ repeat x) . reverse
$$ lc[i][j] = \begin{cases} \hfill \infty \hfill & \text{ if $extras[i][j]<0$} \\ \hfill 0 \hfill & \text{ if $j=n$ and is even $extras[i][j] \geq 0$} \\ \hfill (extras[i][j])^3 \hfill & \text{otherwise} \\ \end{cases} $$
Compare:
cost m n l = lc
where
lc i j
| extra' < 0 = 1/0
| j == n = 0
| otherwise = fromIntegral $ extra' ^ 3
where
extra' = extra m l i j
With:
costs [] = []
costs (e:es)
| e < 0 = (1/0) : costs es
| null es = [0]
| otherwise = exp (fromIntegral e) : costs es
The problem with the naive implementation cost
is that it takes \$O(n^3)\$ to fill its results in an array. But it can be optimized back to \$O(n^2)\$
lc m n ls = array ((1,1),(n,n)) [ ((i, j), cost' i j)
| i <- [1..n], j <- [1..n], i < j]
where
r = runningTotals n ls
cost' = cost m n r
cost :: Int -> Int -> Array Int Int -> (Int -> Int -> Float)
cost m n r i j
| extra < 0 = 1/0
| j == n = 0
| otherwise = fromIntegral $ extra ^ 3
where
extra = m - j + i + sumLk
sumLk = r!j - r!(i-1)
runningTotals :: (Num a) => Int -> [a] -> Array Int a
runningTotals n = listArray (0,n) . scanl (+) 0
Also not I added i < j
in the definition of lc
, even it wouldn't give any performance benefits, but now I can be sure I will not be iterating on the cases (i>j), for which cost is trivially infinity, when defining c
. Also I am now sure I won't be accessing lc!(j,i)
instead of lc!(i,j)
, or similar typo or off-by-one errors.
$$ c[j] = \begin{cases} \hfill 0 \hfill & j = 0 \\ \hfill \min_{1 \leq i \leq j} (c[i-1]+lc[i][ j]) \hfill & j>0 \\ \end{cases} $$
Naive implementation:
cumulativeCost m n l = c
where
c 0 = 0
c j = minimum [c (i-1) + lc i j | i <- [1 .. j]]
lc = cost m n l
Memoized implementation:
c m n ls = c'
where
c' = array (0,n) [(i, c'' i) | i <- [0..n]]
c'' 0 = 0
c'' j = minimum [(c'!i-1) + (lc'!(i,j)) | i <- [1 .. j]]
lc' = lc m ls
The definition of c
is not compared to yours because the one above doesn't contain path building code.
Some further refactorings:
Above c
is dependent on the particular lc
and above lc
is dependent on the particular cost
function only by accident; you can pass lc
and cost
in as parameters respectively... (The same is true for your knuth
function: You shouldn't need to recompile it in order to use exp extra
instead of extra^3
as cost. Also you should be able to run knuth
with different cost functions and compare the results.)
As for performance, you can do some cut off in the calculation of c[j]
. I think this is also the improvements suggested by @MichaelShaw. But his wording
Since after finding that words i through j won't fit, all of the rest of them won't fit either,
suggests, even if it may not be what he means, "f i..j doesn't fit i..j+1 won't fit either", but that requires using mutable arrays I guess. Instead you should think "If i..j doesn't fit i-1..j won't fit either". And that means you should iterate i
s in decreasing order.
-- instead of this
c'' j = minimum [(c'!i-1) + (lc!(i,j)) | i <- [1 .. j]]
-- this
c'' j = minimum $ takeWhile (< 1/0) [(c'!i-1) + (lc!(i,j)) | i <- [j,j-1..1]]
This can be a very significant improvement for large n. If your function will be run against huge inputs (as is case with programming competitions). However IRL you word-wrap one paragraph at a time, where n~50. Another concern is now your implementation would be tied to this particular cost function again. Although it is safe to assume costs will never be more than (1/0)
, cost function shouldn't use Infinity
to disallow any other kind of line break. You should put your assumption lc[i][j] == Infinity => lc[i-1][j] == Infinity
in the documentation for c
(which is another sign that c
should take lc
as a parameter).
Another cut-off strategy that can improve performance could be takeWhile (j - i < maxWordsPerLine)
. maxWordsPerLine = 1 + lineWidth/spaceWidth
could be a safe bet. This strategy would decrease time complexity significantly for n>> maxWordsPerLine, from O(n^2) to O(n*maxWordsPerLine); though which is never the case IRL either. Of course, if your program would be run against n>> maxWordsPerLine, nXn monolithic array would not be the best data structure.
-
\$\begingroup\$ +1. What I meant by that wording was "if i..j doesn't fit then i..j+1 won't either", but your way of looking at it is also true. Both are adding words to a line that's already too long. I was thinking in terms of generating
extras
a line at a time (with a fixedi
for a given line). Your point that it would also save time in generatingc
is a good one, and for that, since we're interested inc[j]
, fixingj
makes sense. \$\endgroup\$Michael Shaw– Michael Shaw2015年03月26日 18:41:06 +00:00Commented Mar 26, 2015 at 18:41
Explore related questions
See similar questions with these tags.