I wrote a bit of code that takes a file, justifies the text and writes to another file.
It uses a DP approach to minimise a badness metric, which is the amount of undershoot of the line length, cubed (see my answer below, the badness function isn't quite right if there is an overshoot, the badness should be inf
- I believe this is how LaTeX justifies its text).
I encapsulated it in a class
(which is the bit I'm least confident about). Be careful running it on big files, because I think it's got a pretty nasty complexity (\$O(n^2)\$ maybe).
class justifyText:
def __init__(self, inFileName, outFileName, pageWidth):
with open(inFileName, "r") as fp:
self.words = [word for line in fp for word in line.split()]
self.outFileName = outFileName
self.pageWidth = pageWidth
self.memo = {}
self.breakPointTrack = {}
self.n = len(self.words)
def badness(self, i, j):
totalWidth = 0;
for word in self.words[i:j]:
totalWidth += len(word)
return abs((self.pageWidth - totalWidth)**3)
def minBadness(self, i):
if i in self.memo:
return self.memo[i]
if i == self.n:
f = 0
j_min = self.n
else:
f = None
for j in range(i + 1, self.n + 1):
temp = self.minBadness(j) + self.badness(i, j)
if (f is None) or (temp < f):
f = temp
j_min = j
self.memo[i] = f
self.breakPointTrack[i] = j_min
return f
def justify(self):
self.minBadness(0)
fOut = open(self.outFileName, "w")
brk = 0
while brk < self.n:
start = brk
brk = self.breakPointTrack[brk]
line = " ".join(self.words[start:brk]) + "\n"
fOut.write(line)
fOut.close()
test = justifyText("test.txt", "out.txt", 80)
test.justify()
Is this a standard way to implement a DP program? I know that global variables are evil, so I was mainly trying to avoid my memo etc. being that.
2 Answers 2
You should have a look at Python's official style-guide, PEP8. One of its recommendations is to use PascalCase
for class names and lower_case
for function and variable names.
Since you read all words into memory anyway, you can just write:
with open(in_file_name) as fp:
self.words = fp.read().split()
This is because split
splits by default on all whitespace (so both space and newlines). Also note that the default behaviour of open
is to open a file in read mode, so "r"
is also not needed here.
Caching (or memoization) of a functions return value is best done with a decorator. This way you avoid interleaving the actual implementation of the function with the caching. One example could be:
import functools
def memoize_method(func):
cache = func.cache = {}
@functools.wraps(func)
def wrapper(self, n):
if n not in cache:
cache[n] = func(self, n)
return cache[n]
return wrapper
class JustifyText:
...
@memoize_method
def min_badness(self, i):
if i == self.n:
f, j_min = 0, self.n
else:
f, j_min = min((self.min_badness(j) + self.badness(i, j), j)
for j in range(i + 1, self.n + 1))
self.break_point_track[i] = j_min
return f
Your badness
function can be simplified by using sum
, similar to how I used min
above:
def badness(self, start, end):
total_width = sum(len(word) for word in self.words[start:end])
return abs((self.page_width - total_width)**3)
While it might be your most used use case, writing the output to a file prevents anyone from using this module in any other way. It would be better if justify
just returned (even better and simpler: yielded) the justified text and writing to a file is left to the user or a dedicated method using justify
internally.
def justify(self):
self.min_badness(0) # no-op or to populate the cache?
brk = 0
while brk < self.n:
start, brk = brk, self.break_point_track[brk]
yield " ".join(self.words[start:brk]) + "\n"
def write(self):
with open(self.out_file_name, "w") as f_out:
f_out.writelines(self.justify())
I'm not exactly sure why you have a lone self.min_badness(0)
in justify
. is it just to populate the cache with the value for zero? If so, this is worth to put a comment here, because otherwise you might accidentally delete it.
-
\$\begingroup\$ Great review. Thanks a lot. The
self.min_badness(0)
starts the recursiveself.min_badness()
at the start of the file, it should then take care of the rest of the file; stopping at the base case which is the end of the file in this case. Is there anywhere that explains the@memoize_method
syntax that you use? I've seen this around, but I don't know what it's called or how it works. \$\endgroup\$Aidenhjj– Aidenhjj2016年11月23日 08:15:11 +00:00Commented Nov 23, 2016 at 8:15 -
1\$\begingroup\$ @Aidenhjj Yes that is a decorator. What it does is call
min_badness = memoize_method(min_badness)
after the definition of the function. \$\endgroup\$Graipher– Graipher2016年11月23日 08:26:55 +00:00Commented Nov 23, 2016 at 8:26 -
1\$\begingroup\$ The advantage of
[word for line in fp for word in line.split()]
overfp.read().split()
is that it doesn't use twice the amount of memory: once for reading the file, once for storing the list of words... \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年11月23日 08:34:36 +00:00Commented Nov 23, 2016 at 8:34 -
1\$\begingroup\$ @MathiasEttinger I don't actually think this is the case. Using
memory_profiler
on the code reports that both use roughly the same amount. ~6.5MiB using comp, ~6.7MiB using.read
, for a 1MiB file. I also checked splitting.read.split
to use an intermarry variable and it used ~7.4MiB. Honestly I think the, large, speed increase is worth the little extra memory. \$\endgroup\$2016年11月23日 12:07:27 +00:00Commented Nov 23, 2016 at 12:07 -
2\$\begingroup\$ @Peilonrayz Interesting tool, this
memory_profiler
. Usingmprof run
with a very small period revealed that thesplit
used way more memory internally than theread
or the list-comp. So using the fastest one (read
) seems the best call. TIL. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年11月23日 15:19:32 +00:00Commented Nov 23, 2016 at 15:19
I realised that this code wasn't dealing correctly with the overshoots. I should have coded badness()
as something like:
def badness(self, i, j):
total_width = -1;
for word in self.words[i:j]:
total_width += len(word) + 1
if total_width > self.page_width:
return float("inf")
return abs((self.page_width - total_width)**3)
-
2\$\begingroup\$ +500 bounty on this answer has been awarded as a prize for Best of Code Review 2016 — Best Newcomer (question), since bounties cannot be awarded on questions. \$\endgroup\$200_success– 200_success2017年01月18日 19:08:06 +00:00Commented Jan 18, 2017 at 19:08
Explore related questions
See similar questions with these tags.
abs()
call in yourbadness()
function that is mentioned in the answer. If you have found a bug, you can write an answer to your own question explaining what the bug is. \$\endgroup\$