Justify text from a file using a LaTeX method

Question 1

I wrote a bit of code that takes a file, justifies the text and writes to another file.

It uses a DP approach to minimise a badness metric, which is the amount of undershoot of the line length, cubed (see my answer below, the badness function isn't quite right if there is an overshoot, the badness should be inf - I believe this is how LaTeX justifies its text).

I encapsulated it in a class (which is the bit I'm least confident about). Be careful running it on big files, because I think it's got a pretty nasty complexity (\$O(n^2)\$ maybe).

class justifyText:
 def __init__(self, inFileName, outFileName, pageWidth):
 with open(inFileName, "r") as fp:
 self.words = [word for line in fp for word in line.split()]
 self.outFileName = outFileName
 self.pageWidth = pageWidth
 self.memo = {}
 self.breakPointTrack = {}
 self.n = len(self.words)
 def badness(self, i, j):
 totalWidth = 0;
 for word in self.words[i:j]:
 totalWidth += len(word)
 return abs((self.pageWidth - totalWidth)**3)
 def minBadness(self, i):
 if i in self.memo:
 return self.memo[i]
 if i == self.n:
 f = 0
 j_min = self.n
 else:
 f = None
 for j in range(i + 1, self.n + 1):
 temp = self.minBadness(j) + self.badness(i, j)
 if (f is None) or (temp < f):
 f = temp
 j_min = j
 self.memo[i] = f
 self.breakPointTrack[i] = j_min
 return f
 def justify(self):
 self.minBadness(0)
 fOut = open(self.outFileName, "w")
 brk = 0
 while brk < self.n:
 start = brk
 brk = self.breakPointTrack[brk]
 line = " ".join(self.words[start:brk]) + "\n"
 fOut.write(line)
 fOut.close()
test = justifyText("test.txt", "out.txt", 80)
test.justify()

Is this a standard way to implement a DP program? I know that global variables are evil, so I was mainly trying to avoid my memo etc. being that.

Question 2

I have rolled back Rev 4 → 3. Please see What to do when someone answers .

Question 3

You dropped an abs() call in your badness() function that is mentioned in the answer. If you have found a bug, you can write an answer to your own question explaining what the bug is.

Question 4

You should have a look at Python's official style-guide, PEP8. One of its recommendations is to use PascalCase for class names and lower_case for function and variable names.

Since you read all words into memory anyway, you can just write:

with open(in_file_name) as fp:
 self.words = fp.read().split()

This is because split splits by default on all whitespace (so both space and newlines). Also note that the default behaviour of open is to open a file in read mode, so "r" is also not needed here.

Caching (or memoization) of a functions return value is best done with a decorator. This way you avoid interleaving the actual implementation of the function with the caching. One example could be:

import functools
def memoize_method(func):
 cache = func.cache = {}
 @functools.wraps(func)
 def wrapper(self, n):
 if n not in cache:
 cache[n] = func(self, n)
 return cache[n]
 return wrapper
class JustifyText:
 ...
 @memoize_method
 def min_badness(self, i):
 if i == self.n:
 f, j_min = 0, self.n
 else:
 f, j_min = min((self.min_badness(j) + self.badness(i, j), j) 
 for j in range(i + 1, self.n + 1))
 self.break_point_track[i] = j_min
 return f

Your badness function can be simplified by using sum, similar to how I used min above:

def badness(self, start, end):
 total_width = sum(len(word) for word in self.words[start:end])
 return abs((self.page_width - total_width)**3)

While it might be your most used use case, writing the output to a file prevents anyone from using this module in any other way. It would be better if justify just returned (even better and simpler: yielded) the justified text and writing to a file is left to the user or a dedicated method using justify internally.

def justify(self):
 self.min_badness(0) # no-op or to populate the cache?
 brk = 0
 while brk < self.n:
 start, brk = brk, self.break_point_track[brk]
 yield " ".join(self.words[start:brk]) + "\n"
def write(self):
 with open(self.out_file_name, "w") as f_out:
 f_out.writelines(self.justify())

I'm not exactly sure why you have a lone self.min_badness(0) in justify. is it just to populate the cache with the value for zero? If so, this is worth to put a comment here, because otherwise you might accidentally delete it.

Question 5

Great review. Thanks a lot. The self.min_badness(0) starts the recursive self.min_badness() at the start of the file, it should then take care of the rest of the file; stopping at the base case which is the end of the file in this case. Is there anywhere that explains the @memoize_method syntax that you use? I've seen this around, but I don't know what it's called or how it works.

Question 6

@Aidenhjj Yes that is a decorator. What it does is call min_badness = memoize_method(min_badness) after the definition of the function.

Question 7

The advantage of [word for line in fp for word in line.split()] over fp.read().split() is that it doesn't use twice the amount of memory: once for reading the file, once for storing the list of words...

Question 8

@MathiasEttinger I don't actually think this is the case. Using memory_profiler on the code reports that both use roughly the same amount. ~6.5MiB using comp, ~6.7MiB using .read, for a 1MiB file. I also checked splitting .read.split to use an intermarry variable and it used ~7.4MiB. Honestly I think the, large, speed increase is worth the little extra memory.

Question 9

@Peilonrayz Interesting tool, this memory_profiler. Using mprof run with a very small period revealed that the split used way more memory internally than the read or the list-comp. So using the fastest one (read) seems the best call. TIL.

Question 10

I realised that this code wasn't dealing correctly with the overshoots. I should have coded badness() as something like:

def badness(self, i, j):
 total_width = -1;
 for word in self.words[i:j]:
 total_width += len(word) + 1
 if total_width > self.page_width:
 return float("inf")
 return abs((self.page_width - total_width)**3)

Question 11

+500 bounty on this answer has been awarded as a prize for Best of Code Review 2016 — Best Newcomer (question), since bounties cannot be awarded on questions.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2016-11-23 07:58:59Z

You should have a look at Python's official style-guide, PEP8. One of its recommendations is to use PascalCase for class names and lower_case for function and variable names.

Since you read all words into memory anyway, you can just write:

with open(in_file_name) as fp:
 self.words = fp.read().split()

This is because split splits by default on all whitespace (so both space and newlines). Also note that the default behaviour of open is to open a file in read mode, so "r" is also not needed here.

Caching (or memoization) of a functions return value is best done with a decorator. This way you avoid interleaving the actual implementation of the function with the caching. One example could be:

import functools
def memoize_method(func):
 cache = func.cache = {}
 @functools.wraps(func)
 def wrapper(self, n):
 if n not in cache:
 cache[n] = func(self, n)
 return cache[n]
 return wrapper
class JustifyText:
 ...
 @memoize_method
 def min_badness(self, i):
 if i == self.n:
 f, j_min = 0, self.n
 else:
 f, j_min = min((self.min_badness(j) + self.badness(i, j), j) 
 for j in range(i + 1, self.n + 1))
 self.break_point_track[i] = j_min
 return f

Your badness function can be simplified by using sum, similar to how I used min above:

def badness(self, start, end):
 total_width = sum(len(word) for word in self.words[start:end])
 return abs((self.page_width - total_width)**3)

While it might be your most used use case, writing the output to a file prevents anyone from using this module in any other way. It would be better if justify just returned (even better and simpler: yielded) the justified text and writing to a file is left to the user or a dedicated method using justify internally.

def justify(self):
 self.min_badness(0) # no-op or to populate the cache?
 brk = 0
 while brk < self.n:
 start, brk = brk, self.break_point_track[brk]
 yield " ".join(self.words[start:brk]) + "\n"
def write(self):
 with open(self.out_file_name, "w") as f_out:
 f_out.writelines(self.justify())

I'm not exactly sure why you have a lone self.min_badness(0) in justify. is it just to populate the cache with the value for zero? If so, this is worth to put a comment here, because otherwise you might accidentally delete it.

Great review. Thanks a lot. The self.min_badness(0) starts the recursive self.min_badness() at the start of the file, it should then take care of the rest of the file; stopping at the base case which is the end of the file in this case. Is there anywhere that explains the @memoize_method syntax that you use? I've seen this around, but I don't know what it's called or how it works.
@Aidenhjj Yes that is a decorator. What it does is call min_badness = memoize_method(min_badness) after the definition of the function.
The advantage of [word for line in fp for word in line.split()] over fp.read().split() is that it doesn't use twice the amount of memory: once for reading the file, once for storing the list of words...
@MathiasEttinger I don't actually think this is the case. Using memory_profiler on the code reports that both use roughly the same amount. ~6.5MiB using comp, ~6.7MiB using .read, for a 1MiB file. I also checked splitting .read.split to use an intermarry variable and it used ~7.4MiB. Honestly I think the, large, speed increase is worth the little extra memory.
@Peilonrayz Interesting tool, this memory_profiler. Using mprof run with a very small period revealed that the split used way more memory internally than the read or the list-comp. So using the fastest one (read) seems the best call. TIL.

Stack Exchange Network

Justify text from a file using a LaTeX method

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Justify text from a file using a LaTeX method

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions