More Help with python .find fucntion

Sat Jan 8 00:35:45 EST 2011

On 2011年1月07日 22:43:54 -0600, Keith Anthony wrote:
> My previous question asked how to read a file into a strcuture a line at
> a time. Figured it out. Now I'm trying to use .find to separate out
> the PDF objects. (See code) PROBLEM/QUESTION: My call to lines[i].find
> does NOT find all instances of endobj. Any help available? Any
> insights?
>> #!/usr/bin/python
>> inputfile = file('sample.pdf','rb') # This is PDF with which
> we will work 
> lines = inputfile.readlines() # read file
> one line at a time

That's incorrect. readlines() reads the entire file in one go, and splits 
it into individual lines.
> linestart = [] # Starting address for
> each line
> lineend = [] # Ending
> address for each line
> linetype = []

*raises eyebrow*
How is an empty list a starting or ending address?
The only thing worse than no comments where you need them is misleading 
comments. A variable called "linestart" implies that it should be a 
position, e.g. linestart = 0. Or possibly a flag.
> print len(lines) # print number of lines
>> i = 0 # define an iterator, i

Again, 0 is not an iterator. 0 is a number.
> addr = 0 # and address pointer
>> while i < len(lines): # Go through each line
> linestart = linestart + [addr]
> length = len(lines[i])
> lineend = lineend + [addr + (length-1)] addr = addr + length
> i = i + 1

Complicated and confusing and not the way to do it in Python. Something 
like this is much simpler:
linetypes = [] # note plural
inputfile = open('sample.pdf','rb') # Don't use file, use open.
for line_number, line in enumerate(inputfile):
 # Process one line at a time. No need for that nonsense with manually
 # tracked line numbers, enumerate() does that for us.
 # No need to initialise linetypes.
 status = 'normal'
 i = line.find(' obj')
 if i >= 0:
 print "Object found at offset %d in line %d" % (i, line_number)
 status = 'object'
 i = line.find('endobj')
 if i >= 0:
 print "endobj found at offset %d in line %d" % (i, line_number)
 if status == 'normal': status = 'endobj'
 else: status = 'object & endobj' # both found on the one line
 linetypes.append(status)
 # What if obj or endobj exist more than once in a line?
One last thing... if PDF files are a binary format, what makes you think 
that they can be processed line-by-line? They may not have lines, except 
by accident.
-- 
Steven