I'm working on a document processing system.
I feel confident with a Document
class which represents each document being processed.
The issue:
Each Document
can have a CoverSheet
, and if it does, we need to get CoverSheetInfo
from this CoverSheet
(for renaming and processing). But checking for a CoverSheet
and coercing the info into CoverSheetInfo
involves a fair amount of Apache PDFBox code.
I'm trying to decide on the best place to have this functionality.
Option 1
Document
class will have these methods:
public boolean hasCoverSheet()
public CoverSheetInfo getCoverSheetInfo()
Pros:
Behavior is close to the data -- the process of checking a Document
for a CoverSheet
takes place in Document
which seems sensible.
Cons:
This adds a lot of PDFBox - related parsing lines which make the other-wise simple set-get Document
look cluttered and makes the Document
class exceed 300 lines to include this functionality. Thus Option 2...
Option 2
Create a DocumentParser
class which would have:
public boolean hasCoverSheet(Document document)
public CoverSheetInfo getCoverSheetInfo(Document document)
Pros:
All the PDFBox - specific parsing code is in it's own Class. I think this is a good example of enforcing Single Responsibility/Law of Demeter As I don't think Document
should necessarily know how to parse information from cover sheets.
Cons:
Awkward(?) separation of behavior from data(?)
Which one seems most reasonable and how so?
Edit: I'm desperate. Any feed back would be absolutely. fricken. loved.
Edit 2
A Document
is in this case a scanned mortgage document, and it will always be a PDF. A Document
is created when my app finds files in a directory (one Document
is made for each file found). DocumentParser
should process Documents
, right, File was a typo.
At this point, Document
is just a wrapper around the File essentially. In Option 1, it would have CoverSheetInfo
as a field and File stubFile
as well as boolean
regarding the existence of these things.
Here's the "story" for what I'm doing:
Someone will scan a document. It will end up in a directory. My app needs to look at that directory, and
rename the files by their cover sheet (if they have one)
make a stub out of the first 8 pages (if the file is very large)
Upload these files (and any stubs made) to Google Drive.
2 Answers 2
As far as I understood, the cover page is an important property of a document in your application. Then I'd represent it with a getter. The business classes dealing with renaming the document then only need the Document
object and have no dependency to a parser which they do not care about.
How that getter is implemented is a different question. You could for example provide a mock implementation that provides a pre-defined cover page for unit testing. To solve the problem of a long class with multiple responsibilities, you could extract the parsing code to a separate class that is used by the real Document
implementation.
To go one step further, the parser instance could be provided to the Document
constructor. That's called dependency injection and decoples the parser from the document, so that other parsers could be used. For exapmle, the unit test of Document
can use a mock implementation of the parser. There are frameworks like weld which essentially provide a factory to create classes without knowing the exact dependency (i.e. the parser).
-
I know it's been a while... but I came back to this issue again today and so I checked here. I appreciate / understand your first paragraph. That makes sense to me... a
Coversheet
is an important part of aDocument
. But you lost me in the second paragraph... are you suggesting I pass in a Parser object toDocument
constructor, and then use that Parser inDocument.get...
? If I'm passing a Parser it makes me think I might as well use the Parser. Does your 2nd paragraph contradict your first?Don Cheadle– Don Cheadle11/03/2014 20:33:40Commented Nov 3, 2014 at 20:33 -
@mmcrae: Yes, that's called dependency injection. Note that only the one who creates a document has to know the parser, not the one using it. But you you don't need to use DI if you think it's overkill - my suggestions on the model go on their own. I extracted the DI part into a new paragraph. Did that make it clear?Yogu– Yogu11/03/2014 21:53:19Commented Nov 3, 2014 at 21:53
Why not have a separate CoverSheet
object? Then you can cleanly do
CoverSheetInfo coverInfo = null;
if(theDocument.hasCoverSheet()) {
CoverSheet cover = theDocument.getCoverSheet();
coverInfo = cover.getInfo();
}
Sorry, I'm not familiar with PDFBox, so I don't know if this introduces more complexity than it solves, but (IMHO) it's a much cleaner design.
Document
for the existence of aCoverSheet
an expensive operation? Or is just retrieving the fullCoverSheetInfo
the expensive part?CoverSheetInfo
andStubInfo
are different/interesting. One thing on my mind about this though: I feel like it's a bad(ish) thing to have references to specific classes (CoverSheetInfo
andStubInfo
in this case) set inDocument
's class i.e. I feel it's best to have those things passed into a constructor -- at least that's something discussed in Dependency Injection but maybe that doesn't quite apply here... Thanks a bunch for your feedback