I want to read a file into a collection of objects. The data (it's a Leica PTX file if you're curious) is formatted as follows:
640 [begin item #1: number of columns]
480 [number of rows]
0 0 0 [position information: 3x3 matrix]
0 0 0 [position ctd]
0 0 0 [position ctd]
1 0 0 0 [calibration information: 3x4 matrix]
0 1 0 0 [calibration ctd]
0 0 1 0 [calibration ctd]
1.1 2.2 3.3 0.5 [point 1: X Y Z I]
-0.2 2.3 1.4 0.2 [point 2: X Y Z I]
...
3.9 1.2 -7.7 0.8 [point n: X Y Z I]
640 [begin item #2: number of columns]
480 [number of rows]
0 0 0 [position information]
0 0 0
0 0 0
1 0 0 0 [calibration information]
0 1 0 0
0 0 1 0
0.2 1.3 -2.1 0.4 [point 1 of item #2]
... [etc]
I.e. Two lines each containing a single float/double (as text) signal the beginning of a new item, and no other lines are exactly one number.
It's easy to do if there's only one item per file, it's a simple fold
operation. It would also be relatively straightforward to do with a while
type loop that keeps state, but I'm new to functional programming and wondering if there isn't a more concise and elegant way to do this functionally with standard tools (like reduce
, fold
, split
/partition
, etc). The best solutions I've come up with so far involve "peeking" ahead using lists or arrays, but I'd like something that allows for the possibility of a sequence that is consumed as it iterates for maximum generality. (I'm using F# and .NET but I don't see this as a language-specific problem.)
-
1What have you tried so far? Where are you stuck? Can you share some example data?Mark Seemann– Mark Seemann2015年10月26日 10:57:15 +00:00Commented Oct 26, 2015 at 10:57
-
Thanks for your comment. I edited the question. Does it make it more clear what I'm struggled with?Robotman– Robotman2015年10月26日 17:05:47 +00:00Commented Oct 26, 2015 at 17:05
-
2Looks like a job for a parser combinator library. I'm not an F# guru but Googling "F# parser combinators" turns up thisBenjamin Hodgson– Benjamin Hodgson2015年10月26日 22:54:18 +00:00Commented Oct 26, 2015 at 22:54
-
@BenjaminHodgson I think learning about parser combinators is an overkill for parsing something that's this simple.svick– svick2015年10月27日 14:02:51 +00:00Commented Oct 27, 2015 at 14:02
3 Answers 3
If the whole file is small enough to fit into memory, then I think one way would be to parse the file into a list and then process the list piece by piece, something like:
let parseItem (columnsString::rowsString::rest) =
let columns = int columnsString
let rows = int rowsString
let position, rest = parsePosition rest
let calibration, rest = parseCalibration rest
let points, rest = parsePoints columns rows rest
(columns, rows, position, calibration, points), rest
let rec parseItems = function
| [] -> []
| lines ->
let item, restLines = parseItem lines
let restItems = parseItems restLines
item :: restItems
As you surely noticed, this code is pretty repetitive, but you can simplify it using the state workflow (known as state monad in Haskell) from ExtCore:
let parseItem =
state {
let! columns = parseInt
let! rows = parseInt
let! position = parsePosition
let! calibration = parseCalibration
let! points = parsePoints columns rows
return columns, rows, position, calibration, points
}
let rec parseItems = function
| [] -> [], []
| lines ->
state {
let! item = parseItem
let! restItems = parseItems
return item::restItems
} <| lines
-
Thanks - passing
rest
back from the inner function calls is exactly the sort of thing I was trying to think of, and couldn't.Robotman– Robotman2015年10月29日 17:13:20 +00:00Commented Oct 29, 2015 at 17:13
I'd wager that the most popular tool in the Haskell world for this sort of problem is parser combinators. The classic library for this is parsec, but for a machine generated file format the attoparsec library is generally a better choice (faster parsing at the cost of less friendly error reporting—a good tradeoff when you're reading large files that were not written by humans).
There is an excellent tutorial demonstrating attoparsec's use by parsing log files.
Attoparsec is popular enough that somebody tried to port it to F#, but I've never even written an F# program, so I definitely don't have any way of judging the quality of that library. There's also a port of the regular (non-"atto") Haskell parsec library to F#, that might be worth looking at—it also comes with a tutorial.
If you wish to understand a little bit of how parser combinators work under the hood, Graham Hutton and Erik Meijer wrote two simple tutorials showing simplified internals. The first, I believe, is an abridged version of the second:
These are in Haskell but if you know F# you will likely be able to follow a lot of it.
This isn't exactly one of functional programming's best use cases, but most languages have a construct that will create an infinite stream of lines, which you generally feed into a fold
with an initial state. Here's an example in Scala (sorry, I don't know F#):
val lines = io.Source.stdin.getLines
val state = lines.foldLeft(initialState)(processLine)
Basically, the processLine
function is whatever you would put in your while
loop, and it returns a new state each time through. Inside it, you can use split
, regexes, pattern matching, or even parser combinators if you're brave.
I can't claim the functional solution ends up more elegant with something this stateful, but it does have the advantage that the wrong path is usually obvious pretty quickly.