For some context, I'm trying to write a streaming tokenizer in Haskell and noticed that I was writing a lot of functions with the type signature String -> Maybe Int
that attempted to consume "the longest possible token matching a particular pattern" from the input stream. The pattern might be something like a quoted string literal in JSON.
I tried using Parsec for a while to get a "prefix matcher", but I kept getting tripped up over what the appropriate userstate is supposed, and what the m
argument in ParsecT
means. Parsec is probably a strict superset of the functionality that this little library exposes.
A "Matcher" is anything with the type String -> Maybe Int
. Matchers are supposed to be greedy and produce the longest prefix that they possibly can. There's also an unenforced property that, in cases where the match is a proper prefix of the input string, adding more characters to the end of a string doesn't extend the prefix. The library also encourages / permits users to write sloppy grammars (for instance using symOrExn
) that fail in ugly ways.
I could use some helping figuring out how to generalize String -> Maybe Int
. I don't know what class I should insist on for the input Eq a => [a]
is the first thing that comes to mind, but it might be possible to make something more general than that. As for the output type maybe a type that's constrained to be a monoid or a group? I really just need a notion of zero and the ability to increment for the index. ... although it might also be possible to generalize the notion of a "prefix" to tree-like structures.
I'm hoping to structure this library eventually as a collection of really concrete implementations for a handful of commonly-used text-like types, as well as a generic implementation that can be used in other circumstances. Advice on how to generalize the library is much appreciated.
module Text.PrefixMatcher.String where
import Data.List
-- A simple non-generic prefix matching library
-- with a few simple combinators
--
-- a Matcher has type (String -> Maybe Int)
--
-- It takes a String and returns a number corresponding
-- to the length of the matched prefix.
--
-- The intended use case here is to provide a simple utility
-- to help in converting a string to a stream of tokens.
--
-- makeMatcher -- make a matcher for a constant string
--
type Matcher = String -> Maybe Int
identityMatcher :: Matcher
identityMatcher str = Just 0
makeMatcher :: String -> Matcher
makeMatcher spec str = case isPrefixOf spec str of
True -> Just (length spec)
False -> Nothing
andThen :: Matcher -> Matcher -> Matcher
andThen m1 m2 str = ltotal where
l1 = m1 str
ltotal = case l1 of
Nothing -> Nothing
(Just l1) ->
let rest = drop l1 str in
let l2 = m2 str in
case l2 of
Nothing -> Nothing
(Just l2) -> Just (l1 + l2)
symOrExn :: Matcher -> Matcher -> Matcher
symOrExn m1 m2 str = res where
l1 = m1 str
l2 = m2 str
res = case (l1, l2) of
-- good cases
(Just l1, Nothing) -> Just l1
(Nothing, Just l2) -> Just l2
-- bad cases
(Nothing, Nothing) -> Nothing
-- if you reach this case, the grammar was
-- constructed badly
(Just l1, Just l2) -> error "ambiguous match"
leftOr :: Matcher -> Matcher -> Matcher
leftOr m1 m2 str = result where
l1 = m1 str
l2 = m2 str
result = case (l1, l2) of
(Just l1, _) -> Just l1
(Nothing, Just l2) -> Just l2
(Nothing, Nothing) -> Nothing
andThenList :: [Matcher] -> Matcher
andThenList [] = identityMatcher
andThenList [x] = x
andThenList (x:xs) = andThen x (andThenList xs)
leftOrList :: [Matcher] -> Matcher
leftOrList [] = identityMatcher
leftOrList [x] = x
leftOrList (x:xs) = leftOr x (leftOrList xs)
2 Answers 2
Looks like you're on the right track, and this might eventually evolve into a parser, or converge to an existing one.
From software engineering point of view, I'd suggest to make Matcher
a newtype
. This better separates its API from the implementation, and allows to do changes later without breaking the API.
Notice these two functions:
identityMatcher :: Matcher
andThen :: Matcher -> Matcher -> Matcher
This looks very much like a monoid (see also Monoid on Haskell Wiki). And this is what you indeed want from your parser - to be compositional, and to adhere to certain laws: associativity and identity. So it'd make sense to have a Monoid
instance for Matcher
. Note that then andThenList
becomes just mconcat.
Another Monoid
instance would be leftOr
, whose identity would be a matcher that's always failing const Nothing
. This is very much like there are two monoids for natural numbers - multiplication with 1 (andThen
) and addition with 0 (leftOr
). We'll back to this later.
The next thing you might consider is to return a value from your matcher. For constant matchers this isn't that useful, but there are multiple reasons why this can be useful:
- The matchers can optionally parse the input, for example when matching numbers, they can already return the parsed number.
- Some matchers might match multiple strings, and then it's useful to be able to provide more information about the outcome (for example, regex matched groups).
- A simple example of the previous point is when calling
leftOr
, you might want to know which of the matchers matched. Returning a value allows you to distinguish that. - Sometimes it's useful if a matcher can depend on the result of the previous match.
- When you don't need a value, it's easy to disregard/discard it, by using
()
as the result.
This leads you to creating a monadic interface of your parser. So your type could be something like:
newtype Matcher a = Matcher (String -> Maybe (a, Int))`.
Interestingly, the inner type is already a monad, corresponding to ReaderT String (WriterT (Sum Int) Maybe a)
, or alternatively RWST String (Sum Int) () Maybe a
. Now standard monadic >>
should correspond to andThen
and return
to the identity matcher. And also mplus
to leftOr
. Such an interface will give your matchers all the power monads offer with very little effort.
Regarding your paragraph
I tried using Parsec for a while to get a "prefix matcher", but I kept getting tripped up over what the appropriate userstate is supposed, and what the m argument in ParsecT means. Parsec is probably a strict superset of the functionality that this little library exposes.
If you don't need any user state, you can just ignore it. And the m
argument allows you to run your parser in another monad - ParsecT
is a monad transformer. If you don't need it, just use Identity
, which is exactly how Parsec
is defined.
You might be interested in using ReadP
, a conceptually simpler parser bundled with GHC.
Also this blog post about writing your own, simple parser, might be useful for you.
Library functions, particularly Maybe's Applicative/Alternative instances, can make your code more consise:
makeMatcher spec str = length spec <$ guard (isPrefixOf spec str)
andThen m1 m2 str = do
l1 <- m1 str
l2 <- m2 $ drop l1 str
Just $ l1 + l2
symOrExn m1 m2 str = case catMaybes [m1 str, m2 str] of
[] -> Nothing
[l] -> Just l
-- if you reach this case, the grammar was
-- constructed badly
[_, _] -> error "ambiguous match"
leftOr = liftA2 (<|>)
andThenList = foldr andThen identityMatcher
leftOrList = foldr leftOr identityMatcher
A more potent approach is to return not the length at which to split, but the results of the split. In fact, StateT String Maybe String
makes this work out of the box:
type Matcher = StateT String Maybe String
makeMatcher :: String -> Matcher
makeMatcher spec = StateT $ fmap (spec,) . stripPrefix spec
runMatcher :: Matcher -> String -> Maybe (String, String)
runMatcher = runStateT
andThen = liftA2 (++)
symOrExn m1 m2 = m1 <* (m2 *> error "ambiguous match" <|> pure ()) <|> m2
-- I'll admit, this one looks clunky
leftOr = (<|>)
andThenList = fmap concat . sequenceA
leftOrList = asum