Split a string on spaces, accounting for backslash as escape character, in Haskell

Question 1

This code works, but since I'm just learning Haskell I'm wondering if it could have been done in a more idiomatic way.

The main function is called tokenize, and it uses helper functions called tokenizeOne and tokenizeSeveral.

Here is the code:

tokenizeOne "" (' ':rest) = tokenizeOne "" rest 
tokenizeOne str ('\\':ch:rest) = tokenizeOne (str ++ [ch]) rest
tokenizeOne x (' ':rest) = (x, rest)
tokenizeOne x "" = (x, "")
tokenizeOne x (ch:rest) = tokenizeOne (x ++ [ch]) rest
tokenizeSeveral x "" = x
tokenizeSeveral x str = let (token, rest) = tokenizeOne "" str in
 tokenizeSeveral (x ++ [token]) rest
tokenize str = tokenizeSeveral [] str

Question 2

Apart from general code layout I don't see much to improve - the only issue worth talking about is that you should never ever actually use something like ++ [ch]. Your function is needlessly O(n2) for that reason alone. That might not sound like much here, but it is a good idea to shake this habit early.

So let's talk about that a bit. Your options generally are:

Find a library function that frees you from actually constructing the list yourself. But to be fair, for this examples I can't actually think of anything helpful myself.
Use a data structure that copes better with it, such as Seq from Data.Sequence.
First construct the list using just (:), then reverse it at the end. That might seem inefficient, but given that reverse is only O(n), it will never actually increase algorithmic complexity.
Rebuild your function so it constructs the list in the right order right away.

Here is what things look like if we go for the last option:

import Control.Arrow ( first )
token :: String -> (String, String)
token "" = ("", "")
token (' ' : cs) = ("", cs)
token ('\\':c:cs) = first (c:) $ token cs
token (c : cs) = first (c:) $ token cs

The key here is that we add the character c using first (c:) after the recursive call returns, and therefore to the front of the string.

A notable advantage is that this version has better laziness properties:

*Main> head $ fst $ token ('t':undefined)
't'

We know that the first character is 't' even without looking at the rest of the string, whereas your function always needs to finish a complete token before it can return. And as we are programming in Haskell, we often like our functions as lazy as possible.

Finally, a matching tokens implementation:

tokens :: String -> [String]
tokens str
 | null str' = []
 | otherwise = tok : tokens rest
 where str' = dropWhile (==' ') str
 (tok, rest) = token str'

This is again basically your version using direct list construction. Only difference is that we now deal with whitespace once per token, which I feel is slightly more elegant.

Peter Wortmann Peter Wortmann 1,2167 silver badges5 bronze badges · Answer 1 · 2014-02-07 17:51:27Z

Apart from general code layout I don't see much to improve - the only issue worth talking about is that you should never ever actually use something like ++ [ch]. Your function is needlessly O(n2) for that reason alone. That might not sound like much here, but it is a good idea to shake this habit early.

So let's talk about that a bit. Your options generally are:

Find a library function that frees you from actually constructing the list yourself. But to be fair, for this examples I can't actually think of anything helpful myself.
Use a data structure that copes better with it, such as Seq from Data.Sequence.
First construct the list using just (:), then reverse it at the end. That might seem inefficient, but given that reverse is only O(n), it will never actually increase algorithmic complexity.
Rebuild your function so it constructs the list in the right order right away.

Here is what things look like if we go for the last option:

import Control.Arrow ( first )
token :: String -> (String, String)
token "" = ("", "")
token (' ' : cs) = ("", cs)
token ('\\':c:cs) = first (c:) $ token cs
token (c : cs) = first (c:) $ token cs

The key here is that we add the character c using first (c:) after the recursive call returns, and therefore to the front of the string.

A notable advantage is that this version has better laziness properties:

*Main> head $ fst $ token ('t':undefined)
't'

We know that the first character is 't' even without looking at the rest of the string, whereas your function always needs to finish a complete token before it can return. And as we are programming in Haskell, we often like our functions as lazy as possible.

Finally, a matching tokens implementation:

tokens :: String -> [String]
tokens str
 | null str' = []
 | otherwise = tok : tokens rest
 where str' = dropWhile (==' ') str
 (tok, rest) = token str'

This is again basically your version using direct list construction. Only difference is that we now deal with whitespace once per token, which I feel is slightly more elegant.

Stack Exchange Network

Split a string on spaces, accounting for backslash as escape character, in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Split a string on spaces, accounting for backslash as escape character, in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions