Function to get NGrams fast in Haskell

Question 1

Is this a good way to get all sublists of a sequence/list that have a given length?

An inefficient way to do it would be something like

f n = (filter (\x -> (length x) > n)) . (take n) . tails

This just takes the n first elements of each tail of the original list. I think it should be slow because of the length check on every tail-element.

A smarter way would should be to "slide" a sequence of length n over an input sequence and save the result of each slide by one to the right.

-- | Get all the subsequences of a given sequence sq of length n
ngrams::Int -> Seq a -> Seq (Seq a)
ngrams n sq | length sq < n = empty
ngrams n sq | otherwise = ngrams' restSequence initialWindow empty
 where
 initialWindow = take n sq
 restSequence = drop n sq
 ngrams' (viewl -> EmptyL) window acc = acc |> window
 ngrams' (viewl -> x :< s) window@(viewl -> a :< r) acc = 
 ngrams' s (r |> x) (acc |> window)

Somehow I have the feeling I am missing an obvious way to do this better...

Question 2

A little math goes a long way.

ngrams :: Int -> [a] -> [[a]]
ngrams n l = take (length l - (n - 1)) . map (take n) . tails $ l

Question 3

Thanks for your reply. I guess this is still not optimal because it won't work on an infinite list. I guess that each call to take also takes n operations for each tail, right ?

Question 4

Re: infinite lists, neither would your Seq version. Re: take, yes but no, remember that it will be evaluated lazily, so you'll only compute the elements of take n as you use them, it won't walk the spine of the ngram twice.

Question 5

I think it is better to replace take (length) l - (n-1) by zipWith (flip const) (drop (n-1) l). This increases lazyness and makes the function work for infinite lists.

bisserlis bisserlis 3,3311 gold badge13 silver badges17 bronze badges · Accepted Answer · 2014-11-17 03:24:01Z

2

\$\begingroup\$

A little math goes a long way.

ngrams :: Int -> [a] -> [[a]]
ngrams n l = take (length l - (n - 1)) . map (take n) . tails $ l

Share

answered Nov 17, 2014 at 3:24

bisserlis's user avatar

bisserlis bisserlis

3,3311 gold badge13 silver badges17 bronze badges

\$\endgroup\$

3

\$\begingroup\$ Thanks for your reply. I guess this is still not optimal because it won't work on an infinite list. I guess that each call to take also takes n operations for each tail, right ? \$\endgroup\$

Christof Schramm
– Christof Schramm

2014年11月17日 09:29:30 +00:00
Commented Nov 17, 2014 at 9:29
\$\begingroup\$ Re: infinite lists, neither would your Seq version. Re: take, yes but no, remember that it will be evaluated lazily, so you'll only compute the elements of take n as you use them, it won't walk the spine of the ngram twice. \$\endgroup\$

bisserlis
– bisserlis

2014年11月17日 14:14:46 +00:00
Commented Nov 17, 2014 at 14:14
\$\begingroup\$ I think it is better to replace take (length) l - (n-1) by zipWith (flip const) (drop (n-1) l). This increases lazyness and makes the function work for infinite lists. \$\endgroup\$

Tashi Walde
– Tashi Walde

2018年09月07日 14:23:15 +00:00
Commented Sep 7, 2018 at 14:23

Add a comment |

Stack Exchange Network

Function to get NGrams fast in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Function to get NGrams fast in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions