Out of pure curiosity, I decided to create a short F# script to calculate the letter frequencies based on the input file. Everything works smoothly and as I would expect. My questions regarding my code are as follows:
Does it conform to formal F# standards?
I have done my best to keep it as functional as possible while maintaining a style similar to the most common one I've seen in example programs, but I am unsure if it is up to par.
Does it maximize F#'s capabilities while maintaining readability?
This is as it seems, I am curious as to whether or not I have used all relevant and applicable F# functions inside the program (e.g. map). The portion I am most concerned about is this bit:
Array.map (fun z -> let e, f = z match count |> Array.tryFind (fun a -> let f,s = a (e = f)) with | Some x -> (e, Convert.ToInt64(snd x) + f) | None -> z)
I am unsure if my approach is proper, that
Array.tryFind
bit is especially horrendous, but I can't think of a better way to approach this. In addition, instead of thetuple
type, should I instead go forrecords
or is there some other, more appropriate, solution that I am missing?Are there any ways to optimize or simplify the program I have written?
I have a decently powerful machine (i5 6600, 16GB RAM) and per FSI's
#time
directive, the program has the following footprint when processing ~10.1gb of textReal: 00:14:30.479, CPU: 00:14:21.421, GC gen0: 208720, gen1: 96, gen2: 8
. I have had very little optimization experience whatsoever, so I am unsure if (by my calculations) 11.89mb/s is a good speed for such a relatively simple program.
#if INTERACTIVE
#time
#else
module LetterFreqCalc
#endif
open System
open System.IO
let explode (x:string) = [|for c in x -> c|]
let ToUpper (x:string) = x.ToUpper()
let numLetterInWord word =
word
|> ToUpper
|> explode
|> Array.countBy id
let sumTupArray (tupArr:(char*int64)[]) =
tupArr
|> Array.sumBy snd
let div (x:float) (y:float) = y/x
let mul (x:float) (y:float) = x*y
let rec CalcNextLine (reader:StreamReader) accumulator =
match reader.EndOfStream with
| true -> accumulator
| false ->
let line = reader.ReadLine()
let count = numLetterInWord line
accumulator
|> Array.map (fun z ->
let e, f = z
match count |> Array.tryFind (fun a -> let f,s = a
(e = f)) with
| Some x -> (e, Convert.ToInt64(snd x) + f)
| None -> z)
|> CalcNextLine reader
let CalcLetterFreqOfFile (path:string) =
let reader = new StreamReader(path)
let start = [|for c in [|'A'..'Z'|] -> (c,0L)|]
CalcNextLine reader start
[<EntryPoint>]
let main argv =
let occurences = @"dictionary.txt"
|> CalcLetterFreqOfFile
let total = occurences
|> Array.sumBy snd
|> float
let freqs = occurences
|> Array.map (fun x -> (fst x, (let percentage = snd x
|> float
|> div total
|> mul 100.0
Math.Round(percentage, 2))))
printfn "%A" freqs
printfn "%A" (freqs |> Array.sortByDescending snd)
printfn "%A" argv
0 // return an integer exit code
1 Answer 1
Your
explode
function is almost just equivalent toSeq.toArray<char>
. There's no need to have aToUpper
function just to call a method, especially if you're using it just once. And there is already a.ToCharArray()
method anyway. I would recommend getting rid of both of those functions and using method calls instead:let numLetterInWord (word:string) = word.ToUpper().ToCharArray() |> Array.countBy id
Similarly, it seems like overkill to have
div
andmul
functions just so they can be used in a pipeline. Their usage later on can be replaced by a single lambda:fun x -> x / total * 100.0
The
sumTupArray
isn't used and not really necessary. At one point you just do an inlineArray.sumBy snd
instead and that's fine.Regarding the portion you're concerned about: Note that you can do pattern matching in function arguments just like you can in a
let
binding. So this pattern...|> Array.map (fun z -> let e, f = z
... can be replaced with |> Array.map (fun (e, f) ->
The
Array.tryFind
can be replaced withArray.tryFind (fst >> (=) e)
(using the function composition operator>>
and the curried prefix version of=
). If you don't like that you can still just use a lambda:Array.tryFind (fun (f,_) -> e = f)
F# has its own more concise version of
Convert.ToInt64
calledint64
. Ironically, it's less type-safe thanConvert.ToInt64
because it also tries to convert other things like strings and can throw an exception if they're not valid. I would still use it in this case since we know anint
can always be converted toint64
.I think that most people don't use the style of indentation that you have used because it quickly uses of a lot of horizontal space. Instead of this, I would suggest starting a new line and indenting by four spaces. This can mean using a bit more vertical space, but it's worth it in my opinion.
With these suggestions applied, we end up with something a bit more concise and readable:
open System
open System.IO
let numLetterInWord (word:string) = word.ToUpper().ToCharArray() |> Array.countBy id
let rec CalcNextLine (reader:StreamReader) accumulator =
match reader.EndOfStream with
| true -> accumulator
| false ->
let line = reader.ReadLine()
let count = numLetterInWord line
accumulator
|> Array.map (fun ((e, f) as z) ->
match count |> Array.tryFind (fst >> (=) e) with
| Some x -> (e, int64 (snd x) + f)
| None -> z)
|> CalcNextLine reader
let CalcLetterFreqOfFile (path:string) =
let reader = new StreamReader(path)
let start = [| for c in [|'A'..'Z'|] -> (c,0L) |]
CalcNextLine reader start
[<EntryPoint>]
let main _ =
let occurences = @"dictionary.txt" |> CalcLetterFreqOfFile
let total = occurences |> Array.sumBy snd
let freqs = occurences |> Array.map (fun (e, f) ->
let percentage = float f / float total * 100.0
(e, Math.Round(percentage, 2)))
printfn "%A" freqs
printfn "%A" (freqs |> Array.sortByDescending snd)
0
I just realised your original time was 14 minutes, not 14 seconds! Which explains a lot. I found a solution that I think is a good balance of fast and simple:
open System
open System.IO
let countByIdBig xs =
let counts = Collections.Generic.Dictionary<_,_>()
for x in xs do
match counts.TryGetValue x with
| true, c -> counts.[x] <- c + 1L
| false, _ -> counts.[x] <- 1L
counts
|> Seq.map (fun (KeyValue kv) -> kv)
let CalcLetterFreqOfFile (path:string) =
seq {
for line in File.ReadLines path do
for c in line do
if Char.IsLetter c then yield Char.ToUpper c }
|> countByIdBig
The countByIdBig
is similar to Seq.countBy id
but it counts using int64
instead of int32
. It uses a mutable dictionary for performance in a very contained way.
Then it's just a matter of getting a seq
of characters to count and passing them to countByIdBig
.
-
\$\begingroup\$ Excellent feedback, thank you. This project branched off from a completely unrelated project (Affine Cipher Solver), which is why little odds and ends like the ToUpper and line-by-line exist. I appreciate the solution you have provided, as it has helped me understand some of the facets of the F# language. I was, unfortunately, unaware of the StreamReader.Read() method as well. Thank you for your assistance, I appreciate it. \$\endgroup\$Chris Altig– Chris Altig2017年07月08日 17:12:14 +00:00Commented Jul 8, 2017 at 17:12
-
\$\begingroup\$ Hello, after applying your suggested improvements to my code, the time has dropped significantly from what it used to be. From
Real: 00:14:30.479, CPU: 00:14:21.421, GC gen0: 208720, gen1: 96, gen2: 8
toReal: 00:09:40.527, CPU: 00:09:32.984, GC gen0: 177638, gen1: 64, gen2: 5
. The changes were mostly about how the code looks, not how it operates. Would the change tonumLetterInLine
really have made such a large difference? Why is the old way I did things slower by five minutes? I apologize for bombarding you with questions, but I'm very curious. \$\endgroup\$Chris Altig– Chris Altig2017年07月09日 03:19:33 +00:00Commented Jul 9, 2017 at 3:19 -
\$\begingroup\$ The difference probably comes from using .NET's in-built
.ToCharArray()
method instead of[|for c in x -> c|]
. \$\endgroup\$TheQuickBrownFox– TheQuickBrownFox2017年07月09日 14:03:11 +00:00Commented Jul 9, 2017 at 14:03 -
\$\begingroup\$ @ChrisAltig I've added to my answer a simpler way of doing this which allows
int64
counts and is faster than my revised version of your code. \$\endgroup\$TheQuickBrownFox– TheQuickBrownFox2017年07月12日 15:47:46 +00:00Commented Jul 12, 2017 at 15:47