I'm very new to Haskell, I've read most of learnyouahaskell.com and played around with some simple things, but this is probably the most 'complicated' bit of Haskell I've written so far. I have an implementation in PHP that does basically the same thing, but it is much much faster.. I'm guessing my bottleneck here is that randomRIO takes a long time to return a value, is there any way to increase the speed, or get an array of values instead of grabbing them one at a time?
That aside, any other tips and/or suggestions about improving my implementation would be very helpful!
module Main where
import System.Environment
import System.Exit
import System.Random (randomRIO)
import Control.Monad (replicateM)
import qualified Data.Map as M
import Text.Regex.Posix
main :: IO ()
main = do
args <- getArgs
let helpOnly = "-h" `elem` args
if helpOnly
then putStrLn usage
else do
password <- generatePassword ("-w" `elem` args)
putStrLn password
exitSuccess
usage :: String
usage = unlines [
"",
"Usage: [ -w, --with-symbols ] [ -h, --help ]",
"",
"By default, generates a 16 character password that does not include symbols.",
"",
"-w, --with-symbols include symbols",
"-h, --help print a brief help message"
]
symbols :: String
symbols = "!$%^&*()-_=+[{]};:@#~,<.>/?"
randomReplace :: String -> String -> IO String
randomReplace [] subject = return subject
randomReplace (replacement:rs) subject = do
randomIndex <- randomRIO (0, length subject - 1) :: IO Int
let hash = zip [0 .. length subject - 1] subject
(randomReplace rs . map snd . M.toList . M.insert randomIndex replacement . M.fromList) hash
generatePassword :: Bool -> IO String
generatePassword withSymbols = do
let passwordLength = 500 -- obviously you wouldn't use 500 as a default here, but I'm just benchmarking
numDigits <- randomRIO (1, passwordLength) :: IO Int
digits <- replicateM numDigits (randomRIO (1, 9) :: IO Int)
numUppercase <- randomRIO (1, passwordLength) :: IO Int
uppercaseLetters <- replicateM numUppercase (randomRIO ('A', 'Z') :: IO Char)
p1 <- replicateM passwordLength (randomRIO ('a', 'z') :: IO Char)
p2 <- randomReplace (concatMap show digits) p1
p3 <- randomReplace uppercaseLetters p2
password <- if withSymbols
then do
numSymbols <- randomRIO (1, passwordLength) :: IO Int
symbolsToReplace <- replicateM numSymbols ((randomRIO (0, length symbols - 1) :: IO Int) >>= (\x -> return $ symbols !! x))
randomReplace symbolsToReplace p3
else return p3
if (password =~ "[a-z]" :: Bool) && (password =~ "[A-Z]" :: Bool) && (password =~ "[0-9]" :: Bool) -- we knows symbols are in there since it went last
then return password
else generatePassword withSymbols
-
\$\begingroup\$ Totally unrelated to your question (which seems good to me!) learnyouahaskell.com seems super cool. \$\endgroup\$anon– anon2016年01月17日 03:54:07 +00:00Commented Jan 17, 2016 at 3:54
-
\$\begingroup\$ @QPaysTaxes good! it was definitely a great resource \$\endgroup\$bruchowski– bruchowski2016年01月17日 03:59:56 +00:00Commented Jan 17, 2016 at 3:59
2 Answers 2
- Too much
IO
. Text.Regex.Posix
is too much for those simple tests. Why notany isUpper
,any isLower
, anyisDigit
?- Your program ignores
--help
and--with-symbols
(have a look atOptParse-Applicative
) randomReplace
has too many temporary structures (see below)- You use
length
too often. Unlike in PHP, it's quite slow (O(n) vs O(1)).
A short search for bottlenecks
I'm guessing my bottleneck here ...
Don't guess, measure. Enable profiling to check where you're actually losing time:
$ stack install --profile random
$ stack install --profile regex-posix
$ stack exec -- ghc -O2 -prof -auto-all PWGen.hs
$ ./PWGen+RTS -s -p
If you don't use stack
, make sure to enable profiling when you install the packages:
$ cabal sandbox init
$ cabal install -p random regex-posix
$ cabal exec -- ghc -O2 -prof -auto-all PWGen.hs
$ ./PWGen +RTS -s -p
This is actually fast, but not fast enough:
9,420,841,024 bytes allocated in the heap 6,425,011,808 bytes copied during GC 1,021,088 bytes maximum residency (2915 sample(s)) 59,672 bytes maximum slop 4 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 15199 colls, 0 par 1.844s 1.971s 0.0001s 0.0011s Gen 1 2915 colls, 0 par 1.188s 1.076s 0.0004s 0.0009s INIT time 0.000s ( 0.001s elapsed) MUT time 1.953s ( 2.001s elapsed) GC time 3.031s ( 3.048s elapsed) RP time 0.000s ( 0.000s elapsed) PROF time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 5.000s ( 5.050s elapsed) %GC time 60.6% (60.4% elapsed) Alloc rate 4,823,470,604 bytes per MUT second Productivity 39.4% of total user, 39.0% of total elapsed
Note: I've changed passwordLength
to 5000
, since it was too quick with 500
. However, the times above don't give enough information where you actually lose that time. That's what -p
was for. PWGen.prof
contains the following data:
COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 59 0 0.0 0.0 100.0 100.0 main Main 119 0 0.8 0.0 99.9 100.0 generatePassword Main 121 1 0.4 0.1 99.1 100.0 randomReplace Main 124 5730 69.5 62.9 98.7 99.9 randomReplace.hash Main 125 5728 29.2 36.9 29.2 36.9 main.helpOnly Main 120 1 0.0 0.0 0.0 0.0
Almost all memory is allocated in randomReplace
. After all, you break both the map and the list apart for every character in replacement
. That's quite costly. Instead, let's try to stay on a single map as long as possible:
-- This is still not an idiomatic version, but better
randomReplace :: String -> String -> IO String
randomReplace rs subject = fmap (map snd . M.toList) $ go rs $ M.fromList $ zip [0..] subject
where
l = length subject
go [] m = return m
go (r:rs) m = do
randomIndex <- randomRIO (0, l - 1) :: IO Int
go rs $ M.insert randomIndex r m
What's the big difference? Well, we're not using length
repeatedly, which is a big plus. length
needs to traverse the whole list to get its result. Also, we're not switching between lists and maps all the time, which gets rid of all intermediate lists.
Running the benchmark again, we gain the following result:
INIT time 0.000s ( 0.001s elapsed) MUT time 0.016s ( 0.021s elapsed) GC time 0.016s ( 0.011s elapsed) RP time 0.000s ( 0.000s elapsed) PROF time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.031s ( 0.033s elapsed) %GC time 50.0% (32.6% elapsed) Alloc rate 2,105,396,736 bytes per MUT second Productivity 50.0% of total user, 47.7% of total elapsed
individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 59 0 5.6 0.0 100.0 100.0 main Main 119 0 5.6 0.1 94.4 97.8 generatePassword Main 121 1 55.6 44.7 88.9 97.7 randomReplace Main 124 2 0.0 16.4 33.3 53.0 randomReplace.l Main 126 2 0.0 0.0 0.0 0.0 randomReplace.go Main 125 5521 33.3 36.6 33.3 36.6 main.helpOnly Main 120 1 0.0 0.0 0.0 0.0
Runtime is down to 0.03s from 5s. Keep in mind that this is for passwordLength = 5000
. This is 0.6% of the original runtime. I can even crank passwordLength
up to 100000
and it still takes only 0.7s total.
The reason here is (somewhat) simple. randomReplace
had a bad asymptoticl complexity. If n
is the length of subject
and k
is the length of rs
, you get roughly:
k -- for every character in rs
* ( n -- get the length of subject (every time!)
+ n * log(n) -- create the map
+ log (n) -- insert a character at a random position
+ 2 * n -- zip the list and zip it back
)
Compare this to the new version:
k * (log (n)) -- for every character insert a character into a map
+ n -- get the length once(!)
+ n * log (n) -- create a map once(!)
+ 2 * n -- zip the map and zip it back
So this would be a way to improve your runtime tremendously.
Further places for improvement
generatePassword
is too opaque, it's not clear what you're doing. Try to split it into several sections. For example, all those lists of symbols can be abstracted into
randomListOf :: Random g => Int -> (g, g) -> IO [g]
randomListOf n r = do
l <- randomRIO (1, n)
replicateM l $ randomRIO r
-- ...
generatePassword = do
-- ...
digits <- randomListOf passwordLength ('0','9')
uppercase <- randomListOf passwordLength ('A','Z')
However, you should try to make all your functions pure. That way, you can easily test them later.
Also, you can try to generate passwords that contain at least one digit. It's not possible with random by default, but you can probably achieve something similar to a modified Test.QuickCheck.frequency
.
-
\$\begingroup\$ wow, incredible review -- thanks so much for taking the time! I didn't know about the profiling options available, that's incredibly useful -- I'll also definitely be using optparse-applicative from now on, work on simpler abstractions, and making my Haskell code more idiomatic by example. \$\endgroup\$bruchowski– bruchowski2016年01月17日 17:19:39 +00:00Commented Jan 17, 2016 at 17:19
I'm guessing my bottleneck here is that randomRIO takes a long time to return a value, is there any way to increase the speed, or get an array of values instead of grabbing them one at a time?
Yes there is a way, and not only it is way faster, it is also way simpler:
randomsUpTo :: R.RandomGen g => g -> Int -> [Int]
randomsUpTo seed max = map (`mod` max) $ R.randoms seed
This just gives an infinite list of random numbers up to a max given an initial seed.
Generating a password is now trivial:
randomPassword seed len charset = map (charset !!) randomIndexes
where
randomIndexes = take len (randomsUpTo seed (length charset))
Benchmarking with:
allAscii = map C.chr [32..127]
main = do
seed <- R.newStdGen
let x = randomPassword seed 100000 allAscii
print $ x
Takes \0ドル.27\$ seconds to run.
Running your code for 10000
(that is 10 times less what I ran my code with) took \95ドル\$ seconds.