2
\$\begingroup\$

I am running a simple OCaml program which opens a CSV file with a pseudo dict-reader and hashes "key" + "value" (where key and values are strings). Then some counts are evaluated on the hashes (but it is not really relevant for what follows).

After a quick look at the default OCaml profiler (gprof), I noticed that my program was mostly spending time in hashing elements (I don't know what caml_page_table_lookup does though).

Each sample counts as 0.01 seconds.
 % cumulative self self total 
 time seconds seconds calls s/call s/call name 
 8.27 4.12 4.12 97999951 0.00 0.00 caml_hash
 7.53 7.88 3.76 97999951 0.00 0.00 caml_hash_mix_string
 5.12 10.43 2.55 136894800 0.00 0.00 caml_page_table_lookup

The hashtables are seldomly called in my code :

let get_indices dict n = Hashtbl.fold (fun k v acc -> ((Hashtbl.hash (k^v)) mod n) :: acc) dict [] ;;

And here :

let dict_reader file_path = 
 let line_stream = read_lines file_path in
 let header = split_line (Stream.next line_stream) in
 Stream.from
 (fun _ ->
 try Some (to_dict header (split_line (Stream.next line_stream))) with End_of_file -> None)

Is there a smart way to optimize these bottlenecks ? I feel like there is something redundant in evaluating hashes and storing values in hashtables...

asked Sep 2, 2016 at 16:05
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

The hash function by itself is already optimized and very fast. It is hard to make it any faster.

In the code snippets, that you provided, I see only one optimization opportunity: instead of first concatenating a key and values and then hashing the result, it might be more optimal to hash the parts separately and then merge them with xor:

 Hashtbl.hash k lxor Hashtbl.hash v

Also, other than explicit calls to Hashtbl.hash each search on a Hashtbl (e.g., Hashtbl.find also calls the hash function). So you might be the case, that you actually didn't show the actual place, that is responsible for the most calls to hash. (Look at the detailed gprof output, it will provide a context for the calls).

The caml_page_table_lookup is an internal OCaml function for memory management (it also uses hashing, but it doesn't call caml_hash function). An optimization with the string concatenation should also lessen the impact from this function.

Finally, the hash function is only responsible for the 8% of the performance cost. It is not the bottleneck. If you think that your program is too slow, then possibly you need to find ways to make it more efficient algorithmically, rather then trying to apply microoptmizations.

If you will post the whole code, it might be possible, that we will point to a real problem in your code, that causes the performance problems.

answered Sep 2, 2016 at 16:53
\$\endgroup\$
2
  • \$\begingroup\$ Wow thanks ! the speedup was impressive. The execution gained roughly 10% - the amount shown in the profiler ! \$\endgroup\$ Commented Sep 2, 2016 at 17:09
  • \$\begingroup\$ I posted the complete version of the code here : codereview.stackexchange.com/questions/141418/… \$\endgroup\$ Commented Sep 21, 2016 at 13:58

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.