I am running a simple OCaml program which opens a CSV file with a pseudo dict-reader and hashes "key" + "value" (where key and values are strings). Then some counts are evaluated on the hashes (but it is not really relevant for what follows).
After a quick look at the default OCaml profiler (gprof
), I noticed that my program was mostly spending time in hashing elements (I don't know what caml_page_table_lookup
does though).
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
8.27 4.12 4.12 97999951 0.00 0.00 caml_hash
7.53 7.88 3.76 97999951 0.00 0.00 caml_hash_mix_string
5.12 10.43 2.55 136894800 0.00 0.00 caml_page_table_lookup
The hashtables are seldomly called in my code :
let get_indices dict n = Hashtbl.fold (fun k v acc -> ((Hashtbl.hash (k^v)) mod n) :: acc) dict [] ;;
And here :
let dict_reader file_path =
let line_stream = read_lines file_path in
let header = split_line (Stream.next line_stream) in
Stream.from
(fun _ ->
try Some (to_dict header (split_line (Stream.next line_stream))) with End_of_file -> None)
Is there a smart way to optimize these bottlenecks ? I feel like there is something redundant in evaluating hashes and storing values in hashtables...
1 Answer 1
The hash function by itself is already optimized and very fast. It is hard to make it any faster.
In the code snippets, that you provided, I see only one optimization opportunity: instead of first concatenating a key and values and then hashing the result, it might be more optimal to hash the parts separately and then merge them with xor:
Hashtbl.hash k lxor Hashtbl.hash v
Also, other than explicit calls to Hashtbl.hash
each search on a Hashtbl (e.g., Hashtbl.find
also calls the hash function). So you might be the case, that you actually didn't show the actual place, that is responsible for the most calls to hash. (Look at the detailed gprof output, it will provide a context for the calls).
The caml_page_table_lookup
is an internal OCaml function for memory management (it also uses hashing, but it doesn't call caml_hash
function). An optimization with the string concatenation should also lessen the impact from this function.
Finally, the hash function is only responsible for the 8% of the performance cost. It is not the bottleneck. If you think that your program is too slow, then possibly you need to find ways to make it more efficient algorithmically, rather then trying to apply microoptmizations.
If you will post the whole code, it might be possible, that we will point to a real problem in your code, that causes the performance problems.
-
\$\begingroup\$ Wow thanks ! the speedup was impressive. The execution gained roughly 10% - the amount shown in the profiler ! \$\endgroup\$RUser4512– RUser45122016年09月02日 17:09:20 +00:00Commented Sep 2, 2016 at 17:09
-
\$\begingroup\$ I posted the complete version of the code here : codereview.stackexchange.com/questions/141418/… \$\endgroup\$RUser4512– RUser45122016年09月21日 13:58:00 +00:00Commented Sep 21, 2016 at 13:58