String replacement in OCaml

Question 1

This is part of my first OCaml program.

Its job is to replace a set of placeholder characters with umlauts. Taking the German word Ruebe as an example, the program turns it into Rübe. Other examples are Moewe -> Möwe and aendern -> ändern

Generally, everytime the program encounters the characters ae, oe, and ue it turns them into umlauts. There is an exception though: If the preceding character is a vowel, it does not change the word. This ensures words like Treue remain as they are (since there is no word Treü).

I've used the library Re2 for regular expressions. Re2 doesn't implement lookarounds, which was my first thought for checking for preceding vowels. This is the reason for the function replace_if_not_after_vowel.

I am happy for all suggestions, especially if they help make the code more idiomatic or simpler.

To compile the program, I used the command ocamlbuild -use-ocamlfind -package re2 -package core -tag thread myCompiledFile.byte

open Core.Std
open Re2.Std
open Re2.Infix
(* If the word contains an umlaut placeholder like "ue" we replace that with the proper umlaut "ü". Except if there is a vowel directly before the placeholder like in "Treue" *)
let placeholders_to_umlauts = [("ue", "ü"); ("oe","ö"); ("ae","ä"); ("Ue", "Ü"); ("Oe","Ö"); ("Ae","Ä")] 
(* A regex that matches a vowel *)
let vowel = ~/"[aeiou]"
(* Applies a list of changes to a word *)
let rec apply_changes word changes =
 match changes with
 | [] -> word
 | change :: rest -> apply_changes (change word) rest
(* Replaces replace_this with replacement inside the text if the preceding character not a vowel. Since Re2 doesn't implement lookarounds we can't use a negative lookbehind *)
let replace_if_not_after_vowel replace_this replacement text = 
 Re2.replace_exn ~/replace_this text ~f:(fun regex_match ->
 (* Returns true if there is a vowel at the given position in the text *)
 let is_vowel text pos = 
 if pos >= 0 && pos < String.length text then
 let maybe_vowel = String.get text pos in
 Re2.matches vowel (Char.to_string maybe_vowel)
 else false
 in
 (* Get the position in the text where the regex matched *)
 let match_pos, _ = Re2.Match.get_pos_exn ~sub:(`Index 0) regex_match in
 (* Replace the placeholder if doesn't follow a vowel *)
 if is_vowel text (match_pos -1 ) then replace_this else replacement
 )
let change_word word =
 (* Those are the changes that we will apply to the word *)
 let changes = List.map placeholders_to_umlauts ~f:(
 fun (placeholder, umlaut) -> replace_if_not_after_vowel placeholder umlaut
 )
 in
 apply_changes word changes
 let () =
 (* We want to change this word into "Übergrößenträgertreue" *)
 let word = "Uebergroeßentraegertreue" in
 let new_word = change_word word in 
 printf "new word: %s\n" new_word

Question 2

Why not just update everything that matches [^aeyoiu] or begining of word, followed by ("ue"|"oe"|"ae"|"Ue"|"Oe"|"Ae","Ä") ? You don't really need lookaround for that. At the moment, you are doing multiple regexp check, but you could do only one big.

Question 3

Do you consider an exception dictionary? Reading "Göthe" would be very strange! :) Also, if you easily want to turn this in a standard Unix filter supporting filtering, reading files or in-place editing, you can try Gasoline, see the caesar example.

Question 4

For anyone interested in a program performing such conversions, here's the repository to the OCaml code I currently use when writing in German with a non-German keyboard layout: gitlab.com/bullbytes/umlaut-conversion

Question 5

Re version, since it was amusing to do. My version uses a slightly different technique: I build one big regexp, with two groups and replace everything in one go without any additional checking. If Re.replace provided slightly more control (per-group substitution) it would avoid the concatenation.

I used the combinators for building the regexp, instead of the symbolic version, because that's much more readable, really.

let map_to_umlauts =
 [ "ue","ü" ; "oe","ö" ; "ae","ä" ; "Ue","Ü" ; "Oe","Ö" ; "Ae","Ä" ]
let regexp =
 let open Re in compile @@
 seq [
 group @@ alt [ bow ; compl [no_case @@ set "aeiouy"] ] ;
 group @@ alt (List.map (fun (s,_) -> str s) map_to_umlauts) ;
 ]
let replace s =
 let f subs =
 Re.get subs 1 ^ List.assoc (Re.get subs 2) map_to_umlauts
 in
 Re.replace ~f regexp s
let () =
 print_endline @@ replace Sys.argv.(1)

On your version, apart from the change in algorithm, I have only one comment: You really don't need a regexp to check if a char is a voyel. ;)

Also, do note all of this is playing fast and, more important, very very loose, with Unicode.

Question 6

Thanks a lot for your answer! Me being new to OCaml I would very much appreciate it if you could explain what goes on in between seq[...]. Also, how could I make sure that this is safe regarding Unicode? I'd be happy about a general hint or link.

Question 7

It's really just the equivalent of ([^aziouy]|bow)("ue"|"oe"|"ae"|"Ue"|"Oe"|"Ae"), but written with Re's combinators. seq is a sequence, alt is alternative choices, compl is the complement. About Unicode ... I don't really know, unfortunately.

Drup Drup 1561 bronze badge · Accepted Answer · 2015-10-07 22:34:20Z

Re version, since it was amusing to do. My version uses a slightly different technique: I build one big regexp, with two groups and replace everything in one go without any additional checking. If Re.replace provided slightly more control (per-group substitution) it would avoid the concatenation.

I used the combinators for building the regexp, instead of the symbolic version, because that's much more readable, really.

let map_to_umlauts =
 [ "ue","ü" ; "oe","ö" ; "ae","ä" ; "Ue","Ü" ; "Oe","Ö" ; "Ae","Ä" ]
let regexp =
 let open Re in compile @@
 seq [
 group @@ alt [ bow ; compl [no_case @@ set "aeiouy"] ] ;
 group @@ alt (List.map (fun (s,_) -> str s) map_to_umlauts) ;
 ]
let replace s =
 let f subs =
 Re.get subs 1 ^ List.assoc (Re.get subs 2) map_to_umlauts
 in
 Re.replace ~f regexp s
let () =
 print_endline @@ replace Sys.argv.(1)

On your version, apart from the change in algorithm, I have only one comment: You really don't need a regexp to check if a char is a voyel. ;)

Also, do note all of this is playing fast and, more important, very very loose, with Unicode.

Thanks a lot for your answer! Me being new to OCaml I would very much appreciate it if you could explain what goes on in between seq[...]. Also, how could I make sure that this is safe regarding Unicode? I'd be happy about a general hint or link.
It's really just the equivalent of ([^aziouy]|bow)("ue"|"oe"|"ae"|"Ue"|"Oe"|"Ae"), but written with Re's combinators. seq is a sequence, alt is alternative choices, compl is the complement. About Unicode ... I don't really know, unfortunately.

Stack Exchange Network

String replacement in OCaml

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

String replacement in OCaml

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions