This is part of my first OCaml program.
Its job is to replace a set of placeholder characters with umlauts. Taking the German word Ruebe
as an example, the program turns it into Rübe
. Other examples are Moewe -> Möwe
and aendern -> ändern
Generally, everytime the program encounters the characters ae
, oe
, and ue
it turns them into umlauts. There is an exception though: If the preceding character is a vowel, it does not change the word. This ensures words like Treue
remain as they are (since there is no word Treü
).
I've used the library Re2
for regular expressions. Re2
doesn't implement lookarounds, which was my first thought for checking for preceding vowels. This is the reason for the function replace_if_not_after_vowel
.
I am happy for all suggestions, especially if they help make the code more idiomatic or simpler.
To compile the program, I used the command ocamlbuild -use-ocamlfind -package re2 -package core -tag thread myCompiledFile.byte
open Core.Std
open Re2.Std
open Re2.Infix
(* If the word contains an umlaut placeholder like "ue" we replace that with the proper umlaut "ü". Except if there is a vowel directly before the placeholder like in "Treue" *)
let placeholders_to_umlauts = [("ue", "ü"); ("oe","ö"); ("ae","ä"); ("Ue", "Ü"); ("Oe","Ö"); ("Ae","Ä")]
(* A regex that matches a vowel *)
let vowel = ~/"[aeiou]"
(* Applies a list of changes to a word *)
let rec apply_changes word changes =
match changes with
| [] -> word
| change :: rest -> apply_changes (change word) rest
(* Replaces replace_this with replacement inside the text if the preceding character not a vowel. Since Re2 doesn't implement lookarounds we can't use a negative lookbehind *)
let replace_if_not_after_vowel replace_this replacement text =
Re2.replace_exn ~/replace_this text ~f:(fun regex_match ->
(* Returns true if there is a vowel at the given position in the text *)
let is_vowel text pos =
if pos >= 0 && pos < String.length text then
let maybe_vowel = String.get text pos in
Re2.matches vowel (Char.to_string maybe_vowel)
else false
in
(* Get the position in the text where the regex matched *)
let match_pos, _ = Re2.Match.get_pos_exn ~sub:(`Index 0) regex_match in
(* Replace the placeholder if doesn't follow a vowel *)
if is_vowel text (match_pos -1 ) then replace_this else replacement
)
let change_word word =
(* Those are the changes that we will apply to the word *)
let changes = List.map placeholders_to_umlauts ~f:(
fun (placeholder, umlaut) -> replace_if_not_after_vowel placeholder umlaut
)
in
apply_changes word changes
let () =
(* We want to change this word into "Übergrößenträgertreue" *)
let word = "Uebergroeßentraegertreue" in
let new_word = change_word word in
printf "new word: %s\n" new_word
1 Answer 1
Re version, since it was amusing to do. My version uses a slightly different technique: I build one big regexp, with two groups and replace everything in one go without any additional checking. If Re.replace
provided slightly more control (per-group substitution) it would avoid the concatenation.
I used the combinators for building the regexp, instead of the symbolic version, because that's much more readable, really.
let map_to_umlauts =
[ "ue","ü" ; "oe","ö" ; "ae","ä" ; "Ue","Ü" ; "Oe","Ö" ; "Ae","Ä" ]
let regexp =
let open Re in compile @@
seq [
group @@ alt [ bow ; compl [no_case @@ set "aeiouy"] ] ;
group @@ alt (List.map (fun (s,_) -> str s) map_to_umlauts) ;
]
let replace s =
let f subs =
Re.get subs 1 ^ List.assoc (Re.get subs 2) map_to_umlauts
in
Re.replace ~f regexp s
let () =
print_endline @@ replace Sys.argv.(1)
On your version, apart from the change in algorithm, I have only one comment: You really don't need a regexp to check if a char is a voyel. ;)
Also, do note all of this is playing fast and, more important, very very loose, with Unicode.
-
\$\begingroup\$ Thanks a lot for your answer! Me being new to OCaml I would very much appreciate it if you could explain what goes on in between
seq[...]
. Also, how could I make sure that this is safe regarding Unicode? I'd be happy about a general hint or link. \$\endgroup\$Matthias Braun– Matthias Braun2015年10月08日 08:04:51 +00:00Commented Oct 8, 2015 at 8:04 -
\$\begingroup\$ It's really just the equivalent of
([^aziouy]|bow)("ue"|"oe"|"ae"|"Ue"|"Oe"|"Ae")
, but written with Re's combinators.seq
is a sequence,alt
is alternative choices,compl
is the complement. About Unicode ... I don't really know, unfortunately. \$\endgroup\$Drup– Drup2015年10月08日 10:53:43 +00:00Commented Oct 8, 2015 at 10:53
("ue"|"oe"|"ae"|"Ue"|"Oe"|"Ae","Ä")
? You don't really need lookaround for that. At the moment, you are doing multiple regexp check, but you could do only one big. \$\endgroup\$