Regular expressions are one of the worst aspects of debugging hell. Since they are contained in string literals, It's hard to comment how they work when the expressions are fairly long.
I have the following regular expression:
\b\d{3}[-.]?\d{3}[-.]?\d{4}\b
Source: RegExr.com
I commented it like this:
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String regexStr = "";
regexStr += "\\b"; //Begin match at the word boundary(whitespace boundary)
regexStr += "\\d{3}"; //Match three digits
regexStr += "[-.]?"; //Optional - Match dash or dot
regexStr += "\\d{3}"; //Match three digits
regexStr += "[-.]?"; //Optional - Match dash or dot
regexStr += "\\d{4}"; //Match four digits
regexStr += "\\b"; //End match at the word boundary(whitespace boundary)
if (args[0].matches(regexStr)) {
System.out.println("Match!");
} else {
System.out.println("No match.");
}
}
}
What would be the best way to go about commenting regular expressions to make them readable for beginners? Is there a better way than what I have shown?
-
2\$\begingroup\$ What language are you embedding the regexes in? Java? Please tag. \$\endgroup\$rolfl– rolfl2014年04月17日 03:44:11 +00:00Commented Apr 17, 2014 at 3:44
-
\$\begingroup\$ @rolfl Edited! I was hoping for a cross-language solution. \$\endgroup\$rootmeanclaire– rootmeanclaire2014年04月17日 03:48:09 +00:00Commented Apr 17, 2014 at 3:48
-
1\$\begingroup\$ While the concept of regular expressions exists in many languages, the syntax varies wildly. A cross-language solution would therefore be hypothetical code, which is off-topic for Code Review. \$\endgroup\$200_success– 200_success2014年04月17日 03:50:22 +00:00Commented Apr 17, 2014 at 3:50
-
1\$\begingroup\$ Isn't better to comment "what the regex should match" and not "the parts of the regex"? a tool online could do it for you \$\endgroup\$Marco Acierno– Marco Acierno2014年04月18日 18:20:09 +00:00Commented Apr 18, 2014 at 18:20
-
\$\begingroup\$ @Marco An excellent point \$\endgroup\$rootmeanclaire– rootmeanclaire2014年04月18日 18:24:38 +00:00Commented Apr 18, 2014 at 18:24
4 Answers 4
This is a difficult topic to pin down; there will be a lot of opinions, without any clear "right" answers. Additionally, a given match might contextually be better-commented in one fashion vs. a different fashion. That said, I will do my best to try and give rationale for making some commenting choices over others.
While I know about embedded regex comments, they tend to make things more confusing, rather than less. Using them causes subtle changes in how whitespace is treated in the regex, and they're visually rather intrusive. Unless you are allowed to pass only a regex around, with no other attendant code or comments, I would avoid using embedded regex comments. The only time I have ever used these is with regexes that were consumed by a client application (I had no other means by which to comment my expressions), and when I had to write some regexes that needed to be carried back and forth between two languages.
Line-by-line commenting in the enclosing language, as in your selection, is the next option. Most programming/scripting environments support breaking strings up onto more than one line and commenting them. This can be somewhat less visually intrusive than directly embedded regex comments, especially given that the regex's whitespace isn't compromised in any way, but there is still additional "noise" overhead in terms of extra quotes and joining syntax (such as + signs in C# and Java, and << in C++). While the underlying strategy is not inherently a bad one, commenting every single atom of the regex is probably too extreme -- try to break the comments down into larger functional groups, and only go atom-by-atom for particularly tricky stuff. There is an inherent downside to the multiline comment scheme, though, and that is not being able to see the whole regex in one piece. What I see happen in practice is that people write the regex in a single place, then come back and re-work it into multiple lines like this as they go through and comment up their finished code. Ironically, the next person tends to wind up putting it all back into one line so they can more readily edit it. :)
I very recently wrote a shell script that did a huge number of complicated regexes. The route that I took was a hybrid -- above my sed commands, I broke down useful matching units and explained them, but the actual pipeline remained in its normal context, as in this example snippet:
#!/bin/bash
# Rename the files to reflect the class of tests they contain.
# head -n5 "$FILE" - Grab the first five lines of the file, which hold (in order) the values for key length, IV length, text length, AAD length, and tag length for all the test entries contained in that file.
# ^.* 0x(.?.) 0x(.?.) - Match the two 1-2 digit hex numbers at the end of the lines
# ibase=16; 2円1円 - Put the bytes back into big-endian, and strip the 0x (prep for BC)
# { while read; do echo $REPLY | bc; done; } - Pipe each line to BC one by one, converting the hex values back to decimal
# :a - Label "a"
# N - Append another line to the buffer
# $!ba - If this is NOT the last line, branch to A
# s/\n/-/g - Replace all the newlines in the processing space with dashes
mv "$FILE" "$BASEFILE"`head -n5 "$FILE" | sed -re 's/^.* 0x(.?.) 0x(.?.)/ibase=16; 2円1円/g' | { while read; do echo $REPLY | bc; done; } | sed -re ':a' -e 'N' -e '$!ba' -e 's/\n/-/g'`.mspd
The upside to this is that you get the benefit of comments tied to specific chunks of regex, while also being able to see all the parts of the regex in their full context. The downside, rather obviously, is that when you update the regex in their full context, you then have to update your comment-copy of those parts to match. In practice, however, I found it easier to alter/fix my regexes in the full context, and then fix up the comments in my "go through and comment up the finished code" phase, than to try and wrangle with the chopped-up regexes.
As with all workflows, your preferences may vary. At the very least, however, I would recommend you comment larger chunks of your regex at more logical points, like so:
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String regexStr = "";
regexStr = "\\b" //Begin match at the word boundary(whitespace boundary)
+ "\\d{3}[-.]?" //Match three digits, then an optional . or - separator (area code)
+ "\\d{3}[-.]?" //Repeat (three more digits)
+ "\\d{4}" //Last four digits
+ "\\b"; //End match at the word boundary(whitespace boundary)
if (args[0].matches(regexStr)) {
System.out.println("Match!");
} else {
System.out.println("No match.");
}
}
}
(Note how the "+" are aligned with the assignment, making it easy to visually track the span, and also keeping them away from both the regex data and the comments).
Or, using my preferred method:
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
// \b Begin match at the word boundary
// \d{3}[-.]? Match three digits, then an optional . or -
// Repeat (first one is area code, second is first three digits of the local number)
// \d{4} Last four digits of the local number
// \b End match at the word boundary
String regexStr = "\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b";
if (args[0].matches(regexStr)) {
System.out.println("Match!");
} else {
System.out.println("No match.");
}
}
}
-
1\$\begingroup\$ Your analysis and advice for Java are excellent! However, I humbly suggest that you post your Bash script as a question for review. ☺ \$\endgroup\$200_success– 200_success2014年04月19日 02:02:01 +00:00Commented Apr 19, 2014 at 2:02
-
\$\begingroup\$ Hah -- the full bash script is a nightmare, and I know it. I figured I wasn't going to cheat and tune up this section of it before posting if I was going to go on about "real world" usage, though. The script is so incredibly special-case (converting NIST GCM AES test vectors into mspdebug sim scripts), that "working and commented" is good enough. I'll put in a note that the Bash script is not a sterling example of how to do things, or clean it up for this post, if you have a suggestion either way. :) \$\endgroup\$Travis Snoozy– Travis Snoozy2014年04月19日 02:24:14 +00:00Commented Apr 19, 2014 at 2:24
-
2\$\begingroup\$ @200_success you now can tear up the entirety of that script on this question, if you so desire. \$\endgroup\$Travis Snoozy– Travis Snoozy2014年04月19日 04:34:02 +00:00Commented Apr 19, 2014 at 4:34
Whatever you do, don't append strings like that! If you look at the generated bytecode, you'll see that the compiler does exactly what you told it to do:
$ javap -c Main
Compiled from "Main.java"
public class Main {
public Main();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: ldc #2 // String
2: astore_1
3: new #3 // class java/lang/StringBuilder
6: dup
7: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
10: aload_1
11: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
14: ldc #6 // String \b
16: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
19: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
22: astore_1
23: new #3 // class java/lang/StringBuilder
26: dup
27: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
30: aload_1
31: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
34: ldc #8 // String \d{3}
36: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
39: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
42: astore_1
43: new #3 // class java/lang/StringBuilder
46: dup
47: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
50: aload_1
51: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
54: ldc #9 // String [-.]?
56: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
59: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
62: astore_1
63: new #3 // class java/lang/StringBuilder
66: dup
67: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
70: aload_1
71: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
74: ldc #8 // String \d{3}
76: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
79: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
82: astore_1
83: new #3 // class java/lang/StringBuilder
86: dup
87: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
90: aload_1
91: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
94: ldc #9 // String [-.]?
96: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
99: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
102: astore_1
103: new #3 // class java/lang/StringBuilder
106: dup
107: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
110: aload_1
111: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
114: ldc #10 // String \d{4}
116: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
119: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
122: astore_1
123: new #3 // class java/lang/StringBuilder
126: dup
127: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
130: aload_1
131: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
134: ldc #6 // String \b
136: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
139: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
142: astore_1
143: aload_0
144: iconst_0
145: aaload
146: aload_1
147: invokevirtual #11 // Method java/lang/String.matches:(Ljava/lang/String;)Z
150: ifeq 164
153: getstatic #12 // Field java/lang/System.out:Ljava/io/PrintStream;
156: ldc #13 // String Match!
158: invokevirtual #14 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
161: goto 172
164: getstatic #12 // Field java/lang/System.out:Ljava/io/PrintStream;
167: ldc #15 // String No match.
169: invokevirtual #14 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
172: return
}
If you change it to concatenate all the string literals in one statement...
public class Main {
public static void main(String[] args) {
String regexStr =
"\\b" + //Begin match at the word boundary(whitespace boundary)
"\\d{3}" + //Match three digits
"[-.]?" + //Optional - Match dash or dot
"\\d{3}" + //Match three digits
"[-.]?" + //Optional - Match dash or dot
"\\d{4}" + //Match four digits
"\\b"; //End match at the word boundary(whitespace boundary)
if (args[0].matches(regexStr)) {
System.out.println("Match!");
} else {
System.out.println("No match.");
}
}
}
Then the bytecode is much saner (one literal string at offset 0):
$ javap -c Main
Compiled from "Main.java"
public class Main {
public Main();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: ldc #2 // String \b\d{3}[-.]?\d{3}[-.]?\d{4}\b
2: astore_1
3: aload_0
4: iconst_0
5: aaload
6: aload_1
7: invokevirtual #3 // Method java/lang/String.matches:(Ljava/lang/String;)Z
10: ifeq 24
13: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream;
16: ldc #5 // String Match!
18: invokevirtual #6 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
21: goto 32
24: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream;
27: ldc #7 // String No match.
29: invokevirtual #6 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
32: return
}
-
\$\begingroup\$ My issue isn't that you criticised the performance of my methods; that's welcomed feedback. I only commented because I felt that the core of my question wasn't addressed. \$\endgroup\$rootmeanclaire– rootmeanclaire2014年04月23日 03:14:59 +00:00Commented Apr 23, 2014 at 3:14
-
\$\begingroup\$ That would be addressed in my other answer, which you didn't like. That's OK — people can disagree on aesthetic matters. \$\endgroup\$200_success– 200_success2014年04月23日 03:22:43 +00:00Commented Apr 23, 2014 at 3:22
-
\$\begingroup\$ (A)People usually put everything in one answer. (B)I didn't know you could post two answers to the same question. (C)Sorry if I've come across as stubborn or annoying, its just that I'm new to this SE. \$\endgroup\$rootmeanclaire– rootmeanclaire2014年04月23日 03:27:04 +00:00Commented Apr 23, 2014 at 3:27
The regex-native way for commenting is using the /x
(free spacing or extended mode) flag.
In many languages there is support for multiline strings, so your regular expression will look like:
regex = /\b # Begin match at the word boundary(whitespace boundary)
\d{3} # Match three digits
[-.]? # Optional - Match dash or dot
\d{3} # Match three digits
[-.]? # Optional - Match dash or dot
\d{4} # Match four digits
\b # End match at the word boundary(whitespace boundary)
/x
In java, however, there is still no way to do this, so you can either comment it as you did, or otherwise carefully construct your string with \n
between each line:
String regex = "\\b # Begin match at the word boundary(whitespace boundary)\n" +
"\\d{3} # Match three digits\n" +
"[-.]? # Optional - Match dash or dot\n" +
"\\d{3} # Match three digits\n" +
"[-.]? # Optional - Match dash or dot\n" +
"\\d{4} # Match four digits\n" +
"\\b # End match at the word boundary(whitespace boundary)";
// Of course, you need to compile with `Pattern.COMMENTS`
Pattern pattern = Pattern.compile(regex, Pattern.COMMENTS);
if (pattern.matcher(args[0]).matches()) {
System.out.println("Match!");
} else {
System.out.println("No match.");
}
The above has no advantage to your suggestion on how to comment your regex, so I guess it a matter of taste...
I recommend against writing comments on your regular expressions like that, for several reasons.
- Readability: Just the act of splitting up the string makes it harder to read. Compactness is also a virtue. Regular expressions are punctuation-heavy enough to begin with; the extra
"
and+
symbols make things worse. - Redundancy: You're stating the same thing twice on every line. To anyone who understands regular expressions, you've just made the code more verbose.
Lack of insight: Comments that describe why are more valuable than comments that describe what. A comment like this would be more helpful:
String regexStr = "\\b" + "\\d{3}" + // 3-digit area code ...
What would be most beneficial, I think, is just one comment that describes the intention of the entire regular expression, and a self-documenting variable name.
// A 10-digit phone number, optionally delimited into groups of 3, 3, 4 digits
// by hyphens or dots. We also check for word boundaries before and after.
Pattern phonePattern = Pattern.compile("\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b");
I think that this comment is at least as informative as your original, without suffering from the disadvantages I mentioned. With that description, even a complete novice to regular expressions should be able to figure out what each part of the regex does.
-
1\$\begingroup\$ I disagree with your "readability" argument; splitting up a string makes it easier to assess each of its parts individually when debugging or improving. \$\endgroup\$rootmeanclaire– rootmeanclaire2014年04月23日 02:52:33 +00:00Commented Apr 23, 2014 at 2:52