Regex to match phone numbers, with comments

Question 1

Regular expressions are one of the worst aspects of debugging hell. Since they are contained in string literals, It's hard to comment how they work when the expressions are fairly long.

I have the following regular expression:

\b\d{3}[-.]?\d{3}[-.]?\d{4}\b

Source: RegExr.com

I commented it like this:

import java.util.regex.Pattern;
public class Main {
 public static void main(String[] args) {
 String regexStr = "";
 regexStr += "\\b"; //Begin match at the word boundary(whitespace boundary)
 regexStr += "\\d{3}"; //Match three digits
 regexStr += "[-.]?"; //Optional - Match dash or dot
 regexStr += "\\d{3}"; //Match three digits
 regexStr += "[-.]?"; //Optional - Match dash or dot
 regexStr += "\\d{4}"; //Match four digits
 regexStr += "\\b"; //End match at the word boundary(whitespace boundary)
 if (args[0].matches(regexStr)) {
 System.out.println("Match!");
 } else {
 System.out.println("No match.");
 }
 }
}

What would be the best way to go about commenting regular expressions to make them readable for beginners? Is there a better way than what I have shown?

Question 2

What language are you embedding the regexes in? Java? Please tag.

Question 3

@rolfl Edited! I was hoping for a cross-language solution.

Question 4

While the concept of regular expressions exists in many languages, the syntax varies wildly. A cross-language solution would therefore be hypothetical code, which is off-topic for Code Review.

Question 5

Isn't better to comment "what the regex should match" and not "the parts of the regex"? a tool online could do it for you

Question 6

@Marco An excellent point

Question 7

This is a difficult topic to pin down; there will be a lot of opinions, without any clear "right" answers. Additionally, a given match might contextually be better-commented in one fashion vs. a different fashion. That said, I will do my best to try and give rationale for making some commenting choices over others.

While I know about embedded regex comments, they tend to make things more confusing, rather than less. Using them causes subtle changes in how whitespace is treated in the regex, and they're visually rather intrusive. Unless you are allowed to pass only a regex around, with no other attendant code or comments, I would avoid using embedded regex comments. The only time I have ever used these is with regexes that were consumed by a client application (I had no other means by which to comment my expressions), and when I had to write some regexes that needed to be carried back and forth between two languages.

Line-by-line commenting in the enclosing language, as in your selection, is the next option. Most programming/scripting environments support breaking strings up onto more than one line and commenting them. This can be somewhat less visually intrusive than directly embedded regex comments, especially given that the regex's whitespace isn't compromised in any way, but there is still additional "noise" overhead in terms of extra quotes and joining syntax (such as + signs in C# and Java, and << in C++). While the underlying strategy is not inherently a bad one, commenting every single atom of the regex is probably too extreme -- try to break the comments down into larger functional groups, and only go atom-by-atom for particularly tricky stuff. There is an inherent downside to the multiline comment scheme, though, and that is not being able to see the whole regex in one piece. What I see happen in practice is that people write the regex in a single place, then come back and re-work it into multiple lines like this as they go through and comment up their finished code. Ironically, the next person tends to wind up putting it all back into one line so they can more readily edit it. :)

I very recently wrote a shell script that did a huge number of complicated regexes. The route that I took was a hybrid -- above my sed commands, I broke down useful matching units and explained them, but the actual pipeline remained in its normal context, as in this example snippet:

#!/bin/bash
# Rename the files to reflect the class of tests they contain.
# head -n5 "$FILE" - Grab the first five lines of the file, which hold (in order) the values for key length, IV length, text length, AAD length, and tag length for all the test entries contained in that file.
# ^.* 0x(.?.) 0x(.?.) - Match the two 1-2 digit hex numbers at the end of the lines
# ibase=16; 2円1円 - Put the bytes back into big-endian, and strip the 0x (prep for BC)
# { while read; do echo $REPLY | bc; done; } - Pipe each line to BC one by one, converting the hex values back to decimal
# :a - Label "a"
# N - Append another line to the buffer
# $!ba - If this is NOT the last line, branch to A
# s/\n/-/g - Replace all the newlines in the processing space with dashes
mv "$FILE" "$BASEFILE"`head -n5 "$FILE" | sed -re 's/^.* 0x(.?.) 0x(.?.)/ibase=16; 2円1円/g' | { while read; do echo $REPLY | bc; done; } | sed -re ':a' -e 'N' -e '$!ba' -e 's/\n/-/g'`.mspd

The upside to this is that you get the benefit of comments tied to specific chunks of regex, while also being able to see all the parts of the regex in their full context. The downside, rather obviously, is that when you update the regex in their full context, you then have to update your comment-copy of those parts to match. In practice, however, I found it easier to alter/fix my regexes in the full context, and then fix up the comments in my "go through and comment up the finished code" phase, than to try and wrangle with the chopped-up regexes.

As with all workflows, your preferences may vary. At the very least, however, I would recommend you comment larger chunks of your regex at more logical points, like so:

import java.util.regex.Pattern;
public class Main {
 public static void main(String[] args) {
 String regexStr = "";
 regexStr = "\\b" //Begin match at the word boundary(whitespace boundary)
 + "\\d{3}[-.]?" //Match three digits, then an optional . or - separator (area code)
 + "\\d{3}[-.]?" //Repeat (three more digits)
 + "\\d{4}" //Last four digits
 + "\\b"; //End match at the word boundary(whitespace boundary)
 if (args[0].matches(regexStr)) {
 System.out.println("Match!");
 } else {
 System.out.println("No match.");
 }
 }
}

(Note how the "+" are aligned with the assignment, making it easy to visually track the span, and also keeping them away from both the regex data and the comments).

Or, using my preferred method:

import java.util.regex.Pattern;
public class Main {
 public static void main(String[] args) {
 // \b Begin match at the word boundary
 // \d{3}[-.]? Match three digits, then an optional . or -
 // Repeat (first one is area code, second is first three digits of the local number)
 // \d{4} Last four digits of the local number
 // \b End match at the word boundary
 String regexStr = "\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b";
 if (args[0].matches(regexStr)) {
 System.out.println("Match!");
 } else {
 System.out.println("No match.");
 }
 }
}

Question 8

Your analysis and advice for Java are excellent! However, I humbly suggest that you post your Bash script as a question for review. ☺

Question 9

Hah -- the full bash script is a nightmare, and I know it. I figured I wasn't going to cheat and tune up this section of it before posting if I was going to go on about "real world" usage, though. The script is so incredibly special-case (converting NIST GCM AES test vectors into mspdebug sim scripts), that "working and commented" is good enough. I'll put in a note that the Bash script is not a sterling example of how to do things, or clean it up for this post, if you have a suggestion either way. :)

Question 10

@200_success you now can tear up the entirety of that script on this question, if you so desire.

Question 11

Whatever you do, don't append strings like that! If you look at the generated bytecode, you'll see that the compiler does exactly what you told it to do:

$ javap -c Main
Compiled from "Main.java"
public class Main {
 public Main();
 Code:
 0: aload_0 
 1: invokespecial #1 // Method java/lang/Object."<init>":()V
 4: return 
 public static void main(java.lang.String[]);
 Code:
 0: ldc #2 // String 
 2: astore_1 
 3: new #3 // class java/lang/StringBuilder
 6: dup 
 7: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
 10: aload_1 
 11: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 14: ldc #6 // String \b
 16: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 19: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
 22: astore_1 
 23: new #3 // class java/lang/StringBuilder
 26: dup 
 27: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
 30: aload_1 
 31: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 34: ldc #8 // String \d{3}
 36: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 39: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
 42: astore_1 
 43: new #3 // class java/lang/StringBuilder
 46: dup 
 47: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
 50: aload_1 
 51: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 54: ldc #9 // String [-.]?
 56: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 59: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
 62: astore_1 
 63: new #3 // class java/lang/StringBuilder
 66: dup 
 67: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
 70: aload_1 
 71: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 74: ldc #8 // String \d{3}
 76: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 79: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
 82: astore_1 
 83: new #3 // class java/lang/StringBuilder
 86: dup 
 87: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
 90: aload_1 
 91: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 94: ldc #9 // String [-.]?
 96: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 99: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
 102: astore_1 
 103: new #3 // class java/lang/StringBuilder
 106: dup 
 107: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
 110: aload_1 
 111: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 114: ldc #10 // String \d{4}
 116: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 119: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
 122: astore_1 
 123: new #3 // class java/lang/StringBuilder
 126: dup 
 127: invokespecial #4 // Method java/lang/StringBuilder."<init>":()V
 130: aload_1 
 131: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 134: ldc #6 // String \b
 136: invokevirtual #5 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
 139: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
 142: astore_1 
 143: aload_0 
 144: iconst_0 
 145: aaload 
 146: aload_1 
 147: invokevirtual #11 // Method java/lang/String.matches:(Ljava/lang/String;)Z
 150: ifeq 164
 153: getstatic #12 // Field java/lang/System.out:Ljava/io/PrintStream;
 156: ldc #13 // String Match!
 158: invokevirtual #14 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
 161: goto 172
 164: getstatic #12 // Field java/lang/System.out:Ljava/io/PrintStream;
 167: ldc #15 // String No match.
 169: invokevirtual #14 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
 172: return 
}

If you change it to concatenate all the string literals in one statement...

public class Main {
 public static void main(String[] args) {
 String regexStr =
 "\\b" + //Begin match at the word boundary(whitespace boundary)
 "\\d{3}" + //Match three digits
 "[-.]?" + //Optional - Match dash or dot
 "\\d{3}" + //Match three digits
 "[-.]?" + //Optional - Match dash or dot
 "\\d{4}" + //Match four digits
 "\\b"; //End match at the word boundary(whitespace boundary)
 if (args[0].matches(regexStr)) {
 System.out.println("Match!");
 } else {
 System.out.println("No match.");
 }
 }
}

Then the bytecode is much saner (one literal string at offset 0):

$ javap -c Main
Compiled from "Main.java"
public class Main {
 public Main();
 Code:
 0: aload_0 
 1: invokespecial #1 // Method java/lang/Object."<init>":()V
 4: return 
 public static void main(java.lang.String[]);
 Code:
 0: ldc #2 // String \b\d{3}[-.]?\d{3}[-.]?\d{4}\b
 2: astore_1 
 3: aload_0 
 4: iconst_0 
 5: aaload 
 6: aload_1 
 7: invokevirtual #3 // Method java/lang/String.matches:(Ljava/lang/String;)Z
 10: ifeq 24
 13: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream;
 16: ldc #5 // String Match!
 18: invokevirtual #6 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
 21: goto 32
 24: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream;
 27: ldc #7 // String No match.
 29: invokevirtual #6 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
 32: return 
}

Question 12

My issue isn't that you criticised the performance of my methods; that's welcomed feedback. I only commented because I felt that the core of my question wasn't addressed.

Question 13

That would be addressed in my other answer, which you didn't like. That's OK — people can disagree on aesthetic matters.

Question 14

(A)People usually put everything in one answer. (B)I didn't know you could post two answers to the same question. (C)Sorry if I've come across as stubborn or annoying, its just that I'm new to this SE.

Question 15

The regex-native way for commenting is using the /x (free spacing or extended mode) flag.

In many languages there is support for multiline strings, so your regular expression will look like:

regex = /\b # Begin match at the word boundary(whitespace boundary)
 \d{3} # Match three digits
 [-.]? # Optional - Match dash or dot
 \d{3} # Match three digits
 [-.]? # Optional - Match dash or dot
 \d{4} # Match four digits
 \b # End match at the word boundary(whitespace boundary)
 /x

In java, however, there is still no way to do this, so you can either comment it as you did, or otherwise carefully construct your string with \n between each line:

String regex = "\\b # Begin match at the word boundary(whitespace boundary)\n" +
 "\\d{3} # Match three digits\n" +
 "[-.]? # Optional - Match dash or dot\n" +
 "\\d{3} # Match three digits\n" +
 "[-.]? # Optional - Match dash or dot\n" +
 "\\d{4} # Match four digits\n" +
 "\\b # End match at the word boundary(whitespace boundary)";
// Of course, you need to compile with `Pattern.COMMENTS`
Pattern pattern = Pattern.compile(regex, Pattern.COMMENTS);
if (pattern.matcher(args[0]).matches()) {
 System.out.println("Match!");
} else {
 System.out.println("No match.");
}

The above has no advantage to your suggestion on how to comment your regex, so I guess it a matter of taste...

Question 16

I recommend against writing comments on your regular expressions like that, for several reasons.

Readability: Just the act of splitting up the string makes it harder to read. Compactness is also a virtue. Regular expressions are punctuation-heavy enough to begin with; the extra " and + symbols make things worse.
Redundancy: You're stating the same thing twice on every line. To anyone who understands regular expressions, you've just made the code more verbose.
Lack of insight: Comments that describe why are more valuable than comments that describe what. A comment like this would be more helpful:
```
String regexStr =
 "\\b" +
 "\\d{3}" + // 3-digit area code
 ...
```

What would be most beneficial, I think, is just one comment that describes the intention of the entire regular expression, and a self-documenting variable name.

// A 10-digit phone number, optionally delimited into groups of 3, 3, 4 digits
// by hyphens or dots. We also check for word boundaries before and after.
Pattern phonePattern = Pattern.compile("\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b");

I think that this comment is at least as informative as your original, without suffering from the disadvantages I mentioned. With that description, even a complete novice to regular expressions should be able to figure out what each part of the regex does.

Question 17

I disagree with your "readability" argument; splitting up a string makes it easier to assess each of its parts individually when debugging or improving.

Travis Snoozy Travis Snoozy 9156 silver badges13 bronze badges · Accepted Answer · 2014-04-18 19:02:33Z

This is a difficult topic to pin down; there will be a lot of opinions, without any clear "right" answers. Additionally, a given match might contextually be better-commented in one fashion vs. a different fashion. That said, I will do my best to try and give rationale for making some commenting choices over others.

While I know about embedded regex comments, they tend to make things more confusing, rather than less. Using them causes subtle changes in how whitespace is treated in the regex, and they're visually rather intrusive. Unless you are allowed to pass only a regex around, with no other attendant code or comments, I would avoid using embedded regex comments. The only time I have ever used these is with regexes that were consumed by a client application (I had no other means by which to comment my expressions), and when I had to write some regexes that needed to be carried back and forth between two languages.

Line-by-line commenting in the enclosing language, as in your selection, is the next option. Most programming/scripting environments support breaking strings up onto more than one line and commenting them. This can be somewhat less visually intrusive than directly embedded regex comments, especially given that the regex's whitespace isn't compromised in any way, but there is still additional "noise" overhead in terms of extra quotes and joining syntax (such as + signs in C# and Java, and << in C++). While the underlying strategy is not inherently a bad one, commenting every single atom of the regex is probably too extreme -- try to break the comments down into larger functional groups, and only go atom-by-atom for particularly tricky stuff. There is an inherent downside to the multiline comment scheme, though, and that is not being able to see the whole regex in one piece. What I see happen in practice is that people write the regex in a single place, then come back and re-work it into multiple lines like this as they go through and comment up their finished code. Ironically, the next person tends to wind up putting it all back into one line so they can more readily edit it. :)

I very recently wrote a shell script that did a huge number of complicated regexes. The route that I took was a hybrid -- above my sed commands, I broke down useful matching units and explained them, but the actual pipeline remained in its normal context, as in this example snippet:

#!/bin/bash
# Rename the files to reflect the class of tests they contain.
# head -n5 "$FILE" - Grab the first five lines of the file, which hold (in order) the values for key length, IV length, text length, AAD length, and tag length for all the test entries contained in that file.
# ^.* 0x(.?.) 0x(.?.) - Match the two 1-2 digit hex numbers at the end of the lines
# ibase=16; 2円1円 - Put the bytes back into big-endian, and strip the 0x (prep for BC)
# { while read; do echo $REPLY | bc; done; } - Pipe each line to BC one by one, converting the hex values back to decimal
# :a - Label "a"
# N - Append another line to the buffer
# $!ba - If this is NOT the last line, branch to A
# s/\n/-/g - Replace all the newlines in the processing space with dashes
mv "$FILE" "$BASEFILE"`head -n5 "$FILE" | sed -re 's/^.* 0x(.?.) 0x(.?.)/ibase=16; 2円1円/g' | { while read; do echo $REPLY | bc; done; } | sed -re ':a' -e 'N' -e '$!ba' -e 's/\n/-/g'`.mspd

The upside to this is that you get the benefit of comments tied to specific chunks of regex, while also being able to see all the parts of the regex in their full context. The downside, rather obviously, is that when you update the regex in their full context, you then have to update your comment-copy of those parts to match. In practice, however, I found it easier to alter/fix my regexes in the full context, and then fix up the comments in my "go through and comment up the finished code" phase, than to try and wrangle with the chopped-up regexes.

As with all workflows, your preferences may vary. At the very least, however, I would recommend you comment larger chunks of your regex at more logical points, like so:

import java.util.regex.Pattern;
public class Main {
 public static void main(String[] args) {
 String regexStr = "";
 regexStr = "\\b" //Begin match at the word boundary(whitespace boundary)
 + "\\d{3}[-.]?" //Match three digits, then an optional . or - separator (area code)
 + "\\d{3}[-.]?" //Repeat (three more digits)
 + "\\d{4}" //Last four digits
 + "\\b"; //End match at the word boundary(whitespace boundary)
 if (args[0].matches(regexStr)) {
 System.out.println("Match!");
 } else {
 System.out.println("No match.");
 }
 }
}

(Note how the "+" are aligned with the assignment, making it easy to visually track the span, and also keeping them away from both the regex data and the comments).

Or, using my preferred method:

import java.util.regex.Pattern;
public class Main {
 public static void main(String[] args) {
 // \b Begin match at the word boundary
 // \d{3}[-.]? Match three digits, then an optional . or -
 // Repeat (first one is area code, second is first three digits of the local number)
 // \d{4} Last four digits of the local number
 // \b End match at the word boundary
 String regexStr = "\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b";
 if (args[0].matches(regexStr)) {
 System.out.println("Match!");
 } else {
 System.out.println("No match.");
 }
 }
}

Your analysis and advice for Java are excellent! However, I humbly suggest that you post your Bash script as a question for review. ☺
Hah -- the full bash script is a nightmare, and I know it. I figured I wasn't going to cheat and tune up this section of it before posting if I was going to go on about "real world" usage, though. The script is so incredibly special-case (converting NIST GCM AES test vectors into mspdebug sim scripts), that "working and commented" is good enough. I'll put in a note that the Bash script is not a sterling example of how to do things, or clean it up for this post, if you have a suggestion either way. :)
@200_success you now can tear up the entirety of that script on this question, if you so desire.

Stack Exchange Network

Regex to match phone numbers, with comments

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Regex to match phone numbers, with comments

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions