Regex to find addresses and phone numbers

Question 1

I am trying to optimize my Java code where I am parsing an address field.

Address fields have the format:

full_address;phone; full_address;phone; full_address;phone;

where full_address = addresstype^street^city^state^zip
and where street = street1;street2;street3;street4;

So my string is

final String string = "Billing^Tata;3001 Garden Parkway^^NJ^;100-00-0009;Home^Goggle;3341 Main Parkway^^NY^;;";

My object location stores each of the above attributes.

//regular expression to match the address type
Pattern newPattern = Pattern.compile("(([^\\^]*)\\^([^\\^]*)\\^([^\\^]*)\\^([^\\^]*)\\^([^;]*);([^;]*);)");
Matcher newMatcher = newPattern.matcher(addressLongText);
List<Location> discreteListOfLocations = new ArrayList<Location>();
MatchResult result = null;
while (newMatcher.find())
{
 result = newMatcher.toMatchResult();
 Location location = new Location();
 location.setAddressTypeCdValue(result.group(2));
 String[] str_arr = result.group(3).split(";");
 if (str_arr.length > 0) 
 {
 location.setStreetAddress1(str_arr[0]);
 }
 if (str_arr.length > 1) 
 {
 location.setStreetAddress2(str_arr[1]);
 }
 if (str_arr.length > 2) 
 {
 location.setStreetAddress3(str_arr[2]);
 }
 if (str_arr.length > 3) 
 {
 location.setStreetAddress4(str_arr[3]);
 }
 location.setCity(result.group(4));
 location.setState(result.group(5));
 location.setZip(result.group(6));
 discreteListOfLocations.add(location);
}

I am a bit confused how to optimize the regex so that it is easier for someone else to understand what my regex is doing. Any idea or suggestion will be helpful.

Question 2

Building on @sln's regex, this is valid Java code:

Pattern pattern = Pattern.compile("(?x)(" +
 " ([^^]*) # (1), Address type\n" +
 " \\^" +
 " ([^^]*) # (2), street1;street2;street3;street4;\n" +
 " \\^" +
 " ([^^]*) # (3), City\n" +
 " \\^" +
 " ([^^]*) # (4), State\n" +
 " \\^" +
 " ([^;]*) # (5), Zip\n" +
 " ;" +
 " ([^;]*) # (6), Phone\n" +
 " ;)");

I simplified it a bit: instead of [^\\^] you can write [^^], because you don't need to escape ^ inside [ ... ].

Aside from the regex, the code could also be better:

MatchResult result = matcher.toMatchResult();
Location location = new Location();
location.setAddressTypeCdValue(result.group(2));
String[] streetAddressParts = result.group(3).split(";");
location.setStreetAddress1(streetAddressParts[0]);
if (streetAddressParts.length > 1) {
 location.setStreetAddress2(streetAddressParts[1]);
 if (streetAddressParts.length > 2) {
 location.setStreetAddress3(streetAddressParts[2]);
 if (streetAddressParts.length > 3) {
 location.setStreetAddress4(streetAddressParts[3]);
 }
 }
}

The improvements:

I renamed str_arr to streetAddressParts. This follows standard naming, and better reflects what it is
The result of split is always at least one element. So no need to check for streetAddressParts.length > 0
I moved the subsequent checks of streetAddressParts.length > n into nested if statements, to avoid unnecessary checks that cannot be true.

Question 3

Not sure about Java string catenation.
Below is your regex formatted and commented (by RegexFormat 5)

This puts it in expanded mode. The good thing is anybody can read it in
your source code for later reference.

Below is 2 versions. One a c++ normal catenation where newline \n are
added. Two a single quoted string where the newline is natural.

The nice thing about doing this in your code is you can always print it out
for debug purposes. It prints as a nice format.

"(?x) \n"
" ( [^\\^]* ) # (1), Address type \n"
" \\^ \n"
" ( [^\\^]* ) # (2), street1;street2;street3;street4; \n"
" \\^ \n"
" ( [^\\^]* ) # (3), City \n"
" \\^ \n"
" ( [^\\^]* ) # (4), State \n"
" \\^ \n"
" ( [^;]* ) # (5), Zip \n"
" ; \n"
" ( [^;]* ) # (6), Phone \n"
" ; \n"

======================================

"(?x)
 ( [^\\^]* ) # (1), Address type
 \\^
 ( [^\\^]* ) # (2), street1;street2;street3;street4;
 \\^
 ( [^\\^]* ) # (3), City
 \\^
 ( [^\\^]* ) # (4), State
 \\^
 ( [^;]* ) # (5), Zip
 ;
 ( [^;]* ) # (6), Phone
 ;
"

Question 4

I'm not sure this matches all your needs but seems to match your test data:

(.*?)\^(.*?)\^(.*?)\^(.*?)\^;(.*?);

Here a reluctant quantifier is used to match as little as possible, making the expression work on your test data and easier to understand:

group 1: anything up to the first circumflex (^)
group 2: anything between the first and second circumflex (^)
group 3: anything between the second and third circumflex (^)
group 4: anything between the third and fourth circumflex (^)
group 5: anything between two semicolons (;) after the fourth circumflex (^)

Another approach might be to match the entire entry first, e.g. using the following expression

(?>.*?\^){4};.*?;

Then split the resulting matches (group(0)) by ^ and remove leading and training semicolons from the last part (zip).

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2014-11-07 23:50:40Z

Building on @sln's regex, this is valid Java code:

Pattern pattern = Pattern.compile("(?x)(" +
 " ([^^]*) # (1), Address type\n" +
 " \\^" +
 " ([^^]*) # (2), street1;street2;street3;street4;\n" +
 " \\^" +
 " ([^^]*) # (3), City\n" +
 " \\^" +
 " ([^^]*) # (4), State\n" +
 " \\^" +
 " ([^;]*) # (5), Zip\n" +
 " ;" +
 " ([^;]*) # (6), Phone\n" +
 " ;)");

I simplified it a bit: instead of [^\\^] you can write [^^], because you don't need to escape ^ inside [ ... ].

Aside from the regex, the code could also be better:

MatchResult result = matcher.toMatchResult();
Location location = new Location();
location.setAddressTypeCdValue(result.group(2));
String[] streetAddressParts = result.group(3).split(";");
location.setStreetAddress1(streetAddressParts[0]);
if (streetAddressParts.length > 1) {
 location.setStreetAddress2(streetAddressParts[1]);
 if (streetAddressParts.length > 2) {
 location.setStreetAddress3(streetAddressParts[2]);
 if (streetAddressParts.length > 3) {
 location.setStreetAddress4(streetAddressParts[3]);
 }
 }
}

The improvements:

I renamed str_arr to streetAddressParts. This follows standard naming, and better reflects what it is
The result of split is always at least one element. So no need to check for streetAddressParts.length > 0
I moved the subsequent checks of streetAddressParts.length > n into nested if statements, to avoid unnecessary checks that cannot be true.

Stack Exchange Network

Regex to find addresses and phone numbers

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Regex to find addresses and phone numbers

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions