5
\$\begingroup\$

I am trying to optimize my Java code where I am parsing an address field.

Address fields have the format:

full_address;phone; full_address;phone; full_address;phone; 

where full_address = addresstype^street^city^state^zip
and where street = street1;street2;street3;street4;

So my string is

final String string = "Billing^Tata;3001 Garden Parkway^^NJ^;100-00-0009;Home^Goggle;3341 Main Parkway^^NY^;;";

My object location stores each of the above attributes.

//regular expression to match the address type
Pattern newPattern = Pattern.compile("(([^\\^]*)\\^([^\\^]*)\\^([^\\^]*)\\^([^\\^]*)\\^([^;]*);([^;]*);)");
Matcher newMatcher = newPattern.matcher(addressLongText);
List<Location> discreteListOfLocations = new ArrayList<Location>();
MatchResult result = null;
while (newMatcher.find())
{
 result = newMatcher.toMatchResult();
 Location location = new Location();
 location.setAddressTypeCdValue(result.group(2));
 String[] str_arr = result.group(3).split(";");
 if (str_arr.length > 0) 
 {
 location.setStreetAddress1(str_arr[0]);
 }
 if (str_arr.length > 1) 
 {
 location.setStreetAddress2(str_arr[1]);
 }
 if (str_arr.length > 2) 
 {
 location.setStreetAddress3(str_arr[2]);
 }
 if (str_arr.length > 3) 
 {
 location.setStreetAddress4(str_arr[3]);
 }
 location.setCity(result.group(4));
 location.setState(result.group(5));
 location.setZip(result.group(6));
 discreteListOfLocations.add(location);
}

I am a bit confused how to optimize the regex so that it is easier for someone else to understand what my regex is doing. Any idea or suggestion will be helpful.

200_success
146k22 gold badges190 silver badges479 bronze badges
asked Nov 7, 2014 at 16:31
\$\endgroup\$
0

3 Answers 3

1
\$\begingroup\$

Building on @sln's regex, this is valid Java code:

Pattern pattern = Pattern.compile("(?x)(" +
 " ([^^]*) # (1), Address type\n" +
 " \\^" +
 " ([^^]*) # (2), street1;street2;street3;street4;\n" +
 " \\^" +
 " ([^^]*) # (3), City\n" +
 " \\^" +
 " ([^^]*) # (4), State\n" +
 " \\^" +
 " ([^;]*) # (5), Zip\n" +
 " ;" +
 " ([^;]*) # (6), Phone\n" +
 " ;)");

I simplified it a bit: instead of [^\\^] you can write [^^], because you don't need to escape ^ inside [ ... ].

Aside from the regex, the code could also be better:

MatchResult result = matcher.toMatchResult();
Location location = new Location();
location.setAddressTypeCdValue(result.group(2));
String[] streetAddressParts = result.group(3).split(";");
location.setStreetAddress1(streetAddressParts[0]);
if (streetAddressParts.length > 1) {
 location.setStreetAddress2(streetAddressParts[1]);
 if (streetAddressParts.length > 2) {
 location.setStreetAddress3(streetAddressParts[2]);
 if (streetAddressParts.length > 3) {
 location.setStreetAddress4(streetAddressParts[3]);
 }
 }
}

The improvements:

  • I renamed str_arr to streetAddressParts. This follows standard naming, and better reflects what it is
  • The result of split is always at least one element. So no need to check for streetAddressParts.length > 0
  • I moved the subsequent checks of streetAddressParts.length > n into nested if statements, to avoid unnecessary checks that cannot be true.
answered Nov 7, 2014 at 23:50
\$\endgroup\$
2
\$\begingroup\$

Not sure about Java string catenation.
Below is your regex formatted and commented (by RegexFormat 5)

This puts it in expanded mode. The good thing is anybody can read it in
your source code for later reference.

Below is 2 versions. One a c++ normal catenation where newline \n are
added. Two a single quoted string where the newline is natural.

The nice thing about doing this in your code is you can always print it out
for debug purposes. It prints as a nice format.

"(?x) \n"
" ( [^\\^]* ) # (1), Address type \n"
" \\^ \n"
" ( [^\\^]* ) # (2), street1;street2;street3;street4; \n"
" \\^ \n"
" ( [^\\^]* ) # (3), City \n"
" \\^ \n"
" ( [^\\^]* ) # (4), State \n"
" \\^ \n"
" ( [^;]* ) # (5), Zip \n"
" ; \n"
" ( [^;]* ) # (6), Phone \n"
" ; \n"

======================================

"(?x)
 ( [^\\^]* ) # (1), Address type
 \\^
 ( [^\\^]* ) # (2), street1;street2;street3;street4;
 \\^
 ( [^\\^]* ) # (3), City
 \\^
 ( [^\\^]* ) # (4), State
 \\^
 ( [^;]* ) # (5), Zip
 ;
 ( [^;]* ) # (6), Phone
 ;
"
answered Nov 7, 2014 at 17:15
\$\endgroup\$
0
\$\begingroup\$

I'm not sure this matches all your needs but seems to match your test data:

(.*?)\^(.*?)\^(.*?)\^(.*?)\^;(.*?);

Here a reluctant quantifier is used to match as little as possible, making the expression work on your test data and easier to understand:

  • group 1: anything up to the first circumflex (^)
  • group 2: anything between the first and second circumflex (^)
  • group 3: anything between the second and third circumflex (^)
  • group 4: anything between the third and fourth circumflex (^)
  • group 5: anything between two semicolons (;) after the fourth circumflex (^)

Another approach might be to match the entire entry first, e.g. using the following expression

(?>.*?\^){4};.*?;

Then split the resulting matches (group(0)) by ^ and remove leading and training semicolons from the last part (zip).

answered Nov 7, 2014 at 16:45
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.