I'm trying to parse addresses out of blocks of text, and have the following expression to do so:
/\d+\s(?:[sewnSEWN]\.?\s)?[\d\w]+\s(?:(?:[\d\w]+\s){0,3})?\w+\.?/
It will currently parse addresses such as:
300 E. Randolph St. Chicago, IL>> Returns 300 E. Randolph St.
5553 Bay Shore Drive>> Returns input
23 Joseph E Lowery Boulevard>> Returns input
513 Martin Luther King Jr Boulevard>> Returns input
This is exactly what I want. I was wondering, as this is the first expression I have ever written, if there was a way to shorten down the expression or refine it a little?
1 Answer 1
I don't know which implementation you are using, so translate this to relevant language when needed.
\w = [a-zA-Z0-9]
so [\d\w]
is same as [\w]
Note that (?(?:[\w]+\s){0,3})?
is same as (?:[\w]+\s){0,3}
because the expression inside is matched zero or more times.
You can also add in the \w+\s
at the beginning to the above expression, and make it repeat from 1 to 4.
Here is a matching for your example, Knowing not much about your format, here is what I find odd.
/
\d+ # 513
\s
(?:[sewnSEWN]\.?\s)? #
(?:\w+\s){1,4} # Martin Luther King Jr
\w+ # Boulevard
\.?
/
- The spaces are restricted to a single space
\s
is this that strict? Perhaps you want\s+
- If I understand you right, the portion after NSEW. directions are that there has to be atleast two words, and atmost 5 words separated by spaces. is this a correct interpretation?
-
\$\begingroup\$ I have it as single spaces because I assumed that's how standard addresses are broken up. The way I was trying to do it, the street name could contain at least one part (such as Randalph), and at most 4 different parts, and after the street name there would be the descripter, i.e. street, park, drive, etc. But they can be abbreviated, which is why I searched for the period after. \$\endgroup\$ayyp– ayyp2012年06月13日 18:45:54 +00:00Commented Jun 13, 2012 at 18:45
-
\$\begingroup\$ @AndrewPeacock makes sense. If you are trying to match generic street names, what about the directions like NW, SE etc? \$\endgroup\$Rahul Gopinath– Rahul Gopinath2012年06月13日 18:49:04 +00:00Commented Jun 13, 2012 at 18:49
-
\$\begingroup\$ I hadn't thought about that... Also, when I change
(?:(?:[\w]+\s){0,3})?
to(?:[\w]+\s){0,3}?
, the second doesn't return the same result as the first. \$\endgroup\$ayyp– ayyp2012年06月13日 18:51:23 +00:00Commented Jun 13, 2012 at 18:51 -
\$\begingroup\$ @AndrewPeacock remove the
?
in your second expression. \$\endgroup\$Rahul Gopinath– Rahul Gopinath2012年06月13日 18:54:00 +00:00Commented Jun 13, 2012 at 18:54