I developed a PHP function to get a not formatted address and split it in a street name and number.
Following are some patterns of received addresses
- StreetName Number
- SrtreetName, Number
- Number StreetName
- Number-Number StreetName
- StreetName, Number, Complement
- StreetName Number/Number
- StreetName Number - ZipCode (ZipCode could be ignored)
- StreetName (without number)
I'm using regex to identify the pattern and then splitting it. Here is the function (the code is commented for better understanding):
<?php
function getInfoAddress ($address)
{
$return = array('street'=>NULL,
'number'=>NULL,
'complement'=>NULL);
//firstly, erase spaces of the strings
$addressWithoutSpace = str_replace(' ', '', $address);
//discover the pattern using regex
if(preg_match('/^([0-9.-])+(.)*$/',$addressWithoutSpace) === 1) {
//here, the numbers comes first and then the information about the street
$info1 = preg_split('/[[:alpha:]]/', $addressWithoutSpace);
$info2 = preg_split('/[0-9.-]/', $address);
$return['number'] = $info1[0];
$return['street'] = end($info2);
}
elseif(preg_match('/^([[:alpha:]]|[[:punct:]])+(.)*$/',$addressWithoutSpace) === 1) {
//here, I have a alpha-numeric word in the first part of the address
if(preg_match('/^(.)+([[:punct:]])+(.)*([0-9.-])*$/',$addressWithoutSpace) === 1) {
if(preg_match('/,/',$addressWithoutSpace) === 1) {
//have one or more comma and ending with the number
$info1 = explode(",", $address);
$return['number'] = trim(preg_replace('/([^0-9-.])/', ' ', end($info1)));//the last element of the array is the number
array_pop($info1);//pop the number from array
$return['street'] = str_replace(",", "",implode(" ",$info1));//the rest of the string is the street name
}
else {
//finish with the numer, without comma
$info1 = explode(" ", $address);
$return['number'] = end($info1);//the last elemento of array is the number
array_pop($info1);//pop the number from array
$return['street'] = implode(" ",$info1);//the rest of the string is the street name
}
}
elseif(preg_match('/^(.)+([0-9.-])+$/',$addressWithoutSpace) === 1) {
//finish with the number, without punctuation
$info1 = explode(" ", $address);
$return['number'] = end($info1);//the last elemento of array is the number
array_pop($info1);//pop the number from array
$return['street'] = implode(" ",$info1);//the rest of the string is the street name
}
else {
//case without any number
if (preg_match('/,/',$addressWithoutSpace) === 1) {
$return['number'] = NULL;
$endArray = explode(',', $address);
$return['complement'] = end($endArray);//complement is the last element of array
array_pop($endArray);// pop the last element
$return['street'] = implode(" ", $endArray);//the rest of the string is the name od street
}
else {
$return['number'] = NULL;
$return['street'] = $address;//address is just the street name
}
}
}
return ($return);
}
$address = $_POST['address'];
$addressArray = getInfoAddress($address);
print_r($addressArray);
?>
This is working in the most cases (enough for me for while), so I'd like to improve some points:
- Improve Readability: I care with readable code, but in this case I think that I couldn't be a good job. Are there some useless if/else block for example?
- Improve Reliability: The code fails in some cases (when the street name includes number like in "5thAvenue" or when the complement is before number like "rue de la montagne BL2 52", for exemple). Are there some way to improve the reliability?
- I also would like some suggestions of improvement without the use of regex, although I have not been able to figured out anything in this way.
1 Answer 1
Looking very quickly, it seems like a frail code.
Still, it works as expected.
But I saw a few things to improve:
- You called your function
getInfoAddress()
.
It sounds like it wil fetch the address somewhere, but that isn't the case...
The address is being parsed. A name likeparseAddress()
seems better. - But, your function casing is wrong, in my opinion.
PHP isn't case-sensitive regarding function names.
If you writeparseaddress()
, you may have problems in the future, if you need to change something.
My recommendation goes onparse_address()
- Be explicit regarding your regular expressions.
Avoid this:/^([[:alpha:]]|[[:punct:]])+(.)*$/
Be explicit. I have no idea whatpunct
means. It is ponctuation? You over-use
preg_match
.
You have this line:if (preg_match('/,/',$addressWithoutSpace) === 1){
You should use
strpos()
for this:if (strpos($addressWithoutSpace, ',') !== false){
This will improve the performance by quite a bit.
Please, don't mix Portuguese with English.
Your$endereco
variable should have other name.
Please, only and only give English names to your variables.
Everybody will thank you.Right on top, you "normalize" your input:
$addressWithoutSpace = str_replace(' ', '', $endereco);
But you use that
$endereco
variable everywhere. Maybe it was by mistake?Avoid closing the PHP tag on a file that only has PHP code
This will avoid frustrations due to a forgotten whitespace after the closing tag.
Many services, like Github, add a newline to the end.
PHP automatically ignores 1 and only 1 whitespace after?>
, but not more.
If you leave 1 more newline by mistake, you can seriously break stuff everywhere.
Just remove the?>
at the end.
-
\$\begingroup\$ I updated my code following your tips. Thank you for it. Just two points: 1) The use of
$endereco
was a mistake. 2) The classes[[:punct:]]
and[[:alpha:]]
comes from the POSIX Extended Regular Expression (ERE) syntax \$\endgroup\$James– James2015年12月21日 11:26:19 +00:00Commented Dec 21, 2015 at 11:26 -
1\$\begingroup\$ @James Just because it comes from the standard, doesn't mean it is the right way. Honestly,
[[:alpha:]]
can either bealphabetic
oralphanumeric
. Which one is it? Most people will thinkalphanumeric
. I though it was too. And do you really need an address with][!"#$%&'()*+,./:;<=>?@\^_`{|}~-
? The only characters you care about are,-/
, which act as a separator. Seriously, get rid of those. You can just do@[a-z]@i
, which has the same effect as@[[:alpha:]]@
, and is easily understood by anyone who knows basic regular expressions. \$\endgroup\$Ismael Miguel– Ismael Miguel2015年12月23日 14:14:16 +00:00Commented Dec 23, 2015 at 14:14 -
\$\begingroup\$ I got your point. I'll change it for use more simple regex. \$\endgroup\$James– James2015年12月23日 15:58:30 +00:00Commented Dec 23, 2015 at 15:58
$return
for each input. \$\endgroup\$