I have the following code to split an address string in T-SQL. Excepting the unit number, which is already in its own field, it would affect too much of the application to split the address into different database fields. It's a bit of a long story, but it's also not possible to make it a parameterized query, or to write the equivalent application code in C#. Still, I'm curious if anyone can think of any edge case that would put things into the wrong fields.
The only case I've come across is "East End Ave", but even the US Postal Service suggests incorrectly parsing 'East' as a directional (for consistency) (reference: very long pdf, search for 'directional', or go to page 233).
I'm only worried about Canadian and American addresses, but in Canada, French-style street types before the street name are fair game (e.g. 'Rue Montreal'). For a tie-breaker, I default to the English style (e.g. 'Rue Street' is parsed with the name 'Rue' and type 'Street').
StreetType
and StreetDirection
are tables which contain street types and street directions, as TypeName
/TypeAbbr
and DirectionName
/DirectionType
pairs respectively - I can post the values I've been using there if anyone's curious, but the StreetType
is quite long. StreetDirection
also includes French directions ('ouest', 'nord', 'sud', 'est', etc.), as well as their equivalents of 'northeast', 'northwest', etc.
DECLARE @CivicNum AS NVARCHAR(10) = ''
DECLARE @Predir AS NVARCHAR(10) = ''
DECLARE @Name AS NVARCHAR(100) = ''
DECLARE @Type AS NVARCHAR(15) = ''
DECLARE @FrenchType AS NVARCHAR(15) = ''
DECLARE @Postdir AS NVARCHAR(10) = ''
SET @Name = '123 E Fake St. S'
SET @CivicNum = SUBSTRING(@Name, 1, 1)
IF ISNUMERIC(@CivicNum) = 1
BEGIN
-- will assume an initial integer is a civic number
DECLARE @LastSpace AS INTEGER
DECLARE @NextSpace AS INTEGER
DECLARE @Temp as VARCHAR(15)
-- get the first token, and push it into @CivicNum
SET @LastSpace = 1
SET @NextSpace = CHARINDEX(' ', @Name)
SET @CivicNum = SUBSTRING(@Name, 1, @NextSpace)
-- get the next token, to check for a fractional @CivicNum
SET @LastSpace = @NextSpace + 1
SET @NextSpace = CHARINDEX(' ', @Name, @LastSpace)
SET @Temp = SUBSTRING(@Name, @LastSpace, @NextSpace - @LastSpace)
-- assumption: if the token contains a slash, it is a fraction
-- for @CivicNum
IF CHARINDEX('\\', @Temp) > 0 OR CHARINDEX ('/', @Temp) > 0
SET @CivicNum = @CivicNum + ' ' + @Temp
ELSE
SET @NextSpace = @LastSpace
-- reverse the string to extract type, direction from right
SET @Name = REVERSE(SUBSTRING(@Name, @NextSpace, LEN(@Name)))
-- get the final token, check if it is a post directional (N, S, E, W)
SET @LastSpace = 1
SET @NextSpace = CHARINDEX(' ', @Name)
SET @Temp = RTRIM(LTRIM(REVERSE(SUBSTRING(@Name, 1, @NextSpace))))
-- strip period, if one exists
IF SUBSTRING(@Temp, LEN(@Temp), 1) = '.'
SET @Temp = SUBSTRING(@Temp, 1, LEN(@Temp) - 1)
-- check if it exists in the StreetDirection table
SELECT @PostDir=DirectionAbbr FROM StreetDirection WHERE UPPER(DirectionName) LIKE UPPER(@Temp)
IF NOT (@PostDir LIKE @Temp) -- try abbreviation
SELECT @PostDir=DirectionAbbr FROM StreetDirection WHERE UPPER(DirectionAbbr) LIKE UPPER(@Temp)
-- if @PostDir was found, get the next token
IF NOT (@PostDir LIKE '')
BEGIN
SET @LastSpace = @NextSpace + 1
SET @NextSpace = CHARINDEX(' ', @Name, @LastSpace + 1)
SET @Temp = RTRIM(LTRIM(REVERSE(SUBSTRING(@Name, @LastSpace, @NextSpace - @LastSpace))))
IF SUBSTRING(@Temp, LEN(@Temp), 1) = '.' -- strip period
SET @Temp = SUBSTRING(@Temp, 1, LEN(@Temp) - 1)
END
-- check if the current token is a StreetType
SELECT @Type=TypeName FROM StreetType WHERE UPPER(TypeName) LIKE UPPER(@Temp)
IF NOT (@Type LIKE @Temp) -- try abbreviation
SELECT @Type=TypeName FROM StreetType WHERE UPPER(TypeAbbr) LIKE UPPER(@Temp)
-- reverse address again to check for pre-directional and french street type
IF NOT (@Type LIKE '')
SET @LastSpace = @NextSpace + 1
SET @Name = RTRIM(LTRIM(REVERSE(SUBSTRING(@Name, @LastSpace, LEN(@Name)))))
-- get the next token, and check for a pre-directional
SET @LastSpace = 1
SET @NextSpace = CHARINDEX(' ', @Name)
SET @Temp = RTRIM(LTRIM(SUBSTRING(@Name, 1, @NextSpace)))
IF SUBSTRING(@Temp, LEN(@Temp), 1) = '.' -- strip period
SET @Temp = SUBSTRING(@Temp, 1, LEN(@Temp) - 1)
SELECT @PreDir=DirectionAbbr FROM StreetDirection WHERE UPPER(DirectionName) LIKE UPPER(@Temp)
IF NOT (@PreDir LIKE @Temp) -- try abbreviation
SELECT @PreDir=DirectionAbbr FROM StreetDirection WHERE UPPER(DirectionAbbr) LIKE UPPER(@Temp)
-- if a pre-directional was found, get the next token
IF NOT (@PreDir LIKE '')
BEGIN
SET @LastSpace = @NextSpace + 1
SET @NextSpace = CHARINDEX(' ', @Name, @LastSpace + 1)
SET @Temp = RTRIM(LTRIM(SUBSTRING(@Name, @LastSpace, LEN(@Name))))
IF SUBSTRING(@Temp, LEN(@Temp), 1) = '.' -- strip period
SET @Temp = SUBSTRING(@Temp, 1, LEN(@Temp) - 1)
END
-- check for French street type before address, if an
-- English street type has not already been found
IF (@Type LIKE '') -- only if not already found (prefer English type)
BEGIN
SELECT @FrenchType=TypeName FROM StreetType WHERE UPPER(TypeName) LIKE UPPER(@Temp)
IF NOT (@FrenchType LIKE @Temp) -- try abbreviation
SELECT @FrenchType=TypeName FROM StreetType WHERE UPPER(TypeAbbr) LIKE UPPER(@Temp)
END
-- if @FrenchType was found, get the final remaining token
IF NOT (@FrenchType Like '')
BEGIN
SET @LastSpace = @NextSpace
SET @Type = @FrenchType
END
-- assume it is the street name
SET @Name = SUBSTRING(@Name, @LastSpace, LEN(@Name))
SELECT @CivicNum AS CivicNum, @Name AS Address, @Type AS StreetType, @PostDir + @PreDir AS Direction
END
ELSE
SELECT @CivicNum AS CivicNum, @Name AS Address, @Type AS StreetType, @PostDir + @PreDir AS Direction
If nothing else, I hope this is a snippet of code someone finds handy! Although they don't particularly apply in my case, here is a handy list of address edge cases from @200_success.
-
2\$\begingroup\$ This is probably covered by your French cases but I wanted to point out that in some American cities (San Diego) Spanish names are numerous. Avenida Montuosa, Via San Marco, Caminito Suenos, etc. And you may have spanish directionals as well. "Paseo del Pueblo Sur". So it is not just French names that you need to account for. \$\endgroup\$nickles80– nickles802012年08月23日 21:32:25 +00:00Commented Aug 23, 2012 at 21:32
-
1\$\begingroup\$ @dnix Sorry for the late reply! In case it's still useful: US Street Types, Street Directionals Canada Street Types, Street Directionals \$\endgroup\$Hannele– Hannele2013年10月22日 15:04:54 +00:00Commented Oct 22, 2013 at 15:04
-
3\$\begingroup\$ I'm curious what the motivation is for trying to decompose an address. In general, it's an impossible task, though the situation is slightly more sane in the US and Canada. \$\endgroup\$200_success– 200_success2013年11月15日 20:46:13 +00:00Commented Nov 15, 2013 at 20:46
-
1\$\begingroup\$ @200_success One particular government form has split fields, while literally every other situation in our application does not. So, the requirement is currently very specific to not even just Canadian, but to Ontario customers. \$\endgroup\$Hannele– Hannele2013年11月15日 20:49:17 +00:00Commented Nov 15, 2013 at 20:49
-
2\$\begingroup\$ @200_success That's a great link for edge cases! I doubt most of them will apply in my case, but thanks. \$\endgroup\$Hannele– Hannele2013年11月15日 21:02:25 +00:00Commented Nov 15, 2013 at 21:02
1 Answer 1
Reviewing this code is essentially meaningless without a ream of test data to compare against.... and it would be really nice to have a debugger!
The REVERSE operations make keeping track of the data state quite complicated.
In general though, the code is formatted well, though there are a couple of issues:
Your first if block
IF ISNUMERIC(@CivicNum) = 1
has aBEGIN
, but I can't see the closingEND
, which I would have expected to find before the indent-matchingELSE
in the second-to-last line. I would assume this is a typo?you have a number of
DECLARE
instances that are issued inside various other blocks. It is traditional to move all your declare blocks to be the initial part of your t-sql source... I am not certain whether this code-style is relevant though, butDECLARE @LastSpace AS INTEGER
and others should be declared at the top.The code is just too much for this area. You need to break it down some more, or find a better algorithm. Tracking the string manipulations is a nightmare, and, frankly, I gave up... which means the code is not well documented, or simply too cumbersome.
My recommendations to you (in a priority of sorts):
- replace the 1-column system with a pre-parsed data system
- create a separate table with the data columns split, and have a join that your can access to see the fields.
- change the user requirmenets to not need this breakdown.
- Create a CDL User-Defined-Function (C# or something) that does the parse using a more friendly format, perhaps returning the data in tab-separated values.... and then splitting them in the calling SQL.
- go with what you have, it basically works.
Conclusion:
I figure your code is close to right, It may not be worth breaking it. Adding a CDL would be a pain to do, but the code for the parse would be much easier to manage.
My money is on the system-change to store the parsed fields.... ;-)
-
\$\begingroup\$ You can do debugging in Microsoft SQL Management Studio - it's really the only reason I was able to keep track of anything. Putting the data into columns instead was considered, it's just that it was a very hard requirement for a very small percentage of customers. Fixed the missing
END
, thanks. \$\endgroup\$Hannele– Hannele2014年02月10日 13:07:49 +00:00Commented Feb 10, 2014 at 13:07