I'm working on a tool to import data from one database to another. One requirement is that I have to split a string from one source field into three (shorter) fields at the target. If possible the string should be split at a space character. If the string doesn't fit completely into the target fields, the rest can be omitted.
Usually I would solve this using an UDF but unfortunately neither UDF's nor Stored Procedures are allowed in my scenario.
My source database has the following table:
CREATE TABLE dbo.Organisations
(
OrganisationID int IDENTITY(1,1) NOT NULL PRIMARY KEY,
OrganisationName nvarchar(180) NOT NULL
/* More columns omitted for brevity */
)
This table contains company names, for example:
OrganisationID | OrganisationName ---------------+-------------------------------------------------------------------- 1 | Microsoft Corporation 2 | S&T System Integration & Technology Distribution Aktiengesellschaft
During the import the records of this table should be inserted into a stating table in the target database. The staging table looks like this:
CREATE TABLE dbo.OrgStaging
(
OrganisationID int NOT NULL,
Name1 nvarchar(50) NOT NULL,
Name2 nvarchar(50) NOT NULL,
Name3 nvarchar(50) NOT NULL
/* More columns omitted for brevity */
)
If I would simply use SUBSTRING
to split the name I would end up in the staging table like this:
OrganisationID | Name1 | Name2 ---------------+---------------------------------------------------+----------------- 1 | Microsoft Corporation | 2 | S&T System Integration & Technology Distribution A|ktiengesellschaft
But I don't want to split in the middle of a word so I would like to have the result like this:
OrganisationID | Name1 | Name2 ---------------+---------------------------------------------------+------------------ 1 | Microsoft Corporation | 2 | S&T System Integration & Technology Distribution |Aktiengesellschaft
To achieve this I came up with the following rather complex query:
DECLARE
@MaxLen int = 50; -- Maximum length of a target column
WITH
SpacePositions AS
(
SELECT
O.OrganisationID,
CHARINDEX(' ', O.OrganisationName, 0) AS Position
FROM
SourceDB.dbo.Organisations O
UNION ALL SELECT
O.OrganisationID,
CHARINDEX(' ', O.OrganisationName, S.Position + 1) AS Position
FROM
SourceDB.dbo.Organisations O
INNER JOIN SpacePositions S
ON CHARINDEX(' ', O.OrganisationName, S.Position + 1) > S.Position
AND S.OrganisationID = O.OrganisationID
),
SplitPositions AS
(
SELECT
S.OrganisationID,
S.Position - 1 AS Position
FROM
SpacePositions S
WHERE
S.Position != 0
UNION SELECT
O.OrganisationID,
LEN(O.OrganisationName) AS Position
FROM
SourceDB.dbo.Organisations O
),
FirstChunk AS
(
SELECT
D.OrganisationID,
1 AS ChunkStart,
MAX(D.Position) AS ChunkEnd
FROM
(
SELECT
S.OrganisationID,
S.Position + 1 AS Position
FROM
SplitPositions S
WHERE
Position BETWEEN 1 AND @MaxLen
UNION SELECT
S.OrganisationID,
@MaxLen
FROM
SplitPositions S
WHERE
NOT EXISTS
(
SELECT
*
FROM
SplitPositions SI
WHERE
SI.Position BETWEEN 1 AND @MaxLen
AND SI.OrganisationID = S.OrganisationID
)
) D
GROUP BY
D.OrganisationID
),
SecondChunk AS
(
SELECT
C.OrganisationID,
C.ChunkEnd + 1 AS ChunkStart,
MAX(D.Position) AS ChunkEnd
FROM
FirstChunk C
INNER JOIN
(
SELECT
S.OrganisationID,
S.Position + 1 AS Position
FROM
SplitPositions S
INNER JOIN FirstChunk C
ON C.OrganisationID = S.OrganisationID
WHERE
S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
UNION SELECT
S.OrganisationID,
C.ChunkEnd + @MaxLen AS Position
FROM
SplitPositions S
INNER JOIN FirstChunk C
ON C.OrganisationID = S.OrganisationID
WHERE
NOT EXISTS
(
SELECT
*
FROM
SplitPositions SI
WHERE
SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
AND OrganisationID = C.OrganisationID
)
) D
ON D.OrganisationID = C.OrganisationID
GROUP BY
C.OrganisationID,
C.ChunkEnd
),
ThirdChunk AS
(
SELECT
C.OrganisationID,
C.ChunkEnd + 1 AS ChunkStart,
MAX(D.Position) AS ChunkEnd
FROM
SecondChunk C
INNER JOIN
(
SELECT
S.OrganisationID,
S.Position + 1 AS Position
FROM
SplitPositions S
INNER JOIN SecondChunk C
ON C.OrganisationID = S.OrganisationID
WHERE
S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
UNION SELECT
S.OrganisationID,
C.ChunkEnd + @MaxLen AS Position
FROM
SplitPositions S
INNER JOIN SecondChunk C
ON C.OrganisationID = S.OrganisationID
WHERE
NOT EXISTS
(
SELECT
*
FROM
SplitPositions SI
WHERE
SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
AND OrganisationID = C.OrganisationID
)
) D
ON D.OrganisationID = C.OrganisationID
GROUP BY
C.OrganisationID,
C.ChunkEnd
)
INSERT INTO dbo.OrgStaging
(
OrganisationID,
Name1,
Name2,
Name3
)
SELECT
O.OrganisationID,
LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C1.ChunkStart, C1.ChunkEnd))),
LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C2.ChunkStart, 1 + C2.ChunkEnd - C2.ChunkStart))),
LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C3.ChunkStart, 1 + C3.ChunkEnd - C3.ChunkStart)))
FROM
SourceDB.dbo.Organisations O
INNER JOIN FirstChunk C1
ON C1.OrganisationID = O.OrganisationID
INNER JOIN SecondChunk C2
ON C2.OrganisationID = O.OrganisationID
INNER JOIN ThirdChunk C3
ON C3.OrganisationID = O.OrganisationID
ORDER BY
O.OrganisationID;
It works as desired but I'm wondering if this can be stated a little more compact. I tried to combine the CTE's FirstChunk
, SecondChunk
and ThirdChunk
into one recursive CTE but that is not working because of the GROUP BY
clause which is not allowed in recursive CTEs.
Can this be restated more compact or is it already the best I can get?
1 Answer 1
A nitpick
This first point will not really change the query, and it is probably
only added for testing purposes, but an ORDER BY
with an INSERT INTO
statement does not really do anything useful (unless you insert
into a table with an IDENTITY
column).
Us a more compact style
It looks like you either have a very strong policy on style and
layout, or used an auto-formatter, because in some places, SELECT *
is written in two lines. That is good, because the query is at least
well formatted and readable. If you want it to be more compact,
though, you might want to skimp a bit on the newlines.
When formatting code more compactly, there is still "breathing space", but the scroll-factor is tuned down a bit, so you have a bit higher view of the code. I find that it can help.
I like to use a combination of indentation to recognize the query parts (my indentation is by no means the default in SQL) and a thing I call "one concept per line", where each line tells me something that can stand on its own for logic.
Not indenting UNION
s helps to see the "equal level" of both sides of
the UNION
, and prevents the lines from getting too long.
DECLARE @MaxLen int = 50; -- Maximum length of a target column
WITH SpacePositions AS
( SELECT O.OrganisationID
, CHARINDEX(' ', O.OrganisationName, 0) AS Position
FROM dbo.Organisations O
UNION ALL
SELECT O.OrganisationID
, CHARINDEX(' ', O.OrganisationName, S.Position + 1) AS Position
FROM dbo.Organisations O
INNER JOIN SpacePositions S
ON CHARINDEX(' ', O.OrganisationName, S.Position + 1) > S.Position
AND S.OrganisationID = O.OrganisationID
)
, SplitPositions AS
( SELECT S.OrganisationID
, S.Position - 1 AS Position
FROM SpacePositions S
WHERE S.Position != 0
UNION
SELECT O.OrganisationID
, LEN(O.OrganisationName) AS Position
FROM dbo.Organisations O
)
, FirstChunk AS
( SELECT D.OrganisationID
, 1 AS ChunkStart
, MAX(D.Position) AS ChunkEnd
FROM ( SELECT S.OrganisationID
, S.Position + 1 AS Position
FROM SplitPositions S
WHERE Position BETWEEN 1 AND @MaxLen
UNION
SELECT S.OrganisationID
, @MaxLen
FROM SplitPositions S
WHERE NOT EXISTS ( SELECT *
FROM SplitPositions SI
WHERE SI.Position BETWEEN 1 AND @MaxLen
AND SI.OrganisationID = S.OrganisationID
)
) D
GROUP BY D.OrganisationID
)
, SecondChunk AS
( SELECT C.OrganisationID
, C.ChunkEnd + 1 AS ChunkStart
, MAX(D.Position) AS ChunkEnd
FROM FirstChunk C
INNER JOIN ( SELECT S.OrganisationID
, S.Position + 1 AS Position
FROM SplitPositions S
INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
UNION
SELECT S.OrganisationID
, C.ChunkEnd + @MaxLen AS Position
FROM SplitPositions S
INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
WHERE NOT EXISTS ( SELECT *
FROM SplitPositions SI
WHERE SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
AND OrganisationID = C.OrganisationID
)
) D ON D.OrganisationID = C.OrganisationID
GROUP BY C.OrganisationID, C.ChunkEnd
)
, ThirdChunk AS
( SELECT C.OrganisationID
, C.ChunkEnd + 1 AS ChunkStart
, MAX(D.Position) AS ChunkEnd
FROM SecondChunk C
INNER JOIN ( SELECT S.OrganisationID
, S.Position + 1 AS Position
FROM SplitPositions S
INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
UNION
SELECT S.OrganisationID
, C.ChunkEnd + @MaxLen AS Position
FROM SplitPositions S
INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
WHERE NOT EXISTS ( SELECT *
FROM SplitPositions SI
WHERE SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
AND OrganisationID = C.OrganisationID
)
) D ON D.OrganisationID = C.OrganisationID
GROUP BY C.OrganisationID, C.ChunkEnd
)
INSERT INTO dbo.OrgStaging
( OrganisationID
, Name1
, Name2
, Name3 )
SELECT O.OrganisationID
, LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C1.ChunkStart, C1.ChunkEnd)))
, LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C2.ChunkStart, 1 + C2.ChunkEnd - C2.ChunkStart)))
, LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C3.ChunkStart, 1 + C3.ChunkEnd - C3.ChunkStart)))
FROM dbo.Organisations O
INNER JOIN FirstChunk C1 ON C1.OrganisationID = O.OrganisationID
INNER JOIN SecondChunk C2 ON C2.OrganisationID = O.OrganisationID
INNER JOIN ThirdChunk C3 ON C3.OrganisationID = O.OrganisationID
ORDER BY O.OrganisationID;
There is still some compactness to be gained if you write JOIN
s with
just one ON
clause on a single line, but only when that JOIN
is
trivial (for instance on matching primary keys). Especially in the
final part, I find the symmetry of the JOIN
s to be clear.
Use LEFT JOIN
and COALESCE
for edge cases
Four times you add a UNION
on a subquery to account for an edge
case. Three of those are when you want to split on @MaxLen
,
because there is no shorter match. But there is another way to do
that.
In SQL, missing data is represented as a NULL
value. When we use
an INNER JOIN
, those NULL
s disappear, because we can only join
on data that we know. Adding missing data afterwards through a
UNION
and a back reference (querying the same data but asking
where it is missing) is possible. But we can also just take those
NULL
s with an OUTER JOIN
(mostly LEFT
or RIGHT
), and tell
SQL to replace missing values by something else, using COALESCE
.
In the Chunk CTEs
In the FirstChunk, we only know that something is missing if we know of all the organisations, so we need to select the data from the origin as well:
, FirstChunk AS
( SELECT O.OrganisationID
, 1 AS ChunkStart
, COALESCE(MAX(D.Position), @MaxLen) AS ChunkEnd
FROM dbo.Organisations O
LEFT JOIN ( SELECT S.OrganisationID
, S.Position + 1 AS Position
FROM SplitPositions S
WHERE Position BETWEEN 1 AND @MaxLen
) D ON D.OrganisationID = O.OrganisationID
GROUP BY O.OrganisationID
)
Notice that we don't select D.OrganisationID
for the first column
any more, because that can also be NULL
if we can't split. That
also means that we need to GROUP BY
the newly selected value.
In the other two CTEs, we can just take the existing values of the previous CTEs:
, SecondChunk AS
( SELECT C.OrganisationID
, C.ChunkEnd + 1 AS ChunkStart
, COALESCE(MAX(D.Position), C.ChunkEnd + @MaxLen) AS ChunkEnd
FROM FirstChunk C
LEFT JOIN ( SELECT S.OrganisationID
, S.Position + 1 AS Position
FROM SplitPositions S
INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
) D ON D.OrganisationID = C.OrganisationID
GROUP BY C.OrganisationID, C.ChunkEnd
)
, ThirdChunk AS
( SELECT C.OrganisationID
, C.ChunkEnd + 1 AS ChunkStart
, COALESCE(MAX(D.Position), C.ChunkEnd + @MaxLen) AS ChunkEnd
FROM SecondChunk C
LEFT JOIN ( SELECT S.OrganisationID
, S.Position + 1 AS Position
FROM SplitPositions S
INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
) D ON D.OrganisationID = C.OrganisationID
GROUP BY C.OrganisationID, C.ChunkEnd
)
Now the query is more compact, but also faster, because we add the edge cases in the same "swoop".
In the SpacePositions and SplitPositions CTEs
Yes, in the -Positions CTEs we do this as well, although this is a bit less clear at first, and will probably not gain as much in readability or performance. For completeness, I will explain it here as well.
The edge case is the final position. It does not contain a space,
but needs to be taken into account as well. Or does it? Now that
we take @MaxLen
where there is no space to split on, we will take
@MaxLen
also when there is no space left to split on. Which
may or may not be past the end of the input string. Lets see what
happens if we just remove the SplitPositions
CTE. I will add
some additional test data to see what happens:
INSERT INTO Organisations ([OrganisationName])
SELECT SUBSTRING(OrganisationName, 1, 180)
FROM ( SELECT 'Microsoft Corporation'
UNION ALL SELECT 'S&T System Integration & Technology Distribution Aktiengesellschaft'
UNION ALL SELECT 'VeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpace Because It Really Is A Long Name, But In The Second Column We Can Split It'
UNION ALL SELECT 'Another VeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpaceButOnlyInTheSecondColumn, Because It Really Is A Long Name'
UNION ALL SELECT 'AnotherVeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpaceBecauseItReallyIsALongNameButNowItEvenExceedsTheLimitOfAllThreeColumnsWithAMaximumLenghtOf50Characters(WhichIsACombinedTotalOf150Characters)AndNowWeDon''tHaveAnythingToPutInTheLastBox'
UNION ALL SELECT 'OneWordOnly'
UNION ALL SELECT 'A' -- Single letter edge case
UNION ALL SELECT '' -- Empty string edge case
) Data(OrganisationName);
Now when, after the query, we run
SELECT *, LEN(Name1), LEN(Name2), LEN(Name3) FROM dbo.OrgStaging;
The results are:
| OrganisationID | Name1 | Name2 | Name3 | | | |
|----------------|----------------------------------------------------|----------------------------------------------------|----------------------------------------------------|----|----|----|
| 1 | Microsoft Corporation | | | 21 | 0 | 0 |
| 2 | S&T System Integration & Technology Distribution | Aktiengesellschaft | | 48 | 18 | 0 |
| 3 | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpace Because It Really Is A Long Name, But In | The Second Column We Can Split It | 50 | 49 | 33 |
| 4 | Another | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpaceButOnlyInTheSecondColumn, Because It | 7 | 50 | 44 |
| 5 | AnotherVeryLongOrganisationNameThatWillHaveToBeSpl | itWithoutASpaceBecauseItReallyIsALongNameButNowItE | venExceedsTheLimitOfAllThreeColumnsWithAMaximumLen | 50 | 50 | 50 |
| 6 | OneWordOnly | | | 11 | 0 | 0 |
| 7 | A | | | 1 | 0 | 0 |
| 8 | | | | 0 | 0 | 0 |
Now lets remove the SplitPositions CTE, and add the - 1
to the
SpacePositions CTE. Furthermore, we replace all references to
SplitPositions to refer to SpacePositions (of course).
WITH SpacePositions AS
( SELECT O.OrganisationID
, CHARINDEX(' ', O.OrganisationName, 0) - 1 AS Position
FROM dbo.Organisations O
UNION ALL
SELECT O.OrganisationID
, CHARINDEX(' ', O.OrganisationName, S.Position + 2) - 1 AS Position
FROM dbo.Organisations O
INNER JOIN SpacePositions S
ON CHARINDEX(' ', O.OrganisationName, S.Position + 2) - 1 > S.Position
AND S.OrganisationID = O.OrganisationID
)
, FirstChunk AS
Which yields:
| OrganisationID | Name1 | Name2 | Name3 | | | |
|----------------|----------------------------------------------------|----------------------------------------------------|----------------------------------------------------|----|----|----|
| 1 | Microsoft | Corporation | | 9 | 11 | 0 |
| 2 | S&T System Integration & Technology Distribution | Aktiengesellschaft | | 48 | 18 | 0 |
| 3 | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpace Because It Really Is A Long Name, But In | The Second Column We Can Split | 50 | 49 | 30 |
| 4 | Another | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpaceButOnlyInTheSecondColumn, Because It | 7 | 50 | 44 |
| 5 | AnotherVeryLongOrganisationNameThatWillHaveToBeSpl | itWithoutASpaceBecauseItReallyIsALongNameButNowItE | venExceedsTheLimitOfAllThreeColumnsWithAMaximumLen | 50 | 50 | 50 |
| 6 | OneWordOnly | | | 11 | 0 | 0 |
| 7 | A | | | 1 | 0 | 0 |
| 8 | | | | 0 | 0 | 0 |
Looks good to me :)
Name1
,Name2
,Name3
? Is there also, or will there ever be Name4, Name5, etc...? Are you sure that's how you want to structure it? Perhaps a one-to-many relationship is something to consider? \$\endgroup\$