Split string in chunks preferable at spaces

Question 1

I'm working on a tool to import data from one database to another. One requirement is that I have to split a string from one source field into three (shorter) fields at the target. If possible the string should be split at a space character. If the string doesn't fit completely into the target fields, the rest can be omitted.

Usually I would solve this using an UDF but unfortunately neither UDF's nor Stored Procedures are allowed in my scenario.

My source database has the following table:

CREATE TABLE dbo.Organisations
(
 OrganisationID int IDENTITY(1,1) NOT NULL PRIMARY KEY,
 OrganisationName nvarchar(180) NOT NULL
 /* More columns omitted for brevity */
)

This table contains company names, for example:

OrganisationID | OrganisationName
---------------+--------------------------------------------------------------------
 1 | Microsoft Corporation
 2 | S&T System Integration & Technology Distribution Aktiengesellschaft

During the import the records of this table should be inserted into a stating table in the target database. The staging table looks like this:

CREATE TABLE dbo.OrgStaging
(
 OrganisationID int NOT NULL,
 Name1 nvarchar(50) NOT NULL,
 Name2 nvarchar(50) NOT NULL,
 Name3 nvarchar(50) NOT NULL
 /* More columns omitted for brevity */
)

If I would simply use SUBSTRING to split the name I would end up in the staging table like this:

OrganisationID | Name1 | Name2
---------------+---------------------------------------------------+-----------------
 1 | Microsoft Corporation |
 2 | S&T System Integration & Technology Distribution A|ktiengesellschaft

But I don't want to split in the middle of a word so I would like to have the result like this:

OrganisationID | Name1 | Name2
---------------+---------------------------------------------------+------------------
 1 | Microsoft Corporation |
 2 | S&T System Integration & Technology Distribution |Aktiengesellschaft

To achieve this I came up with the following rather complex query:

DECLARE
 @MaxLen int = 50; -- Maximum length of a target column
WITH
 SpacePositions AS
 (
 SELECT
 O.OrganisationID,
 CHARINDEX(' ', O.OrganisationName, 0) AS Position
 FROM
 SourceDB.dbo.Organisations O
 UNION ALL SELECT
 O.OrganisationID,
 CHARINDEX(' ', O.OrganisationName, S.Position + 1) AS Position
 FROM
 SourceDB.dbo.Organisations O
 INNER JOIN SpacePositions S
 ON CHARINDEX(' ', O.OrganisationName, S.Position + 1) > S.Position
 AND S.OrganisationID = O.OrganisationID
 ),
 SplitPositions AS 
 (
 SELECT
 S.OrganisationID,
 S.Position - 1 AS Position
 FROM
 SpacePositions S
 WHERE
 S.Position != 0
 UNION SELECT
 O.OrganisationID,
 LEN(O.OrganisationName) AS Position
 FROM
 SourceDB.dbo.Organisations O
 ),
 FirstChunk AS
 (
 SELECT
 D.OrganisationID,
 1 AS ChunkStart,
 MAX(D.Position) AS ChunkEnd
 FROM
 (
 SELECT
 S.OrganisationID,
 S.Position + 1 AS Position
 FROM
 SplitPositions S
 WHERE
 Position BETWEEN 1 AND @MaxLen
 UNION SELECT
 S.OrganisationID,
 @MaxLen
 FROM
 SplitPositions S
 WHERE
 NOT EXISTS
 (
 SELECT
 *
 FROM
 SplitPositions SI
 WHERE
 SI.Position BETWEEN 1 AND @MaxLen
 AND SI.OrganisationID = S.OrganisationID
 )
 ) D
 GROUP BY
 D.OrganisationID
 ),
 SecondChunk AS
 (
 SELECT
 C.OrganisationID,
 C.ChunkEnd + 1 AS ChunkStart,
 MAX(D.Position) AS ChunkEnd
 FROM
 FirstChunk C
 INNER JOIN
 (
 SELECT
 S.OrganisationID,
 S.Position + 1 AS Position
 FROM
 SplitPositions S
 INNER JOIN FirstChunk C
 ON C.OrganisationID = S.OrganisationID
 WHERE
 S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 UNION SELECT
 S.OrganisationID,
 C.ChunkEnd + @MaxLen AS Position
 FROM
 SplitPositions S
 INNER JOIN FirstChunk C
 ON C.OrganisationID = S.OrganisationID
 WHERE
 NOT EXISTS
 (
 SELECT
 *
 FROM
 SplitPositions SI
 WHERE
 SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 AND OrganisationID = C.OrganisationID
 )
 ) D
 ON D.OrganisationID = C.OrganisationID
 GROUP BY
 C.OrganisationID,
 C.ChunkEnd
 ),
 ThirdChunk AS
 (
 SELECT
 C.OrganisationID,
 C.ChunkEnd + 1 AS ChunkStart,
 MAX(D.Position) AS ChunkEnd
 FROM
 SecondChunk C
 INNER JOIN
 (
 SELECT
 S.OrganisationID,
 S.Position + 1 AS Position
 FROM
 SplitPositions S
 INNER JOIN SecondChunk C
 ON C.OrganisationID = S.OrganisationID
 WHERE
 S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 UNION SELECT
 S.OrganisationID,
 C.ChunkEnd + @MaxLen AS Position
 FROM
 SplitPositions S
 INNER JOIN SecondChunk C
 ON C.OrganisationID = S.OrganisationID
 WHERE
 NOT EXISTS
 (
 SELECT
 *
 FROM
 SplitPositions SI
 WHERE
 SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 AND OrganisationID = C.OrganisationID
 )
 ) D
 ON D.OrganisationID = C.OrganisationID
 GROUP BY
 C.OrganisationID,
 C.ChunkEnd
 )
INSERT INTO dbo.OrgStaging
 (
 OrganisationID,
 Name1,
 Name2,
 Name3
 )
SELECT
 O.OrganisationID,
 LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C1.ChunkStart, C1.ChunkEnd))),
 LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C2.ChunkStart, 1 + C2.ChunkEnd - C2.ChunkStart))),
 LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C3.ChunkStart, 1 + C3.ChunkEnd - C3.ChunkStart)))
 FROM
 SourceDB.dbo.Organisations O
 INNER JOIN FirstChunk C1
 ON C1.OrganisationID = O.OrganisationID
 INNER JOIN SecondChunk C2
 ON C2.OrganisationID = O.OrganisationID
 INNER JOIN ThirdChunk C3
 ON C3.OrganisationID = O.OrganisationID
 ORDER BY
 O.OrganisationID;

It works as desired but I'm wondering if this can be stated a little more compact. I tried to combine the CTE's FirstChunk, SecondChunk and ThirdChunk into one recursive CTE but that is not working because of the GROUP BY clause which is not allowed in recursive CTEs.

Can this be restated more compact or is it already the best I can get?

Question 2

Why do you have Name1, Name2, Name3? Is there also, or will there ever be Name4, Name5, etc...? Are you sure that's how you want to structure it? Perhaps a one-to-many relationship is something to consider?

Question 3

What is the reason you split it up in the first place? Why not keep it as one column?

Question 4

@SimonAndréForsberg the fields in the stating table just reflect the fields given in the target database. Both databases belong to already existing and more or less widespread applicatons and therefore cannot be changed. The idea behind my import is that data which is entered in one application (the source database) doesn't need to be reentered in the second application (the target database).

Question 5

Still looking at writing a full answer for this, but I notice a // comment in the query, T-SQL uses -- for comments.

Question 6

Okay, so I think the best way to do this would be to split the string at EVERY space. Then you can rebuild it from the individual words, stopping if you reach 50 characters. I'll try and get a proper answer up soon, just about to leave work so I don't really have the time.

Question 7

A nitpick

This first point will not really change the query, and it is probably only added for testing purposes, but an ORDER BY with an INSERT INTO statement does not really do anything useful (unless you insert into a table with an IDENTITY column).

Us a more compact style

It looks like you either have a very strong policy on style and layout, or used an auto-formatter, because in some places, SELECT * is written in two lines. That is good, because the query is at least well formatted and readable. If you want it to be more compact, though, you might want to skimp a bit on the newlines.

When formatting code more compactly, there is still "breathing space", but the scroll-factor is tuned down a bit, so you have a bit higher view of the code. I find that it can help.

I like to use a combination of indentation to recognize the query parts (my indentation is by no means the default in SQL) and a thing I call "one concept per line", where each line tells me something that can stand on its own for logic.

Not indenting UNION s helps to see the "equal level" of both sides of the UNION, and prevents the lines from getting too long.

DECLARE @MaxLen int = 50; -- Maximum length of a target column
 WITH SpacePositions AS
 ( SELECT O.OrganisationID
 , CHARINDEX(' ', O.OrganisationName, 0) AS Position
 FROM dbo.Organisations O
 UNION ALL
 SELECT O.OrganisationID
 , CHARINDEX(' ', O.OrganisationName, S.Position + 1) AS Position
 FROM dbo.Organisations O
 INNER JOIN SpacePositions S
 ON CHARINDEX(' ', O.OrganisationName, S.Position + 1) > S.Position
 AND S.OrganisationID = O.OrganisationID
 )
 , SplitPositions AS
 ( SELECT S.OrganisationID
 , S.Position - 1 AS Position
 FROM SpacePositions S
 WHERE S.Position != 0
 UNION
 SELECT O.OrganisationID
 , LEN(O.OrganisationName) AS Position
 FROM dbo.Organisations O
 )
 , FirstChunk AS
 ( SELECT D.OrganisationID
 , 1 AS ChunkStart
 , MAX(D.Position) AS ChunkEnd
 FROM ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 WHERE Position BETWEEN 1 AND @MaxLen
 UNION
 SELECT S.OrganisationID
 , @MaxLen
 FROM SplitPositions S
 WHERE NOT EXISTS ( SELECT *
 FROM SplitPositions SI
 WHERE SI.Position BETWEEN 1 AND @MaxLen
 AND SI.OrganisationID = S.OrganisationID
 )
 ) D
 GROUP BY D.OrganisationID
 )
 , SecondChunk AS
 ( SELECT C.OrganisationID
 , C.ChunkEnd + 1 AS ChunkStart
 , MAX(D.Position) AS ChunkEnd
 FROM FirstChunk C
 INNER JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 UNION
 SELECT S.OrganisationID
 , C.ChunkEnd + @MaxLen AS Position
 FROM SplitPositions S
 INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
 WHERE NOT EXISTS ( SELECT *
 FROM SplitPositions SI
 WHERE SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 AND OrganisationID = C.OrganisationID
 )
 ) D ON D.OrganisationID = C.OrganisationID
 GROUP BY C.OrganisationID, C.ChunkEnd
 )
 , ThirdChunk AS
 ( SELECT C.OrganisationID
 , C.ChunkEnd + 1 AS ChunkStart
 , MAX(D.Position) AS ChunkEnd
 FROM SecondChunk C
 INNER JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 UNION
 SELECT S.OrganisationID
 , C.ChunkEnd + @MaxLen AS Position
 FROM SplitPositions S
 INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
 WHERE NOT EXISTS ( SELECT *
 FROM SplitPositions SI
 WHERE SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 AND OrganisationID = C.OrganisationID
 )
 ) D ON D.OrganisationID = C.OrganisationID
 GROUP BY C.OrganisationID, C.ChunkEnd
 )
INSERT INTO dbo.OrgStaging
 ( OrganisationID
 , Name1
 , Name2
 , Name3 )
SELECT O.OrganisationID
 , LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C1.ChunkStart, C1.ChunkEnd)))
 , LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C2.ChunkStart, 1 + C2.ChunkEnd - C2.ChunkStart)))
 , LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C3.ChunkStart, 1 + C3.ChunkEnd - C3.ChunkStart)))
 FROM dbo.Organisations O
 INNER JOIN FirstChunk C1 ON C1.OrganisationID = O.OrganisationID
 INNER JOIN SecondChunk C2 ON C2.OrganisationID = O.OrganisationID
 INNER JOIN ThirdChunk C3 ON C3.OrganisationID = O.OrganisationID
 ORDER BY O.OrganisationID;

There is still some compactness to be gained if you write JOIN s with just one ON clause on a single line, but only when that JOIN is trivial (for instance on matching primary keys). Especially in the final part, I find the symmetry of the JOIN s to be clear.

Use `LEFT JOIN` and `COALESCE` for edge cases

Four times you add a UNION on a subquery to account for an edge case. Three of those are when you want to split on @MaxLen, because there is no shorter match. But there is another way to do that.

In SQL, missing data is represented as a NULL value. When we use an INNER JOIN, those NULL s disappear, because we can only join on data that we know. Adding missing data afterwards through a UNION and a back reference (querying the same data but asking where it is missing) is possible. But we can also just take those NULL s with an OUTER JOIN (mostly LEFT or RIGHT), and tell SQL to replace missing values by something else, using COALESCE.

In the Chunk CTEs

In the FirstChunk, we only know that something is missing if we know of all the organisations, so we need to select the data from the origin as well:

, FirstChunk AS
( SELECT O.OrganisationID
 , 1 AS ChunkStart
 , COALESCE(MAX(D.Position), @MaxLen) AS ChunkEnd
 FROM dbo.Organisations O
 LEFT JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 WHERE Position BETWEEN 1 AND @MaxLen
 ) D ON D.OrganisationID = O.OrganisationID
 GROUP BY O.OrganisationID
)

Notice that we don't select D.OrganisationID for the first column any more, because that can also be NULL if we can't split. That also means that we need to GROUP BY the newly selected value.

In the other two CTEs, we can just take the existing values of the previous CTEs:

, SecondChunk AS
( SELECT C.OrganisationID
 , C.ChunkEnd + 1 AS ChunkStart
 , COALESCE(MAX(D.Position), C.ChunkEnd + @MaxLen) AS ChunkEnd
 FROM FirstChunk C
 LEFT JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 ) D ON D.OrganisationID = C.OrganisationID
 GROUP BY C.OrganisationID, C.ChunkEnd
)
, ThirdChunk AS
( SELECT C.OrganisationID
 , C.ChunkEnd + 1 AS ChunkStart
 , COALESCE(MAX(D.Position), C.ChunkEnd + @MaxLen) AS ChunkEnd
 FROM SecondChunk C
 LEFT JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 ) D ON D.OrganisationID = C.OrganisationID
 GROUP BY C.OrganisationID, C.ChunkEnd
)

Now the query is more compact, but also faster, because we add the edge cases in the same "swoop".

In the SpacePositions and SplitPositions CTEs

Yes, in the -Positions CTEs we do this as well, although this is a bit less clear at first, and will probably not gain as much in readability or performance. For completeness, I will explain it here as well.

The edge case is the final position. It does not contain a space, but needs to be taken into account as well. Or does it? Now that we take @MaxLen where there is no space to split on, we will take @MaxLen also when there is no space left to split on. Which may or may not be past the end of the input string. Lets see what happens if we just remove the SplitPositions CTE. I will add some additional test data to see what happens:

INSERT INTO Organisations ([OrganisationName])
SELECT SUBSTRING(OrganisationName, 1, 180)
 FROM ( SELECT 'Microsoft Corporation'
 UNION ALL SELECT 'S&T System Integration & Technology Distribution Aktiengesellschaft'
 UNION ALL SELECT 'VeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpace Because It Really Is A Long Name, But In The Second Column We Can Split It'
 UNION ALL SELECT 'Another VeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpaceButOnlyInTheSecondColumn, Because It Really Is A Long Name'
 UNION ALL SELECT 'AnotherVeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpaceBecauseItReallyIsALongNameButNowItEvenExceedsTheLimitOfAllThreeColumnsWithAMaximumLenghtOf50Characters(WhichIsACombinedTotalOf150Characters)AndNowWeDon''tHaveAnythingToPutInTheLastBox'
 UNION ALL SELECT 'OneWordOnly'
 UNION ALL SELECT 'A' -- Single letter edge case
 UNION ALL SELECT '' -- Empty string edge case
 ) Data(OrganisationName);

Now when, after the query, we run

SELECT *, LEN(Name1), LEN(Name2), LEN(Name3) FROM dbo.OrgStaging;

The results are:

| OrganisationID | Name1 | Name2 | Name3 | | | |
|----------------|----------------------------------------------------|----------------------------------------------------|----------------------------------------------------|----|----|----|
| 1 | Microsoft Corporation | | | 21 | 0 | 0 |
| 2 | S&T System Integration & Technology Distribution | Aktiengesellschaft | | 48 | 18 | 0 |
| 3 | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpace Because It Really Is A Long Name, But In | The Second Column We Can Split It | 50 | 49 | 33 |
| 4 | Another | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpaceButOnlyInTheSecondColumn, Because It | 7 | 50 | 44 |
| 5 | AnotherVeryLongOrganisationNameThatWillHaveToBeSpl | itWithoutASpaceBecauseItReallyIsALongNameButNowItE | venExceedsTheLimitOfAllThreeColumnsWithAMaximumLen | 50 | 50 | 50 |
| 6 | OneWordOnly | | | 11 | 0 | 0 |
| 7 | A | | | 1 | 0 | 0 |
| 8 | | | | 0 | 0 | 0 |

Now lets remove the SplitPositions CTE, and add the - 1 to the SpacePositions CTE. Furthermore, we replace all references to SplitPositions to refer to SpacePositions (of course).

WITH SpacePositions AS
 ( SELECT O.OrganisationID
 , CHARINDEX(' ', O.OrganisationName, 0) - 1 AS Position
 FROM dbo.Organisations O
 UNION ALL
 SELECT O.OrganisationID
 , CHARINDEX(' ', O.OrganisationName, S.Position + 2) - 1 AS Position
 FROM dbo.Organisations O
 INNER JOIN SpacePositions S
 ON CHARINDEX(' ', O.OrganisationName, S.Position + 2) - 1 > S.Position
 AND S.OrganisationID = O.OrganisationID
 )
 , FirstChunk AS

Which yields:

| OrganisationID | Name1 | Name2 | Name3 | | | |
|----------------|----------------------------------------------------|----------------------------------------------------|----------------------------------------------------|----|----|----|
| 1 | Microsoft | Corporation | | 9 | 11 | 0 |
| 2 | S&T System Integration & Technology Distribution | Aktiengesellschaft | | 48 | 18 | 0 |
| 3 | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpace Because It Really Is A Long Name, But In | The Second Column We Can Split | 50 | 49 | 30 |
| 4 | Another | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpaceButOnlyInTheSecondColumn, Because It | 7 | 50 | 44 |
| 5 | AnotherVeryLongOrganisationNameThatWillHaveToBeSpl | itWithoutASpaceBecauseItReallyIsALongNameButNowItE | venExceedsTheLimitOfAllThreeColumnsWithAMaximumLen | 50 | 50 | 50 |
| 6 | OneWordOnly | | | 11 | 0 | 0 |
| 7 | A | | | 1 | 0 | 0 |
| 8 | | | | 0 | 0 | 0 |

Looks good to me :)

joranvar joranvarjoranvar 9927 silver badges16 bronze badges · Accepted Answer · 2015-12-31 20:26:53Z

A nitpick

This first point will not really change the query, and it is probably only added for testing purposes, but an ORDER BY with an INSERT INTO statement does not really do anything useful (unless you insert into a table with an IDENTITY column).

Us a more compact style

It looks like you either have a very strong policy on style and layout, or used an auto-formatter, because in some places, SELECT * is written in two lines. That is good, because the query is at least well formatted and readable. If you want it to be more compact, though, you might want to skimp a bit on the newlines.

When formatting code more compactly, there is still "breathing space", but the scroll-factor is tuned down a bit, so you have a bit higher view of the code. I find that it can help.

I like to use a combination of indentation to recognize the query parts (my indentation is by no means the default in SQL) and a thing I call "one concept per line", where each line tells me something that can stand on its own for logic.

Not indenting UNION s helps to see the "equal level" of both sides of the UNION, and prevents the lines from getting too long.

DECLARE @MaxLen int = 50; -- Maximum length of a target column
 WITH SpacePositions AS
 ( SELECT O.OrganisationID
 , CHARINDEX(' ', O.OrganisationName, 0) AS Position
 FROM dbo.Organisations O
 UNION ALL
 SELECT O.OrganisationID
 , CHARINDEX(' ', O.OrganisationName, S.Position + 1) AS Position
 FROM dbo.Organisations O
 INNER JOIN SpacePositions S
 ON CHARINDEX(' ', O.OrganisationName, S.Position + 1) > S.Position
 AND S.OrganisationID = O.OrganisationID
 )
 , SplitPositions AS
 ( SELECT S.OrganisationID
 , S.Position - 1 AS Position
 FROM SpacePositions S
 WHERE S.Position != 0
 UNION
 SELECT O.OrganisationID
 , LEN(O.OrganisationName) AS Position
 FROM dbo.Organisations O
 )
 , FirstChunk AS
 ( SELECT D.OrganisationID
 , 1 AS ChunkStart
 , MAX(D.Position) AS ChunkEnd
 FROM ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 WHERE Position BETWEEN 1 AND @MaxLen
 UNION
 SELECT S.OrganisationID
 , @MaxLen
 FROM SplitPositions S
 WHERE NOT EXISTS ( SELECT *
 FROM SplitPositions SI
 WHERE SI.Position BETWEEN 1 AND @MaxLen
 AND SI.OrganisationID = S.OrganisationID
 )
 ) D
 GROUP BY D.OrganisationID
 )
 , SecondChunk AS
 ( SELECT C.OrganisationID
 , C.ChunkEnd + 1 AS ChunkStart
 , MAX(D.Position) AS ChunkEnd
 FROM FirstChunk C
 INNER JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 UNION
 SELECT S.OrganisationID
 , C.ChunkEnd + @MaxLen AS Position
 FROM SplitPositions S
 INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
 WHERE NOT EXISTS ( SELECT *
 FROM SplitPositions SI
 WHERE SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 AND OrganisationID = C.OrganisationID
 )
 ) D ON D.OrganisationID = C.OrganisationID
 GROUP BY C.OrganisationID, C.ChunkEnd
 )
 , ThirdChunk AS
 ( SELECT C.OrganisationID
 , C.ChunkEnd + 1 AS ChunkStart
 , MAX(D.Position) AS ChunkEnd
 FROM SecondChunk C
 INNER JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 UNION
 SELECT S.OrganisationID
 , C.ChunkEnd + @MaxLen AS Position
 FROM SplitPositions S
 INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
 WHERE NOT EXISTS ( SELECT *
 FROM SplitPositions SI
 WHERE SI.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 AND OrganisationID = C.OrganisationID
 )
 ) D ON D.OrganisationID = C.OrganisationID
 GROUP BY C.OrganisationID, C.ChunkEnd
 )
INSERT INTO dbo.OrgStaging
 ( OrganisationID
 , Name1
 , Name2
 , Name3 )
SELECT O.OrganisationID
 , LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C1.ChunkStart, C1.ChunkEnd)))
 , LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C2.ChunkStart, 1 + C2.ChunkEnd - C2.ChunkStart)))
 , LTRIM(RTRIM(SUBSTRING(O.OrganisationName, C3.ChunkStart, 1 + C3.ChunkEnd - C3.ChunkStart)))
 FROM dbo.Organisations O
 INNER JOIN FirstChunk C1 ON C1.OrganisationID = O.OrganisationID
 INNER JOIN SecondChunk C2 ON C2.OrganisationID = O.OrganisationID
 INNER JOIN ThirdChunk C3 ON C3.OrganisationID = O.OrganisationID
 ORDER BY O.OrganisationID;

There is still some compactness to be gained if you write JOIN s with just one ON clause on a single line, but only when that JOIN is trivial (for instance on matching primary keys). Especially in the final part, I find the symmetry of the JOIN s to be clear.

Use `LEFT JOIN` and `COALESCE` for edge cases

Four times you add a UNION on a subquery to account for an edge case. Three of those are when you want to split on @MaxLen, because there is no shorter match. But there is another way to do that.

In SQL, missing data is represented as a NULL value. When we use an INNER JOIN, those NULL s disappear, because we can only join on data that we know. Adding missing data afterwards through a UNION and a back reference (querying the same data but asking where it is missing) is possible. But we can also just take those NULL s with an OUTER JOIN (mostly LEFT or RIGHT), and tell SQL to replace missing values by something else, using COALESCE.

In the Chunk CTEs

In the FirstChunk, we only know that something is missing if we know of all the organisations, so we need to select the data from the origin as well:

, FirstChunk AS
( SELECT O.OrganisationID
 , 1 AS ChunkStart
 , COALESCE(MAX(D.Position), @MaxLen) AS ChunkEnd
 FROM dbo.Organisations O
 LEFT JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 WHERE Position BETWEEN 1 AND @MaxLen
 ) D ON D.OrganisationID = O.OrganisationID
 GROUP BY O.OrganisationID
)

Notice that we don't select D.OrganisationID for the first column any more, because that can also be NULL if we can't split. That also means that we need to GROUP BY the newly selected value.

In the other two CTEs, we can just take the existing values of the previous CTEs:

, SecondChunk AS
( SELECT C.OrganisationID
 , C.ChunkEnd + 1 AS ChunkStart
 , COALESCE(MAX(D.Position), C.ChunkEnd + @MaxLen) AS ChunkEnd
 FROM FirstChunk C
 LEFT JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 INNER JOIN FirstChunk C ON C.OrganisationID = S.OrganisationID
 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 ) D ON D.OrganisationID = C.OrganisationID
 GROUP BY C.OrganisationID, C.ChunkEnd
)
, ThirdChunk AS
( SELECT C.OrganisationID
 , C.ChunkEnd + 1 AS ChunkStart
 , COALESCE(MAX(D.Position), C.ChunkEnd + @MaxLen) AS ChunkEnd
 FROM SecondChunk C
 LEFT JOIN ( SELECT S.OrganisationID
 , S.Position + 1 AS Position
 FROM SplitPositions S
 INNER JOIN SecondChunk C ON C.OrganisationID = S.OrganisationID
 WHERE S.Position BETWEEN C.ChunkEnd + 1 AND C.ChunkEnd + @MaxLen
 ) D ON D.OrganisationID = C.OrganisationID
 GROUP BY C.OrganisationID, C.ChunkEnd
)

Now the query is more compact, but also faster, because we add the edge cases in the same "swoop".

In the SpacePositions and SplitPositions CTEs

Yes, in the -Positions CTEs we do this as well, although this is a bit less clear at first, and will probably not gain as much in readability or performance. For completeness, I will explain it here as well.

The edge case is the final position. It does not contain a space, but needs to be taken into account as well. Or does it? Now that we take @MaxLen where there is no space to split on, we will take @MaxLen also when there is no space left to split on. Which may or may not be past the end of the input string. Lets see what happens if we just remove the SplitPositions CTE. I will add some additional test data to see what happens:

INSERT INTO Organisations ([OrganisationName])
SELECT SUBSTRING(OrganisationName, 1, 180)
 FROM ( SELECT 'Microsoft Corporation'
 UNION ALL SELECT 'S&T System Integration & Technology Distribution Aktiengesellschaft'
 UNION ALL SELECT 'VeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpace Because It Really Is A Long Name, But In The Second Column We Can Split It'
 UNION ALL SELECT 'Another VeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpaceButOnlyInTheSecondColumn, Because It Really Is A Long Name'
 UNION ALL SELECT 'AnotherVeryLongOrganisationNameThatWillHaveToBeSplitWithoutASpaceBecauseItReallyIsALongNameButNowItEvenExceedsTheLimitOfAllThreeColumnsWithAMaximumLenghtOf50Characters(WhichIsACombinedTotalOf150Characters)AndNowWeDon''tHaveAnythingToPutInTheLastBox'
 UNION ALL SELECT 'OneWordOnly'
 UNION ALL SELECT 'A' -- Single letter edge case
 UNION ALL SELECT '' -- Empty string edge case
 ) Data(OrganisationName);

Now when, after the query, we run

SELECT *, LEN(Name1), LEN(Name2), LEN(Name3) FROM dbo.OrgStaging;

The results are:

| OrganisationID | Name1 | Name2 | Name3 | | | |
|----------------|----------------------------------------------------|----------------------------------------------------|----------------------------------------------------|----|----|----|
| 1 | Microsoft Corporation | | | 21 | 0 | 0 |
| 2 | S&T System Integration & Technology Distribution | Aktiengesellschaft | | 48 | 18 | 0 |
| 3 | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpace Because It Really Is A Long Name, But In | The Second Column We Can Split It | 50 | 49 | 33 |
| 4 | Another | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpaceButOnlyInTheSecondColumn, Because It | 7 | 50 | 44 |
| 5 | AnotherVeryLongOrganisationNameThatWillHaveToBeSpl | itWithoutASpaceBecauseItReallyIsALongNameButNowItE | venExceedsTheLimitOfAllThreeColumnsWithAMaximumLen | 50 | 50 | 50 |
| 6 | OneWordOnly | | | 11 | 0 | 0 |
| 7 | A | | | 1 | 0 | 0 |
| 8 | | | | 0 | 0 | 0 |

Now lets remove the SplitPositions CTE, and add the - 1 to the SpacePositions CTE. Furthermore, we replace all references to SplitPositions to refer to SpacePositions (of course).

WITH SpacePositions AS
 ( SELECT O.OrganisationID
 , CHARINDEX(' ', O.OrganisationName, 0) - 1 AS Position
 FROM dbo.Organisations O
 UNION ALL
 SELECT O.OrganisationID
 , CHARINDEX(' ', O.OrganisationName, S.Position + 2) - 1 AS Position
 FROM dbo.Organisations O
 INNER JOIN SpacePositions S
 ON CHARINDEX(' ', O.OrganisationName, S.Position + 2) - 1 > S.Position
 AND S.OrganisationID = O.OrganisationID
 )
 , FirstChunk AS

Which yields:

| OrganisationID | Name1 | Name2 | Name3 | | | |
|----------------|----------------------------------------------------|----------------------------------------------------|----------------------------------------------------|----|----|----|
| 1 | Microsoft | Corporation | | 9 | 11 | 0 |
| 2 | S&T System Integration & Technology Distribution | Aktiengesellschaft | | 48 | 18 | 0 |
| 3 | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpace Because It Really Is A Long Name, But In | The Second Column We Can Split | 50 | 49 | 30 |
| 4 | Another | VeryLongOrganisationNameThatWillHaveToBeSplitWitho | utASpaceButOnlyInTheSecondColumn, Because It | 7 | 50 | 44 |
| 5 | AnotherVeryLongOrganisationNameThatWillHaveToBeSpl | itWithoutASpaceBecauseItReallyIsALongNameButNowItE | venExceedsTheLimitOfAllThreeColumnsWithAMaximumLen | 50 | 50 | 50 |
| 6 | OneWordOnly | | | 11 | 0 | 0 |
| 7 | A | | | 1 | 0 | 0 |
| 8 | | | | 0 | 0 | 0 |

Looks good to me :)

Stack Exchange Network

Split string in chunks preferable at spaces

1 Answer 1

A nitpick

Us a more compact style

Use `LEFT JOIN` and `COALESCE` for edge cases

In the Chunk CTEs

In the SpacePositions and SplitPositions CTEs

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Split string in chunks preferable at spaces

1 Answer 1

A nitpick

Us a more compact style

Use LEFT JOIN and COALESCE for edge cases

In the Chunk CTEs

In the SpacePositions and SplitPositions CTEs

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use `LEFT JOIN` and `COALESCE` for edge cases