I have a flat-text data file containing records which fields are separated by the non printable character "File Separator" (0x1c). I am trying to use SQL Server's bcp utility to load this data into my database. Yet, when using the hex encoded value of the File Separator as TERMINATOR, I get a syntax error.
I have tried using
- the hex encoded value :
"0x1c"
- the XML encoded value as hex:
""
- the XML encoded value as decimal:
""
None of these work, yet when using the same encoding for a printable character, like tab
, this does work: 	
, 	
(0x9
doesn't. Not unsurprisingly, since this is an XML file.)
The conclusion seems to be that non-printable characters are not supported. Is this the case? That would be ironical, since the non-printable separator characters are created exactly for this purpose...
Below you can find all code to reproduce this issue:
XML format file: test.xml
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="" MAX_LENGTH="10" COLLATION="Latin1_General_CS_AS_WS"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="41"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="COL1" xsi:type="SQLVARYCHAR"/>
<COLUMN SOURCE="2" NAME="COL2" xsi:type="SQLNUMERIC" PRECISION="4" SCALE="0"/>
</ROW>
</BCPFORMAT>
Data file: test.txt
This is just one row, as a test case. StackExchange doesn't show the separator in the row below, yet when you click "Edit" for this post, the separator is included, and you should be able to copy-paste this.
1111111112008
Commandline
bcp TEST_DB.dbo.UL_TEST in "test.txt" -T -f "test.xml"
2 Answers 2
It seems, from the documentation about BCP - Specify Field and Row Terminators (SQL Server), that non-printable characters are not supported:
Characters Supported As Terminators
The bcp command, BULK INSERT statement, and the OPENROWSET bulk rowset provider support a variety of characters as field or row terminators and always look for the first instance of each terminator. The following table lists the supported characters for terminators.
As an option, you could use POWERSHELL
to read in the source file and search\replace the 0x1C
characters with some other character like a tilda (~) and output to a different file. Then use that character as your terminator.
(Get-Content c:\test\test.txt) | ForEach-Object {$_ -replace [char]0x1C,'~'} | Set-Content c:\test\testout.txt
-
2How arbitrary... Printable characters only, for something that has nothing to do with printing?TT.– TT.2018年06月07日 16:15:06 +00:00Commented Jun 7, 2018 at 16:15
-
@Scott Hodgin : The documentation indeed implies that non-printable characters can't be used. It's weird, it sucks, and it should be fixed, but still this is a valid answer to the question :) ThanksWouter– Wouter2018年06月08日 07:49:48 +00:00Commented Jun 8, 2018 at 7:49
-
Thx for the powershell. In the meantime I had already worked out something similar, yet slightly more elaborate. Posted it as a separate answer.Wouter– Wouter2018年06月08日 08:02:02 +00:00Commented Jun 8, 2018 at 8:02
-
After some testing, I've figured out that only printable ASCII characters are supported. So, up to HEX 0->127. Any other characters in the cp1252 set, like ,ドル is not supported. Also any other UTF8 character is not supported.Wouter– Wouter2018年06月08日 08:46:40 +00:00Commented Jun 8, 2018 at 8:46
-
@Wouter - yeah, bummer! Glad you solved it another way :)Scott Hodgin - Retired– Scott Hodgin - Retired2018年06月08日 08:47:55 +00:00Commented Jun 8, 2018 at 8:47
Since @ScottHodgin pointed out that non-printable characters are not supported, replacing them in the source file is the only alternative.
Below you can find an example PowerShell script that can make this replacement on a file encoded in UTF8(with or without BOM header doesn't matter), and writes an output file encoded in UTF8 without a BOM header.
Additionally, it uses AppendAllLines to turn this into a streaming operation. This way it also works for huge files, since they don't have to be entirely loaded into memory before doing the replacements. -ReadCount 1000 speeds up the process drastically.
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-Content -Encoding UTF8 -ReadCount 1000 test.txt |
Foreach-Object {
[System.IO.File]::AppendAllLines(
[string]'test.txt'
, [string[]]($_ -replace '\x1C', '😎')
, $Utf8NoBomEncoding
)
}
Make sure you save the script above in a file that is encoded with UTF8 with BOM header, otherwise PowerShell doesn't process the characters as UTF-8.
This attribute specifies the terminator of a data field. The terminator can be any character.