I receive several CSV files each quarter to process through an SSIS package, and the formatting is inconsistent. This quarter, I encountered what I term optional text qualifiers--text qualifer of double quote (") used at times, but not always. We appear not to have any sway over the people who provide this data to convince them to provide clean, consistent data.
Example of the scenario.
Bob,7823984234,"Lions, Tigers, Bears",This
Fred,67961654,"Little,Mermaid, ",That
George,4568,"Things,Stuff",The Other Thing
Barney,324578,"Items,",Something Else
Al,2q36789472,"And More",Another
Update: Here is what I want to convert the data to.
"Bob","7823984234","Lions, Tigers, Bears","This"
"Fred","67961654","Little,Mermaid, ","That"
"George","4568","Things,Stuff","The Other Thing"
"Barney","324578","Items,","Something Else"
"Al","2q36789472","And More","Another"
I used these steps in a text editor to cleanse the data manually. I suspect a Python script would be the fastest way to do this in a script, and create a good opportunity for me to dive into Python.
Step 1
Search for ÿ - A character not found in the data file
Step 2
Regex Replace this: ("[^"\r\n]*),([^"\r\n]*")
With this: 1ドルÿ2ドル
Repeat until no more hits
Step 3
Regex Replace this: "([^"\r\n]*)"
With this: 1ドル
Step 4
Replace this: ("[^"])
With this: "1ドル
Step 5
Replace this: ,
With this: ","
Step 6
Regex Replace this: ^
With this: "
Step 7
Regex Replace this: $
With this: "
Step 8
Replace this: ÿ
With this: ,
Options
Should I try to resolve this in SSIS in a Script Task?
Cleanse the data in a script outside of the SSIS package?
Look at a Third Party SSIS toolset like Pragmatic Works or Cozyroc?
Some other better approach?
-
Fixing it by manipulating the data is dead end. You need more powerful parse be that a library or something you build.paparazzo– paparazzo2015年07月28日 18:41:40 +00:00Commented Jul 28, 2015 at 18:41
1 Answer 1
I am not sure what is your end goal in term of cleaning.
Anyway, Python can really help you here. You can use the csv module in the standard library, or an external package like PETL.
Both will automatically quote field values that need to be quoted.
So for example, if the 3rd column value is Items,
, it will be quoted (as it contains the separator character), but if it it Items
, it won't be quoted.
You will also be able to pass values to your own methods to clean 1,2,
into 1,2
if that is your goal