There are a lot of variants of the CSV "standard" (or lack thereof). I've never personally see any that use an escape character (like \
) instead of surrounding each field with double quotes. Instead of foo,bar,"foo,bar"
it would be foo,bar,foo,円bar
.
This would be handy for situations where a file needs to manually inspected or edited by hand. When counting commas to find the right field, it seems that it would be easier to tell which ones were not field separators if they escaped instead of quoted.
I don't see how it would make a difference from a parsing perspective, though.
Why quote instead of escape?
3 Answers 3
Your question includes the answer, when you wrote "I don't see how it would make a difference from a parsing perspective, though"
There is no compelling reason, it just is. Csv is a data format, so the main goal is to be parseable.
The CSV originates from the early seventies (Defined in IBM Fortran 77), it was introduced to give a better data transfer with less errors in punch cards as the previously used fixed length format was prone to errors in case of one or more missing spaces.
The format is described in IBM DB2 administrative guide: Load, Import and Export file formats
ref: https://www.columbia.edu/sec/acis/db2/db2d0/db2d053.htm
The format is recently defined in RFC 4180, and needs to follow these guidelines to be compliant
What is the RFC 4180 CSV file?
RFC 4180 defines a standard dialect for CSV, that specifies delimiters, quoting, and line breaks. As well as resolving these historical variations in CSV, RFC 4180 also resolves other potential inconsistencies, such as requiring the same number of fields on each line.
Ref: https://www.ietf.org/rfc/rfc4180.txt
The suggestions for a standard in RFC 4180 is later enchanced by W3C in 2015
The file type is used by all major players in the industry. Major changes are not easily applied.
A CSV file doesn't need to rely on commas as the separator between elements. The delimiter can be a semicolon, space, or some other character, though the comma is most common.
Eg in countries wich uses comma as decimal separator the semicolon is used as delimiter between elements.
This is why the escape character is not needed.
-
This just moves the question around though - why did the RFC authors decide to use quoting rather than escaping?Philip Kendall– Philip Kendall2023年10月13日 11:23:14 +00:00Commented Oct 13, 2023 at 11:23
-
1@plykkegaard The RFC contains this comment: "This section documents the format that seems to be followed by most implementations" so it looks like the double-quote standard was already the defacto standard.Pieter B– Pieter B2023年10月13日 11:50:50 +00:00Commented Oct 13, 2023 at 11:50
-
Call it something else rather than CSV an choose your own path or follow the suggested implementation guidelinesplykkegaard– plykkegaard2023年10月13日 12:25:49 +00:00Commented Oct 13, 2023 at 12:25
-
Yeah you can always down vote if the answer does not suit you! Integrations systems like Microsoft BizTalk Server or Seeburger Integration Suite will have a hard time with flatfiles having escape characters rather than quotes, use another separator like tabe or pipe and you are good to goplykkegaard– plykkegaard2023年10月13日 12:57:55 +00:00Commented Oct 13, 2023 at 12:57
-
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.2023年10月15日 04:54:05 +00:00Commented Oct 15, 2023 at 4:54
Chicken and egg.
If I was defining a spec, I'd escape for the reasons that you state. But any CSV parsing tool that you encounter (including Excel, sqloader, etc) is almost certain to use quoting, not escaping. So if you want to produce it, you need to produce it quoted. If you want to consume it, it is safe to assume that whoever generates it will quote.
foo,bar,foo\,円bar
, the last comma would be a field separator."
into the cell. Quotes also allow you to put new lines more read-ably. When writting the parser if you are going to implement one you may as well implement both it does not add much complexity.