Zero/Nul separator breaks column command

Question 1

Problem

I want to parse some data structured as lines (\n separated) with fields separated by the NUL character 0円.

Many linux commands handle this separator with options such as --zero for find, or -0 for xargs or by defining the separator as 0円 for gawk.

I didn't manage to understand how to make column interpret NUL as separator.

Example

If you generate the following set of data (2 lines with 3 columns, separated by 0円):

echo -e "line1\nline2" | awk 'BEGIN {OFS="0円"} {print 1ドル"columnA",1ドル"columnB",1ドル"columnC"}'

You would get the expected following output (0円 separators won't be displayed but is separating each field):

line1columnAline1columnBline1columnC
line2columnAline2columnBline2columnC

But when I try to use column to display my column, despite passing 0円, the output for some reason only display the first column:

echo -e "line1\nline2" \ | awk 'BEGIN {FS="0円"; OFS="0円"} {print 1ドル"columnA",1ドル"columnB",1ドル"columnC"}' | column -s '0円'

line1columnA line2columnA

Actually, even without providing the delimiter, column seems to break on the nul character:

echo -e "line1\nline2" \ | awk 'BEGIN {FS="0円"; OFS="0円"} {print 1ドル"columnA",1ドル"columnB",1ドル"columnC"}' | column

line1columnA line2columnA

Question

Is there a way to use 0円 as a field/column separator in column ?
Optional/ bonus question: Why does column behaves like this (I would expect the 0円 to be totally ignored if not managed and the whole line to be printed as a single field) ?
Optional/ bonus question 2: Some data in these columns will be file paths and I wanted to use 0円 as a best practice. Do you a have better practice to recommand for storing "random strings" in file without having to escape potential conflictual field separator character they may contain?

Question 2

You may be able to us tr to change them.

Question 3

Maybe do what CSV does and use double-quotes to surround fields containing the column separator and allow escaping of double-quotes inside double-quotes with either "" or \". In fact, why not use CSV? It's easy to work with, and most languages have decent libraries for parsing and outputting properly-formed CSV.

Question 4

Leave FS="0円" as is, and change OFS=" " (white space.) Then change "column -s '0円' to "column -t".

Question 5

@cas, it's a good idea: do you have any recommandation of csv tool for linux? it's for a shell script and ideally the more standard tool already installed would have the favor over better but obscure third library.

Question 6

@CinaedSimson: in this example it should work, but rebut if one of your column name gets a space in it, you loose the benefit of nul byte separator and your end result would break

Question 7

Is there a way to use 0円 as a field/column separator in column ?

No, because both implementations of column (that I am aware of), which are the historical BSD and the one in the util-linux package, both use the standard C library's string manipulation functions to parse input lines, and those functions work under the assumption that strings are NUL-terminated. In other words, a NUL byte is meant to always mark the end of any string.

Optional/ bonus question: Why does column behaves like this (I would expect the 0円 to be totally ignored if not managed and the whole line to be printed as a single field) ?

On top of what I explained above, note that option -s expects literal characters. It does not parse an escape syntax like 0円 (nor \n for that matters). This means that you told column to consider either a \ and a 0 as valid separators for its input.

You can provide escape sequences through the $'' string syntax if you are using one of the many shells that support it (e.g. it is available in bash but not in dash). So for instance column -s $'\n' would be valid (to specify a <newline> as column separator) if run by one of those shells.

As a side-note, it's not clear to me what you'd expect from column. Even if it did support NUL as separator, it would just turn each line of that input into a whole column on output. Perhaps you'd wanted to also use -t so as to columnize the single fields for each line?

Optional/ bonus question 2: Some data in these columns will be file paths and I wanted to use 0円 as a best practice. Do you a have better practice to recommand for storing "random strings" in file without having to escape potential conflictual field separator character they may contain?

The only one I know of is by prefixing each single field with its length, expressed as text or binary as you see fit. But then surely you could not pipe them into column.

Also, if your concern is file paths then you should consider not using the \n either as a "structure" separator, because that is a perfectly valid character for filenames.

Just as a proof-of-concept, based on your example but using NUL as structure/record separator and length-specified fields: (I also fiddled a bit with your example strings to involve multibyte characters)

echo -e 'line1\nline2 ò' \ | LC_ALL=C awk '
 BEGIN {
 ORS="0円"
# here we just move arguments away from ARGV
# so that awk reads input from stdin
 for (i in ARGV) {
 c[i]=ARGV[i]
 delete ARGV[i]
 }
 }
 {
# first field is the line read
 printf "%4.4d%s", length, 0ドル
# then a field for each argument
 for(i=1; i<length(c); i++)
 printf "%4.4d%s", length(c[i]), c[i]
 printf "%s", ORS
 }
' "€ column A" $'colu\nmnB' "column C"

Use arguments to awk to pass as many arbitrary column strings as you wish.

Then, a hypothetical counterpart script in awk (actually has to be gawk or mawk to handle RS="0円"):

LC_ALL=C awk '
 BEGIN { RS="0円" }
 {
 nf=0; while(length) {
 field_length = substr(0,ドル 1, 4)
 printf "field %d: \"%s\""ORS, ++nf, substr(0,ドル 5, field_length)
 0ドル = substr(0,ドル 5+field_length)
 }
 printf "%s", ORS
 }
'

Note that it is important to specify the same locale for both scripts to match the character size. Specifying LC_ALL=C for both is fine.

Question 8

Your columns didn't even reach your awk command. Everything past the first zero was lost even before the echo command. You can't store a binary zero in a variable.

var=$'zzz\x00zzz'
echo "${#var}"
3
var=$'zzz\xFFzzz'
echo "${#var}"
7

You could use tr to change all the zeros to any other delimiter of your choice, before you even begin doing what you plan on doing.

Or you could change your shell to zsh.

Question 9

zsh can store null bytes in its variables, so one solution is to change the shell

Question 10

Hmm... I don't see where shell variables are involved here (just pipes); isn't what you're demonstrating here the limited escape handling of echo? In bash for example printf 'zzz\x00zzz' | hexdump -C appears to preserve the null byte

Question 11

@Pourko OK point conceded - but the OP is not using $'string' to insert the null character, they're using awk. Are you suggesting that FS="0円"(for example, awk 'BEGIN{OFS="0円"; print "foo","bar"}' | hexdump -C) is not outputing null-separated fields?

Question 12

The only escape sequence OP uses in echo is \n which does work manyhow, though not everyhow, especially with -e which OP also uses. In GNU awk a variable containing 0円 works, but not FreeBSD awk (at least not my rather old one).

Question 13

@Pierre-Jean "displays the first column correctly"... When passing things around, the first zero signified an end to the string, and everything after it got lost. Just pick another delimiter, anything but zero, and save yourself the headaches.

LL3 LL3 5,5639 silver badges20 bronze badges · Accepted Answer · 2021-04-11 17:53:12Z

Is there a way to use 0円 as a field/column separator in column ?

No, because both implementations of column (that I am aware of), which are the historical BSD and the one in the util-linux package, both use the standard C library's string manipulation functions to parse input lines, and those functions work under the assumption that strings are NUL-terminated. In other words, a NUL byte is meant to always mark the end of any string.

Optional/ bonus question: Why does column behaves like this (I would expect the 0円 to be totally ignored if not managed and the whole line to be printed as a single field) ?

On top of what I explained above, note that option -s expects literal characters. It does not parse an escape syntax like 0円 (nor \n for that matters). This means that you told column to consider either a \ and a 0 as valid separators for its input.

You can provide escape sequences through the $'' string syntax if you are using one of the many shells that support it (e.g. it is available in bash but not in dash). So for instance column -s $'\n' would be valid (to specify a <newline> as column separator) if run by one of those shells.

As a side-note, it's not clear to me what you'd expect from column. Even if it did support NUL as separator, it would just turn each line of that input into a whole column on output. Perhaps you'd wanted to also use -t so as to columnize the single fields for each line?

Optional/ bonus question 2: Some data in these columns will be file paths and I wanted to use 0円 as a best practice. Do you a have better practice to recommand for storing "random strings" in file without having to escape potential conflictual field separator character they may contain?

The only one I know of is by prefixing each single field with its length, expressed as text or binary as you see fit. But then surely you could not pipe them into column.

Also, if your concern is file paths then you should consider not using the \n either as a "structure" separator, because that is a perfectly valid character for filenames.

Just as a proof-of-concept, based on your example but using NUL as structure/record separator and length-specified fields: (I also fiddled a bit with your example strings to involve multibyte characters)

echo -e 'line1\nline2 ò' \ | LC_ALL=C awk '
 BEGIN {
 ORS="0円"
# here we just move arguments away from ARGV
# so that awk reads input from stdin
 for (i in ARGV) {
 c[i]=ARGV[i]
 delete ARGV[i]
 }
 }
 {
# first field is the line read
 printf "%4.4d%s", length, 0ドル
# then a field for each argument
 for(i=1; i<length(c); i++)
 printf "%4.4d%s", length(c[i]), c[i]
 printf "%s", ORS
 }
' "€ column A" $'colu\nmnB' "column C"

Use arguments to awk to pass as many arbitrary column strings as you wish.

Then, a hypothetical counterpart script in awk (actually has to be gawk or mawk to handle RS="0円"):

LC_ALL=C awk '
 BEGIN { RS="0円" }
 {
 nf=0; while(length) {
 field_length = substr(0,ドル 1, 4)
 printf "field %d: \"%s\""ORS, ++nf, substr(0,ドル 5, field_length)
 0ドル = substr(0,ドル 5+field_length)
 }
 printf "%s", ORS
 }
'

Note that it is important to specify the same locale for both scripts to match the character size. Specifying LC_ALL=C for both is fine.

Stack Exchange Network

Zero/Nul separator breaks column command

Problem

Example

Question

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Zero/Nul separator breaks column command

Problem

Example

Question

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions