Problem
I want to parse some data structured as lines (\n
separated) with fields separated by the NUL
character 0円
.
Many linux commands handle this separator with options such as --zero
for find
, or -0
for xargs
or by defining the separator as 0円
for gawk
.
I didn't manage to understand how to make column
interpret NUL
as separator.
Example
If you generate the following set of data (2 lines with 3 columns, separated by 0円
):
echo -e "line1\nline2" | awk 'BEGIN {OFS="0円"} {print 1ドル"columnA",1ドル"columnB",1ドル"columnC"}'
You would get the expected following output (0円
separators won't be displayed but is separating each field):
line1columnAline1columnBline1columnC
line2columnAline2columnBline2columnC
But when I try to use column to display my column, despite passing 0円
, the output for some reason only display the first column:
echo -e "line1\nline2" \ | awk 'BEGIN {FS="0円"; OFS="0円"} {print 1ドル"columnA",1ドル"columnB",1ドル"columnC"}' | column -s '0円'
line1columnA line2columnA
Actually, even without providing the delimiter, column seems to break on the nul character:
echo -e "line1\nline2" \ | awk 'BEGIN {FS="0円"; OFS="0円"} {print 1ドル"columnA",1ドル"columnB",1ドル"columnC"}' | column
line1columnA line2columnA
Question
- Is there a way to use
0円
as a field/column separator incolumn
? - Optional/ bonus question: Why does column behaves like this (I would expect the
0円
to be totally ignored if not managed and the whole line to be printed as a single field) ? - Optional/ bonus question 2: Some data in these columns will be file paths and I wanted to use
0円
as a best practice. Do you a have better practice to recommand for storing "random strings" in file without having to escape potential conflictual field separator character they may contain?
2 Answers 2
Is there a way to use 0円 as a field/column separator in column ?
No, because both implementations of column
(that I am aware of), which are the historical BSD and the one in the util-linux package, both use the standard C library's string manipulation functions to parse input lines, and those functions work under the assumption that strings are NUL-terminated. In other words, a NUL byte is meant to always mark the end of any string.
Optional/ bonus question: Why does column behaves like this (I would expect the 0円 to be totally ignored if not managed and the whole line to be printed as a single field) ?
On top of what I explained above, note that option -s
expects literal characters. It does not parse an escape syntax like 0円
(nor \n
for that matters). This means that you told column
to consider either a \
and a 0
as valid separators for its input.
You can provide escape sequences through the $''
string syntax if you are using one of the many shells that support it (e.g. it is available in bash
but not in dash
). So for instance column -s $'\n'
would be valid (to specify a <newline> as column separator) if run by one of those shells.
As a side-note, it's not clear to me what you'd expect from column
. Even if it did support NUL as separator, it would just turn each line of that input into a whole column on output. Perhaps you'd wanted to also use -t
so as to columnize the single fields for each line?
Optional/ bonus question 2: Some data in these columns will be file paths and I wanted to use 0円 as a best practice. Do you a have better practice to recommand for storing "random strings" in file without having to escape potential conflictual field separator character they may contain?
The only one I know of is by prefixing each single field with its length, expressed as text or binary as you see fit. But then surely you could not pipe them into column
.
Also, if your concern is file paths then you should consider not using the \n
either as a "structure" separator, because that is a perfectly valid character for filenames.
Just as a proof-of-concept, based on your example but using NUL as structure/record separator and length-specified fields: (I also fiddled a bit with your example strings to involve multibyte characters)
echo -e 'line1\nline2 ò' \ | LC_ALL=C awk '
BEGIN {
ORS="0円"
# here we just move arguments away from ARGV
# so that awk reads input from stdin
for (i in ARGV) {
c[i]=ARGV[i]
delete ARGV[i]
}
}
{
# first field is the line read
printf "%4.4d%s", length, 0ドル
# then a field for each argument
for(i=1; i<length(c); i++)
printf "%4.4d%s", length(c[i]), c[i]
printf "%s", ORS
}
' "€ column A" $'colu\nmnB' "column C"
Use arguments to awk
to pass as many arbitrary column strings as you wish.
Then, a hypothetical counterpart script in awk
(actually has to be gawk
or mawk
to handle RS="0円"
):
LC_ALL=C awk '
BEGIN { RS="0円" }
{
nf=0; while(length) {
field_length = substr(0,ドル 1, 4)
printf "field %d: \"%s\""ORS, ++nf, substr(0,ドル 5, field_length)
0ドル = substr(0,ドル 5+field_length)
}
printf "%s", ORS
}
'
Note that it is important to specify the same locale for both scripts to match the character size. Specifying LC_ALL=C
for both is fine.
Your columns didn't even reach your awk command. Everything past the first zero was lost even before the echo command. You can't store a binary zero in a variable.
var=$'zzz\x00zzz'
echo "${#var}"
3
var=$'zzz\xFFzzz'
echo "${#var}"
7
You could use tr
to change all the zeros to any other delimiter of your choice, before you even begin doing what you plan on doing.
Or you could change your shell to zsh
.
-
1zsh can store null bytes in its variables, so one solution is to change the shellphuclv– phuclv2021年04月11日 00:34:02 +00:00Commented Apr 11, 2021 at 0:34
-
1Hmm... I don't see where shell variables are involved here (just pipes); isn't what you're demonstrating here the limited escape handling of
echo
? In bash for exampleprintf 'zzz\x00zzz' | hexdump -C
appears to preserve the null bytesteeldriver– steeldriver2021年04月11日 00:48:48 +00:00Commented Apr 11, 2021 at 0:48 -
2@Pourko OK point conceded - but the OP is not using
$'string'
to insert the null character, they're using awk. Are you suggesting thatFS="0円"
(for example,awk 'BEGIN{OFS="0円"; print "foo","bar"}' | hexdump -C
) is not outputing null-separated fields?steeldriver– steeldriver2021年04月11日 01:14:15 +00:00Commented Apr 11, 2021 at 1:14 -
2The only escape sequence OP uses in
echo
is\n
which does work manyhow, though not everyhow, especially with-e
which OP also uses. In GNU awk a variable containing 0円 works, but not FreeBSD awk (at least not my rather old one).dave_thompson_085– dave_thompson_0852021年04月11日 03:02:28 +00:00Commented Apr 11, 2021 at 3:02 -
1@Pierre-Jean "displays the first column correctly"... When passing things around, the first zero signified an end to the string, and everything after it got lost. Just pick another delimiter, anything but zero, and save yourself the headaches.Pourko– Pourko2021年04月11日 07:17:13 +00:00Commented Apr 11, 2021 at 7:17
You must log in to answer this question.
Explore related questions
See similar questions with these tags.
tr
to change them.""
or\"
. In fact, why not use CSV? It's easy to work with, and most languages have decent libraries for parsing and outputting properly-formed CSV.