I've made an awk script that does the same operation and has the same parameters as Python's pandas.DataFrame.pivot
, for 2D tables. It reshapes a table from "long"-format to "wide"-format.
In a "long" table, some variable names are specified by a field for each row. E.g., Temperature
and Humidity
below are specified row-wise like this:
Room,"Variable Name",Value
Bedroom,Temperature,23
Bedroom,Humidity,78
Office,Temperature,26
Office,Humidity,82
Bathroom,Temperature,18
Bathroom,Humidity,97
My script finds those variables for each key, and outputs a new table where they are each columns instead:
% cat room_data.csv | pivot.awk -v 'columns="Variable Name"' -v key=Room -v values=Value
Room,Temperature,Humidity
Bedroom,23,78
Office,26,82
Bathroom,18,97
Another test case (this time with a compound key):
Company,Model Name,Variable Name,Value
Apple,iPhone 11,Mass,194g
Apple,iPhone 11,Clock Speed,2.66 GHz
Apple,iPhone 11,Year,2019
Apple,iPhone XS Max,Mass,208g
Apple,iPhone XS Max,Clock Speed,2.5 GHz
Apple,iPhone XS Max,Year,2018
Apple,"MacBook Pro (15-inch, 2016)",Clock Speed,2.9GHz
Apple,"MacBook Pro (15-inch, 2016)",Mass,1.83kg
Apple,"MacBook Pro (15-inch, 2016)",Year,2016
Microsoft,Surface Pro,Year,2013
Microsoft,Surface Pro,Clock Speed,1.7 GHz
Microsoft,Surface Pro,Mass,910g
% cat people.csv | spritesheet-tools/pivot.awk -v columns="Variable Name" -v key=Company,"Model Name" -v values=Value
Company,Model Name,Mass,Clock Speed,Year
Apple,iPhone 11,194g,2.66 GHz,2019
Apple,iPhone XS Max,208g,2.5 GHz,2018
Apple,"MacBook Pro (15-inch, 2016)",1.83kg,2.9GHz,2016
Microsoft,Surface Pro,910g,1.7 GHz,2013
The script reads and outputs CSV tables. I've tried to follow what I think is the most popular CSV standard, RFC 4180. So: any field may be surrounded by double-quotes. If a field's contents include a comma or double quote character, then that field must be quoted. To represent a double-quote itself in a quoted field, use two double-quote characters. My implementation currently ignores that quoted fields are meant to also be allowed to contain newlines; maybe after receiving feedback on this version of my script I'll try to support that too.
#!/usr/bin/awk -f
function usage_and_exit(message) {
print "usage: pivot.awk -v columns=<columns_with_new_column_names> -v key=<index_column_names> -v values=<column_of_values_for_new_columns>" > "/dev/stderr"
if (message) {
print message > "/dev/stderr"
}
exit 1
}
function use_csv_fields(line) {
i = 0
while (length(line)) {
i = i + 1
$i = substr(line, match(line, /"(("")*[^"]*)*"|[^",]*/), RLENGTH);
line = substr(line, RSTART + RLENGTH + 1);
}
NF = i
}
BEGIN {
if (!columns || !key || !values) {
usage_and_exit()
}
use_csv_fields(columns)
for (i = 1; i <= NF; i++) {
name_columns[$i] = 0
name_for_i[i] = $i
}
use_csv_fields(key)
for (i = 1; i <= NF; i++) {
key_columns[$i] = 0
key_for_i[i] = $i
}
use_csv_fields(values)
for (i = 1; i <= NF; i++) {
value_columns[$i] = 0
value_for_i[i] = $i
}
if (length(key_columns) == 0) {
usage_and_exit("At least one column name to use as the key must be specified.")
}
if (length(value_columns) == 0) {
usage_and_exit("At least one column name each must be provided for columns and values.")
}
if (length(value_columns) != length(name_columns)) {
usage_and_exit("The number of items specified in columns and values must be the same.")
}
FS = OFS = ","
}
function assert_all_columns_found(arr) {
for (keyname in arr) {
if (!arr[keyname]) {
usage_and_exit("Detected that the column " keyname " was not found in the CSV header.")
}
}
}
{
use_csv_fields(0ドル)
}
NR == 1 {
for (i = 1; i <= NF; i++) {
if ($i in key_columns) {
key_columns[$i] = i
}
if ($i in name_columns) {
name_columns[$i] = i
}
if ($i in value_columns) {
value_columns[$i] = i
}
}
assert_all_columns_found(key_columns)
assert_all_columns_found(name_columns)
assert_all_columns_found(value_columns)
for (i = 1; i <= length(key_for_i); ++i) {
cells[1,++numCols] = $key_columns[key_for_i[i]]
}
numRows = 1
}
NR > 1 {
key = sep = ""
for (i = 1; i <= length(key_for_i); i++) {
key = key sep $key_columns[key_for_i[i]]
sep = OFS
}
rowNr = key2row[key] = ( key in key2row ? key2row[key] : ++numRows )
for (i = 1; i <= length(name_for_i); i++) {
hdr = $name_columns[name_for_i[i]]
val = $value_columns[value_for_i[i]]
colNr = hdr2col[hdr] = ( hdr in hdr2col ? hdr2col[hdr] : ++numCols )
cells[1,colNr] = hdr
cells[rowNr,colNr] = val
}
for (colNr = 1; colNr <= length(key_for_i); colNr++) {
cells[rowNr,colNr] = $key_columns[key_for_i[colNr]]
}
}
END {
for (rowNr = 1; rowNr <= numRows; rowNr++) {
for (colNr = 1; colNr <= numCols; colNr++) {
printf "%s%s", cells[rowNr,colNr], (colNr < numCols ? OFS : ORS)
}
}
}
1 Answer 1
When I say "GNU-ism" below about any specific construct I really mean "non-POSIX but will work in GNU awk and MAY work in some other awks".
To read CSV in general with awk see whats-the-most-robust-way-to-efficiently-parse-csv-using-awk. Your code uses a GNU-ism with print message > "/dev/stderr"
so maybe you can decided to stick with gawk and use awk --csv
.
The code you're calling in the BEGIN
section to do things with input lines, fields, and/or field numbers such as for (i = 1; i <= NF; i++) {
can't work as the BEGIN
section is executed before the first input file is opened so no lines have been read at that time
You're also trying to use that code before setting FS
so the fields wouldn't be what you expect.
cat room_data.csv | pivot.awk ...
is a UUOC, use pivot.awk ... room_data.csv
instead.
if (length(key_columns) == 0)
is a lengthy, slow way of writing if (key_columns == "")
You need to declare function-local variables as such to avoid them clashing with global variables of the same name, e.g. function use_csv_fields(line, i)
instead of just function use_csv_fields(line)
.
I see another GNU-ism - setting NF = i
and hoping it'll change the fields. That's undefined behavior per POSIX so again you may want to stick with GNU awk and then you can use --csv
or FPAT
to make your life easier.
In usage_and_exit()
you have exit 1
- just be aware that statement sets the exit status and takes you to the END
section, it doesn't actually exit the script, so your code will still loop printing array contents before exiting with that status. You may want to do something more like
function usage_and_exit() {
... whatever ...
err=1
exit
}
END {
if ( err ) {
exit err
} else {
do the loops
}
}
For a POSIX solution, just use the readRec()
and splitRec()
functions from whats-the-most-robust-way-to-efficiently-parse-csv-using-awk to read lines including quoted fields, even if they include newlines, quotes or commas, and then split each line into correct fields and then all you have left to do is do whatever you need to do to create your pivot table from the usual fields, 1ドル
, etc. You'll just have to change this:
function readRec( line,rec) {
# Keep reading lines until we have an even number of quotes
# as an incomplete record will have an uneven number.
rec = 0ドル
while ( (gsub(/"/,"&",rec) % 2) && ((getline line) > 0) ) {
rec = rec RS line
NR--
FNR--
}
0ドル = rec
}
to this:
function readRec( line,rec) {
# Keep reading lines until we have an even number of quotes
# as an incomplete record will have an uneven number.
rec = 0ドル
while ( (gsub(/"/,"&",rec) % 2) && ((getline line) > 0) ) {
rec = rec RS line
}
0ドル = rec
myNR++
myFNR++
}
and then use myNR
and myFNR
instead of NR
and FNR
(if you need to) for 100% POSIX functionality since changing the value of NR
or FNR
is undefined by POSIX.
-
\$\begingroup\$ Is
key_columns == ""
always the same as!key_columns
in this case? \$\endgroup\$user98809– user988092024年03月06日 13:26:43 +00:00Commented Mar 6, 2024 at 13:26 -
\$\begingroup\$ @minseong no, try both with
key_columns=0
. \$\endgroup\$Ed Morton– Ed Morton2024年03月06日 13:50:11 +00:00Commented Mar 6, 2024 at 13:50
print message > "/dev/stderr"
works on my mac, I do not have GNU awk \$\endgroup\$BEGIN
works because I setNF
,0ドル
, etc., manually \$\endgroup\$NF
and the1ドル
,2ドル
, ... fields,0ドル
seems to be automatically updated \$\endgroup\$