awk script to pivot long CSV into wide CSV

Question 1

I've made an awk script that does the same operation and has the same parameters as Python's pandas.DataFrame.pivot, for 2D tables. It reshapes a table from "long"-format to "wide"-format.

In a "long" table, some variable names are specified by a field for each row. E.g., Temperature and Humidity below are specified row-wise like this:

Room,"Variable Name",Value
Bedroom,Temperature,23
Bedroom,Humidity,78
Office,Temperature,26
Office,Humidity,82
Bathroom,Temperature,18
Bathroom,Humidity,97

My script finds those variables for each key, and outputs a new table where they are each columns instead:

% cat room_data.csv | pivot.awk -v 'columns="Variable Name"' -v key=Room -v values=Value
Room,Temperature,Humidity
Bedroom,23,78
Office,26,82
Bathroom,18,97

Another test case (this time with a compound key):

Company,Model Name,Variable Name,Value
Apple,iPhone 11,Mass,194g
Apple,iPhone 11,Clock Speed,2.66 GHz
Apple,iPhone 11,Year,2019
Apple,iPhone XS Max,Mass,208g
Apple,iPhone XS Max,Clock Speed,2.5 GHz
Apple,iPhone XS Max,Year,2018
Apple,"MacBook Pro (15-inch, 2016)",Clock Speed,2.9GHz
Apple,"MacBook Pro (15-inch, 2016)",Mass,1.83kg
Apple,"MacBook Pro (15-inch, 2016)",Year,2016
Microsoft,Surface Pro,Year,2013
Microsoft,Surface Pro,Clock Speed,1.7 GHz
Microsoft,Surface Pro,Mass,910g

% cat people.csv | spritesheet-tools/pivot.awk -v columns="Variable Name" -v key=Company,"Model Name" -v values=Value
Company,Model Name,Mass,Clock Speed,Year
Apple,iPhone 11,194g,2.66 GHz,2019
Apple,iPhone XS Max,208g,2.5 GHz,2018
Apple,"MacBook Pro (15-inch, 2016)",1.83kg,2.9GHz,2016
Microsoft,Surface Pro,910g,1.7 GHz,2013

The script reads and outputs CSV tables. I've tried to follow what I think is the most popular CSV standard, RFC 4180. So: any field may be surrounded by double-quotes. If a field's contents include a comma or double quote character, then that field must be quoted. To represent a double-quote itself in a quoted field, use two double-quote characters. My implementation currently ignores that quoted fields are meant to also be allowed to contain newlines; maybe after receiving feedback on this version of my script I'll try to support that too.

#!/usr/bin/awk -f
function usage_and_exit(message) {
 print "usage: pivot.awk -v columns=<columns_with_new_column_names> -v key=<index_column_names> -v values=<column_of_values_for_new_columns>" > "/dev/stderr"
 if (message) {
 print message > "/dev/stderr"
 }
 exit 1
}
function use_csv_fields(line) {
 i = 0
 while (length(line)) {
 i = i + 1
 $i = substr(line, match(line, /"(("")*[^"]*)*"|[^",]*/), RLENGTH);
 line = substr(line, RSTART + RLENGTH + 1);
 }
 NF = i
}
BEGIN {
 if (!columns || !key || !values) {
 usage_and_exit()
 }
 use_csv_fields(columns)
 for (i = 1; i <= NF; i++) {
 name_columns[$i] = 0
 name_for_i[i] = $i
 }
 use_csv_fields(key)
 for (i = 1; i <= NF; i++) {
 key_columns[$i] = 0
 key_for_i[i] = $i
 }
 use_csv_fields(values)
 for (i = 1; i <= NF; i++) {
 value_columns[$i] = 0
 value_for_i[i] = $i
 }
 if (length(key_columns) == 0) {
 usage_and_exit("At least one column name to use as the key must be specified.")
 }
 if (length(value_columns) == 0) {
 usage_and_exit("At least one column name each must be provided for columns and values.")
 }
 if (length(value_columns) != length(name_columns)) {
 usage_and_exit("The number of items specified in columns and values must be the same.")
 }
 FS = OFS = ","
}
function assert_all_columns_found(arr) {
 for (keyname in arr) {
 if (!arr[keyname]) {
 usage_and_exit("Detected that the column " keyname " was not found in the CSV header.")
 }
 }
}
{
 use_csv_fields(0ドル)
}
NR == 1 {
 for (i = 1; i <= NF; i++) {
 if ($i in key_columns) {
 key_columns[$i] = i
 }
 if ($i in name_columns) {
 name_columns[$i] = i
 }
 if ($i in value_columns) {
 value_columns[$i] = i
 }
 }
 assert_all_columns_found(key_columns)
 assert_all_columns_found(name_columns)
 assert_all_columns_found(value_columns)
 for (i = 1; i <= length(key_for_i); ++i) {
 cells[1,++numCols] = $key_columns[key_for_i[i]]
 }
 numRows = 1
}
NR > 1 {
 key = sep = ""
 for (i = 1; i <= length(key_for_i); i++) {
 key = key sep $key_columns[key_for_i[i]]
 sep = OFS
 }
 rowNr = key2row[key] = ( key in key2row ? key2row[key] : ++numRows )
 for (i = 1; i <= length(name_for_i); i++) {
 hdr = $name_columns[name_for_i[i]]
 val = $value_columns[value_for_i[i]]
 colNr = hdr2col[hdr] = ( hdr in hdr2col ? hdr2col[hdr] : ++numCols )
 cells[1,colNr] = hdr
 cells[rowNr,colNr] = val
 }
 for (colNr = 1; colNr <= length(key_for_i); colNr++) {
 cells[rowNr,colNr] = $key_columns[key_for_i[colNr]]
 }
}
END {
 for (rowNr = 1; rowNr <= numRows; rowNr++) {
 for (colNr = 1; colNr <= numCols; colNr++) {
 printf "%s%s", cells[rowNr,colNr], (colNr < numCols ? OFS : ORS)
 }
 }
}

Question 2

@EdMorton for some reason print message > "/dev/stderr" works on my mac, I do not have GNU awk

Question 3

@EdMorton the stuff in BEGIN works because I set NF, 0ドル, etc., manually

Question 4

@EdMorton haha, thanks I didn't know that was a UUOC (nor did I know about UUOC). Thanks for all your advice. I would rather use POSIX awk than GNU awk, hadn't realised I was writing non-POSIX at some points (I wonder what "flavor" of awk mac uses...)

Question 5

@minseong Mac uses BSD awk (which uses POSIX constructs plus a few of it's own extensions, but nothing close to gawks extra functionalioty) by default but you can install gawk. Setting NF is undefined behavior so YMMV. Where are you setting 0ドル manually?

Question 6

@EdMorton I noticed that simply by setting NF and the 1ドル, 2ドル, ... fields, 0ドル seems to be automatically updated

Question 7

When I say "GNU-ism" below about any specific construct I really mean "non-POSIX but will work in GNU awk and MAY work in some other awks".

To read CSV in general with awk see whats-the-most-robust-way-to-efficiently-parse-csv-using-awk. Your code uses a GNU-ism with print message > "/dev/stderr" so maybe you can decided to stick with gawk and use awk --csv.

The code you're calling in the BEGIN section to do things with input lines, fields, and/or field numbers such as for (i = 1; i <= NF; i++) { can't work as the BEGIN section is executed before the first input file is opened so no lines have been read at that time

You're also trying to use that code before setting FS so the fields wouldn't be what you expect.

cat room_data.csv | pivot.awk ... is a UUOC, use pivot.awk ... room_data.csv instead.

if (length(key_columns) == 0) is a lengthy, slow way of writing if (key_columns == "")

You need to declare function-local variables as such to avoid them clashing with global variables of the same name, e.g. function use_csv_fields(line, i) instead of just function use_csv_fields(line).

I see another GNU-ism - setting NF = i and hoping it'll change the fields. That's undefined behavior per POSIX so again you may want to stick with GNU awk and then you can use --csv or FPAT to make your life easier.

In usage_and_exit() you have exit 1 - just be aware that statement sets the exit status and takes you to the END section, it doesn't actually exit the script, so your code will still loop printing array contents before exiting with that status. You may want to do something more like

function usage_and_exit() {
 ... whatever ...
 err=1
 exit
}
END {
 if ( err ) {
 exit err
 } else {
 do the loops
 }
}

For a POSIX solution, just use the readRec() and splitRec() functions from whats-the-most-robust-way-to-efficiently-parse-csv-using-awk to read lines including quoted fields, even if they include newlines, quotes or commas, and then split each line into correct fields and then all you have left to do is do whatever you need to do to create your pivot table from the usual fields, 1ドル, etc. You'll just have to change this:

function readRec( line,rec) {
 # Keep reading lines until we have an even number of quotes
 # as an incomplete record will have an uneven number.
 rec = 0ドル
 while ( (gsub(/"/,"&",rec) % 2) && ((getline line) > 0) ) {
 rec = rec RS line
 NR--
 FNR--
 }
 0ドル = rec
}

to this:

function readRec( line,rec) {
 # Keep reading lines until we have an even number of quotes
 # as an incomplete record will have an uneven number.
 rec = 0ドル
 while ( (gsub(/"/,"&",rec) % 2) && ((getline line) > 0) ) {
 rec = rec RS line
 }
 0ドル = rec
 myNR++
 myFNR++
}

and then use myNR and myFNR instead of NR and FNR (if you need to) for 100% POSIX functionality since changing the value of NR or FNR is undefined by POSIX.

Question 8

Is key_columns == "" always the same as !key_columns in this case?

Question 9

@minseong no, try both with key_columns=0.

Ed Morton Ed Morton 4512 silver badges7 bronze badges · Answer 1 · 2024-03-06 13:10:43Z