Consider the following CSV file:
A; B; ;
B; ; A;
C; ; E F;
D; ; E;
E; C; ;
The fields:
1ドル
: thejname
. A unique id of the entry.2ドル
: a " "(space)-separated list ofincond
.3ドル
: a " "(space)-separated list ofoutcond
.
For the "link" A-B
to be valid, jname
A must define B as outcond
, and job B must define A as incond
.
In the above example, D-E
is not a valid "link" because E doesn't define D as incond
.
C-F
is not a valid "link" because F doesn't exist.
A cond
is not valid if the link it forms is not valid. The script must detect all non valid conds
and which jobs are infected.
#!/usr/bin/awk -f
BEGIN {
FS=" *; *";
delim = "-";
conds[""]=0;
}
{
icnd_size = split(2,ドル incond_list, " ");
for (i=1; i<=icnd_size; ++i) {
conds[incond_list[i] delim 1ドル]++;
}
ocnd_size = split(3,ドル outcond_list, " ");
for (i=1; i<=ocnd_size; ++i) {
conds[1ドル delim outcond_list[i]]--;
}
}
END {
for (i in conds) {
sz = split(i, answer, delim);
if (conds[i] == 1) {
j = answer[2];
c = answer[1];
inorout = "INCOND";
}
if (conds[i] == -1) {
j = answer[1];
c = answer[2];
inorout = "OUTCOND";
}
if (conds[i] != 0)
print "Invalid", inorout, c, "on job", j;
}
}
The script works, although I do not have large data to test against. I see 2 problems with it:
- the script will break if some
cond
has the characterdelim
in the name - the script might break (and/or return false positives) if a line is inserted twice or if two lines have the same
jname
.
I could use any tip on addressing the two problems, as well as any critique of the code, it's literally my first Awk code.
1 Answer 1
Your main questions
The script works, although I do not have large data to test against.
You don't necessarily need a large dataset.
It's better to think of all possible corner cases.
For example, your sample data demonstrates failures of OUTCOND
but not of INCOND
.
Also, although there is an example of more than one outgoing links,
but there is no example of more than one incoming links.
There are not too many interesting cases,
if you add examples for all them,
then you can be fairly confident in your solution.
- The script will break if some cond has the character delim in the name
If you want to be really safe, you could add a sanity check for that, and raise an error when such name is found, for example by calling exit
with a non-zero value.
- The script might break (and/or return false positives) if a line is inserted twice or if two lines have the same jname.
Ditto.
Simplify
Many things can be simplified in this code.
The conds[""]=0;
is unnecessary, you can simply delete that line.
Instead of this:
icnd_size = split(2,ドル incond_list, " "); for (i=1; i<=icnd_size; ++i) { conds[incond_list[i] delim 1ドル]++; }
You don't really need the return value of split
,
because instead of a counting loop,
you can use a more idiomatic for-each loop:
split(2,ドル inconds, " ");
for (i in inconds) {
conds[inconds[i] delim 1ドル]++;
}
The same goes for outconds
as well.
Mutually exclusive if
statements
These if
statements cannot be both true at the same time:
if (conds[i] == 1) { # ... } if (conds[i] == -1) { # ... }
So they should be chained together with an else if
.
Formatting
Instead of this:
for (i=1; i<=ocnd_size; ++i) { conds[1ドル delim outcond_list[i]]--; }
It would be better to write like this:
for (i = 1; i <= ocnd_size; ++i) {
conds[1ドル delim outcond_list[i]]--;
}
Naming
Some of the names are not so great.
For example sz
, i
, j
, c
in the END
block.
sz
is actually unnecessary,
and I would rename the others to pair
, job
, and cond
,
respectively.
Putting it together
Consider this alternative implementation:
#!/usr/bin/awk -f
BEGIN {
FS = " *; *";
delim = "-";
}
{
split(2,ドル inconds, " ");
for (i in inconds) {
conds[inconds[i] delim 1ドル]++;
}
split(3,ドル outconds, " ");
for (i in outconds) {
conds[1ドル delim outconds[i]]--;
}
}
END {
oformat = "Invalid %s %s on job %s\n";
for (pair in conds) {
split(pair, parts, delim);
if (conds[pair] == 1) {
job = parts[2];
cond = parts[1];
inorout = "INCOND";
} else if (conds[pair] == -1) {
job = parts[1];
cond = parts[2];
inorout = "OUTCOND";
}
if (conds[pair] != 0) print "Invalid", inorout, cond, "on job", job;
}
}
tsort
substitute here? Good luck. \$\endgroup\$