Removing duplicate field entries from sorted csv data

Question 1

Given the following input (cat i.txt), I want to remove duplicate field entries in each of the first three columns and none of the others.

DLORENZ;EDDELAK;BCL;G1;2019年04月01日;175
DLORENZ;EDDELAK;BRV/COV;G1;2018年01月31日;165
DLORENZ;EDDELAK;BRV/COV;G2;2018年02月28日;165
DLORENZ;EDDELAK;BRV/COV;WH;2018年05月29日;88
DLORENZ;EDDELAK;BRV/COV;WH;2018年10月02日;139
...

The input is sorted first on column 1, then on column 2, then on column 3, then on column 4, then on column 5, then on column 6.

That is, from here (cat i.txt | column -s ';' -t)

DLORENZ EDDELAK BCL G1 2019年04月01日 175
DLORENZ EDDELAK BRV/COV G1 2018年01月31日 165
DLORENZ EDDELAK BRV/COV G2 2018年02月28日 165
DLORENZ EDDELAK BRV/COV WH 2018年05月29日 88
DLORENZ EDDELAK BRV/COV WH 2018年10月02日 139
DLORENZ EDDELAK BRV/COV WH 2019年01月07日 140
HELMGBR GUDENDORF BCL G1 2018年04月29日 600
HELMGBR GUDENDORF BCL G2 2018年05月28日 580
HELMGBR GUDENDORF BCL WH 2018年11月21日 600
HELMGBR GUDENDORF BOT G1 2018年07月09日 600
HELMGBR GUDENDORF BOT G2 2018年08月06日 600
HELMGBR GUDENDORF BOT WH 2019年02月13日 600
HELMGBR GUDENDORF CHLM G1 2017年12月14日 600
HELMGBR GUDENDORF CHLM G2 2018年01月11日 600
HELMGBR GUDENDORF CHLM WH 2018年09月05日 550
HKARSTENS KUDEN BCL G1 2019年03月11日 255
HKARSTENS KUDEN BCL G2 2019年04月10日 255
HSCHLADETSCH EDDELAK BCL G1 2019年03月11日 213
HSCHLADETSCH EDDELAK BCL G2 2019年04月08日 201
HSCHLADETSCH EDDELAK BRV/COV G1 1979年01月01日 218
HSCHLADETSCH EDDELAK BRV/COV G2 1979年01月01日 218
HSCHLADETSCH EDDELAK BRV/COV WH 2018年03月13日 218
HSCHLADETSCH EDDELAK BRV/COV WH 2018年09月10日 160
HWULFF KUDEN BCL G1 2018年02月28日 244
HWULFF KUDEN BCL G2 2018年03月28日 244
HWULFF KUDEN BCL WH 2018年09月20日 190
HWULFF KUDEN BCL WH 2019年03月19日 250
HWULFF KUDEN CHLM G1 2018年04月01日 244
HWULFF KUDEN CHLM G2 2018年04月29日 244
HWULFF KUDEN CHLM WH 2019年03月28日 250
JMEIER EDDELAK BCL G1 2018年04月30日 360
JMEIER EDDELAK BCL G2 2018年05月28日 360
JPETERS KAISERWILHELMKOOG CHLM G1 2018年02月26日 65
JPETERS KAISERWILHELMKOOG CHLM G2 2018年03月26日 65
JPETERS KAISERWILHELMKOOG CHLM WH 2019年01月18日 79
JTHODE BUCHHOLZ BCL G1 2019年03月12日 253
JTHODE BUCHHOLZ BCL G2 2019年04月12日 253
KMEHLERT BRUNSBUETTEL BCL G1 2018年12月13日 79
KMEHLERT BRUNSBUETTEL BCL G2 2019年01月10日 119
MMAGENS BARLT CHLM G1 2018年02月13日 165
MMAGENS BARLT CHLM G2 2018年03月13日 165
MMAGENS BARLT CHLM WH 2018年09月12日 136
MMAGENS BARLT CHLM WH 2019年03月14日 132
MSCHNEPEL WINDBERGEN CHLM G1 2017年10月09日 205
MSCHNEPEL WINDBERGEN CHLM G2 2017年11月02日 263
MSCHNEPEL WINDBERGEN CHLM WH 2018年04月10日 272
MSCHNEPEL WINDBERGEN CHLM WH 2018年10月25日 208
NJUNGE EDDELAK BCL G1 2018年03月07日 146
NJUNGE EDDELAK BCL G2 2018年04月04日 146
NJUNGE EDDELAK BCL WH 2018年08月06日 100
NJUNGE EDDELAK BCL WH 2018年11月14日 105
NJUNGE EDDELAK BCL WH 2019年03月12日 118
SMOHR BRUNSBUETTEL CHLM G1 2018年04月30日 110
SMOHR BRUNSBUETTEL CHLM G2 2018年05月28日 110
SMOHR BRUNSBUETTEL CHLM WH 2018年12月18日 98

... I want to arrive at the following output (cat 1fertig.txt | column -s ';' -t):

DLORENZ EDDELAK BCL G1 2019年04月01日 175
---- ---- BRV/COV G1 2018年01月31日 165
---- ---- ---- G2 2018年02月28日 165
---- ---- ---- WH 2018年05月29日 88
---- ---- ---- WH 2018年10月02日 139
---- ---- ---- WH 2019年01月07日 140
HELMGBR GUDENDORF BCL G1 2018年04月29日 600
---- ---- ---- G2 2018年05月28日 580
---- ---- ---- WH 2018年11月21日 600
---- ---- BOT G1 2018年07月09日 600
---- ---- ---- G2 2018年08月06日 600
---- ---- ---- WH 2019年02月13日 600
---- ---- CHLM G1 2017年12月14日 600
---- ---- ---- G2 2018年01月11日 600
---- ---- ---- WH 2018年09月05日 550
HKARSTENS KUDEN BCL G1 2019年03月11日 255
---- ---- ---- G2 2019年04月10日 255
HSCHLADETSCH EDDELAK BCL G1 2019年03月11日 213
---- ---- ---- G2 2019年04月08日 201
---- ---- BRV/COV G1 1979年01月01日 218
---- ---- ---- G2 1979年01月01日 218
---- ---- ---- WH 2018年03月13日 218
---- ---- ---- WH 2018年09月10日 160
HWULFF KUDEN BCL G1 2018年02月28日 244
---- ---- ---- G2 2018年03月28日 244
---- ---- ---- WH 2018年09月20日 190
---- ---- ---- WH 2019年03月19日 250
---- ---- CHLM G1 2018年04月01日 244
---- ---- ---- G2 2018年04月29日 244
---- ---- ---- WH 2019年03月28日 250
JMEIER EDDELAK BCL G1 2018年04月30日 360
---- ---- ---- G2 2018年05月28日 360
JPETERS KAISERWILHELMKOOG CHLM G1 2018年02月26日 65
---- ---- ---- G2 2018年03月26日 65
---- ---- ---- WH 2019年01月18日 79
JTHODE BUCHHOLZ BCL G1 2019年03月12日 253
---- ---- ---- G2 2019年04月12日 253
KMEHLERT BRUNSBUETTEL BCL G1 2018年12月13日 79
---- ---- ---- G2 2019年01月10日 119
MMAGENS BARLT CHLM G1 2018年02月13日 165
---- ---- ---- G2 2018年03月13日 165
---- ---- ---- WH 2018年09月12日 136
---- ---- ---- WH 2019年03月14日 132
MSCHNEPEL WINDBERGEN CHLM G1 2017年10月09日 205
---- ---- ---- G2 2017年11月02日 263
---- ---- ---- WH 2018年04月10日 272
---- ---- ---- WH 2018年10月25日 208
NJUNGE EDDELAK BCL G1 2018年03月07日 146
---- ---- ---- G2 2018年04月04日 146
---- ---- ---- WH 2018年08月06日 100
---- ---- ---- WH 2018年11月14日 105
---- ---- ---- WH 2019年03月12日 118
SMOHR BRUNSBUETTEL CHLM G1 2018年04月30日 110
---- ---- ---- G2 2018年05月28日 110
---- ---- ---- WH 2018年12月18日 98

The output will be further processed into a LaTeX input file.

The code I wrote is reasonably straightforward:

First kill the duplicates in column 3 of input, then kill the duplicates in column 2 of the result, then kill the duplicates in column 1 of that result.

It is even efficient enough for my needs (and I can't come up with anything substantially faster offhand, except for not writing to disk that much). But it is not readable at all.

n="$(wc -l < i.txt)"
rm -rfv f
mkdir f
cat i.txt > f/i.txt
cd f
while IFS=';' read lbezg rest
do
 echo "$lbezg"';'"$rest" >> 1lw_"$lbezg"
done < i.txt
for file in 1lw_*
do
 while IFS=';' read lbezg sbezg rest
 do
 echo "$lbezg"';'"$sbezg"';'"$rest" >> 1lw_2so_"$lbezg"_"$sbezg"
 done < "$file"
done
for file in 1lw_2so_*
do
 while IFS=';' read lbezg sbezg impfstoff rest
 do
 ii="$(echo "$impfstoff" | tr -d '/')"
 echo "$lbezg"';'"$sbezg"';'"$impfstoff"';'"$rest" >> 1lw_2so_3impfstoff_"$lbezg"_"$sbezg"_"$ii"
 done < "$file"
done
for file in 1lw_2so_3impfstoff_*
do
 awk -F';' -v OFS=';' ' {if (NR>1) 3ドル="----"; print 0ドル}' < "$file"
done > 3fertig.txt
rm 1lw*
while IFS=';' read lbezg rest
do
 echo "$lbezg"';'"$rest" >> 1lw_"$lbezg"
done < 3fertig.txt
for file in 1lw_*
do
 while IFS=';' read lbezg sbezg rest
 do
 echo "$lbezg"';'"$sbezg"';'"$rest" >> 1lw_2so_"$lbezg"_"$sbezg"
 done < "$file"
done
for file in 1lw_2so_*
do
 awk -F';' -v OFS=';' ' {if (NR>1) 2ドル="----"; print 0ドル}' < "$file"
done > 2fertig.txt
rm 1lw*
while IFS=';' read lbezg rest
do
 echo "$lbezg"';'"$rest" >> 1lw_"$lbezg"
done < 2fertig.txt
for file in 1lw_*
do
 awk -F';' -v OFS=';' ' {if (NR>1) 1ドル="----"; print 0ドル}' < "$file"
done > 1fertig.txt
rm 1lw*
####### the rest is for nice error checking and not strictly necessary
time for i in $(seq 1 "$n")
do
 l1="$(sed -n "$i"p < i.txt)" 
 l2="$(sed -n "$i"p < 1fertig.txt)"
 echo "$i"';'"$l1"'|'"$i"';'"$l2"
done | column -s '|' -t > differ.txt

I wonder how you would go about this?

Question 2

The shell is usually a poor choice for processing data. Let awk do it for you:

#!/usr/bin/awk -f
BEGIN { FS = OFS = ";" }
{
 stub=""
 for (i=1;i<=3;i++) if (saw[ stub = stub FS $i ]++) $i="----"
 print
}

If it has to be bash:

#!/bin/bash
awk -F\; -vOFS=\; '{s=0; for(i=1;i<=3;i++) if(saw[s=s FS $i]++) $i="----"} 1' i.txt > 1fertig.txt

Question 3

I'm afraid this does not produce the intended output, because, for example, the first line of the output now reads ` DLORENZ;EDDELAK;BRV/COV;G2;2018年02月28日;165` which is incorrect, because there is no line above this line that has "G1" in the fourth field, whereas in the original output, we have ` DLORENZ;EDDELAK;BRV/COV;G1;2018年01月31日;165` in line 2, followed by DLORENZ;EDDELAK;BRV/COV;G2;2018年02月28日;165 in line 3. I'll see where I end up using your approach with arrays, though. Decorating the output of your script and re-sorting will allow me to restore the sort.

Question 4

I think you've introduced an error somewhere, or changed your input without realizing it. Before posting, I saved your example input and output. A diff between your output and mine gave zero differences. Testing again just now, the first line of output I get is DLORENZ;EDDELAK;BCL;G1;2019年04月01日;175

Question 5

What awk version do you use? ` [tdu:gimli] /tmp/tdu/work awk --version | head -n 2 GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2) Copyright (C) 1989, 1991-2016 Free Software Foundation. [tdu:gimli] /tmp/tdu/work head -n 1 i.txt DLORENZ;EDDELAK;BRV/COV;G2;2018年02月28日;165;Lorenz:Dirk;DLORENZ [tdu:gimli] /tmp/tdu/work LC_ALL=C awk -F\; -vOFS=\; '{s=0; for(i=1;i<=3;i++) if(saw[s=s FS $i]++) $i="----"} 1' i.txt > 1fertig.txt [tdu:gimli] /tmp/tdu/work head -n 1 1fertig.txt DLORENZ;EDDELAK;BRV/COV;G2;2018年02月28日;165;Lorenz:Dirk;DLORENZ [tdu:gimli] /tmp/tdu/work `

Question 6

and the backticks for code seem not to be suitable for several lines in a row... :(

Question 7

GNU Awk 4.2.1. Look at your own input. The first line is G2 in the 4th field, and the output is the same. There's nothing in the awk code that can change the 4th field at all, let alone from G1 to G2.

Oh My Goodness Oh My Goodness 4,3461 gold badge12 silver badges26 bronze badges · Accepted Answer · 2019-04-30 12:55:26Z

2

\$\begingroup\$

The shell is usually a poor choice for processing data. Let awk do it for you:

#!/usr/bin/awk -f
BEGIN { FS = OFS = ";" }
{
 stub=""
 for (i=1;i<=3;i++) if (saw[ stub = stub FS $i ]++) $i="----"
 print
}

If it has to be bash:

#!/bin/bash
awk -F\; -vOFS=\; '{s=0; for(i=1;i<=3;i++) if(saw[s=s FS $i]++) $i="----"} 1' i.txt > 1fertig.txt

Share

answered Apr 30, 2019 at 12:55

Oh My Goodness's user avatar

Oh My Goodness Oh My Goodness

4,3461 gold badge12 silver badges26 bronze badges

\$\endgroup\$

6

\$\begingroup\$ I'm afraid this does not produce the intended output, because, for example, the first line of the output now reads ` DLORENZ;EDDELAK;BRV/COV;G2;2018年02月28日;165` which is incorrect, because there is no line above this line that has "G1" in the fourth field, whereas in the original output, we have ` DLORENZ;EDDELAK;BRV/COV;G1;2018年01月31日;165` in line 2, followed by DLORENZ;EDDELAK;BRV/COV;G2;2018年02月28日;165 in line 3. I'll see where I end up using your approach with arrays, though. Decorating the output of your script and re-sorting will allow me to restore the sort. \$\endgroup\$

Thure Dührsen
– Thure Dührsen

2019年05月02日 12:18:59 +00:00
Commented May 2, 2019 at 12:18
\$\begingroup\$ I think you've introduced an error somewhere, or changed your input without realizing it. Before posting, I saved your example input and output. A diff between your output and mine gave zero differences. Testing again just now, the first line of output I get is DLORENZ;EDDELAK;BCL;G1;2019年04月01日;175 \$\endgroup\$

Oh My Goodness
– Oh My Goodness

2019年05月02日 22:04:18 +00:00
Commented May 2, 2019 at 22:04
\$\begingroup\$ What awk version do you use? ` [tdu:gimli] /tmp/tdu/work awk --version | head -n 2 GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2) Copyright (C) 1989, 1991-2016 Free Software Foundation. [tdu:gimli] /tmp/tdu/work head -n 1 i.txt DLORENZ;EDDELAK;BRV/COV;G2;2018年02月28日;165;Lorenz:Dirk;DLORENZ [tdu:gimli] /tmp/tdu/work LC_ALL=C awk -F\; -vOFS=\; '{s=0; for(i=1;i<=3;i++) if(saw[s=s FS $i]++) $i="----"} 1' i.txt > 1fertig.txt [tdu:gimli] /tmp/tdu/work head -n 1 1fertig.txt DLORENZ;EDDELAK;BRV/COV;G2;2018年02月28日;165;Lorenz:Dirk;DLORENZ [tdu:gimli] /tmp/tdu/work ` \$\endgroup\$

Thure Dührsen
– Thure Dührsen

2019年05月03日 08:55:09 +00:00
Commented May 3, 2019 at 8:55
\$\begingroup\$ and the backticks for code seem not to be suitable for several lines in a row... :( \$\endgroup\$

Thure Dührsen
– Thure Dührsen

2019年05月03日 08:55:49 +00:00
Commented May 3, 2019 at 8:55
\$\begingroup\$ GNU Awk 4.2.1. Look at your own input. The first line is G2 in the 4th field, and the output is the same. There's nothing in the awk code that can change the 4th field at all, let alone from G1 to G2. \$\endgroup\$

Oh My Goodness
– Oh My Goodness

2019年05月03日 09:25:19 +00:00
Commented May 3, 2019 at 9:25

| Show 1 more comment

Stack Exchange Network

Removing duplicate field entries from sorted csv data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Removing duplicate field entries from sorted csv data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions