Optimize text search in files with Bash

Question 1

I would like to get some performance improvement suggestions to a simple project I made using Bash in Linux.

The target is to read all the *.desktop files, and extract Name, Exec, Icon and Comment entries. Then will be displayed in a GTK yad List.

I have built the code in two versions. Both versions works OK, but are very slow.

Version 1 :

Read all the files one by one & grep the fields. This version need about 9 seconds to read & grep 300 desktop files.

TimeStarted=$(date +%s.%N)
files=/usr/share/applications/*.desktop
fileindex=1 
for i in $(ls $files); do
 readarray -t executable < <(grep -m 1 "^Exec=" $i |cut -f 2 -d '=')
 readarray -t comment < <(grep -m 1 "^Comment=" $i |cut -f 2 -d '=')
 readarray -t comment2 < <(grep -m 1 "^GenericName=" $i |cut -f 2 -d '=')
 readarray -t mname < <(grep -m 1 "^Name=" $i |cut -f 2 -d '=')
 readarray -t icon < <(grep -m 1 "^Icon=" $i |cut -f 2 -d '=')
 if [[ $comment = "" ]]; then
 comment=$comment2
 fi
 yadlist+=( "$fileindex" "${icon[0]}" "${mname[0]}" "$i" "${executable[0]}" "${comment[0]}" ) #this sets double quotes in each variable.
 fileindex=$(($fileindex+1))
done
TimeFinished=$(date +%s.%N); TimeDiff=$(echo "$TimeFinished - $TimeStarted" | bc -l)

Version 2:

grep all the files at once for required fields. This version improves the performance of the script, and needs 2 seconds to grep 300 desktop files.

TimeStarted=$(date +%s.%N)
files=/usr/share/applications/*.desktop
i=$files 
fileindex=1
IFS=$'\n'
readarray -t fi < <(printf '%s\n' $i)
readarray -t executable < <(grep -m 1 '^Exec=' $i)
readarray -t noexecutable < <(grep -L '^Exec=' $i)
readarray -t comment < <(grep -m 1 "^Comment=" $i )
readarray -t nocomment < <(grep -L "^Comment=" $i ) 
readarray -t comment2 < <(grep -m 1 "^GenericName=" $i )
readarray -t nocomment2 < <(grep -L "^GenericName=" $i )
readarray -t mname < <(grep -m 1 "^Name=" $i )
readarray -t nomname < <(grep -L "^Name=" $i ) 
readarray -t icon < <(grep -m 1 "^Icon=" $i )
readarray -t noicon < <(grep -L "^Icon=" $i ) 
for items1 in ${noexecutable[@]}; do
 executable+=($(echo "$items1"":Exec=None"))
done
for items2 in ${nocomment[@]}; do
 comment+=($(echo "$items2"":Comment=None"))
done
for items3 in ${nocomment2[@]}; do
 comment2+=($(echo "$items3"":GenericName=None"))
done
for items4 in ${nomname[@]}; do
 mname+=($(echo "$items4"":Name=None"))
done
for items5 in ${noicon[@]}; do
 icon+=($(echo "$items5"":Icon=None"))
done
sortexecutable=($(sort <<<"${executable[*]}"))
sortcomment=($(sort <<<"${comment[*]}"))
sortcomment2=($(sort <<<"${comment2[*]}"))
sortmname=($(sort <<<"${mname[*]}"))
sorticon=($(sort <<<"${icon[*]}"))
trimexecutable=($(grep -Po '(?<=Exec=)[ --0-9A-Za-z/]*' <<<"${sortexecutable[*]}"))
trimcomment=($(grep -Po '(?<=Comment=)[ --0-9A-Za-z/]*' <<<"${sortcomment[*]}"))
trimcomment2=($(grep -Po '(?<=GenericName=)[ --0-9A-Za-z/]*' <<<"${sortcomment2[*]}"))
trimmname=($(grep -Po '(?<=Name=)[ --0-9A-Za-z/]*' <<<"${sortmname[*]}"))
trimicon=($(grep -Po '(?<=Icon=)[ --0-9A-Za-z/]*' <<<"${sorticon[*]}"))
unset IFS
ae=0
for aeitem in ${fi[@]};do
 if [[ ${trimcomment[ae]} = "None" ]]; then
 trimcomment[ae]=${trimcomment2[ae]}
 fi
 yadlist+=( "$fileindex" "${trimicon[$ae]}" "${trimmname[$ae]}" "${fi[$ae]}" "${trimexecutable[$ae]}" "${trimcomment[$ae]}" ) #this sets double quotes in each variable.
 fileindex=$(($fileindex+1))
 ae=$(($ae+1))
done
TimeFinished=$(date +%s.%N); TimeDiff=$(echo "$TimeFinished - $TimeStarted" | bc -l)

Remarks:

a) Some .dekstop files do not include all the required fields.

b) Performance refers to 64-bit Intel Celeron N3050 - 4GB ram machine, running 64bit Debian 8 Sid with XFCE and GNU bash 4.4.0(1) and GNU grep 2.26. PS: Performance of 9 or 2 seconds is also verified by time ./script.sh.

c) The version 2 script performance can achieve below 0.5 seconds if I remove the "for" sections, but then yadlist becomes a chaos due to the missing fields in some .desktop files.

Result:

According to my opinion, even 2 seconds to grep 300 files it is still too much time for such a small number of files.

Is it possible to further optimize this scripts performance in Bash?

As a sample , you can have a look at this caja.desktop file, taken from my system. Notice that Comment entry is missing.

[Desktop Entry]
Name=Caja
Name[af]=Caja
<More Name entries for different locale>
GenericName=File Manager
GenericName[af]=Lêerbestuurder
<more GenericName entries for different locale>
Exec=caja
Icon=system-file-manager
Terminal=false
Type=Application
StartupNotify=true
NoDisplay=true
OnlyShowIn=MATE;

In other .desktop files, the comment entry (if present) looks like this:

Comment=View multi-page documents
<various Comment entries for different locale>

Question 2

Welcome to codereview. Could you please post an example of a .desktop file ?

Question 3

Hello! Nice to meet you all. I added a real desktop file sample in my main question.

Question 4

Review w.r.t the algorithm only (language independent):

5 grep per file to extract what you need. Instead search for all five altogether : grep "A|B|C|D|E". If this doesn't suit your requirement, you should write a simple file read program and extract all the 5 parameters in one file read instead of 5.
Calculate comment2 only if [[ $comment = "" ]];

Question 5

One array with Grep of all five entries at once is fast indeed, but if one of the entries is completely missing, then array is messed up since grep returns nothing for the missing entry. Manipulating this messed array afterwards is slower than 5 greps.

Question 6

By the way, your advise for comment2 is good. Thanks.

Question 7

@George: If you cannot use the OR functionality , I would suggest reading the file yourself instead of grep. That would be faster.

Question 8

After a lot of research and "observation" i found the problem....

The real problem of script limited performance was cpu scaling. As soon as i pushed the processor to work in full power (1.6 GHz), version 2 achieved 0.5 seconds!

All i had to do was to check the script performance in another machine, and i was lucky enough this "other" machine not to have cpu scaling enabled.

As a programmer point of view there is no doubt that version 2 is MUCH faster than version 1. Also it seems that version 2 is the most we can get out of bash.

PS1: I adopted recommendation of "thepace" for calculation of comment2. That way script performance improved by some milliseconds.

PS2: To make my CPU to work in full power i had to disable intel_pstate and apply performance governor in cpufrequtils (cpufreq-set -c 0 -g performance - same for -c 1) or even better to stick the CPU at max power using cpufreq-set -c 0 -f 1600000.

PS3: performance governor is also available with intel_ptate enabled (default setting) but in reality intel pstate keeps manipulating - reducing the cpu speed even in performance governor as proved by cpufreq-info (in a better way though than default powersave governor).By disabling intel pstate and applying performance governor cpu sticks to 1,6GHz.

PS4: I had no idea that cpufrequtils is installed by default in Debian 8...

For those who want to give a try, full script can be found here: https://github.com/gevasiliou/PythonTests/blob/master/appslist.sh

If you don't have .desktop files in your system (usually found at /usr/share/applications/) you can download this folder with around 300 files for testing: https://github.com/gevasiliou/PythonTests/tree/master/appsfiles

Question 9

Another pitfall with benchmarking these things are disk caches. Try, for example, time find ~ -name unlikely. It took 3.6 s the first time and 0.05 s the second time I ran it on my current machine.

Question 10

@5gon12eder Very interesting info. Thanks!

thepace 2,33911 silver badges10 bronze badges · Answer 1 · 2016-11-07 16:55:17Z

1

\$\begingroup\$

Review w.r.t the algorithm only (language independent):

5 grep per file to extract what you need. Instead search for all five altogether : grep "A|B|C|D|E". If this doesn't suit your requirement, you should write a simple file read program and extract all the 5 parameters in one file read instead of 5.
Calculate comment2 only if [[ $comment = "" ]];

Share

answered Nov 7, 2016 at 16:55

thepace's user avatar

thepace

2,33911 silver badges10 bronze badges

\$\endgroup\$

3

\$\begingroup\$ One array with Grep of all five entries at once is fast indeed, but if one of the entries is completely missing, then array is messed up since grep returns nothing for the missing entry. Manipulating this messed array afterwards is slower than 5 greps. \$\endgroup\$

George Vasiliou
– George Vasiliou

2016年11月07日 17:16:07 +00:00
Commented Nov 7, 2016 at 17:16
\$\begingroup\$ By the way, your advise for comment2 is good. Thanks. \$\endgroup\$

George Vasiliou
– George Vasiliou

2016年11月07日 17:18:20 +00:00
Commented Nov 7, 2016 at 17:18
\$\begingroup\$ @George: If you cannot use the OR functionality , I would suggest reading the file yourself instead of grep. That would be faster. \$\endgroup\$

thepace
– thepace

2016年11月07日 17:27:24 +00:00
Commented Nov 7, 2016 at 17:27

Add a comment |

George Vasiliou 1796 bronze badges · Answer 2 · 2016-11-11 03:03:06Z