6
\$\begingroup\$

I am running my shell script on machineA which copies the files from machineB and machineC to machineA.

If the file is not there in machineB, then it should be there in machineC for sure. So I will try to copy from machineB first, if it is not there in machineB then I will go to machineC to copy the same files.

In machineB and machineC there will be a folder like this YYYYMMDD inside this folder:

/data/pe_t1_snapshot

Whatever date is the latest date in this format YYYYMMDD inside the above folder - I will pick that folder as the full path from where I need to start copying the files.

Suppose, if this is the latest date folder 20140317 inside /data/pe_t1_snapshot, then this will be the full path for me:

/data/pe_t1_snapshot/20140317

from where I need to start copying the files in machineB and machineC. I need to copy around 400 files in machineA from machineB and machineC and each file size is 3.5 GB.

I currently have my below shell script which works fine as I am using scp, but somehow it takes ~3 hours to copy the 400 files in machineA.

Below is my shell script:

#!/bin/bash
readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MEMORY_MAPPED_LOCATION=/data/pe_t1_snapshot
PRIMARY_PARTITION=(0 3 5 7 9) # this will have more file numbers around 200
SECONDARY_PARTITION=(1 2 4 6 8) # this will have more file numbers around 200
dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
echo $dir1
echo $dir2
if [ "$dir1" = "$dir2" ]
then
 # delete all the files first
 find "$PRIMARY" -mindepth 1 -delete
 for el in "${PRIMARY_PARTITION[@]}"
 do
 scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/. || scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/.
 done
 # delete all the files first
 find "$SECONDARY" -mindepth 1 -delete
 for sl in "${SECONDARY_PARTITION[@]}"
 do
 scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/. || scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/.
 done
fi

I am copying PRIMARY_PARTITION files in PRIMARY folder and SECONDARY_PARTITION files in SECONDARY folder in machineA.

Is there any way to move the files faster in machineA? Can I copy 10 files at a time or 5 files at a time in parallel instead of downloading all the files in parallel to speed up this process or any other approach?

I don't want to download all the files in parallel. rsync is not helping me as well, I tried copying with rsync as well. I guess, I need to go multithreaded way, meaning copy 3 files in parallel instead of one file at a time. If those three files are done, then only move to next three files to copy it.

Maybe I only need to run two parallel processes, one downloading from B and one from C. Each process removes a file name from the master list, puts that file name into its own list, and tries to download it. If downloading is unsuccessful, then it puts that file name back onto the master list. The loop then repeats itself again, making sure not to try to download any file if its name is already in its own list. I simply need to run two processes which do this in parallel, one each from machine B and C.

Is this possible to do? If yes, then can anyone provide an example on this? I am trying to limit the number of threads, not have 400 parallel processing.

NOTE: machineA is running on SSD and has 10 gig ethernet.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked May 4, 2014 at 2:47
\$\endgroup\$
8
  • \$\begingroup\$ I would expect network bandwidth to be the bottleneck here. If packets from both B and C travel through the same pipe, you're unlikely to gain anything by running in parallel. Can you describe the network topology a little? \$\endgroup\$ Commented May 4, 2014 at 6:25
  • \$\begingroup\$ @DavidHarkness: As far as I know, we have 10 gig network. I am not a unix guy, so not sure what other information you might need? if you can ask me specifically, then I might be able to answer or I can check with our unix admin. \$\endgroup\$ Commented May 4, 2014 at 6:27
  • \$\begingroup\$ Are these machines all on the same local network? Of course, compressing the files first may shave quite a bit of time if these are simple log files. \$\endgroup\$ Commented May 4, 2014 at 6:27
  • \$\begingroup\$ Yeah they are all on same local network but there might be some cross datacenter file transfer as well. The files which I am transferring are memory-mapped files generated by Java program. \$\endgroup\$ Commented May 4, 2014 at 6:29
  • \$\begingroup\$ But I guess, that same pipe can handle three file transfer at a time, instead of one file transfer? \$\endgroup\$ Commented May 4, 2014 at 6:29

1 Answer 1

2
\$\begingroup\$

You have already posted another question on this same topic two weeks later, with a better solution using GNU parallel. I'll review this one too anyway on its own merit, though it might be a bit of a moot point.


It's not a good idea to set StrictHostKeyChecking=no when using ssh. The host key of servers should normally not change. When they do, and you don't know why, it might be a man in the middle attack. If it's part of a scheduled server update, you can manually update the ~/.ssh/known_hosts file accordingly.


When filtering the output of ls like this:

ls -dt1 path | head -n1

you don't need the -1 flag. The purpose of that flag is to print the list of files in a single column. But when you pipe the output to another command, the output will be always a single column.


Instead of this:

if [ "$dir1" = "$dir2" ]

This is easier (because you don't need to quote the variables) and more modern:

if [[ $dir1 = $dir2 ]]

Instead of cramming so many ssh options on the command line:

scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@machineA:...

it's better to add these options in the ~/.ssh/config file, like this:

Host machineA
Hostname machineA
User david
ControlMaster auto
ControlPath ~/.ssh/control-%r@%h:%p
ControlPersist 900

This way the scp command becomes simply:

scp machineA:...

I hope this (and even more, my other answer) helps!

answered Oct 4, 2014 at 20:28
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.