I am running my shell script on machineA
which copies the files from machineB
and machineC
to machineA
.
If the file is not there in machineB
, then it should be there in machineC
for sure. So I will try to copy from machineB
first, if it is not there in machineB
then I will go to machineC
to copy the same files.
In machineB
and machineC
there will be a folder like this YYYYMMDD
inside this folder:
/data/pe_t1_snapshot
Whatever date is the latest date in this format YYYYMMDD
inside the above folder - I will pick that folder as the full path from where I need to start copying the files.
Suppose, if this is the latest date folder 20140317
inside /data/pe_t1_snapshot
, then this will be the full path for me:
/data/pe_t1_snapshot/20140317
from where I need to start copying the files in machineB
and machineC
. I need to copy around 400
files in machineA
from machineB
and machineC
and each file size is 3.5 GB.
I currently have my below shell script which works fine as I am using scp, but somehow it takes ~3 hours to copy the 400 files in machineA
.
Below is my shell script:
#!/bin/bash
readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MEMORY_MAPPED_LOCATION=/data/pe_t1_snapshot
PRIMARY_PARTITION=(0 3 5 7 9) # this will have more file numbers around 200
SECONDARY_PARTITION=(1 2 4 6 8) # this will have more file numbers around 200
dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
echo $dir1
echo $dir2
if [ "$dir1" = "$dir2" ]
then
# delete all the files first
find "$PRIMARY" -mindepth 1 -delete
for el in "${PRIMARY_PARTITION[@]}"
do
scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/. || scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/.
done
# delete all the files first
find "$SECONDARY" -mindepth 1 -delete
for sl in "${SECONDARY_PARTITION[@]}"
do
scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/. || scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/.
done
fi
I am copying PRIMARY_PARTITION
files in PRIMARY
folder and SECONDARY_PARTITION
files in SECONDARY
folder in machineA
.
Is there any way to move the files faster in machineA
? Can I copy 10 files at a time or 5 files at a time in parallel instead of downloading all the files in parallel to speed up this process or any other approach?
I don't want to download all the files in parallel. rsync
is not helping me as well, I tried copying with rsync
as well. I guess, I need to go multithreaded way, meaning copy 3 files in parallel instead of one file at a time. If those three files are done, then only move to next three files to copy it.
Maybe I only need to run two parallel processes, one downloading from B and one from C. Each process removes a file name from the master list, puts that file name into its own list, and tries to download it. If downloading is unsuccessful, then it puts that file name back onto the master list. The loop then repeats itself again, making sure not to try to download any file if its name is already in its own list. I simply need to run two processes which do this in parallel, one each from machine B and C.
Is this possible to do? If yes, then can anyone provide an example on this? I am trying to limit the number of threads, not have 400 parallel processing.
NOTE: machineA
is running on SSD and has 10 gig ethernet.
-
\$\begingroup\$ I would expect network bandwidth to be the bottleneck here. If packets from both B and C travel through the same pipe, you're unlikely to gain anything by running in parallel. Can you describe the network topology a little? \$\endgroup\$David Harkness– David Harkness2014年05月04日 06:25:39 +00:00Commented May 4, 2014 at 6:25
-
\$\begingroup\$ @DavidHarkness: As far as I know, we have 10 gig network. I am not a unix guy, so not sure what other information you might need? if you can ask me specifically, then I might be able to answer or I can check with our unix admin. \$\endgroup\$arsenal– arsenal2014年05月04日 06:27:01 +00:00Commented May 4, 2014 at 6:27
-
\$\begingroup\$ Are these machines all on the same local network? Of course, compressing the files first may shave quite a bit of time if these are simple log files. \$\endgroup\$David Harkness– David Harkness2014年05月04日 06:27:54 +00:00Commented May 4, 2014 at 6:27
-
\$\begingroup\$ Yeah they are all on same local network but there might be some cross datacenter file transfer as well. The files which I am transferring are memory-mapped files generated by Java program. \$\endgroup\$arsenal– arsenal2014年05月04日 06:29:03 +00:00Commented May 4, 2014 at 6:29
-
\$\begingroup\$ But I guess, that same pipe can handle three file transfer at a time, instead of one file transfer? \$\endgroup\$arsenal– arsenal2014年05月04日 06:29:46 +00:00Commented May 4, 2014 at 6:29
1 Answer 1
You have already posted another question on this same topic two weeks later, with a better solution using GNU parallel. I'll review this one too anyway on its own merit, though it might be a bit of a moot point.
It's not a good idea to set StrictHostKeyChecking=no
when using ssh
.
The host key of servers should normally not change.
When they do, and you don't know why,
it might be a man in the middle attack.
If it's part of a scheduled server update,
you can manually update the ~/.ssh/known_hosts
file accordingly.
When filtering the output of ls
like this:
ls -dt1 path | head -n1
you don't need the -1
flag.
The purpose of that flag is to print the list of files in a single column.
But when you pipe the output to another command,
the output will be always a single column.
Instead of this:
if [ "$dir1" = "$dir2" ]
This is easier (because you don't need to quote the variables) and more modern:
if [[ $dir1 = $dir2 ]]
Instead of cramming so many ssh
options on the command line:
scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@machineA:...
it's better to add these options in the ~/.ssh/config
file, like this:
Host machineA
Hostname machineA
User david
ControlMaster auto
ControlPath ~/.ssh/control-%r@%h:%p
ControlPersist 900
This way the scp
command becomes simply:
scp machineA:...
I hope this (and even more, my other answer) helps!
Explore related questions
See similar questions with these tags.