This is my first Go program. I'm learning the language but it's a bit difficult to understand all the concepts so in order to practice I wrote this. It's a simple program which recursively check for duplicated files in a directory.
It uses a SHA256 hash on files in order to identify if two files are the same or not. I spawn multiple workers to handle this hashing.
Here is how it works:
- n workers (goroutine) are spawned, each of them waiting for file paths to process on the same channel, named
input
in my code. - 1 goroutine is spawned to recursively search for files in the direvtory, and populate the
input
channel with file names. - The main goroutine process the results as soon as they are available and add them to a map of sha256->[file, file, ...].
Finally we just display the duplicates.
Please feel to comment on anything, I really want to progress in Go, and especially "idiomatic" Go.
EDIT: Improved my initial code with flags and error management.
package main
import (
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"sync"
"flag"
"runtime"
"io"
)
var dir string
var workers int
type Result struct {
file string
sha256 [32]byte
}
func worker(input chan string, results chan<- *Result, wg *sync.WaitGroup) {
for file := range input {
var h = sha256.New()
var sum [32]byte
f, err := os.Open(file)
if err != nil {
fmt.Fprintln(os.Stderr, err)
continue
}
if _, err = io.Copy(h, f); err != nil {
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
}
f.Close()
copy(sum[:], h.Sum(nil))
results <- &Result{
file: file,
sha256: sum,
}
}
wg.Done()
}
func search(input chan string) {
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
if err != nil {
fmt.Fprintln(os.Stderr, err)
} else if info.Mode().IsRegular() {
input <- path
}
return nil
})
close(input)
}
func main() {
flag.StringVar(&dir, "dir", ".", "directory to search")
flag.IntVar(&workers, "workers", runtime.NumCPU(), "number of workers")
flag.Parse()
fmt.Printf("Searching in %s using %d workers...\n", dir, workers)
input := make(chan string)
results := make(chan *Result)
wg := sync.WaitGroup{}
wg.Add(workers)
for i := 0; i < workers; i++ {
go worker(input, results, &wg)
}
go search(input)
go func() {
wg.Wait()
close(results)
}()
counter := make(map[[32]byte][]string)
for result := range results {
counter[result.sha256] = append(counter[result.sha256], result.file)
}
for sha, files := range counter {
if len(files) > 1 {
fmt.Printf("Found %d duplicates for %s: \n", len(files), hex.EncodeToString(sha[:]))
for _, f := range files {
fmt.Println("-> ", f)
}
}
}
}
-
\$\begingroup\$ hint: You may want to change the DIR for those trying elsewhere, so it doesn't panic immediately. \$\endgroup\$Randy Howard– Randy Howard2018年02月10日 03:39:55 +00:00Commented Feb 10, 2018 at 3:39
-
\$\begingroup\$ Thanks, I did it and added flags to set directory and number of workers :-) \$\endgroup\$Thibaut D.– Thibaut D.2018年02月10日 09:24:02 +00:00Commented Feb 10, 2018 at 9:24
1 Answer 1
1. declare all 'var' at once
instead of
var dir string
var workers int
you can do
var (
dir string
workers int
)
or even better, use local var instead of global var irectly in your main()
function
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
2. Make sure that arguments are valid
if worker
is <= 0, the program will panic. A little check after flag.Parse()
could prevent this:
if workers <= 0 {
fmt.Printf("workers has to be > 0, was %d", workers)
}
3. Improve hash computing:
First, each worker only need a single instance of hash.Hash, as you can call Reset()
on it after each file:
h := sha256.New()
for file := range input {
...
results <- &Result{...}
h.Reset()
}
Also, the hash of each file could be stored as a string
instead of a [32]byte
to avoid some operations:
results <- &Result{
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
}
4. Always specify the channel direction when you can
From the golang specifications:
A channel provides a mechanism for concurrently executing functions to communicate by sending and receiving values of a specified element type. The value of an uninitialized channel is nil.
ChannelType = ( "chan" | "chan" "<-" | "<-" "chan" ) ElementType .
The optional <- operator specifies the channel direction, send or receive. If no direction is given, the channel is bidirectional. A channel may be constrained only to send or only to receive by conversion or assignment.
Specify the channel direction helps understand what a method is doing
Here is the new version of the code:
package main
import (
"crypto/sha256"
"flag"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"sync"
)
type Result struct {
file string
sha256 string
}
func search(dir string, input chan<- string) {
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
if err != nil {
fmt.Fprintln(os.Stderr, err)
} else if info.Mode().IsRegular() {
input <- path
}
return nil
})
close(input)
}
func startWorker(input <-chan string, results chan<- *Result, wg *sync.WaitGroup) {
h := sha256.New()
for file := range input {
f, err := os.Open("file.txt")
if err != nil {
fmt.Fprintln(os.Stderr, err)
continue
}
if _, err := io.Copy(h, f); err != nil {
fmt.Fprintln(os.Stderr, err)
f.Close()
continue
}
f.Close()
results <- &Result{
file: file,
sha256: fmt.Sprintf("%x", h.Sum(nil)),
}
h.Reset()
}
wg.Done()
}
func run(dir string, workers int) (map[string][]string, error) {
input := make(chan string)
go search(dir, input)
counter := make(map[string][]string)
results := make(chan *Result)
go func() {
for r := range results {
counter[r.sha256] = append(counter[r.sha256], r.file)
}
}()
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++ {
go startWorker(input, results, &wg)
}
wg.Wait()
close(results)
return counter, nil
}
func main() {
dir := flag.String("dir", ".", "directory to search")
workers := flag.Int("workers", runtime.NumCPU(), "number of workers")
flag.Parse()
if *workers <= 0 {
fmt.Printf("workers has to be > 0, was %d \n", workers)
os.Exit(1)
}
fmt.Printf("Searching in %s using %d workers...\n", *dir, *workers)
counter, err := run(*dir, *workers)
if err != nil {
fmt.Printf("failed! %v\n", err)
os.Exit(1)
}
for sha, files := range counter {
if len(files) > 1 {
fmt.Printf("Found %d duplicates for %v: \n", len(files), sha)
for _, f := range files {
fmt.Println("-> ", f)
}
}
}
}
Possible improvements:
Currently, if an error is thrown somewhere in the code, the program does not stop but just write the error to os.Stderr
. It may be better to return this error and then call os.Exit(1)
-
\$\begingroup\$ Instead of
ioutil.ReadFile
, you could follow the example (file) of golang.org and useOpen
followed byio.Copy
. Using this, the whole file does not need to be loaded in memory! \$\endgroup\$oliverpool– oliverpool2018年02月22日 17:59:50 +00:00Commented Feb 22, 2018 at 17:59 -
\$\begingroup\$ @oliverpool you're right, thanks for noticing this! I've updated the code \$\endgroup\$felix– felix2018年02月27日 13:06:09 +00:00Commented Feb 27, 2018 at 13:06