Java app for getting metadata of millions of files in a directory

Question 1

I am writing a Java app that gets the metadata of files in a directory and exports it to a .csv file. The app works fine if the number of files is fewer than a million. But if I feed in a path that has about 3200000 files in all of directories and sub-directories, it takes forever. Is there a way I can speed up things here?

 private void extractDetailsCSV(File libSourcePath, String extractFile) throws ScraperException {
 log.info("Inside extract details csv");
 try{
 FileMetadataUtil fileUtil = new FileMetadataUtil();
 File[] listOfFiles = libSourcePath.listFiles();
 for(int i = 0; i < listOfFiles.length; i++) {
 if(listOfFiles[i].isDirectory()) {
 extractDetailsCSV(listOfFiles[i],extractFile);
 }
 if(listOfFiles[i].isFile()){
 ScraperOutputVO so = new ScraperOutputVO();
 Path path = Paths.get(listOfFiles[i].getAbsolutePath());
 so.setFilePath(listOfFiles[i].getParent());
 so.setFileName(listOfFiles[i].getName());
 so.setFileType(getFileType(listOfFiles[i].getAbsolutePath()));
 BasicFileAttributes basicAttribs = fileUtil.getBasicFileAttributes(path);
 if(basicAttribs != null) {
 so.setDateCreated(basicAttribs.creationTime().toString().substring(0, 10) + " " + basicAttribs.creationTime().toString().substring(11, 16));
 so.setDateLastModified(basicAttribs.lastModifiedTime().toString().substring(0, 10) + " " + basicAttribs.lastModifiedTime().toString().substring(11, 16));
 so.setDateLastAccessed(basicAttribs.lastAccessTime().toString().substring(0, 10) + " " + basicAttribs.lastAccessTime().toString().substring(11, 16));
 }
 so.setFileSize(String.valueOf(listOfFiles[i].length()));
 so.setAuthors(fileUtil.getOwner(path));
 so.setFolderLink(listOfFiles[i].getAbsolutePath());
 writeCsvFileDtl(extractFile, so);
 so.setFileName(listOfFiles[i].getName());
 noOfFiles ++;
 }
 }
 } catch (Exception e) {
 log.error("IOException while setting up columns" + e.fillInStackTrace());
 throw new ScraperException("IOException while setting up columns" , e.fillInStackTrace());
 }
 log.info("Done extracting details to csv file");
}
public void writeCsvFileDtl(String extractFile, ScraperOutputVO scraperOutputVO) throws ScraperException {
 try {
 FileWriter writer = new FileWriter(extractFile, true);
 writer.append(scraperOutputVO.getFilePath());
 writer.append(',');
 writer.append(scraperOutputVO.getFileName());
 writer.append(',');
 writer.append(scraperOutputVO.getFileType());
 writer.append(',');
 writer.append(scraperOutputVO.getDateCreated());
 writer.append(',');
 writer.append(scraperOutputVO.getDateLastModified());
 writer.append(',');
 writer.append(scraperOutputVO.getDateLastAccessed());
 writer.append(',');
 writer.append(scraperOutputVO.getFileSize());
 writer.append(',');
 writer.append(scraperOutputVO.getAuthors());
 writer.append(',');
 writer.append(scraperOutputVO.getFolderLink());
 writer.append('\n');
 writer.flush();
 writer.close();
 } catch (IOException e) {
 log.info("IOException while writing to csv file" + e.fillInStackTrace());
 throw new ScraperException("IOException while writing to csv file" , e.fillInStackTrace());
 }
}
}

Question 2

Which version of java do you use ?

Question 3

@Marc-Andre I use Java 7

Question 4

This link could help you docs.oracle.com/javase/tutorial/essential/io/walk.html it's a tutorial about walking a file tree using java nio

Question 5

I suspect that using the old File class of Java is the possible root problem of your solution right now. Since you're using Java 7, you should use those new classes. I've seen that you use some of them like Path, so it shouldn't be too difficult. I don't what your class look at moment so I've changed some method base on what I'm use to do. So the class I will be using is SimpleFileVisitor since this is the basic implementation of FileVisitor.

So I've created a class Walker (this is a very bad name, you should change it for something clearer for you, since I have no good idea right now) that extends SimpleFileVisitor. The class has an attribute extractFile that correspond to the filename of the csv. This class will have the preVisitDirectory, visitFile and visitFileFailed that we will override from the FileVisitor. I've also added your method writeCsvFileDtl, createDate (thanks to @unholysampler, you should read his answer too).

So the class should look like that :

import static java.nio.file.FileVisitResult.CONTINUE;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.FileVisitResult;
import java.nio.file.Path;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;
import java.nio.file.attribute.FileTime;
public class Walker extends SimpleFileVisitor<Path> {
 private String extractFile;
 public Walker(String extractFile) {
 this.extractFile = extractFile;
 }
 @Override
 public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attr)
 throws IOException {
 populateAndWrite(dir, attr);
 return CONTINUE;
 }
 @Override
 public FileVisitResult visitFile(Path file, BasicFileAttributes attr) {
 populateAndWrite(file, attr);
 return CONTINUE;
 }
 @Override
 public FileVisitResult visitFileFailed(Path file, IOException exc) {
 //You should determine if you need this method or not
 return CONTINUE;
 }
 private void populateAndWrite(Path file, BasicFileAttributes attr) {
 ScraperOutputVO so = new ScraperOutputVO();
 if (file.getParent() != null) {
 so.setFilePath(file.getParent().toString());
 }
 if (file.getFileName() != null) {
 so.setFileName(file.getFileName().toString());
 }
 so.setFileType(getFileType(file.toAbsolutePath().toString()));
 if (attr != null) {
 so.setDateCreated(createDate(attr.creationTime()));
 so.setDateLastModified(createDate(attr.lastModifiedTime()));
 so.setDateLastAccessed(createDate(attr.lastAccessTime()));
 }
 if (!attr.isDirectory()) {
 so.setFileSize(String.valueOf(attr.size()));
 }
 so.setAuthors(fileUtil.getOwner(file));
 so.setFolderLink(file.toAbsolutePath().toString());
 try {
 writeCsvFileDtl(extractFile, so);
 } catch (IOException e) {
 log.info("IOException while writing to csv file" +
 e.fillInStackTrace());
 throw new
 ScraperException("IOException while writing to csv file" ,
 e.fillInStackTrace());
 }
 }
 private String createDate(FileTime time) {
 String timeStr = time.toString();
 return timeStr.substring(0, 10) + " " + timeStr.substring(11, 16);
 }
 private void writeCsvFileDtl(ScraperOutputVO scraperOutputVO) 
 throws ScraperException {
 try {
 FileWriter writer = new FileWriter(extractFile, true);
 writer.append(scraperOutputVO.getFilePath());
 writer.append(',');
 writer.append(scraperOutputVO.getFileName());
 writer.append(',');
 writer.append(scraperOutputVO.getFileType());
 writer.append(',');
 writer.append(scraperOutputVO.getDateCreated());
 writer.append(',');
 writer.append(scraperOutputVO.getDateLastModified());
 writer.append(',');
 writer.append(scraperOutputVO.getDateLastAccessed());
 writer.append(',');
 writer.append(scraperOutputVO.getFileSize());
 writer.append(',');
 writer.append(scraperOutputVO.getAuthors());
 writer.append(',');
 writer.append(scraperOutputVO.getFolderLink());
 writer.append('\n');
 writer.flush();
 writer.close();
 } catch (IOException e) {
 log.info("IOException while writing to csv file" +
 e.fillStackTrace();
 throw new ScraperException("IOException while writing to csv file",
 e.fillInStackTrace());
 }
 }
}

The method populateAndWrite is use in preVisitDirectory and visitFile, basically it will populate each attribute of your object ScraperOutputVO and then send it to the the write method. I'm not sure if you to list directories, so if you don't want to just remove preVisitDirectory. I've added some nullcheck since those method can return null if you start at the root directory.

You'll maybe need to tweak some attributes, cause I didn't had access to your fileUtils and getFileType, so you should test to make sure you have the same values.

To launch the class you simply need to something like :

public static void main(String args[]){
 Path root = Paths.get("Path to your directory");
 Walker walker = new Walker("Name of your csv file");
 try {
 Files.walkFileTree(root, walker);
 } catch (IOException e) {
 //you should handle exception here
 //log.info("Problem walking the directory")
 e.printStackTrace();
 }
}

If you need more help just comment on the answer.

Edit: I find another amelioration that would lower the time of the execution. The problem is when you write to the file, you always create a new FileWriter, open the file, write to it and then close it. So for each file you hit, you're opening accessing the file and closing it which is causing a major drawback on performance. By using your current writeCsvFileDtl it took me like 170 seconds to write 100 000 entries. By leaving open the the FileWriter writing took less than one second. You should do something like that, it is not very elegant (you could probably do something more readable or cleaner) but will enhance the performance.

import java.io.FileWriter;
import java.io.IOException;
public class Writer {
 static FileWriter writer = null;
 public static void openFileWriter() {
 try {
 writer = new FileWriter("C:/dev/file.txt", true);
 } catch (IOException e) {
 e.printStackTrace();
 }
 }
 public static void writeToTheFileWithOpenEachTime(){
 try {
 writer.append("firstColumn");
 writer.append(',');
 writer.append("secondColumn");
 writer.append(',');
 writer.append("thirdColumn");
 writer.append(',');
 writer.append("fourthColumn");
 writer.append(',');
 writer.append("fifthColumn");
 writer.append(',');
 writer.append("sixthColumn");
 writer.append(',');
 writer.append("sevenColumn");
 writer.append(',');
 writer.append("eigthColumn");
 writer.append(',');
 writer.append("ninethColumn");
 writer.append('\n');
 writer.flush();
 } catch (IOException e) {
 e.printStackTrace();
 }
 }
 public static void closeWriter() {
 try {
 writer.close();
 } catch (IOException e) {
 e.printStackTrace();
 }
 }
}

Question 6

You are continually accessing the array for the current item, but never using the index for anything else. This is exactly what for-each loops are for.

for (File f : listOfFiles) {
 //use f instead of listOfFiles[i]
}

The following code is full of repetition:

if(basicAttribs != null) {
 so.setDateCreated(basicAttribs.creationTime().toString().substring(0, 10) + " " + basicAttribs.creationTime().toString().substring(11, 16));
 so.setDateLastModified(basicAttribs.lastModifiedTime().toString().substring(0, 10) + " " + basicAttribs.lastModifiedTime().toString().substring(11, 16));
 so.setDateLastAccessed(basicAttribs.lastAccessTime().toString().substring(0, 10) + " " + basicAttribs.lastAccessTime().toString().substring(11, 16));
}

Instead, make a function to perform the repeated action.

public FileTime createDate(FileTime time) {
 String timeStr = time.toString();
 return timeStr.substring(0, 10) + " " + timeStr.substring(11, 16));
}

There should also be a way to create a Date form FileTime which will let you just use SimpleDateFormat instead of doing the string manipulation by hand.

Question 7

I know this is an old post, but something you may want to consider in order to speed up the suggested solution is to avoid (if you can) the FOLLOW_LINKS option in the code below; this improved performance quite significantly in an app I've recently developed:

EnumSet<FileVisitOption> opts = EnumSet.of(FileVisitOption.FOLLOW_LINKS);
...
Files.walkFileTree(path, opts, Integer.MAX_VALUE, visitor);

If you don't need follow links, you can simply call the simpler version of walkFileTree:

Files.walkFileTree(path,visitor);

This calls the following under the covers:

Files.walkFileTree(path, EnumSet.noneOf(FileVisitOption.class), Integer.MAX_VALUE, visitor);

Marc-Andre Marc-Andre 6,7795 gold badges39 silver badges65 bronze badges · Answer 1 · 2013-08-20 00:08:17Z

I suspect that using the old File class of Java is the possible root problem of your solution right now. Since you're using Java 7, you should use those new classes. I've seen that you use some of them like Path, so it shouldn't be too difficult. I don't what your class look at moment so I've changed some method base on what I'm use to do. So the class I will be using is SimpleFileVisitor since this is the basic implementation of FileVisitor.

So I've created a class Walker (this is a very bad name, you should change it for something clearer for you, since I have no good idea right now) that extends SimpleFileVisitor. The class has an attribute extractFile that correspond to the filename of the csv. This class will have the preVisitDirectory, visitFile and visitFileFailed that we will override from the FileVisitor. I've also added your method writeCsvFileDtl, createDate (thanks to @unholysampler, you should read his answer too).

So the class should look like that :

import static java.nio.file.FileVisitResult.CONTINUE;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.FileVisitResult;
import java.nio.file.Path;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;
import java.nio.file.attribute.FileTime;
public class Walker extends SimpleFileVisitor<Path> {
 private String extractFile;
 public Walker(String extractFile) {
 this.extractFile = extractFile;
 }
 @Override
 public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attr)
 throws IOException {
 populateAndWrite(dir, attr);
 return CONTINUE;
 }
 @Override
 public FileVisitResult visitFile(Path file, BasicFileAttributes attr) {
 populateAndWrite(file, attr);
 return CONTINUE;
 }
 @Override
 public FileVisitResult visitFileFailed(Path file, IOException exc) {
 //You should determine if you need this method or not
 return CONTINUE;
 }
 private void populateAndWrite(Path file, BasicFileAttributes attr) {
 ScraperOutputVO so = new ScraperOutputVO();
 if (file.getParent() != null) {
 so.setFilePath(file.getParent().toString());
 }
 if (file.getFileName() != null) {
 so.setFileName(file.getFileName().toString());
 }
 so.setFileType(getFileType(file.toAbsolutePath().toString()));
 if (attr != null) {
 so.setDateCreated(createDate(attr.creationTime()));
 so.setDateLastModified(createDate(attr.lastModifiedTime()));
 so.setDateLastAccessed(createDate(attr.lastAccessTime()));
 }
 if (!attr.isDirectory()) {
 so.setFileSize(String.valueOf(attr.size()));
 }
 so.setAuthors(fileUtil.getOwner(file));
 so.setFolderLink(file.toAbsolutePath().toString());
 try {
 writeCsvFileDtl(extractFile, so);
 } catch (IOException e) {
 log.info("IOException while writing to csv file" +
 e.fillInStackTrace());
 throw new
 ScraperException("IOException while writing to csv file" ,
 e.fillInStackTrace());
 }
 }
 private String createDate(FileTime time) {
 String timeStr = time.toString();
 return timeStr.substring(0, 10) + " " + timeStr.substring(11, 16);
 }
 private void writeCsvFileDtl(ScraperOutputVO scraperOutputVO) 
 throws ScraperException {
 try {
 FileWriter writer = new FileWriter(extractFile, true);
 writer.append(scraperOutputVO.getFilePath());
 writer.append(',');
 writer.append(scraperOutputVO.getFileName());
 writer.append(',');
 writer.append(scraperOutputVO.getFileType());
 writer.append(',');
 writer.append(scraperOutputVO.getDateCreated());
 writer.append(',');
 writer.append(scraperOutputVO.getDateLastModified());
 writer.append(',');
 writer.append(scraperOutputVO.getDateLastAccessed());
 writer.append(',');
 writer.append(scraperOutputVO.getFileSize());
 writer.append(',');
 writer.append(scraperOutputVO.getAuthors());
 writer.append(',');
 writer.append(scraperOutputVO.getFolderLink());
 writer.append('\n');
 writer.flush();
 writer.close();
 } catch (IOException e) {
 log.info("IOException while writing to csv file" +
 e.fillStackTrace();
 throw new ScraperException("IOException while writing to csv file",
 e.fillInStackTrace());
 }
 }
}

The method populateAndWrite is use in preVisitDirectory and visitFile, basically it will populate each attribute of your object ScraperOutputVO and then send it to the the write method. I'm not sure if you to list directories, so if you don't want to just remove preVisitDirectory. I've added some nullcheck since those method can return null if you start at the root directory.

You'll maybe need to tweak some attributes, cause I didn't had access to your fileUtils and getFileType, so you should test to make sure you have the same values.

To launch the class you simply need to something like :

public static void main(String args[]){
 Path root = Paths.get("Path to your directory");
 Walker walker = new Walker("Name of your csv file");
 try {
 Files.walkFileTree(root, walker);
 } catch (IOException e) {
 //you should handle exception here
 //log.info("Problem walking the directory")
 e.printStackTrace();
 }
}

If you need more help just comment on the answer.

Edit: I find another amelioration that would lower the time of the execution. The problem is when you write to the file, you always create a new FileWriter, open the file, write to it and then close it. So for each file you hit, you're opening accessing the file and closing it which is causing a major drawback on performance. By using your current writeCsvFileDtl it took me like 170 seconds to write 100 000 entries. By leaving open the the FileWriter writing took less than one second. You should do something like that, it is not very elegant (you could probably do something more readable or cleaner) but will enhance the performance.

import java.io.FileWriter;
import java.io.IOException;
public class Writer {
 static FileWriter writer = null;
 public static void openFileWriter() {
 try {
 writer = new FileWriter("C:/dev/file.txt", true);
 } catch (IOException e) {
 e.printStackTrace();
 }
 }
 public static void writeToTheFileWithOpenEachTime(){
 try {
 writer.append("firstColumn");
 writer.append(',');
 writer.append("secondColumn");
 writer.append(',');
 writer.append("thirdColumn");
 writer.append(',');
 writer.append("fourthColumn");
 writer.append(',');
 writer.append("fifthColumn");
 writer.append(',');
 writer.append("sixthColumn");
 writer.append(',');
 writer.append("sevenColumn");
 writer.append(',');
 writer.append("eigthColumn");
 writer.append(',');
 writer.append("ninethColumn");
 writer.append('\n');
 writer.flush();
 } catch (IOException e) {
 e.printStackTrace();
 }
 }
 public static void closeWriter() {
 try {
 writer.close();
 } catch (IOException e) {
 e.printStackTrace();
 }
 }
}

unholysampler unholysampler 6,19119 silver badges31 bronze badges · Answer 2 · 2013-08-19 21:11:50Z

You are continually accessing the array for the current item, but never using the index for anything else. This is exactly what for-each loops are for.

for (File f : listOfFiles) {
 //use f instead of listOfFiles[i]
}

The following code is full of repetition:

if(basicAttribs != null) {
 so.setDateCreated(basicAttribs.creationTime().toString().substring(0, 10) + " " + basicAttribs.creationTime().toString().substring(11, 16));
 so.setDateLastModified(basicAttribs.lastModifiedTime().toString().substring(0, 10) + " " + basicAttribs.lastModifiedTime().toString().substring(11, 16));
 so.setDateLastAccessed(basicAttribs.lastAccessTime().toString().substring(0, 10) + " " + basicAttribs.lastAccessTime().toString().substring(11, 16));
}

Instead, make a function to perform the repeated action.

public FileTime createDate(FileTime time) {
 String timeStr = time.toString();
 return timeStr.substring(0, 10) + " " + timeStr.substring(11, 16));
}

There should also be a way to create a Date form FileTime which will let you just use SimpleDateFormat instead of doing the string manipulation by hand.

Leo Leo 1211 bronze badge · Answer 3 · 2014-11-29 19:12:08Z

I know this is an old post, but something you may want to consider in order to speed up the suggested solution is to avoid (if you can) the FOLLOW_LINKS option in the code below; this improved performance quite significantly in an app I've recently developed:

EnumSet<FileVisitOption> opts = EnumSet.of(FileVisitOption.FOLLOW_LINKS);
...
Files.walkFileTree(path, opts, Integer.MAX_VALUE, visitor);

If you don't need follow links, you can simply call the simpler version of walkFileTree:

Files.walkFileTree(path,visitor);

This calls the following under the covers:

Files.walkFileTree(path, EnumSet.noneOf(FileVisitOption.class), Integer.MAX_VALUE, visitor);

Stack Exchange Network

Java app for getting metadata of millions of files in a directory

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Java app for getting metadata of millions of files in a directory

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions