Efficiently calculating differences between file using diff file

Question 1

I'm using SVNKit to get diff information between two revisions. I'm using the diff utility to generate a diff file, however I still need to parse it into numbers.

I implemented a solution, but it is rather slow. JGit does something similar, however it actually parses the values itself and returns an object, rather than a output stream, and is much much faster. I was unable to determine how to leverage that for SVNKit, so attempted the following solution:

private Diff compareRevisions(final SVNRevision rev1, final SVNRevision rev2) throws SVNException {
 final Diff diff = new Diff();
 try (final ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
 doDiff(rev1, rev2, baos);
 int filesChanged = 0;
 int additions = 0;
 int deletions = 0;
 final String[] lines = baos.toString().split("\n");
 for (final String line : lines) {
 if (line.startsWith("---")) {
 filesChanged++;
 } else if (line.startsWith("+++")) {
 // No action needed
 } else if (line.startsWith("+")) {
 additions++;
 } else if (line.startsWith("-")) {
 deletions++;
 }
 }
 diff.additions = additions;
 diff.deletions = deletions;
 diff.changedFiles = filesChanged;
 return diff;
 } catch (final IOException e) {
 LOGGER.trace("Could not close stream", e);
 return diff;
 }
}

I've taken to caching the values in files to improve time, but optimally I'd like to speed this up. Perhaps I could use external programs?

Question 2

You need to parse the patch file format correctly. Otherwise the next patch that deletes an SQL comment will confuse your program, as it looks like this:

--- old_file.sql
+++ new_file.sql
@@ -1,1 +1,1 @@
--- SQL comment
+SELECT * FROM table;

Your current code interprets the removed line as a removed file.

The file format is explained here: http://www.gnu.org/software/diffutils/manual/html_node/Detailed-Unified.html

Since there are other people who had the same problem, you could just build on their work instead of writing your own, e.g. https://github.com/thombergs/diffparser.

Question 3

That does take into account comments, although it still seems a bit slower than my implementation. I'm guessing that since it is Java, it's kind of hard to make as fast as perl might would be.

Roland Illig Roland Illig 21.8k2 gold badges36 silver badges83 bronze badges · Accepted Answer · 2016-07-24 14:20:59Z

You need to parse the patch file format correctly. Otherwise the next patch that deletes an SQL comment will confuse your program, as it looks like this:

--- old_file.sql
+++ new_file.sql
@@ -1,1 +1,1 @@
--- SQL comment
+SELECT * FROM table;

Your current code interprets the removed line as a removed file.

The file format is explained here: http://www.gnu.org/software/diffutils/manual/html_node/Detailed-Unified.html

Since there are other people who had the same problem, you could just build on their work instead of writing your own, e.g. https://github.com/thombergs/diffparser.

That does take into account comments, although it still seems a bit slower than my implementation. I'm guessing that since it is Java, it's kind of hard to make as fast as perl might would be.

Stack Exchange Network

Efficiently calculating differences between file using diff file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficiently calculating differences between file using diff file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions