SVNKit - Speed up diff on large repositories

Question 1

I have large amounts of sometimes massive Subversion repositories I'm trying to scan to find the number of additions, deletions, and files changed, similar to what GitHub does.

Because of this, the diff operation using SVNKit is very slow. This, combined with 1000-4000 entries on many branches within, 750+ repositories, causes my application to run for a full 24 hours and barely make an impact on the amount of repositories that need to be processed. I've taken to only analyzing commits after a certain date, but it's still phenomenally slow. Is there a way I could give this a significant boost? Here is my code. It is rather large, hopefully someone is still able to offer me advice. It starts when I call updateAuthorInfo(), and from there the execution takes off.

private void updateAuthorInfo(final BranchInfo bi, final SVNRevision endA, final SVNRevision endB, final Date earliestDate)
 throws SVNException {
 LOGGER.info("Getting author information for branch {}", bi.getBranch());
 DiffWrapper information = getCommitsInfoForPath(bi.getBranch(), endA, endB);
 while (true) {
 for (Commit commit : information.commits) {
 CommitterInfo ai = bi.getAuthorInfo(commit.getCommitter(), "", commit.getCommitter(), "");
 ai.incrementAdditions(commit.getAdditions());
 ai.incrementDeletions(commit.getDeletions());
 ai.add(commit);
 }
 if ("".equals(information.source)) {
 break;
 }
 LOGGER.debug("Continuing to path: {}", information.source);
 information = getCommitsInfoForPath(information.source, endA, s(information.revSource));
 }
}
@SuppressWarnings("unchecked")
private DiffWrapper getCommitsInfoForPath(String path, final SVNRevision endA, final SVNRevision endB)
 throws SVNException {
 final SVNRevision start = endA == null ? s(0L) : endA;
 Collection<SVNLogEntry> logEntries;
 if (theRepo.checkPath(path, endB.getNumber()) == SVNNodeKind.NONE) {
 logEntries = Lists.newArrayList();
 LOGGER.info("No history found on path {} for revision {}", path, endB.getNumber());
 } else {
 logEntries = theRepo.log(new String[] { path }, null, start.getNumber(), endB.getNumber(), true, true);
 LOGGER.info("Analyzing {} entries.", logEntries.size());
 }
 SVNLogEntry firstEntry = null;
 SVNLogEntry firstPathEntry = getPathRootLog(path);
 Set<Commit> commits = Sets.newHashSet();
 long currRev = 0L;
 String source = "";
 long revSource = 0L;
 for (final SVNLogEntry leEntry : logEntries) {
 if (leEntry == null) {
 continue;
 } else if (leEntry.getDate().before(earliestDate)) {
 break;
 }
 if (firstEntry == null) {
 firstEntry = leEntry;
 }
 LOGGER.debug("Revision {}", leEntry.getRevision());
 final long rev = leEntry.getRevision();
 final String author = leEntry.getAuthor();
 Commit commit = commitLogger.getCommit(rev);
 if (commit != null && !leEntry.equals(firstPathEntry)) {
 LOGGER.debug("Commit rev {} already exists in log file, skipping.", commit.getId());
 } else {
 Diff diffs;
 if (leEntry.equals(firstEntry)) {
 source = getSource(firstEntry, path);
 if (Strings.isNullOrEmpty(source)) {
 continue;
 }
 LOGGER.debug("Source is {}", source);
 revSource = firstEntry.getRevision() - 1L;
 diffs = compareRevisions(source, path, s(revSource), s(rev));
 } else if (leEntry.getRevision() != 0L && !leEntry.equals(firstEntry)) {
 diffs = compareRevisions(path, null, s(leEntry.getRevision() - 1), s(leEntry.getRevision()));
 } else {
 diffs = null;
 }
 LOGGER.debug("Differences calculated with {} additions, {} deletions, and {} files changed",
 diffs.additions, diffs.deletions, diffs.changedFiles);
 commit = new Commit(Long.toString(leEntry.getRevision()), leEntry.getDate(), diffs.changedFiles,
 diffs.additions, diffs.deletions, false, leEntry.getMessage().replace("\n", " "));
 commit.setCommitter(author);
 commit.setAuthor(author);
 commitLogger.addCommitToJsonLog(commit);
 }
 commits.add(commit);
 currRev++;
 if (currRev % 100 == 0) { LOGGER.info("{}/{} entries processed", currRev, logEntries.size()); }
 }
 LOGGER.info("All {} log entries processed", currRev);
 return new DiffWrapper(commits, source, revSource);
}
/**
 * Finds where the branch started
 * 
 * @param path
 * @return
 */
@SuppressWarnings("unchecked")
private SVNLogEntry getPathRootLog(String path) {
 LOGGER.debug("Root log for path {}", path);
 try {
 final Collection<SVNLogEntry> logEntries = theRepo.log(new String[] { path }, null, 0L,
 theRepo.getLatestRevision(), true, true);
 for (SVNLogEntry leEntry : logEntries) {
 return leEntry;
 }
 } catch (SVNException e) {
 LOGGER.trace("Path doesn't exist", e);
 LOGGER.debug("Can't trace back any farther, path probably no longer exists");
 }
 return null;
}
private String getSource(SVNLogEntry leEntry, String path) {
 String temp = "";
 // Stop at trunk
 if (TRUNK.equalsIgnoreCase(path)) {
 LOGGER.debug("We're at the trunk");
 return TRUNK;
 }
 for (Entry<String, SVNLogEntryPath> entry : leEntry.getChangedPaths().entrySet()) {
 LOGGER.debug("{}", entry.getValue());
 if (entry.getValue().getCopyPath() == null || entry.getValue().getKind() != SVNNodeKind.DIR) {
 continue;
 }
 temp = entry.getValue().getPath().replace(path, "");
 if (!"/".equals(temp)) {
 temp = entry.getValue().getCopyPath().replace(temp, "");
 } else {
 temp = entry.getValue().getCopyPath();
 }
 }
 return temp;
}
private void doDiff(String branch, String branch2, final SVNRevision rev1, final SVNRevision rev2,
 final OutputStream baos) throws SVNException {
 final SVNDiffClient diffs = new SVNDiffClient(authManager, null);
 SVNURL url1 = theRepo.getLocation().appendPath(branch, true);
 SVNURL url2 = theRepo.getLocation().appendPath(branch2, true);
 if (branch2 == null) {
 diffs.doDiff(url1, rev2, rev1, rev2, SVNDepth.INFINITY, true, baos);
 } else {
 diffs.doDiff(url1, rev1, url2, rev2, SVNDepth.INFINITY, true, baos);
 }
}

Question 2

NOt really a review hence a comment: 1) Use a profiler to find out if there are any particular parts which are slow. 2) Whenever I had to do similar things I used a scripting language (python or perl) with the native svn command line tools to parse the output - this tends to be very fast.

Question 3

@ChrisWue already have done that, the slow part is the doDiff part

Question 4

Well, as stated, doing a svn log --diff and then parsing the output has been the fastest way of parsing svn logs I've experienced so far.

Question 5

@ChrisWue thanks for the advice, that's what I will probably use

Question 6

Are you measuring add/delete/modified as in, number of files added/deleted/modified? In that case, you don't need the diff. This information is already contained in the log, SVNLogEntry. getChangedPaths() -> SVNLogEntryPath.getType(). If you do need line based info per revision (not per file), the fastest way would probably be to count the number of occurences of - and + from the output of svn diff -x -U0 -c <revisionNumber> <repositoryPath>

Question 7

I'm not familiar with SVNKit, but when I worked with slow APIs I used threading. So say I had API which returns response in 250 milliseconds, then I use 100 threads --our call----> slow api <-- 250 milliseconds

When you use threading, it could be 50-100 "answers" in 250 milliseconds. As guys suggested you might measure the slowest part, find some expensive operation, maybe there's a call that provides more data than you need and that's why it's taking more time than it should. One threaded code is doing things in sequence, so if you need to do 3000 operations, each taking 250 milliseconds it goes

--->250 ms --> 250 ms --> 250 ms --> 250 ms

But when you do it in multithreading way, its

---> 250 ms
---> 250 ms
.... n threads

It will be same time, but x [Thread count] results.

In my case I was able to get 50-100 more results in same timeframe. You will need to identify the resources you have, how many calls you can make. Are those different servers? Can use make parallel calls. And in the list of commands find "least expensive". Not very helpful, but general idea might be useful for you. Tnx

score 1 · Answer 1 · 2018-02-05 11:19:02Z

I'm not familiar with SVNKit, but when I worked with slow APIs I used threading. So say I had API which returns response in 250 milliseconds, then I use 100 threads --our call----> slow api <-- 250 milliseconds

When you use threading, it could be 50-100 "answers" in 250 milliseconds. As guys suggested you might measure the slowest part, find some expensive operation, maybe there's a call that provides more data than you need and that's why it's taking more time than it should. One threaded code is doing things in sequence, so if you need to do 3000 operations, each taking 250 milliseconds it goes

--->250 ms --> 250 ms --> 250 ms --> 250 ms

But when you do it in multithreading way, its

---> 250 ms
---> 250 ms
.... n threads

It will be same time, but x [Thread count] results.

In my case I was able to get 50-100 more results in same timeframe. You will need to identify the resources you have, how many calls you can make. Are those different servers? Can use make parallel calls. And in the list of commands find "least expensive". Not very helpful, but general idea might be useful for you. Tnx

Stack Exchange Network

SVNKit - Speed up diff on large repositories

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

SVNKit - Speed up diff on large repositories

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions