I have large amounts of sometimes massive Subversion repositories I'm trying to scan to find the number of additions, deletions, and files changed, similar to what GitHub does.
Because of this, the diff operation using SVNKit is very slow. This, combined with 1000-4000 entries on many branches within, 750+ repositories, causes my application to run for a full 24 hours and barely make an impact on the amount of repositories that need to be processed. I've taken to only analyzing commits after a certain date, but it's still phenomenally slow. Is there a way I could give this a significant boost? Here is my code. It is rather large, hopefully someone is still able to offer me advice. It starts
when I call updateAuthorInfo()
, and from there the execution takes off.
private void updateAuthorInfo(final BranchInfo bi, final SVNRevision endA, final SVNRevision endB, final Date earliestDate)
throws SVNException {
LOGGER.info("Getting author information for branch {}", bi.getBranch());
DiffWrapper information = getCommitsInfoForPath(bi.getBranch(), endA, endB);
while (true) {
for (Commit commit : information.commits) {
CommitterInfo ai = bi.getAuthorInfo(commit.getCommitter(), "", commit.getCommitter(), "");
ai.incrementAdditions(commit.getAdditions());
ai.incrementDeletions(commit.getDeletions());
ai.add(commit);
}
if ("".equals(information.source)) {
break;
}
LOGGER.debug("Continuing to path: {}", information.source);
information = getCommitsInfoForPath(information.source, endA, s(information.revSource));
}
}
@SuppressWarnings("unchecked")
private DiffWrapper getCommitsInfoForPath(String path, final SVNRevision endA, final SVNRevision endB)
throws SVNException {
final SVNRevision start = endA == null ? s(0L) : endA;
Collection<SVNLogEntry> logEntries;
if (theRepo.checkPath(path, endB.getNumber()) == SVNNodeKind.NONE) {
logEntries = Lists.newArrayList();
LOGGER.info("No history found on path {} for revision {}", path, endB.getNumber());
} else {
logEntries = theRepo.log(new String[] { path }, null, start.getNumber(), endB.getNumber(), true, true);
LOGGER.info("Analyzing {} entries.", logEntries.size());
}
SVNLogEntry firstEntry = null;
SVNLogEntry firstPathEntry = getPathRootLog(path);
Set<Commit> commits = Sets.newHashSet();
long currRev = 0L;
String source = "";
long revSource = 0L;
for (final SVNLogEntry leEntry : logEntries) {
if (leEntry == null) {
continue;
} else if (leEntry.getDate().before(earliestDate)) {
break;
}
if (firstEntry == null) {
firstEntry = leEntry;
}
LOGGER.debug("Revision {}", leEntry.getRevision());
final long rev = leEntry.getRevision();
final String author = leEntry.getAuthor();
Commit commit = commitLogger.getCommit(rev);
if (commit != null && !leEntry.equals(firstPathEntry)) {
LOGGER.debug("Commit rev {} already exists in log file, skipping.", commit.getId());
} else {
Diff diffs;
if (leEntry.equals(firstEntry)) {
source = getSource(firstEntry, path);
if (Strings.isNullOrEmpty(source)) {
continue;
}
LOGGER.debug("Source is {}", source);
revSource = firstEntry.getRevision() - 1L;
diffs = compareRevisions(source, path, s(revSource), s(rev));
} else if (leEntry.getRevision() != 0L && !leEntry.equals(firstEntry)) {
diffs = compareRevisions(path, null, s(leEntry.getRevision() - 1), s(leEntry.getRevision()));
} else {
diffs = null;
}
LOGGER.debug("Differences calculated with {} additions, {} deletions, and {} files changed",
diffs.additions, diffs.deletions, diffs.changedFiles);
commit = new Commit(Long.toString(leEntry.getRevision()), leEntry.getDate(), diffs.changedFiles,
diffs.additions, diffs.deletions, false, leEntry.getMessage().replace("\n", " "));
commit.setCommitter(author);
commit.setAuthor(author);
commitLogger.addCommitToJsonLog(commit);
}
commits.add(commit);
currRev++;
if (currRev % 100 == 0) { LOGGER.info("{}/{} entries processed", currRev, logEntries.size()); }
}
LOGGER.info("All {} log entries processed", currRev);
return new DiffWrapper(commits, source, revSource);
}
/**
* Finds where the branch started
*
* @param path
* @return
*/
@SuppressWarnings("unchecked")
private SVNLogEntry getPathRootLog(String path) {
LOGGER.debug("Root log for path {}", path);
try {
final Collection<SVNLogEntry> logEntries = theRepo.log(new String[] { path }, null, 0L,
theRepo.getLatestRevision(), true, true);
for (SVNLogEntry leEntry : logEntries) {
return leEntry;
}
} catch (SVNException e) {
LOGGER.trace("Path doesn't exist", e);
LOGGER.debug("Can't trace back any farther, path probably no longer exists");
}
return null;
}
private String getSource(SVNLogEntry leEntry, String path) {
String temp = "";
// Stop at trunk
if (TRUNK.equalsIgnoreCase(path)) {
LOGGER.debug("We're at the trunk");
return TRUNK;
}
for (Entry<String, SVNLogEntryPath> entry : leEntry.getChangedPaths().entrySet()) {
LOGGER.debug("{}", entry.getValue());
if (entry.getValue().getCopyPath() == null || entry.getValue().getKind() != SVNNodeKind.DIR) {
continue;
}
temp = entry.getValue().getPath().replace(path, "");
if (!"/".equals(temp)) {
temp = entry.getValue().getCopyPath().replace(temp, "");
} else {
temp = entry.getValue().getCopyPath();
}
}
return temp;
}
private void doDiff(String branch, String branch2, final SVNRevision rev1, final SVNRevision rev2,
final OutputStream baos) throws SVNException {
final SVNDiffClient diffs = new SVNDiffClient(authManager, null);
SVNURL url1 = theRepo.getLocation().appendPath(branch, true);
SVNURL url2 = theRepo.getLocation().appendPath(branch2, true);
if (branch2 == null) {
diffs.doDiff(url1, rev2, rev1, rev2, SVNDepth.INFINITY, true, baos);
} else {
diffs.doDiff(url1, rev1, url2, rev2, SVNDepth.INFINITY, true, baos);
}
}
1 Answer 1
I'm not familiar with SVNKit, but when I worked with slow APIs I used threading. So say I had API which returns response in 250 milliseconds, then I use 100 threads --our call----> slow api <-- 250 milliseconds
When you use threading, it could be 50-100 "answers" in 250 milliseconds. As guys suggested you might measure the slowest part, find some expensive operation, maybe there's a call that provides more data than you need and that's why it's taking more time than it should. One threaded code is doing things in sequence, so if you need to do 3000 operations, each taking 250 milliseconds it goes
--->250 ms --> 250 ms --> 250 ms --> 250 ms
But when you do it in multithreading way, its
---> 250 ms
---> 250 ms
.... n threads
It will be same time, but x [Thread count] results.
In my case I was able to get 50-100 more results in same timeframe. You will need to identify the resources you have, how many calls you can make. Are those different servers? Can use make parallel calls. And in the list of commands find "least expensive". Not very helpful, but general idea might be useful for you. Tnx
svn log --diff
and then parsing the output has been the fastest way of parsing svn logs I've experienced so far. \$\endgroup\$SVNLogEntry. getChangedPaths()
->SVNLogEntryPath.getType()
. If you do need line based info per revision (not per file), the fastest way would probably be to count the number of occurences of-
and+
from the output ofsvn diff -x -U0 -c <revisionNumber> <repositoryPath>
\$\endgroup\$