Beware. This page was up to date as of March 1995. Now I'm more interested in approximation algorithms for the TSP. Until I write up my thoughts on that field, I'll keep this page around. (I'm getting a degree doing research, not updating web pages.) Until then, visit my current research interests page.

Research interests

I'm looking into innovative ways of parallelizing branch and bound codes for the traveling salesman problem (TSP).

Here is the outline for the parallelization research.

Introduction
Held and Karp's 1-tree relaxation
Lagrangean cost function
Why use Prim's algorithm?
Computational experience
Things to do

Introduction

Conventional wisdom holds that parallelism should be exploited at the highest level, resulting in the coarsest grain computation and lowest synchronization costs. In the branch and bound framework, this means using one processor per subproblem node, but processing many nodes at once. Tarek Abdelrahman calls this algorithm-level parallelism. However, he has shown that naive implementation of this strategy on shared memory machines can lead to excessive contention and synchronization costs when accessing the central pool of nodes waiting to be processed.

I'm particularly interested in finding out the best way to parallelize branch and bound codes that use Held and Karp's 1-tree relaxation. These are of interest because this relaxation is the basis for the most efficient sequential algorithms known for solving symmetric TSPs. This is according to Balas and Toth's survey in The Traveling Salesman Problem, edited by Lawler et al., 1985. It is always a good idea to start one's work with the best algorithm known so far.

Branch and bound algorithms based on 1-trees are also notable because the search trees they generate are very small. For instance, Smith and Thompson report that on problems of size 42, 48, 57, and 100 their algorithm generates 7, 7, 34, and 95 branch and bound search nodes, respectively. (Most of the time is spent computing many minimum 1-trees per branch and bound node.) Compare this with the thousands of nodes generated on a problem of size 48 when using the LMSK algorithm reported by Abdelrahman in 1994.

It therefore seems worthwhile to study the profitability of using many processors per branch and bound node. Abdelrahman calls this subproblem-level parallelism. I want to gain experience with exploiting parallelism at this level, especially in using the 1-tree framework where it appears to be most promising.

Held and Karp's 1-tree relaxation

So what is a 1-tree? Suppose we are given a graph G=(V,E) with n nodes, i.e.,|V|=n. Then a 1-tree for G is a spanning tree for (V\{1},E) together with two edges in G incident upon vertex 1.

Notice that any tour of G qualifies as a 1-tree for G. It follows then that the length of a shortest 1-tree of G is a lower bound on the length of a shortest tour of G. A minimum cost 1-tree for G can be found by computing a minimum spanning tree for (V\{1},E) and by scanning for the two shortest edges incident upon vertex 1. There are efficient algorithms for these two steps. This is the basis for the use of the 1-tree relaxation in branch and bound codes for the TSP.

Lagrangean cost function

When used as in a branch and bound framework, it pays to have as tight a lower bounding function as possible. Pearl has shown that a linear increase in the error of the lower bounding function leads to an exponential increase in the number of subproblems generated.

Unfortunately, the length of the minimum 1-tree of a graph is on average only about 60% of the length of the shortest tour of that graph. How can we improve on this? The trick is to use Lagrange multipliers to incorporate some degree constraints. That is, we can add a cost to vertices that have either too many or too few edges.

Instead of computing the minimum 1-tree using the plain cost function c(i,j), use c(i,j)+l(i)+l(j) where l is some real-valued function on the vertices. With some rearrangement and noting that every vertex in a tour is of degree 2, we can show that for any choice of l, L(l) = min 1-tree cost (using c(i,j)+l(i)+l(j))- 2 sum_{i \in V} l(i) <= min tour cost. (Sorry, but HTML isn't TeX.)

Given the right choice of l, this Lagrangean lower bound consistently achieves 99.5% of the minimum tour cost on symmetric instances, and is the basis for the most successful algorithms for the symmetric TSP (up to 1985, according to Balas and Toth).

We use what is called a subgradient optimization method to compute the n-vector l. This is basically a numerical hill-climbing procedure in n dimensions. Some theory has been developed to aid in the choice of step sizes, though there are still some parameters that must be chosen; this is still a black art.

Why use Prim's algorithm?

The graphs that we compute MSTs for are complete or very nearly so. In such cases, Prim's algorithm is asymptotically optimal because it scans each edge exactly once and its storage requirements can be made linear in the number of vertices. See the MILES_SPAN program in Knuth's excellent The Stanford GraphBase, especially sections 69 and 70.

As suggested by Quinn's parallel programming text, I also implemented Sollin's algorithm. It suffers from two drawbacks. First, it is quite complex; the parallel version would require much fine-grained locking and synchronization. Second, it turns out to be much slower than Prim's algorithm on the (nearly) complete graphs that we encounter. (See my research log book for timing comparisons.)

[Perhaps insert descriptions of Prim's and Sollin's algorithm.]

Computational experience

Minimum spanning trees

I have tested code for minimum spanning trees using Prim's algorithm targetted for KSR shared memory parallel machines. I have modified the algorithm to allow the specification that certain edges must appear in the 1-tree and that others must not. These are called fixed in and fixed out edges, respectively.

I have run experiments to determine the speedup for various problem sizes with various numbers of processors. For small problems (e.g. 51 cities), there is no speedup over a the sequential algorithm. For larger problems, we get some speedup with the best being 60% of optimal speedup for a 3038 city problem on up to 10 processors, taking 8.6 seconds.

Branch and bound code for the TSP itself

I have completed a sequential version of the entire branch and bound algorithm with the following features:

It uses a best-first problem selection strategy. (Thanks to Wayne Hayes for the priority queue routines.)
It uses geometrically decreasing alpha-step parameter. The exact scheme is original to me, although Volgenant and Jonker mention such a scheme in passing.
It uses an original set of stopping rules based on the convergence properties of the ascent. I found those of Volgenant and Jonker to be too basic (a fixed number of steps) and those of Smith and Thompson too opaque.
It uses the branching scheme of Volgenant and Jonker.
It uses their rule that if two edges are fixed into a vertex, then no other edges should be allowed to be incident upon that vertex in the 1-tree computed.
It uses a fictitious upper bound: After the initial ascent computes a lower bound for the general graph, an upper bound of 1.015 times that lower bound as an upper bound for the rest of the computation, until an actual tour is found that is less than that. If no such tour is found, then the computation is restarted after the initial ascent using 1.015 times the previously tried upper bound.

It solves instances to optimality in the following times:

 Number Ascents Task List Len 1-Trees 
Instance Cities Performed Avg Std Dev Computed Time (in seconds)
---------- ------ --------- ---- ------- -------- ----
grid25 25 65 16.4 8.6 675 3.7
dantzig42 42 376 67.9 42.5 3635 31.0
hk48 48 13 8.1 3.6 242 2.8
eil51 51 126 92.1 43.6 1706 29.5
st70 70 1854 1294.4 722.8 30412 982.5
eil76 76 373 328.2 189.9 6998 260.5
kroA100 100 9473 2836.1 2004.1 161288 10078.7
eil101 101 1009 291.3 94.7 10028 713.6
lin105 105 26 21.7 11.1 722 54.7
randA50 50 29 3.5 2.6 240 3.0
randB50 50 22 4.3 2.9 257 3.0
randC50 50 23 12.7 7.8 448 5.0
randD50 50 32 6.4 6.0 518 5.9
randA100 100 80 58.7 37.0 2164 89.8
randB100 100 225 172.5 94.6 4987 194.4
randC100 100 36 29.8 17.1 877 33.7
randD100 100 19 16.3 8.6 573 23.9
randA150 150 42 37.1 20.5 1269 134.8
randB150 150 96 76.1 49.0 2872 273.1
randC150 150 46 43.1 24.7 1690 158.1
randB150 150 21 18.7 10.9 726 70.0
randA200 200 26 25.5 14.5 930 159.7
randB200 200 200 184.3 106.7 9157 1540.9
randC200 200 44 36.4 23.2 1962 332.3
randD200 200 30 29.6 16.8 1396 237.3
randA300 300 57 54.3 30.7 2677 1040.9
randB300 300 86 82.5 47.3 6333 2435.4
randC300 300 40 39.2 22.8 1760 699.4
randD300 300 113 107.7 61.1 5611 2162.2
randA1000 1000 90 89.1 51.8 7204 32579.5

Instances beginning with rand have edge lengths which were drawn from a uniform random distribution over [1,1000].

A posting to comp.theory by and email from Jay Herder (jherder at yahoo dot com) prompted an investigation into grid graphs. He was interested in approximate solutions on graphs where with embedded grids as part of a wiring problem. Instance grid25 is a 5 by 5 grid of points spaced 10 units apart. Arbitrary insertion found an optimal tour in this case, suggesting that it is a good heuristic for these kinds of graphs.

All others were taken from the TSPLIB library of instances.

The compiler was GCC 2.5.8 with no optimization under Linux 1.0.8. The machine is an Intel 486 DX2/66 with 256K 20ns cache and 16MB 70ns main memory. For comparison, this machine is about as fast as a single processor on a KSR-1, and about 3 times faster (for this task) than a SPARC IPC.

The program requires little memory: the program and data on instance eil51.tsp take up about 130 kilobytes on the 486, as reported by top(1) under Linux. Note that this fits entirely within the 256K cache, even though I exclusively use the double type for storage and manipulation of the Lagrangean multipliers and edge lengths.

Note, however, that on Euclidean instances, I do not store the entire cost matrix but instead recompute the cost function on the fly from the stored coordinates. This is done to reduce the working set size, especially for larger instances. I experimented with storing the edge costs but found that this slowed down the program by about 5% even on the relatively small 51 city problem. That is, table lookup was slower than recomputation.

Things to do

Design and test a parallelization of the BB algorithm.