Memoization-based puzzle solution - sums in a triangle

Question 1

I am trying to solve a problem at Codechef. As I am new to programming, I am not familiar with dynamic programming. However, I do know the basics of recursion and memoization. Hence I implemented a solution to the above mentioned problem. The problem with my solution is that it exceeds the time limit given. As far as I can guess, the problem seems to be with my memoization logic. Please help me optimize the code.

#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
using namespace std;
typedef vector<int> vector1D ;
typedef vector<vector1D > vector2D ; // Two Dimensional vector
int rownum;
int noOfCalls = 0;
vector2D a;
vector2D cache; // The cache to implement memoization
int solve (int i, int j)
{
 int t1,t2,total;
 if( i > rownum-1 ) {
 return 0;
 }
 else {
 if (cache[i][j] != 0 && ( i < 50) && ( j < 50)) { //for small values use memoization
 return cache[i][j];
 }
 t1 = solve(i+1,j);
 t2 = solve(i+1,j+1);
 total = max(t1,t2) + a[i][j];
 cache[i][j] = total;//store in cache
 }
 return total;
}
int main(int argc, char *argv[])
{
 int tries;
 cin>>tries;
 while(tries--) {
 cin>>rownum;
 int temp = rownum;
 for ( int i = 0 ; i < rownum ; i++) {
 a.push_back(vector1D(rownum,0));//initialize memory
 }
 for ( int j = 0; j < 10; j++) {
 cache.push_back(vector1D(10,0));//initialize cache
 }
 for ( int i = 0; i < rownum; i++) {
 for ( int j = 0 ; j <= i; j++) {
 cin>>a[i][j];
 }
 }
 cout<<solve(0,0)<<endl;//call the solving function
 cache.clear();
 }
 return 0;
}

Question 2

your solution overall looks Ok. You should extend the memoization for all values. Max no of lines is <=100 and 100X100 vector is Ok to have.

Question 3

General Comments.

I hate your use of global variables.

int rownum;
int noOfCalls = 0;
vector2D a;
vector2D cache; // The cache to implement memoization

I suppose it is an attempt at better performance.

The test for j here is redundant:

if (cache[i][j] != 0 && ( i < 50) && ( j < 50))

If i is smaller than 50 then j can not be greater than 50. If i is greater than 50 then the test will fail and there is no need to test what the value of j is.

Bugs

 for ( int i = 0 ; i < rownum ; i++) {
 a.push_back(vector1D(rownum,0));//initialize memory
 }

As you never clear a the size increases for every puzzle.
This can become quite expensive as a will continuously keep breaking its buffer and have to be re-allocated (copying all those internal arrays).

Speed

If you are solely going for speed then operator>> is about half the speed of scanf as it (operator>>) takes into account a lot of local stuff that is not done with scanf.

Your recursive algorithm is computing some spots multiple times.

// Your Triangle
1
2 3
4 5 6
7 8 9 10
11 12 13 14 15
// Number of times each spot is hit with recursion
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1

As you can see some of the spots in the center bottom of the triangle are being hit many many times by recursion when you only want to hit each location once. Thus an iterative solution is probably better.

You are using a 2D array to represent the triangle which requires two de-references (which can be inefficient (but without measuring I am not convinced this is a major issue)). Rather than do that use a 1-D array given the number of rows n you can quickly compute the number of elements in the triangle as n*(n+1)/2 and thus re-size the array in one statement. Element access for a child of m is m+row(m)+0 and m+row(m)+1 (note the calculation of row(m) does not need to be done each time you can keep track of this as you increment m).

Question 4

Why only use memoization for "small" values of i and j?

Without memoization, your program runs in exponential time. The problem, which memoization intends to solve, is that you tend to compute the same value many times along the way.

So you memoize some of the values, to save time. On those areas of the table you aren't memoizing, you still suffer the exponential slowdown. And you're only memoizing 1/4 of your domain.

Question 5

If your priority is performance with this, in addition to having a memoizing cache for up to the size of about the L1 cache - maybe 4K or so, I'd also get rid of the use of the vector of vector.

You're spending a disproportionate amount of time in your vectors' [] operator() functions which will cause an over-abundance of stack pushes and context switches - you're using that operator in at least four places in each pass through solve() - six when not cached already. You can see this if you profile your code. I'd keep it to a vector with dynamically allocated columns and use only pointer arithmetic so that all accesses are happening from the same stack frame in solve(). I'd pre-calculate base pointers in advance and use those as much as possible in the pointer calculations using the bitwise << sizeof(int) to traverse.

I would also stay local and convert your max(t1,t2) call to use a bit-twiddle trick so that total is now:

total = t1 ^ ((t1 ^ t2) & -(t1 < t2)) + *paij;

where *paij is the dereferenced arithmetically calc'ed pointer instead of a[i][j]. That should keep the pipeline full more often.

To get to the next level from there, I would probably move from recursion to an iterative approach if you're not married to recursion.

That should buy you some performance gain, although more could be done depending on your machine configuration.

Edit: Test for speed.

#include <iostream>
#include <algorithm>
#include <sys/time.h> 
using namespace std; 
inline int getmax( int a, int b ) { if ( a < b ) return b; else return a; }
int main()
{
 time_t timev;
 timeval t1, t2; 
 double elapsed_time; 
 const unsigned int RUNL = 1000000000; 
 int i = 0; 
 int j = 1; 
 int k = 1; 
 int total;
 gettimeofday( &t1, NULL );
 while ( i++ < RUNL )
 {
 total = j ^ ( ( j ^ k ) & -( j < k ) );
 k ^= 1;
 }
 gettimeofday( &t2, NULL );
 elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
 elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
 cout << "Elapsed time for the bitop was: " << elapsed_time << " ms\n" << endl;
 i = 0;
 k = 1;
 gettimeofday( &t1, NULL );
 while ( i++ < RUNL )
 {
 total = max( j, k ); k ^= 1;
 }
 gettimeofday( &t2, NULL );
 elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
 elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
 cout << "Elapsed time for the stl call was: " << elapsed_time << " ms\n" << endl;
 i = 0;
 k = 1;
 gettimeofday( &t1, NULL );
 while ( i++ < RUNL )
 {
 total = getmax( j, k ); k ^= 1;
 }
 gettimeofday( &t2, NULL );
 elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
 elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
 cout << "Elapsed time for the getmax call was: " << elapsed_time << " ms\n" << endl; 
}

Edit (Edit) Fixed Speed Test:

#include <iostream>
#include <algorithm>
#include <sys/time.h>
#include <vector>
using namespace std;
void printTime(int total, std::string const& name, timeval& t1, timeval& t2)
{
 double elapsed_time;
 elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
 elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
 std::cout << "Result(" << total << ") Elapsed time for the " << name << " call was: " << elapsed_time << " ms\n" << std::endl;
}
inline int getmax( int a, int b ) { if ( a < b ) return b; else return a; }
int main()
{
 timeval t1, t2;
 const unsigned int RUNL = 1000000000;
 std::cout << "LoadData" << std::endl;
 std::vector<int> data(RUNL);
 for(int loop=0;loop < RUNL;++loop)
 {
 data[loop] = rand();
 }
 std::cout << "LoadData DONE" << std::endl;
 int j = data[0];
 int k;
 gettimeofday( &t1, NULL );
 for(int loop = 1;loop < RUNL;++loop)
 {
 k = data[loop];
 j = j ^ ( ( j ^ k ) & -( j < k ) );
 }
 gettimeofday( &t2, NULL );
 printTime(j, "BIT TWIDDLE", t1, t2);
 j = data[0];
 gettimeofday( &t1, NULL );
 for(int loop = 1;loop < RUNL;++loop)
 {
 k = data[loop];
 j = max( j, k );
 }
 gettimeofday( &t2, NULL );
 printTime(j, "std::max", t1, t2);
 j = data[0];
 gettimeofday( &t1, NULL );
 for(int loop = 1;loop < RUNL;++loop)
 {
 k = data[loop];
 j = getmax( j, k );
 }
 gettimeofday( &t2, NULL );
 printTime(j, "inline max", t1, t2);
}

Results:

> g++ -O3 x.cpp
> ./a.out
LoadData
LoadData DONE
Result: total( 2147483645) Elapsed time for the BIT TWIDDLE call was: 1968.2 ms
Result: total( 2147483645) Elapsed time for the std::max call was: 1170.83 ms
Result: total( 2147483645) Elapsed time for the inline max call was: 1178.05 ms

It's easy to see why the optimizer runs so much faster than the bit hack from this code from between the gettimeofday calls):

STLMAX
 movl 28(%esp), %ecx
 movl 1,ドル %edx
 movl 1,ドル %eax
 .p2align 4,,7
L34:
 movl (%ecx,%edx,4), %edx
 cmpl %edx, %ebx
 cmovl %edx, %ebx
 addl 1,ドル %eax
 cmpl 1000000000,ドル %eax
 movl %eax, %edx
 jne L34
 leal 40(%esp), %eax
 movl 0,ドル 4(%esp)
 movl %eax, (%esp)
BIT HACK
 movl 28(%esp), %edi
 movl 1,ドル %eax
 movl 1,ドル %edx
 .p2align 4,,7
L25:
 movl (%edi,%eax,4), %ecx
 xorl %eax, %eax
 cmpl %ebx, %ecx
 setg %al
 xorl %ebx, %ecx
 negl %eax
 addl 1,ドル %edx
 andl %ecx, %eax
 xorl %eax, %ebx
 cmpl 1000000000,ドル %edx
 movl %edx, %eax
 jne L25
 leal 40(%esp), %eax
 movl 0,ドル 4(%esp)
 movl %eax, (%esp)

Basically two effective instructions in the loop doing the actual max in optimized STL compiler compared with seven for the bit hack. Lesson learned.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Code for hasZeroByte which gave about 2.5x improvement for the bit hack over the other two under -O3 optimization conditions:

#include <iostream>
#include <algorithm>
#include <sys/time.h>
#include <vector>
using namespace std;
void printTime(int total, std::string const& name, timeval& t1, timeval& t2)
{
 double elapsed_time;
 elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
 elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
 std::cout << "Result(" << total << ") Elapsed time for the " << name << " call was: " << elapsed_time << " ms\n" << std::endl;
}
inline bool checkZbyte1( int a ) 
{ 
 for ( int i = 0; i < sizeof( int ); i++ )
 {
 if ( !( 0xff & a ) )
 return true;
 else
 a >>= 8;
 }
 return false;
}
inline bool checkZbyte2( int a ) 
{ 
 if ( !( 0xff & a ) || !( 0xff00 & a ) || !( 0xff0000 & a ) || !( 0xff000000 & a ) )
 return true;
 return false;
}
int main()
{
 timeval t1, t2;
 const unsigned int RUNL = 100000000;
 std::cout << "LoadData" << std::endl;
 std::vector<int> data(RUNL);
 for(int loop=0; loop < RUNL; ++loop)
 data[loop] = rand();
 std::cout << "LoadData DONE" << std::endl;
 int j;
 int count = 0;
 gettimeofday( &t1, NULL );
 for(int loop = 0; loop < RUNL ;++loop)
 {
 if ( checkZbyte1( data[loop] ) )
 count++;
 }
 gettimeofday( &t2, NULL );
 printTime( count, "inline version #1", t1, t2 );
 count = 0;
 gettimeofday( &t1, NULL );
 for( int loop = 0; loop < RUNL; ++loop )
 {
 if ( checkZbyte2( data[loop] ) )
 count++;
 }
 gettimeofday( &t2, NULL );
 printTime( count, "inline version #2", t1, t2 );
 count = 0;
 gettimeofday( &t1, NULL );
 for( int loop = 0; loop < RUNL; ++loop )
 {
 j = data[loop];
 if ( ~( ( ( (j & 0x7F7F7F7F ) + 0x7F7F7F7F ) | j ) | 0x7F7F7F7F ) )
 count++;
 }
 gettimeofday( &t2, NULL );
 printTime( count, "BIT HACK", t1, t2 );
}

Question 6

Unless your compiler is extremely ancient/stupid, it will generate better code for max than any bit-twidding hack you come up with. Otherwise these are good ideas :-)

Question 7

I think bit twiddling is silly. Also I am not sure I agree that operator[] is a problem. This is simply returns base[index] and as such is probably inlined and is no more expensive than a normal array access. If you are saying that the 2D array is the problem then maybe.

Question 8

hmm.. Why are bit hacks silly? I just ran 3 different cases of max using 1Billion of each all in identical loops. The results:

Question 9

Thing cut me off for taking to long... ironic... the results: bit hack max as above: 3.052s stl max() call: 4.238s my own getmax() call: 4.470s the results were consistent. I'm using cygwin with gcc 4.5 on a quad 2.5GHz Dell M6400. Try it yourself... I don't think we could disregard an improvement of 29%.

Question 10

They are silly because the compiler will always do as well and usually better. If you have found a new optimization to max (unlikely) then it will be introduced into the compiler and the next evolution of the compiler will be just as quick as your code. Post your code. :-) we will see how well it really does.

Loki Astari Loki Astari 97.7k5 gold badges126 silver badges341 bronze badges · Accepted Answer · 2012-01-26 14:38:54Z

General Comments.

I hate your use of global variables.

int rownum;
int noOfCalls = 0;
vector2D a;
vector2D cache; // The cache to implement memoization

I suppose it is an attempt at better performance.

The test for j here is redundant:

if (cache[i][j] != 0 && ( i < 50) && ( j < 50))

If i is smaller than 50 then j can not be greater than 50. If i is greater than 50 then the test will fail and there is no need to test what the value of j is.

Bugs

 for ( int i = 0 ; i < rownum ; i++) {
 a.push_back(vector1D(rownum,0));//initialize memory
 }

As you never clear a the size increases for every puzzle.
This can become quite expensive as a will continuously keep breaking its buffer and have to be re-allocated (copying all those internal arrays).

Speed

If you are solely going for speed then operator>> is about half the speed of scanf as it (operator>>) takes into account a lot of local stuff that is not done with scanf.

Your recursive algorithm is computing some spots multiple times.

// Your Triangle
1
2 3
4 5 6
7 8 9 10
11 12 13 14 15
// Number of times each spot is hit with recursion
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1

As you can see some of the spots in the center bottom of the triangle are being hit many many times by recursion when you only want to hit each location once. Thus an iterative solution is probably better.

You are using a 2D array to represent the triangle which requires two de-references (which can be inefficient (but without measuring I am not convinced this is a major issue)). Rather than do that use a 1-D array given the number of rows n you can quickly compute the number of elements in the triangle as n*(n+1)/2 and thus re-size the array in one statement. Element access for a child of m is m+row(m)+0 and m+row(m)+1 (note the calculation of row(m) does not need to be done each time you can keep track of this as you increment m).

Stack Exchange Network

Memoization-based puzzle solution - sums in a triangle

3 Answers 3

General Comments.

Bugs

Speed

Edit: Test for speed.

Edit (Edit) Fixed Speed Test:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Memoization-based puzzle solution - sums in a triangle

3 Answers 3

General Comments.

Bugs

Speed

Edit: Test for speed.

Edit (Edit) Fixed Speed Test:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions