I am trying to solve a problem at Codechef. As I am new to programming, I am not familiar with dynamic programming. However, I do know the basics of recursion and memoization. Hence I implemented a solution to the above mentioned problem. The problem with my solution is that it exceeds the time limit given. As far as I can guess, the problem seems to be with my memoization logic. Please help me optimize the code.
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
using namespace std;
typedef vector<int> vector1D ;
typedef vector<vector1D > vector2D ; // Two Dimensional vector
int rownum;
int noOfCalls = 0;
vector2D a;
vector2D cache; // The cache to implement memoization
int solve (int i, int j)
{
int t1,t2,total;
if( i > rownum-1 ) {
return 0;
}
else {
if (cache[i][j] != 0 && ( i < 50) && ( j < 50)) { //for small values use memoization
return cache[i][j];
}
t1 = solve(i+1,j);
t2 = solve(i+1,j+1);
total = max(t1,t2) + a[i][j];
cache[i][j] = total;//store in cache
}
return total;
}
int main(int argc, char *argv[])
{
int tries;
cin>>tries;
while(tries--) {
cin>>rownum;
int temp = rownum;
for ( int i = 0 ; i < rownum ; i++) {
a.push_back(vector1D(rownum,0));//initialize memory
}
for ( int j = 0; j < 10; j++) {
cache.push_back(vector1D(10,0));//initialize cache
}
for ( int i = 0; i < rownum; i++) {
for ( int j = 0 ; j <= i; j++) {
cin>>a[i][j];
}
}
cout<<solve(0,0)<<endl;//call the solving function
cache.clear();
}
return 0;
}
3 Answers 3
General Comments.
I hate your use of global variables.
int rownum;
int noOfCalls = 0;
vector2D a;
vector2D cache; // The cache to implement memoization
I suppose it is an attempt at better performance.
The test for j here is redundant:
if (cache[i][j] != 0 && ( i < 50) && ( j < 50))
If i is smaller than 50 then j can not be greater than 50. If i is greater than 50 then the test will fail and there is no need to test what the value of j is.
Bugs
for ( int i = 0 ; i < rownum ; i++) {
a.push_back(vector1D(rownum,0));//initialize memory
}
As you never clear a
the size increases for every puzzle.
This can become quite expensive as a
will continuously keep breaking its buffer and have to be re-allocated (copying all those internal arrays).
Speed
If you are solely going for speed then operator>>
is about half the speed of scanf
as it (operator>>) takes into account a lot of local stuff that is not done with scanf.
Your recursive algorithm is computing some spots multiple times.
// Your Triangle
1
2 3
4 5 6
7 8 9 10
11 12 13 14 15
// Number of times each spot is hit with recursion
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
As you can see some of the spots in the center bottom of the triangle are being hit many many times by recursion when you only want to hit each location once. Thus an iterative solution is probably better.
You are using a 2D array to represent the triangle which requires two de-references (which can be inefficient (but without measuring I am not convinced this is a major issue)). Rather than do that use a 1-D array given the number of rows n you can quickly compute the number of elements in the triangle as n*(n+1)/2
and thus re-size the array in one statement. Element access for a child of m is m+row(m)+0
and m+row(m)+1
(note the calculation of row(m) does not need to be done each time you can keep track of this as you increment m).
Why only use memoization for "small" values of i
and j
?
Without memoization, your program runs in exponential time. The problem, which memoization intends to solve, is that you tend to compute the same value many times along the way.
So you memoize some of the values, to save time. On those areas of the table you aren't memoizing, you still suffer the exponential slowdown. And you're only memoizing 1/4 of your domain.
If your priority is performance with this, in addition to having a memoizing cache for up to the size of about the L1 cache - maybe 4K or so, I'd also get rid of the use of the vector of vector.
You're spending a disproportionate amount of time in your vectors' [] operator() functions which will cause an over-abundance of stack pushes and context switches - you're using that operator in at least four places in each pass through solve() - six when not cached already. You can see this if you profile your code. I'd keep it to a vector with dynamically allocated columns and use only pointer arithmetic so that all accesses are happening from the same stack frame in solve(). I'd pre-calculate base pointers in advance and use those as much as possible in the pointer calculations using the bitwise << sizeof(int) to traverse.
I would also stay local and convert your max(t1,t2) call to use a bit-twiddle trick so that total is now:
total = t1 ^ ((t1 ^ t2) & -(t1 < t2)) + *paij;
where *paij is the dereferenced arithmetically calc'ed pointer instead of a[i][j]. That should keep the pipeline full more often.
To get to the next level from there, I would probably move from recursion to an iterative approach if you're not married to recursion.
That should buy you some performance gain, although more could be done depending on your machine configuration.
Edit: Test for speed.
#include <iostream>
#include <algorithm>
#include <sys/time.h>
using namespace std;
inline int getmax( int a, int b ) { if ( a < b ) return b; else return a; }
int main()
{
time_t timev;
timeval t1, t2;
double elapsed_time;
const unsigned int RUNL = 1000000000;
int i = 0;
int j = 1;
int k = 1;
int total;
gettimeofday( &t1, NULL );
while ( i++ < RUNL )
{
total = j ^ ( ( j ^ k ) & -( j < k ) );
k ^= 1;
}
gettimeofday( &t2, NULL );
elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
cout << "Elapsed time for the bitop was: " << elapsed_time << " ms\n" << endl;
i = 0;
k = 1;
gettimeofday( &t1, NULL );
while ( i++ < RUNL )
{
total = max( j, k ); k ^= 1;
}
gettimeofday( &t2, NULL );
elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
cout << "Elapsed time for the stl call was: " << elapsed_time << " ms\n" << endl;
i = 0;
k = 1;
gettimeofday( &t1, NULL );
while ( i++ < RUNL )
{
total = getmax( j, k ); k ^= 1;
}
gettimeofday( &t2, NULL );
elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
cout << "Elapsed time for the getmax call was: " << elapsed_time << " ms\n" << endl;
}
Edit (Edit) Fixed Speed Test:
#include <iostream>
#include <algorithm>
#include <sys/time.h>
#include <vector>
using namespace std;
void printTime(int total, std::string const& name, timeval& t1, timeval& t2)
{
double elapsed_time;
elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
std::cout << "Result(" << total << ") Elapsed time for the " << name << " call was: " << elapsed_time << " ms\n" << std::endl;
}
inline int getmax( int a, int b ) { if ( a < b ) return b; else return a; }
int main()
{
timeval t1, t2;
const unsigned int RUNL = 1000000000;
std::cout << "LoadData" << std::endl;
std::vector<int> data(RUNL);
for(int loop=0;loop < RUNL;++loop)
{
data[loop] = rand();
}
std::cout << "LoadData DONE" << std::endl;
int j = data[0];
int k;
gettimeofday( &t1, NULL );
for(int loop = 1;loop < RUNL;++loop)
{
k = data[loop];
j = j ^ ( ( j ^ k ) & -( j < k ) );
}
gettimeofday( &t2, NULL );
printTime(j, "BIT TWIDDLE", t1, t2);
j = data[0];
gettimeofday( &t1, NULL );
for(int loop = 1;loop < RUNL;++loop)
{
k = data[loop];
j = max( j, k );
}
gettimeofday( &t2, NULL );
printTime(j, "std::max", t1, t2);
j = data[0];
gettimeofday( &t1, NULL );
for(int loop = 1;loop < RUNL;++loop)
{
k = data[loop];
j = getmax( j, k );
}
gettimeofday( &t2, NULL );
printTime(j, "inline max", t1, t2);
}
Results:
> g++ -O3 x.cpp
> ./a.out
LoadData
LoadData DONE
Result: total( 2147483645) Elapsed time for the BIT TWIDDLE call was: 1968.2 ms
Result: total( 2147483645) Elapsed time for the std::max call was: 1170.83 ms
Result: total( 2147483645) Elapsed time for the inline max call was: 1178.05 ms
It's easy to see why the optimizer runs so much faster than the bit hack from this code from between the gettimeofday calls):
STLMAX
movl 28(%esp), %ecx
movl 1,ドル %edx
movl 1,ドル %eax
.p2align 4,,7
L34:
movl (%ecx,%edx,4), %edx
cmpl %edx, %ebx
cmovl %edx, %ebx
addl 1,ドル %eax
cmpl 1000000000,ドル %eax
movl %eax, %edx
jne L34
leal 40(%esp), %eax
movl 0,ドル 4(%esp)
movl %eax, (%esp)
BIT HACK
movl 28(%esp), %edi
movl 1,ドル %eax
movl 1,ドル %edx
.p2align 4,,7
L25:
movl (%edi,%eax,4), %ecx
xorl %eax, %eax
cmpl %ebx, %ecx
setg %al
xorl %ebx, %ecx
negl %eax
addl 1,ドル %edx
andl %ecx, %eax
xorl %eax, %ebx
cmpl 1000000000,ドル %edx
movl %edx, %eax
jne L25
leal 40(%esp), %eax
movl 0,ドル 4(%esp)
movl %eax, (%esp)
Basically two effective instructions in the loop doing the actual max in optimized STL compiler compared with seven for the bit hack. Lesson learned.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Code for hasZeroByte which gave about 2.5x improvement for the bit hack over the other two under -O3 optimization conditions:
#include <iostream>
#include <algorithm>
#include <sys/time.h>
#include <vector>
using namespace std;
void printTime(int total, std::string const& name, timeval& t1, timeval& t2)
{
double elapsed_time;
elapsed_time = ( t2.tv_sec - t1.tv_sec ) * 1000.0;
elapsed_time += ( t2.tv_usec - t1.tv_usec ) / 1000.0;
std::cout << "Result(" << total << ") Elapsed time for the " << name << " call was: " << elapsed_time << " ms\n" << std::endl;
}
inline bool checkZbyte1( int a )
{
for ( int i = 0; i < sizeof( int ); i++ )
{
if ( !( 0xff & a ) )
return true;
else
a >>= 8;
}
return false;
}
inline bool checkZbyte2( int a )
{
if ( !( 0xff & a ) || !( 0xff00 & a ) || !( 0xff0000 & a ) || !( 0xff000000 & a ) )
return true;
return false;
}
int main()
{
timeval t1, t2;
const unsigned int RUNL = 100000000;
std::cout << "LoadData" << std::endl;
std::vector<int> data(RUNL);
for(int loop=0; loop < RUNL; ++loop)
data[loop] = rand();
std::cout << "LoadData DONE" << std::endl;
int j;
int count = 0;
gettimeofday( &t1, NULL );
for(int loop = 0; loop < RUNL ;++loop)
{
if ( checkZbyte1( data[loop] ) )
count++;
}
gettimeofday( &t2, NULL );
printTime( count, "inline version #1", t1, t2 );
count = 0;
gettimeofday( &t1, NULL );
for( int loop = 0; loop < RUNL; ++loop )
{
if ( checkZbyte2( data[loop] ) )
count++;
}
gettimeofday( &t2, NULL );
printTime( count, "inline version #2", t1, t2 );
count = 0;
gettimeofday( &t1, NULL );
for( int loop = 0; loop < RUNL; ++loop )
{
j = data[loop];
if ( ~( ( ( (j & 0x7F7F7F7F ) + 0x7F7F7F7F ) | j ) | 0x7F7F7F7F ) )
count++;
}
gettimeofday( &t2, NULL );
printTime( count, "BIT HACK", t1, t2 );
}
-
\$\begingroup\$ Unless your compiler is extremely ancient/stupid, it will generate better code for
max
than any bit-twidding hack you come up with. Otherwise these are good ideas :-) \$\endgroup\$Nemo– Nemo2012年01月26日 02:22:55 +00:00Commented Jan 26, 2012 at 2:22 -
\$\begingroup\$ I think bit twiddling is silly. Also I am not sure I agree that operator[] is a problem. This is simply
returns base[index]
and as such is probably inlined and is no more expensive than a normal array access. If you are saying that the 2D array is the problem then maybe. \$\endgroup\$Loki Astari– Loki Astari2012年01月26日 13:50:33 +00:00Commented Jan 26, 2012 at 13:50 -
\$\begingroup\$ hmm.. Why are bit hacks silly? I just ran 3 different cases of max using 1Billion of each all in identical loops. The results: \$\endgroup\$Wade Stone– Wade Stone2012年01月27日 00:58:28 +00:00Commented Jan 27, 2012 at 0:58
-
\$\begingroup\$ Thing cut me off for taking to long... ironic... the results: bit hack max as above: 3.052s stl max() call: 4.238s my own getmax() call: 4.470s the results were consistent. I'm using cygwin with gcc 4.5 on a quad 2.5GHz Dell M6400. Try it yourself... I don't think we could disregard an improvement of 29%. \$\endgroup\$Wade Stone– Wade Stone2012年01月27日 01:09:55 +00:00Commented Jan 27, 2012 at 1:09
-
\$\begingroup\$ They are silly because the compiler will always do as well and usually better. If you have found a new optimization to max (unlikely) then it will be introduced into the compiler and the next evolution of the compiler will be just as quick as your code. Post your code. :-) we will see how well it really does. \$\endgroup\$Loki Astari– Loki Astari2012年01月27日 03:10:02 +00:00Commented Jan 27, 2012 at 3:10
Explore related questions
See similar questions with these tags.
100X100
vector is Ok to have. \$\endgroup\$