0

I've got a code where I need to create a map with key values as double (value of the f-test between two clusters. I need to calculate the residual sum of squares for this) and the mapped value of cluspair which is pair of the class Cluster that I created. Map aims to store the F-test values between the all clusters so that I would not need to do the calculation again and again in every step. BTW cluster is a tree structure where every cluster contains two subclusters and the stored values are 70-dimensional vectors.

Problem is, in order to calculate the RSS, I need to implement a recursive code where I need to find the distance of every element of the cluster with the mean of the cluster and this seems to be consuming an enormous amount of memory. When I create the same map with the key values being the simple distance between the means of two clusters, the program uses minimal memory so I think the increase in the memory use is caused by the call of the recursive function RSS. What should I do to manage the memory use in the code below? In its current implementation the system runs out of memory and windows closes the application saying that the system ran out of virtual memory.

The main code:

 map<double,cluspair> createRSSMap( list<Cluster*> cluslist )
 {
 list<Cluster*>::iterator it1;
 list<Cluster*>::iterator it2;
 map<double,cluspair> rtrnmap;
 for(it1=cluslist.begin(); it1!= --cluslist.end() ;it1++)
 {
 it2=it1;
 ++it2;
 cout << ".";
 list<Cluster*>::iterator itc;
 double cFvalue=10000000000000000000;
 double rIt1 = (*it1)->rss();
 for(int kk=0 ; it2!=cluslist.end(); it2++)
 {
 Cluster tclustr ((*it1) , (*it2));
 double r1 = tclustr.rss();
 double r2= rIt1 + (*it2)->rss();
 int df2 = tclustr.getNumOfVecs() - 2;
 double fvalue = (r1 - r2) / (r2 / df2);
 if(fvalue<cFvalue)
 {
 cFvalue=fvalue;
 itc=it2;
 }
 }
 cluspair clp;
 clp.c1 = *it1;
 clp.c2 = *itc;
 bool doesexists = (rtrnmap.find(cFvalue) != rtrnmap.end());
 while(rtrnmap)
 {
 cFvalue+= 0.000000001;
 rtrnmap= (rtrnmap.find(cFvalue) != rtrnmap.end());
 }
 rtrnmap[cFvalue] = clp;
 }
 return rtrnmap;
 }

and the imlementation of the function RSS:

double Cluster::rss()
{
 return rss(cnode->mean);
}
double Cluster::rss(vector<double> &cmean)
{
 if(cnode->numOfVecs==1)
 {
 return vectorDist(cmean,cnode->mean);
 }
 else
 {
 return ( ec1->rss(cmean) + ec2->rss(cmean) ); 
 }
}

Much thanks in advance. I really don't know what to do at this point.


below is the code with that I use to create a map with keys being simple euclidian distance between two cluster means. As I've said above, it is quite similar and uses minimal memory. It only differs in the calculation of the fvalue. Instead of the recursive calculation, there is the calculation of simple distance of means of two clusters. Hope it helps to identify the problem

 map<double,cluspair> createDistMap( list<Cluster*> cluslist )
 {
 list<Cluster*>::iterator it1;
 list<Cluster*>::iterator it2;
 map<double,cluspair> rtrnmap;
 for(it1=cluslist.begin(); it1!= --cluslist.end() ;it1++)
 {
 it2=it1;
 ++it2;
 cout << ".";
 list<Cluster*>::iterator itc;
 double cDist=1000000000000000;
 for(int kk=0 ; it2!=cluslist.end(); it2++)
 {
 double nDist = vectorDist( (*it1)->getMean(),(*it2)->getMean());
 if (nDist<cDist)
 {
 cDist = nDist;
 itc=it2;
 }
 } 
 cluspair clp;
 clp.c1 = *it1;
 clp.c2 = *itc;
 bool doesexists = (rtrnmap.find(cDist) != rtrnmap.end());
 while(doesexists)
 {
 cDist+= 0.000000001;
 doesexists = (rtrnmap.find(cDist) != rtrnmap.end());
 }
 rtrnmap[cDist] = clp;
 }
 return rtrnmap;
 }

Cluster constructer

Cluster::Cluster (Cluster *C1, Cluster *C2)
{
 ec1=C1;
 ec2=C2;
 node* cn = new node;
 cn->numOfVecs = C1->cnode->numOfVecs + C2->cnode->numOfVecs;
 double nov = cn->numOfVecs;
 double div = (1 / nov);
 cn->mean = scalarMultVect(div,vectAdd(scalarMultVect(C1->cnode->numOfVecs,C1->cnode->mean),scalarMultVect(C2->cnode->numOfVecs,C2->cnode->mean)));
 mvect tmv;
 tmv.stock="";
 cn->v1 = tmv;
 cnode = cn;
}
asked Jun 19, 2011 at 13:25
9
  • You haven't given us enough code to reproduce the problem, but it looks as if you could save a lot of calculation (and maybe memory) by storing these RSS values and not recomputing them so many times. Commented Jun 19, 2011 at 13:40
  • the loop is for finding the fvalues between cluster pairs. when the loop is at the last element, there is no cluster to find the fvalue for after that point. Commented Jun 19, 2011 at 13:41
  • Does the Cluster::Cluster(const Cluster* pc1, const Cluster* pc2) constructor do a deep copy of the entire tree nodes of the two input clusters? Please explain its memory usage, and/or post the constructor's code. Commented Jun 19, 2011 at 13:44
  • @Beta vectorDist is the euclidian distance between two mean vectors. ec1 and ec2 are the two subclusters. and the cnode->mean gives the mean of the current cluster. if you to know anything else I would gladly give more code but just didn't want to fill the page with unnecessary codes. As for the storing RSS values, this is a tree is building process and therefore RSS values wouldn't stay same and it would change with the means of the new parent clusters. Commented Jun 19, 2011 at 13:46
  • @Beta and @aristos: the RSS values of a single cluster (the r2) can be precomputed. But I think the RSS values of a combination of clusters will have to be computed recursively. Commented Jun 19, 2011 at 13:46

1 Answer 1

1

You've asked exactly the same question before: Enormous Increase In the Use Of Memory

  1. rtrnmap= (rtrnmap.find(cFvalue) != rtrnmap.end()); does not make sense.
  2. You were told to pass data through references
  3. You were told to add logging information and see how many iterations are performed.

A few comments:

  • It is a bad idea to have a map with a double as key as you may find yourself unable to retrieve an element due to a tiny difference in the double.
  • Add only a few elements in the collection and manually go through all the functions in the debugger. You'll get to "see" what gets executed and can immediately see if the actual execution flow matches your expectations

And please don't double post your questions (even if you use different users).

EDIT:

We all assumed a proper destructor. Make sure you deallocate any memory you explicitly allocate with new or new[] with delete or delete[] as appropriate.

answered Jun 19, 2011 at 14:08

11 Comments

Sorry for that, just thought that the old post would go unseen. doesexists = (rtrnmap.find(cDist) != rtrnmap.end()) would return true if map contains an element with the same key that is wanted to be added. In that case, a tiny amount of 0.00000001 will be added and thereby no info would be lost. Counter would give a value with maximum of 2 to the power 14000. I'm aware that this is a huge number but that is the nature of the data I have to deal with. Tree is quite large.
@aristos : You have a limited amount of stack space for the recursion (and your function is not tail call optimisable).
@aristos : are you sure you know what 2 to the poser 14000 means?
And what do you mean "no info will be lost"? You do realize that "rtrnmap[cDist] = clp;" will add a new element in the rtrnmap every time it is called, right? (Because of the loop above which finds a cDist not in the map)
@andrei Well I have about 14k vectors to build a cluster tree from. 2 to the power 14000 minus one would be equal to the number of recursions. Is there no way to make an optimisation for that kind of problem? By "no info will be lost" I meant that there would be no overwriting. If the given key is already in the map, while loop would find a double value close to cDist which is not a key of the map.
|

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.