1
\$\begingroup\$

I am having some performance problems with a little R script of mine that I use to visualize simulation results of a project of mine. It now takes longer on my machine to run the R script than the simulation itself, and I guess I am doing something wrong.

library(ggplot2) 
DownloadTime <- function(data, prefix) {
 pIds = unique(data$Pid)
 nPeers = length(pIds)
 type <- 1:nPeers
 downloadTime <- 1:nPeers
 for(i in 1:(nPeers)){ 
 type[i] <- toString(unique(data[ (data$Pid == i), ]$Type))
 start <- unique(data[ (data$Pid == i), ]$Start)
 lastRound <- max(data[ (data$Pid == i), ]$Tick)
 end <- data[ (data$Tick == lastRound) & (data$Pid==i), ]$End
 if( end >= 0){
 downloadTime[i] <- end - start 
 } else {
 downloadTime[i] <- -1
 }
 }
 pData <- data.frame(type, downloadTime )
 hist = ggplot(pData, aes(x=downloadTime, fill=type)) + 
 xlab("Download time [ticks]") + ylab("Peers") + 
 geom_histogram(position=position_dodge()) + 
 opts(title="Download time") + 
 scale_fill_hue( name="Peers", 
 breaks=c("Peer", "Peer_C1"), labels=c("BT","BT_ext") )
 density = ggplot(pData, aes(x=downloadTime, colour=type)) + 
 xlab("Download time [ticks]") + 
 ylab("Peers [ratio]") + geom_density() + 
 opts(title="Download time") + 
 scale_colour_hue( name="Peers", 
 breaks=c("Peer", "Peer_C1"), labels=c("BT","BT_ext") )
 path = paste(prefix, "downloadTime_hist.png", sep="_")
 ggsave(file=path , plot=hist , dpi=100)
 path = paste( prefix, "downloadTime_den.png", sep="_")
 ggsave(file=path, plot=density , dpi=100)
 return(pData)
}
proccessData <- function(data, prefix)
{
 maxTick = data$Tick[length(data$Tick)]
 tick <- 0:(maxTick*2-1)
 type <- 1:(maxTick*2)
 online <- 1:(maxTick*2)
 completed <- 1:(maxTick*2)
 avgnTFTSlots <- 1:(maxTick*2)
 avgnOUSlots <- 1:(maxTick*2)
 upRate <- 1:(maxTick*2)
 downRate <- 1:(maxTick*2)
 type <- 1:(maxTick*2)
 tftouUpRatio <- 1:(maxTick*2)
 tftouDownRatio <- 1:(maxTick*2)
 shareRatio <- 1:(maxTick*2)
 for(i in 0:(maxTick-1)){
 tick[i*2 + 1] <- i
 type[i*2 +1] <- "Peer"
 online[i*2 +1] <- nrow( data[ (data$Tick == i) & (data$Type=="Peer"), ] )
 completed[i*2 +1] <- nrow( data[ (data$Tick == i) & (data$Type=="Peer") & (data$End != -1), ] )
 avgnTFTSlots[i*2 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer") & (data$End == -1),]$TFT )
 avgnOUSlots[i*2 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer") & (data$End == -1),]$OU )
 downRate[i*2 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer") & (data$End == -1),]$Download/data[(data$Tick == i) & (data$Type=="Peer") & (data$End == -1),]$MaxDownload )
 upRate[i*2 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer") & (data$End == -1),]$Upload/data[(data$Tick == i) & (data$Type=="Peer") & (data$End == -1),]$MaxUpload )
 tftouUpRatio[i*2 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer") & (data$End == -1),]$tftouUpRatio )
 tftouDownRatio[i*2 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer") & (data$End == -1),]$tftouDownRatio )
 shareRatio[i*2 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer") & (data$End == -1),]$shareRatio )
 tick[i*2+1 + 1] <- i
 type[i*2+1 + 1] <- "Peer_C1"
 online[i*2+1 +1] <- nrow( data[ (data$Tick == i) & (data$Type=="Peer_C1"), ] )
 completed[i*2+1 +1] <- nrow( data[ (data$Tick == i) & (data$Type=="Peer_C1") & (data$End != -1), ] )
 avgnTFTSlots[i*2+1 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer_C1") & (data$End == -1),]$TFT )
 avgnOUSlots[i*2+1 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer_C1") & (data$End == -1),]$OU )
 downRate[i*2+1 + 1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer_C1") & (data$End == -1) ,]$Download/data[(data$Tick == i) & (data$Type=="Peer_C1") & (data$End == -1),]$MaxDownload )
 upRate[i*2+1 + 1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer_C1") & (data$End == -1),]$Upload/data[(data$Tick == i) & (data$Type=="Peer_C1") & (data$End == -1),]$MaxUpload )
 tftouUpRatio[i*2+1 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer_C1") & (data$End == -1),]$tftouUpRatio )
 tftouDownRatio[i*2+1 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer_C1") & (data$End == -1),]$tftouDownRatio )
 shareRatio[i*2+1 +1] <- mean( data[ (data$Tick == i) & (data$Type=="Peer_C1") & (data$End == -1),]$shareRatio )
 }
 pData <- data.frame(tick, type, online, completed, avgnTFTSlots, avgnOUSlots, upRate, downRate, tftouUpRatio, tftouDownRatio, shareRatio )
 #Generate Plots
 cm = scale_color_manual( name="Peers (without seeders)", breaks=c("Peer", "Peer_C1"), labels=c("BT","BT_ext") , values=c("red", "blue") )
 pData.upload = ggplot(pData, aes(x=tick) ) + geom_line(aes(y=upRate, colour=type) ) + xlab("Ticks") + ylab("Ratio") + opts(title="Upload usage") + cm 
 pData.download = ggplot(pData, aes(x=tick) ) + geom_line(aes(y=downRate, colour=type) ) + xlab("Ticks") + ylab("Ratio") + opts(title="Download usage") + cm
 pData.shareRatio = ggplot(pData, aes(x=tick) ) + geom_line(aes(y=shareRatio, colour=type) ) + xlab("Ticks") + ylab("Ratio") + opts(title="Download / Upload") + cm 
 pData.tftouUpRatio = ggplot(pData, aes(x=tick) ) + geom_line(aes(y=tftouUpRatio, colour=type) ) + xlab("Ticks") + ylab("Ratio") + opts(title="TFT/OU Upload") + cm
 pData.tftouDownRatio = ggplot(pData, aes(x=tick) ) + geom_line(aes(y=tftouDownRatio, colour=type) ) + xlab("Ticks") + ylab("Ratio") + opts(title="TFT/OU Download") + cm
 pData.connPlot = ggplot(pData, aes(x=tick) ) + geom_area(aes(y=online, fill=type) , alpha=0.4 , position=position_identity() ) + geom_line( aes(y=completed, colour=type) ,position=position_identity()) + ylab("Peers") + xlab("Ticks") + opts(title="Total and completed Peers") + scale_fill_manual( name="Total Peers", breaks=c("Peer", "Peer_C1"), labels=c("BT","BT_ext") , values=c("red", "blue") ) + scale_color_manual( name="Completed Peers", breaks=c("Peer", "Peer_C1"), labels=c("BT","BT_ext") , values=c("red", "blue") )
 pData.ouPlot = ggplot(pData, aes(x=tick) ) + geom_line(aes(y=avgnOUSlots, colour=type) ) + xlab("Ticks") + ylab("OU Slots") + opts(title="Average number of OU Slots") + cm
 pData.tftPlot = ggplot(pData, aes(x=tick) ) + geom_line(aes(y=avgnTFTSlots, colour=type) ) + xlab("Ticks") + ylab("TFT Slots") + opts(title="Average number of TFT Slots") + cm
 #Save Plots 
 path = paste(prefix, "Connections_OU.png", sep="_")
 ggsave(file=path , plot=pData.ouPlot, dpi=100)
 path = paste(prefix, "Connections_TFT.png", sep="_")
 ggsave(file=path , plot=pData.tftPlot, dpi=100)
 path = paste(prefix, "Peer_Count.png", sep="_")
 ggsave(file=path , plot=pData.connPlot, dpi=100)
 path = paste(prefix, "uploadRatio.png", sep="_")
 ggsave(file=path, plot = pData.upload, dpi=100)
 path = paste(prefix, "downloadRatio.png", sep="_")
 ggsave(file=path, plot=pData.download, dpi=100)
 path = paste(prefix, "shareRatio.png", sep="_")
 ggsave(file=path, plot=pData.shareRatio, dpi=100)
 path = paste(prefix, "tftouUpRatio.png", sep="_")
 ggsave(file=path, plot=pData.tftouUpRatio, dpi=100)
 path = paste(prefix, "tftouDownRatio.png", sep="_")
 ggsave(file=path, plot=pData.tftouDownRatio, dpi=100)
 return(pData)
}
#Generate a copy of an vector v, with elment e insert at position pos ( index starting from 1 ) , pos = -1 appends e to the end
insert <- function(v, e, pos)
{
 if( pos == 1){
 return( c(e,v) )
 } else {
 if( pos > length(v) | (pos == -1) )
 pos = length(v)
 if( pos == length(v))
 return( c(v,e) )
 else
 return( c(v[1:(pos-1)],e,v[(pos):length(v)]))
 }
}
#Script has to be called like : Rscript Statistics.R [STATS_FILE] [OUTPUT_DIR] [SUMMARY_FILE] [PREFIX] OR [SUMMARY_FILE] [SUMMART_OUTPUT_DIR] [PREFIX]
#Whereby STATS_FILE points to the csv input file and PREFIX will be
#Check arguments ( argument TRUE will filter all system arguments )
arg = commandArgs()
writeLines( paste("Received " , length(arg) , " arguments", sep="") )
writeLines( paste("Received args ... ", arg, sep="") )
#Own arguments start with arg[6]
if(length(arg) < 9){
 writeLines( "Missing arguments!" )
 #quit("no")
}
#writeLines("Enough arguments!")
#writeLines(paste("arg[6] ", arg[6], sep=""))
if( length(arg) == 8 ){
 dataFile = arg[6]
 outputDir = arg[7]
 prefix = arg[8]
 outputDir = paste( outputDir, prefix, sep="/")
 ecdfFile = paste( outputDir, "ECDF.png", sep="")
 histFile = paste( outputDir, "histogram.png", sep="")
 writeLines( "Assuming script was called to generate summary statistics!" )
 #Load data
 data = read.csv(dataFile, comment.char='#', sep=';', header=F )
 #Set col names
 colnames(data) <- c("Type", "DownloadTime")
 #Create plot and store file
 hist = ggplot(data, aes(x=DownloadTime, fill=Type)) + xlab("Download time") + ylab("Peers") + geom_histogram(position=position_dodge()) + opts(title="Download time") + scale_fill_hue( name="Peers", breaks=c("Peer", "Peer_C1"), labels=c("BT","BT_ext") )
 ggsave(file=histFile, plot=hist , dpi=100)
 #Create ECDF
 data.reduced = data[data$DownloadTime != -1, ] #Remove peers that did not complete their download
 #Adds a ecdf column to the data, containing the ecdf value for each line
 #Note: the ddply is used to group the data depening on the Type coloum ( so is like generating two tables, calculating ecdf for each and than join them again)
 data.reduced <- ddply(data.reduced, .(Type), transform , ecdf = ecdf(DownloadTime)(DownloadTime) )
 data.ecdf = ggplot(data=data.reduced) + geom_step(aes(x=DownloadTime, y=ecdf, color=Type) ) + xlab("Download Time") + ylab("ECDF") + opts(title="Download time") + scale_color_hue( name="Peers", breaks=c("Peer", "Peer_C1"), labels=c("BT","BT_ext") )
 ggsave(file=ecdfFile, plot=data.ecdf , dpi=100)
}
if( length(arg) == 9)
{
 dataFile = arg[6]
 workingDir = arg[7]
 histFile = arg[8]
 prefix = arg[9]
 writeLines( paste("Assuming script was called to generate simulation statistics! On statistic data ", dataFile, " with histfile ", histFile , " and prefix ", prefix ,sep="") )
 setwd(workingDir)
 #Load data
 data = read.csv(dataFile, comment.char='#', sep=';', header=F )
 # Log format <Tick> <peer Type> <id> <downloadStart> <DownloadEnd> \
 #<Max Upload Rate> <Max Download Rate> <Current Upload Rate> <Current Download Rate> \
 #<Total # of max TFT Slots> <Total # of max OU Slots> <Total # of used TFT Slots> <Total # of used OU Slots>\n"
 #Name data
 colnames(data) <- c("Tick","Type","Pid","Start","End","MaxUpload","MaxDownload","Upload","Download","MaxTFT","MaxOU","TFT","OU","TFTDown","TFTUp","OUDown","OUUp" )
 #Calculate averages
 data$upUsage <- data$Upload / data$MaxUpload
 data$downUsage <- data$Download / data$MaxDownload
 data$shareRatio <- data$Download / data$Upload
 data$tftouUpRatio <- data$TFTUp / data$OUUp
 data$tftouDownRatio <- data$TFTDown / data$OUDown
 #Remove irregularities , NaN and inf are set to zero
 data[ is.nan(data$tftouUpRatio),]$tftouUpRatio = 0
 data[ is.nan(data$tftouDownRatio),]$tftouDownRatio = 0
 data[ is.nan(data$shareRatio),]$shareRatio = 0
 data[ is.infinite(data$tftouUpRatio),]$tftouUpRatio = 0
 data[ is.infinite(data$tftouDownRatio),]$tftouDownRatio = 0
 data[ is.infinite(data$shareRatio),]$shareRatio = 0
 #Filter too high values
 m = mean(data$shareRatio) * 10
 data = data[ ((data$shareRatio) < m),]
 m = mean(data$tftouUpRatio) * 10
 data = data[ ((data$tftouUpRatio) < m),]
 m = mean(data$tftouDownRatio) * 10
 data = data[ ((data$tftouDownRatio) < m),]
 #Do processing and generate Plots
 proccessData(data, prefix)
 pData = DownloadTime(data, prefix)
 #Save the processed data into the summary file
 write.table(pData, file=histFile, sep=";", 
 append=TRUE, col.names=FALSE, 
 row.names = FALSE)
}

The heavy lifting is done in processData(). I think the problem is either the for loop itself or the condition based filtering on the data table.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Apr 5, 2012 at 20:49
\$\endgroup\$
4
  • \$\begingroup\$ Oh, didnt know about something like this existing. Is there a way to move the question? \$\endgroup\$ Commented Apr 5, 2012 at 20:59
  • \$\begingroup\$ Try these tips first: stackoverflow.com/a/8474941/636656 \$\endgroup\$ Commented Apr 5, 2012 at 21:15
  • 1
    \$\begingroup\$ @user - if you want, you can flag the question for a moderator to move it for you. Alternatively, I'd recommend paring your question down and focusing it on a specific part of your code that isn't performing optimally. ?Rprof can help identify which parts of your code are slow. Having some sample data to go with it will also encourage others to try and help! Here's the guide to making a great question: stackoverflow.com/questions/5963269/…. Good luck! \$\endgroup\$ Commented Apr 5, 2012 at 21:36
  • \$\begingroup\$ Too much code. But you should read The R Inferno to get awesome tips on speeding up your loops: burns-stat.com/pages/Tutor/R_inferno.pdf \$\endgroup\$ Commented Apr 5, 2012 at 21:59

1 Answer 1

2
\$\begingroup\$

That's a lot of code to wade through. But after a quick look at your processData function, some things stand out.

# This part:
online[i*2 +1] <- nrow( data[ (data$Tick == i) & (data$Type=="Peer"), ] )
completed[i*2 +1] <- nrow( data[ (data$Tick == i) & (data$Type=="Peer") & (data$End != -1), ] )
# etc ...
# Can be replaced with this:
d <- data[ (data$Tick == i) & (data$Type=="Peer"), ]
online[i*2 +1] <- nrow( d )
completed[i*2 +1] <- nrow( d[(data$End != -1), ] )
# etc...
# And yet again replaced with this:
idx <- (data$Tick == i) & (data$Type=="Peer")
online[i*2 +1] <- sum(idx)
completed[i*2 +1] <- sum( idx & (data$End != -1) )
# etc...

..The basic idea here is to avoid doing the same calculation many times. Extracting data from a data.frame is rather costly, and you seem to only need the number of matches. Indexing using a logical expression will produce a TRUE/FALSE vector, and the rows extracted are the TRUE values, so summing the index vector is the same as the row count...

answered Apr 5, 2012 at 21:45
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.