9

One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?

foreach i in A B C D { 
 forval n=1990/2000 { 
 local m = 'n'-1 
 # create new columns from existing ones on-the-fly 
 generate pop'i''n' = pop'i''m' * (1 + trend'n') 
 } 
} 
asked Feb 17, 2011 at 2:25
5
  • 5
    for those that don't speak stata, maybe add what the final output should look like? And the input data for that matter... Commented Feb 17, 2011 at 2:29
  • I'm wondering what idiot designer of a statistical package decided that 1990/2000 was a range rather than a division facepalm Commented Feb 17, 2011 at 15:25
  • 2
    @Spacedman: You don't know the half of it. I used Stata for 3 years. Worst. Programming. Language. Ever. Commented Feb 17, 2011 at 15:29
  • @Joshua : May I kindly agree :-) But it has to be said, it is quite a powerful statistical package. You just shouldn't be dreaming about anything else but scripting your analysis. Commented Feb 17, 2011 at 15:40
  • 2
    @Joris: Though I didn't explicitly say so, I agree that Stata has a lot of statistical capability. That's why I was careful to specifically say programming in Stata is terrible. ;-) Commented Feb 17, 2011 at 15:45

4 Answers 4

15

DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.

For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.

Use a data structure that the language gives you. In this case probably a list.

answered Feb 17, 2011 at 8:05
Sign up to request clarification or add additional context in comments.

Comments

9

Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).

But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.

I make some data as I believe it is in your R version now (at least, I hope so...)

Data <- data.frame(
 popA1989 = 1:10,
 popB1989 = 10:1,
 popC1989 = 11:20,
 popD1989 = 20:11
)
Trend <- replicate(11,runif(10,-0.1,0.1))

You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year

newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL

Filling up the dataframe is then quite easy :

for(i in 1:11){
 tmp <- newData[newData$year==(1988+i),]
 newData <- rbind(newData,
 data.frame( values = tmp$values*Trend[,i],
 pop = tmp$pop,
 year = tmp$year+1
 )
 )
}

In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.

And if you insist, you can still create a wide format with unstack()

unstack(newData,values~paste("pop",pop,year,sep=""))

Adaptation of Joshua's answer to add the columns to the dataframe :

for(L in LETTERS[1:4]) {
 for(i in 1990:2000) {
 new <- paste("pop",L,i,sep="") # create name for new variable
 old <- get(paste("pop",L,i-1,sep=""),Data) # get old variable
 trend <- Trend[,i-1989] # get trend variable
 Data <- within(Data,assign(new, old*(1+trend)))
 }
}
answered Feb 17, 2011 at 15:39

2 Comments

Can you explain what you mean by "keep your factors factors instead of variable names"?
@KevinM That's the difference between "long format" and "wide format". You put all data in a single column, and use a factor or categorical variable to describe which data is from which population and year. If you use your variable names to indicate which year and population we're talking about, you'll have more difficulty using that information. Both population and year are categorical variables in terms of statistical analysis. So I keep them as a categorical variable (factor) instead of combining them to construct variable names.
3

Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.

for(L in LETTERS[1:4]) {
 for(i in 1990:2000) {
 new <- paste("pop",L,i,sep="") # create name for new variable
 old <- get(paste("pop",L,i-1,sep="")) # get old variable
 trend <- get(paste("trend",i,sep="")) # get trend variable
 assign(new, old*(1+trend))
 }
}
answered Feb 17, 2011 at 3:21

Comments

1

Assuming you have population data in vector pop1989 and data for trend in trend.

require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)
answered Feb 20, 2011 at 7:19

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.