3

How can one dodge the encoding problems when reading Stata-data into R?

The dataset I wish to read is a .dta in either Stata 12 or Stata 13 (before Stata introduced support for utf-8 in version 14). Text-variables with Swedish and German letters å, ä, ö, ß, as well as other characters do not import well.

I have tried these answers, read.dta in foreign, the haven package (with no encoding-parameters), and now read_stata13, which informs me that it expects Stata files to be encoded in CP1252. But alas, the encoding doesn't work. Should I give up and and use a .csv-export as a bridge instead, or is it actually possible to read .dta-files in R?

Minimal example:
This code downloads the first few lines of my dataset, and illustrates the problem, for example in the variable vocation which contain Scandinavian languages.

setwd("~/Downloads/")
system("curl -O http://www.lilljegren.com/stackoverflow/example.stata13.dta", intern=F)
library(foreign)
?read_dta
df1 <- read_dta('example.stata13.dta', encoding="latin1")
df2 <- read_dta('example.stata13.dta', encoding="CP1252")
library(readstata13)
df3 <- read.dta13('example.stata13.dta', fromEncoding="latin1")
df4 <- read.dta13('example.stata13.dta', fromEncoding="CP1252")
df5 <- read.dta13('example.stata13.dta', fromEncoding="utf-8")
vocation <- c("Brandkorpral","Sömmerska","Jungfru","Timmerman","Skomakare","Skräddare","Föreståndare","Platsförsäljare","Sömmerska")
df4$vocation == vocation
# [1] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
asked Nov 6, 2018 at 15:33
4
  • csv is probably the best thing to do. Or if you have Stata 14 convert the files to Unicode first and save. Commented Nov 6, 2018 at 16:02
  • This is what I'm fearing. I'm looking at different files Stata builds using enca, but it is not able to guess what encoding they are, and I also have some encoding problems reading the csv-files that Stata generates. Uhhh. Stata really isn't awesome :/ 21st century software without support for utf-8 :( Commented Nov 6, 2018 at 17:20
  • Stata's current version is 15 and as of version 14 supports Unicode. Not sure why you are complaining for features that are not available in software that is two versions behind and no longer supported / maintained. Upgrade? Commented Nov 6, 2018 at 17:49
  • I am poor, and Stata is a licensed software that'd cost me expensively for an upgrade needed merely to resolve this encoding-problem that, I think one could argue, shouldn't have to belong to our decade. But duly noted: I was grumpy. :) Besides, the correct encoding was "macroman", and I found out by going through the csv-solution, as you suggested, so thank you. Commented Nov 7, 2018 at 8:49

1 Answer 1

4

The correct encoding to read files generated by Stata prior to version 14 on Macs is "macroman"

df <- read.dta13('example.stata13.dta', fromEncoding="macroman")

On my Mac, both .dta-files in stata13 and stata12 formats (saved by saveold in Stata 13) imported nicely like this.

Supposedly, the manual of read_stata13, correctly assumes "CP1252" on other platforms. To me, "macroman", however, did the trick, (also for the .csv-files that Stata 13 generated with export delimited).

Nick Cox
37.4k6 gold badges37 silver badges51 bronze badges
answered Nov 7, 2018 at 8:52
Sign up to request clarification or add additional context in comments.

1 Comment

Note that you make no mention whatsoever in your question that you are using a Mac. Which is probably why nobody answered.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.