I retrieved a bunch of text records from my postgresql database and intend to preprocess these text documents before analyzing them.
I want to tokenize the documents but ran into some problem during tokenizing
#some other bunch of regex replacements
#toToken is the text string
toTokens = self.regexClitics1.sub(" \1円",toTokens)
toTokens = self.regexClitics2.sub(" \1円 \2円",toTokens)
toTokens = str.strip(toTokens)
The error is TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode' I'm curious, why does this error occurs, when the encoding of the database is UTF-8?
asked Jun 23, 2011 at 6:59
goh
29.9k30 gold badges95 silver badges156 bronze badges
1 Answer 1
Why don't you use toTokens.strip(). No need of str module.
There are 2 string types in Python, str and unicode. Look at this for an explanation.
answered Jun 23, 2011 at 7:13
Samuel
2,5102 gold badges19 silver badges21 bronze badges
Sign up to request clarification or add additional context in comments.
3 Comments
Eric O. Lebigot
+1. A shorter explanation can be found on StackOverflow: stackoverflow.com/questions/4545661/… (shameless plug). :)
goh
does that means that the strings I get from my queries are unicode? Why is that so?
Samuel
@amateur It seems so. It's strange, because AFAIK psycopg returns str objects unless instructed to do otherwise, but can't know without more information about your setup.
default