0

I retrieved a bunch of text records from my postgresql database and intend to preprocess these text documents before analyzing them.

I want to tokenize the documents but ran into some problem during tokenizing

 #some other bunch of regex replacements
 #toToken is the text string 
 toTokens = self.regexClitics1.sub(" \1円",toTokens) 
 toTokens = self.regexClitics2.sub(" \1円 \2円",toTokens)
 toTokens = str.strip(toTokens)

The error is TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode' I'm curious, why does this error occurs, when the encoding of the database is UTF-8?

asked Jun 23, 2011 at 6:59

1 Answer 1

4

Why don't you use toTokens.strip(). No need of str module.

There are 2 string types in Python, str and unicode. Look at this for an explanation.

answered Jun 23, 2011 at 7:13
Sign up to request clarification or add additional context in comments.

3 Comments

+1. A shorter explanation can be found on StackOverflow: stackoverflow.com/questions/4545661/… (shameless plug). :)
does that means that the strings I get from my queries are unicode? Why is that so?
@amateur It seems so. It's strange, because AFAIK psycopg returns str objects unless instructed to do otherwise, but can't know without more information about your setup.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.