What exactly do "u" and "r" string prefixes do, and what are raw string literals?

Question 1

While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.

I know what an encoding is, and I know what u'' alone does since I get what is Unicode.

But what does r'' do exactly? What kind of string does it result in?
And above all, what the heck does ur'' do?
Finally, is there any reliable way to go back from a Unicode string to a simple raw string?
Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?

Question 2

This question is similar to: What's the difference between r'string' and normal 'string' in Python? and What's the u prefix in a Python string? Close voters, please vote to close as a duplicate of the second one, since I already voted to close as a duplicate of the first one.

Question 3

This question is similar but hot a duplicate. It asks broader information.

Question 4

There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an r before the opening quote.

A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).

Question 5

Understanding "r" doesn't implies any type or encoding issues, it's much simpler.

Question 6

Note that ru"C:\foo\unstable" will fail because \u is a unicode escape sequence in ru mode. r mode does not have \u.

Question 7

Note that u and r are not commutative: ur'str' works, ru'str' doesnt. (at least in ipython 2.7.2 on win7)

Question 8

Just tested r strings and noticed that if \ is the last character it will not be taken as a literal but instead escapes the closing quote, causing SyntaxError: EOL while scanning string literal. So \\ still must be used for the final instance of \ in any strings ending with a backslash.

Question 9

python 3.x - sys.getsizeof('cioa') == sys.getsizeof(r'cioa') == sys.getsizeof(u'cioa') (Ubuntu 16.04 with UTF8 lang). Similarly, type('cioa') == type(r'cioa') == type(u'cioa'). BUT, the raw string interpolation makes a difference, so sys.getsizeof('\ncioa') == sys.getsizeof(u'\ncioa') != sys.getsizeof(r'\ncioa')

Question 10

There are two types of string in Python 2: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.

The r doesn't change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.

ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.

You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.

Question 11

Backslashes are not treated as literal in raw string literals, which is why r"\" is a syntax error.

Question 12

Only applies to Python 2.

Question 13

@PaulMcG print(r"\") gives error in Python3 too : SyntaxError: EOL while scanning string literal

Question 14

'raw string' means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.

Question 15

...unless it's the last character of the string, in which case it does escape the closing quote.

Question 16

Let me explain it simply: In python 2, you can store string in 2 different types.

The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)

The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.

By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')

So u is just a short way to call a function to cast str to unicode. That's it!

Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r'\n' will not create a new line character. It's just plain text containing 2 characters.

If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.

NOW, the important part:

You cannot store one backslash by using r, it's the only exception. So this code will produce error: r'\'

To store a backslash (only one) you need to use '\\'

If you want to store more than 1 characters you can still use r like r'\\' will produce 2 backslashes as you expected.

I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet. I hope that it is a bug.

Question 17

You will notice not only r'\' is illegal, you even can't put a single '\' at any string's tail. Just like r'xxxxxx\' is a illegal string.

Question 18

what about python 3 ?

Question 19

@Krissh All python 3 strings are Unicode supported. Its type will be str. Read more for better understanding here: medium.com/better-programming/…

Question 20

r'\' gives a SyntaxError: unterminated string literal as intended, and noted in: docs.python.org/3/reference/…: Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result... This was also pointed out in another answer by @Jeyekomon.

Question 21

Why can’t raw strings (r-strings) end with a backslash? (cite: More precisely, they can’t end with an odd number of backslashes: the unpaired backslash at the end escapes the closing quote character, leaving an unterminated string.)

Question 22

A "u" prefix denotes the value has type unicode rather than str.

Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (e.g. r"\").

"Raw" is not part of the type, it's merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.

You can have unicode raw string literals:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise. However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:

Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.

Question 23

Unicode string literals

Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.

Raw string literals

If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround.

One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!

On the other hand, sometimes you might want to include the actual characters \ and n into your string – you might not want them to be interpreted as a new line. Look at these examples:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.

Raw string literals are not completely "raw"?

Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.

From the lexical definition of a string literal:

string ::= "'" stringitem* "'"
stringitem ::= stringchar | escapeseq
stringchar ::= <any source character except "\" or newline or the quote>
escapeseq ::= "\" <any source character>

It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.

Question 24

Maybe this is obvious, maybe not, but you can make the string '\' by calling x=chr(92)

Python2

x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y # True
x is y # False

Python3 (3.11.1)

x=chr(92)
print(type(x), len(x)) # <class 'str'> 1
# Note The Type Change To Class
y='\\'
print(type(y), len(y)) # <class 'str'> 1
# Note The Type Change To Class
x==y # True
x is y # True
# Note this is now True

Question 25

x is y evaluates to True in python3?

Question 26

@HabeebPerwad, that is because of string interning. You should never rely on the fact that x is y happens to evaluate to True because of interning. Instead use x == y (if your not checking if x and y are exactly the same object stored at a single memory position, that is).

Question 27

r-strings by example (raw strings)

Just to provide a slightly more "example-oriented" pedagogy, with an eye out for the edge cases:

Syntax	Meaning	Note
`'a\nb'`	`a`, `\n`, `b`	"Regular" string with a newline
`r'a\nb'`	`a`, `\`, `n`, `b`	`r` string: `\` does not create magic characters anymore
`r'a\\b'`	`a`, `\`, `\`, `b`	`\` doesn't even escape itself, we get two `\`
`r'a\'b'`	`a`, `\`, `'`, `b`	The only thing that `\` "escapes" in r-strings is `'` itself. But the `\` and `'` still appear in the string.
`r"a\"b"`	`a`, `\`, `"`, `b`	Double quotes is analogous.
`r'a'b'`	Syntax error	Unbalanced single quotes, we'd need the `\` like above
`r"a'b"`	`a`, `'`, `b`	We can get a single quote without `\` by using `"` instead
`r'''a'"b'''`	`a`, `'`, `"`, `b`	We can get both `'` and `"` in by using triple quotes
`'a\'\'\'"""b'`	`a`, `'`, `'`, `'`, `"`, `"`, `"`, `b`	It is impossible to have both triple `'` and triple `"` in a raw string without some backslash escaping: https://stackoverflow.com/questions/4630465/how-to-include-a-double-quote-and-or-single-quote-character-in-a-raw-python-stri\|

The easiest way to play with this yourself is to convert the input string literal to a list of characters with the list() function as mentioned at How do I split a string into a list of characters? e.g.:

>>> list(r'a\nb')
['a', '\\', 'n', 'b']

Application of r-strings: it removes the need to escape \, common in regexes

E.g. if you want to match ISO dates yyyy-mm-dd, without r, the cleanest way would be to write:

re.compile('\\d{4}-\\d{2}-\\d{2}')

because the \ has to be present in the final string seen by regexp. It would also actually work in certain Python versions if you did just:

re.compile('\d{4}-\d{2}-\d{2}')

because \d is not a valid escape sequence and gets interpreted as \ + d. But that is confusing, as it is hard to remember what is a valid escape or not (\a, \b, \f, \n, \r, \t, \v are valid, what in the name is a "Vertical Tab"???), and Python 3.12 already gives a warning if you do that:

<stdin>:1: SyntaxWarning: invalid escape sequence '\d'

see also: How to fix "SyntaxWarning: invalid escape sequence" in Python?

So with r-string we can write the simpler:

re.compile(r'\d{4}-\d{2}-\d{2}')

which is much more readable and sane.

The downside of r strings is that you then can't have magic characters like newline in your string. But these are not very common in regular expressions.

u strings are the default in Python 3 (Unicode strings)

In Python 3, 'abc' is the same as u'abc', and the u syntax exists just to help with code backward compatibility and is never needed.

And to get a Python 2 'abc' (byte string), you have to do b'abc' in Python 3.

See also: What's the u prefix in a Python string?

Syntax	Python version	Meaning
`'abc'`	2	Byte string
`u'abc'`	2	Unicode string
`b'abc'`	2	Didn't exist
`'abc'`	3	Unicode string
`u'abc'`	3	Unicode string
`b'abc'`	3	Byte string

Unicode string vs byte string

Byte strings can only have "ASCII characters" (more precisely, single byte values 0-255), while the Unicode string can have any Unicode character.

For example, in Python 3, if we play around with é, an 'e' with an acute accent present e.g. in French and Portuguese, and encoded as two bytes in UTF-8 0xC3 + 0xA9 we get:

>>> list('aéi')
['a', 'é', 'i']
>>> list(b'aéi')
 File "<stdin>", line 1
 list(b'aéi')
 ^^^^^^
SyntaxError: bytes can only contain ASCII literal characters
>>> list(map(lambda x: hex(x), list(bytes('aéi', 'utf8'))))
['0x61', '0xc3', '0xa9', '0x69']

so we see that:

'aéi' contains three Unicode characters. Doing e.g. 'aéi'[1] gives é as intuitively expected
bytes('aéi', 'utf8') contains four bytes, because the é is made up of two bytes

What usually happens is that when you read a file with:

f = open("myfile.txt", "r")
b = f.read()

what you get from it are bytes, not the Unicode string, because the filesystem does not know about encodings, it just provides bytes.

Then, using either some external knowledge, or some information contained in the file itself, you decide it's encoding, e.g. UTF-8.

Once you decided the encoding, you convert the bytes to a Unicode string with the .decode method:

u = b'\x61\xc3\xa9\x69'.decode('utf8')

which gives use the Unicode string:

aéi

and then you continue on your merry way doing operations on the Unicode string, which is almost always the level that you want to operate on text, since you generally want to operate on characters rather than parts of characters. E.g. you might want to replace the acute accent with a grave accent with:

u[1] = 'è'

which gives:

aèi

And then to write it back to a file, you first have to convert it back to a stream of bytes with:

u.encode()

Question 28

The ASCII BELL (\a can be found with r'\\a' in a regex to prevent it from being seen as a group, same for backspace (\b) and a few others. And all the others are covered under \s (whitespace). When the third party regex module is used, you can even capture things like backspace and BELL with the POSIX [:print:] group

Alex Martelli 887k175 gold badges1.3k silver badges1.4k bronze badges · Accepted Answer · 2010-01-17 16:38:39Z

There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an r before the opening quote.

A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).

Understanding "r" doesn't implies any type or encoding issues, it's much simpler.
Note that ru"C:\foo\unstable" will fail because \u is a unicode escape sequence in ru mode. r mode does not have \u.
Note that u and r are not commutative: ur'str' works, ru'str' doesnt. (at least in ipython 2.7.2 on win7)
Just tested r strings and noticed that if \ is the last character it will not be taken as a literal but instead escapes the closing quote, causing SyntaxError: EOL while scanning string literal. So \\ still must be used for the final instance of \ in any strings ending with a backslash.
python 3.x - sys.getsizeof('cioa') == sys.getsizeof(r'cioa') == sys.getsizeof(u'cioa') (Ubuntu 16.04 with UTF8 lang). Similarly, type('cioa') == type(r'cioa') == type(u'cioa'). BUT, the raw string interpolation makes a difference, so sys.getsizeof('\ncioa') == sys.getsizeof(u'\ncioa') != sys.getsizeof(r'\ncioa')

CollectivesTM on Stack Overflow

What exactly do "u" and "r" string prefixes do, and what are raw string literals?

8 Answers 8

8 Comments

3 Comments

1 Comment

5 Comments

Comments

Unicode string literals

Raw string literals

Raw string literals are not completely "raw"?

Comments

Python2

Python3 (3.11.1)

2 Comments

1 Comment

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

8 Answers 8

8 Comments

3 Comments

1 Comment

5 Comments

Comments

Unicode string literals

Raw string literals

Raw string literals are not completely "raw"?

Comments

Python2

Python3 (3.11.1)

2 Comments

1 Comment

Linked

Related