[Python-Dev] Python-3.0, unicode, and os.environ

Thu Dec 4 23:47:52 CET 2008

* Adam Olsen wrote: 
> On Thu, Dec 4, 2008 at 2:09 PM, André Malo <nd at perlig.de> wrote:

> > Here's an example which will become popular soon, I guess: CGI scripts
> > and, of course WSGI applications. All those get their environment in an
> > unknown encoding. In the worst case one can blow up the application by
> > simply sending strange header lines over the wire. But there's more:
> > consider running the server in C locale, then probably even a single 8
> > bit char might break something (?).
>> I think that's an argument that the framework should reencode all
> input text into the correct system encoding before passing it on to
> the CGI script or WSGI app. If the framework doesn't have a clear way
> to determine the client's encoding then it's all just gibberish
> anyway. A HTTP 400 or 500 range error code is appropriate here.

Duh.
See, you're already mixing different encodings and creating issues here! 
You're talking about client encoding (whatever that is) with correct system 
encoding (whatever that is, too) in the same paragraph and assume they are 
the same or compatible.
There are several points here:
- there is no clear way to get a single client encoding for the whole HTTP 
 transaction (headers + body), because *there is none*. If the whole 
 header set matches the same encoding, it's more or less luck.
- there is no correct system encoding either. As said, I prefer running my 
 servers in C locale, so it's all ascii. In fact, it shouldn't matter. The 
 locale should not have anything to do with an application called over the 
 network.
- A 400 or 500 response for a header containing something like my name is 
 not appropriate.
- Octets in HTTP headers are allowed. And they are what they are -
 octets. The interpretation has to be left to the application, not the 
 framework.
>> >> However, some pragmatism is also possible. Many uses of PATH may
> >> allow it to be treated as black-box bytes, rather than text. The
> >> minimal solution I see is to make os.getenv() and os.putenv() switch
> >> to byte modes when given byte arguments, as os.listdir() does. This
> >> use case doesn't require the ability to iterate over all environment
> >> variables, as os.environb would allow.
> >>
> >> I do wonder if controlling the environment given to a subprocess
> >> requires os.environb, but it may be too obscure to really matter.
> >
> > IMHO, environment variables are no text. They are bytes by definition
> > and should be treated as such.
> > I know, there's windows having unicode enabled env vars on demand, but
> > there's only trouble with those over there in apache's httpd (when
> > passing them to CGI scripts, oh well...).
>> Environment variables have textual names, are set via text, frequently

Well, think about my example again. The friendly way to maintain them is not 
the issue. The problems arise at least when the variables are set by an 
attacker.
> contain textual file names or paths, and my shell (bash in
> gnome-terminal on ubuntu) lets me put unicode text in just fine. The
> underlying APIs may use bytes, but they're *intended* to be encoded
> text.

Yes, encoded text == bytes. No, they're intended to be c-strings. And well, 
even if we assume that they should contain text (as in encoded unicode), 
their meaning is application specific and so is the encoding (even if it's 
mixed).
What I'm saying is: I don't see much use for unicode APIs for the 
environment at all, because I don't know what's in there before inspecting 
them. And apparently the only reliable way to inspect them is via a byte 
oriented API.
nd