Python redirect input to child process

Question 1

Normally this isn't much of an issue as passing data through STDIN/STDOUT is straightforward.

But I am working on a diff util, and this has two inputs and one output.

Consider:

diff <(curl 'http://google.com') <(curl 'https://google.com')
5c5
< <A HREF="http://www.google.com/">here</A>.
---
> <A HREF="https://www.google.com/">here</A>.

Now this is fine with a plain old python program as I can open(sys.argv[1], 'r').read() to get the data just fine for both argv[1] and argv[2].

The problem is that my differ is a C++ implementation of google_diff_match_patch, and to keep things simple I am calling into that program (which reads its argvs using wifstream, wstring, and getline).

So what must happen now is that I've got to "give" my /dev/fd/11 to my subprocess.Popen(['dmp']), except I can't seem to stuff the paths (that are usually) /dev/fd/11 and /dev/fd/12 in as the args to the dmp C++ program, because its /dev/fd/11 isn't my python program's /dev/fd/11.

To further muddle the issue, I must read out the contents of the files before sending it to the child, because I am using file as an "is binary file" oracle:

file_process = Popen(['file', '-'], stdin=PIPE, stdout=PIPE)
file_content = open(filename, 'r').read()
(filetype, err) = file_process.communicate(file_content)
if filetype.find('text') == -1:
 # Popen my c++ program and try to feed it file_content

Please don't give an answer like "write to a file" or something. I want to implement these input redirect fifo's so that I can use the program as effectively as any other command line diff (and that includes curling something off the net without saving to a file for example).

Edit: According to subprocess the child should inherit the file descriptors if close_fds argument is the default value of False. Okay, so this would seem to indicate that if in my python wrapper program I call open('/dev/fd/11') and don't close it, and then fork a child using Popen(), that child should be able to read file descriptor 11 somehow.

Okay, so then now that I have the contents of python's file descriptor 11 how can I set up a file for the child to read? E.g. how to replicate the shell's functionality of <(echo file contents) (without using shell=True and echo, which i recognize I should just do right now)

Question 2

Sorry, do not understand what you are asking for.

Question 3

yeah my question was bad, i edited.

Question 4

It sounds to me like you have an external executable which is expecting filenames as arguments and you want to pass it open file descriptors from your Python script instead, correct? And these file descriptors might not be actual files, they might be stdin or other pipes?

If that's the case, you're not going to have an easy job - the application is expecting filenames, not open files. Because that executable's code is opening files by name, you can't change that behaviour from your Python script - even if the executable has inherited the file descriptors, its code would need to have been written with that assumption in mind. It's not that the code in that executable can't do what you're suggesting, it's just that it's not written to do it. So what you're attempting to do is a workaround, and there isn't necessarily a clean option.

You say you don't want solutions involving writing to files, but I feel I should point out that really is the simplest option if you're after minimal work. If you're worried about writing to disk then you could create a tmpfs partition or something, but that's getting rather fiddly (and not very portable).

The next simplest might even be to write a C extension which calls directly into the Google library rather than using a third party executable - I would say that's considerably cleaner (and more portable) than messing around with /proc/self/fd or anything. In fact, just having checked the project it already offers a Python API so is there a reason you're not just calling directly into that? Personally, this is definitely the approach I'd be taking.

EDIT: Ah, I just spotted that the Python API is pure Python as opposed to a wrapper around the C++ module, so I guess you might not be using that for performance reasons. Unless you have stringent performance requirements, I still think this is the easiest option, but if you really need C++ performance then you still have the option of writing your own wrapper.

If you're really intent on calling the executable and you don't to write to intermediate files, then I guess you could use the /dev/fd/* files, but this is likely to only work for real files. At least on Linux, these files as symlinks to underlying files on the filesystem, so if your executable re-opens them via the symlinks it should get a read pointer at the start of each file and be able to do the diff correctly.

In the case of stdin, however, you're dealing with pipes not real files, so I don't believe this trick will work. If you try it, you'll have two processes with the same underlying pipe open for reading. This means that any output from the pipe will arrive at a random child process (not exactly random, but unpredictable from your point of view). Now, as long as your process isn't reading from stdin then you might get away with this, but it's a pretty dubious thing to be doing.

In short, you might get away with just opening the /proc/self/fd files (or /dev/fd if you prefer), but it's not something that I would recommend. If the executable you're using doesn't invoke the library in the way you want, I suggest calling into the library directly, either writing your own Python C extension wrapper or by using the Python API already available.

Question 5

Well, so I am also writing the C++ program so if there is an easy way to make it receptive to input from file descriptors owned by its caller, that would likely be the way to go. It would however also be good to have my C++ program itself also be usable directly from a shell as well. Anyway, I think my solution to the problem is to make more clear which program is meant to process the input. With a pipe the input only can come in once. With a pipe if I want to dupe the data, I basically gotta replicate what the tee program does.

Question 6

And yes, a proper workable approach is to not make the native component its own process. I should just be using something to bind it to Python (swig? boost.python?).

Question 7

I can't comment on Swig or boost.python, I've only used the Python C API - those frameworks may be more convenient, but in this case it shouldn't be a complicated job writing it to the raw Python API. The main thing that might trip you up is that you need to reference count all Python objects manually, but if your extension is C++ then RAII can help there.

Question 8

Cool, thanks. For this particular implementation I think I can get away with what I've been doing by offloading the extra processing that I'm currently doing from python to C++. While using file to find out about the nature of the data is nice (in theory), it is a fragile interface. I'll solve the problem by sidestepping it.

Cartroo 4,35324 silver badges23 bronze badges · Accepted Answer · 2013-07-01 12:04:31Z