Sandboxing interpreted code

Question 1

I have a little pet compiler project that generates bytecode interpreted by a virtual machine. The language is kind of low-level, as it allows the user to manually allocate memory and dereference any pointer as they see fit. This can of course lead to bugs which crash the interpreted program. When the VM runtime is used in a host environment though, I would like to prevent the host application from crashing if a script has memory bugs in it. What I thought I could do is install a signal handler that catches SIGSEGV (and perhaps other signals) and longjumps back into the runtime. The runtime could clean up after the script, as it can track all resource allocations made by the user through the language facilities. I tested it and it works nicely for simple cases, however please correct me if I'm wrong on this.

What makes things complicated though is that it is possible for the environment to install callbacks that the script can call. Then those callbacks can execute other script functions in the VM runtime. So basically the host program can have callstacks that look like this:

Host -> ScriptFunctionA -> HostCallback -> ScriptFunctionB -> AnotherHostCallback -> ...

So I would install the signal handler when the VM runtime is constructed, call setjmp whenever a script function is called and keep a stack of jmp_bufs so the signal handler can jump to the script invocation at the top of the callstack. What is not considered in this design is that the host might install other signal handlers for SIGSEGV overriding mine.

Here is a slightly simplified code example

Runtime runtime;
runtime.installCallback(HostFuncA); // HostFuncA calls bar() defined in the script below, actual design is a bit more complicated than this
runtime.installCallback(HostFuncB);
runtime.compile(R"(
func foo() { 
 HostFuncA(); 
}
func bar() { 
 var baz = *roguePointer; // Segfault
 HostFuncB(); 
}
)");
auto hostFoo = runtime.getFunction("foo");
hostFoo();

The constructor of Runtime installs a signal handler via

sigaction(SIGSEGV, &sa, nullptr); // Or perhaps store already installed handler and reinstall it later...

The signal handler calls

longjmp(jmpBufStack.top());

And the function hostFoo() defined by the runtime looks like this

if (setjmp(jmpBufStack.push())) { cleanup(); throw /*...*/; }
executeScript();
jmpBufStack.pop();

So my question is this: Is this design sound? Can I even handle segfaults in user code reliably? Would I (as the maintainer of the runtime) have to reinstall the handler everytime a host callback returns, in case it installed another signal handler?

Question 2

"as it allows the user to manually allocate memory and dereference any pointer" - so your VM allows memory allocations directly using the related OS functions? No managed memory management?

Question 3

Right now it directly forwards to malloc and free, but the language exposes them in a RAII like fashion so 'destructors' of owning pointers call free. But I will probably change this to sandboxed arena allocations in the future. It could still dereference pointers that it receives from the host environment, but I guess in that case it is the responsibilty of the host to only pass valid pointers to the script functions. What is not possible though is pointer arithmetic (beyond array indexing which is always bounds checked)

Question 4

I cant see how handling segfaults would be sufficient if you want to run in-process. As far as I can tell you would need to validate each pointer access, otherwise you would risk trashing the host memory.

Question 5

@chrysante: you have two conflicting requirements. On one hand, you want direct, low level memory access from your language's byte code to the host. On the other hand, you want to prevent the host application from crashing by something which happens in the byte code. You can't have both - pick one.

Question 6

Yes, Your VM would need some way to register or convert pointers, so it can mark them as valid. But I'm doubtful that using a low level memory model is a good idea if you also want safety. "Safe" languages like .net and Java uses fully managed memory models for good reasons.

Question 7

It seems you have two conflicting requirements.

On one hand, you want direct, low level memory access from your language's byte code to the host.
On the other hand, you want to prevent the host application from crashing by something which happens in the byte code.

In reality, it is hard to get both. If you effectively want to prevent your VM interpreter crashing the host application, you need to isolate the interpreter execution in a separate process, and that forbids most kind of low level memory access. Shared memory might be an option, but that will undoubtly make the memory interface more complex. Callback functions to the host will become quite a challenge. If you want to go this route, you better rethink your whole execution model and switch to an asynchronous event-based approach.

If you just want to reduce the probability of crashing to a reasonable degree, but still prefer an in-process solution, you may consider not to pass any data by pointer references, but only "by value", or by some kind of "managed" or "smart" pointer capsule. The goal should be to avoid the occurence of error signals like a segmentation fault, so you can leave the implementation of such handlers to the host process.

Question 8

Can I even handle segfaults in user code reliably?

No, because the worst case is that it doesn't segfault but instead overwrites part of the runtime's state which is in the same process.

You can't achieve sandboxing this way. Have you considered targeting e.g. the WASM runtime? It manages this by having fake "pointers" that are just offsets into a block of memory that can be bounds-checked by the runtime.

Question 9

Yes, I haven't really considered that possibility. I think I will do just that, have a big block of 'virtual memory' and share data between guest and host by value only.

Doc Brown Doc Brown 219k35 gold badges405 silver badges619 bronze badges · Accepted Answer · 2023-09-25 15:35:17Z

It seems you have two conflicting requirements.

On one hand, you want direct, low level memory access from your language's byte code to the host.
On the other hand, you want to prevent the host application from crashing by something which happens in the byte code.

In reality, it is hard to get both. If you effectively want to prevent your VM interpreter crashing the host application, you need to isolate the interpreter execution in a separate process, and that forbids most kind of low level memory access. Shared memory might be an option, but that will undoubtly make the memory interface more complex. Callback functions to the host will become quite a challenge. If you want to go this route, you better rethink your whole execution model and switch to an asynchronous event-based approach.

If you just want to reduce the probability of crashing to a reasonable degree, but still prefer an in-process solution, you may consider not to pass any data by pointer references, but only "by value", or by some kind of "managed" or "smart" pointer capsule. The goal should be to avoid the occurence of error signals like a segmentation fault, so you can leave the implementation of such handlers to the host process.

Stack Exchange Network

Sandboxing interpreted code

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Sandboxing interpreted code

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions