(追記) (追記ここまで)

| |

Subscribe / Log in / New account

Process containers

[Posted May 29, 2007 by corbet]

Back in September, LWN took a look at Rohit Seth's containers patch. Since that time, containers development has moved on to Paul Menage who, like Rohit, posts from a google.com address. The patch has evolved considerably, to the point that Rohit's name no longer appears within it. As of the recently posted containers V10 patch, this mechanism is reaching a reasonably mature state.

This patch introduces a couple of new concepts into the kernel. The first one has an old name: "subsystem". Fortunately, the driver core has just removed its "subsystem" concept, leaving the term free. In the container patch, a subsystem is some part of the kernel which might have an interest in what groups of processes are doing. Chances are that most subsystems will be involved with resource management; for example, the container patch turns the Linux cpusets mechanism (which binds processes to specific groups of processors) into a subsystem.

A "container" is a group of processes which shares a set of parameters used by one or more subsystems. In the cpuset example, a container would have a set of processors which it is entitled to use; all processes within the container inherit that same set. Other (not yet existing) subsystems could use containers to enforce limits on CPU time, I/O bandwidth usage, memory usage, filesystem visibility, and so on. Containers are hierarchical, in that one container can hold others.

[画像:[container hierarchy]] As an example, consider the simple hierarchy to the right. A server used to host containerized guests could establish two top-level containers to control the usage of CPU time. Guests, perhaps, could be allowed 90% of the CPU, but the administrator may want to place system tasks in a separate container which will always get at least 10% of the processor - that way, the mail will continue to be delivered regardless of what the guests are doing. Within the "Guests" container, each individual guest has its own container with specific CPU usage policies.

The container mechanism is not limited to a single hierarchy; instead, the administrator can create as many hierarchies as desired. So, for example, the administrator of the system described above could create an entirely different hierarchy for the control of network bandwidth usage. By default, all processes would be in the same container, but it is possible to set up policy which would shift processes to a different container when they run a specific application. So a web browser might be moved into a container which gets a relatively high portion of the available bandwidth while Bittorrent clients find themselves relegated to an unhappy container with almost no bandwidth available.

Different container hierarchies need not resemble each other in any way. Each hierarchy has one or more subsystems associated with it; a subsystem can only be attached to a single hierarchy. If there is more than one hierarchy, each process in the system will be in more than one container - one in each hierarchy.

The administration of containers is performed through a special virtual filesystem. The documentation suggests that it could be mounted on /dev/container, which is a bit strange; it has nothing to do with devices. One container filesystem instance will be mounted for each hierarchy to be created. The association of subsystems with hierarchies is done at mount time, by way of mount options. By default, all known subsystems are associated with a hierarchy, so a command like:

 mount -t container none /containers

would create a single container hierarchy with all known subsystems on /containers. A setup like the one described above, instead, could be created with something like:

 mount -t container -o cpu cpu /containers/cpu
 mount -t container -o net net /containers/net

The desired subsystems for each container hierarchy are simply provided as options at mount time. Note that the "cpu" and "net" subsystems mentioned above do not actually exist in the current container patch set.

Creating new containers is just a matter of making a directory in the appropriate spot in the hierarchy. Containers have a file called tasks; reading that file will yield a list of all processes currently in the container. A process can be added to a container by writing its ID to the tasks file. So a simple way to create a container and move a shell into it would be:

 mkdir /containers/new_container
 echo $$ > /containers/new_container/tasks

Subsystems can add files to containers for use in setting resource limits or otherwise controlling how the subsystem works. For example, the cpuset subsystem (which does exist) adds a file called cpus containing the list of CPUs established for that container; there are several other files added as well.

It's worth noting that the container patch does not add a single system call; all of the management is performed through the virtual filesystem.

With a basic container mechanism in place, most of the action in the future is likely to be in the creation of new subsystems. One can imagine, for example, hooking the existing process ID virtualization code into containers, as well as adding no end of resource controllers. The creation of a subsystem is relatively straightforward; the subsystem code starts by creating and registering a container_subsys structure. That structure contains an integer subsys_id field which should be set to the subsystem's specific ID number; these numbers are set staticly in <linux/container_subsys.h>. Implicit in this arrangement is that subsystems must be built into the kernel; there is no provision for adding subsystems as loadable modules.

Each subsystem defines a set of methods to be used by the container code, beginning with:

 int (*create)(struct container_subsys *ss, struct container *cont);
 int (*populate)(struct container_subsys *ss, struct container *cont);
 void (*destroy)(struct container_subsys *ss, struct container *cont);

These three are called whenever a container is created or destroyed; this is the chance for the subsystem to set up any bookkeeping it will need for the new container (or clean up for a container which is going away). The populate() method is called after the successful creation of a new container; its purpose is to allow the subsystem to add management files to that container.

Four methods are for the addition and removal of processes:

 int (*can_attach)(struct container_subsys *ss, struct container *cont, 
 struct task_struct *tsk);
 void (*attach)(struct container_subsys *ss, struct container *cont,
		 struct container *old_cont, struct task_struct *tsk);
 void (*fork)(struct container_subsys *ss, struct task_struct *task);
 void (*exit)(struct container_subsys *ss, struct task_struct *task);

If a process is explicitly added to a container after creation, the container code will call can_attach() to determine whether the addition should succeed. If the subsystem allows the action to happen, it should have performed any needed allocations to ensure that the subsequent attach() call succeeds. When a process forks, fork() will be called to add the new child to the container. Exiting processes call exit() to allow the subsystem to clean up.

Clearly, there's more to the interface than described here; see the thorough documentation file packaged with the patch for much more detail. Your editor would not venture a guess as to when this code might be merged, but it does seem that this is the mechanism that the containers community has decided to push. So, sooner or later, it will likely be contained within the mainline.

Index entries for this article
Kernel	Containers
Kernel	Virtualization/Containers

Process containers

Posted May 31, 2007 18:20 UTC (Thu) by utoddl (guest, #1232) [Link] (6 responses)

This is a reimplementation of groups, but with more features attached than simply "you may or may not access this local file or directory". It looks like an extention of what OpenAFS's PAGs (process authentication groups) give you -- and what has kept their camel out of the kernel tent for years.

Process containers

Posted May 31, 2007 21:51 UTC (Thu) by IkeTo (subscriber, #2122) [Link] (4 responses)

I have some difficulties understanding your comment. I've looked at OpenAFS for a tiny bit of time, my impression is exactly what you say: it is a filesystem, and PAG is a system for you to tell the filesystem who you are. How is this is anything to do with process container, which seems to be mainly a tool for system administrators or service startup scripts to limit the amount (rather than identities) of system resources like CPU and network bandwidth (rather than files) that the process can use, based on the "echo" commands executed by administrator manually or via scripts (rather than via the user creation and login procedure)?

Process containers

Posted Jun 1, 2007 14:02 UTC (Fri) by utoddl (guest, #1232) [Link] (3 responses)

Fair enough. Let's see if I can connect the dots.

Ignore for the moment the implementation of either groups or process containers, and just look at the semantics. A given process can be in multiple groups; child processes inherit groups from their parents; special circumstances can alter which groups are added or dropped from a process' group list. Likewise for processes in containers. If you were to replace the labels in the diagram from the article with numbers, you could implement the processes "in-container-x" property with the existing group mechanism.

Process group lists have always been a light-weight set of properties that processes carry around and pass on through fork(). The fact that (almost) nothing except file systems uses them not withstanding, it seems somebody finally noticed that the semantics of passing around properties in this way is useful for other things like processor affinity, throttling, and other things the article mentions.

AFS (and later OpenAFS) piggy-backed process authentication group membership on the group mechanism. The AFS kernel module would add a group (actually a pair of group numbers) to a processes group list to create a new PAG. Child processes would inherit these just like any other groups through fork(), but no file system -- including AFS -- used these group numbers to check file access. Instead, AFS would use these numbers to associate a process with a specific PAG, which is just a set of processes which share a cached token. The token *is* used for access control, but membership in a PAG is just a property like any other group membership. The semantics for group membership and inheritance just happens to be exactly what you want for an authenticated file system like AFS.

Besides that, though, these semantics happen to be exactly what you want for processor affinity, bandwidth throttling, CPU limits, etc. But rather than piggy-backing these capabilities onto the existing group mechanism as AFS did, they've invented another parallel mechanism for passing process properties around. Group membership and process container "in-ness" are just properties after all.

To be fair, the time tested group mechanism has its limits. Group lists are rather short (or thay were last time I ran into that issue). They also aren't explicitly hierarchical like process containers (though what that buys us wasn't immediately obvious to me upon reading the article). It wouldn't surprise me if the old UNIX groups weren't eventually reimplemented as containers. Then you could eventually have hierarchical UNIX groups!

The point of my "camel in the tent" comment was that the way AFS piggy-backed the process properties it was interested in on top of groups was met with skepticism and sometimes out-right contempt by some kernel developers. The reasons include NIH (Not Invented Here -- AFS predates linux by a fair few years), the kernel module itself is maintained out-of-tree (it builds for several OSes other than Linux and not just on the current versions, so it contains a lot of "cruft", at least in the eyes of the kernel hard-core), and it's hobbled by being under the IPL license (basically IBM's GPL with a "we can take it proprietary later if we want" clause). AFS on recent kernels has switched to using keyrings -- yet another special purpose property propagation mechanism -- to implement PAGs, but the other factors still keep AFS/OpenAFS on the outside looking in.

The kernel goes through this periodic process where some new functionality is added, then somebody points out that this new thing and this other old thing have similar operations, then some common code is developed that they can both use or one gets folded into the other. We've seen it over and over, and I wouldn't be surprised to see it happen with groups and process properties.

Process containers

Posted Jun 1, 2007 14:33 UTC (Fri) by IkeTo (subscriber, #2122) [Link] (2 responses)

> Process group lists have always been a light-weight set of properties that
> processes carry around and pass on through fork().

Can you clarify a little bit? AFAIK, there are two concepts of "groups" in the current kernel. One is called the "process group", as is set by setpgid(). Each process belong to one group (rather than many). That group is used for signal deliveries, allowing users to send signals to all processes of a group, either by explicit "kill" command/system call, or by using a special terminal character. The other is the "supplementary group IDs", as is set by setgroups(). Each process has a small number of those. It is used by system administrators to control the files or other resources that each user can access. The numeric values are meaningful not only to the kernel, but to the admin as well. They assign each user a list of such group IDs in /etc/group, and the login procedure will assign the login shell (or X session) process to use that list. There is also the session ID, but that doesn't seem like being what you mean.

So by "process group" do you mean one of these existing concepts, or is there yet another group concept in the process carried by the process that either is hidden in the kernel or that I forgot?

Process containers

Posted Jun 1, 2007 15:06 UTC (Fri) by utoddl (guest, #1232) [Link] (1 responses)

I was talking about supplementary group IDs as set by setgroups().

In the particular AFS context, when the older libafs kernel module loaded, it would swipe the setgroups entry in the sys_call_table (?sp) so it could handle the necessary details of associating an AFS PAG, token, and process. It was an admitted hack, but one that has worked in various forms for over a decade in a half dozen major flavors of UNIX. Other methods were invented for Linux when the kernel police make the sys_call_table read-only.

BTW, this was/is another reason to dislike what AFS does with the supplementary group list. It's rather disconcerting to do "id -a" and see groups with no associated names, but that's common if your shell is in a PAG. Behold:

$ id -a
uid=12428(utoddl) gid=12428(utoddl) \
groups=10(wheel),1511(atnid),12428(utoddl),1094942735

Process containers

Posted Jun 1, 2007 16:40 UTC (Fri) by IkeTo (subscriber, #2122) [Link]

Thanks. I understand your posts now. But I don't think I like the idea. At the very least, I don't think it reasonable to arbitrarily allocate user ID space to something completely unrelated to users this way. And of course it provides a horrible interface to users.

Process containers

Posted Jan 25, 2008 17:35 UTC (Fri) by rijrunner (guest, #49442) [Link]

Well, my read of this is a bit different. This looks to me like a side-effect of the
virtualization changes to the kernel and how Oracle works. Basically, virtualization requires
carving out a set of system resources (memory, cpu, disk, network, etc, etc) and assigning
them to a virtual machine to manage. The key is that the kernel has to be able to define
parameters that can be isolated and restricted in their size and scope. What the container
concept seems to be - which I could be misunderstanding based on only a cursory reading - is
extending that ability to isolate resources to processes running within the base OS. 
ie, if you are putting hooks into the kernel to be able to define and limit system resources
for virtual machines, why not extend it to processes and resources at the OS level?

Process containers

Posted May 31, 2007 22:24 UTC (Thu) by riddochc (guest, #43) [Link] (2 responses)

I must admit, this is more abstract than usual. I think the other two comments suggest that it's not really clear what exactly these are containers are *for*. Can someone give me an example of how such containers could be used? I'm confused.

Process containers

Posted May 31, 2007 23:34 UTC (Thu) by i3839 (guest, #31386) [Link] (1 responses)

They key point is:

> Other (not yet existing) subsystems could use containers to enforce
> limits on CPU time, I/O bandwidth usage, memory usage, filesystem
> visibility, and so on. Containers are hierarchical, in that one
> container can hold others.

Right now all resource management is done globally or per process/thread, but not much else. Process containers make it possible to group a bunch of processes and do resource allocation for them as a group (think ulimit, but more). What resource that is doesn't matter right now, as this article is about the basic infrastructure which is put into place to make everything possible.

This is useful for multi-purpose and multi-user machines. E.g. if you want your server to spend 50% of its CPU time, disk IO and/or memory on the webserver and a database, 25% on finding aliens, and the rest for reading LWn, it can be done.

It seems it can also function as a sort of jail, limiting the fs and process namespace view/access processes have.

(I might be mixing multiple things though.)

Process containers

Posted Jun 5, 2007 14:40 UTC (Tue) by vMeson (subscriber, #45212) [Link]

an industrial use case:
let's say you are a network infrastructure vendor,
you'd like to allocate 60% of cpu to processing packets for existing work, 10 % for handling new work, 10% for system maintenance, 10% for I/O, and 10% for spying^Hlawful intercept. ;-) The missing bit is how these containers or classes interact. Is system maintenance more important than new work or do you have a policy of fairness?

Containers coupled to the new scheduler: CFS seem like a powerful combination.

Process containers

Posted Jun 9, 2007 10:57 UTC (Sat) by muwlgr (guest, #35359) [Link] (1 responses)

And don't forget, there is no meaningful way to shape incoming traffic,
so the dream about BitTorrent&browser is just that, a dream :>

Process containers

Posted Jul 12, 2007 10:19 UTC (Thu) by Stephen_Beynon (guest, #4090) [Link]

It is possible to shape incoming traffic for tcp streams. Just drop any
packets that would cause the required bandwidth to be exceeded. TCP is
designed to assume packet loss means a saturated link, and backoff. While
it is not possible to get the bandwidth exact it is good enough to be
usefull.

When it comes to bittorrent I tend to find the problem is the upstream
bandwidth use, and that is much more controlable :)

Stephen

(追記) (追記ここまで)