Capabilities Isolators Guide
This document is a walk-through guide describing how to use rkt isolators for Linux Capabilities.
- About Linux Capabilities
- Default Capabilities
- Capability Isolators
- Usage Example
- Overriding Capabilities
- Recommendations
About Linux Capabilities
Linux capabilities are meant to be a modern evolution of traditional UNIX
permissions checks.
The goal is to split the permissions granted to privileged processes into a set
of capabilities (eg. CAP_NET_RAW
to open a raw socket), which can be
separately handled and assigned to single threads.
Processes can gain specific capabilities by either being run by superuser, or by having the setuid/setgid bits or specific file-capabilities set on their executable file. Once running, each process has a bounding set of capabilities which it can enable and use; such process cannot get further capabilities outside of this set.
In the context of containers, capabilities are useful for:
- Restricting the effective privileges of applications running as root
- Allowing applications to perform specific privileged operations, without having to run them as root
For the complete list of existing Linux capabilities and a detailed description of this security mechanism, see the capabilities(7) man page.
Default capabilities
By default, rkt enforces a default set of capabilities onto applications.
This default set is tailored to stop applications from performing a large
variety of privileged actions, while not impacting their normal behavior.
Operations which are typically not needed in containers and which may
impact host state, eg. invoking reboot(2)
, are denied in this way.
However, this default set is mostly meant as a safety precaution against erratic and misbehaving applications, and will not suffice against tailored attacks. As such, it is recommended to fine-tune the capabilities bounding set using one of the customizable isolators available in rkt.
Capability Isolators
When running Linux containers, rkt provides two mutually exclusive isolators to define the bounding set under which an application will be run:
os/linux/capabilities-retain-set
os/linux/capabilities-remove-set
Those isolators cover different use-cases and employ different techniques to achieve the same goal of limiting available capabilities. As such, they cannot be used together at the same time, and recommended usage varies on a case-by-case basis.
As the granularity of capabilities varies for specific permission cases, a word
of warning is needed in order to avoid a false sense of security.
In many cases it is possible to abuse granted capabilities in order to
completely subvert the sandbox: for example, CAP_SYS_PTRACE
allows to access
stage1 environment and CAP_SYS_ADMIN
grants a broad range of privileges,
effectively equivalent to root.
Many other ways to maliciously transition across capabilities have already been
reported.
Retain-set
os/linux/capabilities-retain-set
allows for an additive approach to
capabilities: applications will be stripped of all capabilities, except the ones
listed in this isolator.
This whitelisting approach is useful for completely locking down environments and whenever application requirements (in terms of capabilities) are well-defined in advance. It allows one to ensure that exactly and only the specified capabilities could ever be used.
For example, an application that will only need to bind to port 80 as
a privileged operation, will have CAP_NET_BIND_SERVICE
as the only entry in
its "retain-set".
Remove-set
os/linux/capabilities-remove-set
tackles capabilities in a subtractive way:
starting from the default set of capabilities, single entries can be further
forbidden in order to prevent specific actions.
This blacklisting approach is useful to somehow limit applications which have broad requirements in terms of privileged operations, in order to deny some potentially malicious operations.
For example, an application that will need to perform multiple privileged
operations but is known to never open a raw socket, will have
CAP_NET_RAW
specified in its "remove-set".
Usage Example
The goal of these examples is to show how to build ACIs with acbuild
,
where some capabilities are either explicitly blocked or allowed.
For simplicity, the starting point will be the official Alpine Linux image from
CoreOS which ships with ping
and nc
commands (from busybox). Those
commands respectively requires CAP_NET_RAW
and CAP_NET_BIND_SERVICE
capabilities in order to perform privileged operations.
To block their usage, capabilities bounding set
can be manipulated via os/linux/capabilities-remove-set
or
os/linux/capabilities-retain-set
; both approaches are shown here.
Removing specific capabilities
This example shows how to block ping
only, by removing CAP_NET_RAW
from
capabilities bounding set.
First, a local image is built with an explicit "remove-set" isolator.
This set contains the capabilities that need to be forbidden in order to block
ping
usage (and only that):
$ acbuild begin
$ acbuild set-name localhost/caps-remove-set-example
$ acbuild dependency add quay.io/coreos/alpine-sh
$ acbuild set-exec -- /bin/sh
$ echo '{ "set": ["CAP_NET_RAW"] }' | acbuild isolator add "os/linux/capabilities-remove-set" -
$ acbuild write caps-remove-set-example.aci
$ acbuild end
Once properly built, this image can be run in order to check that ping
usage has
been effectively disabled:
$ sudo rkt run --interactive --insecure-options=image caps-remove-set-example.aci
image: using image from file stage1-coreos.aci
image: using image from file caps-remove-set-example.aci
image: using image from local store for image name quay.io/coreos/alpine-sh
/ # whoami
root
/ # ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: permission denied (are you root?)
This means that CAP_NET_RAW
had been effectively disabled inside the container.
At the same time, CAP_NET_BIND_SERVICE
is still available in the default bounding
set, so the nc
command will be able to bind to port 80:
$ sudo rkt run --interactive --insecure-options=image caps-remove-set-example.aci
image: using image from file stage1-coreos.aci
image: using image from file caps-remove-set-example.aci
image: using image from local store for image name quay.io/coreos/alpine-sh
/ # whoami
root
/ # nc -v -l -p 80
listening on [::]:80 ...
Allowing specific capabilities
In contrast to the example above, this one shows how to allow ping
only, by
removing all capabilities except CAP_NET_RAW
from the bounding set.
This means that all other privileged operations, including binding to port 80
will be blocked.
First, a local image is built with an explicit "retain-set" isolator.
This set contains the capabilities that need to be enabled in order to allowed
ping
usage (and only that):
$ acbuild begin
$ acbuild set-name localhost/caps-retain-set-example
$ acbuild dependency add quay.io/coreos/alpine-sh
$ acbuild set-exec -- /bin/sh
$ echo '{ "set": ["CAP_NET_RAW"] }' | acbuild isolator add "os/linux/capabilities-retain-set" -
$ acbuild write caps-retain-set-example.aci
$ acbuild end
Once run, it can be easily verified that ping
from inside the container is now
functional:
$ sudo rkt run --interactive --insecure-options=image caps-retain-set-example.aci
image: using image from file stage1-coreos.aci
image: using image from file caps-retain-set-example.aci
image: using image from local store for image name quay.io/coreos/alpine-sh
/ # whoami
root
/ # ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=41 time=24.910 ms
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 24.910/24.910/24.910 ms
However, all others capabilities are now not anymore available to the application.
For example, using nc
to bind to port 80 will now result in a failure due to
the missing CAP_NET_BIND_SERVICE
capability:
$ sudo rkt run --interactive --insecure-options=image caps-retain-set-example.aci
image: using image from file stage1-coreos.aci
image: using image from file caps-retain-set-example.aci
image: using image from local store for image name quay.io/coreos/alpine-sh
/ # whoami
root
/ # nc -v -l -p 80
nc: bind: Permission denied
Overriding capabilities
Capability sets are typically defined when creating images, as they are tightly linked to specific app requirements. However, image consumers may need to further tweak/restrict the set of available capabilities in specific local scenarios. This can be done either by permanently patching the manifest of specific images, or by overriding capability isolators with command line options.
Patching images
Image manifests can be manipulated manually, by unpacking the image and editing
the manifest file, or with helper tools like actool
.
To override an image's pre-defined capabilities set, replace the existing capabilities
isolators in the image with new isolators defining the desired capabilities.
The patch-manifest
subcommand to actool
manipulates the capabilities sets
defined in an image.
actool patch-manifest --capability
changes the retain
capabilities set.
actool patch-manifest --revoke-capability
changes the remove
set.
These commands take an input image, modify its existing capabilities sets, and
write the changes to an output image, as shown in the example:
$ actool cat-manifest caps-retain-set-example.aci
...
"isolators": [
{
"name": "os/linux/capabilities-retain-set",
"value": {
"set": [
"CAP_NET_RAW"
]
}
}
]
...
$ actool patch-manifest -capability CAP_NET_RAW,CAP_NET_BIND_SERVICE caps-retain-set-example.aci caps-retain-set-patched.aci
$ actool cat-manifest caps-retain-set-patched.aci
...
"isolators": [
{
"name": "os/linux/capabilities-retain-set",
"value": {
"set": [
"CAP_NET_RAW",
"CAP_NET_BIND_SERVICE"
]
}
}
]
...
Now run the image to check that the CAP_NET_BIND_SERVICE
capability added to
the patched image is retained as expected by using nc
to listen on a
"privileged" port:
$ sudo rkt run --interactive --insecure-options=image caps-retain-set-patched.aci
image: using image from file stage1-coreos.aci
image: using image from file caps-retain-set-patched.aci
image: using image from local store for image name quay.io/coreos/alpine-sh
/ # nc -v -l -p 80
listening on [::]:80 ...
Overriding capabilities at run-time
Capabilities can be directly overridden at run time from the command-line,
without changing the executed images.
The --caps-retain
option to rkt run
manipulates the retain
capabilities set.
The --caps-remove
option manipulates the remove
set.
Capabilities specified from the command-line will replace all capability settings in the image manifest.
Also as stated above the options --caps-retain
, and --caps-remove
are mutually exclusive.
Only one can be specified at a time.
Capabilities isolators can be added on the command line at run time by specifying the desired overriding set, as shown in this example:
$ sudo rkt run --interactive quay.io/coreos/alpine-sh --caps-retain CAP_NET_BIND_SERVICE
image: using image from file /usr/local/bin/stage1-coreos.aci
image: using image from local store for image name quay.io/coreos/alpine-sh
/ # whoami
root
/ # ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: permission denied (are you root?)
Capability sets are application-specific configuration entries, and in a
rkt run
command line, they must follow the application container image to
which they apply.
Each application within a pod can have different capability sets.
Recommendations
As with most security features, capability isolators may require some application-specific tuning in order to be maximally effective. For this reason, for security-sensitive environments it is recommended to have a well-specified set of capabilities requirements and follow best practices:
- Always follow the principle of least privilege and, whenever possible, avoid running applications as root
- Only grant the minimum set of capabilities needed by an application, according to its typical usage
- Avoid granting overly generic capabilities. For example,
CAP_SYS_ADMIN
andCAP_SYS_PTRACE
are typically bad choices, as they open large attack surfaces. - Prefer a whitelisting approach, trying to keep the "retain-set" as small as possible.