Restrict syscalls with seccomp
"You shall not pass!" by Gandalf - 10/16/2022
Last updated
"You shall not pass!" by Gandalf - 10/16/2022
Last updated
Hi folks, today we will be studying seccomp, a Linux kernel security feature, so this post is a technical compendium with some references that can help you in this odyssey. So seccomp it's a simple sandboxing tool in the Linux kernel, available since Linux version 2.6.x. When enabling seccomp in Mode 1 aka Strict, the process enters a "secure mode" where a very small number of system calls are available (exit(), read(), write(), sigreturn()). Writing code to work in this environment is difficult; for example, a simple dynamic memory allocation (that uses brk() or mmap(), looking to the internals when calls malloc()) is not possible.
However, looking into another hand, Mode 2 seccomp BPF, was introduced into the Linux kernel in version 3.5 for x86_64 systems and Linux kernel version 3.10 for ARM systems. That resource involves a userspace-created policy being sent to the kernel, defining which syscalls are permitted, what arguments are allowed for those syscalls, and what action should be taken in the case of a syscall violation. The filter comes in the form of BPF bytecode, a particular instruction set that is interpreted in the kernel and used to implement filters. This resource is used in the Chrome, Firefox and OpenSSH sandbox on Linux, for example.
Seccomp BPF is a more recent extension to seccomp, which allows filtering system calls with BPF (Berkeley Packet Filter) programs. These filters can be used to allow or deny an arbitrary set of system calls, as well as filter on system call arguments (numeric values only; pointer arguments can't be dereferenced). Another point, a system call filtering isn’t a sandbox. It provides a clearly defined mechanism for minimizing the exposed kernel surface. It is meant to be a tool for sandbox developers to use. All right here, so for more information in deepness, we can read more in this URL:
So the Allowlist approach is a good course if we have a strategy with the proper syscall listing and existing open-source tools to create this mapping. So custom profile like that approach gives more security but provides more chance to crash. If the context where implemented doesn't have mature TDD, maybe it is not a good course.
The Blocklist approach. We can read an example in this post. It is not harder to write and create a proper strategy for creating restrictions.
Filters by inspecting arguments:
Filtering arguments of functions with seccomp BPF can spend performance if we compare strings, for example.
Filtering resources like syscall connect(), for example, to incept addr argv to block invasion(addr by allowlist), it's clever but maybe can cause TOCTOU pitfalls, and maybe needs mutex lockers, so yes, I think about it when I create an LKM generator for HiddenFirewall. So a good course for beginners is to use only allowlist or blocklist and test each strategy in controlled resources before up in production to prevent crashes.
About other courses to intercept arguments in syscalls, it's more complex, so based on my experience, I did not recommend the uses of ptrace() , ftrace() or kprobe in this context for beginners because, yes, spend performance, too, needs a proper strategy for success.
So this example uses seccomp to filter a little list of syscalls, and the function main() shows an example when we call functions like printf()<-write() and system()<-execve(). Look at the following:
So compile the code and run. Look at the following:
The seccomp mode is enabled via the prctl()
system call using the PR_SET_SECCOMP
argument via the seccomp()
system call. The block list used in this source code runs in functions like write(), chmod(), chown(), execve() and symlink().
BPF_JUMP() - This will be used to indicate in which scenario (or what function) where the rules will be effective.
BPF_STMT() - This is the function for which we will set specific rules.
SECCOMP_RET_ERRNO - macro to return an error if it detects an event.
SECCOMP_RET_LOG - Optional macro to salve in the log when an event trigger action.
So we can insert this rule example into the context If we need to save a log during the event. Look at the following:
So to see the logs, we can run the following :
SECCOMP_RET_KILL_PROCESS - Option macro to kill a process when an event triggers the action.
All right here, the last function calls not run system() and printf(). Yes, because these function calls are to use the syscalls write() and execve(), so using a tool like "strace", we can validate this point. We can see the following setting breakpoints with a proper debugger tool like "GDB", but remember, we need to set the comment at line 79 of function init_call_filter() for a practical overview in the blocklist of seccomp's action.
Another point of attention, when working with seccomp bpf, the boring fact is the necessity of the proper use of architecture. Look at the following:
So without this point of bizarre macros, we have a portability problem that is a big pitfall. An official document has a point about it. Please look at the following https://www.kernel.org/doc/html/v4.18/userspace-api/seccomp_filter.html#pitfalls
Another cool stuff around macros context:
Kafel - https://github.com/google/kafel
From my friend hc0d3r - https://github.com/hc0d3r/seccomp-macros
In this example, we follow another approach, using libseccomp for modern code, yes fewer code lines more human approach:
Before compiling, please install the lib, and look at the following:
Link the external lib and Compile and run like the following:
So all right, the function system() uses syscall execve() that is not present in allowlist, which is a "bad system call" following seccomp. Mission complete here.
So a little analysis of the seccomp-tools:
All right, the fantastic point about libseccomp, it can run in other programming languages like Python, Rust and GOlang.
So seccomp has an everyday use in a sandbox context. Sandboxing is a technique to isolate specific programs to prevent a vulnerability from compromising the rest of the system by restricting access to unnecessary resources. All common browsers nowadays include a sandbox and utilise a multi-process architecture. The browser splits itself into different processes (e.g. the content process, GPU process, RDD process, etc.) and sandboxes them individually, strictly adhering to the principle of least privilege.
A browser must use a sandbox, as it processes untrusted input by design, poses an enormous attack surface and is one of the most used applications on the system. Without a sandbox, any exploit in the browser can be used to take over the rest of the system.
Back to the real world, stopping to look at chrome and firefox. So filter syscalls are essential to control security in software. Maybe we can restrict a syscall like execve() or resources used by core utils like chmod, chown and soon. One developer with bad intentions commits a call execve() for a web shell that executes input by method POST following HTTP context, maybe another resource that it smells like a backdoor, so it has a seccomp restriction in syscall execve(). Any derivative cannot run in the tested software. So has other ways to insert the seccomp bpf filter, for example, in Kubernetes. So my favourite model is writing seccomp filter in software.
Another possible scenario is we can block a syscall not to give permissions to execute or write. It's a piece of cake, so we need filter syscalls with chmod().
In the third scenario, when we do some upload using socket() with sendfile() syscall, so we can create a proper context to audit these inputs, yes exist open source tools to put a limit in the context:
So at this point, there are many courses to solve in the Linux context, for example, hooking applications using LD_PRELOAD, ptrace() or ftrace() to intercept the context, like strace tool. Another course uses static ways, for example, parsing ELF or resources. All right, look the following:
So using this in seccomp BPF example of this post, we can view the result in the following:
So with these functions, a programmer can correlate with getting all syscalls, so now you ask hey Cooler, do you know that can be another approach? so maybe using the capstone framework, which is used by engines in many tools like radare, IDA pro and soon.
https://elixir.bootlin.com/linux/v6.0.3/C/ident/__NR_syscalls
Syscall intercept with lib capstone - https://github.com/pmem/syscall_intercept
Another possible course is reading source code and tokenising each chunk to extract functions and correlate at the point with a hash table with the mnemônic correlation context of syscall table. maybe we can use LLVM, or crazy parsers with O caml language.
libseccomp - https://github.com/seccomp/libseccomp (golang, python and soon)
Infect to protect, infect an ELF to force seccomp to restrict syscalls. This resource is very interesting and is present in POC|GTFO book - https://github.com/lpereira/infect-to-protect
Android(is linux, runs OK) - https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html (we can insert implementation in kotlin, java, c++ and soon)
lib Kore - https://kore.io/ ( AWESOME PROJECT, really I am a big fan of this project )
Seccomp in elastic - https://www.elastic.co/blog/seccomp-in-the-elastic-stack
So with these empirical points around seccomp, yes, there are weak points and courses to bypass and escape from seccomp, and yes, seccomp has more vulnerabilities than Capsicum of FreeBSD or Pledge from OpenBSD that has the same proposal if we were looking at the sandboxing proposal. Seccomp is more popular. We can look at the android context.
So remember, using any protection, restriction or resource for defence is better than using nothing. So another good option, in addition, is the MAC "Mandatory Access Controls" like AppArmor, SELINUX, TOMOYO, so MAC is a framework for defining what a program can and cannot do on a whitelist basis. A program is represented as a subject. Anything the program wants to act on, such as a file, path, network interface, or port, is represented as an object. The rules for accessing the object are called permission or flag.
In order to check the security point, a good indicator is listing CVE and studying each context and possibility, like binary without proper hardenings like full relro or lack of other resources for protection. Without these points, it is crucial to try bypassing resources of binary, using seccomp tools, trying to understand rules and possibilities to bypass maybe try patching, so exist a list of CTF games with seccomp theme. This context of the challenge is the best course to learn in a practical context for bypassing.
CVE-2022-30594
https://nvd.nist.gov/vuln/detail/CVE-2022-30594
MEDIUM
The Linux kernel before 5.17.2 mishandles seccomp permissions. The PTRACE_SEIZE code path allows attackers to bypass intended restrictions on setting the PT_SUSPEND_SECCOMP flag.
CVE-2021-41133
https://nvd.nist.gov/vuln/detail/CVE-2021-41133
MEDIUM
Flatpak is a system for building
CVE-2020-0261
https://nvd.nist.gov/vuln/detail/CVE-2020-0261
HIGH
In C2 flame devices
CVE-2019-7303
https://nvd.nist.gov/vuln/detail/CVE-2019-7303
MEDIUM
A vulnerability in the seccomp filters of Canonical snapd before version 2.37.4 allows a strict mode snap to insert characters into a terminal on a 64-bit host. The seccomp rules were generated to match 64-bit ioctl(2) commands on a 64-bit platform
CVE-2017-18367
https://nvd.nist.gov/vuln/detail/CVE-2017-18367
MEDIUM
libseccomp-golang 0.9.0 and earlier incorrectly generates BPFs that OR multiple arguments rather than ANDing them. A process running under a restrictive seccomp filter that specified multiple syscall arguments could bypass intended access restrictions by specifying a single matching argument.
CVE-2019-10145
https://nvd.nist.gov/vuln/detail/CVE-2019-10145
MEDIUM
rkt through version 1.30.0 does not isolate processes in containers that are run with `rkt enter`. Processes run with `rkt enter` do not have seccomp filtering during stage 2 (the actual environment in which the applications run). Compromised containers could exploit this flaw to access host resources.
CVE-2018-15746
https://nvd.nist.gov/vuln/detail/CVE-2018-15746
LOW
qemu-seccomp.c in QEMU might allow local OS guest users to cause a denial of service (guest crash) by leveraging mishandling of the seccomp policy for threads other than the main thread.
CVE-2019-12589
https://nvd.nist.gov/vuln/detail/CVE-2019-12589
MEDIUM
In Firejail before 0.9.60
CVE-2019-2054
https://nvd.nist.gov/vuln/detail/CVE-2019-2054
MEDIUM
In the seccomp implementation prior to kernel version 4.8
CVE-2019-9893
https://nvd.nist.gov/vuln/detail/CVE-2019-9893
HIGH
libseccomp before 2.4.0 did not correctly generate 64-bit syscall argument comparisons using the arithmetic operators (LT
Note: made in NVD API.
Thank you for reading.
Cheers!