Restrict syscalls with seccomp

"You shall not pass!" by Gandalf - 10/16/2022

Hi folks, today we will be studying seccomp, a Linux kernel security feature, so this post is a technical compendium with some references that can help you in this odyssey. So seccomp it's a simple sandboxing tool in the Linux kernel, available since Linux version 2.6.x. When enabling seccomp in Mode 1 aka Strict, the process enters a "secure mode" where a very small number of system calls are available (exit(), read(), write(), sigreturn()). Writing code to work in this environment is difficult; for example, a simple dynamic memory allocation (that uses brk() or mmap(), looking to the internals when calls malloc()) is not possible.

However, looking into another hand, Mode 2 seccomp BPF, was introduced into the Linux kernel in version 3.5 for x86_64 systems and Linux kernel version 3.10 for ARM systems. That resource involves a userspace-created policy being sent to the kernel, defining which syscalls are permitted, what arguments are allowed for those syscalls, and what action should be taken in the case of a syscall violation. The filter comes in the form of BPF bytecode, a particular instruction set that is interpreted in the kernel and used to implement filters. This resource is used in the Chrome, Firefox and OpenSSH sandbox on Linux, for example.

Seccomp BPF is a more recent extension to seccomp, which allows filtering system calls with BPF (Berkeley Packet Filter) programs. These filters can be used to allow or deny an arbitrary set of system calls, as well as filter on system call arguments (numeric values only; pointer arguments can't be dereferenced). Another point, a system call filtering isn’t a sandbox. It provides a clearly defined mechanism for minimizing the exposed kernel surface. It is meant to be a tool for sandbox developers to use. All right here, so for more information in deepness, we can read more in this URL:

Allowlist or blocklist?

So the Allowlist approach is a good course if we have a strategy with the proper syscall listing and existing open-source tools to create this mapping. So custom profile like that approach gives more security but provides more chance to crash. If the context where implemented doesn't have mature TDD, maybe it is not a good course.

The Blocklist approach. We can read an example in this post. It is not harder to write and create a proper strategy for creating restrictions.

Filters by inspecting arguments:

Filtering arguments of functions with seccomp BPF can spend performance if we compare strings, for example.
Filtering resources like syscall connect(), for example, to incept addr argv to block invasion(addr by allowlist), it's clever but maybe can cause TOCTOU pitfalls, and maybe needs mutex lockers, so yes, I think about it when I create an LKM generator for HiddenFirewall. So a good course for beginners is to use only allowlist or blocklist and test each strategy in controlled resources before up in production to prevent crashes.
About other courses to intercept arguments in syscalls, it's more complex, so based on my experience, I did not recommend the uses of ptrace() , ftrace() or kprobe in this context for beginners because, yes, spend performance, too, needs a proper strategy for success.

The first proof of concept blocklist of syscalls

So this example uses seccomp to filter a little list of syscalls, and the function main() shows an example when we call functions like printf()<-write() and system()<-execve(). Look at the following:

test.c

// Coded by CoolerVoid based in seccomp official examples
#include <errno.h>
#include <linux/audit.h>
#include <linux/bpf.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <linux/unistd.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <unistd.h>

#if defined(__i386__)
#define SECCOMP_AUDIT_ARCH AUDIT_ARCH_I386
#elif defined(__x86_64__)
#define SECCOMP_AUDIT_ARCH AUDIT_ARCH_X86_64
#elif defined(__arm__)
#define SECCOMP_AUDIT_ARCH AUDIT_ARCH_ARM
#elif defined(__aarch64__)
#define SECCOMP_AUDIT_ARCH AUDIT_ARCH_AARCH64
#else
#warning "seccomp: unsupported platform"
#define SECCOMP_AUDIT_ARCH 0
#endif

// https://github.com/torvalds/linux/tree/master/samples/seccomp 
static int add_call_filter(int syscall_label, int arch, int error)
{
  struct sock_filter filter[] = {
    BPF_STMT(BPF_LD + BPF_W + BPF_ABS, (offsetof(struct seccomp_data, arch))),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, arch, 0, 3),
    BPF_STMT(BPF_LD + BPF_W + BPF_ABS, (offsetof(struct seccomp_data, nr))),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, syscall_label, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ERRNO | (error & SECCOMP_RET_DATA)),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), 
  };
// if need kill instead to return error, so try to use the macro SECCOMP_RET_KILL_PROCESS
  
  struct sock_fprog prog = {
  .len = (unsigned short)(sizeof(filter) / sizeof(filter[0])),
  .filter = filter,
  };

  if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
  {
      perror("prctl(NO_NEW_PRIVS)");
      return 1;
  }

  if (prctl(PR_SET_SECCOMP, 2, &prog))
  {
      perror("prctl(PR_SET_SECCOMP)");
      return 1;
  }
  
  return 0;
}

// to understand that function use "$ man seccomp"
void init_call_filter()
{
  //  https://github.com/torvalds/linux/blob/master/include/linux/syscalls.h
    add_call_filter(__NR_unlink, SECCOMP_AUDIT_ARCH , EPERM);
    add_call_filter(__NR_write, SECCOMP_AUDIT_ARCH , EPERM);
    add_call_filter(__NR_symlink, SECCOMP_AUDIT_ARCH , EPERM);
    // permissions calls... prevent path traversal chmod 777 /etc/password ... shadow..
    add_call_filter(__NR_chown, SECCOMP_AUDIT_ARCH , EPERM);
    add_call_filter(__NR_chmod, SECCOMP_AUDIT_ARCH , EPERM);
    // this is util to prevent Overflow that use technique ret-libc that reuse calls of libc like system(),
    // yes because blocks execve() that used by system() and popen() and soon.
    add_call_filter(__NR_execve, SECCOMP_AUDIT_ARCH , EPERM);
    // other brainstorms can block ports in socket() bind(), using filter by arguments and read each register
}

int main()
{
    printf("All right here, stay strong like a nail in the sand!\n");
    init_call_filter();
    
    // Danger function to create a file, idea is block RCE another risk points
    system("touch bazinga"); // system() libC's function that use syscall execve() to execute command in operational system
    printf("something's gonna happen!!\n");
    printf("it will not definitely print this here\n");

    return 0;
}

So compile the code and run. Look at the following:

cooler@ubuntu:~/codes/poc$ gcc -o test test.c; ./test 
All right here, stay strong like nail in the sand!
Bad system call

The seccomp mode is enabled via the prctl() system call using the PR_SET_SECCOMP argument via the seccomp() system call. The block list used in this source code runs in functions like write(), chmod(), chown(), execve() and symlink().

 struct sock_filter filter[] = {
    BPF_STMT(BPF_LD + BPF_W + BPF_ABS, (offsetof(struct seccomp_data, arch))),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, arch, 0, 3),
    BPF_STMT(BPF_LD + BPF_W + BPF_ABS, (offsetof(struct seccomp_data, nr))),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, syscall_label, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ERRNO | (error & SECCOMP_RET_DATA)),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), 
  };

BPF_JUMP() - This will be used to indicate in which scenario (or what function) where the rules will be effective.
BPF_STMT() - This is the function for which we will set specific rules.
SECCOMP_RET_ERRNO - macro to return an error if it detects an event.
SECCOMP_RET_LOG - Optional macro to salve in the log when an event trigger action.

So we can insert this rule example into the context If we need to save a log during the event. Look at the following:

        BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_LOG),

So to see the logs, we can run the following :

$ sudo ausearch -ui $USER --format text --start recent -c your_binary_programm_name
or
$ sudo ausearch --format text --start recent -c your_binary

SECCOMP_RET_KILL_PROCESS - Option macro to kill a process when an event triggers the action.

All right here, the last function calls not run system() and printf(). Yes, because these function calls are to use the syscalls write() and execve(), so using a tool like "strace", we can validate this point. We can see the following setting breakpoints with a proper debugger tool like "GDB", but remember, we need to set the comment at line 79 of function init_call_filter() for a practical overview in the blocklist of seccomp's action.

Another point of attention, when working with seccomp bpf, the boring fact is the necessity of the proper use of architecture. Look at the following:


#if defined(__i386__)
#define SECCOMP_AUDIT_ARCH AUDIT_ARCH_I386
#elif defined(__x86_64__)
#define SECCOMP_AUDIT_ARCH AUDIT_ARCH_X86_64
#elif defined(__arm__)
#define SECCOMP_AUDIT_ARCH AUDIT_ARCH_ARM
#elif defined(__aarch64__)
#define SECCOMP_AUDIT_ARCH AUDIT_ARCH_AARCH64
#else
#warning "seccomp: unsupported platform"
#define SECCOMP_AUDIT_ARCH 0
#endif

So without this point of bizarre macros, we have a portability problem that is a big pitfall. An official document has a point about it. Please look at the following https://www.kernel.org/doc/html/v4.18/userspace-api/seccomp_filter.html#pitfalls

Another cool stuff around macros context:

Kafel - https://github.com/google/kafel
From my friend hc0d3r - https://github.com/hc0d3r/seccomp-macros

The second PoC allowlist of syscalls

In this example, we follow another approach, using libseccomp for modern code, yes fewer code lines more human approach:

allowlist.c

#include <syscall.h>
#include <unistd.h>
#include <seccomp.h>
#include <linux/seccomp.h>
#include <stdlib.h>

int main(void)
{
 scmp_filter_ctx ctx;
 ctx = seccomp_init(SCMP_ACT_KILL);
 seccomp_rule_add(ctx, SCMP_ACT_ALLOW, __NR_write, 0);
 seccomp_load(ctx);
 syscall(1,1,"TEST here\n",10); // write()
 system("touch erorr.log"); 
 return 0;
}

Before compiling, please install the lib, and look at the following:

 $ sudo apt install libseccomp-dev
 or RPM based
 $ sudo yum install libseccomp-devel

Link the external lib and Compile and run like the following:

$ gcc -o allowlist allowlist.c -lseccomp; ./allowlist
TEST here
Bad system call (core dumped)

So all right, the function system() uses syscall execve() that is not present in allowlist, which is a "bad system call" following seccomp. Mission complete here.

So a little analysis of the seccomp-tools:

$ seccomp-tools dump ./allowlist
 line  CODE  JT   JF      K
=================================
 0000: 0x20 0x00 0x00 0x00000004  A = arch
 0001: 0x15 0x00 0x05 0xc000003e  if (A != ARCH_X86_64) goto 0007
 0002: 0x20 0x00 0x00 0x00000000  A = sys_number
 0003: 0x35 0x00 0x01 0x40000000  if (A < 0x40000000) goto 0005
 0004: 0x15 0x00 0x02 0xffffffff  if (A != 0xffffffff) goto 0007
 0005: 0x15 0x00 0x01 0x00000001  if (A != write) goto 0007
 0006: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0007: 0x06 0x00 0x00 0x00000000  return KILL

All right, the fantastic point about libseccomp, it can run in other programming languages like Python, Rust and GOlang.

Why is it useful?

So seccomp has an everyday use in a sandbox context. Sandboxing is a technique to isolate specific programs to prevent a vulnerability from compromising the rest of the system by restricting access to unnecessary resources. All common browsers nowadays include a sandbox and utilise a multi-process architecture. The browser splits itself into different processes (e.g. the content process, GPU process, RDD process, etc.) and sandboxes them individually, strictly adhering to the principle of least privilege.

A browser must use a sandbox, as it processes untrusted input by design, poses an enormous attack surface and is one of the most used applications on the system. Without a sandbox, any exploit in the browser can be used to take over the rest of the system.

Back to the real world, stopping to look at chrome and firefox. So filter syscalls are essential to control security in software. Maybe we can restrict a syscall like execve() or resources used by core utils like chmod, chown and soon. One developer with bad intentions commits a call execve() for a web shell that executes input by method POST following HTTP context, maybe another resource that it smells like a backdoor, so it has a seccomp restriction in syscall execve(). Any derivative cannot run in the tested software. So has other ways to insert the seccomp bpf filter, for example, in Kubernetes. So my favourite model is writing seccomp filter in software.

Another possible scenario is we can block a syscall not to give permissions to execute or write. It's a piece of cake, so we need filter syscalls with chmod().

In the third scenario, when we do some upload using socket() with sendfile() syscall, so we can create a proper context to audit these inputs, yes exist open source tools to put a limit in the context:

GitHub - netblue30/firejail: Linux namespaces and seccomp-bpf sandboxGitHub

How do we enumerate syscalls of software?

So at this point, there are many courses to solve in the Linux context, for example, hooking applications using LD_PRELOAD, ptrace() or ftrace() to intercept the context, like strace tool. Another course uses static ways, for example, parsing ELF or resources. All right, look the following:

list_calls.py

#!/usr/bin/python3
# ELF parse example by CoolerVoid
# based in examples of https://github.com/eliben/pyelftools
from elftools.elf.elffile import *
import argparse

def list_funcs(elffile):
 tmp_array = []
 for section in elffile.iter_sections():
  if isinstance(section, RelocationSection) == 0:
   continue
  symbol_table = elffile.get_section(section['sh_link'])
  for rel in section.iter_relocations():
   if isinstance(symbol_table, NullSection):
    continue    
   symbol = symbol_table.get_symbol(rel['r_info_sym'])
   if rel['r_info_type'] == 7:
    tmp_array.append(symbol.name)
 return tmp_array

def parse_elf(bin_file):
 with open(bin_file, "rb") as f:
  elffile = ELFFile(f)
  functions = list_funcs(elffile)
  print(functions)

def banner():
 print("\nSimple ELF parser to extract functions")
 print("\r$ python3 elf_parse.py -f binary_file\n")

def arguments():
 parser = argparse.ArgumentParser(description = banner())
 parser.add_argument('-f', '--file', action = 'store', dest = 'binfile',default='0',required = True, help = 'ELF file')
 args = parser.parse_args()
 return args.binfile


def main():
 try:
  file_in = arguments()
  parse_elf(file_in)
 except Exception as e:
  print(" log error : "+str(e))
  exit(0)

if __name__=="__main__":
 main()

So using this in seccomp BPF example of this post, we can view the result in the following:

cooler@fedora:~/codes/poc$ python3 list_calls.py -f test

Simple ELF parser to extract functions
$ python3 elf_parse.py -f binary_file

['puts', '__stack_chk_fail', 'system', 'prctl', 'perror']

So with these functions, a programmer can correlate with getting all syscalls, so now you ask hey Cooler, do you know that can be another approach? so maybe using the capstone framework, which is used by engines in many tools like radare, IDA pro and soon.

https://elixir.bootlin.com/linux/v6.0.3/C/ident/__NR_syscalls

Syscall intercept with lib capstone - https://github.com/pmem/syscall_intercept
Another possible course is reading source code and tokenising each chunk to extract functions and correlate at the point with a hash table with the mnemônic correlation context of syscall table. maybe we can use LLVM, or crazy parsers with O caml language.

How do we use it with other programming languages?

libseccomp - https://github.com/seccomp/libseccomp (golang, python and soon)
Rust - https://docs.rs/seccomp/latest/seccomp/
Infect to protect, infect an ELF to force seccomp to restrict syscalls. This resource is very interesting and is present in POC|GTFO book - https://github.com/lpereira/infect-to-protect
Node js - https://www.npmjs.com/package/node-seccomp
Android(is linux, runs OK) - https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html (we can insert implementation in kotlin, java, c++ and soon)
lib Kore - https://kore.io/ ( AWESOME PROJECT, really I am a big fan of this project )

How do we monitor seccomp events?

Seccomp in elastic - https://www.elastic.co/blog/seccomp-in-the-elastic-stack
Osquery events - https://github.com/osquery/osquery/blob/master/specs/linux/seccomp_events.table

My Security Overview

So with these empirical points around seccomp, yes, there are weak points and courses to bypass and escape from seccomp, and yes, seccomp has more vulnerabilities than Capsicum of FreeBSD or Pledge from OpenBSD that has the same proposal if we were looking at the sandboxing proposal. Seccomp is more popular. We can look at the android context.

So remember, using any protection, restriction or resource for defence is better than using nothing. So another good option, in addition, is the MAC "Mandatory Access Controls" like AppArmor, SELINUX, TOMOYO, so MAC is a framework for defining what a program can and cannot do on a whitelist basis. A program is represented as a subject. Anything the program wants to act on, such as a file, path, network interface, or port, is represented as an object. The rules for accessing the object are called permission or flag.

The CVE list

In order to check the security point, a good indicator is listing CVE and studying each context and possibility, like binary without proper hardenings like full relro or lack of other resources for protection. Without these points, it is crucial to try bypassing resources of binary, using seccomp tools, trying to understand rules and possibilities to bypass maybe try patching, so exist a list of CTF games with seccomp theme. This context of the challenge is the best course to learn in a practical context for bypassing.

CVE number

URL

Risk

NIST description

CVE-2022-30594

https://nvd.nist.gov/vuln/detail/CVE-2022-30594

MEDIUM

The Linux kernel before 5.17.2 mishandles seccomp permissions. The PTRACE_SEIZE code path allows attackers to bypass intended restrictions on setting the PT_SUSPEND_SECCOMP flag.

CVE-2021-41133

https://nvd.nist.gov/vuln/detail/CVE-2021-41133

MEDIUM

Flatpak is a system for building

CVE-2020-0261

https://nvd.nist.gov/vuln/detail/CVE-2020-0261

HIGH

In C2 flame devices

CVE-2019-7303

https://nvd.nist.gov/vuln/detail/CVE-2019-7303

MEDIUM

A vulnerability in the seccomp filters of Canonical snapd before version 2.37.4 allows a strict mode snap to insert characters into a terminal on a 64-bit host. The seccomp rules were generated to match 64-bit ioctl(2) commands on a 64-bit platform

CVE-2017-18367

https://nvd.nist.gov/vuln/detail/CVE-2017-18367

MEDIUM

libseccomp-golang 0.9.0 and earlier incorrectly generates BPFs that OR multiple arguments rather than ANDing them. A process running under a restrictive seccomp filter that specified multiple syscall arguments could bypass intended access restrictions by specifying a single matching argument.

CVE-2019-10145

https://nvd.nist.gov/vuln/detail/CVE-2019-10145

MEDIUM

rkt through version 1.30.0 does not isolate processes in containers that are run with `rkt enter`. Processes run with `rkt enter` do not have seccomp filtering during stage 2 (the actual environment in which the applications run). Compromised containers could exploit this flaw to access host resources.

CVE-2018-15746

https://nvd.nist.gov/vuln/detail/CVE-2018-15746

LOW

qemu-seccomp.c in QEMU might allow local OS guest users to cause a denial of service (guest crash) by leveraging mishandling of the seccomp policy for threads other than the main thread.

CVE-2019-12589

https://nvd.nist.gov/vuln/detail/CVE-2019-12589

MEDIUM

In Firejail before 0.9.60

CVE-2019-2054

https://nvd.nist.gov/vuln/detail/CVE-2019-2054

MEDIUM

In the seccomp implementation prior to kernel version 4.8

CVE-2019-9893

https://nvd.nist.gov/vuln/detail/CVE-2019-9893

HIGH

libseccomp before 2.4.0 did not correctly generate 64-bit syscall argument comparisons using the arithmetic operators (LT