Skip to content

Commit bd4cf0e

Browse files
Alexei Starovoitovdavem330
Alexei Starovoitov
authored andcommitted
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with an optimized BPF instruction set format that is modelled closer to mimic native instruction sets and is designed to be JITed with one to one mapping. Thus, the new interpreter is noticeably faster than the current implementation of sk_run_filter(); mainly for two reasons: 1. Fall-through jumps: BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. The new BPF jump instructions have only one branch and fall-through otherwise, which fits the CPU branch predictor logic better. `perf stat` shows drastic difference for branch-misses between the old and new code. 2. Jump-threaded implementation of interpreter vs switch statement: Instead of single table-jump at the top of 'switch' statement, gcc will now generate multiple table-jump instructions, which helps CPU branch predictor logic. Note that the verification of filters is still being done through sk_chk_filter() in classical BPF format, so filters from user- or kernel space are verified in the same way as we do now, and same restrictions/constraints hold as well. We reuse current BPF JIT compilers in a way that this upgrade would even be fine as is, but nevertheless allows for a successive upgrade of BPF JIT compilers to the new format. The internal instruction set migration is being done after the probing for JIT compilation, so in case JIT compilers are able to create a native opcode image, we're going to use that, and in all other cases we're doing a follow-up migration of the BPF program's instruction set, so that it can be transparently run in the new interpreter. In short, the *internal* format extends BPF in the following way (more details can be taken from the appended documentation): - Number of registers increase from 2 to 10 - Register width increases from 32-bit to 64-bit - Conditional jt/jf targets replaced with jt/fall-through - Adds signed > and >= insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - Introduction of bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions - Adds arithmetic right shift and endianness conversion insns - Adds atomic_add insn - Old tax/txa insns are replaced with 'mov dst,src' insn Performance of two BPF filters generated by libpcap resp. bpf_asm was measured on x86_64, i386 and arm32 (other libpcap programs have similar performance differences): fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call, smaller is better: --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 new BPF 31 71 47 97 old BPF jit 12 34 17 44 new BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 107 136 227 252 new BPF 40 119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 202 300 475 540 new BPF 180 270 330 470 old BPF jit 26 182 37 202 new BPF jit TBD Thus, without changing any userland BPF filters, applications on top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf classifier, netfilter's xt_bpf, team driver's load-balancing mode, and many more will have better interpreter filtering performance. While we are replacing the internal BPF interpreter, we also need to convert seccomp BPF in the same step to make use of the new internal structure since it makes use of lower-level API details without being further decoupled through higher-level calls like sk_unattached_filter_{create,destroy}(), for example. Just as for normal socket filtering, also seccomp BPF experiences a time-to-verdict speedup: 05-sim-long_jumps.c of libseccomp was used as micro-benchmark: seccomp_rule_add_exact(ctx,... seccomp_rule_add_exact(ctx,... rc = seccomp_load(ctx); for (i = 0; i < 10000000; i++) syscall(199, 100); 'short filter' has 2 rules 'large filter' has 200 rules 'short filter' performance is slightly better on x86_64/i386/arm32 'large filter' is much faster on x86_64 and i386 and shows no difference on arm32 --x86_64-- short filter old BPF: 2.7 sec 39.12% bench libc-2.15.so [.] syscall 8.10% bench [kernel.kallsyms] [k] sk_run_filter 6.31% bench [kernel.kallsyms] [k] system_call 5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller 3.70% bench [kernel.kallsyms] [k] __secure_computing 3.67% bench [kernel.kallsyms] [k] lock_is_held 3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load new BPF: 2.58 sec 42.05% bench libc-2.15.so [.] syscall 6.91% bench [kernel.kallsyms] [k] system_call 6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 6.07% bench [kernel.kallsyms] [k] __secure_computing 5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp --arm32-- short filter old BPF: 4.0 sec 39.92% bench [kernel.kallsyms] [k] vector_swi 16.60% bench [kernel.kallsyms] [k] sk_run_filter 14.66% bench libc-2.17.so [.] syscall 5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load 5.10% bench [kernel.kallsyms] [k] __secure_computing new BPF: 3.7 sec 35.93% bench [kernel.kallsyms] [k] vector_swi 21.89% bench libc-2.17.so [.] syscall 13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 6.25% bench [kernel.kallsyms] [k] __secure_computing 3.96% bench [kernel.kallsyms] [k] syscall_trace_exit --x86_64-- large filter old BPF: 8.6 seconds 73.38% bench [kernel.kallsyms] [k] sk_run_filter 10.70% bench libc-2.15.so [.] syscall 5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.97% bench [kernel.kallsyms] [k] system_call new BPF: 5.7 seconds 66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 16.75% bench libc-2.15.so [.] syscall 3.31% bench [kernel.kallsyms] [k] system_call 2.88% bench [kernel.kallsyms] [k] __secure_computing --i386-- large filter old BPF: 5.4 sec new BPF: 3.8 sec --arm32-- large filter old BPF: 13.5 sec 73.88% bench [kernel.kallsyms] [k] sk_run_filter 10.29% bench [kernel.kallsyms] [k] vector_swi 6.46% bench libc-2.17.so [.] syscall 2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.19% bench [kernel.kallsyms] [k] __secure_computing 0.87% bench [kernel.kallsyms] [k] sys_getuid new BPF: 13.5 sec 76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 10.98% bench [kernel.kallsyms] [k] vector_swi 5.87% bench libc-2.17.so [.] syscall 1.77% bench [kernel.kallsyms] [k] __secure_computing 0.93% bench [kernel.kallsyms] [k] sys_getuid BPF filters generated by seccomp are very branchy, so the new internal BPF performance is better than the old one. Performance gains will be even higher when BPF JIT is committed for the new structure, which is planned in future work (as successive JIT migrations). BPF has also been stress-tested with trinity's BPF fuzzer. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Hagen Paul Pfeifer <[email protected]> Cc: Kees Cook <[email protected]> Cc: Paul Moore <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: [email protected] Acked-by: Kees Cook <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent 77e0114 commit bd4cf0e

File tree

4 files changed

+1279
-372
lines changed

4 files changed

+1279
-372
lines changed

include/linux/filter.h

+64-10
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,58 @@
99
#include <linux/workqueue.h>
1010
#include <uapi/linux/filter.h>
1111

12-
#ifdef CONFIG_COMPAT
13-
/*
14-
* A struct sock_filter is architecture independent.
12+
/* Internally used and optimized filter representation with extended
13+
* instruction set based on top of classic BPF.
1514
*/
15+
16+
/* instruction classes */
17+
#define BPF_ALU64 0x07 /* alu mode in double word width */
18+
19+
/* ld/ldx fields */
20+
#define BPF_DW 0x18 /* double word */
21+
#define BPF_XADD 0xc0 /* exclusive add */
22+
23+
/* alu/jmp fields */
24+
#define BPF_MOV 0xb0 /* mov reg to reg */
25+
#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
26+
27+
/* change endianness of a register */
28+
#define BPF_END 0xd0 /* flags for endianness conversion: */
29+
#define BPF_TO_LE 0x00 /* convert to little-endian */
30+
#define BPF_TO_BE 0x08 /* convert to big-endian */
31+
#define BPF_FROM_LE BPF_TO_LE
32+
#define BPF_FROM_BE BPF_TO_BE
33+
34+
#define BPF_JNE 0x50 /* jump != */
35+
#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
36+
#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
37+
#define BPF_CALL 0x80 /* function call */
38+
#define BPF_EXIT 0x90 /* function return */
39+
40+
/* BPF has 10 general purpose 64-bit registers and stack frame. */
41+
#define MAX_BPF_REG 11
42+
43+
/* BPF program can access up to 512 bytes of stack space. */
44+
#define MAX_BPF_STACK 512
45+
46+
/* Arg1, context and stack frame pointer register positions. */
47+
#define ARG1_REG 1
48+
#define CTX_REG 6
49+
#define FP_REG 10
50+
51+
struct sock_filter_int {
52+
__u8 code; /* opcode */
53+
__u8 a_reg:4; /* dest register */
54+
__u8 x_reg:4; /* source register */
55+
__s16 off; /* signed offset */
56+
__s32 imm; /* signed immediate constant */
57+
};
58+
59+
#ifdef CONFIG_COMPAT
60+
/* A struct sock_filter is architecture independent. */
1661
struct compat_sock_fprog {
1762
u16 len;
18-
compat_uptr_t filter; /* struct sock_filter * */
63+
compat_uptr_t filter; /* struct sock_filter * */
1964
};
2065
#endif
2166

@@ -26,6 +71,7 @@ struct sock_fprog_kern {
2671

2772
struct sk_buff;
2873
struct sock;
74+
struct seccomp_data;
2975

3076
struct sk_filter {
3177
atomic_t refcnt;
@@ -34,9 +80,10 @@ struct sk_filter {
3480
struct sock_fprog_kern *orig_prog; /* Original BPF program */
3581
struct rcu_head rcu;
3682
unsigned int (*bpf_func)(const struct sk_buff *skb,
37-
const struct sock_filter *filter);
83+
const struct sock_filter_int *filter);
3884
union {
39-
struct sock_filter insns[0];
85+
struct sock_filter insns[0];
86+
struct sock_filter_int insnsi[0];
4087
struct work_struct work;
4188
};
4289
};
@@ -50,9 +97,18 @@ static inline unsigned int sk_filter_size(unsigned int proglen)
5097
#define sk_filter_proglen(fprog) \
5198
(fprog->len * sizeof(fprog->filter[0]))
5299

100+
#define SK_RUN_FILTER(filter, ctx) \
101+
(*filter->bpf_func)(ctx, filter->insnsi)
102+
53103
int sk_filter(struct sock *sk, struct sk_buff *skb);
54-
unsigned int sk_run_filter(const struct sk_buff *skb,
55-
const struct sock_filter *filter);
104+
105+
u32 sk_run_filter_int_seccomp(const struct seccomp_data *ctx,
106+
const struct sock_filter_int *insni);
107+
u32 sk_run_filter_int_skb(const struct sk_buff *ctx,
108+
const struct sock_filter_int *insni);
109+
110+
int sk_convert_filter(struct sock_filter *prog, int len,
111+
struct sock_filter_int *new_prog, int *new_len);
56112

57113
int sk_unattached_filter_create(struct sk_filter **pfp,
58114
struct sock_fprog *fprog);
@@ -86,7 +142,6 @@ static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
86142
print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
87143
16, 1, image, proglen, false);
88144
}
89-
#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
90145
#else
91146
#include <linux/slab.h>
92147
static inline void bpf_jit_compile(struct sk_filter *fp)
@@ -96,7 +151,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
96151
{
97152
kfree(fp);
98153
}
99-
#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
100154
#endif
101155

102156
static inline int bpf_tell_extensions(void)

include/linux/seccomp.h

-1
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,6 @@ static inline int seccomp_mode(struct seccomp *s)
7676
#ifdef CONFIG_SECCOMP_FILTER
7777
extern void put_seccomp_filter(struct task_struct *tsk);
7878
extern void get_seccomp_filter(struct task_struct *tsk);
79-
extern u32 seccomp_bpf_load(int off);
8079
#else /* CONFIG_SECCOMP_FILTER */
8180
static inline void put_seccomp_filter(struct task_struct *tsk)
8281
{

kernel/seccomp.c

+58-61
Original file line numberDiff line numberDiff line change
@@ -55,60 +55,33 @@ struct seccomp_filter {
5555
atomic_t usage;
5656
struct seccomp_filter *prev;
5757
unsigned short len; /* Instruction count */
58-
struct sock_filter insns[];
58+
struct sock_filter_int insnsi[];
5959
};
6060

6161
/* Limit any path through the tree to 256KB worth of instructions. */
6262
#define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
6363

64-
/**
65-
* get_u32 - returns a u32 offset into data
66-
* @data: a unsigned 64 bit value
67-
* @index: 0 or 1 to return the first or second 32-bits
68-
*
69-
* This inline exists to hide the length of unsigned long. If a 32-bit
70-
* unsigned long is passed in, it will be extended and the top 32-bits will be
71-
* 0. If it is a 64-bit unsigned long, then whatever data is resident will be
72-
* properly returned.
73-
*
64+
/*
7465
* Endianness is explicitly ignored and left for BPF program authors to manage
7566
* as per the specific architecture.
7667
*/
77-
static inline u32 get_u32(u64 data, int index)
68+
static void populate_seccomp_data(struct seccomp_data *sd)
7869
{
79-
return ((u32 *)&data)[index];
80-
}
70+
struct task_struct *task = current;
71+
struct pt_regs *regs = task_pt_regs(task);
8172

82-
/* Helper for bpf_load below. */
83-
#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
84-
/**
85-
* bpf_load: checks and returns a pointer to the requested offset
86-
* @off: offset into struct seccomp_data to load from
87-
*
88-
* Returns the requested 32-bits of data.
89-
* seccomp_check_filter() should assure that @off is 32-bit aligned
90-
* and not out of bounds. Failure to do so is a BUG.
91-
*/
92-
u32 seccomp_bpf_load(int off)
93-
{
94-
struct pt_regs *regs = task_pt_regs(current);
95-
if (off == BPF_DATA(nr))
96-
return syscall_get_nr(current, regs);
97-
if (off == BPF_DATA(arch))
98-
return syscall_get_arch(current, regs);
99-
if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
100-
unsigned long value;
101-
int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
102-
int index = !!(off % sizeof(u64));
103-
syscall_get_arguments(current, regs, arg, 1, &value);
104-
return get_u32(value, index);
105-
}
106-
if (off == BPF_DATA(instruction_pointer))
107-
return get_u32(KSTK_EIP(current), 0);
108-
if (off == BPF_DATA(instruction_pointer) + sizeof(u32))
109-
return get_u32(KSTK_EIP(current), 1);
110-
/* seccomp_check_filter should make this impossible. */
111-
BUG();
73+
sd->nr = syscall_get_nr(task, regs);
74+
sd->arch = syscall_get_arch(task, regs);
75+
76+
/* Unroll syscall_get_args to help gcc on arm. */
77+
syscall_get_arguments(task, regs, 0, 1, (unsigned long *) &sd->args[0]);
78+
syscall_get_arguments(task, regs, 1, 1, (unsigned long *) &sd->args[1]);
79+
syscall_get_arguments(task, regs, 2, 1, (unsigned long *) &sd->args[2]);
80+
syscall_get_arguments(task, regs, 3, 1, (unsigned long *) &sd->args[3]);
81+
syscall_get_arguments(task, regs, 4, 1, (unsigned long *) &sd->args[4]);
82+
syscall_get_arguments(task, regs, 5, 1, (unsigned long *) &sd->args[5]);
83+
84+
sd->instruction_pointer = KSTK_EIP(task);
11285
}
11386

11487
/**
@@ -133,17 +106,17 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
133106

134107
switch (code) {
135108
case BPF_S_LD_W_ABS:
136-
ftest->code = BPF_S_ANC_SECCOMP_LD_W;
109+
ftest->code = BPF_LDX | BPF_W | BPF_ABS;
137110
/* 32-bit aligned and not out of bounds. */
138111
if (k >= sizeof(struct seccomp_data) || k & 3)
139112
return -EINVAL;
140113
continue;
141114
case BPF_S_LD_W_LEN:
142-
ftest->code = BPF_S_LD_IMM;
115+
ftest->code = BPF_LD | BPF_IMM;
143116
ftest->k = sizeof(struct seccomp_data);
144117
continue;
145118
case BPF_S_LDX_W_LEN:
146-
ftest->code = BPF_S_LDX_IMM;
119+
ftest->code = BPF_LDX | BPF_IMM;
147120
ftest->k = sizeof(struct seccomp_data);
148121
continue;
149122
/* Explicitly include allowed calls. */
@@ -185,6 +158,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
185158
case BPF_S_JMP_JGT_X:
186159
case BPF_S_JMP_JSET_K:
187160
case BPF_S_JMP_JSET_X:
161+
sk_decode_filter(ftest, ftest);
188162
continue;
189163
default:
190164
return -EINVAL;
@@ -202,18 +176,21 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
202176
static u32 seccomp_run_filters(int syscall)
203177
{
204178
struct seccomp_filter *f;
179+
struct seccomp_data sd;
205180
u32 ret = SECCOMP_RET_ALLOW;
206181

207182
/* Ensure unexpected behavior doesn't result in failing open. */
208183
if (WARN_ON(current->seccomp.filter == NULL))
209184
return SECCOMP_RET_KILL;
210185

186+
populate_seccomp_data(&sd);
187+
211188
/*
212189
* All filters in the list are evaluated and the lowest BPF return
213190
* value always takes priority (ignoring the DATA).
214191
*/
215192
for (f = current->seccomp.filter; f; f = f->prev) {
216-
u32 cur_ret = sk_run_filter(NULL, f->insns);
193+
u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi);
217194
if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
218195
ret = cur_ret;
219196
}
@@ -231,6 +208,8 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
231208
struct seccomp_filter *filter;
232209
unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
233210
unsigned long total_insns = fprog->len;
211+
struct sock_filter *fp;
212+
int new_len;
234213
long ret;
235214

236215
if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
@@ -252,28 +231,43 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
252231
CAP_SYS_ADMIN) != 0)
253232
return -EACCES;
254233

255-
/* Allocate a new seccomp_filter */
256-
filter = kzalloc(sizeof(struct seccomp_filter) + fp_size,
257-
GFP_KERNEL|__GFP_NOWARN);
258-
if (!filter)
234+
fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
235+
if (!fp)
259236
return -ENOMEM;
260-
atomic_set(&filter->usage, 1);
261-
filter->len = fprog->len;
262237

263238
/* Copy the instructions from fprog. */
264239
ret = -EFAULT;
265-
if (copy_from_user(filter->insns, fprog->filter, fp_size))
266-
goto fail;
240+
if (copy_from_user(fp, fprog->filter, fp_size))
241+
goto free_prog;
267242

268243
/* Check and rewrite the fprog via the skb checker */
269-
ret = sk_chk_filter(filter->insns, filter->len);
244+
ret = sk_chk_filter(fp, fprog->len);
270245
if (ret)
271-
goto fail;
246+
goto free_prog;
272247

273248
/* Check and rewrite the fprog for seccomp use */
274-
ret = seccomp_check_filter(filter->insns, filter->len);
249+
ret = seccomp_check_filter(fp, fprog->len);
250+
if (ret)
251+
goto free_prog;
252+
253+
/* Convert 'sock_filter' insns to 'sock_filter_int' insns */
254+
ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
255+
if (ret)
256+
goto free_prog;
257+
258+
/* Allocate a new seccomp_filter */
259+
filter = kzalloc(sizeof(struct seccomp_filter) +
260+
sizeof(struct sock_filter_int) * new_len,
261+
GFP_KERNEL|__GFP_NOWARN);
262+
if (!filter)
263+
goto free_prog;
264+
265+
ret = sk_convert_filter(fp, fprog->len, filter->insnsi, &new_len);
275266
if (ret)
276-
goto fail;
267+
goto free_filter;
268+
269+
atomic_set(&filter->usage, 1);
270+
filter->len = new_len;
277271

278272
/*
279273
* If there is an existing filter, make it the prev and don't drop its
@@ -282,8 +276,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
282276
filter->prev = current->seccomp.filter;
283277
current->seccomp.filter = filter;
284278
return 0;
285-
fail:
279+
280+
free_filter:
286281
kfree(filter);
282+
free_prog:
283+
kfree(fp);
287284
return ret;
288285
}
289286

0 commit comments

Comments
 (0)