Skip to content

Commit 4f738ad

Browse files
jrfastabborkmann
authored andcommitted
bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
This implements a BPF ULP layer to allow policy enforcement and monitoring at the socket layer. In order to support this a new program type BPF_PROG_TYPE_SK_MSG is used to run the policy at the sendmsg/sendpage hook. To attach the policy to sockets a sockmap is used with a new program attach type BPF_SK_MSG_VERDICT. Similar to previous sockmap usages when a sock is added to a sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT program type attached then the BPF ULP layer is created on the socket and the attached BPF_PROG_TYPE_SK_MSG program is run for every msg in sendmsg case and page/offset in sendpage case. BPF_PROG_TYPE_SK_MSG Semantics/API: BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and SK_DROP. Returning SK_DROP free's the copied data in the sendmsg case and in the sendpage case leaves the data untouched. Both cases return -EACESS to the user. Returning SK_PASS will allow the msg to be sent. In the sendmsg case data is copied into kernel space buffers before running the BPF program. The kernel space buffers are stored in a scatterlist object where each element is a kernel memory buffer. Some effort is made to coalesce data from the sendmsg call here. For example a sendmsg call with many one byte iov entries will likely be pushed into a single entry. The BPF program is run with data pointers (start/end) pointing to the first sg element. In the sendpage case data is not copied. We opt not to copy the data by default here, because the BPF infrastructure does not know what bytes will be needed nor when they will be needed. So copying all bytes may be wasteful. Because of this the initial start/end data pointers are (0,0). Meaning no data can be read or written. This avoids reading data that may be modified by the user. A new helper is added later in this series if reading and writing the data is needed. The helper call will do a copy by default so that the page is exclusively owned by the BPF call. The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg in the sendmsg() case and the entire page/offset in the sendpage case. This avoids ambiguity on how to handle mixed return codes in the sendmsg case. Again a helper is added later in the series if a verdict needs to apply to multiple system calls and/or only a subpart of the currently being processed message. The helper msg_redirect_map() can be used to select the socket to send the data on. This is used similar to existing redirect use cases. This allows policy to redirect msgs. Pseudo code simple example: The basic logic to attach a program to a socket is as follows, // load the programs bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog); // lookup the sockmap bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map"); // get fd for sockmap map_fd_msg = bpf_map__fd(bpf_map_msg); // attach program to sockmap bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0); Adding sockets to the map is done in the normal way, // Add a socket 'fd' to sockmap at location 'i' bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY); After the above any socket attached to "my_sock_map", in this case 'fd', will run the BPF msg verdict program (msg_prog) on every sendmsg and sendpage system call. For a complete example see BPF selftests or sockmap samples. Implementation notes: It seemed the simplest, to me at least, to use a refcnt to ensure psock is not lost across the sendmsg copy into the sg, the bpf program running on the data in sg_data, and the final pass to the TCP stack. Some performance testing may show a better method to do this and avoid the refcnt cost, but for now use the simpler method. Another item that will come after basic support is in place is supporting MSG_MORE flag. At the moment we call sendpages even if the MSG_MORE flag is set. An enhancement would be to collect the pages into a larger scatterlist and pass down the stack. Notice that bpf_tcp_sendmsg() could support this with some additional state saved across sendmsg calls. I built the code to support this without having to do refactoring work. Other features TBD include ZEROCOPY and the TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series shortly. Future work could improve size limits on the scatterlist rings used here. Currently, we use MAX_SKB_FRAGS simply because this was being used already in the TLS case. Future work could extend the kernel sk APIs to tune this depending on workload. This is a trade-off between memory usage and throughput performance. Signed-off-by: John Fastabend <[email protected]> Acked-by: David S. Miller <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
1 parent 8c05dbf commit 4f738ad

File tree

8 files changed

+857
-21
lines changed

8 files changed

+857
-21
lines changed

include/linux/bpf.h

+1
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ struct bpf_verifier_env;
2121
struct perf_event;
2222
struct bpf_prog;
2323
struct bpf_map;
24+
struct sock;
2425

2526
/* map is generic key/value storage optionally accesible by eBPF programs */
2627
struct bpf_map_ops {

include/linux/bpf_types.h

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
1313
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
1414
BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
1515
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
16+
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
1617
#endif
1718
#ifdef CONFIG_BPF_EVENTS
1819
BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)

include/linux/filter.h

+17
Original file line numberDiff line numberDiff line change
@@ -507,6 +507,22 @@ struct xdp_buff {
507507
struct xdp_rxq_info *rxq;
508508
};
509509

510+
struct sk_msg_buff {
511+
void *data;
512+
void *data_end;
513+
__u32 apply_bytes;
514+
__u32 cork_bytes;
515+
int sg_copybreak;
516+
int sg_start;
517+
int sg_curr;
518+
int sg_end;
519+
struct scatterlist sg_data[MAX_SKB_FRAGS];
520+
bool sg_copy[MAX_SKB_FRAGS];
521+
__u32 key;
522+
__u32 flags;
523+
struct bpf_map *map;
524+
};
525+
510526
/* Compute the linear packet data range [data, data_end) which
511527
* will be accessed by various program types (cls_bpf, act_bpf,
512528
* lwt, ...). Subsystems allowing direct data access must (!)
@@ -771,6 +787,7 @@ xdp_data_meta_unsupported(const struct xdp_buff *xdp)
771787
void bpf_warn_invalid_xdp_action(u32 act);
772788

773789
struct sock *do_sk_redirect_map(struct sk_buff *skb);
790+
struct sock *do_msg_redirect_map(struct sk_msg_buff *md);
774791

775792
#ifdef CONFIG_BPF_JIT
776793
extern int bpf_jit_enable;

include/uapi/linux/bpf.h

+21-1
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ enum bpf_prog_type {
133133
BPF_PROG_TYPE_SOCK_OPS,
134134
BPF_PROG_TYPE_SK_SKB,
135135
BPF_PROG_TYPE_CGROUP_DEVICE,
136+
BPF_PROG_TYPE_SK_MSG,
136137
};
137138

138139
enum bpf_attach_type {
@@ -143,6 +144,7 @@ enum bpf_attach_type {
143144
BPF_SK_SKB_STREAM_PARSER,
144145
BPF_SK_SKB_STREAM_VERDICT,
145146
BPF_CGROUP_DEVICE,
147+
BPF_SK_MSG_VERDICT,
146148
__MAX_BPF_ATTACH_TYPE
147149
};
148150

@@ -718,6 +720,15 @@ union bpf_attr {
718720
* int bpf_override_return(pt_regs, rc)
719721
* @pt_regs: pointer to struct pt_regs
720722
* @rc: the return value to set
723+
*
724+
* int bpf_msg_redirect_map(map, key, flags)
725+
* Redirect msg to a sock in map using key as a lookup key for the
726+
* sock in map.
727+
* @map: pointer to sockmap
728+
* @key: key to lookup sock in map
729+
* @flags: reserved for future use
730+
* Return: SK_PASS
731+
*
721732
*/
722733
#define __BPF_FUNC_MAPPER(FN) \
723734
FN(unspec), \
@@ -779,7 +790,8 @@ union bpf_attr {
779790
FN(perf_prog_read_value), \
780791
FN(getsockopt), \
781792
FN(override_return), \
782-
FN(sock_ops_cb_flags_set),
793+
FN(sock_ops_cb_flags_set), \
794+
FN(msg_redirect_map),
783795

784796
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
785797
* function eBPF program intends to call
@@ -942,6 +954,14 @@ enum sk_action {
942954
SK_PASS,
943955
};
944956

957+
/* user accessible metadata for SK_MSG packet hook, new fields must
958+
* be added to the end of this structure
959+
*/
960+
struct sk_msg_md {
961+
void *data;
962+
void *data_end;
963+
};
964+
945965
#define BPF_TAG_SIZE 8
946966

947967
struct bpf_prog_info {

0 commit comments

Comments
 (0)