- Mar 27, 2018
-
-
David Howells authored
In rxrpc and afs, use the debug_ids that are monotonically allocated to various objects as they're allocated rather than pointers as kernel pointers are now hashed making them less useful. Further, the debug ids aren't reused anywhere nearly as quickly. In addition, allow kernel services that use rxrpc, such as afs, to take numbers from the rxrpc counter, assign them to their own call struct and pass them in to rxrpc for both client and service calls so that the trace lines for each will have the same ID tag. Signed-off-by:
David Howells <dhowells@redhat.com>
-
David Howells authored
Add a tracepoint to trace packet resend events and to dump the Tx annotation buffer for added illumination. Signed-off-by:
David Howells <dhowells@rdhat.com>
-
Kirill Tkhai authored
This adds comments to different places to improve readability. Signed-off-by:
Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Kirill Tkhai authored
net_sem is some undefined area name, so it will be better to make the area more defined. Rename it to pernet_ops_rwsem for better readability and better intelligibility. Signed-off-by:
Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Kirill Tkhai authored
Synchronous pernet_operations are not allowed anymore. All are asynchronous. So, drop the structure member. Signed-off-by:
Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Xin Long authored
After Commit dae399d7 ("sctp: hold transport instead of assoc when lookup assoc in rx path"), it put transport instead of asoc in sctp_has_association. Variable 'asoc' is not used any more. So this patch is to remove it, while at it, it also changes the return type of sctp_has_association to bool, and does the same for it's caller sctp_endpoint_is_peeled_off. Signed-off-by:
Xin Long <lucien.xin@gmail.com> Acked-by:
Neil Horman <nhorman@tuxdriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 26, 2018
-
-
Or Gerlitz authored
Newer NICs (ConnectX-5 and onward) can apply vlan pop or push as an action taking place during flow steering. Add the core bits for that. Signed-off-by:
Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by:
Mark Bloch <markb@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Moshe Shemesh authored
Added the following packets dropped while vport down statistics: Rx dropped while vport down - counts packets which were steered by e-switch to a vport, but dropped since the vport was down. This counter will be shown on ip link tool as part of the vport rx_dropped counter. Tx dropped while vport down - counts packets which were transmitted by a vport, but dropped due to vport logical link down. This counter will be shown on ip link tool as part of the vport tx_dropped counter. The counters are read from FW by command QUERY_VNIC_ENV. Signed-off-by:
Moshe Shemesh <moshe@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Moshe Shemesh authored
Added the following packets drop counter: Rx steering missed dropped packets - counts packets which were dropped due to miss on NIC rx steering rules. This counter will be shown on ethtool as a new counter called rx_steer_missed_packets. Signed-off-by:
Moshe Shemesh <moshe@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Moshe Shemesh authored
Add support for new FW command QUERY_VNIC_ENV. The command is used by the driver to query vnic diagnostic statistics from FW. Signed-off-by:
Moshe Shemesh <moshe@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Inbar Karmy authored
Implement set/get functions to configure PFC stall prevention timeout by tunables api through ethtool. By default the stall prevention timeout is configured to 8 sec. Timeout range is: 80-8000 msec. Enabling stall prevention with the auto timeout will set the timeout to 100 msec. Signed-off-by:
Inbar Karmy <inbark@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Inbar Karmy authored
In the event where the device unexpectedly becomes unresponsive for a long period of time, flow control mechanism may propagate pause frames which will cause congestion spreading to the entire network. To prevent this scenario, when the device is stalled for a period longer than a pre-configured timeout, flow control mechanisms are automatically disabled. This patch adds support for the ETHTOOL_PFC_STALL_PREVENTION as a tunable. This API provides support for configuring flow control storm prevention timeout (msec). Signed-off-by:
Inbar Karmy <inbark@mellanox.com> Cc: Michal Kubecek <mkubecek@suse.cz> Cc: Andrew Lunn <andrew@lunn.ch> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Inbar Karmy authored
Add the needed capability bit and counters to device spec description. Expose the following two counters in ethtool: tx_pause_storm_warning_events: when the device is stalled for a period longer than a pre-configured watermark, the counter increase, allowing the debug utility an insight into current device status. tx_pause_storm_error_events: when the device is stalled for a period longer than a pre-configured timeout, the pause transmission is disabled, and the counter increase. Signed-off-by:
Inbar Karmy <inbark@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Yuval Mintz authored
Since ipmr and ip6mr are using the same mr_mfc struct at their core, we can now refactor the ipmr_cache_{hold,put} logic and apply refcounting to both ipmr and ip6mr. Signed-off-by:
Yuval Mintz <yuvalm@mellanox.com> Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Yuval Mintz authored
Add the ability to discern whether a given FIB rule notification relates to the default rule inserted when registering ip6mr or a different one. Would later be used by drivers wishing to offload ipv6 multicast routes but unable to offload rules other than the default one. Signed-off-by:
Yuval Mintz <yuvalm@mellanox.com> Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Yuval Mintz authored
In similar fashion to ipmr, support fib notifications for ip6mr mfc and vif related events. This would later allow drivers to react to said notifications and offload the IPv6 mroutes. Signed-off-by:
Yuval Mintz <yuvalm@mellanox.com> Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Yuval Mintz authored
Since all the primitive elements used for the notification done by ipmr are now common [mr_table, mr_mfc, vif_device] we can refactor the logic for dumping them to a common file. Signed-off-by:
Yuval Mintz <yuvalm@mellanox.com> Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Yuval Mintz authored
Like vif notifications, move the notifier struct for MFC as well as its helpers into a common file; Currently they're only used by ipmr. Signed-off-by:
Yuval Mintz <yuvalm@mellanox.com> Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Yuval Mintz authored
The fib-notifiers are tightly coupled with the vif_device which is already common. Move the notifier struct definition and helpers to the common file; Currently they're only used by ipmr. Signed-off-by:
Yuval Mintz <yuvalm@mellanox.com> Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Kirill Tkhai authored
Last user is gone after bdf5bd7f "rds: tcp: remove register_netdevice_notifier infrastructure.", so we can remove this netdevice command. This allows to delete rtnl_lock() in netdev_run_todo(), which is hot path for net namespace unregistration. dev_change_net_namespace() and netdev_wait_allrefs() have rcu_barrier() before NETDEV_UNREGISTER_FINAL call, and the source commits say they were introduced to delemit the call with NETDEV_UNREGISTER, but this patch leaves them on the places, since they require additional analysis, whether we need in them for something else. Signed-off-by:
Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Kirill Tkhai authored
This patch is preparation to drop NETDEV_UNREGISTER_FINAL. Since the cmd is used in usnic_ib_netdev_event_to_string() to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL from everywhere, we'd have holes in event2str[] in this function. Instead of that, let's make NETDEV_XXX commands names available for everyone, and to define netdev_cmd_to_name() in the way we won't have to shaffle names after their numbers are changed. Signed-off-by:
Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 24, 2018
-
-
Davide Caratti authored
tcf_idr_cleanup() is no more used, so remove it. Suggested-by:
Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by:
Davide Caratti <dcaratti@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 23, 2018
-
-
Jon Maloy authored
We add a 128-bit node identity, as an alternative to the currently used 32-bit node address. For the sake of compatibility and to minimize message header changes we retain the existing 32-bit address field. When not set explicitly by the user, this field will be filled with a hash value generated from the much longer node identity, and be used as a shorthand value for the latter. We permit either the address or the identity to be set by configuration, but not both, so when the address value is set by a legacy user the corresponding 128-bit node identity is generated based on the that value. Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Dave Watson authored
Add rx path for tls software implementation. recvmsg, splice_read, and poll implemented. An additional sockopt TLS_RX is added, with the same interface as TLS_TX. Either TLX_RX or TLX_TX may be provided separately, or together (with two different setsockopt calls with appropriate keys). Control messages are passed via CMSG in a similar way to transmit. If no cmsg buffer is passed, then only application data records will be passed to userspace, and EIO is returned for other types of alerts. EBADMSG is passed for decryption errors, and EMSGSIZE is passed for framing too big, and EBADMSG for framing too small (matching openssl semantics). EINVAL is returned for TLS versions that do not match the original setsockopt call. All are unrecoverable. strparser is used to parse TLS framing. Decryption is done directly in to userspace buffers if they are large enough to support it, otherwise sk_cow_data is called (similar to ipsec), and buffers are decrypted in place and copied. splice_read always decrypts in place, since no buffers are provided to decrypt in to. sk_poll is overridden, and only returns POLLIN if a full TLS message is received. Otherwise we wait for strparser to finish reading a full frame. Actual decryption is only done during recvmsg or splice_read calls. Signed-off-by:
Dave Watson <davejwatson@fb.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Dave Watson authored
Several config variables are prefixed with tx, drop the prefix since these will be used for both tx and rx. Signed-off-by:
Dave Watson <davejwatson@fb.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Dave Watson authored
Pass EBADMSG explicitly to tls_err_abort. Receive path will pass additional codes - EMSGSIZE if framing is larger than max TLS record size, EINVAL if TLS version mismatch. Signed-off-by:
Dave Watson <davejwatson@fb.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Dave Watson authored
Separate tx crypto parameters to a separate cipher_context struct. The same parameters will be used for rx using the same struct. tls_advance_record_sn is modified to only take the cipher info. Signed-off-by:
Dave Watson <davejwatson@fb.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Ahern authored
Earlier change missed the path where CONFIG_NET_DEVLINK is disabled. Thanks to Jiri for spotting. Fixes: 14530746 ("devlink: Remove top_hierarchy arg to devlink_resource_register") Signed-off-by:
David Ahern <dsahern@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Daniel Vacek authored
This reverts commit b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible"). The commit is meant to be a boot init speed up skipping the loop in memmap_init_zone() for invalid pfns. But given some specific memory mapping on x86_64 (or more generally theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the implementation also skips valid pfns which is plain wrong and causes 'kernel BUG at mm/page_alloc.c:1389!' crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1 kernel BUG at mm/page_alloc.c:1389! invalid opcode: 0000 [#1] SMP -- RIP: 0010: move_freepages+0x15e/0x160 -- Call Trace: move_freepages_block+0x73/0x80 __rmqueue+0x263/0x460 get_page_from_freelist+0x7e1/0x9e0 __alloc_pages_nodemask+0x176/0x420 -- crash> page_init_bug -v | grep RAM <struct resource 0xffff88067fffd2f8> 1000 - 9bfff System RAM (620.00 KiB) <struct resource 0xffff88067fffd3a0> 100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB) <struct resource 0xffff88067fffd410> 4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB) <struct resource 0xffff88067fffd480> 4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB) <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct resource 0xffff88067fffd640> 100000000 - 67fffffff System RAM ( 22.00 GiB) crash> page_init_bug | head -6 <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct page 0xffffea0001ede200> 1fffff00000000 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0> <struct page 0xffffea0001ed8000> 0 0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA 1 4095 <struct page 0xffffea0001edffc0> 1fffff00000400 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 BUG, zones differ! crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea0001e00000 78000000 0 0 0 0 ffffea0001ed7fc0 7b5ff000 0 0 0 0 ffffea0001ed8000 7b600000 0 0 0 0 <<<< ffffea0001ede1c0 7b787000 0 0 0 0 ffffea0001ede200 7b788000 0 0 1 1fffff00000000 Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible") Signed-off-by:
Daniel Vacek <neelx@redhat.com> Acked-by:
Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by:
Michal Hocko <mhocko@suse.com> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Toshi Kani authored
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may create pud/pmd mappings. A kernel panic was observed on arm64 systems with Cortex-A75 in the following steps as described by Hanjun Guo. 1. ioremap a 4K size, valid page table will build, 2. iounmap it, pte0 will set to 0; 3. ioremap the same address with 2M size, pgd/pmd is unchanged, then set the a new value for pmd; 4. pte0 is leaked; 5. CPU may meet exception because the old pmd is still in TLB, which will lead to kernel panic. This panic is not reproducible on x86. INVLPG, called from iounmap, purges all levels of entries associated with purged address on x86. x86 still has memory leak. The patch changes the ioremap path to free unmapped page table(s) since doing so in the unmap path has the following issues: - The iounmap() path is shared with vunmap(). Since vmap() only supports pte mappings, making vunmap() to free a pte page is an overhead for regular vmap users as they do not need a pte page freed up. - Checking if all entries in a pte page are cleared in the unmap path is racy, and serializing this check is expensive. - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges. Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB purge. Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which clear a given pud/pmd entry and free up a page for the lower level entries. This patch implements their stub functions on x86 and arm64, which work as workaround. [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub] Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings") Reported-by:
Lei Li <lious.lilei@hisilicon.com> Signed-off-by:
Toshi Kani <toshi.kani@hpe.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Wang Xuefeng <wxf.wang@hisilicon.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Mar 22, 2018
-
-
Kirill Tkhai authored
Since ra_chain is per-net, we may use per-net mutexes to protect them in ip_ra_control(). This improves scalability. Signed-off-by:
Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Kirill Tkhai authored
This is optimization, which makes ip_call_ra_chain() iterate less sockets to find the sockets it's looking for. Signed-off-by:
Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Subash Abhinov Kasiviswanathan authored
Define new netlink attributes for rmnet mux_id and flags. These flags / mux_id were earlier using vlan flags / id respectively. The flag bits are also moved to uapi and are renamed with prefix RMNET_FLAG_*. Also add the rmnet policy to handle the new netlink attributes. Signed-off-by:
Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
GhantaKrishnamurthy MohanKrishna authored
Currently when tipc is unable to queue a received message on a socket, the message is rejected back to the sender with error TIPC_ERR_OVERLOAD. However, the application on this socket has no knowledge about these discards. In this commit, we try to step the sk_drops counter when tipc is unable to queue a received message. Export sk_drops using tipc socket diagnostics. Acked-by:
Jon Maloy <jon.maloy@ericsson.com> Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by:
Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
GhantaKrishnamurthy MohanKrishna authored
This commit adds socket diagnostics capability for AF_TIPC in netlink family NETLINK_SOCK_DIAG in a new kernel module (diag.ko). The following are key design considerations: - config TIPC_DIAG has default y, like INET_DIAG. - only requests with flag NLM_F_DUMP is supported (dump all). - tipc_sock_diag_req message is introduced to send filter parameters. - the response attributes are of TLV, some nested. To avoid exposing data structures between diag and tipc modules and avoid code duplication, the following additions are required: - export tipc_nl_sk_walk function to reuse socket iterator. - export tipc_sk_fill_sock_diag to fill the tipc diag attributes. - create a sock_diag response message in __tipc_add_sock_diag defined in diag.c and use the above exported tipc_sk_fill_sock_diag to fill response. Acked-by:
Jon Maloy <jon.maloy@ericsson.com> Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by:
Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Ahern authored
top_hierarchy arg can be determined by comparing parent_resource_id to DEVLINK_RESOURCE_ID_PARENT_TOP so it does not need to be a separate argument. Signed-off-by:
David Ahern <dsahern@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Kevin Hao authored
For some phy devices, even though they don't support the MMD extended register access, it does have some side effect if we are trying to read/write the MMD registers via indirect method. So introduce general dummy stubs for MMD register access which these devices can use to avoid such side effect. Fixes: b6b5e8a6 ("gianfar: Disable EEE autoneg by default") Signed-off-by:
Kevin Hao <haokexin@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Christian Brauner authored
This commit adds struct uevent_sock to struct net. Since struct uevent_sock records the position of the uevent socket in the uevent socket list we can trivially remove it from the uevent socket list during cleanup. This speeds up the old removal codepath. Note, list_del() will hit __list_del_entry_valid() in its call chain which will validate that the element is a member of the list. If it isn't it will take care that the list is not modified. Signed-off-by:
Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 21, 2018
-
-
Ben Caradoc-Davies authored
Commit 7b6ddeaf ("mac80211: use QoS NDP for AP probing") added an argument qos_ok to ieee80211_nullfunc_get to support QoS NDP. Despite the claim in the commit log "Change all the drivers to *not* allow QoS NDP for now, even though it looks like most of them should be OK with that", this commit enables QoS NDP in response to beacons (see change to mlme.c:ieee80211_send_nullfunc), causing ath9k_htc to lose IP connectivity. See: https://patchwork.kernel.org/patch/10241109/ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=891060 Introduce a hardware flag to allow such buggy drivers to override the correct default behaviour of mac80211 of sending QoS NDP packets. Signed-off-by:
Ben Caradoc-Davies <ben@transient.nz> Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
- Mar 19, 2018
-
-
John Fastabend authored
Currently, if a bpf sk msg program is run the program can only parse data that the (start,end) pointers already consumed. For sendmsg hooks this is likely the first scatterlist element. For sendpage this will be the range (0,0) because the data is shared with userspace and by default we want to avoid allowing userspace to modify data while (or after) BPF verdict is being decided. To support pulling in additional bytes for parsing use a new helper bpf_sk_msg_pull(start, end, flags) which works similar to cls tc logic. This helper will attempt to point the data start pointer at 'start' bytes offest into msg and data end pointer at 'end' bytes offset into message. After basic sanity checks to ensure 'start' <= 'end' and 'end' <= msg_length there are a few cases we need to handle. First the sendmsg hook has already copied the data from userspace and has exclusive access to it. Therefor, it is not necessesary to copy the data. However, it may be required. After finding the scatterlist element with 'start' offset byte in it there are two cases. One the range (start,end) is entirely contained in the sg element and is already linear. All that is needed is to update the data pointers, no allocate/copy is needed. The other case is (start, end) crosses sg element boundaries. In this case we allocate a block of size 'end - start' and copy the data to linearize it. Next sendpage hook has not copied any data in initial state so that data pointers are (0,0). In this case we handle it similar to the above sendmsg case except the allocation/copy must always happen. Then when sending the data we have possibly three memory regions that need to be sent, (0, start - 1), (start, end), and (end + 1, msg_length). This is required to ensure any writes by the BPF program are correctly transmitted. Lastly this operation will invalidate any previous data checks so BPF programs will have to revalidate pointers after making this BPF call. Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Acked-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net>
-