- Feb 27, 2020
-
-
Christian Borntraeger authored
Now that everything is in place, we can announce the feature. Signed-off-by:
Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by:
Cornelia Huck <cohuck@redhat.com> Reviewed-by:
David Hildenbrand <david@redhat.com>
-
Janosch Frank authored
diag 308 subcode 0 and 1 require several KVM and Ultravisor interactions. Specific to these "soft" reboots are * The "unshare all" UVC * The "prepare for reset" UVC Signed-off-by:
Janosch Frank <frankja@linux.ibm.com> Acked-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Cornelia Huck <cohuck@redhat.com> [borntraeger@de.ibm.com: patch merging, splitting, fixing] Signed-off-by:
Christian Borntraeger <borntraeger@de.ibm.com>
-
Janosch Frank authored
Now that we can't access guest memory anymore, we have a dedicated satellite block that's a bounce buffer for instruction data. We re-use the memop interface to copy the instruction data to / from userspace. This lets us re-use a lot of QEMU code which used that interface to make logical guest memory accesses which are not possible anymore in protected mode anyway. Signed-off-by:
Janosch Frank <frankja@linux.ibm.com> Reviewed-by:
Thomas Huth <thuth@redhat.com> Reviewed-by:
David Hildenbrand <david@redhat.com> [borntraeger@de.ibm.com: patch merging, splitting, fixing] Signed-off-by:
Christian Borntraeger <borntraeger@de.ibm.com>
-
Janosch Frank authored
This contains 3 main changes: 1. changes in SIE control block handling for secure guests 2. helper functions for create/destroy/unpack secure guests 3. KVM_S390_PV_COMMAND ioctl to allow userspace dealing with secure machines Signed-off-by:
Janosch Frank <frankja@linux.ibm.com> Reviewed-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Cornelia Huck <cohuck@redhat.com> [borntraeger@de.ibm.com: patch merging, splitting, fixing] Signed-off-by:
Christian Borntraeger <borntraeger@de.ibm.com>
-
- Feb 14, 2020
-
-
Randy Dunlap authored
Eliminate all kernel-doc and Sphinx warnings in <linux/netdevice.h>. Fixes these warnings: ../include/linux/netdevice.h:2100: warning: Function parameter or member 'gso_partial_features' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'l3mdev_ops' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xfrmdev_ops' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'tlsdev_ops' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'name_assign_type' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'ieee802154_ptr' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'mpls_ptr' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xdp_prog' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'gro_flush_timeout' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xdp_bulkq' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xps_cpus_map' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xps_rxqs_map' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'qdisc_hash' not described in 'net_device' ../include/linux/netdevice.h:3552: WARNING: Inline emphasis start-string without end-string. ../include/linux/netdevice.h:3552: WARNING: Inline emphasis start-string without end-string. Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Feb 13, 2020
-
-
Jason A. Donenfeld authored
This introduces a helper function to be called only by network drivers that wraps calls to icmp[v6]_send in a conntrack transformation, in case NAT has been used. We don't want to pollute the non-driver path, though, so we introduce this as a helper to be called by places that actually make use of this, as suggested by Florian. Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Hangbin Liu authored
@thoff has moved to struct flow_dissector_key_control. Fixes: 42aecaa9 ("net: Get skb hash over flow_keys structure") Signed-off-by:
Hangbin Liu <liuhangbin@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Feb 12, 2020
-
-
Randy Dunlap authored
Fix kernel-doc warnings in struct pipe_inode_info after @wait was split into @rd_wait and @wr_wait. include/linux/pipe_fs_i.h:66: warning: Function parameter or member 'rd_wait' not described in 'pipe_inode_info' include/linux/pipe_fs_i.h:66: warning: Function parameter or member 'wr_wait' not described in 'pipe_inode_info' Fixes: 0ddad21d ("pipe: use exclusive waits when reading or writing") Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Trond Myklebust authored
If a dentry was not initially looked up while we were holding a delegation, then we do still need to revalidate that it still holds the same name. If there are multiple hard links to the same file, then all the hard links need validation. Reported-by:
Benjamin Coddington <bcodding@redhat.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by:
Benjamin Coddington <bcodding@redhat.com> Tested-by:
Benjamin Coddington <bcodding@redhat.com> [Anna: Put nfs_unset_verifier_delegated() under CONFIG_NFS_V4] Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
- Feb 11, 2020
-
-
Rafael J. Wysocki authored
Introduce a new helper function, acpi_any_gpe_status_set(), for checking the status bits of all enabled GPEs in one go. It is needed to distinguish spurious SCIs from genuine ones when deciding whether or not to wake up the system from suspend-to-idle. Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
Rafael J. Wysocki authored
It is theoretically possible for the ACPI EC GPE to be set after the s2idle_ops->wake() called from s2idle_loop() has returned and before the subsequent pm_wakeup_pending() check is carried out. If that happens, the resulting wakeup event will cause the system to resume even though it may be a spurious one. To avoid that race, first make the ->wake() callback in struct platform_s2idle_ops return a bool value indicating whether or not to let the system resume and rearrange s2idle_loop() to use that value instad of the direct pm_wakeup_pending() call if ->wake() is present. Next, rework acpi_s2idle_wake() to process EC events and check pm_wakeup_pending() before re-arming the SCI for system wakeup to prevent it from triggering prematurely and add comments to that function to explain the rationale for the new code flow. Fixes: 56b99184 ("PM: sleep: Simplify suspend-to-idle control flow") Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
Tom Zanussi authored
Move the checking, buffer reserve and buffer commit code in synth_event_trace_start/end() into inline functions __synth_event_trace_start/end() so they can also be used by synth_event_trace() and synth_event_trace_array(), and then have all those functions use them. Also, change synth_event_trace_state.enabled to disabled so it only needs to be set if the event is disabled, which is not normally the case. Link: http://lkml.kernel.org/r/b1f3108d0f450e58192955a300e31d0405ab4149.1581374549.git.zanussi@kernel.org Signed-off-by:
Tom Zanussi <zanussi@kernel.org> Signed-off-by:
Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- Feb 08, 2020
-
-
Linus Torvalds authored
This makes the pipe code use separate wait-queues and exclusive waiting for readers and writers, avoiding a nasty thundering herd problem when there are lots of readers waiting for data on a pipe (or, less commonly, lots of writers waiting for a pipe to have space). While this isn't a common occurrence in the traditional "use a pipe as a data transport" case, where you typically only have a single reader and a single writer process, there is one common special case: using a pipe as a source of "locking tokens" rather than for data communication. In particular, the GNU make jobserver code ends up using a pipe as a way to limit parallelism, where each job consumes a token by reading a byte from the jobserver pipe, and releases the token by writing a byte back to the pipe. This pattern is fairly traditional on Unix, and works very well, but will waste a lot of time waking up a lot of processes when only a single reader needs to be woken up when a writer releases a new token. A simplified test-case of just this pipe interaction is to create 64 processes, and then pass a single token around between them (this test-case also intentionally passes another token that gets ignored to test the "wake up next" logic too, in case anybody wonders about it): #include <unistd.h> int main(int argc, char **argv) { int fd[2], counters[2]; pipe(fd); counters[0] = 0; counters[1] = -1; write(fd[1], counters, sizeof(counters)); /* 64 processes */ fork(); fork(); fork(); fork(); fork(); fork(); do { int i; read(fd[0], &i, sizeof(i)); if (i < 0) continue; counters[0] = i+1; write(fd[1], counters, (1+(i & 1)) *sizeof(int)); } while (counters[0] < 1000000); return 0; } and in a perfect world, passing that token around should only cause one context switch per transfer, when the writer of a token causes a directed wakeup of just a single reader. But with the "writer wakes all readers" model we traditionally had, on my test box the above case causes more than an order of magnitude more scheduling: instead of the expected ~1M context switches, "perf stat" shows 231,852.37 msec task-clock # 15.857 CPUs utilized 11,250,961 context-switches # 0.049 M/sec 616,304 cpu-migrations # 0.003 M/sec 1,648 page-faults # 0.007 K/sec 1,097,903,998,514 cycles # 4.735 GHz 120,781,778,352 instructions # 0.11 insn per cycle 27,997,056,043 branches # 120.754 M/sec 283,581,233 branch-misses # 1.01% of all branches 14.621273891 seconds time elapsed 0.018243000 seconds user 3.611468000 seconds sys before this commit. After this commit, I get 5,229.55 msec task-clock # 3.072 CPUs utilized 1,212,233 context-switches # 0.232 M/sec 103,951 cpu-migrations # 0.020 M/sec 1,328 page-faults # 0.254 K/sec 21,307,456,166 cycles # 4.074 GHz 12,947,819,999 instructions # 0.61 insn per cycle 2,881,985,678 branches # 551.096 M/sec 64,267,015 branch-misses # 2.23% of all branches 1.702148350 seconds time elapsed 0.004868000 seconds user 0.110786000 seconds sys instead. Much better. [ Note! This kernel improvement seems to be very good at triggering a race condition in the make jobserver (in GNU make 4.2.1) for me. It's a long known bug that was fixed back in June 2017 by GNU make commit b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to avoid hangs."). But there wasn't a new release of GNU make until 4.3 on Jan 19 2020, so a number of distributions may still have the buggy version. Some have backported the fix to their 4.2.1 release, though, and even without the fix it's quite timing-dependent whether the bug actually is hit. ] Josh Triplett says: "I've been hammering on your pipe fix patch (switching to exclusive wait queues) for a month or so, on several different systems, and I've run into no issues with it. The patch *substantially* improves parallel build times on large (~100 CPU) systems, both with parallel make and with other things that use make's pipe-based jobserver. All current distributions (including stable and long-term stable distributions) have versions of GNU make that no longer have the jobserver bug" Tested-by:
Josh Triplett <josh@joshtriplett.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Zenghui Yu authored
Currently, we will not set vpe_l1_page for the current RD if we can inherit the vPE configuration table from another RD (or ITS), which results in an inconsistency between RDs within the same CommonLPIAff group. Let's rename it to vpe_l1_base to indicate the base address of the vPE configuration table of this RD, and set it properly for *all* v4.1 redistributors. Signed-off-by:
Zenghui Yu <yuzenghui@huawei.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200206075711.1275-3-yuzenghui@huawei.com
-
- Feb 07, 2020
-
-
Al Viro authored
called errorfc/infofc/warnfc/invalfc Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Don't bother with "mixed" options that would allow both the form with and without argument (i.e. both -o foo and -o foo=bar). Rather than trying to shove both into a single fs_parameter_spec, allow having with-argument and no-argument specs with the same name and teach fs_parse to handle that. There are very few options of that sort, and they are actually easier to handle that way - callers end up with less postprocessing. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
The former contains nothing but a pointer to an array of the latter... Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Eric Sandeen authored
Unused now. Signed-off-by:
Eric Sandeen <sandeen@redhat.com> Acked-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
... turning it into struct p_log embedded into fs_context. Initialize the prefix with fs_type->name, turning fs_parse() into a trivial inline wrapper for __fs_parse(). This makes fs_parameter_description->name completely unused. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
... and now errorf() et.al. are never called with NULL fs_context, so we can get rid of conditional in those. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
fs_parse() analogue taking p_log instead of fs_context. fs_parse() turned into a wrapper, callers in ceph_common and rbd switched to __fs_parse(). As the result, fs_parse() never gets NULL fs_context and neither do fs_context-based logging primitives Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
primitives for prefixed logging Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Its behaviour is identical to that of fs_value_is_filename. It makes no sense, anyway - LOOKUP_EMPTY affects nothing whatsoever once the pathname has been imported from userland. And both fs_value_is_filename and fs_value_is_filename_empty carry an already imported pathname. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Have the arrays of constant_table self-terminated (by NULL ->name in the final entry). Simplifies lookup_constant() and allows to reuse the search for enum params as well. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Johannes Berg authored
It turns out that this wasn't a good idea, I hit a test failure in hwsim due to this. That particular failure was easily worked around, but it raised questions: if an AP needs to, for example, send action frames to each connected station, the current limit is nowhere near enough (especially if those stations are sleeping and the frames are queued for a while.) Shuffle around some bits to make more room for ack_frame_id to allow up to 8192 queued up frames, that's enough for queueing 4 frames to each connected station, even at the maximum of 2007 stations on a single AP. We take the bits from band (which currently only 2 but I leave 3 in case we add another band) and from the hw_queue, which can only need 4 since it has a limit of 16 queues. Fixes: 6912daed ("mac80211: Shrink the size of ack_frame_id to make room for tx_time_est") Signed-off-by:
Johannes Berg <johannes.berg@intel.com> Acked-by:
Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20200115122549.b9a4ef9f4980.Ied52ed90150220b83a280009c590b65d125d087c@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Damien Le Moal authored
zonefs is a very simple file system exposing each zone of a zoned block device as a file. Unlike a regular file system with zoned block device support (e.g. f2fs), zonefs does not hide the sequential write constraint of zoned block devices to the user. Files representing sequential write zones of the device must be written sequentially starting from the end of the file (append only writes). As such, zonefs is in essence closer to a raw block device access interface than to a full featured POSIX file system. The goal of zonefs is to simplify the implementation of zoned block device support in applications by replacing raw block device file accesses with a richer file API, avoiding relying on direct block device file ioctls which may be more obscure to developers. One example of this approach is the implementation of LSM (log-structured merge) tree structures (such as used in RocksDB and LevelDB) on zoned block devices by allowing SSTables to be stored in a zone file similarly to a regular file system rather than as a range of sectors of a zoned device. The introduction of the higher level construct "one file is one zone" can help reducing the amount of changes needed in the application as well as introducing support for different application programming languages. Zonefs on-disk metadata is reduced to an immutable super block to persistently store a magic number and optional feature flags and values. On mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration and populates the mount point with a static file tree solely based on this information. E.g. file sizes come from the device zone type and write pointer offset managed by the device itself. The zone files created on mount have the following characteristics. 1) Files representing zones of the same type are grouped together under a common sub-directory: * For conventional zones, the sub-directory "cnv" is used. * For sequential write zones, the sub-directory "seq" is used. These two directories are the only directories that exist in zonefs. Users cannot create other directories and cannot rename nor delete the "cnv" and "seq" sub-directories. 2) The name of zone files is the number of the file within the zone type sub-directory, in order of increasing zone start sector. 3) The size of conventional zone files is fixed to the device zone size. Conventional zone files cannot be truncated. 4) The size of sequential zone files represent the file's zone write pointer position relative to the zone start sector. Truncating these files is allowed only down to 0, in which case, the zone is reset to rewind the zone write pointer position to the start of the zone, or up to the zone size, in which case the file's zone is transitioned to the FULL state (finish zone operation). 5) All read and write operations to files are not allowed beyond the file zone size. Any access exceeding the zone size is failed with the -EFBIG error. 6) Creating, deleting, renaming or modifying any attribute of files and sub-directories is not allowed. 7) There are no restrictions on the type of read and write operations that can be issued to conventional zone files. Buffered, direct and mmap read & write operations are accepted. For sequential zone files, there are no restrictions on read operations, but all write operations must be direct IO append writes. mmap write of sequential files is not allowed. Several optional features of zonefs can be enabled at format time. * Conventional zone aggregation: ranges of contiguous conventional zones can be aggregated into a single larger file instead of the default one file per zone. * File ownership: The owner UID and GID of zone files is by default 0 (root) but can be changed to any valid UID/GID. * File access permissions: the default 640 access permissions can be changed. The mkzonefs tool is used to format zoned block devices for use with zonefs. This tool is available on Github at: git@github.com:damien-lemoal/zonefs-tools.git. zonefs-tools also includes a test suite which can be run against any zoned block device, including null_blk block device created with zoned mode. Example: the following formats a 15TB host-managed SMR HDD with 256 MB zones with the conventional zones aggregation feature enabled. $ sudo mkzonefs -o aggr_cnv /dev/sdX $ sudo mount -t zonefs /dev/sdX /mnt $ ls -l /mnt/ total 0 dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq The size of the zone files sub-directories indicate the number of files existing for each type of zones. In this example, there is only one conventional zone file (all conventional zones are aggregated under a single file). $ ls -l /mnt/cnv total 137101312 -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 This aggregated conventional zone file can be used as a regular file. $ sudo mkfs.ext4 /mnt/cnv/0 $ sudo mount -o loop /mnt/cnv/0 /data The "seq" sub-directory grouping files for sequential write zones has in this example 55356 zones. $ ls -lv /mnt/seq total 14511243264 -rw-r----- 1 root root 0 Nov 25 13:23 0 -rw-r----- 1 root root 0 Nov 25 13:23 1 -rw-r----- 1 root root 0 Nov 25 13:23 2 ... -rw-r----- 1 root root 0 Nov 25 13:23 55354 -rw-r----- 1 root root 0 Nov 25 13:23 55355 For sequential write zone files, the file size changes as data is appended at the end of the file, similarly to any regular file system. $ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s $ ls -l /mnt/seq/0 -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 The written file can be truncated to the zone size, preventing any further write operation. $ truncate -s 268435456 /mnt/seq/0 $ ls -l /mnt/seq/0 -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 Truncation to 0 size allows freeing the file zone storage space and restart append-writes to the file. $ truncate -s 0 /mnt/seq/0 $ ls -l /mnt/seq/0 -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 Since files are statically mapped to zones on the disk, the number of blocks of a file as reported by stat() and fstat() indicates the size of the file zone. $ stat /mnt/seq/0 File: /mnt/seq/0 Size: 0 Blocks: 524288 IO Block: 4096 regular empty file Device: 870h/2160d Inode: 50431 Links: 1 Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2019-11-25 13:23:57.048971997 +0900 Modify: 2019-11-25 13:52:25.553805765 +0900 Change: 2019-11-25 13:52:25.553805765 +0900 Birth: - The number of blocks of the file ("Blocks") in units of 512B blocks gives the maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone size in this example. Of note is that the "IO block" field always indicates the minimum IO size for writes and corresponds to the device physical sector size. This code contains contributions from: * Johannes Thumshirn <jthumshirn@suse.de>, * Darrick J. Wong <darrick.wong@oracle.com>, * Christoph Hellwig <hch@lst.de>, * Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and * Ting Yao <tingyao@hust.edu.cn>. Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Dave Chinner <dchinner@redhat.com>
-
Al Viro authored
no real difference now Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Don't do a single array; attach them to fsparam_enum() entry instead. And don't bother trying to embed the names into those - it actually loses memory, with no real speedup worth mentioning. Simplifies validation as well. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
As it is, vfs_parse_fs_string() makes "foo" and "foo=" indistinguishable; both get fs_value_is_string for ->type and NULL for ->string. To make it even more unpleasant, that combination is impossible to produce with fsconfig(). Much saner rules would be "foo" => fs_value_is_flag, NULL "foo=" => fs_value_is_string, "" "foo=bar" => fs_value_is_string, "bar" All cases are distinguishable, all results are expressable by fsconfig(), ->has_value checks are much simpler that way (to the point of the field being useless) and quite a few regressions go away (gfs2 has no business accepting -o nodebug=, for example). Partially based upon patches from Miklos. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- Feb 06, 2020
-
-
Tariq Toukan authored
Deprecate the generic TLS cap bit, use the new TX-specific TLS cap bit instead. Fixes: a12ff35e ("net/mlx5: Introduce TLS TX offload hardware bits and structures") Signed-off-by:
Tariq Toukan <tariqt@mellanox.com> Reviewed-by:
Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Qian Cai authored
sk_buff.qlen can be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96: unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821 net/unix/af_unix.c:1761 ____sys_sendmsg+0x33e/0x370 ___sys_sendmsg+0xa6/0xf0 __sys_sendmsg+0x69/0xf0 __x64_sys_sendmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99: __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029 __skb_try_recv_datagram+0xbe/0x220 unix_dgram_recvmsg+0xee/0x850 ____sys_recvmsg+0x1fb/0x210 ___sys_recvmsg+0xa2/0xf0 __sys_recvmsg+0x66/0xf0 __x64_sys_recvmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe Since only the read is operating as lockless, it could introduce a logic bug in unix_recvq_full() due to the load tearing. Fix it by adding a lockless variant of skb_queue_len() and unix_recvq_full() where READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to the commit d7d16a89 ("net: add skb_queue_empty_lockless()"). Signed-off-by:
Qian Cai <cai@lca.pw> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Feb 05, 2020
-
-
Geert Uytterhoeven authored
Depending on include order: include/linux/of_clk.h:11:45: warning: ‘struct device_node’ declared inside parameter list will not be visible outside of this definition or declaration unsigned int of_clk_get_parent_count(struct device_node *np); ^~~~~~~~~~~ include/linux/of_clk.h:12:43: warning: ‘struct device_node’ declared inside parameter list will not be visible outside of this definition or declaration const char *of_clk_get_parent_name(struct device_node *np, int index); ^~~~~~~~~~~ include/linux/of_clk.h:13:31: warning: ‘struct of_device_id’ declared inside parameter list will not be visible outside of this definition or declaration void of_clk_init(const struct of_device_id *matches); ^~~~~~~~~~~~ Fix this by adding forward declarations for struct device_node and struct of_device_id. Signed-off-by:
Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lkml.kernel.org/r/20200205194649.31309-1-geert+renesas@glider.be Signed-off-by:
Stephen Boyd <sboyd@kernel.org>
-
Eric Dumazet authored
syzbot managed to send an IPX packet through bond_alb_xmit() and af_packet and triggered a use-after-free. First, bond_alb_xmit() was using ipx_hdr() helper to reach the IPX header, but ipx_hdr() was using the transport offset instead of the network offset. In the particular syzbot report transport offset was 0xFFFF This patch removes ipx_hdr() since it was only (mis)used from bonding. Then we need to make sure IPv4/IPv6/IPX headers are pulled in skb->head before dereferencing anything. BUG: KASAN: use-after-free in bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452 Read of size 2 at addr ffff8801ce56dfff by task syz-executor.2/18108 (if (ipx_hdr(skb)->ipx_checksum != IPX_NO_CHECKSUM) ...) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: [<ffffffff8441fc42>] __dump_stack lib/dump_stack.c:17 [inline] [<ffffffff8441fc42>] dump_stack+0x14d/0x20b lib/dump_stack.c:53 [<ffffffff81a7dec4>] print_address_description+0x6f/0x20b mm/kasan/report.c:282 [<ffffffff81a7e0ec>] kasan_report_error mm/kasan/report.c:380 [inline] [<ffffffff81a7e0ec>] kasan_report mm/kasan/report.c:438 [inline] [<ffffffff81a7e0ec>] kasan_report.cold+0x8c/0x2a0 mm/kasan/report.c:422 [<ffffffff81a7dc4f>] __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:469 [<ffffffff82c8c00a>] bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452 [<ffffffff82c60c74>] __bond_start_xmit drivers/net/bonding/bond_main.c:4199 [inline] [<ffffffff82c60c74>] bond_start_xmit+0x4f4/0x1570 drivers/net/bonding/bond_main.c:4224 [<ffffffff83baa558>] __netdev_start_xmit include/linux/netdevice.h:4525 [inline] [<ffffffff83baa558>] netdev_start_xmit include/linux/netdevice.h:4539 [inline] [<ffffffff83baa558>] xmit_one net/core/dev.c:3611 [inline] [<ffffffff83baa558>] dev_hard_start_xmit+0x168/0x910 net/core/dev.c:3627 [<ffffffff83bacf35>] __dev_queue_xmit+0x1f55/0x33b0 net/core/dev.c:4238 [<ffffffff83bae3a8>] dev_queue_xmit+0x18/0x20 net/core/dev.c:4278 [<ffffffff84339189>] packet_snd net/packet/af_packet.c:3226 [inline] [<ffffffff84339189>] packet_sendmsg+0x4919/0x70b0 net/packet/af_packet.c:3252 [<ffffffff83b1ac0c>] sock_sendmsg_nosec net/socket.c:673 [inline] [<ffffffff83b1ac0c>] sock_sendmsg+0x12c/0x160 net/socket.c:684 [<ffffffff83b1f5a2>] __sys_sendto+0x262/0x380 net/socket.c:1996 [<ffffffff83b1f700>] SYSC_sendto net/socket.c:2008 [inline] [<ffffffff83b1f700>] SyS_sendto+0x40/0x60 net/socket.c:2004 Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
syzbot <syzkaller@googlegroups.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Andy Shevchenko authored
Replace with appropriate types.h. Signed-off-by:
Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by:
Florian Fainelli <f.fainelli@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Andy Shevchenko authored
Replace with appropriate types.h. Signed-off-by:
Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by:
Florian Fainelli <f.fainelli@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Feb 04, 2020
-
-
Dai Ngo authored
When the directory is large and it's being modified by one client while another client is doing the 'ls -l' on the same directory then the cache page invalidation from nfs_force_use_readdirplus causes the reading client to keep restarting READDIRPLUS from cookie 0 which causes the 'ls -l' to take a very long time to complete, possibly never completing. Currently when nfs_force_use_readdirplus is called to switch from READDIR to READDIRPLUS, it invalidates all the cached pages of the directory. This cache page invalidation causes the next nfs_readdir to re-read the directory content from cookie 0. This patch is to optimise the cache invalidation in nfs_force_use_readdirplus by only truncating the cached pages from last page index accessed to the end the file. It also marks the inode to delay invalidating all the cached page of the directory until the next initial nfs_readdir of the next 'ls' instance. Signed-off-by:
Dai Ngo <dai.ngo@oracle.com> Reviewed-by:
Trond Myklebust <trond.myklebust@hammerspace.com> [Anna - Fix conflicts with Trond's readdir patches] [Anna - Remove redundant call to nfs_zap_mapping()] [Anna - Replace d_inode(file_dentry(desc->file)) with file_inode(desc->file)] Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Michal Simek authored
dma-continuguous.h is generic for all architectures except arm32 which has its own version. Similar change was done for msi.h by commit a1b39bae ("asm-generic: Make msi.h a mandatory include/asm header") Suggested-by:
Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/linux-arm-kernel/20200117080446.GA8980@lst.de/T/#m92bb56b04161057635d4142e1b3b9b6b0a70122e Signed-off-by:
Michal Simek <michal.simek@xilinx.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Thomas Gleixner <tglx@linutronix.de> Acked-by:
Arnd Bergmann <arnd@arndb.de> Acked-by: Paul Walmsley <paul.walmsley@sifive.com> # for arch/riscv
-
Yury Norov authored
New design of inner bitmap_parse() allows to avoid calculating the size of a null-terminated string. Link: http://lkml.kernel.org/r/20200102043031.30357-8-yury.norov@gmail.com Signed-off-by:
Yury Norov <yury.norov@gmail.com> Reviewed-by:
Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Amritha Nambiar <amritha.nambiar@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: "Tobin C . Harding" <tobin@kernel.org> Cc: Vineet Gupta <vineet.gupta1@synopsys.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-