aboutsummaryrefslogtreecommitdiff
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* mm/oom_kill: squashed reverts to a stable stateCorinna Vinschen2019-07-192-11/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert "mm, oom: fix use-after-free in oom_kill_process" This reverts commit e1bebdeedb497f03d426c85a89c3807c7e75268d. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm,oom: make oom_killer_disable() killable" This reverts commit 65a7400a432639aa8d5e572f30687fbca204b6f8. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm: oom_kill: don't ignore oom score on exiting tasks" This reverts commit d60dae46b27a8f381e4a7ad9dde870faa49fa5f1. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm/oom_kill.c: avoid attempting to kill init sharing same memory" This reverts commit 10773c0325259d6640b93c0694b5598ddf84939f. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place" This reverts commit 2bdd9a2042a0e12d96c545773d9d8038c920f813. Revert "mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process()" This reverts commit 419a313435b31821e4d045ca4b7ea1cc5fa02035. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm/oom_kill: cleanup the "kill sharing same memory" loop" This reverts commit afda78c6de38f9f66eba0955153b380d540d8276. Revert "mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process()" This reverts commit acde9c2ace298b249c06ec5b0b971c333449dc09. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm, oom: remove task_lock protecting comm printing" This reverts commit 9a9ca142d250ec9de1215284857f4528c6ddb080. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm/oom_kill.c: suppress unnecessary "sharing same memory" message" This reverts commit 1aa2960f7c70d65b1481f805ac73b988faff6747. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL" This reverts commit f028aedfcfd2e2bb98921b98d3ae183387ab8fed. Revert "mm, oom: remove unnecessary variable" This reverts commit 54b0b58224146d68a11bccb5e64683ab3029373a. Revert "mm/oom_kill.c: print points as unsigned int" This reverts commit 603f975a6d4f0b56c7f6df7889ef2a704eca94a3. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm: oom_kill: simplify OOM killer locking" This reverts commit 7951a52ed35d162063fa08b27894e302fd716ccd. Revert "mm: oom_kill: remove unnecessary locking in exit_oom_victim()" This reverts commit f0739b25ac884682865d6aae7485e79489107bfb. Revert "mm: oom_kill: generalize OOM progress waitqueue" This reverts commit eb4b1243c72ba0b392bbe05dbf9f91959f70eb18. Revert "mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear" This reverts commit e611f16275c3642cb8a6345ff2470926fef52110. Revert "mm: oom_kill: clean up victim marking and exiting interfaces" This reverts commit c6fada01b9370e3d7603b4ad8c26b56759174667. Revert "mm: oom_kill: remove unnecessary locking in oom_enable()" This reverts commit 5dd152d7351b3805f59b2b1f624722ab2f3c5fd8. Revert "oom, PM: make OOM detection in the freezer path raceless" This reverts commit 5fc5b1ddee5404a7629dd7045f54eaf8941bc11c.
* mm: Add notifier framework for showing memoryLaura Abbott2019-07-191-0/+20
| | | | | | | | | | | | There are many drivers in the kernel which can hold on to lots of memory. It can be useful to dump out all those drivers at key points in the kernel. Introduct a notifier framework for dumping this information. When the notifiers are called, drivers can dump out the state of any memory they may be using. Change-Id: Ifb2946964bf5d072552dd56d8d6dfdd794af6d84 Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
* Bluetooth: Align minimum encryption key size for LE and BR/EDR connectionsMarcel Holtmann2019-07-181-0/+3
| | | | | | | | | | | | | | commit d5bb334a8e171b262e48f378bd2096c0ea458265 upstream. The minimum encryption key size for LE connections is 56 bits and to align LE with BR/EDR, enforce 56 bits of minimum encryption key size for BR/EDR connections as well. Change-Id: Iaa1e00cab1ca82f42098c461f91fe370e501d826 Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Bluetooth: Convert hci_conn->link_mode into flagsJohan Hedberg2019-07-181-3/+6
| | | | | | | | | | | | | Since the link_mode member of the hci_conn struct is a bit field and we already have a flags member as well it makes sense to merge these two together. This patch moves all used link_mode bits into corresponding flags. To keep backwards compatibility with user space we still need to provide a get_link_mode() helper function for the ioctl's that expect a link_mode style value. Change-Id: Ia885bce68ab454ad47230a6a577e7ddd9319d73c Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
* BACKPORT: tcp: add tcp_min_snd_mss sysctlEric Dumazet2019-07-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5f3e2bf008c2221478101ee72f5cb4654b9fc363 upstream. Some TCP peers announce a very small MSS option in their SYN and/or SYN/ACK messages. This forces the stack to send packets with a very high network/cpu overhead. Linux has enforced a minimal value of 48. Since this value includes the size of TCP options, and that the options can consume up to 40 bytes, this means that each segment can include only 8 bytes of payload. In some cases, it can be useful to increase the minimal value to a saner value. We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility reasons. Note that TCP_MAXSEG socket option enforces a minimal value of (TCP_MIN_MSS). David Miller increased this minimal value in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.") from 64 to 88. We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [BACKPORT to 3.10: use previous sysctrl method] Signed-off-by: syphyr@gmail.com Change-Id: Ib5e91a60fe4f4c00afc27ed92b1bd8dfe39fb7c9
* tcp: limit payload size of sacked skbsEric Dumazet2019-07-182-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream. Jonathan Looney reported that TCP can trigger the following crash in tcp_shifted_skb() : BUG_ON(tcp_skb_pcount(skb) < pcount); This can happen if the remote peer has advertized the smallest MSS that linux TCP accepts : 48 An skb can hold 17 fragments, and each fragment can hold 32KB on x86, or 64KB on PowerPC. This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs can overflow. Note that tcp_sendmsg() builds skbs with less than 64KB of payload, so this problem needs SACK to be enabled. SACK blocks allow TCP to coalesce multiple skbs in the retransmit queue, thus filling the 17 fragments to maximal capacity. CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs Backport notes, provided by Joao Martins <joao.m.martins@oracle.com> v4.15 or since commit 737ff314563 ("tcp: use sequence distance to detect reordering") had switched from the packet-based FACK tracking and switched to sequence-based. v4.14 and older still have the old logic and hence on tcp_skb_shift_data() needs to retain its original logic and have @fack_count in sync. In other words, we keep the increment of pcount with tcp_skb_pcount(skb) to later used that to update fack_count. To make it more explicit we track the new skb that gets incremented to pcount in @next_pcount, and we get to avoid the constant invocation of tcp_skb_pcount(skb) all together. Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing") Change-Id: Ia549e9b12cd033edd93f90e13c6c0e255f74c399 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: tcp_fragment() should apply sane memory limitsEric Dumazet2019-07-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream. Jonathan Looney reported that a malicious peer can force a sender to fragment its retransmit queue into tiny skbs, inflating memory usage and/or overflow 32bit counters. TCP allows an application to queue up to sk_sndbuf bytes, so we need to give some allowance for non malicious splitting of retransmit queue. A new SNMP counter is added to monitor how many times TCP did not allow to split an skb if the allowance was exceeded. Note that this counter might increase in the case applications use SO_SNDBUF socket option to lower sk_sndbuf. CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the socket is already using more than half the allowed space Change-Id: I594a9f68263f774fa6f0824042bc287bba6dc927 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: introduce vma_is_anonymous(vma) helperOleg Nesterov2019-07-181-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b5330628546616af14ff23075fbf8d4ad91f6e25 upstream. special_mapping_fault() is absolutely broken. It seems it was always wrong, but this didn't matter until vdso/vvar started to use more than one page. And after this change vma_is_anonymous() becomes really trivial, it simply checks vm_ops == NULL. However, I do think the helper makes sense. There are a lot of ->vm_ops != NULL checks, the helper makes the caller's code more understandable (self-documented) and this is more grep-friendly. This patch (of 3): Preparation. Add the new simple helper, vma_is_anonymous(vma), and change handle_pte_fault() to use it. It will have more users. The name is not accurate, say a hpet_mmap()'ed vma is not anonymous. Perhaps it should be named vma_has_fault() instead. But it matches the logic in mmap.c/memory.c (see next changes). "True" just means that a page fault will use do_anonymous_page(). Change-Id: I024c69016c5125b6f40e990a2f63c6630f641b28 Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.16 as dependency of "mm/mincore.c: make mincore() more conservative"; adjusted context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> (cherry picked from commit e3bcb8e29b639d822175be5cb1b8e6b124edf98e)
* mm, oom: remove task_lock protecting comm printingDavid Rientjes2019-07-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Change-Id: I89f64666a1db5d414aa53862fd6b665bbb8125bc Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: oom_kill: simplify OOM killer lockingJohannes Weiner2019-07-081-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Change-Id: Ieb0b621bc3a391cc0a826a3ae53bf28ea4a8dbe5 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: oom_kill: clean up victim marking and exiting interfacesJohannes Weiner2019-07-081-3/+4
| | | | | | | | | | | | | | | | | | | | | | Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking are related in functionality, but the interface is not symmetrical at all: one is an internal OOM killer function used during the killing, the other is for an OOM victim to signal its own death on exit later on. This has locking implications, see follow-up changes. While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which is easier on the eye. Change-Id: I8956f6357e98f17e0ae6096c6a2c7027886a4fda Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom, PM: make OOM detection in the freezer path racelessMichal Hocko2019-07-081-11/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, "We used to have freezing points deep in file system code which may be reacheable from page fault." so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Change-Id: Ie72c2cfc39dad6420802b873053c739e804f956f Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* partial merge fix - BACKPORT: random: introduce getrandom(2) system callTheodore Ts'o2019-07-072-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Almost clean cherry pick of c6e9d6f38894798696f23c8084ca7edbf16ee895, includes change made by merge 0891ad829d2a0501053703df66029e843e3b8365. The getrandom(2) system call was requested by the LibreSSL Portable developers. It is analoguous to the getentropy(2) system call in OpenBSD. The rationale of this system call is to provide resiliance against file descriptor exhaustion attacks, where the attacker consumes all available file descriptors, forcing the use of the fallback code where /dev/[u]random is not available. Since the fallback code is often not well-tested, it is better to eliminate this potential failure mode entirely. The other feature provided by this new system call is the ability to request randomness from the /dev/urandom entropy pool, but to block until at least 128 bits of entropy has been accumulated in the /dev/urandom entropy pool. Historically, the emphasis in the /dev/urandom development has been to ensure that urandom pool is initialized as quickly as possible after system boot, and preferably before the init scripts start execution. This is because changing /dev/urandom reads to block represents an interface change that could potentially break userspace which is not acceptable. In practice, on most x86 desktop and server systems, in general the entropy pool can be initialized before it is needed (and in modern kernels, we will printk a warning message if not). However, on an embedded system, this may not be the case. And so with this new interface, we can provide the functionality of blocking until the urandom pool has been initialized. Any userspace program which uses this new functionality must take care to assure that if it is used during the boot process, that it will not cause the init scripts or other portions of the system startup to hang indefinitely. SYNOPSIS #include <linux/random.h> int getrandom(void *buf, size_t buflen, unsigned int flags); DESCRIPTION The system call getrandom() fills the buffer pointed to by buf with up to buflen random bytes which can be used to seed user space random number generators (i.e., DRBG's) or for other cryptographic uses. It should not be used for Monte Carlo simulations or other programs/algorithms which are doing probabilistic sampling. If the GRND_RANDOM flags bit is set, then draw from the /dev/random pool instead of the /dev/urandom pool. The /dev/random pool is limited based on the entropy that can be obtained from environmental noise, so if there is insufficient entropy, the requested number of bytes may not be returned. If there is no entropy available at all, getrandom(2) will either block, or return an error with errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags. If the GRND_RANDOM bit is not set, then the /dev/urandom pool will be used. Unlike using read(2) to fetch data from /dev/urandom, if the urandom pool has not been sufficiently initialized, getrandom(2) will block (or return -1 with the errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags). The getentropy(2) system call in OpenBSD can be emulated using the following function: int getentropy(void *buf, size_t buflen) { int ret; if (buflen > 256) goto failure; ret = getrandom(buf, buflen, 0); if (ret < 0) return ret; if (ret == buflen) return 0; failure: errno = EIO; return -1; } RETURN VALUE On success, the number of bytes that was filled in the buf is returned. This may not be all the bytes requested by the caller via buflen if insufficient entropy was present in the /dev/random pool, or if the system call was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. ERRORS EINVAL An invalid flag was passed to getrandom(2) EFAULT buf is outside the accessible address space. EAGAIN The requested entropy was not available, and getentropy(2) would have blocked if the GRND_NONBLOCK flag was not set. EINTR While blocked waiting for entropy, the call was interrupted by a signal handler; see the description of how interrupted read(2) calls on "slow" devices are handled with and without the SA_RESTART flag in the signal(7) man page. NOTES For small requests (buflen <= 256) getrandom(2) will not return EINTR when reading from the urandom pool once the entropy pool has been initialized, and it will return all of the bytes that have been requested. This is the recommended way to use getrandom(2), and is designed for compatibility with OpenBSD's getentropy() system call. However, if you are using GRND_RANDOM, then getrandom(2) may block until the entropy accounting determines that sufficient environmental noise has been gathered such that getrandom(2) will be operating as a NRBG instead of a DRBG for those people who are working in the NIST SP 800-90 regime. Since it may block for a long time, these guarantees do *not* apply. The user may want to interrupt a hanging process using a signal, so blocking until all of the requested bytes are returned would be unfriendly. For this reason, the user of getrandom(2) MUST always check the return value, in case it returns some error, or if fewer bytes than requested was returned. In the case of !GRND_RANDOM and small request, the latter should never happen, but the careful userspace code (and all crypto code should be careful) should check for this anyway! Finally, unless you are doing long-term key generation (and perhaps not even then), you probably shouldn't be using GRND_RANDOM. The cryptographic algorithms used for /dev/urandom are quite conservative, and so should be sufficient for all purposes. The disadvantage of GRND_RANDOM is that it can block, and the increased complexity required to deal with partially fulfilled getrandom(2) requests. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Zach Brown <zab@zabbo.net> Bug: http://b/29621447 Change-Id: I189ba74070dd6d918b0fdf83ff30bb74ec0f7556 (cherry picked from commit 4af712e8df998475736f3e2727701bd31e3751a9)
* lsm: split the xfrm_state_alloc_security() hook implementationPaul Moore2019-07-061-8/+18
| | | | | | | | | | | | | | | | | | The xfrm_state_alloc_security() LSM hook implementation is really a multiplexed hook with two different behaviors depending on the arguments passed to it by the caller. This patch splits the LSM hook implementation into two new hook implementations, which match the LSM hooks in the rest of the kernel: * xfrm_state_alloc * xfrm_state_alloc_acquire Also included in this patch are the necessary changes to the SELinux code; no other LSMs are affected. Change-Id: I455e3b62cc439127b735aa0b0a5183b98919255e Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
* xattr: Constify ->name member of "struct xattr".Tetsuo Handa2019-07-063-6/+6
| | | | | | | | | | | | | | | | | Since everybody sets kstrdup()ed constant string to "struct xattr"->name but nobody modifies "struct xattr"->name , we can omit kstrdup() and its failure checking by constifying ->name member of "struct xattr". Change-Id: I84a47af13e3c77b394218cc12ac8901d87b0fd69 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Joel Becker <jlbec@evilplan.org> [ocfs2] Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Tested-by: Paul Moore <paul@paul-moore.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
* Provide a function to create a NUL-terminated string from unterminated dataDavid Howells2019-07-061-0/+1
| | | | | | | | | | | | | | | commit f35157417215ec138c920320c746fdb3e04ef1d5 upstream. Provide a function, kmemdup_nul(), that will create a NUL-terminated string from an unterminated character array where the length is known in advance. This is better than kstrndup() in situations where we already know the string length as the strnlen() in kstrndup() is superfluous. Change-Id: I52f8594090d324c4aa9530c0ad9d287ac43ac0fc Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* netfilter: x_tables: add and use xt_check_proc_nameFlorian Westphal2019-05-031-0/+2
| | | | | | | | | | | | | | | | | | | | | commit b1d0a5d0cba4597c0394997b2d5fced3e3841b4e upstream. recent and hashlimit both create /proc files, but only check that name is 0 terminated. This can trigger WARN() from procfs when name is "" or "/". Add helper for this and then use it for both. Change-Id: I6772510c4de2697a546204cb0d11df406a17e2a1 Cc: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: <syzbot+0502b00edac2a0680b61@syzkaller.appspotmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> [bwh: Backported to 3.16: - xt_hashlimit has only one check function - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* netfilter: ipv6: nf_defrag: reduce struct net memory wasteEric Dumazet2019-05-032-1/+1
| | | | | | | | | | | | | | | | | | | | commit 9ce7bc036ae4cfe3393232c86e9e1fea2153c237 upstream. It is a waste of memory to use a full "struct netns_sysctl_ipv6" while only one pointer is really used, considering netns_sysctl_ipv6 keeps growing. Also, since "struct netns_frags" has cache line alignment, it is better to move the frags_hdr pointer outside, otherwise we spend a full cache line for this pointer. This saves 192 bytes of memory per netns. Fixes: c038a767cd69 ("ipv6: add a new namespace for nf_conntrack_reasm") Change-Id: Id8c8bde53be35e8a927eb26db74b607739ac952b Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* soreuseport: initialise timewait reuseport fieldEric Dumazet2019-05-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3099a52918937ab86ec47038ad80d377ba16c531 upstream. syzbot reported an uninit-value in inet_csk_bind_conflict() [1] It turns out we never propagated sk->sk_reuseport into timewait socket. [1] BUG: KMSAN: uninit-value in inet_csk_bind_conflict+0x5f9/0x990 net/ipv4/inet_connection_sock.c:151 CPU: 1 PID: 3589 Comm: syzkaller008242 Not tainted 4.16.0+ #82 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 inet_csk_bind_conflict+0x5f9/0x990 net/ipv4/inet_connection_sock.c:151 inet_csk_get_port+0x1d28/0x1e40 net/ipv4/inet_connection_sock.c:320 inet6_bind+0x121c/0x1820 net/ipv6/af_inet6.c:399 SYSC_bind+0x3f2/0x4b0 net/socket.c:1474 SyS_bind+0x54/0x80 net/socket.c:1460 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x4416e9 RSP: 002b:00007ffce6d15c88 EFLAGS: 00000217 ORIG_RAX: 0000000000000031 RAX: ffffffffffffffda RBX: 0100000000000000 RCX: 00000000004416e9 RDX: 000000000000001c RSI: 0000000020402000 RDI: 0000000000000004 RBP: 0000000000000000 R08: 00000000e6d15e08 R09: 00000000e6d15e08 R10: 0000000000000004 R11: 0000000000000217 R12: 0000000000009478 R13: 00000000006cd448 R14: 0000000000000000 R15: 0000000000000000 Uninit was stored to memory at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_save_stack mm/kmsan/kmsan.c:293 [inline] kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684 __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521 tcp_time_wait+0xf17/0xf50 net/ipv4/tcp_minisocks.c:283 tcp_rcv_state_process+0xebe/0x6490 net/ipv4/tcp_input.c:6003 tcp_v6_do_rcv+0x11dd/0x1d90 net/ipv6/tcp_ipv6.c:1331 sk_backlog_rcv include/net/sock.h:908 [inline] __release_sock+0x2d6/0x680 net/core/sock.c:2271 release_sock+0x97/0x2a0 net/core/sock.c:2786 tcp_close+0x277/0x18f0 net/ipv4/tcp.c:2269 inet_release+0x240/0x2a0 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:435 sock_release net/socket.c:595 [inline] sock_close+0xe0/0x300 net/socket.c:1149 __fput+0x49e/0xa10 fs/file_table.c:209 ____fput+0x37/0x40 fs/file_table.c:243 task_work_run+0x243/0x2c0 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x10e1/0x38d0 kernel/exit.c:867 do_group_exit+0x1a0/0x360 kernel/exit.c:970 SYSC_exit_group+0x21/0x30 kernel/exit.c:981 SyS_exit_group+0x25/0x30 kernel/exit.c:979 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Uninit was stored to memory at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_save_stack mm/kmsan/kmsan.c:293 [inline] kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684 __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521 inet_twsk_alloc+0xaef/0xc00 net/ipv4/inet_timewait_sock.c:182 tcp_time_wait+0xd9/0xf50 net/ipv4/tcp_minisocks.c:258 tcp_rcv_state_process+0xebe/0x6490 net/ipv4/tcp_input.c:6003 tcp_v6_do_rcv+0x11dd/0x1d90 net/ipv6/tcp_ipv6.c:1331 sk_backlog_rcv include/net/sock.h:908 [inline] __release_sock+0x2d6/0x680 net/core/sock.c:2271 release_sock+0x97/0x2a0 net/core/sock.c:2786 tcp_close+0x277/0x18f0 net/ipv4/tcp.c:2269 inet_release+0x240/0x2a0 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:435 sock_release net/socket.c:595 [inline] sock_close+0xe0/0x300 net/socket.c:1149 __fput+0x49e/0xa10 fs/file_table.c:209 ____fput+0x37/0x40 fs/file_table.c:243 task_work_run+0x243/0x2c0 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x10e1/0x38d0 kernel/exit.c:867 do_group_exit+0x1a0/0x360 kernel/exit.c:970 SYSC_exit_group+0x21/0x30 kernel/exit.c:981 SyS_exit_group+0x25/0x30 kernel/exit.c:979 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756 inet_twsk_alloc+0x13b/0xc00 net/ipv4/inet_timewait_sock.c:163 tcp_time_wait+0xd9/0xf50 net/ipv4/tcp_minisocks.c:258 tcp_rcv_state_process+0xebe/0x6490 net/ipv4/tcp_input.c:6003 tcp_v6_do_rcv+0x11dd/0x1d90 net/ipv6/tcp_ipv6.c:1331 sk_backlog_rcv include/net/sock.h:908 [inline] __release_sock+0x2d6/0x680 net/core/sock.c:2271 release_sock+0x97/0x2a0 net/core/sock.c:2786 tcp_close+0x277/0x18f0 net/ipv4/tcp.c:2269 inet_release+0x240/0x2a0 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:435 sock_release net/socket.c:595 [inline] sock_close+0xe0/0x300 net/socket.c:1149 __fput+0x49e/0xa10 fs/file_table.c:209 ____fput+0x37/0x40 fs/file_table.c:243 task_work_run+0x243/0x2c0 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x10e1/0x38d0 kernel/exit.c:867 do_group_exit+0x1a0/0x360 kernel/exit.c:970 SYSC_exit_group+0x21/0x30 kernel/exit.c:981 SyS_exit_group+0x25/0x30 kernel/exit.c:979 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: da5e36308d9f ("soreuseport: TCP/IPv4 implementation") Change-Id: I272409adab361e9281a887e52863babd21b3d89d Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* driver core: dev_set_drvdata returns voidJean Delvare2019-05-031-1/+1
| | | | | | | | | | | | | dev_set_drvdata can no longer fail, so it could return void. All callers have hopefully been updated to no longer check for the return value. [apq8084: fix callers still expecting a return value] Change-Id: I5e8995b0783727139901dcc7a810e8d0d39a0eed Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: add zone counter for cma pagesVinayak Menon2019-05-032-2/+3
| | | | | | | | | | | | Add per free area nr_free_cma counter. The idea is to also track the number of cma pages present in free pages. This will be used in later patches to fix issues with zone_watermark_ok. Change-Id: I97da9d2f3642db56fc541c48ab56a7ce78e2333c Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Prakash Gupta <guptap@codeaurora.org> (cherry picked from commit a147305588507b1a241af87f1006c5d0b30beade)
* mm: use a dedicated lock to protect totalram_pages and zone->managed_pagesJiang Liu2019-05-022-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to protect totalram_pages and zone->managed_pages. Other than the memory hotplug driver, totalram_pages and zone->managed_pages may also be modified at runtime by other drivers, such as Xen balloon, virtio_balloon etc. For those cases, memory hotplug lock is a little too heavy, so introduce a dedicated lock to protect totalram_pages and zone->managed_pages. Now we have a simplified locking rules totalram_pages and zone->managed_pages as: 1) no locking for read accesses because they are unsigned long. 2) no locking for write accesses at boot time in single-threaded context. 3) serialize write accesses at runtime by acquiring the dedicated managed_page_count_lock. Also adjust zone->managed_pages when freeing reserved pages into the buddy system, to keep totalram_pages and zone->managed_pages in consistence. [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)] Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: <sworddragon2@aol.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: c3d5f5f0c2bc4eabeaf49f1a21e1aeb965246cd2 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [imaund@codeaurora.org: resolve merge conflcits] Signed-off-by: Ian Maund <imaund@codeaurora.org> Change-Id: I4ab46ac402c57b079f1680da1c8c119663060a72 (cherry picked from commit e238160acbf97306e1283ec921c6db5c54af359b)
* mm/vmalloc.c: rename VM_UNLIST to VM_UNINITIALIZEDZhang Yanfei2019-05-021-7/+7
| | | | | | | | | | | | | | | | | | VM_UNLIST was used to indicate that the vm_struct is not listed in vmlist. But after commit 4341fa454796 ("mm, vmalloc: remove list management of vmlist after initializing vmalloc"), the meaning of this flag changed. It now means the vm_struct is not fully initialized. So renaming it to VM_UNINITIALIZED seems more reasonable. Also change clear_vm_unlist to clear_vm_uninitialized_flag. Change-Id: Icad57fdbbe6cf057b35102a23ed5bd90d1658b3b Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit b5de696d12c90a617476dcdfa18dbb13acf2558f)
* mm: Update is_vmalloc_addr to account for vmalloc savingsLaura Abbott2019-05-022-6/+7
| | | | | | | | | | is_vmalloc_addr current assumes that all vmalloc addresses exist between VMALLOC_START and VMALLOC_END. This may not be the case when interleaving vmalloc and lowmem. Update the is_vmalloc_addr to properly check for this. Change-Id: I5def3d6ae1a4de59ea36f095b8c73649a37b1f36 Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
* vmalloc: introduce remap_vmalloc_range_partialHATAYAMA Daisuke2019-05-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc space and remap it to user-space in order to reduce the risk that memory allocation fails on system with huge number of CPUs and so with huge ELF note segment that exceeds 11-order block size. Although there's already remap_vmalloc_range for the purpose of remapping vmalloc memory to user-space, we need to specify user-space range via vma. Mmap on /proc/vmcore needs to remap range across multiple objects, so the interface that requires vma to cover full range is problematic. This patch introduces remap_vmalloc_range_partial that receives user-space range as a pair of base address and size and can be used for mmap on /proc/vmcore case. remap_vmalloc_range is rewritten using remap_vmalloc_range_partial. [akpm@linux-foundation.org: use PAGE_ALIGNED()] Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: If11092672ab09b7d75d905f5e7276f2bee2415cb (cherry picked from commit 5c27777e9c15793fe55422bb475daf1e4eb5ddb6)
* oom: add helpers for setting and clearing TIF_MEMDIEMichal Hocko2019-05-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchset addresses a race which was described in the changelog for 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Change-Id: If4188f68cdfb73f8cf2721f89c9739ec0a8f0e12 Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: don't assume that a coredumping thread will exit soonOleg Nesterov2019-05-021-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() "forever" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I6a69bf26a477f31e733ed6911da4f299d49cdfe1
* mm, oom: rename zonelist locking functionsDavid Rientjes2019-05-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Change-Id: Ide8f1efa2763212ffb2797979582c43251c89697 Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, oom: ensure memoryless node zonelist always includes zonesDavid Rientjes2019-05-021-1/+10
| | | | | | | | | | | | | | | | | | | | | | With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [akpm@linux-foundation.org: fix non-numa build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ic93a32abbf875a331244a0f95f2a0bcd0764360b
* PM / sleep: Mechanism for aborting system suspends unconditionallyRafael J. Wysocki2019-05-021-0/+4
| | | | | | | | | | | | | It sometimes may be necessary to abort a system suspend in progress or wake up the system from suspend-to-idle even if the pm_wakeup_event()/pm_stay_awake() mechanism is not enabled. For this purpose, introduce a new global variable pm_abort_suspend and make pm_wakeup_pending() check its value. Also add routines for manipulating that variable. Change-Id: I694bee866a1b9e85a724f3efec09326d47a4ef6e Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* exit: ptrace: shift "reap dead" code from exit_ptrace() to ↵Oleg Nesterov2019-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | forget_original_parent() Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks, we can simply pass "dead_children" list to exit_ptrace() and remove another release_task() loop. Plus this way we do not need to drop and reacquire tasklist_lock. Also shift the list_empty(ptraced) check, if we want this optimization it makes sense to eliminate the function call altogether. Change-Id: I07b22f882fe7d5ec0bdc4f12d416992f96567610 Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched, exit: Deal with nested sleepsPeter Zijlstra2019-05-021-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_wait() is a big wait loop, but we set TASK_RUNNING too late; we end up calling potential sleeps before we reset it. Not strictly a bug since we're guaranteed to exit the loop and not call schedule(); put in annotations to quiet might_sleep(). WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:7123 __might_sleep+0x7e/0x90() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109a788>] do_wait+0x88/0x270 Call Trace: [<ffffffff81694991>] dump_stack+0x4e/0x7a [<ffffffff8109877c>] warn_slowpath_common+0x8c/0xc0 [<ffffffff8109886c>] warn_slowpath_fmt+0x4c/0x50 [<ffffffff810bca6e>] __might_sleep+0x7e/0x90 [<ffffffff811a1c15>] might_fault+0x55/0xb0 [<ffffffff8109a3fb>] wait_consider_task+0x90b/0xc10 [<ffffffff8109a804>] do_wait+0x104/0x270 [<ffffffff8109b837>] SyS_wait4+0x77/0x100 [<ffffffff8169d692>] system_call_fastpath+0x16/0x1b Change-Id: I6d0a7e2ac676fe95e27068518d1fe69108864fe9 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: tglx@linutronix.de Cc: umgwanakikbuti@gmail.com Cc: ilya.dryomov@inktank.com Cc: Alex Elder <alex.elder@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Axel Lin <axel.lin@ingics.com> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Guillaume Morin <guillaume@morinfr.org> Cc: Ionut Alexa <ionut.m.alexa@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140924082242.186408915@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch]Oleg Nesterov2019-05-022-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Move the declaration/definition of allow_signal/disallow_signal to signal.h/signal.c. The new place is more logical and allows to use the static helpers in signal.c (see the next changes). While at it, make them return void and remove the valid_signal() check. Nobody checks the returned value, and in-kernel users must not pass the wrong signal number. Change-Id: I75a6d15eaa4a6c01175823f8d07c356869da9db2 Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* exit.c: unexport __set_special_pids()Oleg Nesterov2019-05-021-2/+0
| | | | | | | | | | | | | | Move __set_special_pids() from exit.c to sys.c close to its single caller and make it static. And rename it to set_special_pids(), another helper with this name has gone away. Change-Id: I0095999c845fabe07cdb3854c5ee1866220e3198 Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transitionOleg Nesterov2019-05-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and drops tasklist_lock. If this task is not the natural child and it is traced, we change its state back to EXIT_ZOMBIE for ->real_parent. The last transition is racy, this is even documented in 50b8d257486a "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race". wait_consider_task() tries to detect this transition and clear ->notask_error but we can't rely on ptrace_reparented(), debugger can exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE. And there is another problem which were missed before: this transition can also race with reparent_leader() which doesn't reset >exit_signal if EXIT_DEAD, assuming that this task must be reaped by someone else. So the tracee can be re-parented with ->exit_signal != SIGCHLD, and if /sbin/init doesn't use __WALL it becomes unreapable. This was fixed by the previous commit, but it was the temporary hack. 1. Add the new exit_state, EXIT_TRACE. It means that the task is the traced zombie, debugger is going to detach and notify its natural parent. This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we can avoid the changes in proc/kgdb code, get_task_state() still reports "X (dead)" in this case. Note: with or without this change userspace can see Z -> X -> Z transition. Not really bad, but probably makes sense to fix. 2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD if we need to notify the ->real_parent. 3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD is always the final state we can safely ignore such a task. 4. Change wait_consider_task() to check EXIT_TRACE separately and kill the racy and no longer needed ptrace_reparented() case. If ptrace == T an EXIT_TRACE thread should be simply ignored, the owner of this state is going to ptrace_unlink() this task. We can pretend that it was already removed from ->ptraced list. Otherwise we should skip this thread too but clear ->notask_error, we must be the natural parent and debugger is going to untrace and notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace" even if the task was already untraced. Change-Id: I972c5bc91a93901ef836bf4f6a53af06f6a0a1e9 Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com> Reported-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: delete vfs_readdir function declarationZhang Zhen2019-05-021-1/+0
| | | | | | | | | | | vfs_readdir() was replaced by iterate_dir() in commit 5c0ba4e0762e ("[readdir] introduce iterate_dir() and dir_context"). Change-Id: I0a04fe567a55afabb19a75dc944b4ef62c6cadb4 Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* freezer: set PF_SUSPEND_TASK flag on tasks that call freeze_processesMoyster2019-05-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling freeze_processes sets a global flag that will cause any process that calls try_to_freeze to enter the refrigerator. It skips sending a signal to the current task, but if the current task ever hits try_to_freeze, all threads will be frozen and the system will deadlock. Set a new flag, PF_SUSPEND_TASK, on the task that calls freeze_processes. The flag notifies the freezer that the thread is involved in suspend and should not be frozen. Also add a WARN_ON in thaw_processes if the caller does not have the PF_SUSPEND_TASK flag set to catch if a different task calls thaw_processes than the one that called freeze_processes, leaving a task with PF_SUSPEND_TASK permanently set on it. Threads that spawn off a task with PF_SUSPEND_TASK set (which swsusp does) will also have PF_SUSPEND_TASK set, preventing them from freezing while they are helping with suspend, but they need to be dead by the time suspend is triggered, otherwise they may run when userspace is expected to be frozen. Add a WARN_ON in thaw_processes if more than one thread has the PF_SUSPEND_TASK flag set. Change-Id: I621e70c26203a2470b3e60f3f3f5991a00762e09 Reported-and-tested-by: Michael Leun <lkml20130126@newton.leun.net> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Moyster <oysterized@gmail.com>
* clean up process flagsCorinna Vinschen2019-05-021-0/+5
| | | | | | | | | | * Move PF_WAKE_UP_IDLE to 0x00000002 to make room for PF_SUSPEND_TASK * Drop PF_SU in favor of a bit 'task_is_su' in the task_struct bitfield which has still lots of room without changing the struct size. Change-Id: I2af053ebcbb3c41b7407560008da8150a73c8c05 Signed-off-by: Corinna Vinschen <xda@vinschen.de> Signed-off-by: Moyster <oysterized@gmail.com>
* memcg: kill CONFIG_MM_OWNEROleg Nesterov2019-05-022-3/+3
| | | | | | | | | | | | | CONFIG_MM_OWNER makes no sense. It is not user-selectable, it is only selected by CONFIG_MEMCG automatically. So we can kill this option in init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally. Change-Id: I07980d9557cef16a102ed293bc4a8ad1f9302777 Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched/cpuset/pm: Fix cpuset vs. suspend-resume bugsPeter Zijlstra2019-05-021-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 50e76632339d4655859523a39249dd95ee5e93e7 upstream. Cpusets vs. suspend-resume is _completely_ broken. And it got noticed because it now resulted in non-cpuset usage breaking too. On suspend cpuset_cpu_inactive() doesn't call into cpuset_update_active_cpus() because it doesn't want to move tasks about, there is no need, all tasks are frozen and won't run again until after we've resumed everything. But this means that when we finally do call into cpuset_update_active_cpus() after resuming the last frozen cpu in cpuset_cpu_active(), the top_cpuset will not have any difference with the cpu_active_mask and this it will not in fact do _anything_. So the cpuset configuration will not be restored. This was largely hidden because we would unconditionally create identity domains and mobile users would not in fact use cpusets much. And servers what do use cpusets tend to not suspend-resume much. An addition problem is that we'd not in fact wait for the cpuset work to finish before resuming the tasks, allowing spurious migrations outside of the specified domains. Fix the rebuild by introducing cpuset_force_rebuild() and fix the ordering with cpuset_wait_for_hotplug(). Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling") Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: Ia40ffcf49507af1d5493d7e534e9433ca346db02
* nospec: Include <asm/barrier.h> dependencyDan Williams2019-05-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The nospec.h header expects the per-architecture header file <asm/barrier.h> to optionally define array_index_mask_nospec(). Include that dependency to prevent inadvertent fallback to the default array_index_mask_nospec() implementation. The default implementation may not provide a full mitigation on architectures that perform data value speculation. Change-Id: Ib3bb49179541ca4f187f7ec6b91603a58db358f7 Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arch@vger.kernel.org Link: http://lkml.kernel.org/r/151881605404.17395.1341935530792574707.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* nospec: Allow index argument to have const-qualified typeRasmus Villemoes2019-05-021-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The last expression in a statement expression need not be a bare variable, quoting gcc docs The last thing in the compound statement should be an expression followed by a semicolon; the value of this subexpression serves as the value of the entire construct. and we already use that in e.g. the min/max macros which end with a ternary expression. This way, we can allow index to have const-qualified type, which will in some cases avoid the need for introducing a local copy of index of non-const qualified type. That, in turn, can prevent readers not familiar with the internals of array_index_nospec from wondering about the seemingly redundant extra variable, and I think that's worthwhile considering how confusing the whole _nospec business is. The expression _i&_mask has type unsigned long (since that is the type of _mask, and the BUILD_BUG_ONs guarantee that _i will get promoted to that), so in order not to change the type of the whole expression, add a cast back to typeof(_i). Change-Id: I1fcaea6268a40e10c67d58e906d757b5bce7e8d0 Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/151881604837.17395.10812767547837568328.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* nospec: Kill array_index_nospec_mask_check()Dan Williams2019-05-021-21/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are multiple problems with the dynamic sanity checking in array_index_nospec_mask_check(): * It causes unnecessary overhead in the 32-bit case since integer sized @index values will no longer cause the check to be compiled away like in the 64-bit case. * In the 32-bit case it may trigger with user controllable input when the expectation is that should only trigger during development of new kernel enabling. * The macro reuses the input parameter in multiple locations which is broken if someone passes an expression like 'index++' to array_index_nospec(). Change-Id: I845dbcb830b4d72cf7f0b835bfa64476d3763268 Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arch@vger.kernel.org Link: http://lkml.kernel.org/r/151881604278.17395.6605847763178076520.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* nospec: Move array_index_nospec() parameter checking into separate macroWill Deacon2019-05-021-15/+21
| | | | | | | | | | | | | | | | | | | For architectures providing their own implementation of array_index_mask_nospec() in asm/barrier.h, attempting to use WARN_ONCE() to complain about out-of-range parameters using WARN_ON() results in a mess of mutually-dependent include files. Rather than unpick the dependencies, simply have the core code in nospec.h perform the checking for us. Change-Id: I36857abc8226fcdfd9c9014e3ffee17f9dc2cbca Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1517840166-15399-1-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* array_index_nospec: Sanitize speculative array de-referencesDan Williams2019-05-021-0/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | array_index_nospec() is proposed as a generic mechanism to mitigate against Spectre-variant-1 attacks, i.e. an attack that bypasses boundary checks via speculative execution. The array_index_nospec() implementation is expected to be safe for current generation CPUs across multiple architectures (ARM, x86). Based on an original implementation by Linus Torvalds, tweaked to remove speculative flows by Alexei Starovoitov, and tweaked again by Linus to introduce an x86 assembly implementation for the mask generation. Change-Id: Id28afe5e117369ab52a9400c2cb7d639630f3a08 Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Co-developed-by: Alexei Starovoitov <ast@kernel.org> Suggested-by: Cyril Novikov <cnovikov@lynx.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: kernel-hardening@lists.openwall.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: gregkh@linuxfoundation.org Cc: torvalds@linux-foundation.org Cc: alan@linux.intel.com Link: https://lkml.kernel.org/r/151727414229.33451.18411580953862676575.stgit@dwillia2-desk3.amr.corp.intel.com
* Make file credentials available to the seqfile interfacesLinus Torvalds2019-05-021-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 34dbbcdbf63360661ff7bda6c5f52f99ac515f92 upstream. A lot of seqfile users seem to be using things like %pK that uses the credentials of the current process, but that is actually completely wrong for filesystem interfaces. The unix semantics for permission checking files is to check permissions at _open_ time, not at read or write time, and that is not just a small detail: passing off stdin/stdout/stderr to a suid application and making the actual IO happen in privileged context is a classic exploit technique. So if we want to be able to look at permissions at read time, we need to use the file open credentials, not the current ones. Normal file accesses can just use "f_cred" (or any of the helper functions that do that, like file_ns_capable()), but the seqfile interfaces do not have any such options. It turns out that seq_file _does_ save away the user_ns information of the file, though. Since user_ns is just part of the full credential information, replace that special case with saving off the cred pointer instead, and suddenly seq_file has all the permission information it needs. Change-Id: Ic6fc339d315f6aa5774f7f5356747aebf47cb45f Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* allow O_TMPFILE to work with O_WRONLYAl Viro2019-05-021-2/+2
| | | | | Change-Id: I90171d1b53a4c35bfa76757ecfdfb6f95330d107 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Safer ABI for O_TMPFILEAl Viro2019-05-021-2/+6
| | | | | | | | | [suggested by Rasmus Villemoes] make O_DIRECTORY | O_RDWR part of O_TMPFILE; that will fail on old kernels in a lot more cases than what I came up with. And make sure O_CREAT doesn't get there... Change-Id: Iaa3c8b487d44515b539150bdb5d0b749b87d3ea2 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* allow the temp files created by open() to be linked toAl Viro2019-05-021-0/+1
| | | | | | | | | O_TMPFILE | O_CREAT => linkat() with AT_SYMLINK_FOLLOW and /proc/self/fd/<n> as oldpath (i.e. flink()) will create a link O_TMPFILE | O_CREAT | O_EXCL => ENOENT on attempt to link those guys Change-Id: I5e28485680c3320cd0fccc0ba1bea8b963fca7fe Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* it's still short a few helpers, but infrastructure should be OK now...Al Viro2019-05-023-0/+7
| | | | | Change-Id: I0adb8fe9c5029bad3ac52629003c3b78e9442936 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>