aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
Commit message (Collapse)AuthorAgeFilesLines
* partial merge fix - BACKPORT: random: introduce getrandom(2) system callTheodore Ts'o2019-07-072-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Almost clean cherry pick of c6e9d6f38894798696f23c8084ca7edbf16ee895, includes change made by merge 0891ad829d2a0501053703df66029e843e3b8365. The getrandom(2) system call was requested by the LibreSSL Portable developers. It is analoguous to the getentropy(2) system call in OpenBSD. The rationale of this system call is to provide resiliance against file descriptor exhaustion attacks, where the attacker consumes all available file descriptors, forcing the use of the fallback code where /dev/[u]random is not available. Since the fallback code is often not well-tested, it is better to eliminate this potential failure mode entirely. The other feature provided by this new system call is the ability to request randomness from the /dev/urandom entropy pool, but to block until at least 128 bits of entropy has been accumulated in the /dev/urandom entropy pool. Historically, the emphasis in the /dev/urandom development has been to ensure that urandom pool is initialized as quickly as possible after system boot, and preferably before the init scripts start execution. This is because changing /dev/urandom reads to block represents an interface change that could potentially break userspace which is not acceptable. In practice, on most x86 desktop and server systems, in general the entropy pool can be initialized before it is needed (and in modern kernels, we will printk a warning message if not). However, on an embedded system, this may not be the case. And so with this new interface, we can provide the functionality of blocking until the urandom pool has been initialized. Any userspace program which uses this new functionality must take care to assure that if it is used during the boot process, that it will not cause the init scripts or other portions of the system startup to hang indefinitely. SYNOPSIS #include <linux/random.h> int getrandom(void *buf, size_t buflen, unsigned int flags); DESCRIPTION The system call getrandom() fills the buffer pointed to by buf with up to buflen random bytes which can be used to seed user space random number generators (i.e., DRBG's) or for other cryptographic uses. It should not be used for Monte Carlo simulations or other programs/algorithms which are doing probabilistic sampling. If the GRND_RANDOM flags bit is set, then draw from the /dev/random pool instead of the /dev/urandom pool. The /dev/random pool is limited based on the entropy that can be obtained from environmental noise, so if there is insufficient entropy, the requested number of bytes may not be returned. If there is no entropy available at all, getrandom(2) will either block, or return an error with errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags. If the GRND_RANDOM bit is not set, then the /dev/urandom pool will be used. Unlike using read(2) to fetch data from /dev/urandom, if the urandom pool has not been sufficiently initialized, getrandom(2) will block (or return -1 with the errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags). The getentropy(2) system call in OpenBSD can be emulated using the following function: int getentropy(void *buf, size_t buflen) { int ret; if (buflen > 256) goto failure; ret = getrandom(buf, buflen, 0); if (ret < 0) return ret; if (ret == buflen) return 0; failure: errno = EIO; return -1; } RETURN VALUE On success, the number of bytes that was filled in the buf is returned. This may not be all the bytes requested by the caller via buflen if insufficient entropy was present in the /dev/random pool, or if the system call was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. ERRORS EINVAL An invalid flag was passed to getrandom(2) EFAULT buf is outside the accessible address space. EAGAIN The requested entropy was not available, and getentropy(2) would have blocked if the GRND_NONBLOCK flag was not set. EINTR While blocked waiting for entropy, the call was interrupted by a signal handler; see the description of how interrupted read(2) calls on "slow" devices are handled with and without the SA_RESTART flag in the signal(7) man page. NOTES For small requests (buflen <= 256) getrandom(2) will not return EINTR when reading from the urandom pool once the entropy pool has been initialized, and it will return all of the bytes that have been requested. This is the recommended way to use getrandom(2), and is designed for compatibility with OpenBSD's getentropy() system call. However, if you are using GRND_RANDOM, then getrandom(2) may block until the entropy accounting determines that sufficient environmental noise has been gathered such that getrandom(2) will be operating as a NRBG instead of a DRBG for those people who are working in the NIST SP 800-90 regime. Since it may block for a long time, these guarantees do *not* apply. The user may want to interrupt a hanging process using a signal, so blocking until all of the requested bytes are returned would be unfriendly. For this reason, the user of getrandom(2) MUST always check the return value, in case it returns some error, or if fewer bytes than requested was returned. In the case of !GRND_RANDOM and small request, the latter should never happen, but the careful userspace code (and all crypto code should be careful) should check for this anyway! Finally, unless you are doing long-term key generation (and perhaps not even then), you probably shouldn't be using GRND_RANDOM. The cryptographic algorithms used for /dev/urandom are quite conservative, and so should be sufficient for all purposes. The disadvantage of GRND_RANDOM is that it can block, and the increased complexity required to deal with partially fulfilled getrandom(2) requests. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Zach Brown <zab@zabbo.net> Bug: http://b/29621447 Change-Id: I189ba74070dd6d918b0fdf83ff30bb74ec0f7556 (cherry picked from commit 4af712e8df998475736f3e2727701bd31e3751a9)
* Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds2018-11-2935-35/+35
| | | | | | | | | | | | | | This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Moyster <oysterized@gmail.com>
* Finally eradicate CONFIG_HOTPLUGStephen Rothwell2017-12-221-1/+1
| | | | | | | | | | | | | | | | | | | Ever since commit 45f035ab9b8f ("CONFIG_HOTPLUG should be always on"), it has been basically impossible to build a kernel with CONFIG_HOTPLUG turned off. Remove all the remaining references to it. Cc: Russell King <linux@arm.linux.org.uk> Cc: Doug Thompson <dougthompson@xmission.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computingAndy Lutomirski2017-12-052-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The secure_computing function took a syscall number parameter, but it only paid any attention to that parameter if seccomp mode 1 was enabled. Rather than coming up with a kludge to get the parameter to work in mode 2, just remove the parameter. To avoid churn in arches that don't have seccomp filters (and may not even support syscall_get_nr right now), this leaves the parameter in secure_computing_strict, which is now a real function. For ARM, this is a bit ugly due to the fact that ARM conditionally supports seccomp filters. Fixing that would probably only be a couple of lines of code, but it should be coordinated with the audit maintainers. This will be a slight slowdown on some arches. The right fix is to pass in all of seccomp_data instead of trying to make just the syscall nr part be fast. This is a prerequisite for making two-phase seccomp work cleanly. Cc: Russell King <linux@arm.linux.org.uk> Cc: linux-arm-kernel@lists.infradead.org Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: x86@kernel.org Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Kees Cook <keescook@chromium.org>
* x86/apic: fix build breakage caused by incomplete backport to 3.10Willy Tarreau2017-11-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Commit 928a277 ("x86/apic: Do not init irq remapping if ioapic is disabled") introduced in 3.10.105 introduced an implicit dependency of CONFIG_X86_LOCAL_APIC to CONFIG_X86_IO_APIC which was later solved as part of simplifications on the config dependencies in more recent kernels. This dependency results in build failure when CONFIG_X86_LOCAL_APIC is set without CONFIG_X86_IO_APIC (this setup requires CONFIG_SMP=n). The reason is that skip_ioapic_setup is declared in apic.c and that the backported code was picked from a context where the #ifdef surrounding the function used to cover this condition. Let's just add the appropriate #ifdef to fix the 3.10 backport. Thanks to Christoph Biedl for reporting and diagnosing this one. Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Cc: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Cc: Jan Beulich <JBeulich@suse.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exitHaozhong Zhang2017-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8eb3f87d903168bdbd1222776a6b1e281f50513e upstream. When KVM emulates an exit from L2 to L1, it loads L1 CR4 into the guest CR4. Before this CR4 loading, the guest CR4 refers to L2 CR4. Because these two CR4's are in different levels of guest, we should vmx_set_cr4() rather than kvm_set_cr4() here. The latter, which is used to handle guest writes to its CR4, checks the guest change to CR4 and may fail if the change is invalid. The failure may cause trouble. Consider we start a L1 guest with non-zero L1 PCID in use, (i.e. L1 CR4.PCIDE == 1 && L1 CR3.PCID != 0) and a L2 guest with L2 PCID disabled, (i.e. L2 CR4.PCIDE == 0) and following events may happen: 1. If kvm_set_cr4() is used in load_vmcs12_host_state() to load L1 CR4 into guest CR4 (in VMCS01) for L2 to L1 exit, it will fail because of PCID check. As a result, the guest CR4 recorded in L0 KVM (i.e. vcpu->arch.cr4) is left to the value of L2 CR4. 2. Later, if L1 attempts to change its CR4, e.g., clearing VMXE bit, kvm_set_cr4() in L0 KVM will think L1 also wants to enable PCID, because the wrong L2 CR4 is used by L0 KVM as L1 CR4. As L1 CR3.PCID != 0, L0 KVM will inject GP to L1 guest. Fixes: 4704d0befb072 ("KVM: nVMX: Exiting from L2 to L1") Cc: qemu-stable@nongnu.org Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* KVM: x86: zero base3 of unusable segmentsRadim Krčmář2017-11-061-0/+2
| | | | | | | | | | | | | | | commit f0367ee1d64d27fa08be2407df5c125442e885e3 upstream. Static checker noticed that base3 could be used uninitialized if the segment was not present (useable). Random stack values probably would not pass VMCS entry checks. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 1aa366163b8b ("KVM: x86 emulator: consolidate segment accessors") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init()Laura Abbott2017-11-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | commit 861ce4a3244c21b0af64f880d5bfe5e6e2fb9e4a upstream. '__vmalloc_start_set' currently only gets set in initmem_init() when !CONFIG_NEED_MULTIPLE_NODES. This breaks detection of vmalloc address with virt_addr_valid() with CONFIG_NEED_MULTIPLE_NODES=y, causing a kernel crash: [mm/usercopy] 517e1fbeb6: kernel BUG at arch/x86/mm/physaddr.c:78! Set '__vmalloc_start_set' appropriately for that case as well. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: dc16ecf7fd1f ("x86-32: use specific __vmalloc_start_set flag in __virt_addr_valid") Link: http://lkml.kernel.org/r/1494278596-30373-1-git-send-email-labbott@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* kvm: async_pf: fix rcu_irq_enter() with irqs enabledPaolo Bonzini2017-11-061-1/+1
| | | | | | | | | | | | | | | | | | commit bbaf0e2b1c1b4f88abd6ef49576f0efb1734eae5 upstream. native_safe_halt enables interrupts, and you just shouldn't call rcu_irq_enter() with interrupts enabled. Reorder the call with the following local_irq_disable() to respect the invariant. Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/io: Add "memory" clobber to insb/insw/insl/outsb/outsw/outslArnd Bergmann2017-11-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7206f9bf108eb9513d170c73f151367a1bdf3dbf upstream. The x86 version of insb/insw/insl uses an inline assembly that does not have the target buffer listed as an output. This can confuse the compiler, leading it to think that a subsequent access of the buffer is uninitialized: drivers/net/wireless/wl3501_cs.c: In function ‘wl3501_mgmt_scan_confirm’: drivers/net/wireless/wl3501_cs.c:665:9: error: ‘sig.status’ is used uninitialized in this function [-Werror=uninitialized] drivers/net/wireless/wl3501_cs.c:668:12: error: ‘sig.cap_info’ may be used uninitialized in this function [-Werror=maybe-uninitialized] drivers/net/sb1000.c: In function 'sb1000_rx': drivers/net/sb1000.c:775:9: error: 'st[0]' is used uninitialized in this function [-Werror=uninitialized] drivers/net/sb1000.c:776:10: error: 'st[1]' may be used uninitialized in this function [-Werror=maybe-uninitialized] drivers/net/sb1000.c:784:11: error: 'st[1]' may be used uninitialized in this function [-Werror=maybe-uninitialized] I tried to mark the exact input buffer as an output here, but couldn't figure it out. As suggested by Linus, marking all memory as clobbered however is good enough too. For the outs operations, I also add the memory clobber, to force the input to be written to local variables. This is probably already guaranteed by the "asm volatile", but it can't hurt to do this for symmetry. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: http://lkml.kernel.org/r/20170719125310.2487451-5-arnd@arndb.de Link: https://lkml.org/lkml/2017/7/12/605 Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* misc: replace __FUNCTION__ by __function__Moyster2017-09-232-2/+2
| | | | | result of : git grep -l '__FUNCTION__' | xargs sed -i 's/__FUNCTION__/__func__/g'
* kernel.h: remove ancient __FUNCTION__ hackRasmus Villemoes2017-09-232-3/+3
| | | | | | | | | | | | | | | | __FUNCTION__ hasn't been treated as a string literal since gcc 3.4, so this only helps people who only test-compile using 3.3 (compiler-gcc3.h barks at anything older than that). Besides, there are almost no occurrences of __FUNCTION__ left in the tree. [akpm@linux-foundation.org: convert remaining __FUNCTION__ references] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Moyster <oysterized@gmail.com>
* crypto: xts - consolidate sanity check for keysStephan Mueller2017-08-316-45/+18
| | | | | | | | | | | | | | | The patch centralizes the XTS key check logic into the service function xts_check_key which is invoked from the different XTS implementations. With this, the XTS implementations in ARM, ARM64, PPC and S390 have now a sanity check for the XTS keys similar to the other arches. In addition, this service function received a check to ensure that the key != the tweak key which is mandated by FIPS 140-2 IG A.9. As the check is not present in the standards defining XTS, it is only enforced in FIPS mode of the kernel. Signed-off-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* Make ARCH_HAS_FAST_MULTIPLIER a real config variableLinus Torvalds2017-08-312-2/+1
| | | | | | | | | | | | | | | | | | | | | It used to be an ad-hoc hack defined by the x86 version of <asm/bitops.h> that enabled a couple of library routines to know whether an integer multiply is faster than repeated shifts and additions. This just makes it use the real Kconfig system instead, and makes x86 (which was the only architecture that did this) select the option. NOTE! Even for x86, this really is kind of wrong. If we cared, we would probably not enable this for builds optimized for netburst (P4), where shifts-and-adds are generally faster than multiplies. This patch does *not* change that kind of logic, though, it is purely a syntactic change with no code changes. This was triggered by the fact that we have other places that really want to know "do I want to expand multiples by constants by hand or not", particularly the hash generation code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* UPSTREAM: x86/acpi: Prevent out of bound access caused by broken ACPI tablesSeunghun Han2017-08-161-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | The bus_irq argument of mp_override_legacy_irq() is used as the index into the isa_irq_to_gsi[] array. The bus_irq argument originates from ACPI_MADT_TYPE_IO_APIC and ACPI_MADT_TYPE_INTERRUPT items in the ACPI tables, but is nowhere sanity checked. That allows broken or malicious ACPI tables to overwrite memory, which might cause malfunction, panic or arbitrary code execution. Add a sanity check and emit a warning when that triggers. [ tglx: Added warning and rewrote changelog ] Signed-off-by: Seunghun Han <kkamagui@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: security@kernel.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit dad5ab0db8deac535d03e3fe3d8f2892173fa6a4) Bug: 64253928 Change-Id: I34c8305afb582298618cf8e6d889985134137265
* mm: larger stack guard gap, between vmasHugh Dickins2017-07-042-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> [wt: backport to 4.11: adjust context] [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide] [wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes] [wt: backport to 3.18: adjust context ; no FOLL_POPULATE ; s390 uses generic arch_get_unmapped_area()] [wt: backport to 3.16: adjust context] [wt: backport to 3.10: adjust context ; code logic in PARISC's arch_get_unmapped_area() wasn't found ; code inserted into expand_upwards() and expand_downwards() runs under anon_vma lock; changes for gup.c:faultin_page go to memory.c:__get_user_pages(); included Hugh Dickins' fixes] Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/mm/32: Enable full randomization on i386 and X86_32Hector Marco-Gisbert2017-07-041-13/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8b8addf891de8a00e4d39fc32f93f7c5eb8feceb upstream. Currently on i386 and on X86_64 when emulating X86_32 in legacy mode, only the stack and the executable are randomized but not other mmapped files (libraries, vDSO, etc.). This patch enables randomization for the libraries, vDSO and mmap requests on i386 and in X86_32 in legacy mode. By default on i386 there are 8 bits for the randomization of the libraries, vDSO and mmaps which only uses 1MB of VA. This patch preserves the original randomness, using 1MB of VA out of 3GB or 4GB. We think that 1MB out of 3GB is not a big cost for having the ASLR. The first obvious security benefit is that all objects are randomized (not only the stack and the executable) in legacy mode which highly increases the ASLR effectiveness, otherwise the attackers may use these non-randomized areas. But also sensitive setuid/setgid applications are more secure because currently, attackers can disable the randomization of these applications by setting the ulimit stack to "unlimited". This is a very old and widely known trick to disable the ASLR in i386 which has been allowed for too long. Another trick used to disable the ASLR was to set the ADDR_NO_RANDOMIZE personality flag, but fortunately this doesn't work on setuid/setgid applications because there is security checks which clear Security-relevant flags. This patch always randomizes the mmap_legacy_base address, removing the possibility to disable the ASLR by setting the stack to "unlimited". Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es> Acked-by: Ismael Ripoll Ripoll <iripoll@upv.es> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/1457639460-5242-1-git-send-email-hecmargi@upv.es Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86: standardize mmap_rnd() usageKees Cook2017-07-041-18/+22
| | | | | | | | | | | | | | | | | commit 82168140bc4cec7ec9bad39705518541149ff8b7 upstream. In preparation for splitting out ET_DYN ASLR, this refactors the use of mmap_rnd() to be used similarly to arm, and extracts the checking of PF_RANDOMIZE. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Willy Tarreau <w@1wt.eu>
* kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)Jim Mattson2017-07-041-6/+5
| | | | | | | | | | | | | commit ef85b67385436ddc1998f45f1d6a210f935b3388 upstream. When L2 exits to L0 due to "exception or NMI", software exceptions (#BP and #OF) for which L1 has requested an intercept should be handled by L1 rather than L0. Previously, only hardware exceptions were forwarded to L1. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* ftrace/x86: Fix triple fault with graph tracing and suspend-to-ramJosh Poimboeuf2017-07-041-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 34a477e5297cbaa6ecc6e17c042a866e1cbe80d6 upstream. On x86-32, with CONFIG_FIRMWARE and multiple CPUs, if you enable function graph tracing and then suspend to RAM, it will triple fault and reboot when it resumes. The first fault happens when booting a secondary CPU: startup_32_smp() load_ucode_ap() prepare_ftrace_return() ftrace_graph_is_dead() (accesses 'kill_ftrace_graph') The early head_32.S code calls into load_ucode_ap(), which has an an ftrace hook, so it calls prepare_ftrace_return(), which calls ftrace_graph_is_dead(), which tries to access the global 'kill_ftrace_graph' variable with a virtual address, causing a fault because the CPU is still in real mode. The fix is to add a check in prepare_ftrace_return() to make sure it's running in protected mode before continuing. The check makes sure the stack pointer is a virtual kernel address. It's a bit of a hack, but it's not very intrusive and it works well enough. For reference, here are a few other (more difficult) ways this could have potentially been fixed: - Move startup_32_smp()'s call to load_ucode_ap() down to *after* paging is enabled. (No idea what that would break.) - Track down load_ucode_ap()'s entire callee tree and mark all the functions 'notrace'. (Probably not realistic.) - Pause graph tracing in ftrace_suspend_notifier_call() or bringup_cpu() or __cpu_up(), and ensure that the pause facility can be queried from real mode. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: "Rafael J . Wysocki" <rjw@rjwysocki.net> Cc: linux-acpi@vger.kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Len Brown <lenb@kernel.org> Link: http://lkml.kernel.org/r/5c1272269a580660703ed2eccf44308e790c7a98.1492123841.git.jpoimboe@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
* xen/x86: don't lose event interruptsStefano Stabellini2017-07-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | commit c06b6d70feb32d28f04ba37aa3df17973fd37b6b upstream. On slow platforms with unreliable TSC, such as QEMU emulated machines, it is possible for the kernel to request the next event in the past. In that case, in the current implementation of xen_vcpuop_clockevent, we simply return -ETIME. To be precise the Xen returns -ETIME and we pass it on. However the result of this is a missed event, which simply causes the kernel to hang. Instead it is better to always ask the hypervisor for a timer event, even if the timeout is in the past. That way there are no lost interrupts and the kernel survives. To do that, remove the VCPU_SSHOTTMR_future flag. Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Juergen Gross <jgross@suse.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/mce/AMD: Give a name to MCA bank 3 when accessed with legacy MSRsYazen Ghannam2017-07-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 29f72ce3e4d18066ec75c79c857bee0618a3504b upstream. MCA bank 3 is reserved on systems pre-Fam17h, so it didn't have a name. However, MCA bank 3 is defined on Fam17h systems and can be accessed using legacy MSRs. Without a name we get a stack trace on Fam17h systems when trying to register sysfs files for bank 3 on kernels that don't recognize Scalable MCA. Call MCA bank 3 "decode_unit" since this is what it represents on Fam17h. This will allow kernels without SMCA support to see this bank on Fam17h+ and prevent the stack trace. This will not affect older systems since this bank is reserved on them, i.e. it'll be ignored. Tested on AMD Fam15h and Fam17h systems. WARNING: CPU: 26 PID: 1 at lib/kobject.c:210 kobject_add_internal kobject: (ffff88085bb256c0): attempted to be registered with empty name! ... Call Trace: kobject_add_internal kobject_add kobject_create_and_add threshold_create_device threshold_init_device Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1490102285-3659-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/vdso: Plug race between mapping and ELF header setupThomas Gleixner2017-07-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6fdc6dd90272ce7e75d744f71535cfbd8d77da81 upstream. The vsyscall32 sysctl can racy against a concurrent fork when it switches from disabled to enabled: arch_setup_additional_pages() if (vdso32_enabled) --> No mapping sysctl.vsysscall32() --> vdso32_enabled = true create_elf_tables() ARCH_DLINFO_IA32 if (vdso32_enabled) { --> Add VDSO entry with NULL pointer Make ARCH_DLINFO_IA32 check whether the VDSO mapping has been set up for the newly forked process or not. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathias Krause <minipli@googlemail.com> Link: http://lkml.kernel.org/r/20170410151723.602367196@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
* crypto: ghash-clmulni - Fix load failureWang, Rui Y2017-07-041-0/+26
| | | | | | | | | | | | | | | | | | | | | commit 3a020a723c65eb8ffa7c237faca26521a024e582 upstream. ghash_clmulni_intel fails to load on Linux 4.3+ with the following message: "modprobe: ERROR: could not insert 'ghash_clmulni_intel': Invalid argument" After 8996eafdc ("crypto: ahash - ensure statesize is non-zero") all ahash drivers are required to implement import()/export(), and must have a non- zero statesize. This patch has been tested with the algif_hash interface. The calculated digest values, after several rounds of import()s and export()s, match those calculated by tcrypt. Signed-off-by: Rui Wang <rui.y.wang@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: Sumit Semwal <sumit.semwal@linaro.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>
* KVM: VMX: use correct vmcs_read/write for guest segment selector/baseChao Peng2017-07-041-1/+1
| | | | | | | | | | | | | | | | commit 96794e4ed4d758272c486e1529e431efb7045265 upstream. Guest segment selector is 16 bit field and guest segment base is natural width field. Fix two incorrect invocations accordingly. Without this patch, build fails when aggressive inlining is used with ICC. [js] no vmx_dump_sel in 3.12 Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/PCI: Ignore _CRS on Supermicro X8DTH-i/6/iF/6FBjorn Helgaas2017-07-041-0/+10
| | | | | | | | | | | | | | | | | | | | | commit 89e9f7bcd8744ea25fcf0ac671b8d72c10d7d790 upstream. Martin reported that the Supermicro X8DTH-i/6/iF/6F advertises incorrect host bridge windows via _CRS: pci_root PNP0A08:00: host bridge window [io 0xf000-0xffff] pci_root PNP0A08:01: host bridge window [io 0xf000-0xffff] Both bridges advertise the 0xf000-0xffff window, which cannot be correct. Work around this by ignoring _CRS on this system. The downside is that we may not assign resources correctly to hot-added PCI devices (if they are possible on this system). Link: https://bugzilla.kernel.org/show_bug.cgi?id=42606 Reported-by: Martin Burnicki <martin.burnicki@meinberg.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* ftrace/x86: Set ftrace_stub to weak to prevent gcc from using short jumps to itSteven Rostedt2017-07-041-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8329e818f14926a6040df86b2668568bde342ebf upstream. Matt Fleming reported seeing crashes when enabling and disabling function profiling which uses function graph tracer. Later Namhyung Kim hit a similar issue and he found that the issue was due to the jmp to ftrace_stub in ftrace_graph_call was only two bytes, and when it was changed to jump to the tracing code, it overwrote the ftrace_stub that was after it. Masami Hiramatsu bisected this down to a binutils change: 8dcea93252a9ea7dff57e85220a719e2a5e8ab41 is the first bad commit commit 8dcea93252a9ea7dff57e85220a719e2a5e8ab41 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri May 15 03:17:31 2015 -0700 Add -mshared option to x86 ELF assembler This patch adds -mshared option to x86 ELF assembler. By default, assembler will optimize out non-PLT relocations against defined non-weak global branch targets with default visibility. The -mshared option tells the assembler to generate code which may go into a shared library where all non-weak global branch targets with default visibility can be preempted. The resulting code is slightly bigger. This option only affects the handling of branch instructions. Declaring ftrace_stub as a weak call prevents gas from using two byte jumps to it, which would be converted to a jump to the function graph code. Link: http://lkml.kernel.org/r/20160516230035.1dbae571@gandalf.local.home Reported-by: Matt Fleming <matt@codeblueprint.co.uk> Reported-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* xen: do not re-use pirq number cached in pci device msi msg dataDan Streetman2017-06-171-16/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c74fd80f2f41d05f350bb478151021f88551afe8 upstream. Revert the main part of commit: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") That commit introduced reading the pci device's msi message data to see if a pirq was previously configured for the device's msi/msix, and re-use that pirq. At the time, that was the correct behavior. However, a later change to Qemu caused it to call into the Xen hypervisor to unmap all pirqs for a pci device, when the pci device disables its MSI/MSIX vectors; specifically the Qemu commit: c976437c7dba9c7444fb41df45468968aaa326ad ("qemu-xen: free all the pirqs for msi/msix when driver unload") Once Qemu added this pirq unmapping, it was no longer correct for the kernel to re-use the pirq number cached in the pci device msi message data. All Qemu releases since 2.1.0 contain the patch that unmaps the pirqs when the pci device disables its MSI/MSIX vectors. This bug is causing failures to initialize multiple NVMe controllers under Xen, because the NVMe driver sets up a single MSIX vector for each controller (concurrently), and then after using that to talk to the controller for some configuration data, it disables the single MSIX vector and re-configures all the MSIX vectors it needs. So the MSIX setup code tries to re-use the cached pirq from the first vector for each controller, but the hypervisor has already given away that pirq to another controller, and its initialization fails. This is discussed in more detail at: https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/platform/goldfish: Prevent unconditional loadingThomas Gleixner2017-06-171-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 47512cfd0d7a8bd6ab71d01cd89fca19eb2093eb upstream. The goldfish platform code registers the platform device unconditionally which causes havoc in several ways if the goldfish_pdev_bus driver is enabled: - Access to the hardcoded physical memory region, which is either not available or contains stuff which is completely unrelated. - Prevents that the interrupt of the serial port can be requested - In case of a spurious interrupt it goes into a infinite loop in the interrupt handler of the pdev_bus driver (which needs to be fixed seperately). Add a 'goldfish' command line option to make the registration opt-in when the platform is compiled in. I'm seriously grumpy about this engineering trainwreck, which has seven SOBs from Intel developers for 50 lines of code. And none of them figured out that this is broken. Impressive fail! Fixes: ddd70cf93d78 ("goldfish: platform device for x86") Reported-by: Gabriel C <nix.or.die@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/cpu: Fix bootup crashes by sanitizing the argument of the 'clearcpuid=' ↵Lukasz Odzioba2017-06-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | command-line option commit dd853fd216d1485ed3045ff772079cc8689a9a4a upstream. A negative number can be specified in the cmdline which will be used as setup_clear_cpu_cap() argument. With that we can clear/set some bit in memory predceeding boot_cpu_data/cpu_caps_cleared which may cause kernel to misbehave. This patch adds lower bound check to setup_disablecpuid(). Boris Petkov reproduced a crash: [ 1.234575] BUG: unable to handle kernel paging request at ffffffff858bd540 [ 1.236535] IP: memcpy_erms+0x6/0x10 Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: andi.kleen@intel.com Cc: bp@alien8.de Cc: dave.hansen@linux.intel.com Cc: luto@kernel.org Cc: slaoub@gmail.com Fixes: ac72e7888a61 ("x86: add generic clearcpuid=... option") Link: http://lkml.kernel.org/r/1482933340-11857-1-git-send-email-lukasz.odzioba@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps ↵Steven Rostedt (Red Hat)2017-06-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | to it commit 847fa1a6d3d00f3bdf68ef5fa4a786f644a0dd67 upstream. With new binutils, gcc may get smart with its optimization and change a jmp from a 5 byte jump to a 2 byte one even though it was jumping to a global function. But that global function existed within a 2 byte radius, and gcc was able to optimize it. Unfortunately, that jump was also being modified when function graph tracing begins. Since ftrace expected that jump to be 5 bytes, but it was only two, it overwrote code after the jump, causing a crash. This was fixed for x86_64 with commit 8329e818f149, with the same subject as this commit, but nothing was done for x86_32. Fixes: d61f82d06672 ("ftrace: use dynamic patching for updating mcount calls") Reported-by: Colin Ian King <colin.king@canonical.com> Tested-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* KVM: x86: Introduce segmented_write_stdSteve Rutherford2017-06-171-2/+16
| | | | | | | | | | | | | | | | | | | | | | commit 129a72a0d3c8e139a04512325384fe5ac119e74d upstream. Introduces segemented_write_std. Switches from emulated reads/writes to standard read/writes in fxsave, fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding kernel memory leak. Since commit 283c95d0e389 ("KVM: x86: emulate FXSAVE and FXRSTOR", 2016-11-09), which is luckily not yet in any final release, this would also be an exploitable kernel memory *write*! Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: 96051572c819194c37a8367624b285be10297eca Fixes: 283c95d0e3891b64087706b344a4b545d04a6e62 Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* KVM: x86: fix emulation of "MOV SS, null selector"Paolo Bonzini2017-06-171-10/+38
| | | | | | | | | | | | | | | | | | | | | | | | | commit 33ab91103b3415e12457e3104f0e4517ce12d0f3 upstream. This is CVE-2017-2583. On Intel this causes a failed vmentry because SS's type is neither 3 nor 7 (even though the manual says this check is only done for usable SS, and the dmesg splat says that SS is unusable!). On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb. The fix fabricates a data segment descriptor when SS is set to a null selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb. Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3; this in turn ensures CPL < 3 because RPL must be equal to CPL. Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing the bug and deciphering the manuals. [js] backport to 3.12 Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com> Fixes: 79d5b4c3cd809c770d4bf9812635647016c56011 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>
* mm: Tighten x86 /dev/mem with zeroing readsKees Cook2017-05-071-10/+30
| | | | | | | | | | | | | | | | | | | | | | | | commit a4866aa812518ed1a37d8ea0c881dc946409de94 upstream. Under CONFIG_STRICT_DEVMEM, reading System RAM through /dev/mem is disallowed. However, on x86, the first 1MB was always allowed for BIOS and similar things, regardless of it actually being System RAM. It was possible for heap to end up getting allocated in low 1MB RAM, and then read by things like x86info or dd, which would trip hardened usercopy: usercopy: kernel memory exposure attempt detected from ffff880000090000 (dma-kmalloc-256) (4096 bytes) This changes the x86 exception for the low 1MB by reading back zeros for System RAM areas instead of blindly allowing them. More work is needed to extend this to mmap, but currently mmap doesn't go through usercopy, so hardened usercopy won't Oops the kernel. Change-Id: If63add732b7f654918aded70f5196166416ae905 Reported-by: Tommi Rantala <tommi.t.rantala@nokia.com> Tested-by: Tommi Rantala <tommi.t.rantala@nokia.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Brad Spengler <spender@grsecurity.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* um: Stop abusing __KERNEL__Richard Weinberger2017-05-031-3/+3
| | | | | | | | Currently UML is abusing __KERNEL__ to distinguish between kernel and host code (os-Linux). It is better to use a custom define such that existing users of __KERNEL__ don't get confused. Signed-off-by: Richard Weinberger <richard@nod.at>
* locking/mcs: Allow architecture specific asm files to be used for contended caseTim Chen2017-05-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows each architecture to add its specific assembly optimized arch_mcs_spin_lock_contended and arch_mcs_spinlock_uncontended for MCS lock and unlock functions. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: AswinChandramouleeswaran <aswin@hp.com> Cc: George Spelvin <linux@horizon.com> Cc: Rik vanRiel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: MichelLespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alex Shi <alex.shi@linaro.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Figo.zhang" <figo1802@gmail.com> Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew R Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1390347382.3138.67.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: ddf1d169c0a489d498c1799a7043904a43b0c159 [joonwoop@codeaurora.org: Resolve merge conflicts; we don't have changes for arch other than ARM/ARM64] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* um: x86: add relaxed.hMister Oyster2017-05-031-0/+22
|
* um: remove dead codeRichard Weinberger2017-05-031-1/+0
| | | | | | | | | | | "me" is not used. [cherry-pick of upstream 9e82d450531c79b18ab18c9b9645cdd9db31ee98] Bug: 21631098 Change-Id: Ifc3550184931dddf8feebd6c3137e60b97f6a0f1 Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
* BACKPORT: FROMLIST: mm: ASLR: use get_random_long()dcashman2017-04-111-2/+2
| | | | | | | | | | | | | (cherry picked from commit https://lkml.org/lkml/2016/2/4/833) Replace calls to get_random_int() followed by a cast to (unsigned long) with calls to get_random_long(). Also address shifting bug which, in case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits. Bug: 26963541 Signed-off-by: Daniel Cashman <dcashman@android.com> Signed-off-by: Daniel Cashman <dcashman@google.com> Change-Id: Ie577b21a0678cf4b21eae06bddd8ccb27cbe70ff
* Get rid of __cpuinitMoyster2017-04-1166-313/+313
| | | | | | | | | | | | | | | | | | | | | This commit is the result of find . -name '*.c' | xargs sed -i 's/ __cpuinit / /g' find . -name '*.c' | xargs sed -i 's/ __cpuexit / /g' find . -name '*.c' | xargs sed -i 's/ __cpuinitdata / /g' find . -name '*.c' | xargs sed -i 's/ __cpuinit$//g' find ./arch/ -name '*.h' | xargs sed -i 's/ __cpuinit//g' find . -name '*.c' | xargs sed -i 's/^__cpuinit //g' find . -name '*.c' | xargs sed -i 's/^__cpuinitdata //g' find . -name '*.c' | xargs sed -i 's/\*__cpuinit /\*/g' find . -name '*.c' | xargs sed -i 's/ __cpuinitconst / /g' find . -name '*.h' | xargs sed -i 's/ __cpuinit / /g' find . -name '*.h' | xargs sed -i 's/ __cpuinitdata / /g' git add . git reset include/linux/init.h git checkout -- include/linux/init.h based off : https://github.com/jollaman999/jolla-kernel_bullhead/commit/bc15db84a622eed7d61d3ece579b577154d0ec29
* Fix potential infoleak in older kernelsLinus Torvalds2017-04-111-1/+1
| | | | | | | | | | | | | | | | | | | Not upstream as it is not needed there. So a patch something like this might be a safe way to fix the potential infoleak in older kernels. THIS IS UNTESTED. It's a very obvious patch, though, so if it compiles it probably works. It just initializes the output variable with 0 in the inline asm description, instead of doing it in the exception handler. It will generate slightly worse code (a few unnecessary ALU operations), but it doesn't have any interactions with the exception handler implementation. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/um: reuse asm-generic/barrier.hMichael S. Tsirkin2017-04-111-5/+1
| | | | | | | | | | | | | | | | commit 577f183acc88645eae116326cc2203dc88ea730c upstream. On x86/um CONFIG_SMP is never defined. As a result, several macros match the asm-generic variant exactly. Drop the local definitions and pull in asm-generic/barrier.h instead. This is in preparation to refactoring this code area. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Richard Weinberger <richard@nod.at> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/build: Build compressed x86 kernels as PIEH.J. Lu2017-04-113-1/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6d92bc9d483aa1751755a66fee8fb39dffb088c0 upstream. The 32-bit x86 assembler in binutils 2.26 will generate R_386_GOT32X relocation to get the symbol address in PIC. When the compressed x86 kernel isn't built as PIC, the linker optimizes R_386_GOT32X relocations to their fixed symbol addresses. However, when the compressed x86 kernel is loaded at a different address, it leads to the following load failure: Failed to allocate space for phdrs during the decompression stage. If the compressed x86 kernel is relocatable at run-time, it should be compiled with -fPIE, instead of -fPIC, if possible and should be built as Position Independent Executable (PIE) so that linker won't optimize R_386_GOT32X relocation to its fixed symbol address. Older linkers generate R_386_32 relocations against locally defined symbols, _bss, _ebss, _got and _egot, in PIE. It isn't wrong, just less optimal than R_386_RELATIVE. But the x86 kernel fails to properly handle R_386_32 relocations when relocating the kernel. To generate R_386_RELATIVE relocations, we mark _bss, _ebss, _got and _egot as hidden in both 32-bit and 64-bit x86 kernels. To build a 64-bit compressed x86 kernel as PIE, we need to disable the relocation overflow check to avoid relocation overflow errors. We do this with a new linker command-line option, -z noreloc-overflow, which got added recently: commit 4c10bbaa0912742322f10d9d5bb630ba4e15dfa7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 15 11:07:06 2016 -0700 Add -z noreloc-overflow option to x86-64 ld Add -z noreloc-overflow command-line option to the x86-64 ELF linker to disable relocation overflow check. This can be used to avoid relocation overflow check if there will be no dynamic relocation overflow at run-time. The 64-bit compressed x86 kernel is built as PIE only if the linker supports -z noreloc-overflow. So far 64-bit relocatable compressed x86 kernel boots fine even when it is built as a normal executable. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org [ Edited the changelog and comments. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/paravirt: Do not trace _paravirt_ident_*() functionsSteven Rostedt2017-04-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 15301a570754c7af60335d094dd2d1808b0641a5 upstream. Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up after enabling function tracer. I asked him to bisect the functions within available_filter_functions, which he did and it came down to three: _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64() It was found that this is only an issue when noreplace-paravirt is added to the kernel command line. This means that those functions are most likely called within critical sections of the funtion tracer, and must not be traced. In newer kenels _paravirt_nop() is defined within gcc asm(), and is no longer an issue. But both _paravirt_ident_{32,64}() causes the following splat when they are traced: mm/pgtable-generic.c:33: bad pmd ffff8800d2435150(0000000001d00054) mm/pgtable-generic.c:33: bad pmd ffff8800d3624190(0000000001d00070) mm/pgtable-generic.c:33: bad pmd ffff8800d36a5110(0000000001d00054) mm/pgtable-generic.c:33: bad pmd ffff880118eb1450(0000000001d00054) NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469] Modules linked in: e1000e CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 task: ffff880118f740c0 ti: ffff8800d4aec000 task.ti: ffff8800d4aec000 RIP: 0010:[<ffffffff81134148>] [<ffffffff81134148>] queued_spin_lock_slowpath+0x118/0x1a0 RSP: 0018:ffff8800d4aefb90 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011eb16d40 RDX: ffffffff82485760 RSI: 000000001f288820 RDI: ffffea0000008030 RBP: ffff8800d4aefb90 R08: 00000000000c0000 R09: 0000000000000000 R10: ffffffff821c8e0e R11: 0000000000000000 R12: ffff880000200fb8 R13: 00007f7a4e3f7000 R14: ffffea000303f600 R15: ffff8800d4b562e0 FS: 00007f7a4e3d7840(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7a4e3f7000 CR3: 00000000d3e71000 CR4: 00000000001406e0 Call Trace: _raw_spin_lock+0x27/0x30 handle_pte_fault+0x13db/0x16b0 handle_mm_fault+0x312/0x670 __do_page_fault+0x1b1/0x4e0 do_page_fault+0x22/0x30 page_fault+0x28/0x30 __vfs_read+0x28/0xe0 vfs_read+0x86/0x130 SyS_read+0x46/0xa0 entry_SYSCALL_64_fastpath+0x1e/0xa8 Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b Reported-by: Łukasz Daniluk <lukasz.daniluk@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/mm/pat, /dev/mem: Remove superfluous error messageJiri Kosina2017-04-111-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 39380b80d72723282f0ea1d1bbf2294eae45013e upstream. Currently it's possible for broken (or malicious) userspace to flood a kernel log indefinitely with messages a-la Program dmidecode tried to access /dev/mem between f0000->100000 because range_is_allowed() is case of CONFIG_STRICT_DEVMEM being turned on dumps this information each and every time devmem_is_allowed() fails. Reportedly userspace that is able to trigger contignuous flow of these messages exists. It would be possible to rate limit this message, but that'd have a questionable value; the administrator wouldn't get information about all the failing accessess, so then the information would be both superfluous and incomplete at the same time :) Returning EPERM (which is what is actually happening) is enough indication for userspace what has happened; no need to log this particular error as some sort of special condition. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1607081137020.24757@cbobk.fhfr.pm Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/apic: Do not init irq remapping if ioapic is disabledWanpeng Li2017-04-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | commit 2e63ad4bd5dd583871e6602f9d398b9322d358d9 upstream. native_smp_prepare_cpus -> default_setup_apic_routing -> enable_IR_x2apic -> irq_remapping_prepare -> intel_prepare_irq_remapping -> intel_setup_irq_remapping So IR table is setup even if "noapic" boot parameter is added. As a result we crash later when the interrupt affinity is set due to a half initialized remapping infrastructure. Prevent remap initialization when IOAPIC is disabled. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Joerg Roedel <joro@8bytes.org> Link: http://lkml.kernel.org/r/1471954039-3942-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/mm: Disable preemption during CR3 read+writeSebastian Andrzej Siewior2017-04-111-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e upstream. There's a subtle preemption race on UP kernels: Usually current->mm (and therefore mm->pgd) stays the same during the lifetime of a task so it does not matter if a task gets preempted during the read and write of the CR3. But then, there is this scenario on x86-UP: TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by: -> mmput() -> exit_mmap() -> tlb_finish_mmu() -> tlb_flush_mmu() -> tlb_flush_mmu_tlbonly() -> tlb_flush() -> flush_tlb_mm_range() -> __flush_tlb_up() -> __flush_tlb() -> __native_flush_tlb() At this point current->mm is NULL but current->active_mm still points to the "old" mm. Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its own mm so CR3 has changed. Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's mm and so CR3 remains unchanged. Once taskA gets active it continues where it was interrupted and that means it writes its old CR3 value back. Everything is fine because userland won't need its memory anymore. Now the fun part: Let's preempt taskA one more time and get back to taskB. This time switch_mm() won't do a thing because oldmm (->active_mm) is the same as mm (as per context_switch()). So we remain with a bad CR3 / PGD and return to userland. The next thing that happens is handle_mm_fault() with an address for the execution of its code in userland. handle_mm_fault() realizes that it has a PTE with proper rights so it returns doing nothing. But the CPU looks at the wrong PGD and insists that something is wrong and faults again. And again. And one more time… This pagefault circle continues until the scheduler gets tired of it and puts another task on the CPU. It gets little difficult if the task is a RT task with a high priority. The system will either freeze or it gets fixed by the software watchdog thread which usually runs at RT-max prio. But waiting for the watchdog will increase the latency of the RT task which is no good. Fix this by disabling preemption across the critical code section. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de [ Prettified the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/traps: Ignore high word of regs->cs in early_idt_handler_commonAndy Lutomirski2017-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of: commit fc0e81b2bea0ebceb71889b61d2240856141c9ee upstream On the 80486 DX, it seems that some exceptions may leave garbage in the high bits of CS. This causes sporadic failures in which early_fixup_exception() refuses to fix up an exception. As far as I can tell, this has been buggy for a long time, but the problem seems to have been exacerbated by commits: 1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4") e1bfc11c5a6f ("x86/init: Fix cr4_init_shadow() on CR4-less machines") This appears to have broken for as long as we've had early exception handling. [ This backport should apply to kernels from 3.4 - 4.5. ] Fixes: 4c5023a3fa2e ("x86-32: Handle exception table entries during early boot") Cc: H. Peter Anvin <hpa@zytor.com> Cc: stable@vger.kernel.org Reported-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()Juergen Gross2017-04-111-1/+1
| | | | | | | | | | | | | commit 1cf38741308c64d08553602b3374fb39224eeb5a upstream. xen_cleanhighmap() is operating on level2_kernel_pgt only. The upper bound of the loop setting non-kernel-image entries to zero should not exceed the size of level2_kernel_pgt. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* x86/mm/xen: Suppress hugetlbfs in PV guestsJan Beulich2017-04-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 103f6112f253017d7062cd74d17f4a514ed4485c upstream. Huge pages are not normally available to PV guests. Not suppressing hugetlbfs use results in an endless loop of page faults when user mode code tries to access a hugetlbfs mapped area (since the hypervisor denies such PTEs to be created, but error indications can't be propagated out of xen_set_pte_at(), just like for various of its siblings), and - once killed in an oops like this: kernel BUG at .../fs/hugetlbfs/inode.c:428! invalid opcode: 0000 [#1] SMP ... RIP: e030:[<ffffffff811c333b>] [<ffffffff811c333b>] remove_inode_hugepages+0x25b/0x320 ... Call Trace: [<ffffffff811c3415>] hugetlbfs_evict_inode+0x15/0x40 [<ffffffff81167b3d>] evict+0xbd/0x1b0 [<ffffffff8116514a>] __dentry_kill+0x19a/0x1f0 [<ffffffff81165b0e>] dput+0x1fe/0x220 [<ffffffff81150535>] __fput+0x155/0x200 [<ffffffff81079fc0>] task_work_run+0x60/0xa0 [<ffffffff81063510>] do_exit+0x160/0x400 [<ffffffff810637eb>] do_group_exit+0x3b/0xa0 [<ffffffff8106e8bd>] get_signal+0x1ed/0x470 [<ffffffff8100f854>] do_signal+0x14/0x110 [<ffffffff810030e9>] prepare_exit_to_usermode+0xe9/0xf0 [<ffffffff814178a5>] retint_user+0x8/0x13 This is CVE-2016-3961 / XSA-174. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <JGross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: xen-devel <xen-devel@lists.xenproject.org> Link: http://lkml.kernel.org/r/57188ED802000078000E431C@prv-mh.provo.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>