aboutsummaryrefslogtreecommitdiff
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* BACKPORT: futex: Prevent overflow by strengthen input validationLi Jinyue2018-04-131-0/+3
| | | | | | | | | | | | | | | | | | | | | | UBSAN reports signed integer overflow in kernel/futex.c: UBSAN: Undefined behaviour in kernel/futex.c:2041:18 signed integer overflow: 0 - -2147483648 cannot be represented in type 'int' Add a sanity check to catch negative values of nr_wake and nr_requeue. Signed-off-by: Li Jinyue <lijinyue@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Erick Reyes <erickreyes@google.com> Cc: peterz@infradead.org Cc: dvhart@infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1513242294-31786-1-git-send-email-lijinyue@huawei.com Cherry-picked from fbe0e839d1e22d88810f3ee3e2f1479be4c0aa4a Change-Id: I954cc2848678318b60ec3f103d0c15f87b4605a4
* BACKPORT: audit: consistently record PIDs with task_tgid_nr()Paul Moore2017-12-282-7/+13
| | | | | | | | | | | | | | | | | Unfortunately we record PIDs in audit records using a variety of methods despite the correct way being the use of task_tgid_nr(). This patch converts all of these callers, except for the case of AUDIT_SET in audit_receive_msg() (see the comment in the code). Reported-by: Jeff Vander Stoep <jeffv@google.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Bug: 28952093 (cherry picked from commit fa2bea2f5cca5b8d4a3e5520d2e8c0ede67ac108) Signed-off-by: Jeff Vander Stoep <jeffv@google.com> Change-Id: I36508a25c957f5108299e68a3b0f627c94eb27eb Signed-off-by: Joe Maples <joe@frap129.org>
* perf: Use hrtimers for event multiplexingStephane Eranian2017-12-261-7/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current scheme of using the timer tick was fine for per-thread events. However, it was causing bias issues in system-wide mode (including for uncore PMUs). Event groups would not get their fair share of runtime on the PMU. With tickless kernels, if a core is idle there is no timer tick, and thus no event rotation (multiplexing). However, there are events (especially uncore events) which do count even though cores are asleep. This patch changes the timer source for multiplexing. It introduces a per-PMU per-cpu hrtimer. The advantage is that even when a core goes idle, it will come back to service the hrtimer, thus multiplexing on system-wide events works much better. The per-PMU implementation (suggested by PeterZ) enables adjusting the multiplexing interval per PMU. The preferred interval is stashed into the struct pmu. If not set, it will be forced to the default interval value. In order to minimize the impact of the hrtimer, it is turned on and off on demand. When the PMU on a CPU is overcommited, the hrtimer is activated. It is stopped when the PMU is not overcommitted. In order for this to work properly, we had to change the order of initialization in start_kernel() such that hrtimer_init() is run before perf_event_init(). The default interval in milliseconds is set to a timer tick just like with the old code. We will provide a sysctl to tune this in another patch. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: mydongistiny <jaysonedson@gmail.com> Signed-off-by: Joe Maples <joe@frap129.org>
* cpu: Defer smpboot kthread unparking until CPU known to schedulerPaul E. McKenney2017-12-261-3/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, smpboot_unpark_threads() is invoked before the incoming CPU has been added to the scheduler's runqueue structures. This might potentially cause the unparked kthread to run on the wrong CPU, since the correct CPU isn't fully set up yet. That causes a sporadic, hard to debug boot crash triggering on some systems, reported by Borislav Petkov, and bisected down to: 2a442c9c6453 ("x86: Use common outgoing-CPU-notification code") This patch places smpboot_unpark_threads() in a CPU hotplug notifier with priority set so that these kthreads are unparked just after the CPU has been added to the runqueues. Change-Id: I8921987de9c2a2f475cc63dc82662d6ebf6e8725 Reported-and-tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: 00df35f991914db6b8bde8cf09808e19a9cffc3d Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <mattw@codeaurora.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com> Signed-off-by: Joe Maples <joe@frap129.org>
* arm64: use the new *_relaxed macros for lower power usagefranciscofranco2017-12-256-18/+20
| | | | | | Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Signed-off-by: Joe Maples <joe@frap129.org> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* Finally eradicate CONFIG_HOTPLUGStephen Rothwell2017-12-221-1/+0
| | | | | | | | | | | | | | | | | | | Ever since commit 45f035ab9b8f ("CONFIG_HOTPLUG should be always on"), it has been basically impossible to build a kernel with CONFIG_HOTPLUG turned off. Remove all the remaining references to it. Cc: Russell King <linux@arm.linux.org.uk> Cc: Doug Thompson <dougthompson@xmission.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/idle: Optimize the generic idle loopGaurav Jindal (Gaurav Jindal)2017-12-221-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, smp_processor_id() is used to fetch the current CPU in cpu_idle_loop(). Every time the idle thread runs, it fetches the current CPU using smp_processor_id(). Since the idle thread is per CPU, the current CPU is constant, so we can lift the load out of the loop, saving execution cycles/time in the loop. x86-64: Before patch (execution in loop): 148: 0f ae e8 lfence 14b: 65 8b 04 25 00 00 00 00 mov %gs:0x0,%eax 152: 00 153: 89 c0 mov %eax,%eax 155: 49 0f a3 04 24 bt %rax,(%r12) After patch (execution in loop): 150: 0f ae e8 lfence 153: 4d 0f a3 34 24 bt %r14,(%r12) ARM64: Before patch (execution in loop): 168: d5033d9f dsb ld 16c: b9405661 ldr w1,[x19,#84] 170: 1100fc20 add w0,w1,#0x3f 174: 6b1f003f cmp w1,wzr 178: 1a81b000 csel w0,w0,w1,lt 17c: 130c7000 asr w0,w0,#6 180: 937d7c00 sbfiz x0,x0,#3,#32 184: f8606aa0 ldr x0,[x21,x0] 188: 9ac12401 lsr x1,x0,x1 18c: 36000e61 tbz w1,#0,358 After patch (execution in loop): 1a8: d50339df dsb ld 1ac: f8776ac0 ldr x0,[x22,x23] ab0: ea18001f tst x0,x24 1b4: 54000ea0 b.eq 388 Further observance on ARM64 for 4 seconds shows that cpu_idle_loop is called 8672 times. Shifting the code will save instructions executed in loop and eventually time as well. Signed-off-by: Gaurav Jindal <gaurav.jindal@spreadtrum.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sanjeev Yadav <sanjeev.yadav@spreadtrum.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160512101330.GA488@gauravjindalubtnb.del.spreadtrum.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* log: Initial dmesg pruningNathan Chancellor2017-12-183-4/+4
| | | | | | | | | These are all of the annoying messages on just the stock kernel... More to follow in future patches! Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* cpuset: Add allow_attach hook for cpusets on android.Riley Andrews2017-12-111-0/+18
| | | | Change-Id: Ic1b61b2bbb7ce74c9e9422b5e22ee9078251de21
* cpuset: Make cpusets restore on hotplugRiley Andrews2017-12-111-15/+33
| | | | | | | | | | | | This deliberately changes the behavior of the per-cpuset cpus file to not be effected by hotplug. When a cpu is offlined, it will be removed from the cpuset/cpus file. When a cpu is onlined, if the cpuset originally requested that that cpu was part of the cpuset, that cpu will be restored to the cpuset. The cpus files still have to be hierachical, but the ranges no longer have to be out of the currently online cpus, just the physically present cpus. Change-Id: I3efbae24a1f6384be1e603fb56f0d3baef61d924
* clocksource: Fix abs() usage w/ 64bit valuesJohn Stultz2017-12-061-1/+1
| | | | | | | | | | | | | | | | | | | | commit 67dfae0cd72fec5cd158b6e5fb1647b7dbe0834c upstream. This patch fixes one cases where abs() was being used with 64-bit nanosecond values, where the result may be capped at 32-bits. This potentially could cause watchdog false negatives on 32-bit systems, so this patch addresses the issue by using abs64(). Change-Id: I0fea076d3af13221cd43082eb94de4bdcf013275 Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1442279124-7309-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li <lizefan@huawei.com>
* sched/cputime: Fix invalid gtime in procHiroshi Shimamoto2017-12-061-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | /proc/stats shows invalid gtime when the thread is running in guest. When vtime accounting is not enabled, we cannot get a valid delta. The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap is only updated when vtime accounting is runtime enabled. This patch makes task_gtime() just return gtime without computing the buggy non-existing tickless delta when vtime accounting is not enabled. Use context_tracking_is_enabled() to check if vtime is accounting on some cpu, in which case only we need to check the tickless delta. This way we fix the gtime value regression on machines not running nohz full. The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full. I ran and stop a busy loop in VM and see the gtime in host. Dump the 43rd field which shows the gtime in every second: # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done S 4348 R 7064566 R 7064766 R 7064967 R 7065168 S 4759 S 4759 During running busy loop, it returns large value. After applying this patch, we can see right gtime. # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done S 5338 R 5365 R 5465 R 5566 R 5666 S 5726 S 5726 Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computingAndy Lutomirski2017-12-051-19/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The secure_computing function took a syscall number parameter, but it only paid any attention to that parameter if seccomp mode 1 was enabled. Rather than coming up with a kludge to get the parameter to work in mode 2, just remove the parameter. To avoid churn in arches that don't have seccomp filters (and may not even support syscall_get_nr right now), this leaves the parameter in secure_computing_strict, which is now a real function. For ARM, this is a bit ugly due to the fact that ARM conditionally supports seccomp filters. Fixing that would probably only be a couple of lines of code, but it should be coordinated with the audit maintainers. This will be a slight slowdown on some arches. The right fix is to pass in all of seccomp_data instead of trying to make just the syscall nr part be fast. This is a prerequisite for making two-phase seccomp work cleanly. Cc: Russell King <linux@arm.linux.org.uk> Cc: linux-arm-kernel@lists.infradead.org Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: x86@kernel.org Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Kees Cook <keescook@chromium.org>
* sched/idle: Move cpu/idle.c to sched/idle.cNicolas Pitre2017-12-054-2/+1
| | | | | | | | | | | | | | | Integration of cpuidle with the scheduler requires that the idle loop be closely integrated with the scheduler proper. Moving cpu/idle.c into the sched directory will allow for a smoother integration, and eliminate a subdirectory which contained only one source file. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.LFD.2.11.1401301102210.1652@knanqh.ubzr Signed-off-by: Ingo Molnar <mingo@kernel.org>
* PM: convert do_each_thread to for_each_process_threadMichal Hocko2017-12-051-8/+8
| | | | | | | | | | | | as per 0c740d0afc3b (introduce for_each_thread() to replace the buggy while_each_thread()) get rid of do_each_thread { } while_each_thread() construct and replace it by a more error prone for_each_thread. This patch doesn't introduce any user visible change. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* perf: Optimize group_sched_in()Peter Zijlstra2017-12-051-1/+1
| | | | | | | | | | | | | | | | Use the ctx pmu instead of the event pmu. When a group leader is a software event but the group contains hardware events, the entire group is on the hardware PMU. Using the hardware PMU for the transaction makes most sense since that's the most expensive one to programm (and software PMUs generally don't have TXN support anyway). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-sctoo9t2f3nn2c9g568928q3@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* workqueue: allow work_on_cpu() to be called recursivelyLai Jiangshan2017-12-051-10/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the @fn call work_on_cpu() again, the lockdep will complain: > [ INFO: possible recursive locking detected ] > 3.11.0-rc1-lockdep-fix-a #6 Not tainted > --------------------------------------------- > kworker/0:1/142 is trying to acquire lock: > ((&wfc.work)){+.+.+.}, at: [<ffffffff81077100>] flush_work+0x0/0xb0 > > but task is already holding lock: > ((&wfc.work)){+.+.+.}, at: [<ffffffff81075dd9>] process_one_work+0x169/0x610 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock((&wfc.work)); > lock((&wfc.work)); > > *** DEADLOCK *** It is false-positive lockdep report. In this sutiation, the two "wfc"s of the two work_on_cpu() are different, they are both on stack. flush_work() can't be deadlock. To fix this, we need to avoid the lockdep checking in this case, thus we instroduce a internal __flush_work() which skip the lockdep. tj: Minor comment adjustment. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reported-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Reported-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: move flush_scheduled_work() to workqueue.hLai Jiangshan2017-12-051-30/+0
| | | | | | | flush_scheduled_work() is just a simple call to flush_work(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* kernel/workqueue.c: pr_warning/pr_warn & printk/pr_infoFabian Frederick2017-12-051-3/+3
| | | | | | | | tj: Refreshed on top of wq/for-3.16. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if ↵Daeseok Youn2017-12-051-6/+2
| | | | | | | | | | | | | | the target cpumask equals wq's wq_update_unbound_numa(), when it's decided that the newly updated cpumask equals the default, looks at whether the current pwq is already the default one and skips setting pwq to the default one. This extra step is unnecessary and we can always jump to use_dfl_pwq instead. Simplify the code by removing the conditional. This doesn't make any functional difference. Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: wake regular worker if need_more_worker() when rescuer leave the poolLai Jiangshan2017-12-051-2/+2
| | | | | | | | | | | | We don't need to wake up regular worker when nr_running==1, so need_more_worker() is sufficient here. And need_more_worker() gives us better readability due to the name of "keep_working()" implies the rescuer should keep working now but the rescuer is actually leaving. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: alloc struct worker on its local nodeLai Jiangshan2017-12-051-4/+4
| | | | | | | | | | | | | | When the create_worker() is called from non-manager, the struct worker is allocated from the node of the caller which may be different from the node of pool->node. So we add a node ID argument for the alloc_worker() to ensure the struct worker is allocated from the preferable node. tj: @nid renamed to @node for consistency. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: reuse the already calculated pwq in try_to_grab_pending()Lai Jiangshan2017-12-051-1/+1
| | | | | | | | | | | | | try_to_grab_pending() was re-calculating the associated pwq using get_work_pwq() when it already has it cached in a local varible and the association can't change. Reuse the local variable instead. This doesn't introduce any functional changes. tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: stronger test in process_one_work()Lai Jiangshan2017-12-051-7/+1
| | | | | | | | | | | | | | | | | When POOL_DISASSOCIATED is cleared, the running worker's local CPU should be the same as pool->cpu without any exception even during cpu-hotplug. This patch changes "(proposition_A && proposition_B && proposition_C)" to "(proposition_B && proposition_C)", so if the old compound proposition is true, the new one must be true too. so this won't hide any possible bug which can be hit by old test. tj: Minor description update and dropped the obvious comment. CC: Jason J. Herne <jjherne@linux.vnet.ibm.com> CC: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: remove useless WARN_ON_ONCE()Lai Jiangshan2017-12-051-2/+0
| | | | | | | | The @cpu is fetched via smp_processor_id() in this function, so the check is useless. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: sanity check pool->cpu in wq_worker_sleeping()Lai Jiangshan2017-12-051-1/+1
| | | | | | | | | | In theory, pool->cpu is equals to @cpu in wq_worker_sleeping() after worker->flags is checked. And "pool->cpu != cpu" sanity check will help us if something wrong. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: use schedule_timeout_interruptible() instead of open codeLai Jiangshan2017-12-051-2/+1
| | | | | | | | schedule_timeout_interruptible(CREATE_COOLDOWN) is exactly the same as the original code. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: remove the empty check in too_many_workers()Lai Jiangshan2017-12-051-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | The commit ea1abd6197d5 ("workqueue: reimplement idle worker rebinding") used a trick which simply removes all to-be-bound idle workers from the idle list and lets them add themselves back after completing rebinding. And this trick caused the @worker_pool->nr_idle may deviate than the actual number of idle workers on @worker_pool->idle_list. More specifically, nr_idle may be non-zero while ->idle_list is empty. All users of ->nr_idle and ->idle_list are audited. The only affected one is too_many_workers() which is updated to check %false if ->idle_list is empty regardless of ->nr_idle. The commit/trick was complicated due to it just tried to simplify an even more complicated problem (workers had to rebind itself). But the commit a9ab775bcadf ("workqueue: directly restore CPU affinity of workers from CPU_ONLINE") fixed all these problems and the mentioned trick was useless and is gone. So, now the @worker_pool->nr_idle is exactly the actual number of workers on @worker_pool->idle_list. too_many_workers() should recover as it was before the trick. So we remove the empty check. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: use "pool->cpu < 0" to stand for an unbound poolLai Jiangshan2017-12-051-1/+1
| | | | | | | | | | | | | | | | | | There is a piece of sanity checks code in the put_unbound_pool(). The meaning of this code is "if it is not an unbound pool, it will complain and return" IIUC. But the code uses "pool->flags & POOL_DISASSOCIATED" imprecisely due to a non-unbound pool may also have this flags. We should use "pool->cpu < 0" to stand for an unbound pool, so we covert the code to it. There is no strictly wrong if we still keep "pool->flags & POOL_DISASSOCIATED" here, but it is just a noise if we keep it: 1) we focus on "unbound" here, not "[dis]association". 2) "pool->cpu < 0" already implies "pool->flags & POOL_DISASSOCIATED". Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: Provide destroy_delayed_work_on_stack()Thomas Gleixner2017-12-051-0/+7
| | | | | | | | | | | | If a delayed or deferrable work is on stack we need to tell debug objects that we are destroying the timer and the work. Otherwise we leak the tracking object. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Acked-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20140323141939.911487677@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in ↵Li Bin2017-12-051-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | init_workqueues When one work starts execution, the high bits of work's data contain pool ID. It can represent a maximum of WORK_OFFQ_POOL_NONE. Pool ID is assigned WORK_OFFQ_POOL_NONE when the work being initialized indicating that no pool is associated and get_work_pool() uses it to check the associated pool. So if worker_pool_assign_id() assigns a ID greater than or equal WORK_OFFQ_POOL_NONE to a pool, it triggers leakage, and it may break the non-reentrance guarantee. This patch fix this issue by modifying the worker_pool_assign_id() function calling idr_alloc() by setting @end param WORK_OFFQ_POOL_NONE. Furthermore, in the current implementation, the BUILD_BUG_ON() in init_workqueues makes no sense. The number of worker pools needed cannot be determined at compile time, because the number of backing pools for UNBOUND workqueues is dynamic based on the assigned custom attributes. So remove it. tj: Minor comment and indentation updates. Signed-off-by: Li Bin <huawei.libin@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: Fix workqueue stall issue after cpu down failureSe Wang (Patrick) Oh2017-12-051-9/+15
| | | | | | | | | | | | | | When the hotplug notifier call chain with CPU_DOWN_PREPARE is broken before reaching workqueue_cpu_down_callback(), rebind_workers() adds WORKER_REBOUND flag for running workers. Hence, the nr_running of the pool is not increased when scheduler wakes up the worker. The fix is skipping adding WORKER_REBOUND flag when the worker doesn't have WORKER_UNBOUND flag in CPU_DOWN_FAILED path. Change-Id: I2528e9154f4913d9ec14b63adbcbcd1eaa8a8452 Signed-off-by: Se Wang (Patrick) Oh <sewango@codeaurora.org> Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
* workqueue: clear POOL_DISASSOCIATED in rebind_workers()Lai Jiangshan2017-12-051-4/+1
| | | | | | | | | | | | | | | | | | | | | | a9ab775bcadf ("workqueue: directly restore CPU affinity of workers from CPU_ONLINE") moved pool locking into rebind_workers() but left "pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback(). There is nothing necessarily wrong with it, but there is no benefit either. Let's move it into rebind_workers() and achieve the following benefits: 1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers() as expected. 2) we can guarantee that, when POOL_DISASSOCIATED is clear, the running workers of the pool are on the local CPU (pool->cpu). tj: Minor description update. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
* tracing: do not leak kernel addressesNick Desaulniers2017-12-031-1/+1
| | | | | | | | This likely breaks tracing tools like trace-cmd. It logs in the same format but now addresses are all 0x0. Bug: 34277115 Change-Id: Ifb0d4d2a184bf0d95726de05b1acee0287a375d9
* Revert "power: make sync on suspend optional"Moyster2017-12-012-7/+0
| | | | | | seems like gueste kanged franciscofranco, sorry for that :D This reverts commit e68ce258c35d28b497cdb11e1e5e9949f660d1ed.
* timer/debug: Change /proc/timer_list from 0444 to 0400Ingo Molnar2017-12-011-1/+1
| | | | | | | | | | | While it uses %pK, there's still few reasons to read this file as non-root. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* tracing: Erase irqsoff trace with empty writeBo Yan2017-11-061-2/+8
| | | | | | | | | | | | | | | | | | | | | commit 8dd33bcb7050dd6f8c1432732f930932c9d3a33e upstream. One convenient way to erase trace is "echo > trace". However, this is currently broken if the current tracer is irqsoff tracer. This is because irqsoff tracer use max_buffer as the default trace buffer. Set the max_buffer as the one to be cleared when it's the trace buffer currently in use. Link: http://lkml.kernel.org/r/1505754215-29411-1-git-send-email-byan@nvidia.com Cc: <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: 4acd4d00f ("tracing: give easy way to clear trace buffer") Signed-off-by: Bo Yan <byan@nvidia.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* tracing: Apply trace_clock changes to instance max bufferBaohong Liu2017-11-061-1/+1
| | | | | | | | | | | | | | | | | | | commit 170b3b1050e28d1ba0700e262f0899ffa4fccc52 upstream. Currently trace_clock timestamps are applied to both regular and max buffers only for global trace. For instance trace, trace_clock timestamps are applied only to regular buffer. But, regular and max buffers can be swapped, for example, following a snapshot. So, for instance trace, bad timestamps can be seen following a snapshot. Let's apply trace_clock timestamps to instance max buffer as well. Link: http://lkml.kernel.org/r/ebdb168d0be042dcdf51f81e696b17fabe3609c1.1504642143.git.tom.zanussi@linux.intel.com Cc: stable@vger.kernel.org Fixes: 277ba0446 ("tracing: Add interface to allow multiple trace buffers") Signed-off-by: Baohong Liu <baohong.liu@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* workqueue: implicit ordered attribute should be overridableTejun Heo2017-11-061-4/+9
| | | | | | | | | | | | | | | | | | | | | commit 0a94efb5acbb6980d7c9ab604372d93cd507e4d8 upstream. 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automatically enabled ordered attribute for unbound workqueues w/ max_active == 1. Because ordered workqueues reject max_active and some attribute changes, this implicit ordered mode broke cases where the user creates an unbound workqueue w/ max_active == 1 and later explicitly changes the related attributes. This patch distinguishes explicit and implicit ordered setting and overrides from attribute changes if implict. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") Cc: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* kernel/extable.c: mark core_kernel_text notraceMarcin Nowakowski2017-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c0d80ddab89916273cb97114889d3f337bc370ae upstream. core_kernel_text is used by MIPS in its function graph trace processing, so having this method traced leads to an infinite set of recursive calls such as: Call Trace: ftrace_return_to_handler+0x50/0x128 core_kernel_text+0x10/0x1b8 prepare_ftrace_return+0x6c/0x114 ftrace_graph_caller+0x20/0x44 return_to_handler+0x10/0x30 return_to_handler+0x0/0x30 return_to_handler+0x0/0x30 ftrace_ops_no_ops+0x114/0x1bc core_kernel_text+0x10/0x1b8 core_kernel_text+0x10/0x1b8 core_kernel_text+0x10/0x1b8 ftrace_ops_no_ops+0x114/0x1bc core_kernel_text+0x10/0x1b8 prepare_ftrace_return+0x6c/0x114 ftrace_graph_caller+0x20/0x44 (...) Mark the function notrace to avoid it being traced. Link: http://lkml.kernel.org/r/1498028607-6765-1-git-send-email-marcin.nowakowski@imgtec.com Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Meyer <thomas@m3y3r.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* workqueue: restore WQ_UNBOUND/max_active==1 to be orderedTejun Heo2017-11-061-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5c0338c68706be53b3dc472e4308961c36e4ece1 upstream. The combination of WQ_UNBOUND and max_active == 1 used to imply ordered execution. After NUMA affinity 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues"), this is no longer true due to per-node worker pools. While the right way to create an ordered workqueue is alloc_ordered_workqueue(), the documentation has been misleading for a long time and people do use WQ_UNBOUND and max_active == 1 for ordered workqueues which can lead to subtle bugs which are very difficult to trigger. It's unlikely that we'd see noticeable performance impact by enforcing ordering on WQ_UNBOUND / max_active == 1 workqueues. Let's automatically set __WQ_ORDERED for those workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Reported-by: Alexei Potashnik <alexei@purestorage.com> Fixes: 4c16bd327c74 ("workqueue: implement NUMA affinity for unbound workqueues") Cc: stable@vger.kernel.org # v3.10+ Signed-off-by: Willy Tarreau <w@1wt.eu> Conflicts: kernel/workqueue.c
* fork: make whole stack_canary randomJann Horn2017-10-141-1/+1
| | | | | | | | | | | | | | | | On machines with sizeof(unsigned long)==8, this ensures that the more significant 32 bits of stack_canary are random, too. stack_canary is defined as unsigned long, all the architectures with stack protector support already pick the stack_canary of init as a random unsigned long, and get_random_long() should be as fast as get_random_int(), so there seems to be no good reason against this. This should help if someone tries to guess a stack canary with brute force. (This change has been made in PaX already, with a different RNG.) Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org>
* CHROMIUM: remove Android's cgroup generic permissions checksDmitry Torokhov2017-09-302-50/+3
| | | | | | | | | | | | | | | | | | | | The implementation is utterly broken, resulting in all processes being allows to move tasks between sets (as long as they have access to the "tasks" attribute), and upstream is heading towards checking only capability anyway, so let's get rid of this code. BUG=b:31790445,chromium:647994 TEST=Boot android container, examine logcat Change-Id: I2f780a5992c34e52a8f2d0b3557fc9d490da2779 Signed-off-by: Dmitry Torokhov <dtor@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/394967 Reviewed-by: Ricky Zhou <rickyz@chromium.org> Reviewed-by: John Stultz <john.stultz@linaro.org> (cherry picked from commit 6895149f8bf0719aa70487e285fa6a8ad3d2692d) Reviewed-on: https://chromium-review.googlesource.com/399858 Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* misc: replace __FUNCTION__ by __function__Moyster2017-09-233-4/+4
| | | | | result of : git grep -l '__FUNCTION__' | xargs sed -i 's/__FUNCTION__/__func__/g'
* drivers: merged Android Binder from 4.9Lukas06102017-09-161-9/+0
| | | | | Change-Id: I857ef86b2d502293fb8c37398383dceaa21dd29f Signed-off-by: Mister Oyster <oysterized@gmail.com>
* lib: vsprintf: whitelist stack tracesDave Weinstein2017-09-141-1/+1
| | | | | | | | | Use the %pP functionality to explicitly allow kernel pointers to be logged for stack traces BUG: 30368199 Change-Id: I495915465565293e9e4da5aa28fbd1d14538d99b Signed-off-by: Dave Weinstein <olorin@google.com>
* perf/core: Fix group {cpu,task} validationMark Rutland2017-09-111-20/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 64aee2a965cf2954a038b5522f11d2cd2f0f8f3e upstream. Regardless of which events form a group, it does not make sense for the events to target different tasks and/or CPUs, as this leaves the group inconsistent and impossible to schedule. The core perf code assumes that these are consistent across (successfully intialised) groups. Core perf code only verifies this when moving SW events into a HW context. Thus, we can violate this requirement for pure SW groups and pure HW groups, unless the relevant PMU driver happens to perform this verification itself. These mismatched groups subsequently wreak havoc elsewhere. For example, we handle watchpoints as SW events, and reserve watchpoint HW on a per-CPU basis at pmu::event_init() time to ensure that any event that is initialised is guaranteed to have a slot at pmu::add() time. However, the core code only checks the group leader's cpu filter (via event_filter_match()), and can thus install follower events onto CPUs violating thier (mismatched) CPU filters, potentially installing them into a CPU without sufficient reserved slots. This can be triggered with the below test case, resulting in warnings from arch backends. #define _GNU_SOURCE #include <linux/hw_breakpoint.h> #include <linux/perf_event.h> #include <sched.h> #include <stdio.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <unistd.h> static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); } char watched_char; struct perf_event_attr wp_attr = { .type = PERF_TYPE_BREAKPOINT, .bp_type = HW_BREAKPOINT_RW, .bp_addr = (unsigned long)&watched_char, .bp_len = 1, .size = sizeof(wp_attr), }; int main(int argc, char *argv[]) { int leader, ret; cpu_set_t cpus; /* * Force use of CPU0 to ensure our CPU0-bound events get scheduled. */ CPU_ZERO(&cpus); CPU_SET(0, &cpus); ret = sched_setaffinity(0, sizeof(cpus), &cpus); if (ret) { printf("Unable to set cpu affinity\n"); return 1; } /* open leader event, bound to this task, CPU0 only */ leader = perf_event_open(&wp_attr, 0, 0, -1, 0); if (leader < 0) { printf("Couldn't open leader: %d\n", leader); return 1; } /* * Open a follower event that is bound to the same task, but a * different CPU. This means that the group should never be possible to * schedule. */ ret = perf_event_open(&wp_attr, 0, 1, leader, 0); if (ret < 0) { printf("Couldn't open mismatched follower: %d\n", ret); return 1; } else { printf("Opened leader/follower with mismastched CPUs\n"); } /* * Open as many independent events as we can, all bound to the same * task, CPU0 only. */ do { ret = perf_event_open(&wp_attr, 0, 0, -1, 0); } while (ret >= 0); /* * Force enable/disble all events to trigger the erronoeous * installation of the follower event. */ printf("Opened all events. Toggling..\n"); for (;;) { prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0); prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0); } return 0; } Fix this by validating this requirement regardless of whether we're moving events. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zhou Chengming <zhouchengming1@huawei.com> Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* hrtimer: Prevent enqueue of hrtimer on dead CPUMichael Bohan2017-09-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Date Wed, 10 Apr 2013 14:07:48 -0700 When switching the hrtimer cpu_base, we briefly allow for preemption to become enabled by unlocking the cpu_base lock. During this time, the CPU corresponding to the new cpu_base that was selected may in fact go offline. In this scenario, the hrtimer is enqueued to a CPU that's not online, and therefore it never fires. As an example, consider this example: CPU #0 CPU #1 ---- ---- ... hrtimer_start() lock_hrtimer_base() switch_hrtimer_base() cpu = hrtimer_get_target() -> 1 spin_unlock(&cpu_base->lock) <migrate thread to CPU #0> <offline> spin_lock(&new_base->lock) this_cpu = 0 cpu != this_cpu enqueue_hrtimer(cpu_base #1) To prevent this scenario, verify that the CPU corresponding to the new cpu_base is indeed online before selecting it in hrtimer_switch_base(). If it's not online, fallback to using the base of the current CPU. Change-Id: I3aaf5b806a25d5a8b96d6ccea7a042d2718091f7 Signed-off-by: Michael Bohan <mbohan@codeaurora.org>
* hrtimer: Consider preemption when migrating hrtimer cpu_basesMichael Bohan2017-09-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Date Wed, 10 Apr 2013 14:07:47 -0700 When switching to a new cpu_base in switch_hrtimer_base(), we briefly enable preemption by unlocking the cpu_base lock in two places. During this interval it's possible for the running thread to be swapped to a different CPU. Consider the following example: CPU #0 CPU #1 ---- ---- hrtimer_start() ... lock_hrtimer_base() switch_hrtimer_base() this_cpu = 0; target_cpu_base = 0; raw_spin_unlock(&cpu_base->lock) <migrate to CPU 1> ... this_cpu == 0 cpu == this_cpu timer->base = CPU #0 timer->base != LOCAL_CPU Since the cached this_cpu is no longer accurate, we'll skip the hrtimer_check_target() check. Once we eventually go to program the hardware, we'll decide not to do so since it knows the real CPU that we're running on is not the same as the chosen base. As a consequence, we may end up missing the hrtimer's deadline. Fix this by updating the local CPU number each time we retake a cpu_base lock in switch_hrtimer_base(). Another possibility is to disable preemption across the whole of switch_hrtimer_base. This looks suboptimal since preemption would be disabled while waiting for lock(s). Change-Id: I46be5cf3069f8e6683ad8fff0841949bdb801684 Signed-off-by: Michael Bohan <mbohan@codeaurora.org>
* sched: Micro-optimize the smart wake-affine logicPeter Zijlstra2017-09-113-2/+8
| | | | | | | | | | | | | | | | | | | | | Smart wake-affine is using node-size as the factor currently, but the overhead of the mask operation is high. Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record the highest cache-share domain size, and make it to be the new factor, in order to reduce the overhead and make it more reasonable. Tested-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michael Wang <wangyun@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Link: http://lkml.kernel.org/r/51D5008E.6030102@linux.vnet.ibm.com [ Tidied up the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Change-Id: I8fb3ceac1e6944db078932172c99d544a4e304e4 Signed-off-by: Paul Reioux <reioux@gmail.com>