aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* ASoC: jack: Use power efficient workqueueMark Brown2016-08-261-1/+1
| | | | | | | | | | | The accessory detect debounce work is not performance sensitive so let the scheduler run it wherever is most efficient rather than in a per CPU workqueue by using the system power efficient workqueue. Signed-off-by: Mark Brown <broonie@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> (cherry picked from commit e6058aaadcd473e5827720dc143af56aabbeecc7) Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* regulator: core: Use the power efficient workqueue for delayed powerdownMark Brown2016-08-261-2/+3
| | | | | | | | | | | | There is no need to use a normal per-CPU workqueue for delayed power downs as they're not timing or performance critical and waking up a core for them would defeat some of the point. Signed-off-by: Mark Brown <broonie@linaro.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Liam Girdwood <liam.r.girdwood@intel.com> (cherry picked from commit 070260f07c7daec311f2466eb9d1df475d5a46f8) Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* ASoC: pcm: Use the power efficient workqueue for delayed powerdownMark Brown2016-08-261-2/+3
| | | | | | | | | | | There is no need to use a normal per-CPU workqueue for delayed power downs as they're not timing or performance critical and waking up a core for them would defeat some of the point. Signed-off-by: Mark Brown <broonie@linaro.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> (cherry picked from commit d4e1a73acd4e894f8332f2093bceaef585cfab67) Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* fbcon: queue work on power efficient wqViresh Kumar2016-08-261-1/+1
| | | | | | | | | | | | | | | | | | | fbcon uses workqueues and it has no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces system_wq with system_power_efficient_wq. Cc: Dave Airlie <airlied@redhat.com> Cc: linux-fbdev@vger.kernel.org Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit a85f1a41f020bc2c97611060bcfae6f48a1db28d) Signed-off-by: Mark Brown <broonie@linaro.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* block: queue work on power efficient wqViresh Kumar2016-08-263-6/+12
| | | | | | | | | | | | | | | | | | Block layer uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 695588f9454bdbc7c1a2fbb8a6bfdcfba6183348) Signed-off-by: Mark Brown <broonie@linaro.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* PHYLIB: queue work on system_power_efficient_wqViresh Kumar2016-08-261-4/+5
| | | | | | | | | | | | | | | | | | | | Phylib uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces system_wq with system_power_efficient_wq for PHYLIB. Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit bbb47bdeae756f04b896b55b51f230f3eb21f207) Signed-off-by: Mark Brown <broonie@linaro.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* workqueue: Add system wide power_efficient workqueuesViresh Kumar2016-08-262-1/+20
| | | | | | | | | | | | | | This patch adds system wide workqueues aligned towards power saving. This is done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to 'true'. tj: updated comments a bit. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 0668106ca3865ba945e155097fb042bf66d364d3) Signed-off-by: Mark Brown <broonie@linaro.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueuesViresh Kumar2016-08-264-0/+75
| | | | | | | | | | | | | | | | | | | | | | | | | Workqueues can be performance or power-oriented. Currently, most workqueues are bound to the CPU they were created on. This gives good performance (due to cache effects) at the cost of potentially waking up otherwise idle cores (Idle from scheduler's perspective. Which may or may not be physically idle) just to process some work. To save power, we can allow the work to be rescheduled on a core that is already awake. Workqueues created with the WQ_UNBOUND flag will allow some power savings. However, we don't change the default behaviour of the system. To enable power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to be turned on. This option can also be overridden by the workqueue.power_efficient boot parameter. tj: Updated config description and comments. Renamed CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit cee22a15052faa817e3ec8985a28154d3fabc7aa) Signed-off-by: Mark Brown <broonie@linaro.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* power: make sync on suspend optionalStefan Guendhoer2016-08-262-0/+7
| | | | | | | | | | | | | | | | | propagate from (CR). On embedded devices with built-in batteries, it is not so important to sync the file systems before suspend. The chance of losing power during suspend are no greater than they are when the system is awake. The sync operations can greatly increase suspend latency when the system has accrued many dirty pages and/or the target storage devices are not particularly fast. This commit adds a kernel config option to allow file system sync in the suspend path to be disabled. It is enabled by default.
* mm: don't use compound_head() in virt_to_head_page()Joonsoo Kim2016-08-261-1/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | compound_head() is implemented with assumption that there would be race condition when checking tail flag. This assumption is only true when we try to access arbitrary positioned struct page. The situation that virt_to_head_page() is called is different case. We call virt_to_head_page() only in the range of allocated pages, so there is no race condition on tail flag. In this case, we don't need to handle race condition and we can reduce overhead slightly. This patch implements compound_head_fast() which is similar with compound_head() except tail flag race handling. And then, virt_to_head_page() uses this optimized function to improve performance. I saw 1.8% win in a fast-path loop over kmem_cache_alloc/free, (14.063 ns -> 13.810 ns) if target object is on tail page. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* sched/rt: Reduce rq lock contention by eliminating locking of non-feasible ↵Stefan Guendhoer2016-08-261-0/+10
| | | | target
* sched/rt: Reduce rq lock contention by eliminating locking of non-feasible ↵Tim Chen2016-08-261-1/+16
| | | | | | | | | | | | | | | | | | | | | target This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shawn Bohrer <sbohrer@rgmadvisors.com> Cc: Suruchi Kadu <suruchi.a.kadu@intel.com> Cc: Doug Nelson<doug.nelson@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1421430374.2399.27.camel@schen9-desk2.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* aio: Skip timer for io_getevents if timeout=0Fam Zheng2016-08-261-2/+6
| | | | | | | | | | | | In this case, it is basically a polling. Let's not involve timer at all because that would hurt performance for application event loops. In an arbitrary test I've done, io_getevents syscall elapsed time reduces from 50000+ nanoseconds to a few hundereds. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* workqueue: allow rescuer thread to do more work.NeilBrown2016-08-261-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When there is serious memory pressure, all workers in a pool could be blocked, and a new thread cannot be created because it requires memory allocation. In this situation a WQ_MEM_RECLAIM workqueue will wake up the rescuer thread to do some work. The rescuer will only handle requests that are already on ->worklist. If max_requests is 1, that means it will handle a single request. The rescuer will be woken again in 100ms to handle another max_requests requests. I've seen a machine (running a 3.0 based "enterprise" kernel) with thousands of requests queued for xfslogd, which has a max_requests of 1, and is needed for retiring all 'xfs' write requests. When one of the worker pools gets into this state, it progresses extremely slowly and possibly never recovers (only waited an hour or two). With this patch we leave a pool_workqueue on mayday list until it is clearly no longer in need of assistance. This allows all requests to be handled in a timely fashion. We keep each pool_workqueue on the mayday list until need_to_create_worker() is false, and no work for this workqueue is found in the pool. I have tested this in combination with a (hackish) patch which forces all work items to be handled by the rescuer thread. In that context it significantly improves performance. A similar patch for a 3.0 kernel significantly improved performance on a heavy work load. Thanks to Jan Kara for some design ideas, and to Dongsu Park for some comments and testing. tj: Inverted the lock order between wq_mayday_lock and pool->lock with a preceding patch and simplified this patch. Added comment and updated changelog accordingly. Dongsu spotted missing get_pwq() in the simplified code. Cc: Dongsu Park <dongsu.park@profitbricks.com> Cc: Jan Kara <jack@suse.cz> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* mm/slub: optimize alloc/free fastpath by removing preemption on/offJoonsoo Kim2016-08-261-12/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had to insert a preempt enable/disable in the fastpath a while ago in order to guarantee that tid and kmem_cache_cpu are retrieved on the same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler can move the process to other cpu during retrieving data. Now, I reach the solution to remove preempt enable/disable in the fastpath. If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu are retrieved by separate this_cpu operation, it means that they are retrieved on the same cpu. If not matched, we just have to retry it. With this guarantee, preemption enable/disable isn't need at all even if CONFIG_PREEMPT, so this patch removes it. I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns) Below is the result of Christoph's slab_test reported by Jesper Dangaard Brouer. * Before Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 68 cycles 10000 times kmalloc(16)/kfree -> 68 cycles 10000 times kmalloc(32)/kfree -> 69 cycles 10000 times kmalloc(64)/kfree -> 68 cycles 10000 times kmalloc(128)/kfree -> 68 cycles 10000 times kmalloc(256)/kfree -> 68 cycles 10000 times kmalloc(512)/kfree -> 74 cycles 10000 times kmalloc(1024)/kfree -> 75 cycles 10000 times kmalloc(2048)/kfree -> 74 cycles 10000 times kmalloc(4096)/kfree -> 74 cycles 10000 times kmalloc(8192)/kfree -> 75 cycles 10000 times kmalloc(16384)/kfree -> 510 cycles * After Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 65 cycles 10000 times kmalloc(16)/kfree -> 66 cycles 10000 times kmalloc(32)/kfree -> 65 cycles 10000 times kmalloc(64)/kfree -> 66 cycles 10000 times kmalloc(128)/kfree -> 66 cycles 10000 times kmalloc(256)/kfree -> 71 cycles 10000 times kmalloc(512)/kfree -> 72 cycles 10000 times kmalloc(1024)/kfree -> 71 cycles 10000 times kmalloc(2048)/kfree -> 71 cycles 10000 times kmalloc(4096)/kfree -> 71 cycles 10000 times kmalloc(8192)/kfree -> 65 cycles 10000 times kmalloc(16384)/kfree -> 511 cycles Most of the results are better than before. Note that this change slightly worses performance in !CONFIG_PREEMPT, roughly 0.3%. Implementing each case separately would help performance, but, since it's so marginal, I didn't do that. This would help maintanance since we have same code for all cases. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* mm: replace __get_cpu_var uses with this_cpu_ptrChristoph Lameter2016-08-268-12/+12
| | | | | | | | | | | | Replace places where __get_cpu_var() is used for an address calculation with this_cpu_ptr(). Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* locking/rtmutex: Optimize setting task running after being blockedDavidlohr Bueso2016-08-261-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | We explicitly mark the task running after returning from a __rt_mutex_slowlock() call, which does the actual sleeping via wait-wake-trylocking. As such, this patch does two things: (1) refactors the code so that setting current to TASK_RUNNING is done by __rt_mutex_slowlock(), and not by the callers. The downside to this is that it becomes a bit unclear when at what point we block. As such I've added a comment that the task blocks when calling __rt_mutex_slowlock() so readers can figure out when it is running again. (2) relaxes setting current's state through __set_current_state(), instead of it's more expensive barrier alternative. There was no need for the implied barrier as we're obviously not planning on blocking. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1422857784.18096.1.camel@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* sched/completion: Add lock-free checking of the blocking caseNicholas Mc Guire2016-08-261-0/+9
| | | | | | | | | | | | | | | | | | The "thread would block" case can be checked without grabbing ->wait.lock. [ If the check does not return early then grab the lock and recheck. A memory barrier is not needed as complete() and complete_all() imply a barrier. The ACCESS_ONCE() is needed for calls in a loop that, if inlined, could optimize out the re-fetching of x->done. ] Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1422013307-13200-1-git-send-email-der.herr@hofr.at Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* Optimise apply_slack() for size and speed.Chinmay V S2016-08-261-6/+3
| | | | | | | | | | | | | To apply proper slack, the original algorithm used to prepare a mask and then apply the mask to obtain the appropriately rounded-off absolute time the timer expires. This patch modifies the masking logic to a bit-shift logic, therby reducing the complexity and number of operations. Thus obtaining a minor speed-up. Signed-off-by: Chinmay V S <chinmay.v.s@pathpartnertech.com> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* cpufreq: Introduce new relation for freq selectionStratos Karafotis2016-08-262-1/+12
| | | | | | | | | | | Introduce CPUFREQ_RELATION_C for frequency selection. It selects the frequency with the minimum euclidean distance to target. In case of equal distance between 2 frequencies, it will select the greater frequency. Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* audit: kiss goodbye you stupid piece of crap logging messages.Francisco Franco2016-08-261-1/+1
| | | | | | Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com> Signed-off-by: engstk <eng.stk@sapo.pt> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* mm: reduce readahead from 1024kB to 512kBStefan Guendhoer2016-08-261-1/+1
|
* block: row: add magic values.franciscofranco2016-08-261-9/+9
| | | | | Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* Add ROW IO scheduler.Raymond Golo (intersectRaven)2016-08-263-0/+1147
| | | | | | | Import ROW scheduler from lenok source with additional changes due to past MediaTek commits to ensure compatibility. Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* mm: pass readahead info down to the i/o schedulerLee Susman2016-08-263-0/+5
| | | | | | | | | | | | | | | | Some i/o schedulers (i.e. row-iosched, cfq-iosched) deploy an idling algorithm in order to be better synced with the readahead algorithm. Idling is a prediction algorithm for incoming read requests. In this patch we mark pages which are part of a readahead window, by setting a newly introduced flag. With this flag, the i/o scheduler can identify a request which is associated with a readahead page. This enables the i/o scheduler's idling mechanism to be en-sync with the readahead mechanism and, in turn, can increase read throughput. Change-Id: I0654f23315b6d19d71bcc9cc029c6b281a44b196 Signed-off-by: Lee Susman <lsusman@codeaurora.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* block: add REQ_URGENT to request flagsTatyana Brokhman2016-08-261-1/+3
| | | | | | | | | | | | This patch adds a new flag to be used in cmd_flags field of struct request for marking request as urgent. Urgent request is the one that should be given priority currently handled (regular) request by the device driver. The decision of a request urgency is taken by the scheduler. Change-Id: Ic20470987ef23410f1d0324f96f00578f7df8717 Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* block: Add API for urgent request handlingTatyana Brokhman2016-08-266-2/+58
| | | | | | | | | | | | | | | | | This patch add support in block & elevator layers for handling urgent requests. The decision if a request is urgent or not is taken by the scheduler. Urgent request notification is passed to the underlying block device driver (eMMC for example). Block device driver may decide to interrupt the currently running low priority request to serve the new urgent request. By doing so READ latency is greatly reduced in read&write collision scenarios. Note that if the current scheduler doesn't implement the urgent request mechanism, this code path is never activated. Change-Id: I8aa74b9b45c0d3a2221bd4e82ea76eb4103e7cfa Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* block: Add support for reinsert a dispatched reqTatyana Brokhman2016-08-264-0/+86
| | | | | | | | | | | | | | | | | | | Add support for reinserting a dispatched request back to the scheduler's internal data structures. This capability is used by the device driver when it chooses to interrupt the current request transmission and execute another (more urgent) pending request. For example: interrupting long write in order to handle pending read. The device driver re-inserts the remaining write request back to the scheduler, to be rescheduled for transmission later on. Add API for verifying whether the current scheduler supports reinserting requests mechanism. If reinsert mechanism isn't supported by the scheduler, this code path will never be activated. Change-Id: I5c982a66b651ebf544aae60063ac8a340d79e67f Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* block, bfq: add Early Queue Merge (EQM) to BFQ-v7r8 for 3.10.8+Mauro Andreolini2016-08-263-252/+580
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A set of processes may happen to perform interleaved reads, i.e.,requests whose union would give rise to a sequential read pattern. There are two typical cases: in the first case, processes read fixed-size chunks of data at a fixed distance from each other, while in the second case processes may read variable-size chunks at variable distances. The latter case occurs for example with QEMU, which splits the I/O generated by the guest into multiple chunks, and lets these chunks be served by a pool of cooperating processes, iteratively assigning the next chunk of I/O to the first available process. CFQ uses actual queue merging for the first type of rocesses, whereas it uses preemption to get a sequential read pattern out of the read requests performed by the second type of processes. In the end it uses two different mechanisms to achieve the same goal: boosting the throughput with interleaved I/O. This patch introduces Early Queue Merge (EQM), a unified mechanism to get a sequential read pattern with both types of processes. The main idea is checking newly arrived requests against the next request of the active queue both in case of actual request insert and in case of request merge. By doing so, both the types of processes can be handled by just merging their queues. EQM is then simpler and more compact than the pair of mechanisms used in CFQ. Finally, EQM also preserves the typical low-latency properties of BFQ, by properly restoring the weight-raising state of a queue when it gets back to a non-merged state. Change-Id: I464d77705cfd2a382150264c52405887e3903358 Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* block: introduce the BFQ-v7r8 I/O sched for 3.10.8+Paolo Valente2016-08-265-0/+6824
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the BFQ-v7r8 I/O scheduler to 3.10.8+. The general structure is borrowed from CFQ, as much of the code for handling I/O contexts Over time, several useful features have been ported from CFQ as well (details in the changelog in README.BFQ). A (bfq_)queue is associated to each task doing I/O on a device, and each time a scheduling decision has to be made a queue is selected and served until it expires. - Slices are given in the service domain: tasks are assigned budgets, measured in number of sectors. Once got the disk, a task must however consume its assigned budget within a configurable maximum time (by default, the maximum possible value of the budgets is automatically computed to comply with this timeout). This allows the desired latency vs "throughput boosting" tradeoff to be set. - Budgets are scheduled according to a variant of WF2Q+, implemented using an augmented rb-tree to take eligibility into account while preserving an O(log N) overall complexity. - A low-latency tunable is provided; if enabled, both interactive and soft real-time applications are guaranteed a very low latency. - Latency guarantees are preserved also in the presence of NCQ. - Also with flash-based devices, a high throughput is achieved while still preserving latency guarantees. - BFQ features Early Queue Merge (EQM), a sort of fusion of the cooperating-queue-merging and the preemption mechanisms present in CFQ. EQM is in fact a unified mechanism that tries to get a sequential read pattern, and hence a high throughput, with any set of processes performing interleaved I/O over a contiguous sequence of sectors. - BFQ supports full hierarchical scheduling, exporting a cgroups interface. Since each node has a full scheduler, each group can be assigned its own weight. - If the cgroups interface is not used, only I/O priorities can be assigned to processes, with ioprio values mapped to weights with the relation weight = IOPRIO_BE_NR - ioprio. - ioprio classes are served in strict priority order, i.e., lower priority queues are not served as long as there are higher priority queues. Among queues in the same class the bandwidth is distributed in proportion to the weight of each queue. A very thin extra bandwidth is however guaranteed to the Idle class, to prevent it from starving. Change-Id: Ib51ef300946b21e19cbaad97eb415c05bddd3ead Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* sched/idle: Avoid spurious wakeup IPIsPeter Zijlstra2016-08-261-6/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because mwait_idle_with_hints() gets called from !idle context it must call current_clr_polling(). This however means that resched_task() is very likely to send an IPI even when we were polling: CPU0 CPU1 if (current_set_polling_and_test()) goto out; __monitor(&ti->flags); if (!need_resched()) __mwait(eax, ecx); set_tsk_need_resched(p); smp_mb(); out: current_clr_polling(); if (!tsk_is_polling(p)) smp_send_reschedule(cpu); So while it is correct (extra IPIs aren't a problem, whereas a missed IPI would be) it is a performance problem (for some). Avoid this issue by using fetch_or() to atomically set NEED_RESCHED and test if POLLING_NRFLAG is set. Since a CPU stuck in mwait is unlikely to modify the flags word, contention on the cmpxchg is unlikely and thus we should mostly succeed in a single go. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Nicolas Pitre <nico@linaro.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-kf5suce6njh5xf5d3od13rr0@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* lib/string: use glibc versionDooMLoRD2016-08-261-17/+12
| | | | | | | | | the performance of memcpy and memmove of the general version is very inefficient, this patch improved them. Signed-off-by: Miao Xie <miaox*******> Signed-off-by: faux123 <reioux@gmail.com> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* lib/memcopy: use glibc versionfaux1232016-08-263-1/+630
| | | | | | | | | | | | | | | | | | the kernel's memcpy and memmove is very inefficient. But the glibc version is quite fast, in some cases it is 10 times faster than the kernel version. So I introduce some memory copy macros and functions of the glibc to improve the kernel version's performance. The strategy of the memory functions is: 1. Copy bytes until the destination pointer is aligned. 2. Copy words in unrolled loops. If the source and destination are not aligned in the same way, use word memory operations, but shift and merge two read words before writing. 3. Copy the few remaining bytes. Signed-off-by: Miao Xie <miaox*******> Signed-off-by: faux123 <reioux@gmail.com> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* cpuidle: remove cross-cpu IPI by new latency request. when drivers request ↵Guojian Chen2016-08-261-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | new latency requirement, it's not necessary to immediately wake up another cpu by sending cross-cpu IPI, we can consider the new latency to be taken into effect after next wakeup from idle, this can save the unnecessary wakeup cost, and reduce the risk that drivers may request latency in irq disabled context. [<c08e0cc0>] (__irq_svc+0x40/0x70) from [<c00e801c>] (smp_call_function_single+0x16c/0x240) [<c00e801c>] (smp_call_function_single+0x16c/0x240) from [<c00e855c>] (smp_call_function+0x40/0x6c) [<c00e855c>] (smp_call_function+0x40/0x6c) from [<c0601c9c>] (cpuidle_latency_notify+0x18/0x20) [<c0601c9c>] (cpuidle_latency_notify+0x18/0x20) from [<c00b7c28>] (blocking_notifier_call_chain+0x74/0x94) [<c00b7c28>] (blocking_notifier_call_chain+0x74/0x94) from [<c00d563c>] (pm_qos_update_target+0xe0/0x128) [<c00d563c>] (pm_qos_update_target+0xe0/0x128) from [<c0620d3c>] (msmsdcc_enable+0xac/0x158) [<c0620d3c>] (msmsdcc_enable+0xac/0x158) from [<c06050e0>] (mmc_try_claim_host+0xb0/0xb8) [<c06050e0>] (mmc_try_claim_host+0xb0/0xb8) from [<c0605318>] (mmc_start_bkops.part.15+0x50/0x2f4) [<c0605318>] (mmc_start_bkops.part.15+0x50/0x2f4) from [<c00ab768>] (process_one_work+0x124/0x55c) [<c00ab768>] (process_one_work+0x124/0x55c) from [<c00abfc8>] (worker_thread+0x178/0x45c) [<c00abfc8>] (worker_thread+0x178/0x45c) from [<c00b0b24>] (kthread+0x84/0x90) [<c00b0b24>] (kthread+0x84/0x90) from [<c000fdd4>] (kernel_thread_exit+0x0/0x8) Disabling lock debugging due to kernel taint coresight-etb coresight-etb.0: ETB aborted Kernel panic - not syncing: softlockup: hung tasks Signed-off-by: Guojian Chen <a21757@motorola.com> Reviewed-on: http://gerrit.pcs.mot.com/532702 SLT-Approved: Slta Waiver <sltawvr@motorola.com> Tested-by: Jira Key <jirakey@motorola.com> Reviewed-by: Klocwork kwcheck <klocwork-kwcheck@sourceforge.mot.com> Reviewed-by: Christopher Fries <qcf001@motorola.com> Reviewed-by: David Ding <dding@motorola.com> Submit-Approved: Jira Key <jirakey@motorola.com> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* jiffies conversions: Use compile time constants when possibleJoe Perches2016-08-262-98/+144
| | | | | | | | | | | | | | | Do the multiplications and divisions at compile time instead of runtime when the converted value is a constant. Make the calculation functions static __always_inline to jiffies.h. Add #defines with __builtin_constant_p to test and use the static inline or the runtime functions as appropriate. Prefix the old exported symbols/functions with __ Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* hotplug cpu touch frequencyStefan Guendhoer2016-08-261-1/+2
|
* Remove the '+' sign on the kernel local version.franciscofranco2016-08-261-1/+0
| | | | Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* Makefile: use ccacheLevin Calado2016-08-261-2/+2
| | | | Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* Revert "fib_rules: fix fib rule dumps across multiple skbs"Levin Calado2016-08-261-9/+5
| | | | | | This reverts commit d0550a3f2d313a5310ace03299d45ef63edfb750. Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* Linux 3.10.90Greg Kroah-Hartman2016-08-261-1/+1
| | | | Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* Revert "iio: bmg160: IIO_BUFFER and IIO_TRIGGERED_BUFFER are required"Markus Pargmann2016-08-261-2/+1
| | | | | | | | | | | | | This reverts commit 35c45e8bce3c92fb1ff94d376f1d4bfaae079d66 which was commit 06d2f6ca5a38abe92f1f3a132b331eee773868c3 upstream as it should not have been applied. Reported-by: Luis Henriques <luis.henriques@canonical.com> Cc: Markus Pargmann <mpa@pengutronix.de> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* vfs: Remove incorrect debugging WARN in prepend_pathEric W. Biederman2016-08-261-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | commit 93e3bce6287e1fb3e60d3324ed08555b5bbafa89 upstream. The warning message in prepend_path is unclear and outdated. It was added as a warning that the mechanism for generating names of pseudo files had been removed from prepend_path and d_dname should be used instead. Unfortunately the warning reads like a general warning, making it unclear what to do with it. Remove the warning. The transition it was added to warn about is long over, and I added code several years ago which in rare cases causes the warning to fire on legitimate code, and the warning is now firing and scaring people for no good reason. Reported-by: Ivan Delalande <colona@arista.com> Reported-by: Omar Sandoval <osandov@osandov.com> Fixes: f48cfddc6729e ("vfs: In d_path don't call d_dname on a mount point") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> [ vlee: Backported to 3.10. Adjusted context. ] Signed-off-by: Vinson Lee <vlee@twitter.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* fib_rules: fix fib rule dumps across multiple skbsWilson Kok2016-08-261-5/+9
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 41fc014332d91ee90c32840bf161f9685b7fbf2b ] dump_rules returns skb length and not error. But when family == AF_UNSPEC, the caller of dump_rules assumes that it returns an error. Hence, when family == AF_UNSPEC, we continue trying to dump on -EMSGSIZE errors resulting in incorrect dump idx carried between skbs belonging to the same dump. This results in fib rule dump always only dumping rules that fit into the first skb. This patch fixes dump_rules to return error so that we exit correctly and idx is correctly maintained between skbs that are part of the same dump. Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* sctp: fix race on protocol/netns initializationMarcelo Ricardo Leitner2016-08-261-23/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8e2d61e0aed2b7c4ecb35844fe07e0b2b762dee4 ] Consider sctp module is unloaded and is being requested because an user is creating a sctp socket. During initialization, sctp will add the new protocol type and then initialize pernet subsys: status = sctp_v4_protosw_init(); if (status) goto err_protosw_init; status = sctp_v6_protosw_init(); if (status) goto err_v6_protosw_init; status = register_pernet_subsys(&sctp_net_ops); The problem is that after those calls to sctp_v{4,6}_protosw_init(), it is possible for userspace to create SCTP sockets like if the module is already fully loaded. If that happens, one of the possible effects is that we will have readers for net->sctp.local_addr_list list earlier than expected and sctp_net_init() does not take precautions while dealing with that list, leading to a potential panic but not limited to that, as sctp_sock_init() will copy a bunch of blank/partially initialized values from net->sctp. The race happens like this: CPU 0 | CPU 1 socket() | __sock_create | socket() inet_create | __sock_create list_for_each_entry_rcu( | answer, &inetsw[sock->type], | list) { | inet_create /* no hits */ | if (unlikely(err)) { | ... | request_module() | /* socket creation is blocked | * the module is fully loaded | */ | sctp_init | sctp_v4_protosw_init | inet_register_protosw | list_add_rcu(&p->list, | last_perm); | | list_for_each_entry_rcu( | answer, &inetsw[sock->type], sctp_v6_protosw_init | list) { | /* hit, so assumes protocol | * is already loaded | */ | /* socket creation continues | * before netns is initialized | */ register_pernet_subsys | Simply inverting the initialization order between register_pernet_subsys() and sctp_v4_protosw_init() is not possible because register_pernet_subsys() will create a control sctp socket, so the protocol must be already visible by then. Deferring the socket creation to a work-queue is not good specially because we loose the ability to handle its errors. So, as suggested by Vlad, the fix is to split netns initialization in two moments: defaults and control socket, so that the defaults are already loaded by when we register the protocol, while control socket initialization is kept at the same moment it is today. Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace") Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* net/ipv6: Correct PIM6 mrt_lock handlingRichard Laing2016-08-261-1/+1
| | | | | | | | | | | | | | | | | [ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ] In the IPv6 multicast routing code the mrt_lock was not being released correctly in the MFC iterator, as a result adding or deleting a MIF would cause a hang because the mrt_lock could not be acquired. This fix is a copy of the code for the IPv4 case and ensures that the lock is released correctly. Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* ipv6: fix exthdrs offload registration in out_rt pathDaniel Borkmann2016-08-261-1/+1
| | | | | | | | | | | | | | | | [ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ] We previously register IPPROTO_ROUTING offload under inet6_add_offload(), but in error path, we try to unregister it with inet_del_offload(). This doesn't seem correct, it should actually be inet6_del_offload(), also ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it also uses rthdr_offload twice), but it got removed entirely later on. Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* usbnet: Get EVENT_NO_RUNTIME_PM bit before it is clearedEugene Shatokhin2016-08-261-3/+4
| | | | | | | | | | | | | | | | | [ Upstream commit f50791ac1aca1ac1b0370d62397b43e9f831421a ] It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in usbnet_stop(), but its value should be read before it is cleared when dev->flags is set to 0. The problem was spotted and the fix was provided by Oliver Neukum <oneukum@suse.de>. Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru> Acked-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* ip6_gre: release cached dst on tunnel removalhuaibin Wang2016-08-261-0/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit d4257295ba1b389c693b79de857a96e4b7cd8ac0 ] When a tunnel is deleted, the cached dst entry should be released. This problem may prevent the removal of a netns (seen with a x-netns IPv6 gre tunnel): unregister_netdevice: waiting for lo to become free. Usage count = 3 CC: Dmitry Kozlov <xeb@mail.ru> Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: huaibin Wang <huaibin.wang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* rds: fix an integer overflow test in rds_info_getsockopt()Dan Carpenter2016-08-261-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 468b732b6f76b138c0926eadf38ac88467dcd271 ] "len" is a signed integer. We check that len is not negative, so it goes from zero to INT_MAX. PAGE_SIZE is unsigned long so the comparison is type promoted to unsigned long. ULONG_MAX - 4095 is a higher than INT_MAX so the condition can never be true. I don't know if this is harmful but it seems safe to limit "len" to INT_MAX - 4095. Fixes: a8c879a7ee98 ('RDS: Info and stats') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* netlink: don't hold mutex in rcu callback when releasing mmapd ringFlorian Westphal2016-08-261-32/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0470eb99b4721586ccac954faac3fa4472da0845 ] Kirill A. Shutemov says: This simple test-case trigers few locking asserts in kernel: int main(int argc, char **argv) { unsigned int block_size = 16 * 4096; struct nl_mmap_req req = { .nm_block_size = block_size, .nm_block_nr = 64, .nm_frame_size = 16384, .nm_frame_nr = 64 * block_size / 16384, }; unsigned int ring_size; int fd; fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0) exit(1); if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0) exit(1); ring_size = req.nm_block_nr * req.nm_block_size; mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); return 0; } +++ exited with 0 +++ BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616 in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init 3 locks held by init/1: #0: (reboot_mutex){+.+...}, at: [<ffffffff81080959>] SyS_reboot+0xa9/0x220 #1: ((reboot_notifier_list).rwsem){.+.+..}, at: [<ffffffff8107f379>] __blocking_notifier_call_chain+0x39/0x70 #2: (rcu_callback){......}, at: [<ffffffff810d32e0>] rcu_do_batch.isra.49+0x160/0x10c0 Preemption disabled at:[<ffffffff8145365f>] __delay+0xf/0x20 CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014 ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102 0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002 ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98 Call Trace: <IRQ> [<ffffffff81929ceb>] dump_stack+0x4f/0x7b [<ffffffff81085a9d>] ___might_sleep+0x16d/0x270 [<ffffffff81085bed>] __might_sleep+0x4d/0x90 [<ffffffff8192e96f>] mutex_lock_nested+0x2f/0x430 [<ffffffff81932fed>] ? _raw_spin_unlock_irqrestore+0x5d/0x80 [<ffffffff81464143>] ? __this_cpu_preempt_check+0x13/0x20 [<ffffffff8182fc3d>] netlink_set_ring+0x1ed/0x350 [<ffffffff8182e000>] ? netlink_undo_bind+0x70/0x70 [<ffffffff8182fe20>] netlink_sock_destruct+0x80/0x150 [<ffffffff817e484d>] __sk_free+0x1d/0x160 [<ffffffff817e49a9>] sk_free+0x19/0x20 [..] Cong Wang says: We can't hold mutex lock in a rcu callback, [..] Thomas Graf says: The socket should be dead at this point. It might be simpler to add a netlink_release_ring() function which doesn't require locking at all. Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name> Diagnosed-by: Cong Wang <cwang@twopensource.com> Suggested-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>