aboutsummaryrefslogtreecommitdiff
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
...
* uksm: modify ema logic and tidy upRyan Pennucci2017-04-111-57/+41
|
* uksm: enhancements and cleanupsRyan Pennucci2017-04-111-69/+169
| | | | | | | | | | | | - Replace the old throttle multiplier with a divisor - Fix up the logic for saved_pages_to_scan - Fix a couple memory leaks and assorted tiny bugs - Add a cpu_scales tuner to better configure rung CPU usage Signed-off-by: Joe Maples <joe@frap129.org> Conflicts: mm/uksm.c
* uksm: squashed fixupsRyan Pennucci2017-04-111-121/+96
| | | | | | | | | | | | | | | | | | | | | | uksm: cleanups and scheduling changes - Fix deprecated function warnings - Use nice 15 instead of 5 - Use fixed periods instead of adjusting - Rewrite pages_to_scan logic - Modify kthread behavior to freeze more aggressively uksm: fixups - enforce usage limit across all rungs - mark uksm_do_scan hot - frob a few tuning knobs uksm: back off during load Signed-off-by: Joe Maples <joe@frap129.org> Conflicts: mm/uksm.c
* UKSM: cast variable as constNathan Chancellor2017-04-111-1/+1
| | | | | | | | | | | | gcc 4.9 produces this error: mm/uksm.c: In function 'is_full_zero': mm/uksm.c:169:23: warning: initialization discards 'const' qualifier from pointer target type error, forbidden warning: uksm.c:169 Casting *src as a constant fixes this. Could also remove the const from the parameter. Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
* UKSM: remove U64_MAX definitionNathan Chancellor2017-04-111-1/+0
| | | | | | This is already defined in include/linux/kernel.h, causing a compilation error; UKSM has not been updated for the later versions of 3.10 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
* mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node()zhong jiang2017-04-111-1/+6
| | | | | | | | | | | | | | | | | | | | According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can in rare cases cause a hung task warning. At present, if alloc_stable_node() allocation fails, two break_cows may want to allocate a couple of pages, and the issue will come up when free memory is under pressure. We fix it by adding __GFP_HIGH to GFP, to grant access to memory reserves, increasing the likelihood of allocation success. [akpm@linux-foundation.org: tweak comment] Link: http://lkml.kernel.org/r/1474354484-58233-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm,ksm: fix endless looping in allocating memory when ksm enablezhong jiang2017-04-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I hit the following hung task when runing a OOM LTP test case with 4.1 kernel. Call trace: [<ffffffc000086a88>] __switch_to+0x74/0x8c [<ffffffc000a1bae0>] __schedule+0x23c/0x7bc [<ffffffc000a1c09c>] schedule+0x3c/0x94 [<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350 [<ffffffc000a1e32c>] down_write+0x64/0x80 [<ffffffc00021f794>] __ksm_exit+0x90/0x19c [<ffffffc0000be650>] mmput+0x118/0x11c [<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74 [<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4 [<ffffffc0000d0f34>] get_signal+0x444/0x5e0 [<ffffffc000089fcc>] do_signal+0x1d8/0x450 [<ffffffc00008a35c>] do_notify_resume+0x70/0x78 The oom victim cannot terminate because it needs to take mmap_sem for write while the lock is held by ksmd for read which loops in the page allocator ksm_do_scan scan_get_next_rmap_item down_read get_next_rmap_item alloc_rmap_item #ksmd will loop permanently. There is no way forward because the oom victim cannot release any memory in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve this problem because it would release the memory asynchronously. Nevertheless we can relax alloc_rmap_item requirements and use __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan would just retry later after the lock got dropped. Such a patch would be also easy to backport to older stable kernels which do not have oom_reaper. While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed by the allocation failure. Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Hugh Dickins <hughd@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ksm: fix conflict between mmput and scan_get_next_rmap_itemZhou Chengming2017-04-111-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A concurrency issue about KSM in the function scan_get_next_rmap_item. task A (ksmd): |task B (the mm's task): | mm = slot->mm; | down_read(&mm->mmap_sem); | | ... | | spin_lock(&ksm_mmlist_lock); | | ksm_scan.mm_slot go to the next slot; | | spin_unlock(&ksm_mmlist_lock); | |mmput() -> | ksm_exit(): | |spin_lock(&ksm_mmlist_lock); |if (mm_slot && ksm_scan.mm_slot != mm_slot) { | if (!mm_slot->rmap_list) { | easy_to_free = 1; | ... | |if (easy_to_free) { | mmdrop(mm); | ... | |So this mm_struct may be freed in the mmput(). | up_read(&mm->mmap_sem); | As we can see above, the ksmd thread may access a mm_struct that already been freed to the kmem_cache. Suppose a fork will get this mm_struct from the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will cause mmap_sem.count to become -1. As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has the same SMP race condition, so fix it too. My prev fix in function scan_get_next_rmap_item will introduce a different SMP race condition, so just invert the up_read/spin_unlock order as Andrea Arcangeli said. Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.com Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Geliang Tang <geliangtang@163.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: Li Bin <huawei.libin@huawei.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/ksm.c: mark stable page dirtyMinchan Kim2017-04-111-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MADV_FREE patchset changes page reclaim to simply free a clean anonymous page with no dirty ptes, instead of swapping it out; but KSM uses clean write-protected ptes to reference the stable ksm page. So be sure to mark that page dirty, so it's never mistakenly discarded. [hughd@google.com: adjusted comments] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@kernel.org> Cc: <yalin.wang2010@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jason Evans <je@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rik van Riel <riel@redhat.com> Cc: Roland Dreier <roland@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Shaohua Li <shli@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/ksm.c: use list_for_each_entry_safeGeliang Tang2017-04-111-13/+7
| | | | | | | | | | Use list_for_each_entry_safe() instead of list_for_each_safe() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ksm: unstable_tree_search_insert error checking cleanupAndrea Arcangeli2017-04-111-2/+3
| | | | | | | | | | | | | | | get_mergeable_page() can only return NULL (also in case of errors) or the pinned mergeable page. It can't return an error different than NULL. This optimizes away the unnecessary error check. Add a return after the "out:" label in the callee to make it more readable. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ksm: use find_mergeable_vma in try_to_merge_with_ksm_pageAndrea Arcangeli2017-04-111-6/+2
| | | | | | | | | | | | | | Doing the VM_MERGEABLE check after the page == kpage check won't provide any meaningful benefit. The !vma->anon_vma check of find_mergeable_vma is the only superfluous bit in using find_mergeable_vma because the !PageAnon check of try_to_merge_one_page() implicitly checks for that, but it still looks cleaner to share the same find_mergeable_vma(). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ksm: use the helper method to do the hlist_empty checkAndrea Arcangeli2017-04-111-1/+1
| | | | | | | | | | | | This just uses the helper function to cleanup the assumption on the hlist_node internals. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Joe Maples <joe@frap129.org>
* ksm: don't fail stable tree lookups if walking over stale stable_nodesAndrea Arcangeli2017-04-111-5/+27
| | | | | | | | | | | | | | | | | | | | | | | The stable_nodes can become stale at any time if the underlying pages gets freed. The stable_node gets collected and removed from the stable rbtree if that is detected during the rbtree lookups. Don't fail the lookup if running into stale stable_nodes, just restart the lookup after collecting the stale stable_nodes. Otherwise the CPU spent in the preparation stage is wasted and the lookup must be repeated at the next loop potentially failing a second time in a second stale stable_node. If we don't prune aggressively we delay the merging of the unstable node candidates and at the same time we delay the freeing of the stale stable_nodes. Keeping stale stable_nodes around wastes memory and it can't provide any benefit. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Joe Maples <joe@frap129.org>
* ksm: add cond_resched() to the rmap_walksAndrea Arcangeli2017-04-112-0/+5
| | | | | | | | | | | | | | | While at it add it to the file and anon walks too. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Petr Holasek <pholasek@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Joe Maples <joe@frap129.org> Conflicts: mm/rmap.c
* ksm: Don't bug on pmd_trans_hugeJoe Maples2017-04-111-1/+0
| | | | Signed-off-by: Joe Maples <joe@frap129.org>
* mm: ksm use pr_err instead of printkPaul McQuade2017-04-111-2/+2
| | | | | | | | | WARNING: Prefer: pr_err(... to printk(KERN_ERR ... [akpm@linux-foundation.org: remove KERN_ERR] Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* KSM: mediatek: implement Adaptive KSMAnmin Hsu2016-12-251-1/+144
| | | | | | | | | | | | | | [Detail] implement Adaptive KSM [Solution] implement Adaptive KSM [Feature] Adaptive KSM MTK-Commit-Id: 39cab34ba6273f932a61698513284383dc6bb3bc Change-Id: I844434ef2d1159fea0f0c265d05a0e118cf618f6 Signed-off-by: Casper Li <casper.li@mediatek.com> CR-Id: ALPS02298405
* zsmalloc: Fix CPU hotplug callback registrationSrivatsa S. Bhat2016-12-111-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Instead, the correct and race-free way of performing the callback registration is: cpu_notifier_register_begin(); for_each_online_cpu(cpu) init_cpu(cpu); /* Note the use of the double underscored version of the API */ __register_cpu_notifier(&foobar_cpu_notifier); cpu_notifier_register_done(); Fix the zsmalloc code by using this latter form of callback registration. Cc: Nitin Gupta <ngupta@vflare.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* zsmalloc: add copyrightMinchan Kim2016-12-111-0/+1
| | | | | | | | | Add my copyright to the zsmalloc source code which I maintain. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: move it under mmMinchan Kim2016-12-113-0/+1131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch moves zsmalloc under mm directory. Before that, description will explain why we have needed custom allocator. Zsmalloc is a new slab-based memory allocator for storing compressed pages. It is designed for low fragmentation and high allocation success rate on large object, but <= PAGE_SIZE allocations. zsmalloc differs from the kernel slab allocator in two primary ways to achieve these design goals. zsmalloc never requires high order page allocations to back slabs, or "size classes" in zsmalloc terms. Instead it allows multiple single-order pages to be stitched together into a "zspage" which backs the slab. This allows for higher allocation success rate under memory pressure. Also, zsmalloc allows objects to span page boundaries within the zspage. This allows for lower fragmentation than could be had with the kernel slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE. With the kernel slab allocator, if a page compresses to 60% of it original size, the memory savings gained through compression is lost in fragmentation because another object of the same size can't be stored in the leftover space. This ability to span pages results in zsmalloc allocations not being directly addressable by the user. The user is given an non-dereferencable handle in response to an allocation request. That handle must be mapped, using zs_map_object(), which returns a pointer to the mapped region that can be used. The mapping is necessary since the object data may reside in two different noncontigious pages. The zsmalloc fulfills the allocation needs for zram perfectly [sjenning@linux.vnet.ibm.com: borrow Seth's quote] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/Kconfig mm/Makefile
* mm: reorder can_do_mlock to fix audit denialJeff Vander Stoep2016-11-171-2/+2
| | | | | | | | | | | | | | | | | | | | | A userspace call to mmap(MAP_LOCKED) may result in the successful locking of memory while also producing a confusing audit log denial. can_do_mlock checks capable and rlimit. If either of these return positive can_do_mlock returns true. The capable check leads to an LSM hook used by apparmour and selinux which produce the audit denial. Reordering so rlimit is checked first eliminates the denial on success, only recording a denial when the lock is unsuccessful as a result of the denial. Change-Id: Ic6e724554a7d566768a594917f160ab5b732108e Signed-off-by: Jeff Vander Stoep <jeffv@google.com> Acked-by: Nick Kralevich <nnk@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Paul Cassella <cassella@cray.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* proc: much faster /proc/vmstatFrancisco Franco2016-11-071-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every current KDE system has process named ksysguardd polling files below once in several seconds: $ strace -e trace=open -p $(pidof ksysguardd) Process 1812 attached open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8 open("/proc/net/dev", O_RDONLY) = 8 open("/proc/net/wireless", O_RDONLY) = -1 ENOENT (No such file or directory) open("/proc/stat", O_RDONLY) = 8 open("/proc/vmstat", O_RDONLY) = 8 Hell knows what it is doing but speed up reading /proc/vmstat by 33%! Benchmark is open+read+close 1.000.000 times. BEFORE $ perf stat -r 10 taskset -c 3 ./proc-vmstat Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs): 13146.768464 task-clock (msec) # 0.960 CPUs utilized ( +- 0.60% ) 15 context-switches # 0.001 K/sec ( +- 1.41% ) 1 cpu-migrations # 0.000 K/sec ( +- 11.11% ) 104 page-faults # 0.008 K/sec ( +- 0.57% ) 45,489,799,349 cycles # 3.460 GHz ( +- 0.03% ) 9,970,175,743 stalled-cycles-frontend # 21.92% frontend cycles idle ( +- 0.10% ) 2,800,298,015 stalled-cycles-backend # 6.16% backend cycles idle ( +- 0.32% ) 79,241,190,850 instructions # 1.74 insn per cycle # 0.13 stalled cycles per insn ( +- 0.00% ) 17,616,096,146 branches # 1339.956 M/sec ( +- 0.00% ) 176,106,232 branch-misses # 1.00% of all branches ( +- 0.18% ) 13.691078109 seconds time elapsed ( +- 0.03% ) ^^^^^^^^^^^^ AFTER $ perf stat -r 10 taskset -c 3 ./proc-vmstat Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs): 8688.353749 task-clock (msec) # 0.950 CPUs utilized ( +- 1.25% ) 10 context-switches # 0.001 K/sec ( +- 2.13% ) 1 cpu-migrations # 0.000 K/sec 104 page-faults # 0.012 K/sec ( +- 0.56% ) 30,384,010,730 cycles # 3.497 GHz ( +- 0.07% ) 12,296,259,407 stalled-cycles-frontend # 40.47% frontend cycles idle ( +- 0.13% ) 3,370,668,651 stalled-cycles-backend # 11.09% backend cycles idle ( +- 0.69% ) 28,969,052,879 instructions # 0.95 insn per cycle # 0.42 stalled cycles per insn ( +- 0.01% ) 6,308,245,891 branches # 726.058 M/sec ( +- 0.00% ) 214,685,502 branch-misses # 3.40% of all branches ( +- 0.26% ) 9.146081052 seconds time elapsed ( +- 0.07% ) ^^^^^^^^^^^ vsnprintf() is slow because: 1. format_decode() is busy looking for format specifier: 2 branches per character (not in this case, but in others) 2. approximately million branches while parsing format mini language and everywhere 3. just look at what string() does /proc/vmstat is good case because most of its content are strings Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.by Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* mm: remove gup_flags FOLL_WRITE games from __get_user_pages()Linus Torvalds2016-11-071-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream. This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9757a ("Fix get_user_pages() race for write access") but that was then undone due to problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug"). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement software dirty bits") which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger. To fix it, we introduce a new internal FOLL_COW flag to mark the "yes, we already did a COW" rather than play racy games with FOLL_WRITE that is very fundamental, and then use the pte dirty flag to validate that the FOLL_COW flag is still valid. Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask; s/faultin_page/__get_user_page] Signed-off-by: Willy Tarreau <w@1wt.eu>
* mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEEDAndrea Arcangeli2016-11-071-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 upstream. pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were introduced to locklessy (but atomically) detect when a pmd is a regular (stable) pmd or when the pmd is unstable and can infinitely transition from pmd_none() and pmd_trans_huge() from under us, while only holding the mmap_sem for reading (for writing not). While holding the mmap_sem only for reading, MADV_DONTNEED can run from under us and so before we can assume the pmd to be a regular stable pmd we need to compare it against pmd_none() and pmd_trans_huge() in an atomic way, with pmd_trans_unstable(). The old pmd_trans_huge() left a tiny window for a race. Useful applications are unlikely to notice the difference as doing MADV_DONTNEED concurrently with a page fault would lead to undefined behavior. [js] 3.12 backport: no pmd_devmap in 3.12 yet. [akpm@linux-foundation.org: tidy up comment grammar/layout] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* mm, vmalloc: remove useless variable in vmap_blockJoonsoo Kim2016-09-281-2/+0
| | | | | | | | | | | | | | vbq in vmap_block isn't used. So remove it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com> Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Signed-off-by: engstk <eng.stk@sapo.pt>
* mm, vmalloc: use well-defined find_last_bit() funcJoonsoo Kim2016-09-281-9/+6
| | | | | | | | | | | | | | | | | Our intention in here is to find last_bit within the region to flush. There is well-defined function, find_last_bit() for this purpose and its performance may be slightly better than current implementation. So change it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com> Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Signed-off-by: engstk <eng.stk@sapo.pt>
* proc/maps: make vm_is_stack() logic namespace-friendlyOleg Nesterov2016-09-281-17/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Rename vm_is_stack() to task_of_stack() and change it to return "struct task_struct *" rather than the global (and thus wrong in general) pid_t. - Add the new pid_of_stack() helper which calls task_of_stack() and uses the right namespace to report the correct pid_t. Unfortunately we need to define this helper twice, in task_mmu.c and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c and move at least pid_of_stack/task_of_stack there to avoid the code duplication. - Change show_map_vma() and show_numa_map() to use the new helper. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com> Conflicts: fs/proc/task_nommu.c mm/util.c
* readahead: make context readahead more conservativeFengguang Wu2016-09-281-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This helps performance on moderately dense random reads on SSD. Transaction-Per-Second numbers provided by Taobao: QPS case ------------------------------------------------------- 7536 disable context readahead totally w/ patch: 7129 slower size rampup and start RA on the 3rd read 6717 slower size rampup w/o patch: 5581 unmodified context readahead Before, readahead will be started whenever reading page N+1 when it happen to read N recently. After patch, we'll only start readahead when *three* random reads happen to access pages N, N+1, N+2. The probability of this happening is extremely low for pure random reads, unless they are very dense, which actually deserves some readahead. Also start with a smaller readahead window. The impact to interleaved sequential reads should be small, because for a long run stream, the the small readahead window rampup phase is negletable. The context readahead actually benefits clustered random reads on HDD whose seek cost is pretty high. However as SSD is increasingly used for random read workloads it's better for the context readahead to concentrate on interleaved sequential reads. Another SSD rand read test from Miao # file size: 2GB # read IO amount: 625MB sysbench --test=fileio \ --max-requests=10000 \ --num-threads=1 \ --file-num=1 \ --file-block-size=64K \ --file-test-mode=rndrd \ --file-fsync-freq=0 \ --file-fsync-end=off run shows the performance of btrfs grows up from 69MB/s to 121MB/s, ext4 from 104MB/s to 121MB/s. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Tested-by: Tao Ma <tm@tao.ma> Tested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: flar2 <asegaert@gmail.com> Signed-off-by: Brandon Berhent <bbedward@gmail.com> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* slub: do not assert not having lock in removing freed partialSteven Rostedt2016-09-281-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vladimir reported the following issue: Commit c65c1877bd68 ("slub: use lockdep_assert_held") requires remove_partial() to be called with n->list_lock held, but free_partial() called from kmem_cache_close() on cache destruction does not follow this rule, leading to a warning: WARNING: CPU: 0 PID: 2787 at mm/slub.c:1536 __kmem_cache_shutdown+0x1b2/0x1f0() Modules linked in: CPU: 0 PID: 2787 Comm: modprobe Tainted: G W 3.14.0-rc1-mm1+ #1 Hardware name: 0000000000000600 ffff88003ae1dde8 ffffffff816d9583 0000000000000600 0000000000000000 ffff88003ae1de28 ffffffff8107c107 0000000000000000 ffff880037ab2b00 ffff88007c240d30 ffffea0001ee5280 ffffea0001ee52a0 Call Trace: __kmem_cache_shutdown+0x1b2/0x1f0 kmem_cache_destroy+0x43/0xf0 xfs_destroy_zones+0x103/0x110 [xfs] exit_xfs_fs+0x38/0x4e4 [xfs] SyS_delete_module+0x19a/0x1f0 system_call_fastpath+0x16/0x1b His solution was to add a spinlock in order to quiet lockdep. Although there would be no contention to adding the lock, that lock also requires disabling of interrupts which will have a larger impact on the system. Instead of adding a spinlock to a location where it is not needed for lockdep, make a __remove_partial() function that does not test if the list_lock is held, as no one should have it due to it being freed. Also added a __add_partial() function that does not do the lock validation either, as it is not needed for the creation of the cache. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Reported-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm: slub: work around unneeded lockdep warningDave Hansen2016-09-281-3/+6
| | | | | | | | | | | | | | | | | | | | | | The slub code does some setup during early boot in early_kmem_cache_node_alloc() with some local data. There is no possible way that another CPU can see this data, so the slub code doesn't unnecessarily lock it. However, some new lockdep asserts check to make sure that add_partial() _always_ has the list_lock held. Just add the locking, even though it is technically unnecessary. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com> Signed-off-by: Anik1199 <anik9280@gmail.com> Conflicts: mm/slub.c
* slab: do not panic if we fail to create memcg cacheVladimir Davydov2016-09-281-1/+8
| | | | | | | | | | | | | | | | | There is no point in flooding logs with warnings or especially crashing the system if we fail to create a cache for a memcg. In this case we will be accounting the memcg allocation to the root cgroup until we succeed to create its own cache, but it isn't that critical. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* slab: clean up kmem_cache_create_memcg() error handlingVladimir Davydov2016-09-281-34/+31
| | | | | | | | | | | | | | | | | | | | | | | | | Currently kmem_cache_create_memcg() backoffs on failure inside conditionals, without using gotos. This results in the rollback code duplication, which makes the function look cumbersome even though on error we should only free the allocated cache. Since in the next patch I am going to add yet another rollback function call on error path there, let's employ labels instead of conditionals for undoing any changes on failure to keep things clean. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com> Conflicts: mm/slab_common.c Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm/slab: Give s_next and s_stop slab-specific namesWanpeng Li2016-09-283-8/+8
| | | | | | | | | Give s_next and s_stop slab-specific names instead of exporting "s_next" and "s_stop". Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm/slab: Fix /proc/slabinfo unwriteable for slabWanpeng Li2016-09-281-1/+9
| | | | | | | | | | | | Slab have some tunables like limit, batchcount, and sharedfactor can be tuned through function slabinfo_write. Commit (b7454ad3: mm/sl[au]b: Move slabinfo processing to slab_common.c) uncorrectly change /proc/slabinfo unwriteable for slab, this patch fix it by revert to original mode. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm/slab: Sharing s_next and s_stop between slab and slubWanpeng Li2016-09-283-12/+5
| | | | | | | | | This patch shares s_next and s_stop between slab and slub. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* topology: add support for node_to_mem_node() to determine the fallback nodeJoonsoo Kim2016-09-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that on ppc LPARs with memoryless nodes, a large amount of memory was consumed by slabs and was marked unreclaimable. He tracked it down to slab deactivations in the SLUB core when we allocate remotely, leading to poor efficiency always when memoryless nodes are present. After much discussion, Joonsoo provided a few patches that help significantly. They don't resolve the problem altogether: - memory hotplug still needs testing, that is when a memoryless node becomes memory-ful, we want to dtrt - there are other reasons for going off-node than memoryless nodes, e.g., fully exhausted local nodes Neither case is resolved with this series, but I don't think that should block their acceptance, as they can be explored/resolved with follow-on patches. The series consists of: [1/3] topology: add support for node_to_mem_node() to determine the fallback node [2/3] slub: fallback to node_to_mem_node() node if allocating on memoryless node - Joonsoo's patches to cache the nearest node with memory for each NUMA node [3/3] Partial revert of 81c98869faa5 (""kthread: ensure locality of task_struct allocations") - At Tejun's request, keep the knowledge of memoryless node fallback to the allocator core. This patch (of 3): We need to determine the fallback node in slub allocator if the allocation target node is memoryless node. Without it, the SLUB wrongly select the node which has no memory and can't use a partial slab, because of node mismatch. Introduced function, node_to_mem_node(X), will return a node Y with memory that has the nearest distance. If X is memoryless node, it will return nearest distance node, but, if X is normal node, it will return itself. We will use this function in following patch to determine the fallback node. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* slub: fall back to node_to_mem_node() node if allocating on memoryless nodeJoonsoo Kim2016-09-281-6/+18
| | | | | | | | | | | | | | | | | | | | | | | Update the SLUB code to search for partial slabs on the nearest node with memory in the presence of memoryless nodes. Additionally, do not consider it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a memoryless-node specified allocation goes off-node. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* slub: disable tracing and failslab for merged slabsChristoph Lameter2016-09-281-0/+11
| | | | | | | | | | | | | | | | | | Tracing of mergeable slabs as well as uses of failslab are confusing since the objects of multiple slab caches will be affected. Moreover this creates a situation where a mergeable slab will become unmergeable. If tracing or failslab testing is desired then it may be best to switch merging off for starters. Signed-off-by: Christoph Lameter <cl@linux.com> Tested-by: WANG Chao <chaowang@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* slub: remove kmemcg id from create_unique_idVladimir Davydov2016-09-281-6/+0
| | | | | | | | | | | | | | | | This function is never called for memcg caches, because they are unmergeable, so remove the dead code. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* slab: fix the alias count (via sysfs) of slab cacheGu Zheng2016-09-281-1/+1
| | | | | | | | | | | | | | | We mark some slab caches (e.g. kmem_cache_node) as unmergeable by setting refcount to -1, and their alias should be 0, not refcount-1, so correct it here. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* slub: avoid duplicate creation on the first objectWei Yang2016-09-281-8/+11
| | | | | | | | | | | | | | | | | | | | When a kmem_cache is created with ctor, each object in the kmem_cache will be initialized before ready to use. While in slub implementation, the first object will be initialized twice. This patch reduces the duplication of initialization of the first object. Fix commit 7656c72b ("SLUB: add macros for scanning objects in a slab"). Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm, slub: mark resiliency_test as init textDavid Rientjes2016-09-281-1/+1
| | | | | | | | | | | | | resiliency_test() is only called for bootstrap, so it may be moved to init.text and freed after boot. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* slub: fix off by one in number of slab testsJoonsoo Kim2016-09-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | min_partial means minimum number of slab cached in node partial list. So, if nr_partial is less than it, we keep newly empty slab on node partial list rather than freeing it. But if nr_partial is equal or greater than it, it means that we have enough partial slabs so should free newly empty slab. Current implementation missed the equal case so if we set min_partial is 0, then, at least one slab could be cached. This is critical problem to kmemcg destroying logic because it doesn't works properly if some slabs is cached. This patch fixes this problem. Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs immediately"). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* slub: search partial list on numa_mem_id(), instead of numa_node_id()Joonsoo Kim2016-09-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | Currently, if allocation constraint to node is NUMA_NO_NODE, we search a partial slab on numa_node_id() node. This doesn't work properly on a system having memoryless nodes, since it can have no memory on that node so there must be no partial slab on that node. On that node, page allocation always falls back to numa_mem_id() first. So searching a partial slab on numa_node_id() in that case is the proper solution for the memoryless node case. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm, slab_common: add 'unlikely' to size check of kmalloc_slab()Joonsoo Kim2016-09-281-1/+1
| | | | | | | | | | Size is usually below than KMALLOC_MAX_SIZE. If we add a 'unlikely' macro, compiler can make better code. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm: slub: fix ALLOC_SLOWPATH statDave Hansen2016-09-281-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH got bumped in that exit path. Now there are two, and a bunch of gotos. ALLOC_SLOWPATH can now get set more than once during a single call to __slab_alloc() which is pretty bogus. Here's the sequence: 1. Enter __slab_alloc(), fall through all the way to the stat(s, ALLOC_SLOWPATH); 2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to new_slab (goto #1) 3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo (goto #2) 4. Fall through in the same path we did before all the way to stat(s, ALLOC_SLOWPATH) 5. bump ALLOC_REFILL stat, then return Doing this is obviously bogus. It keeps us from being able to accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means that the total number of allocs always exceeds the total number of frees. This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same place that __slab_alloc() is. This makes it much less likely that ALLOC_SLOWPATH will get botched again in the spaghetti-code inside __slab_alloc(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm, slab: suppress out of memory warning unless debug is enabledDavid Rientjes2016-09-282-14/+25
| | | | | | | | | | | | | | | | | | | | | | | | When the slab or slub allocators cannot allocate additional slab pages, they emit diagnostic information to the kernel log such as current number of slabs, number of objects, active objects, etc. This is always coupled with a page allocation failure warning since it is controlled by !__GFP_NOWARN. Suppress this out of memory warning if the allocator is configured without debug supported. The page allocation failure warning will indicate it is a failed slab allocation, the order, and the gfp mask, so this is only useful to diagnose allocator issues. Since CONFIG_SLUB_DEBUG is already enabled by default for the slub allocator, there is no functional change with this patch. If debug is disabled, however, the warnings are now suppressed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm/slub.c: convert vnsprintf-static to va_formatFabian Frederick2016-09-281-7/+9
| | | | | | | | | | | | Inspired by Joe Perches suggestion in ntfs logging clean-up. Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joe Perches <joe@perches.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm/slub.c: convert printk to pr_foo()Fabian Frederick2016-09-281-72/+57
| | | | | | | | | | | | | | | | All printk(KERN_foo converted to pr_foo() Default printk converted to pr_warn() Coalesce format fragments Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joe Perches <joe@perches.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>