aboutsummaryrefslogtreecommitdiff
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
...
* Readahead: Optimize divide/multiply by power of 2 using L/R shift (thx ramgear)hellsgod2017-12-261-10/+10
| | | | Signed-off-by: Joe Maples <joe@frap129.org>
* arm64: use the new *_relaxed macros for lower power usagefranciscofranco2017-12-252-5/+5
| | | | | | Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Signed-off-by: Joe Maples <joe@frap129.org> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* Finally eradicate CONFIG_HOTPLUGStephen Rothwell2017-12-221-1/+1
| | | | | | | | | | | | | | | | | | | Ever since commit 45f035ab9b8f ("CONFIG_HOTPLUG should be always on"), it has been basically impossible to build a kernel with CONFIG_HOTPLUG turned off. Remove all the remaining references to it. Cc: Russell King <linux@arm.linux.org.uk> Cc: Doug Thompson <dougthompson@xmission.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: page_alloc: add kasan hooks on alloc and free pathsAndrey Ryabinin2017-12-185-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add kernel address sanitizer hooks to mark allocated page's addresses as accessible in corresponding shadow region. Mark freed pages as inaccessible. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slob.c: remove an unnecessary check for __GFP_ZEROMiles Chen2017-12-021-1/+1
| | | | | | | | | | | | | | Current flow guarantees a valid pointer when handling the __GFP_ZERO case. So remove the unnecessary NULL pointer check. Link: http://lkml.kernel.org/r/1507203141-11959-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slub: Fix sysfs duplicate filename creation when slub_debug=OMiles Chen2017-11-241-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When slub_debug=O is set. It is possible to clear debug flags for an "unmergeable" slab cache in kmem_cache_open(). It makes the "unmergeable" cache became "mergeable" in sysfs_slab_add(). These caches will generate their "unique IDs" by create_unique_id(), but it is possible to create identical unique IDs. In my experiment, sgpool-128, names_cache, biovec-256 generate the same ID ":Ft-0004096" and the kernel reports "sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096'". To repeat my experiment, set disable_higher_order_debug=1, CONFIG_SLUB_DEBUG_ON=y in kernel-4.14. Fix this issue by setting unmergeable=1 if slub_debug=O and the the default slub_debug contains any no-merge flags. call path: kmem_cache_create() __kmem_cache_alias() -> we set SLAB_NEVER_MERGE flags here create_cache() __kmem_cache_create() kmem_cache_open() -> clear DEBUG_METADATA_FLAGS sysfs_slab_add() -> the slab cache is mergeable now [ 0.674272] sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096' [ 0.674473] ------------[ cut here ]------------ [ 0.674653] WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x7c [ 0.674847] Modules linked in: [ 0.674969] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 4.14.0-rc7ajb-00131-gd4c2e9f-dirty #123 [ 0.675211] Hardware name: linux,dummy-virt (DT) [ 0.675342] task: ffffffc07d4e0080 task.stack: ffffff8008008000 [ 0.675505] PC is at sysfs_warn_dup+0x60/0x7c [ 0.675633] LR is at sysfs_warn_dup+0x60/0x7c [ 0.675759] pc : [<ffffff8008235808>] lr : [<ffffff8008235808>] pstate: 60000145 [ 0.675948] sp : ffffff800800bb40 [ 0.676048] x29: ffffff800800bb40 x28: 0000000000000040 [ 0.676209] x27: ffffffc07c52a380 x26: 0000000000000000 [ 0.676369] x25: ffffff8008af4ad0 x24: ffffff8008af4000 [ 0.676528] x23: ffffffc07c532580 x22: ffffffc07cf04598 [ 0.676695] x21: ffffffc07cf26578 x20: ffffffc07c533700 [ 0.676857] x19: ffffffc07ce67000 x18: 0000000000000002 [ 0.677017] x17: 0000000000007ffe x16: 0000000000000007 [ 0.677176] x15: 0000000000000001 x14: 0000000000007fff [ 0.677335] x13: 0000000000000394 x12: 0000000000000000 [ 0.677492] x11: 00000000000001ab x10: 0000000000000007 [ 0.677651] x9 : 00000000000001ac x8 : ffffff800835d114 [ 0.677809] x7 : 656b2f2720656d61 x6 : 0000000000000017 [ 0.677967] x5 : ffffffc07ffdb9a8 x4 : 0000000000000000 [ 0.678124] x3 : 0000000000000000 x2 : ffffffffffffffff [ 0.678282] x1 : ffffff8008a4e878 x0 : 0000000000000042 [ 0.678442] Call trace: [ 0.678528] Exception stack(0xffffff800800ba00 to 0xffffff800800bb40) [ 0.678706] ba00: 0000000000000042 ffffff8008a4e878 ffffffffffffffff 0000000000000000 [ 0.678914] ba20: 0000000000000000 ffffffc07ffdb9a8 0000000000000017 656b2f2720656d61 [ 0.679121] ba40: ffffff800835d114 00000000000001ac 0000000000000007 00000000000001ab [ 0.679326] ba60: 0000000000000000 0000000000000394 0000000000007fff 0000000000000001 [ 0.679532] ba80: 0000000000000007 0000000000007ffe 0000000000000002 ffffffc07ce67000 [ 0.679739] baa0: ffffffc07c533700 ffffffc07cf26578 ffffffc07cf04598 ffffffc07c532580 [ 0.679944] bac0: ffffff8008af4000 ffffff8008af4ad0 0000000000000000 ffffffc07c52a380 [ 0.680149] bae0: 0000000000000040 ffffff800800bb40 ffffff8008235808 ffffff800800bb40 [ 0.680354] bb00: ffffff8008235808 0000000060000145 ffffffc07c533700 0000000062616c73 [ 0.680560] bb20: ffffffffffffffff 0000000000000000 ffffff800800bb40 ffffff8008235808 [ 0.680774] [<ffffff8008235808>] sysfs_warn_dup+0x60/0x7c [ 0.680928] [<ffffff8008235920>] sysfs_create_dir_ns+0x98/0xa0 [ 0.681095] [<ffffff8008539274>] kobject_add_internal+0xa0/0x294 [ 0.681267] [<ffffff80085394f8>] kobject_init_and_add+0x90/0xb4 [ 0.681435] [<ffffff80081b524c>] sysfs_slab_add+0x90/0x200 [ 0.681592] [<ffffff80081b62a0>] __kmem_cache_create+0x26c/0x438 [ 0.681769] [<ffffff80081858a4>] kmem_cache_create+0x164/0x1f4 [ 0.681940] [<ffffff80086caa98>] sg_pool_init+0x60/0x100 [ 0.682094] [<ffffff8008084144>] do_one_initcall+0x38/0x12c [ 0.682254] [<ffffff80086a0d10>] kernel_init_freeable+0x138/0x1d4 [ 0.682423] [<ffffff8008547388>] kernel_init+0x10/0xfc [ 0.682571] [<ffffff80080851e0>] ret_from_fork+0x10/0x18 Signed-off-by: Miles Chen <miles.chen@mediatek.com>
* mm: fix overflow check in expand_upwards()Helge Deller2017-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 37511fb5c91db93d8bd6e3f52f86e5a7ff7cfcdf upstream. Jörn Engel noticed that the expand_upwards() function might not return -ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE. Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa which all define TASK_SIZE as 0xffffffff, but since none of those have an upwards-growing stack we currently have no actual issue. Nevertheless let's fix this just in case any of the architectures with an upward-growing stack (currently parisc, metag and partly ia64) define TASK_SIZE similar. Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit") Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: Jörn Engel <joern@purestorage.com> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* mm/page_alloc: Remove kernel address exposure in free_reserved_area()Josh Poimboeuf2017-11-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit adb1fe9ae2ee6ef6bc10f3d5a588020e7664dfa7 upstream. Linus suggested we try to remove some of the low-hanging fruit related to kernel address exposure in dmesg. The only leaks I see on my local system are: Freeing SMP alternatives memory: 32K (ffffffff9e309000 - ffffffff9e311000) Freeing initrd memory: 10588K (ffffa0b736b42000 - ffffa0b737599000) Freeing unused kernel memory: 3592K (ffffffff9df87000 - ffffffff9e309000) Freeing unused kernel memory: 1352K (ffffa0b7288ae000 - ffffa0b728a00000) Freeing unused kernel memory: 632K (ffffa0b728d62000 - ffffa0b728e00000) Linus says: "I suspect we should just remove [the addresses in the 'Freeing' messages]. I'm sure they are useful in theory, but I suspect they were more useful back when the whole "free init memory" was originally done. These days, if we have a use-after-free, I suspect the init-mem situation is the easiest situation by far. Compared to all the dynamic allocations which are much more likely to show it anyway. So having debug output for that case is likely not all that productive." With this patch the freeing messages now look like this: Freeing SMP alternatives memory: 32K Freeing initrd memory: 10588K Freeing unused kernel memory: 3592K Freeing unused kernel memory: 1352K Freeing unused kernel memory: 632K Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/6836ff90c45b71d38e5d4405aec56fa9e5d1d4b2.1477405374.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* mm: vmscan: support equal reclaim for anon and file pagesLiam Mark2017-10-212-1/+11
| | | | | | | | | | | | When performing memory reclaim support treating anonymous and file backed pages equally. Swapping anonymous pages out to memory can be efficient enough to justify treating anonymous and file backed pages equally. CRs-Fixed: 648984 Change-Id: I6315b8557020d1e27a34225bb9cefbef1fb43266 Signed-off-by: Liam Mark <lmark@codeaurora.org>
* gather extra early boot entropy like PaXDaniel Micay2017-10-141-0/+11
|
* BACKPORT: Sanitize 'move_pages()' permission checksMarissa Wall2017-10-141-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'move_paghes()' system call was introduced long long ago with the same permission checks as for sending a signal (except using CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability). That turns out to not be a great choice - while the system call really only moves physical page allocations around (and you need other capabilities to do a lot of it), you can check the return value to map out some the virtual address choices and defeat ASLR of a binary that still shares your uid. So change the access checks to the more common 'ptrace_may_access()' model instead. This tightens the access checks for the uid, and also effectively changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that anybody really _uses_ this legacy system call any more (we hav ebetter NUMA placement models these days), so I expect nobody to notice. Famous last words. Reported-by: Otto Ebeling <otto.ebeling@iki.fi> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> cherry-picked from: 197e7e521384a23b9e585178f3f11c9fa08274b9 This branch does not have the PTRACE_MODE_REALCREDS flag but its default behavior is the same as PTRACE_MODE_REALCREDS. So use PTRACE_MODE_READ instead of PTRACE_MODE_READ_REALCREDS. Change-Id: I75364561d91155c01f78dd62cdd41c5f0f418854
* CHROMIUM: remove Android's cgroup generic permissions checksDmitry Torokhov2017-09-301-12/+0
| | | | | | | | | | | | | | | | | | | | The implementation is utterly broken, resulting in all processes being allows to move tasks between sets (as long as they have access to the "tasks" attribute), and upstream is heading towards checking only capability anyway, so let's get rid of this code. BUG=b:31790445,chromium:647994 TEST=Boot android container, examine logcat Change-Id: I2f780a5992c34e52a8f2d0b3557fc9d490da2779 Signed-off-by: Dmitry Torokhov <dtor@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/394967 Reviewed-by: Ricky Zhou <rickyz@chromium.org> Reviewed-by: John Stultz <john.stultz@linaro.org> (cherry picked from commit 6895149f8bf0719aa70487e285fa6a8ad3d2692d) Reviewed-on: https://chromium-review.googlesource.com/399858 Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* mm/zpool: add name argument to create zpoolGanesh Mahendran2017-09-252-3/+6
| | | | | | | | | | | | | | | | | | | | | | | Currently the underlay of zpool: zsmalloc/zbud, do not know who creates them. There is not a method to let zsmalloc/zbud find which caller they belong to. Now we want to add statistics collection in zsmalloc. We need to name the debugfs dir for each pool created. The way suggested by Minchan Kim is to use a name passed by caller(such as zram) to create the zsmalloc pool. /sys/kernel/debug/zsmalloc/zram0 This patch adds an argument `name' to zs_create_pool() and other related functions. Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zpool: use prefixed module loadingKees Cook2017-09-252-1/+86
| | | | | | | | | | | | | | | | | To avoid potential format string expansion via module parameters, do not use the zpool type directly in request_module() without a format string. Additionally, to avoid arbitrary modules being loaded via zpool API (e.g. via the zswap_zpool_type module parameter) add a "zpool-" prefix to the requested module, as well as module aliases for the existing zpool types (zbud and zsmalloc). Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zbud: add to mm/Seth Jennings2017-09-252-0/+528
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zbud is an special purpose allocator for storing compressed pages. It is designed to store up to two compressed pages per physical page. While this design limits storage density, it has simple and deterministic reclaim properties that make it preferable to a higher density approach when reclaim will be used. zbud works by storing compressed pages, or "zpages", together in pairs in a single memory page called a "zbud page". The first buddy is "left justifed" at the beginning of the zbud page, and the last buddy is "right justified" at the end of the zbud page. The benefit is that if either buddy is freed, the freed buddy space, coalesced with whatever slack space that existed between the buddies, results in the largest possible free region within the zbud page. zbud also provides an attractive lower bound on density. The ratio of zpages to zbud pages can not be less than 1. This ensures that zbud can never "do harm" by using more pages to store zpages than the uncompressed zpages would have used on their own. This implementation is a rewrite of the zbud allocator internally used by zcache in the driver/staging tree. The rewrite was necessary to remove some of the zcache specific elements that were ingrained throughout and provide a generic allocation interface that can later be used by zsmalloc and others. This patch adds zbud to mm/ for later use by zswap. Change-Id: I5120b1acd22f15c5dc3d2a0e6f1a34a73f97be3a Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Jenifer Hopper <jhopper@us.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Hansen <dave@sr71.net> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Hugh Dickens <hughd@google.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 4e2e2770b1529edc5849c86b29a6febe27e2f083 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [venkatg@codeaurora.org: keep msm-3.10 changes, add zbud cfg] Signed-off-by: Venkat Gopalakrishnan <venkatg@codeaurora.org>
* zsmalloc: use OBJ_TAG_BIT for bit shifterMinchan Kim2017-09-251-6/+6
| | | | | | | | | | | | | Static check warns using tag as bit shifter. It doesn't break current working but not good for redability. Let's use OBJ_TAG_BIT as bit shifter instead of OBJ_ALLOCATED_TAG. Link: http://lkml.kernel.org/r/20160607045146.GF26230@bbox Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: use freeobj for indexMinchan Kim2017-09-251-66/+73
| | | | | | | | | | | | | | Zsmalloc stores first free object's <PFN, obj_idx> position into freeobj in each zspage. If we change it with index from first_page instead of position, it makes page migration simple because we don't need to correct other entries for linked list if a page is migrated out. Link: http://lkml.kernel.org/r/1464736881-24886-11-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: separate free_zspage from putback_zspageMinchan Kim2017-09-251-16/+11
| | | | | | | | | | | | | | | Currently, putback_zspage does free zspage under class->lock if fullness become ZS_EMPTY but it makes trouble to implement locking scheme for new zspage migration. So, this patch is to separate free_zspage from putback_zspage and free zspage out of class->lock which is preparation for zspage migration. Link: http://lkml.kernel.org/r/1464736881-24886-10-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: introduce zspage structureMinchan Kim2017-09-251-293/+241
| | | | | | | | | | | | | | | | | | | | We have squeezed meta data of zspage into first page's descriptor. So, to get meta data from subpage, we should get first page first of all. But it makes trouble to implment page migration feature of zsmalloc because any place where to get first page from subpage can be raced with first page migration. IOW, first page it got could be stale. For preventing it, I have tried several approahces but it made code complicated so finally, I concluded to separate metadata from first page. Of course, it consumes more memory. IOW, 16bytes per zspage on 32bit at the moment. It means we lost 1% at *worst case*(40B/4096B) which is not bad I think at the cost of maintenance. Link: http://lkml.kernel.org/r/1464736881-24886-9-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: factor page chain functionality outMinchan Kim2017-09-251-24/+36
| | | | | | | | | | | | For page migration, we need to create page chain of zspage dynamically so this patch factors it out from alloc_zspage. Link: http://lkml.kernel.org/r/1464736881-24886-8-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: use accessorMinchan Kim2017-09-251-22/+60
| | | | | | | | | | | | Upcoming patch will change how to encode zspage meta so for easy review, this patch wraps code to access metadata as accessor. Link: http://lkml.kernel.org/r/1464736881-24886-7-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: use bit_spin_lockMinchan Kim2017-09-251-7/+3
| | | | | | | | | | | | | | Use kernel standard bit spin-lock instead of custom mess. Even, it has a bug which doesn't disable preemption. The reason we don't have any problem is that we have used it during preemption disable section by class->lock spinlock. So no need to go to stable. Link: http://lkml.kernel.org/r/1464736881-24886-6-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: keep max_object in size_classMinchan Kim2017-09-251-19/+15
| | | | | | | | | | | | Every zspage in a size_class has same number of max objects so we could move it to a size_class. Link: http://lkml.kernel.org/r/1464736881-24886-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* update "mm/zsmalloc: don't fail if can't create debugfs info"Dan Streetman2017-09-251-19/+16
| | | | | | | | | | | | | | | | | Some updates to commit d34f615720d1 ("mm/zsmalloc: don't fail if can't create debugfs info"): - add pr_warn to all stat failure cases - do not prevent module loading on stat failure Link: http://lkml.kernel.org/r/1463671123-5479-1-git-send-email-ddstreet@ieee.org Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reviewed-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zsmalloc: don't fail if can't create debugfs infoDan Streetman2017-09-251-11/+7
| | | | | | | | | | | | | | | | | | | Change the return type of zs_pool_stat_create() to void, and remove the logic to abort pool creation if the stat debugfs dir/file could not be created. The debugfs stat file is for debugging/information only, and doesn't affect operation of zsmalloc; there is no reason to abort creating the pool if the stat file can't be created. This was seen with zswap, which used the same name for all pool creations, which caused zsmalloc to fail to create a second pool for zswap if CONFIG_ZSMALLOC_STAT was enabled. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: require GFP in zs_malloc()Sergey Senozhatsky2017-09-251-11/+13
| | | | | | | | | | | | | | | | | | Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to zs_create_pool(), so we can be more flexible, but, more importantly, we need this to switch zram to per-cpu compression streams -- zram will try to allocate handle with preemption disabled in a fast path and switch to a slow path (using different gfp mask) if the fast one has failed. Apart from that, this also align zs_malloc() interface with zspool/zbud. [sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask] Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: remove unused pool param in obj_freeMinchan Kim2017-09-251-4/+3
| | | | | | | | | Let's remove unused pool param in obj_free Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: reorder function parametersMinchan Kim2017-09-251-24/+26
| | | | | | | | | | Clean up function parameter ordering to order higher data structure first. Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: use first_page rather than pageMinchan Kim2017-09-251-30/+32
| | | | | | | | | | | Clean up function parameter "struct page". Many functions of zsmalloc expect that page paramter is "first_page" so use "first_page" rather than "page" for code readability. Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: fix zs_can_compact() integer overflowSergey Senozhatsky2017-09-251-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zs_can_compact() has two race conditions in its core calculation: unsigned long obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) - zs_stat_get(class, OBJ_USED); 1) classes are not locked, so the numbers of allocated and used objects can change by the concurrent ops happening on other CPUs 2) shrinker invokes it from preemptible context Depending on the circumstances, thus, OBJ_ALLOCATED can become less than OBJ_USED, which can result in either very high or negative `total_scan' value calculated later in do_shrink_slab(). do_shrink_slab() has some logic to prevent those cases: vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-64 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 However, due to the way `total_scan' is calculated, not every shrinker->count_objects() overflow can be spotted and handled. To demonstrate the latter, I added some debugging code to do_shrink_slab() (x86_64) and the results were: vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615] vmscan: but total_scan > 0: 92679974445502 vmscan: resulting total_scan: 92679974445502 [..] vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615] vmscan: but total_scan > 0: 22634041808232578 vmscan: resulting total_scan: 22634041808232578 Even though shrinker->count_objects() has returned an overflowed value, the resulting `total_scan' is positive, and, what is more worrisome, it is insanely huge. This value is getting used later on in shrinker->scan_objects() loop: while (total_scan >= batch_size || total_scan >= freeable) { unsigned long ret; unsigned long nr_to_scan = min(batch_size, total_scan); shrinkctl->nr_to_scan = nr_to_scan; ret = shrinker->scan_objects(shrinker, shrinkctl); if (ret == SHRINK_STOP) break; freed += ret; count_vm_events(SLABS_SCANNED, nr_to_scan); total_scan -= nr_to_scan; cond_resched(); } `total_scan >= batch_size' is true for a very-very long time and 'total_scan >= freeable' is also true for quite some time, because `freeable < 0' and `total_scan' is large enough, for example, 22634041808232578. The only break condition, in the given scheme of things, is shrinker->scan_objects() == SHRINK_STOP test, which is a bit too weak to rely on, especially in heavy zsmalloc-usage scenarios. To fix the issue, take a pool stat snapshot and use it instead of racy zs_stat_get() calls. Link: http://lkml.kernel.org/r/20160509140052.3389-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zsmalloc: add `freeable' column to pool statSergey Senozhatsky2017-09-251-7/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new column to pool stats, which will tell how many pages ideally can be freed by class compaction, so it will be easier to analyze zsmalloc fragmentation. At the moment, we have only numbers of FULL and ALMOST_EMPTY classes, but they don't tell us how badly the class is fragmented internally. The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable [..] 12 224 0 2 146 5 8 4 4 13 240 0 0 0 0 0 1 0 14 256 1 13 1840 1672 115 1 10 15 272 0 0 0 0 0 1 0 [..] 49 816 0 3 745 735 149 1 2 51 848 3 4 361 306 76 4 8 52 864 12 14 378 268 81 3 21 54 896 1 12 117 57 26 2 12 57 944 0 0 0 0 0 3 0 [..] Total 26 131 12709 10994 1071 134 For example, from this particular output we can easily conclude that class-896 is heavily fragmented -- it occupies 26 pages, 12 can be freed by compaction. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: drop unused member 'mapping_area->huge'YiPing Xu2017-09-251-6/+3
| | | | | | | | | | | | When unmapping a huge class page in zs_unmap_object, the page will be unmapped by kmap_atomic. the "!area->huge" branch in __zs_unmap_object is alway true, and no code set "area->huge" now, so we can drop it. Signed-off-by: YiPing Xu <xuyiping@huawei.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: fix migrate_zspage-zs_free race conditionJunil Lee2017-09-251-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | record_obj() in migrate_zspage() does not preserve handle's HANDLE_PIN_BIT, set by find_aloced_obj()->trypin_tag(), and implicitly (accidentally) un-pins the handle, while migrate_zspage() still performs an explicit unpin_tag() on the that handle. This additional explicit unpin_tag() introduces a race condition with zs_free(), which can pin that handle by this time, so the handle becomes un-pinned. Schematically, it goes like this: CPU0 CPU1 migrate_zspage find_alloced_obj trypin_tag set HANDLE_PIN_BIT zs_free() pin_tag() obj_malloc() -- new object, no tag record_obj() -- remove HANDLE_PIN_BIT set HANDLE_PIN_BIT unpin_tag() -- remove zs_free's HANDLE_PIN_BIT The race condition may result in a NULL pointer dereference: Unable to handle kernel NULL pointer dereference at virtual address 00000000 CPU: 0 PID: 19001 Comm: CookieMonsterCl Tainted: PC is at get_zspage_mapping+0x0/0x24 LR is at obj_free.isra.22+0x64/0x128 Call trace: get_zspage_mapping+0x0/0x24 zs_free+0x88/0x114 zram_free_page+0x64/0xcc zram_slot_free_notify+0x90/0x108 swap_entry_free+0x278/0x294 free_swap_and_cache+0x38/0x11c unmap_single_vma+0x480/0x5c8 unmap_vmas+0x44/0x60 exit_mmap+0x50/0x110 mmput+0x58/0xe0 do_exit+0x320/0x8dc do_group_exit+0x44/0xa8 get_signal+0x538/0x580 do_signal+0x98/0x4b8 do_notify_resume+0x14/0x5c This patch keeps the lock bit in migration path and update value atomically. Signed-off-by: Junil Lee <junil0814.lee@lge.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: reorganize struct size_class to pack 4 bytes holeWeijie Yang2017-09-251-2/+2
| | | | | | | | | | | Reoder the pages_per_zspage field in struct size_class which can eliminate the 4 bytes hole between it and stats field. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: use page->private instead of page->first_pageKirill A. Shutemov2017-09-251-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | We are going to rework how compound_head() work. It will not use page->first_page as we have it now. The only other user of page->first_page beyond compound pages is zsmalloc. Let's use page->private instead of page->first_page here. It occupies the same storage space. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: reduce size_class memory usageSergey Senozhatsky2017-09-251-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each `struct size_class' contains `struct zs_size_stat': an array of NR_ZS_STAT_TYPE `unsigned long'. For zsmalloc built with no CONFIG_ZSMALLOC_STAT this results in a waste of `2 * sizeof(unsigned long)' per-class. The patch removes unneeded `struct zs_size_stat' members by redefining NR_ZS_STAT_TYPE (max stat idx in array). Since both NR_ZS_STAT_TYPE and zs_stat_type are compile time constants, GCC can eliminate zs_stat_inc()/zs_stat_dec() calls that use zs_stat_type larger than NR_ZS_STAT_TYPE: CLASS_ALMOST_EMPTY and CLASS_ALMOST_FULL at the moment. ./scripts/bloat-o-meter mm/zsmalloc.o.old mm/zsmalloc.o.new add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-39 (-39) function old new delta fix_fullness_group 97 94 -3 insert_zspage 100 86 -14 remove_zspage 141 119 -22 To summarize: a) each class now uses less memory b) we avoid a number of dec/inc stats (a minor optimization, but still). The gain will increase once we introduce additional stats. A simple IO test. iozone -t 4 -R -r 32K -s 60M -I +Z patched base " Initial write " 4145599.06 4127509.75 " Rewrite " 4146225.94 4223618.50 " Read " 17157606.00 17211329.50 " Re-read " 17380428.00 17267650.50 " Reverse Read " 16742768.00 16162732.75 " Stride read " 16586245.75 16073934.25 " Random read " 16349587.50 15799401.75 " Mixed workload " 10344230.62 9775551.50 " Random write " 4277700.62 4260019.69 " Pwrite " 4302049.12 4313703.88 " Pread " 6164463.16 6126536.72 " Fwrite " 7131195.00 6952586.00 " Fread " 12682602.25 12619207.50 Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zsmalloc.c: remove useless line in obj_free()Hui Zhu2017-09-251-3/+0
| | | | | | | | Signed-off-by: Hui Zhu <zhuhui@xiaomi.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: use preempt.h for in_interrupt()Sergey Senozhatsky2017-09-251-1/+1
| | | | | | | | | | | | | | A cosmetic change. Commit c60369f01125 ("staging: zsmalloc: prevent mappping in interrupt context") added in_interrupt() check to zs_map_object() and 'hardirq.h' include; but in_interrupt() macro is defined in 'preempt.h' not in 'hardirq.h', so include it instead. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: fix obj_to_head use page_private(page) as value but not pointerHui Zhu2017-09-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In obj_malloc(): if (!class->huge) /* record handle in the header of allocated chunk */ link->handle = handle; else /* record handle in first_page->private */ set_page_private(first_page, handle); In the hugepage we save handle to private directly. But in obj_to_head(): if (class->huge) { VM_BUG_ON(!is_first_page(page)); return *(unsigned long *)page_private(page); } else return *(unsigned long *)obj; It is used as a pointer. The reason why there is no problem until now is huge-class page is born with ZS_FULL so it can't be migrated. However, we need this patch for future work: "VM-aware zsmalloced page migration" to reduce external fragmentation. Signed-off-by: Hui Zhu <zhuhui@xiaomi.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: add comments for ->inuse to zspageHui Zhu2017-09-251-0/+1
| | | | | | | | | | [akpm@linux-foundation.org: fix grammar] Signed-off-by: Hui Zhu <zhuhui@xiaomi.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: remove null check from destroy_handle_cache()Sergey Senozhatsky2017-09-251-2/+1
| | | | | | | | | | | We can pass a NULL cache pointer to kmem_cache_destroy(), because it NULL-checks its argument now. Remove redundant test from destroy_handle_cache(). Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: use class->pages_per_zspageMinchan Kim2017-09-251-3/+2
| | | | | | | | | | There is no need to recalcurate pages_per_zspage in runtime. Just use class->pages_per_zspage to avoid unnecessary runtime overhead. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: consider ZS_ALMOST_FULL as migrate sourceMinchan Kim2017-09-251-7/+10
| | | | | | | | | | | | There is no reason to prevent select ZS_ALMOST_FULL as migration source if we cannot find source from ZS_ALMOST_EMPTY. With this patch, zs_can_compact will return more exact result. Signed-off-by: Minchan Kim <minchan.kim@lge.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: partial page ordering within a fullness_listSergey Senozhatsky2017-09-251-5/+14
| | | | | | | | | | | | | | | | | | | | | We want to see more ZS_FULL pages and less ZS_ALMOST_{FULL, EMPTY} pages. Put a page with higher ->inuse count first within its ->fullness_list, which will give us better chances to fill up this page with new objects (find_get_zspage() return ->fullness_list head for new object allocation), so some zspages will become ZS_ALMOST_FULL/ZS_FULL quicker. It performs a trivial and cheap ->inuse compare which does not slow down zsmalloc and in the worst case keeps the list pages in no particular order. A more expensive solution could sort fullness_list by ->inuse count. [minchan@kernel.org: code adjustments] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: account the number of compacted pagesSergey Senozhatsky2017-09-251-10/+17
| | | | | | | | | | | | | | | | | | | | | Compaction returns back to zram the number of migrated objects, which is quite uninformative -- we have objects of different sizes so user space cannot obtain any valuable data from that number. Change compaction to operate in terms of pages and return back to compaction issuer the number of pages that were freed during compaction. So from now on we will export more meaningful value in zram<id>/mm_stat -- the number of freed (compacted) pages. This requires: (a) a rename of `num_migrated' to 'pages_compacted' (b) a internal API change -- return first_page's fullness_group from putback_zspage(), so we know when putback_zspage() did free_zspage(). It helps us to account compaction stats correctly. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc/zram: introduce zs_pool_stats apiSergey Senozhatsky2017-09-251-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | `zs_compact_control' accounts the number of migrated objects but it has a limited lifespan -- we lose it as soon as zs_compaction() returns back to zram. It worked fine, because (a) zram had it's own counter of migrated objects and (b) only zram could trigger compaction. However, this does not work for automatic pool compaction (not issued by zram). To account objects migrated during auto-compaction (issued by the shrinker) we need to store this number in zs_pool. Define a new `struct zs_pool_stats' structure to keep zs_pool's stats there. It provides only `num_migrated', as of this writing, but it surely can be extended. A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to caller. Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: cosmetic compaction code adjustmentsSergey Senozhatsky2017-09-251-6/+6
| | | | | | | | | | | | | | | | | Change zs_object_copy() argument order to be (DST, SRC) rather than (SRC, DST). copy/move functions usually have (to, from) arguments order. Rename alloc_target_page() to isolate_target_page(). This function doesn't allocate anything, it isolates target page, pretty much like isolate_source_page(). Tweak __zs_compact() comment. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: introduce zs_can_compact() functionSergey Senozhatsky2017-09-251-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function checks if class compaction will free any pages. Rephrasing -- do we have enough unused objects to form at least one ZS_EMPTY page and free it. It aborts compaction if class compaction will not result in any (further) savings. EXAMPLE (this debug output is not part of this patch set): - class size - number of allocated objects - number of used objects - max objects per zspage - pages per zspage - estimated number of pages that will be freed [..] class-512 objs:544 inuse:540 maxobj-per-zspage:8 pages-per-zspage:1 zspages-to-free:0 ... class-512 compaction is useless. break class-496 objs:660 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:2 class-496 objs:627 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:1 class-496 objs:594 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:0 ... class-496 compaction is useless. break class-448 objs:657 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:4 class-448 objs:648 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:3 class-448 objs:639 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:2 class-448 objs:630 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:1 class-448 objs:621 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:0 ... class-448 compaction is useless. break class-432 objs:728 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:1 class-432 objs:700 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:0 ... class-432 compaction is useless. break class-416 objs:819 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:2 class-416 objs:780 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:1 class-416 objs:741 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:0 ... class-416 compaction is useless. break class-400 objs:690 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:1 class-400 objs:680 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:0 ... class-400 compaction is useless. break class-384 objs:736 inuse:709 maxobj-per-zspage:32 pages-per-zspage:3 zspages-to-free:0 ... class-384 compaction is useless. break [..] Every "compaction is useless" indicates that we saved CPU cycles. class-512 has 544 object allocated 540 objects used 8 objects per-page Even if we have a ALMOST_EMPTY zspage, we still don't have enough room to migrate all of its objects and free this zspage; so compaction will not make a lot of sense, it's better to just leave it as is. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: always keep per-class statsSergey Senozhatsky2017-09-251-32/+8
| | | | | | | | | | | | | | | | | | | | | | Always account per-class `zs_size_stat' stats. This data will help us make better decisions during compaction. We are especially interested in OBJ_ALLOCATED and OBJ_USED, which can tell us if class compaction will result in any memory gain. For instance, we know the number of allocated objects in the class, the number of objects being used (so we also know how many objects are not used) and the number of objects per-page. So we can ensure if we have enough unused objects to form at least one ZS_EMPTY zspage during compaction. We calculate this value on per-class basis so we can calculate a total number of zspages that can be released. Which is exactly what a shrinker wants to know. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: drop unused variable `nr_to_migrate'Sergey Senozhatsky2017-09-251-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | This patchset tweaks compaction and makes it possible to trigger pool compaction automatically when system is getting low on memory. zsmalloc in some cases can suffer from a notable fragmentation and compaction can release some considerable amount of memory. The problem here is that currently we fully rely on user space to perform compaction when needed. However, performing zsmalloc compaction is not always an obvious thing to do. For example, suppose we have a `idle' fragmented (compaction was never performed) zram device and system is getting low on memory due to some 3rd party user processes (gcc LTO, or firefox, etc.). It's quite unlikely that user space will issue zpool compaction in this case. Besides, user space cannot tell for sure how badly pool is fragmented; however, this info is known to zsmalloc and, hence, to a shrinker. This patch (of 7): __zs_compact() does not use `nr_to_migrate', drop it. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>