aboutsummaryrefslogtreecommitdiff
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
...
* mm/vmalloc.c: rename VM_UNLIST to VM_UNINITIALIZEDZhang Yanfei2019-05-021-9/+9
| | | | | | | | | | | | | | | | | | VM_UNLIST was used to indicate that the vm_struct is not listed in vmlist. But after commit 4341fa454796 ("mm, vmalloc: remove list management of vmlist after initializing vmalloc"), the meaning of this flag changed. It now means the vm_struct is not fully initialized. So renaming it to VM_UNINITIALIZED seems more reasonable. Also change clear_vm_unlist to clear_vm_uninitialized_flag. Change-Id: Icad57fdbbe6cf057b35102a23ed5bd90d1658b3b Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit b5de696d12c90a617476dcdfa18dbb13acf2558f)
* mm: Update is_vmalloc_addr to account for vmalloc savingsLaura Abbott2019-05-021-0/+33
| | | | | | | | | | is_vmalloc_addr current assumes that all vmalloc addresses exist between VMALLOC_START and VMALLOC_END. This may not be the case when interleaving vmalloc and lowmem. Update the is_vmalloc_addr to properly check for this. Change-Id: I5def3d6ae1a4de59ea36f095b8c73649a37b1f36 Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
* mm/vmalloc.c: emit the failure message before returnZhang Yanfei2019-05-021-1/+1
| | | | | | | | | | | | Use goto to jump to the fail label to give a failure message before returning NULL. This makes the failure handling in this function consistent. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I8487b08f9b8557dde0c685b263ff066384b0750e (cherry picked from commit 5208b5405b8d725d5d74d4734d2ee8ba8896843a)
* mm/vmalloc.c: remove alloc_map from vmap_blockZhang Yanfei2019-05-021-3/+0
| | | | | | | | | | | | | | As we have removed the dead code in the vb_alloc, it seems there is no place to use the alloc_map. So there is no reason to maintain the alloc_map in vmap_block. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I28de3daf10ceefda9c19a9a9e253c7624038a323 (cherry picked from commit 68482067e5c67f79c8671dfc20bb3c9e2d3022e6)
* mm/vmalloc.c: remove unused purge_fragmented_blocks_thiscpuZhang Yanfei2019-05-021-5/+0
| | | | | | | | | | | | This function is nowhere used now, so remove it. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ifd266d83adc838867409c91342315656ed778aa3 (cherry picked from commit e866039819842d384b0b4de557d20beb5d8afe50)
* mm/vmalloc.c: remove dead code in vb_allocZhang Yanfei2019-05-021-15/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Space in a vmap block that was once allocated is considered dirty and not made available for allocation again before the whole block is recycled. The result is that free space within a vmap block is always contiguous. So if a vmap block has enough free space for allocation, the allocation is impossible to fail. Thus, the fragmented block purging was never invoked from vb_alloc(). So remove this dead code. [ Same patches also sent by: Chanho Min <chanho.min@lge.com> Johannes Weiner <hannes@cmpxchg.org> but git doesn't do "multiple authors" ] Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ia406f64ff51247f6643cb469bf8b3ca5a603abdf (cherry picked from commit c920b64e6ab415e78ffd3805f76af96ec7a6636e)
* mm/vmalloc.c: unbreak __vunmap()Dan Carpenter2019-05-021-1/+1
| | | | | | | | | | | There is an extra semi-colon so the function always returns. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I2c62c8670715b88ea20b0a26810f803f0e58076b (cherry picked from commit 9629ae527b04c773265787b05f75e79fdae58940)
* mm, vmalloc: use clamp() to simplify codeZhang Yanfei2019-05-021-10/+2
| | | | | | | | Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I2bb66875ce24a79e5210b5ee37c106221b6e890d (cherry picked from commit 0e3158c292b9219c5d8524d6a1bbdaeecf11de47)
* mm, vmalloc: remove insert_vmalloc_vm()Zhang Yanfei2019-05-021-7/+0
| | | | | | | | | | | Now this function is nowhere used, we can remove it directly. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: If56e743dffd9fbf5c554faf50645a8770e7e217e (cherry picked from commit c6e9ede3e9f65b65fc5b4bdaf09470089fa59d7a)
* mm, vmalloc: call setup_vmalloc_vm() instead of insert_vmalloc_vm()Zhang Yanfei2019-05-021-2/+2
| | | | | | | | | | | | | Here we pass flags with only VM_ALLOC bit set, it is unnecessary to call clear_vm_unlist to clear VM_UNLIST bit. So use setup_vmalloc_vm instead of insert_vmalloc_vm. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ib9bd0f1bcb11386dff15399930e64cbcc898abfa (cherry picked from commit 0b854a0f845cd89823c525b057d19cdbd3d6d192)
* mm, vmalloc: only call setup_vmalloc_vm() only in __get_vm_area_node()Zhang Yanfei2019-05-021-10/+1
| | | | | | | | | | | | | | | | | | | | Now for insert_vmalloc_vm, it only calls the two functions: - setup_vmalloc_vm: fill vm_struct and vmap_area instances - clear_vm_unlist: clear VM_UNLIST bit in vm_struct->flags So in __get_vm_area_node(), if VM_UNLIST bit unset in flags, that is the else branch here, we don't need to clear VM_UNLIST bit for vm->flags since this bit is obviously not set. That is to say, we could only call setup_vmalloc_vm instead of insert_vmalloc_vm here. And then we could even remove the if test here. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ib1b5eb912cf6bd4d8ccec4afd2807679008c0140 (cherry picked from commit 32c0d2d011a868c8d607b1201df140a3a7b6ca5e)
* vmalloc: introduce remap_vmalloc_range_partialHATAYAMA Daisuke2019-05-021-22/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc space and remap it to user-space in order to reduce the risk that memory allocation fails on system with huge number of CPUs and so with huge ELF note segment that exceeds 11-order block size. Although there's already remap_vmalloc_range for the purpose of remapping vmalloc memory to user-space, we need to specify user-space range via vma. Mmap on /proc/vmcore needs to remap range across multiple objects, so the interface that requires vma to cover full range is problematic. This patch introduces remap_vmalloc_range_partial that receives user-space range as a pair of base address and size and can be used for mmap on /proc/vmcore case. remap_vmalloc_range is rewritten using remap_vmalloc_range_partial. [akpm@linux-foundation.org: use PAGE_ALIGNED()] Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: If11092672ab09b7d75d905f5e7276f2bee2415cb (cherry picked from commit 5c27777e9c15793fe55422bb475daf1e4eb5ddb6)
* vmalloc: make find_vm_area check in rangeHATAYAMA Daisuke2019-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Currently, __find_vmap_area searches for the kernel VM area starting at a given address. This patch changes this behavior so that it searches for the kernel VM area to which the address belongs. This change is needed by remap_vmalloc_range_partial to be introduced in later patch that receives any position of kernel VM area as target address. This patch changes the condition (addr > va->va_start) to the equivalent (addr >= va->va_end) by taking advantage of the fact that each kernel VM area is non-overlapping. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I3af544abd5fb67bf49f03a241618dcd24fca564e (cherry picked from commit 2e8110e8fad77bf3f81aedcaaf877ec42a590a39)
* mm: page_alloc: embed OOM killing naturally into allocation slowpathJohannes Weiner2019-05-021-48/+36
| | | | | | | | | | | | | | | | | | | | | | | | The OOM killing invocation does a lot of duplicative checks against the task's allocation context. Rework it to take advantage of the existing checks in the allocator slowpath. The OOM killer is invoked when the allocator is unable to reclaim any pages but the allocation has to keep looping. Instead of having a check for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM invocation to the true branch of should_alloc_retry(). The __GFP_FS check from oom_gfp_allowed() can then be moved into the OOM avoidance branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test. __alloc_pages_may_oom() can then signal to the caller whether the OOM killer was invoked, instead of requiring it to duplicate the order and high_zoneidx checks to guess this when deciding whether to continue. Change-Id: I49515bba44d0c99743d03d357f57a8d6e215aff2 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: thaw the OOM victim if it is frozenMichal Hocko2019-05-021-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Change-Id: I0e1e58f06dab168d2e94f2b475e93763e73101ec Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: add helpers for setting and clearing TIF_MEMDIEMichal Hocko2019-05-022-4/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchset addresses a race which was described in the changelog for 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Change-Id: If4188f68cdfb73f8cf2721f89c9739ec0a8f0e12 Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: don't count on mm-less current processTetsuo Handa2019-05-021-1/+5
| | | | | | | | | | | | | | | | | | | | | out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Change-Id: I8d23a0e41df8380ee55c368121ba3fec2b62fced Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: make sure that TIF_MEMDIE is set under task_lockMichal Hocko2019-05-021-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Change-Id: Ic251ce93e4e2311791a3ecb5a8718ceec9948c7f Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: patch mtk custom codeMoyster2019-05-021-1/+1
|
* oom: don't assume that a coredumping thread will exit soonOleg Nesterov2019-05-022-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() "forever" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I6a69bf26a477f31e733ed6911da4f299d49cdfe1
* mm, oom: rename zonelist locking functionsDavid Rientjes2019-05-022-20/+16
| | | | | | | | | | | | | | | | | | | | | | | | try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Change-Id: Ide8f1efa2763212ffb2797979582c43251c89697 Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, oom: ensure memoryless node zonelist always includes zonesDavid Rientjes2019-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [akpm@linux-foundation.org: fix non-numa build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ic93a32abbf875a331244a0f95f2a0bcd0764360b
* mm, oom: prefer thread group leaders for display purposesDavid Rientjes2019-05-022-11/+20
| | | | | | | | | | | | | | | | | | | | When two threads have the same badness score, it's preferable to kill the thread group leader so that the actual process name is printed to the kernel log rather than the thread group name which may be shared amongst several processes. This was the behavior when select_bad_process() used to do for_each_process(), but it now iterates threads instead and leads to ambiguity. Change-Id: I71ebfbeb4b030d3b4cc342d68af70526e29fd8d4 Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, oom: remove unnecessary exit_state checkDavid Rientjes2019-05-021-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Change-Id: I072465956d32c33a3d08975849a0adbe43ee4389 Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* shmem: fix buildMoyster2019-05-021-1/+1
|
* tmpfs: fix uninitialized return value in shmem_linkDarrick J. Wong2019-05-021-1/+1
| | | | | | | | | | | | | | | | | When we made the shmem_reserve_inode call in shmem_link conditional, we forgot to update the declaration for ret so that it always has a known value. Dan Carpenter pointed out this deficiency in the original patch. Fixes: 1062af920c07 ("tmpfs: fix link accounting when a tmpfile is linked in") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Matej Kupljen <matej.kupljen@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 29b00e609960ae0fcff382f4c7079dd0874a5311) Change-Id: I648253008c7977fabb9d04655c11f99a536f43ca
* tmpfs: fix link accounting when a tmpfile is linked inDarrick J. Wong2019-05-021-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tmpfs has a peculiarity of accounting hard links as if they were separate inodes: so that when the number of inodes is limited, as it is by default, a user cannot soak up an unlimited amount of unreclaimable dcache memory just by repeatedly linking a file. But when v3.11 added O_TMPFILE, and the ability to use linkat() on the fd, we missed accommodating this new case in tmpfs: "df -i" shows that an extra "inode" remains accounted after the file is unlinked and the fd closed and the actual inode evicted. If a user repeatedly links tmpfiles into a tmpfs, the limit will be hit (ENOSPC) even after they are deleted. Just skip the extra reservation from shmem_link() in this case: there's a sense in which this first link of a tmpfile is then cheaper than a hard link of another file, but the accounting works out, and there's still good limiting, so no need to do anything more complicated. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182134370.7035@eggly.anvils Fixes: f4e0c30c191 ("allow the temp files created by open() to be linked to") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Matej Kupljen <matej.kupljen@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 1062af920c07f5b54cf5060fde3339da6df0cf6b) Change-Id: I194ccdf709b1a4a385e00e07f97085521bcabdc1
* mremap: properly flush TLB before releasing the pageLinus Torvalds2019-05-022-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit eb66ae030829605d61fbef1909ce310e29f78821 upstream. This is a backport to stable 3.18.y, based on Will Deacon's 4.4.y backport. Jann Horn points out that our TLB flushing was subtly wrong for the mremap() case. What makes mremap() special is that we don't follow the usual "add page to list of pages to be freed, then flush tlb, and then free pages". No, mremap() obviously just _moves_ the page from one page table location to another. That matters, because mremap() thus doesn't directly control the lifetime of the moved page with a freelist: instead, the lifetime of the page is controlled by the page table locking, that serializes access to the entry. As a result, we need to flush the TLB not just before releasing the lock for the source location (to avoid any concurrent accesses to the entry), but also before we release the destination page table lock (to avoid the TLB being flushed after somebody else has already done something to that page). This also makes the whole "need_flush" logic unnecessary, since we now always end up flushing the TLB for every valid entry. Bug: 118836219 Reported-and-tested-by: Jann Horn <jannh@google.com> Acked-by: Will Deacon <will.deacon@arm.com> Tested-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [will: backport to 4.4 stable] Signed-off-by: Will Deacon <will.deacon@arm.com> [ghackmann@google.com: adjust context] Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I653b28b6c2fd6ec00e4b0be2b3289dcab1dcc4b1 Signed-off-by: Greg Hackmann <ghackmann@google.com>
* it's still short a few helpers, but infrastructure should be OK now...Al Viro2019-05-021-0/+32
| | | | | Change-Id: I0adb8fe9c5029bad3ac52629003c3b78e9442936 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* mm: vmpressure.c mismergeMoyster2018-12-111-1/+1
|
* arm, pm, vmpressure: add missing slab.h includesTejun Heo2018-12-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c were somehow getting slab.h indirectly through cgroup.h which in turn was getting it indirectly through xattr.h. A scheduled cgroup change drops xattr.h inclusion from cgroup.h and breaks compilation of these three files. Add explicit slab.h includes to the three files. A pending cgroup patch depends on this change and it'd be great if this can be routed through cgroup/for-3.14-fixes branch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Stephen Warren <swarren@wwwdotorg.org> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: linux-tegra@vger.kernel.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: cgroups@vger.kernel.org Signed-off-by: Joe Maples <joe@frap129.org>
* mm: vmpressure: dynamic window sizing.Martijn Coenen2018-12-112-32/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The window size used for calculating vm pressure events was previously fixed at 512 pages. The window size has a big impact on the rate of notifications sent off to userspace, in particular when using the "low" level. On machines with a lot of memory, the current value is likely excessive. On the other hand, if the window size is too big, we might delay memory pressure events for too long, especially at critical levels of memory pressure. This patch attempts to address that problem with two changes. The first change is to calculate the window size based on the machine size, quite similar to how the vm watermarks are being calculated. This reduces the chance of false positives at any pressure level. Using the machine size only makes sense on the root cgroup though; for non-root cgroups, their hard memory limit is used to calculate the window size. If no hard memory limit is set, we fall back to the default window size that was previously used. The second change is based on an idea from Johannes Weiner, to only report medium and low pressure levels for every X windows that we scan. This reduces the frequency with which we report low/medium pressure levels, but at the same time will still report critical memory pressure immediately. Change-Id: Ieffca055b2fb6aa27ae0179e0a588e6fcb173a61 Signed-off-by: Martijn Coenen <maco@google.com>
* mm, oom: make dump_tasks publicLiam Mark2018-12-011-1/+1
| | | | | | | | Allow other functions to dump the list of tasks. Useful for when debugging memory leaks. Change-Id: I76c33a118a9765b4c2276e8c76de36399c78dbf6 Signed-off-by: Liam Mark <lmark@codeaurora.org>
* CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victimsDouglas Anderson2018-12-011-4/+27
| | | | | | | | | | | | | | | | | | | | | DROP THIS PATCH ON REBASE. It is part of a patch series that attempts to solve similar problems to the OOM reaper upstream. NOTE that the OOM reaper patches weren't backported because mm patches in general are fairly intertwined and picking the OOM reaper without an mm expert and lots of careful testing could cause new mysterious problems. Compared to the OOM reaper, this patch series: + Is simpler to reason about. + Is not very intertwined to the rest of the system. + Can be fairly easily reasoned about to see that it should do no harm, even if it doesn't fix every problem. - Can more easily get into the state where all memory reserves are gone, since it allows more than one task to be in "MEMDIE". - Doesn't free memory as quickly. The reaper kills anonymous / swapped memory even before the doomed task exits. - Contains the magic "100 ms" heuristic, which is non-ideal.
* CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our placeDouglas Anderson2018-12-011-0/+9
| | | | | | | | | | | | | | | | | | | | | DROP THIS PATCH ON REBASE. It is part of a patch series that attempts to solve similar problems to the OOM reaper upstream. NOTE that the OOM reaper patches weren't backported because mm patches in general are fairly intertwined and picking the OOM reaper without an mm expert and lots of careful testing could cause new mysterious problems. Compared to the OOM reaper, this patch series: + Is simpler to reason about. + Is not very intertwined to the rest of the system. + Can be fairly easily reasoned about to see that it should do no harm, even if it doesn't fix every problem. - Can more easily get into the state where all memory reserves are gone, since it allows more than one task to be in "MEMDIE". - Doesn't free memory as quickly. The reaper kills anonymous / swapped memory even before the doomed task exits. - Contains the magic "100 ms" heuristic, which is non-ideal.
* BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasksJohannes Weiner2018-12-011-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the OOM killer scans tasks and encounters a PF_EXITING one, it force-selects that task regardless of the score. The problem is that if that task got stuck waiting for some state the allocation site is holding, the OOM reaper can not move on to the next best victim. Frankly, I don't even know why we check for exiting tasks in the OOM killer. We've tried direct reclaim at least 15 times by the time we decide the system is OOM, there was plenty of time to exit and free memory; and a task might exit voluntarily right after we issue a kill. This is testing pure noise. Remove it. BUG=chromium:706048, chromium:702707 TEST=With whole series, eating memory doesn't hang system Conflicts: mm/oom_kill.c There were extra checked in 3.14, but the concept is still the same. Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> (cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176) Reviewed-on: https://chromium-review.googlesource.com/465411 Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Joe Maples <joe@frap129.org>
* mm: add CMA page allocationfire8552018-11-302-5/+21
| | | | needed for new ion driver
* Revert "GCC: Fix up for gcc 5+"Moyster2018-11-302-2/+1
| | | | This reverts commit ff505baaf412985af758d5820cd620ed9f1a7e05.
* Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds2018-11-2911-11/+11
| | | | | | | | | | | | | | This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Moyster <oysterized@gmail.com>
* GCC: Fix up for gcc 5+mydongistiny2018-11-292-1/+2
| | | | | Signed-off-by: mydongistiny <jaysonedson@gmail.com> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* panic on !PageSlab && !PageCompound in ksizeDaniel Micay2018-05-161-1/+1
|
* panic on kmem_cache_free with the wrong cacheDaniel Micay2018-05-161-4/+2
|
* always perform cache_from_obj sanity checksDaniel Micay2018-05-161-10/+0
|
* real slab_equal_or_root check for !MEMCG_KMEMDaniel Micay2018-05-161-1/+1
|
* add missing cache_from_obj !PageSlab checkDaniel Micay2018-05-161-0/+1
| | | | Taken from PaX.
* add slub free list XOR encryptionDaniel Micay2018-05-161-4/+15
| | | | Based on the grsecurity feature, but with a per-cache random value.
* add slub sanitizationDaniel Micay2018-05-161-0/+7
|
* add page sanitization / verificationDaniel Micay2018-05-161-2/+10
|
* BACKPORT: [UPSTREAM] mm: new shrinker APIDave Chinner2017-12-311-20/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (Cherry-pick from commit 24f7c6b981fb70084757382da464ea85d72af300) The current shrinker callout API uses an a single shrinker call for multiple functions. To determine the function, a special magical value is passed in a parameter to change the behaviour. This complicates the implementation and return value specification for the different behaviours. Separate the two different behaviours into separate operations, one to return a count of freeable objects in the cache, and another to scan a certain number of objects in the cache for freeing. In defining these new operations, ensure the return values and resultant behaviours are clearly defined and documented. Modify shrink_slab() to use the new API and implement the callouts for all the existing shrinkers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* readahead: Fix an error (thx ramgear)hellsgod2017-12-261-2/+2
| | | | Signed-off-by: Joe Maples <joe@frap129.org>