aboutsummaryrefslogtreecommitdiff
path: root/mm/page_alloc.c
Commit message (Collapse)AuthorAgeFilesLines
* mm/oom_kill: squashed reverts to a stable stateCorinna Vinschen2019-07-191-6/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert "mm, oom: fix use-after-free in oom_kill_process" This reverts commit e1bebdeedb497f03d426c85a89c3807c7e75268d. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm,oom: make oom_killer_disable() killable" This reverts commit 65a7400a432639aa8d5e572f30687fbca204b6f8. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm: oom_kill: don't ignore oom score on exiting tasks" This reverts commit d60dae46b27a8f381e4a7ad9dde870faa49fa5f1. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm/oom_kill.c: avoid attempting to kill init sharing same memory" This reverts commit 10773c0325259d6640b93c0694b5598ddf84939f. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place" This reverts commit 2bdd9a2042a0e12d96c545773d9d8038c920f813. Revert "mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process()" This reverts commit 419a313435b31821e4d045ca4b7ea1cc5fa02035. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm/oom_kill: cleanup the "kill sharing same memory" loop" This reverts commit afda78c6de38f9f66eba0955153b380d540d8276. Revert "mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process()" This reverts commit acde9c2ace298b249c06ec5b0b971c333449dc09. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm, oom: remove task_lock protecting comm printing" This reverts commit 9a9ca142d250ec9de1215284857f4528c6ddb080. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm/oom_kill.c: suppress unnecessary "sharing same memory" message" This reverts commit 1aa2960f7c70d65b1481f805ac73b988faff6747. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL" This reverts commit f028aedfcfd2e2bb98921b98d3ae183387ab8fed. Revert "mm, oom: remove unnecessary variable" This reverts commit 54b0b58224146d68a11bccb5e64683ab3029373a. Revert "mm/oom_kill.c: print points as unsigned int" This reverts commit 603f975a6d4f0b56c7f6df7889ef2a704eca94a3. Signed-off-by: Corinna Vinschen <xda@vinschen.de> Revert "mm: oom_kill: simplify OOM killer locking" This reverts commit 7951a52ed35d162063fa08b27894e302fd716ccd. Revert "mm: oom_kill: remove unnecessary locking in exit_oom_victim()" This reverts commit f0739b25ac884682865d6aae7485e79489107bfb. Revert "mm: oom_kill: generalize OOM progress waitqueue" This reverts commit eb4b1243c72ba0b392bbe05dbf9f91959f70eb18. Revert "mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear" This reverts commit e611f16275c3642cb8a6345ff2470926fef52110. Revert "mm: oom_kill: clean up victim marking and exiting interfaces" This reverts commit c6fada01b9370e3d7603b4ad8c26b56759174667. Revert "mm: oom_kill: remove unnecessary locking in oom_enable()" This reverts commit 5dd152d7351b3805f59b2b1f624722ab2f3c5fd8. Revert "oom, PM: make OOM detection in the freezer path raceless" This reverts commit 5fc5b1ddee5404a7629dd7045f54eaf8941bc11c.
* mm: oom_kill: simplify OOM killer lockingJohannes Weiner2019-07-081-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Change-Id: Ieb0b621bc3a391cc0a826a3ae53bf28ea4a8dbe5 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom, PM: make OOM detection in the freezer path racelessMichal Hocko2019-07-081-15/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, "We used to have freezing points deep in file system code which may be reacheable from page fault." so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Change-Id: Ie72c2cfc39dad6420802b873053c739e804f956f Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc: remove more mtk bitsMoyster2019-05-031-7/+0
|
* mm: adjust page migration heuristicTim Murray2019-05-031-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The page allocator's heuristic to decide when to migrate page blocks to unmovable seems to have been tuned on architectures that do not have kernel drivers that would make unmovable allocations of several megabytes or greater--ie, no cameras or shared-memory GPUs. The number of allocations from these drivers may be unbounded and may occupy a significant percentage of overall system memory (>50%). As a result, every Android device has suffered to some extent from increasing fragmentation due to unmovable page block migration over time. This change adjusts the page migration heuristic to only migrate page blocks for unmovable allocations when the order of the requested allocation is order-5 or greater. This prevents migration due to GPU and ion allocations so long as kernel drivers allocate memory at runtime using order-4 or smaller pages. Experimental results running the Android longevity test suite on a Nexus 5X for 10 hours: old heuristic: 116 unmovable blocks after boot -> 281 unmovable blocks new heuristic: 105 unmovable blocks after boot -> 101 unmovable blocks bug 26916944 Change-Id: I5b7ccbbafa4049a2f47f399df4cb4779689f4c40 (cherry picked from commit f0e444d2ebab56eedc22fdc3d5376e41e66cce6c)
* mm: always steal split buddies in fallback allocationsVlastimil Babka2019-05-031-40/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When allocation falls back to another migratetype, it will steal a page with highest available order, and (depending on this order and desired migratetype), it might also steal the rest of free pages from the same pageblock. Given the preference of highest available order, it is likely that it will be higher than the desired order, and result in the stolen buddy page being split. The remaining pages after split are currently stolen only when the rest of the free pages are stolen. This can however lead to situations where for MOVABLE allocations we split e.g. order-4 fallback UNMOVABLE page, but steal only order-0 page. Then on the next MOVABLE allocation (which may be batched to fill the pcplists) we split another order-3 or higher page, etc. By stealing all pages that we have split, we can avoid further stealing. This patch therefore adjusts the page stealing so that buddy pages created by split are always stolen. This has effect only on MOVABLE allocations, as RECLAIMABLE and UNMOVABLE allocations already always do that in addition to stealing the rest of free pages from the pageblock. The change also allows to simplify try_to_steal_freepages() and factor out CMA handling. According to Mel, it has been intended since the beginning that buddy pages after split would be stolen always, but it doesn't seem like it was ever the case until commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added"). The commit has unintentionally introduced this behavior, but was reverted by commit 0cbef29a7821 ("mm: __rmqueue_fallback() should respect pageblock type"). Neither included evaluation. My evaluation with stress-highalloc from mmtests shows about 2.5x reduction of page stealing events for MOVABLE allocations, without affecting the page stealing events for other allocation migratetypes. Change-Id: I2c5b1a7fd01fc080efb689da07d380abd0e030ee Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Corinna Vinschen <xda@vinschen.de>
* mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplacedVlastimil Babka2019-05-031-13/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the MIGRATE_RESERVE pages, it is useful when they do not get misplaced on free_list of other migratetype, otherwise they might get allocated prematurely and e.g. fragment the MIGRATE_RESEVE pageblocks. While this cannot be avoided completely when allocating new MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should prevent the misplacement where possible. Currently, it is possible for the misplacement to happen when a MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a fallback for other desired migratetype, and then later freed back through free_pcppages_bulk() without being actually used. This happens because free_pcppages_bulk() uses get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with the *desired* migratetype and not the page's original MIGRATE_RESERVE migratetype. This patch fixes the problem by moving the call to set_freepage_migratetype() from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where the actual page's migratetype (e.g. from which free_list the page is taken from) is used. Note that this migratetype might be different from the pageblock's migratetype due to freepage stealing decisions. This is OK, as page stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave all MIGRATE_CMA pages on the correct freelist. Therefore, as an additional benefit, the call to get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies on the fact that MIGRATE_CMA pageblocks are created only during system init, and the above. The related is_migrate_isolate() check is also unnecessary, as memory isolation has other ways to move pages between freelists, and drain pcp lists containing pages that should be isolated. The buffered_rmqueue() can also benefit from calling get_freepage_migratetype() instead of get_pageblock_migratetype(). Change-Id: I045fc4b3a1e25ea217453abe54f849714cc37d5c Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Wang, Yalin" <Yalin.Wang@sonymobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Corinna Vinschen <xda@vinschen.de>
* mm: more aggressive page stealing for UNMOVABLE allocationsVlastimil Babka2019-05-031-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When allocation falls back to stealing free pages of another migratetype, it can decide to steal extra pages, or even the whole pageblock in order to reduce fragmentation, which could happen if further allocation fallbacks pick a different pageblock. In try_to_steal_freepages(), one of the situations where extra pages are stolen happens when we are trying to allocate a MIGRATE_RECLAIMABLE page. However, MIGRATE_UNMOVABLE allocations are not treated the same way, although spreading such allocation over multiple fallback pageblocks is arguably even worse than it is for RECLAIMABLE allocations. To minimize fragmentation, we should minimize the number of such fallbacks, and thus steal as much as is possible from each fallback pageblock. Note that in theory this might put more pressure on movable pageblocks and cause movable allocations to steal back from unmovable pageblocks. However, movable allocations are not as aggressive with stealing, and do not cause permanent fragmentation, so the tradeoff is reasonable, and evaluation seems to support the change. This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to steal extra free pages. When evaluating with stress-highalloc from mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to roughly 1/6. The number of these fallbacks stealing from MIGRATE_MOVABLE block is reduced to 1/3. There was no observation of growing number of unmovable pageblocks over time, and also not of increased movable allocation fallbacks. Change-Id: I61b1be192ca2374350800181d74f34dcfa9e2cff Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Corinna Vinschen <xda@vinschen.de>
* mm/page_alloc: missing argument in move_freepages_blockMoyster2019-05-031-1/+1
|
* mm/page_allo.c: restructure free-page stealing code and fix a bugSrivatsa S. Bhat2019-05-031-43/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The free-page stealing code in __rmqueue_fallback() is somewhat hard to follow, and has an incredible amount of subtlety hidden inside! First off, there is a minor bug in the reporting of change-of-ownership of pageblocks. Under some conditions, we try to move upto 'pageblock_nr_pages' no. of pages to the preferred allocation list. But we change the ownership of that pageblock to the preferred type only if we manage to successfully move atleast half of that pageblock (or if page_group_by_mobility_disabled is set). However, the current code ignores the latter part and sets the 'migratetype' variable to the preferred type, irrespective of whether we actually changed the pageblock migratetype of that block or not. So, the page_alloc_extfrag tracepoint can end up printing incorrect info (i.e., 'change_ownership' might be shown as 1 when it must have been 0). So fixing this involves moving the update of the 'migratetype' variable to the right place. But looking closer, we observe that the 'migratetype' variable is used subsequently for checks such as "is_migrate_cma()". Obviously the intent there is to check if the *fallback* type is MIGRATE_CMA, but since we already set the 'migratetype' variable to start_migratetype, we end up checking if the *preferred* type is MIGRATE_CMA!! To make things more interesting, this actually doesn't cause a bug in practice, because we never change *anything* if the fallback type is CMA. So, restructure the code in such a way that it is trivial to understand what is going on, and also fix the above mentioned bug. And while at it, also add a comment explaining the subtlety behind the migratetype used in the call to expand(). [akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I2e84c3b2a45dc063402117dd74179585caa7234c Signed-off-by: Corinna Vinschen <xda@vinschen.de>
* mm/page_alloc: remove mtk custom bitsMoyster2019-05-031-5/+0
|
* mm: fix cma accounting in zone_watermark_okVinayak Menon2019-05-031-7/+11
| | | | | | | | | | | | | | | | | | Some cases were reported where atomic unmovable allocations of order 2 fails, but kswapd does not wakeup. And in such cases it was seen that, when zone_watermark_ok check is performed to decide whether to wake up kswapd, there were lot of CMA pages of order 2 and above. This makes the watermark check succeed resulting in kswapd not being woken up. But since these atomic unmovable allocations can't come from CMA region, further atomic allocations keeps failing, without kswapd trying to reclaim. Usually concurrent movable allocations result in reclaim and improves the situtation, but the case reported was from a network test which was resulting in only atomic skb allocations being attempted. Change-Id: If953b8a8cfb0a5caa1fb63d3c032b194942f8091 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Prakash Gupta <guptap@codeaurora.org> (cherry picked from commit ea934a2665d641ca879b2c374d06da64c832f00a)
* mm: add zone counter for cma pagesVinayak Menon2019-05-031-6/+28
| | | | | | | | | | | | Add per free area nr_free_cma counter. The idea is to also track the number of cma pages present in free pages. This will be used in later patches to fix issues with zone_watermark_ok. Change-Id: I97da9d2f3642db56fc541c48ab56a7ce78e2333c Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Prakash Gupta <guptap@codeaurora.org> (cherry picked from commit a147305588507b1a241af87f1006c5d0b30beade)
* mm/page_alloc.c: use '__paginginit' instead of '__init'Chen Gang2019-05-021-2/+2
| | | | | | | | | | | | | | | | | | | | | set_pageblock_order() may be called when memory hotplug, so need use '__paginginit' instead of '__init'. The related warning: The function __meminit .free_area_init_node() references a function __init .set_pageblock_order(). If .set_pageblock_order is only used by .free_area_init_node then annotate .set_pageblock_order with a matching annotation. Change-Id: I982ee702a2ff92670cf386cabcc47fdfd3de8180 Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 15ca220e1a63af06e000691e4ae1beaba5430c32 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Laura Abbott <lauraa@codeaurora.org> (cherry picked from commit 7f00b507f3dce0c81ba3a4a189ee836564016804)
* mm: use a dedicated lock to protect totalram_pages and zone->managed_pagesJiang Liu2019-05-021-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to protect totalram_pages and zone->managed_pages. Other than the memory hotplug driver, totalram_pages and zone->managed_pages may also be modified at runtime by other drivers, such as Xen balloon, virtio_balloon etc. For those cases, memory hotplug lock is a little too heavy, so introduce a dedicated lock to protect totalram_pages and zone->managed_pages. Now we have a simplified locking rules totalram_pages and zone->managed_pages as: 1) no locking for read accesses because they are unsigned long. 2) no locking for write accesses at boot time in single-threaded context. 3) serialize write accesses at runtime by acquiring the dedicated managed_page_count_lock. Also adjust zone->managed_pages when freeing reserved pages into the buddy system, to keep totalram_pages and zone->managed_pages in consistence. [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)] Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: <sworddragon2@aol.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: c3d5f5f0c2bc4eabeaf49f1a21e1aeb965246cd2 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [imaund@codeaurora.org: resolve merge conflcits] Signed-off-by: Ian Maund <imaund@codeaurora.org> Change-Id: I4ab46ac402c57b079f1680da1c8c119663060a72 (cherry picked from commit e238160acbf97306e1283ec921c6db5c54af359b)
* mm: page_alloc: embed OOM killing naturally into allocation slowpathJohannes Weiner2019-05-021-48/+36
| | | | | | | | | | | | | | | | | | | | | | | | The OOM killing invocation does a lot of duplicative checks against the task's allocation context. Rework it to take advantage of the existing checks in the allocator slowpath. The OOM killer is invoked when the allocator is unable to reclaim any pages but the allocation has to keep looping. Instead of having a check for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM invocation to the true branch of should_alloc_retry(). The __GFP_FS check from oom_gfp_allowed() can then be moved into the OOM avoidance branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test. __alloc_pages_may_oom() can then signal to the caller whether the OOM killer was invoked, instead of requiring it to duplicate the order and high_zoneidx checks to guess this when deciding whether to continue. Change-Id: I49515bba44d0c99743d03d357f57a8d6e215aff2 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, oom: rename zonelist locking functionsDavid Rientjes2019-05-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Change-Id: Ide8f1efa2763212ffb2797979582c43251c89697 Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: add CMA page allocationfire8552018-11-301-0/+20
| | | | needed for new ion driver
* add page sanitization / verificationDaniel Micay2018-05-161-2/+10
|
* arm64: use the new *_relaxed macros for lower power usagefranciscofranco2017-12-251-3/+3
| | | | | | Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Signed-off-by: Joe Maples <joe@frap129.org> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* mm: page_alloc: add kasan hooks on alloc and free pathsAndrey Ryabinin2017-12-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add kernel address sanitizer hooks to mark allocated page's addresses as accessible in corresponding shadow region. Mark freed pages as inaccessible. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc: Remove kernel address exposure in free_reserved_area()Josh Poimboeuf2017-11-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit adb1fe9ae2ee6ef6bc10f3d5a588020e7664dfa7 upstream. Linus suggested we try to remove some of the low-hanging fruit related to kernel address exposure in dmesg. The only leaks I see on my local system are: Freeing SMP alternatives memory: 32K (ffffffff9e309000 - ffffffff9e311000) Freeing initrd memory: 10588K (ffffa0b736b42000 - ffffa0b737599000) Freeing unused kernel memory: 3592K (ffffffff9df87000 - ffffffff9e309000) Freeing unused kernel memory: 1352K (ffffa0b7288ae000 - ffffa0b728a00000) Freeing unused kernel memory: 632K (ffffa0b728d62000 - ffffa0b728e00000) Linus says: "I suspect we should just remove [the addresses in the 'Freeing' messages]. I'm sure they are useful in theory, but I suspect they were more useful back when the whole "free init memory" was originally done. These days, if we have a use-after-free, I suspect the init-mem situation is the easiest situation by far. Compared to all the dynamic allocations which are much more likely to show it anyway. So having debug output for that case is likely not all that productive." With this patch the freeing messages now look like this: Freeing SMP alternatives memory: 32K Freeing initrd memory: 10588K Freeing unused kernel memory: 3592K Freeing unused kernel memory: 1352K Freeing unused kernel memory: 632K Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/6836ff90c45b71d38e5d4405aec56fa9e5d1d4b2.1477405374.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
* gather extra early boot entropy like PaXDaniel Micay2017-10-141-0/+11
|
* mm/init: fix zone boundary creationOliver O'Halloran2017-06-171-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 90cae1fe1c3540f791d5b8e025985fa5e699b2bb upstream. As a part of memory initialisation the architecture passes an array to free_area_init_nodes() which specifies the max PFN of each memory zone. This array is not necessarily monotonic (due to unused zones) so this array is parsed to build monotonic lists of the min and max PFN for each zone. ZONE_MOVABLE is special cased here as its limits are managed by the mm subsystem rather than the architecture. Unfortunately, this special casing is broken when ZONE_MOVABLE is the not the last zone in the zone list. The core of the issue is: if (i == ZONE_MOVABLE) continue; arch_zone_lowest_possible_pfn[i] = arch_zone_highest_possible_pfn[i-1]; As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will be set to zero. This patch fixes this bug by adding explicitly tracking where the next zone should start rather than relying on the contents arch_zone_highest_possible_pfn[]. Thie is low priority. To get bitten by this you need to enable a zone that appears after ZONE_MOVABLE in the zone_type enum. As far as I can tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled, so I can't see this affecting too many people. I only noticed this because I've been fiddling with ZONE_DEVICE on powerpc and 4.6 broke my test kernel. This bug, in conjunction with the changes in Taku Izumi's kernelcore=mirror patch (d91749c1dda71) and powerpc being the odd architecture which initialises max_zone_pfn[] to ~0ul instead of 0 caused all of system memory to be placed into ZONE_DEVICE at boot, followed a panic since device memory cannot be used for kernel allocations. I've already submitted a patch to fix the powerpc specific bits, but I figured this should be fixed too. Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.com Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>
* memblock, numa: binary search node idYinghai Lu2017-05-241-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Current early_pfn_to_nid() on arch that support memblock go over memblock.memory one by one, so will take too many try near the end. We can use existing memblock_search to find the node id for given pfn, that could save some time on bigger system that have many entries memblock.memory array. Here are the timing differences for several machines. In each case with the patch less time was spent in __early_pfn_to_nid(). 3.11-rc5 with patch difference (%) -------- ---------- -------------- UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%) UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%) UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%) UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%) Time in seconds. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: Russ Anderson <rja@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* mm: disable zone_reclaim_mode by defaultMel Gorman2017-05-241-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. This patch (of 2): zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* topology: add support for node_to_mem_node() to determine the fallback nodeJoonsoo Kim2016-09-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that on ppc LPARs with memoryless nodes, a large amount of memory was consumed by slabs and was marked unreclaimable. He tracked it down to slab deactivations in the SLUB core when we allocate remotely, leading to poor efficiency always when memoryless nodes are present. After much discussion, Joonsoo provided a few patches that help significantly. They don't resolve the problem altogether: - memory hotplug still needs testing, that is when a memoryless node becomes memory-ful, we want to dtrt - there are other reasons for going off-node than memoryless nodes, e.g., fully exhausted local nodes Neither case is resolved with this series, but I don't think that should block their acceptance, as they can be explored/resolved with follow-on patches. The series consists of: [1/3] topology: add support for node_to_mem_node() to determine the fallback node [2/3] slub: fallback to node_to_mem_node() node if allocating on memoryless node - Joonsoo's patches to cache the nearest node with memory for each NUMA node [3/3] Partial revert of 81c98869faa5 (""kthread: ensure locality of task_struct allocations") - At Tejun's request, keep the knowledge of memoryless node fallback to the allocator core. This patch (of 3): We need to determine the fallback node in slub allocator if the allocation target node is memoryless node. Without it, the SLUB wrongly select the node which has no memory and can't use a partial slab, because of node mismatch. Introduced function, node_to_mem_node(X), will return a node Y with memory that has the nearest distance. If X is memoryless node, it will return nearest distance node, but, if X is normal node, it will return itself. We will use this function in following patch to determine the fallback node. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: W4TCH0UT <ateekujjawal@gmail.com>
* mm: pass readahead info down to the i/o schedulerLee Susman2016-08-261-0/+1
| | | | | | | | | | | | | | | | Some i/o schedulers (i.e. row-iosched, cfq-iosched) deploy an idling algorithm in order to be better synced with the readahead algorithm. Idling is a prediction algorithm for incoming read requests. In this patch we mark pages which are part of a readahead window, by setting a newly introduced flag. With this flag, the i/o scheduler can identify a request which is associated with a readahead page. This enables the i/o scheduler's idling mechanism to be en-sync with the readahead mechanism and, in turn, can increase read throughput. Change-Id: I0654f23315b6d19d71bcc9cc029c6b281a44b196 Signed-off-by: Lee Susman <lsusman@codeaurora.org> Signed-off-by: Stefan Guendhoer <stefan@guendhoer.com>
* first commitMeizu OpenSource2016-08-151-0/+7661