aboutsummaryrefslogtreecommitdiff
path: root/include/linux
Commit message (Collapse)AuthorAgeFilesLines
* non-ext4 portions of "direct-io: Implement generic deferred AIO completions"Theodore Ts'o2017-05-272-2/+7
| | | | | | Originally from 7b7a8665edd8db73 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: use old interface for ext4_readdir()Theodore Ts'o2017-05-271-1/+2
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* sync: don't block the flusher thread waiting on IODave Chinner2017-05-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When sync does it's WB_SYNC_ALL writeback, it issues data Io and then immediately waits for IO completion. This is done in the context of the flusher thread, and hence completely ties up the flusher thread for the backing device until all the dirty inodes have been synced. On filesystems that are dirtying inodes constantly and quickly, this means the flusher thread can be tied up for minutes per sync call and hence badly affect system level write IO performance as the page cache cannot be cleaned quickly. We already have a wait loop for IO completion for sync(2), so cut this out of the flusher thread and delegate it to wait_sb_inodes(). Hence we can do rapid IO submission, and then wait for it all to complete. Effect of sync on fsmark before the patch: FSUse% Count Size Files/sec App Overhead ..... 0 640000 4096 35154.6 1026984 0 720000 4096 36740.3 1023844 0 800000 4096 36184.6 916599 0 880000 4096 1282.7 1054367 0 960000 4096 3951.3 918773 0 1040000 4096 40646.2 996448 0 1120000 4096 43610.1 895647 0 1200000 4096 40333.1 921048 And a single sync pass took: real 0m52.407s user 0m0.000s sys 0m0.090s After the patch, there is no impact on fsmark results, and each individual sync(2) operation run concurrently with the same fsmark workload takes roughly 7s: real 0m6.930s user 0m0.000s sys 0m0.039s IOWs, sync is 7-8x faster on a busy filesystem and does not have an adverse impact on ongoing async data write operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 7747bd4bceb3079572695d3942294a6c7b265557)
* ext4: backport mm portion of: fix data integrity sync in ordered modeTheodore Ts'o2017-05-271-1/+11
| | | | | | | | Commit 1c8349a17137: "ext4: fix data integrity sync in ordered mode" included changes to include/linux/page-flags.h and mm/page-writeback.c. Apply them as part of the 3.18 ext4 backport. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* include/linux/fs.h: add dir_relax() for 3.18 backportMister Oyster2017-05-271-0/+7
| | | | | | partial commit from ec991b05a282642359e81e65855f189f7881009c include/linux/fs.h: add dir_emit() and dir_relax() for 3.18 backport Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* BACKPORT: ext4 from 3.18 to mtk-3.10Mister Oyster2017-05-272-90/+148
|
* fs: add inode_set_flags() for ext4 3.18 backportTheodore Ts'o2017-05-271-0/+3
| | | | | | Excerpted from commit 5f16f3225b062 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* sched: add bit_wait_io for 3.18 ext4 backportTheodore Ts'o2017-05-271-1/+29
| | | | | | | Excerpted from commit 743162013: "sched: Remove proliferation of wait_on_bit() action functions" Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* buffer_head.h: add getblk_unmovable() for 3.18 ext4 backportTheodore Ts'o2017-05-271-0/+6
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mm.h: add truncate_inode_pages_final() for 3.18 ext4 backportTheodore Ts'o2017-05-271-0/+6
|
* bitops.h: add smp_mb__after_atomic() for 3.18 backportTheodore Ts'o2017-05-271-0/+3
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* buffer_head.h: add sb_bread_unmovable() for 3.18 ext4 backportTheodore Ts'o2017-05-271-0/+6
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mm: add find_get_page_flags() for 3.18 ext4 backportTheodore Ts'o2017-05-271-2/+12
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fs: add {lock,unlock}_two_nondirectories for 3.18 backportTheodore Ts'o2017-05-271-1/+6
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* UPSTREAM: math64: New separate div64_u64_rem helperMike Snitzer2017-05-241-0/+13
| | | | | | | | | | | | | | | | | | | | | | Commit f792685006274a850e6cc0ea9ade275ccdfc90bc ("math64: New div64_u64_rem helper") implemented div64_u64 in terms of div64_u64_rem. But div64_u64_rem was removed because it slowed down div64_u64 (and there were no other users of div64_u64_rem). Device Mapper's I/O statistics support has a need for div64_u64_rem; reintroduce this helper as a separate method that doesn't slow down div64_u64, especially on 32-bit systems. BUG: 27175947 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Change-Id: I05274c1a7235dfa972f5ddc4778e0154240fd9b6 Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
* include/linux/mm.h: remove ifdef conditionRashika Kheria2017-05-241-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ifdef conditions in include/linux/mm.h presents three cases: - !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) There is no actual definition of function but include/linux/mm.h has a static inline stub defined. - defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) linux/mm.h does not define a prototype, but mm/page_alloc.c defines the function. Hence, compiler reports the following warning: mm/page_alloc.c:4300:15: warning: no previous prototype for `__early_pfn_to_nid' [-Wmissing-prototypes] - defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) The architecture defines the function, and linux/mm.h has a prototype. Thus, join the conditions of Case 2 and 3 ie eliminate the ifdef condition of CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID to eliminate the missing prototype warning from file mm/page_alloc.c. Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* mm/readahead.c: inline ra_submitFabian Frederick2017-05-241-3/+0
| | | | | | | | | | | | | | Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left ra_submit with a single function call. Move ra_submit to internal.h and inline it to save some stack. Thanks to Andrew Morton for commenting different versions. Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* mm/memblock.c: introduce bottom-up allocation modeTang Chen2017-05-242-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux kernel cannot migrate pages used by the kernel. As a result, kernel pages cannot be hot-removed. So we cannot allocate hotpluggable memory for the kernel. ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info. But before SRAT is parsed, memblock has already started to allocate memory for the kernel. So we need to prevent memblock from doing this. In a memory hotplug system, any numa node the kernel resides in should be unhotpluggable. And for a modern server, each node could have at least 16GB memory. So memory around the kernel image is highly likely unhotpluggable. So the basic idea is: Allocate memory from the end of the kernel image and to the higher memory. Since memory allocation before SRAT is parsed won't be too much, it could highly likely be in the same node with kernel image. The current memblock can only allocate memory top-down. So this patch introduces a new bottom-up allocation mode to allocate memory bottom-up. And later when we use this allocation direction to allocate memory, we will limit the start address above the kernel. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* memblock, numa: binary search node idYinghai Lu2017-05-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Current early_pfn_to_nid() on arch that support memblock go over memblock.memory one by one, so will take too many try near the end. We can use existing memblock_search to find the node id for given pfn, that could save some time on bigger system that have many entries memblock.memory array. Here are the timing differences for several machines. In each case with the patch less time was spent in __early_pfn_to_nid(). 3.11-rc5 with patch difference (%) -------- ---------- -------------- UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%) UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%) UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%) UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%) Time in seconds. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: Russ Anderson <rja@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* mm: disable zone_reclaim_mode by defaultMel Gorman2017-05-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. This patch (of 2): zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* swap: maybe_preload & refactoringDmitry Safonov2017-05-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | zswap_get_swap_cache_page and read_swap_cache_async have pretty much the same code with only significant difference in return value and usage of swap_readpage. I a helper __read_swap_cache_async() with the common code. Behavior change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload instead radix_tree_preload. Looks like, this wasn't changed only by the reason of code duplication. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* fs/mpage.c: factor page_endio() out of mpage_end_io()Matthew Wilcox2017-05-241-0/+2
| | | | | | | | | | | | | | | page_endio() takes care of updating all the appropriate page flags once I/O has finished to a page. Switch to using mapping_set_error() instead of setting AS_EIO directly; this will handle thin-provisioned devices correctly. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* fs/block_dev.c: add bdev_read_page() and bdev_write_page()Matthew Wilcox2017-05-241-0/+4
| | | | | | | | | | | | | | | | A block device driver may choose to provide a rw_page operation. These will be called when the filesystem is attempting to do page sized I/O to page cache pages (ie not for direct I/O). This does preclude I/Os that are larger than page size, so this may only be a performance gain for some devices. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* mm/mempolicy.c: convert the shared_policy lock to a rwlockNathan Zimmer2017-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running the SPECint_rate gcc on some very large boxes it was noticed that the system was spending lots of time in mpol_shared_policy_lookup(). The gamess benchmark can also show it and is what I mostly used to chase down the issue since the setup for that I found to be easier. To be clear the binaries were on tmpfs because of disk I/O requirements. We then used text replication to avoid icache misses and having all the copies banging on the memory where the instruction code resides. This results in us hitting a bottleneck in mpol_shared_policy_lookup() since lookup is serialised by the shared_policy lock. I have only reproduced this on very large (3k+ cores) boxes. The problem starts showing up at just a few hundred ranks getting worse until it threatens to livelock once it gets large enough. For example on the gamess benchmark at 128 ranks this area consumes only ~1% of time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over 90%. To alleviate the contention in this area I converted the spinlock to an rwlock. This allows a large number of lookups to happen simultaneously. The results were quite good reducing this consumtion at max ranks to around 2%. [akpm@linux-foundation.org: tidy up code comments] Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Nadia Yvette Chambers <nyc@holomorphy.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* timer: Added usleep[_range] timerPatrick Pannuto2017-05-221-0/+5
| | | | | | | | | | | | | | | usleep[_range] are finer precision implementations of msleep and are designed to be drop-in replacements for udelay where a precise sleep / busy-wait is unnecessary. They also allow an easy interface to specify slack when a precise (ish) wakeup is unnecessary to help minimize wakeups As ACK'd upstream: https://patchwork.kernel.org/patch/112813/ Change-Id: I277737744ca58061323837609b121a0fc9d27f33 Signed-off-by: Patrick Pannuto <ppannuto@codeaurora.org> (cherry picked from commit 08c118890b06595dfc26d47ee63f59e73256c270)
* kernel: Only expose su when daemon is runningTom Marshall2017-05-214-0/+19
| | | | | | | | | | | | | | It has been claimed that the PG implementation of 'su' has security vulnerabilities even when disabled. Unfortunately, the people that find these vulnerabilities often like to keep them private so they can profit from exploits while leaving users exposed to malicious hackers. In order to reduce the attack surface for vulnerabilites, it is therefore necessary to make 'su' completely inaccessible when it is not in use (except by the root and system users). Change-Id: I79716c72f74d0b7af34ec3a8054896c6559a181d
* fscrypt: introduce helper function for filename matchingEric Biggers2017-05-212-0/+87
| | | | | | | | | Introduce a helper function fscrypt_match_name() which tests whether a fscrypt_name matches a directory entry. Also clean up the magic numbers and document things properly. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fscrypt: eliminate ->prepare_context() operationEric Biggers2017-05-211-1/+0
| | | | | | | | | | | | | | | | The only use of the ->prepare_context() fscrypt operation was to allow ext4 to evict inline data from the inode before ->set_context(). However, there is no reason why this cannot be done as simply the first step in ->set_context(), and in fact it makes more sense to do it that way because then the policy modes and flags get validated before any real work is done. Therefore, merge ext4_prepare_context() into ext4_set_context(), and remove ->prepare_context(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Conflicts: fs/ext4/super.c
* f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discardChao Yu2017-05-211-0/+1
| | | | | | | | | | | | | Introduce CP_TRIMMED_FLAG to indicate all invalid block were trimmed before umount, so once we do mount with image which contain the flag, we don't record invalid blocks as undiscard one, when fstrim is being triggered, we can avoid issuing redundant discard commands. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Conflicts: include/trace/events/f2fs.h
* f2fs: sanity check segment countJin Qian2017-05-211-0/+6
| | | | | | | | F2FS uses 4 bytes to represent block address. As a result, supported size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments. Signed-off-by: Jin Qian <jinqian@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* FROMLIST: pstore: drop pmsg bounce bufferMark Salyzyn2017-05-202-5/+13
| | | | | | | | | | | | | | | | | | (from https://lkml.org/lkml/2016/9/1/428) (cherry pick from android-3.10 commit b58133100b38f2bf83cad2d7097417a3a196ed0b) Removing a bounce buffer copy operation in the pmsg driver path is always better. We also gain in overall performance by not requesting a vmalloc on every write as this can cause precious RT tasks, such as user facing media operation, to stall while memory is being reclaimed. Added a write_buf_user to the pstore functions, a backup platform write_buf_user that uses the small buffer that is part of the instance, and implemented a ramoops write_buf_user that only supports PSTORE_TYPE_PMSG. Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 31057326 Change-Id: I4cdee1cd31467aa3e6c605bce2fbd4de5b0f8caa
* give up on gcc ilog2() constant optimizationsLinus Torvalds2017-05-201-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gcc-7 has an "optimization" pass that completely screws up, and generates the code expansion for the (impossible) case of calling ilog2() with a zero constant, even when the code gcc compiles does not actually have a zero constant. And we try to generate a compile-time error for anybody doing ilog2() on a constant where that doesn't make sense (be it zero or negative). So now gcc7 will fail the build due to our sanity checking, because it created that constant-zero case that didn't actually exist in the source code. There's a whole long discussion on the kernel mailing about how to work around this gcc bug. The gcc people themselevs have discussed their "feature" in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72785 but it's all water under the bridge, because while it looked at one point like it would be solved by the time gcc7 was released, that was not to be. So now we have to deal with this compiler braindamage. And the only simple approach seems to be to just delete the code that tries to warn about bad uses of ilog2(). So now "ilog2()" will just return 0 not just for the value 1, but for any non-positive value too. It's not like I can recall anybody having ever actually tried to use this function on any invalid value, but maybe the sanity check just meant that such code never made it out in public. Reported-by: Laura Abbott <labbott@redhat.com> Cc: John Stultz <john.stultz@linaro.org>, Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Joe Maples <joe@frap129.org>
* ANDROID: Add untag hacks to inet_release functionChenbo Feng2017-05-201-0/+1
| | | | | | | | | | | | | | | | | To prevent protential risk of memory leak caused by closing socket with out untag it from qtaguid module, the qtaguid module now do not hold any socket file reference count. Instead, it will increase the sk_refcnt of the sk struct to prevent a reuse of the socket pointer. And when a socket is released. It will delete the tag if the socket is previously tagged so no more resources is held by xt_qtaguid moudle. A flag is added to the untag process to prevent possible kernel crash caused by fail to delete corresponding socket_tag_entry list. Bug: 36374484 Test: compile and run test under system/extra/test/iptables, run cts -m CtsNetTestCases -t android.net.cts.SocketRefCntTest Signed-off-by: Chenbo Feng <fengc@google.com> Change-Id: Iea7c3bf0c59b9774a5114af905b2405f6bc9ee52
* Revert "BACKPORT: [UPSTREAM] mbcache2: reimplement mbcache"Mister Oyster2017-05-111-50/+0
| | | | This reverts commit 20ccd1e3ce3323d66ab29bf71cd75b337b2667a1.
* Revert "BACKPORT: [UPSTREAM] mm: new shrinker API"Mister Oyster2017-05-111-29/+9
| | | | This reverts commit db537c9914552c3472bd5c75ffe72327e9076f76.
* android: fiq_debugger: restrict access to critical commands.Mark Salyzyn2017-05-101-0/+1
| | | | | | | | | | | | | | | | | | Sysrq must be enabled via /proc/sys/kernel/sysrq as a security measure to enable various critical fiq debugger commands that either leak information or can be used as a system attack. Default disabled, this will leave the reboot, reset, irqs, sleep, nosleep, console and ps commands. Reboot and reset commands will be restricted from taking any parameters. We will also switch to showing the limited command set in this mode. Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 32402555 Change-Id: I3f74b1ff5e4971d619bcb37a911fed68fbb538d5 Git-repo: https://android.googlesource.com/kernel/msm Git-commit: 1031836c0895f1f5a05c25efec83bfa11aa08ca9 Signed-off-by: Dennis Cagle <d-cagle@codeaurora.org>
* locking/mcs: Allow architecture specific asm files to be used for contended caseTim Chen2017-05-031-0/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows each architecture to add its specific assembly optimized arch_mcs_spin_lock_contended and arch_mcs_spinlock_uncontended for MCS lock and unlock functions. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: AswinChandramouleeswaran <aswin@hp.com> Cc: George Spelvin <linux@horizon.com> Cc: Rik vanRiel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: MichelLespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alex Shi <alex.shi@linaro.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Figo.zhang" <figo1802@gmail.com> Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew R Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1390347382.3138.67.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: ddf1d169c0a489d498c1799a7043904a43b0c159 [joonwoop@codeaurora.org: Resolve merge conflicts; we don't have changes for arch other than ARM/ARM64] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* BACKPORT: [UPSTREAM] mbcache2: reimplement mbcacheJan Kara2017-04-251-0/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (Cherry-pick from commit f9a61eb4e2471c56a63cd804c7474128138c38ac) Original mbcache was designed to have more features than what ext? filesystems ended up using. It supported entry being in more hashes, it had a home-grown rwlocking of each entry, and one cache could cache entries from multiple filesystems. This genericity also resulted in more complex locking, larger cache entries, and generally more code complexity. This is reimplementation of the mbcache functionality to exactly fit the purpose ext? filesystems use it for. Cache entries are now considerably smaller (7 instead of 13 longs), the code is considerably smaller as well (414 vs 913 lines of code), and IMO also simpler. The new code is also much more lightweight. I have measured the speed using artificial xattr-bench benchmark, which spawns P processes, each process sets xattr for F different files, and the value of xattr is randomly chosen from a pool of V values. Averages of runtimes for 5 runs for various combinations of parameters are below. The first value in each cell is old mbache, the second value is the new mbcache. V=10 F\P 1 2 4 8 16 32 64 10 0.158,0.157 0.208,0.196 0.500,0.277 0.798,0.400 3.258,0.584 13.807,1.047 61.339,2.803 100 0.172,0.167 0.279,0.222 0.520,0.275 0.825,0.341 2.981,0.505 12.022,1.202 44.641,2.943 1000 0.185,0.174 0.297,0.239 0.445,0.283 0.767,0.340 2.329,0.480 6.342,1.198 16.440,3.888 V=100 F\P 1 2 4 8 16 32 64 10 0.162,0.153 0.200,0.186 0.362,0.257 0.671,0.496 1.433,0.943 3.801,1.345 7.938,2.501 100 0.153,0.160 0.221,0.199 0.404,0.264 0.945,0.379 1.556,0.485 3.761,1.156 7.901,2.484 1000 0.215,0.191 0.303,0.246 0.471,0.288 0.960,0.347 1.647,0.479 3.916,1.176 8.058,3.160 V=1000 F\P 1 2 4 8 16 32 64 10 0.151,0.129 0.210,0.163 0.326,0.245 0.685,0.521 1.284,0.859 3.087,2.251 6.451,4.801 100 0.154,0.153 0.211,0.191 0.276,0.282 0.687,0.506 1.202,0.877 3.259,1.954 8.738,2.887 1000 0.145,0.179 0.202,0.222 0.449,0.319 0.899,0.333 1.577,0.524 4.221,1.240 9.782,3.579 V=10000 F\P 1 2 4 8 16 32 64 10 0.161,0.154 0.198,0.190 0.296,0.256 0.662,0.480 1.192,0.818 2.989,2.200 6.362,4.746 100 0.176,0.174 0.236,0.203 0.326,0.255 0.696,0.511 1.183,0.855 4.205,3.444 19.510,17.760 1000 0.199,0.183 0.240,0.227 1.159,1.014 2.286,2.154 6.023,6.039 ---,10.933 ---,36.620 V=100000 F\P 1 2 4 8 16 32 64 10 0.171,0.162 0.204,0.198 0.285,0.230 0.692,0.500 1.225,0.881 2.990,2.243 6.379,4.771 100 0.151,0.171 0.220,0.210 0.295,0.255 0.720,0.518 1.226,0.844 3.423,2.831 19.234,17.544 1000 0.192,0.189 0.249,0.225 1.162,1.043 2.257,2.093 5.853,4.997 ---,10.399 ---,32.198 We see that the new code is faster in pretty much all the cases and starting from 4 processes there are significant gains with the new code resulting in upto 20-times shorter runtimes. Also for large numbers of cached entries all values for the old code could not be measured as the kernel started hitting softlockups and died before the test completed. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* BACKPORT: [UPSTREAM] mm: new shrinker APIDave Chinner2017-04-251-9/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (Cherry-pick from commit 24f7c6b981fb70084757382da464ea85d72af300) The current shrinker callout API uses an a single shrinker call for multiple functions. To determine the function, a special magical value is passed in a parameter to change the behaviour. This complicates the implementation and return value specification for the different behaviours. Separate the two different behaviours into separate operations, one to return a count of freeable objects in the cache, and another to scan a certain number of objects in the cache for freeing. In defining these new operations, ensure the return values and resultant behaviours are clearly defined and documented. Modify shrink_slab() to use the new API and implement the callouts for all the existing shrinkers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* cred/userns: define current_user_ns() as a functionArnd Bergmann2017-04-252-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current_user_ns() macro currently returns &init_user_ns when user namespaces are disabled, and that causes several warnings when building with gcc-6.0 in code that compares the result of the macro to &init_user_ns itself: fs/xfs/xfs_ioctl.c: In function 'xfs_ioctl_setattr_check_projid': fs/xfs/xfs_ioctl.c:1249:22: error: self-comparison always evaluates to true [-Werror=tautological-compare] if (current_user_ns() == &init_user_ns) This is a legitimate warning in principle, but here it isn't really helpful, so I'm reprasing the definition in a way that shuts up the warning. Apparently gcc only warns when comparing identical literals, but it can figure out that the result of an inline function can be identical to a constant expression in order to optimize a condition yet not warn about the fact that the condition is known at compile time. This is exactly what we want here, and it looks reasonable because we generally prefer inline functions over macros anyway. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: David Howells <dhowells@redhat.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rbtree: add postorder iteration functionsCody P Schafer2017-04-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Postorder iteration yields all of a node's children prior to yielding the node itself, and this particular implementation also avoids examining the leaf links in a node after that node has been yielded. In what I expect will be its most common usage, postorder iteration allows the deletion of every node in an rbtree without modifying the rbtree nodes (no _requirement_ that they be nulled) while avoiding referencing child nodes after they have been "deleted" (most commonly, freed). I have only updated zswap to use this functionality at this point, but numerous bits of code (most notably in the filesystem drivers) use a hand rolled postorder iteration that NULLs child links as it traverses the tree. Each of those instances could be replaced with this common implementation. 1 & 2 add rbtree postorder iteration functions. 3 adds testing of the iteration to the rbtree runtime tests 4 allows building the rbtree runtime tests as builtins 5 updates zswap. This patch: Add postorder iteration functions for rbtree. These are useful for safely freeing an entire rbtree without modifying the tree at all. Change-Id: I5d0f2db0b5bcb57da9c7fa1c5f34b8686db8dcc9 Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@google.com>
* rbtree: add rbtree_postorder_for_each_entry_safe() helperCody P Schafer2017-04-171-0/+18
| | | | | | | | | | | | | | | | | Because deletion (of the entire tree) is a relatively common use of the rbtree_postorder iteration, and because doing it safely means fiddling with temporary storage, provide a helper to simplify postorder rbtree iteration. Change-Id: Ifb89570a13fe7f3f480aa48f4281c21d99e28094 Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@google.com>
* vfs: add setattr2 fix mergeMister Oyster2017-04-171-1/+1
|
* BACKPORT: posix_acl: Clear SGID bit when setting file permissionsJan Kara2017-04-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | (cherry pick from commit 073931017b49d9458aa351605b43a7e34598caef) When file permissions are modified via chmod(2) and the user is not in the owning group or capable of CAP_FSETID, the setgid bit is cleared in inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file permissions as well as the new ACL, but doesn't clear the setgid bit in a similar way; this allows to bypass the check in chmod(2). Fix that. NB: conflicts resolution included extending the change to all visible users of the near deprecated function posix_acl_equiv_mode replaced with posix_acl_update_mode. We did not resolve the ACL leak in this CL, require additional upstream fixes. References: CVE-2016-7097 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Bug: 32458736 Change-Id: I19591ad452cc825ac282b3cfd2daaa72aa9a1ac1
* fs: limit filesystem stacking depthMiklos Szeredi2017-04-161-0/+11
| | | | | | | | | | | | | | | Add a simple read-only counter to super_block that indicates how deep this is in the stack of filesystems. Previously ecryptfs was the only stackable filesystem and it explicitly disallowed multiple layers of itself. Overlayfs, however, can be stacked recursively and also may be stacked on top of ecryptfs or vice versa. To limit the kernel stack usage we must limit the depth of the filesystem stack. Initially the limit is set to 2. Change-Id: I91549cf876ed11a4265487f6b2d980b459399f9d Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* BACKPORT: block: add blk_rq_set_block_pc()Jens Axboe2017-04-131-0/+1
| | | | | | | | | | | | | | | | | With the optimizations around not clearing the full request at alloc time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC up to the user allocating the request. Add a blk_rq_set_block_pc() that sets the command type to REQ_TYPE_BLOCK_PC, and properly initializes the members associated with this type of request. Update callers to use this function instead of manipulating rq->cmd_type directly. Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed attempt. Change-Id: Ifc386dfb951c5d6adebf48ff38135dda28e4b1ce Signed-off-by: Jens Axboe <axboe@fb.com>
* BACKPORT: HID: input: generic hidinput_input_event handlerDavid Herrmann2017-04-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The hidinput_input_event() callback converts input events written from userspace into HID reports and sends them to the device. We currently implement this in every HID transport driver, even though most of them do the same. This provides a generic hidinput_input_event() implementation which is mostly copied from usbhid. It uses a delayed worker to allow multiple LED events to be collected into a single output event. We use the custom ->request() transport driver callback to allow drivers to adjust the outgoing report and handle the request asynchronously. If no custom ->request() callback is available, we fall back to the generic raw output report handler (which is synchronous). Drivers can still provide custom hidinput_input_event() handlers (see logitech-dj) if the generic implementation doesn't fit their needs. Conflicts: drivers/hid/hid-input.c Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Michael Wright <michaelwr@google.com> Change-Id: Iccb0b1de6460f6854b3d55d4008cc1d744472a06
* ipv6: Remove privacy config option.David S. Miller2017-04-131-2/+0
| | | | | | | | The code for privacy extentions is very mature, and making it configurable only gives marginal memory/code savings in exchange for obfuscation and hard to read code via CPP ifdef'ery. Signed-off-by: David S. Miller <davem@davemloft.net>
* fscrypt: catch up to v4.11-rc1Jaegeuk Kim2017-04-134-434/+380
| | | | | | | | | | | | | | | Keep validate_user_key() due to kasprintf() panic. fscrypt: - skcipher_ -> ablkcipher_ - fs/crypto/bio.c changes f2fs: - fscrypt: use ENOKEY when file cannot be created w/o key - fscrypt: split supp and notsupp declarations into their own headers - fscrypt: make fscrypt_operations.key_prefix a string Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: introduce free nid bitmapChao Yu2017-04-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | In scenario of intensively node allocation, free nids will be ran out soon, then it needs to stop to load free nids by traversing NAT blocks, in worse case, if NAT blocks does not be cached in memory, it generates IOs which slows down our foreground operations. In order to speed up node allocation, in this patch we introduce a new free_nid_bitmap array, so there is an bitmap table for each NAT block, Once the NAT block is loaded, related bitmap cache will be switched on, and bitmap will be set during traversing nat entries in NAT block, later we can query and update nid usage status in memory completely. With such implementation, I expect performance of node allocation can be improved in the long-term after filesystem image is mounted. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Conflicts: include/linux/f2fs_fs.h