aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* ext4: use old interface for ext4_readdir()Theodore Ts'o2017-05-272-3/+17
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* sync: don't block the flusher thread waiting on IODave Chinner2017-05-272-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When sync does it's WB_SYNC_ALL writeback, it issues data Io and then immediately waits for IO completion. This is done in the context of the flusher thread, and hence completely ties up the flusher thread for the backing device until all the dirty inodes have been synced. On filesystems that are dirtying inodes constantly and quickly, this means the flusher thread can be tied up for minutes per sync call and hence badly affect system level write IO performance as the page cache cannot be cleaned quickly. We already have a wait loop for IO completion for sync(2), so cut this out of the flusher thread and delegate it to wait_sb_inodes(). Hence we can do rapid IO submission, and then wait for it all to complete. Effect of sync on fsmark before the patch: FSUse% Count Size Files/sec App Overhead ..... 0 640000 4096 35154.6 1026984 0 720000 4096 36740.3 1023844 0 800000 4096 36184.6 916599 0 880000 4096 1282.7 1054367 0 960000 4096 3951.3 918773 0 1040000 4096 40646.2 996448 0 1120000 4096 43610.1 895647 0 1200000 4096 40333.1 921048 And a single sync pass took: real 0m52.407s user 0m0.000s sys 0m0.090s After the patch, there is no impact on fsmark results, and each individual sync(2) operation run concurrently with the same fsmark workload takes roughly 7s: real 0m6.930s user 0m0.000s sys 0m0.039s IOWs, sync is 7-8x faster on a busy filesystem and does not have an adverse impact on ongoing async data write operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 7747bd4bceb3079572695d3942294a6c7b265557)
* ext4: fix up bio->bi_iter.bi_sectorTheodore Ts'o2017-05-271-2/+2
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: backport mm portion of: fix data integrity sync in ordered modeTheodore Ts'o2017-05-272-6/+17
| | | | | | | | Commit 1c8349a17137: "ext4: fix data integrity sync in ordered mode" included changes to include/linux/page-flags.h and mm/page-writeback.c. Apply them as part of the 3.18 ext4 backport. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: use old truncate_pagecache() interface for ext4 3.18 backportTheodore Ts'o2017-05-272-3/+5
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: use old percpu_counter_init() function signatureTheodore Ts'o2017-05-272-9/+6
| | | | | | Older kernels don't pass in a gfp_flags argument. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: use the old shrinker APITheodore Ts'o2017-05-271-6/+10
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: disable ext4 acl handling for 3.18 backportTheodore Ts'o2017-05-271-0/+3
| | | | | | | The ACL code changes too much between 3.18 and 3.10; so disable acl handling so we don't have make things actually work. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* include/linux/fs.h: add dir_relax() for 3.18 backportMister Oyster2017-05-271-0/+7
| | | | | | partial commit from ec991b05a282642359e81e65855f189f7881009c include/linux/fs.h: add dir_emit() and dir_relax() for 3.18 backport Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* BACKPORT: ext4 from 3.18 to mtk-3.10Mister Oyster2017-05-2741-6405/+7260
|
* fs: add inode_set_flags() for ext4 3.18 backportTheodore Ts'o2017-05-272-0/+34
| | | | | | Excerpted from commit 5f16f3225b062 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* sched: add bit_wait_io for 3.18 ext4 backportTheodore Ts'o2017-05-272-1/+38
| | | | | | | Excerpted from commit 743162013: "sched: Remove proliferation of wait_on_bit() action functions" Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* buffer_head.h: add getblk_unmovable() for 3.18 ext4 backportTheodore Ts'o2017-05-271-0/+6
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* uapi: add new system call ABI codepoints for ext4 3.18 backportTheodore Ts'o2017-05-272-0/+37
| | | | | | | | | | | note: this doesn't guarantee that functionality provided by FALLOC_FL_COLLAPSE_RANGE, FALLOC_FL_ZERO_RANGE, and FIEMAP_FLAG_CACHE to necessarily _work_; it only allows ext4 from 3.18 to *compile*. Fortunately, these are exotic bits of functionality that most people never use. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mm.h: add truncate_inode_pages_final() for 3.18 ext4 backportTheodore Ts'o2017-05-271-0/+6
|
* bitops.h: add smp_mb__after_atomic() for 3.18 backportTheodore Ts'o2017-05-271-0/+3
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* buffer_head.h: add sb_bread_unmovable() for 3.18 ext4 backportTheodore Ts'o2017-05-271-0/+6
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* percpu: add raw_cpu_ptr() for 3.18 ext4 backportTheodore Ts'o2017-05-271-0/+3
|
* mm: add find_get_page_flags() for 3.18 ext4 backportTheodore Ts'o2017-05-272-3/+16
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fs: add {lock,unlock}_two_nondirectories for 3.18 backportTheodore Ts'o2017-05-272-1/+41
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* defconfig: f2fs: enable more options to fix twrpMoyster2017-05-261-7/+7
|
* Kconfig: add depends for UID_SYS_STATSGanesh Mahendran2017-05-241-1/+1
| | | | | | | | uid_io depends on TASK_XACCT and TASK_IO_ACCOUNTING. So add depends in Kconfig before compiling code. Change-Id: Ie6bf57ec7c2eceffadf4da0fc2aca001ce10c36e Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
* ANDROID: uid_sys_stats: defer io stats calulation for dead tasksJin Qian2017-05-241-65/+42
| | | | | | | | | Store sum of dead task io stats in uid_entry and defer uid io calulation until next uid proc stat change or dumpsys. Bug: 37754877 Change-Id: I970f010a4c841c5ca26d0efc7e027414c3c952e0 Signed-off-by: Jin Qian <jinqian@google.com>
* ANDROID: uid_sys_stats: fix access of task_uid(task)Ganesh Mahendran2017-05-241-4/+5
| | | | | | | | struct task_struct *task should be proteced by tasklist_lock. Change-Id: Iefcd13442a9b9d855a2bbcde9fd838a4132fee58 Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> (cherry picked from commit 90d78776c4a0e13fb7ee5bd0787f04a1730631a6)
* ANDROID: uid_sys_stats: reduce update_io_stats overheadJin Qian2017-05-241-10/+51
| | | | | | | | | | | | Replaced read_lock with rcu_read_lock to reduce time that preemption is disabled. Added a function to update io stats for specific uid and moved hash table lookup, user_namespace out of loops. Bug: 37319300 Change-Id: I2b81b5cd3b6399b40d08c3c14b42cad044556970 Signed-off-by: Jin Qian <jinqian@google.com>
* ANDROID: sdcardfs: Check for NULL in revalidateDaniel Rosenberg2017-05-241-3/+5
| | | | | | | | | If the inode is in the process of being evicted, the top value may be NULL. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 38502532 Change-Id: I0b9d04aab621e0398d44d1c5dc53293106aa5f89
* selinux: conditionally reschedule in hashtab_insert while loading selinux policyDave Jones2017-05-241-0/+3
| | | | | | | | | | After silencing the sleeping warning in mls_convert_context() I started seeing similar traces from hashtab_insert. Do a cond_resched there too. Signed-off-by: Dave Jones <davej@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: conditionally reschedule in mls_convert_context while loading ↵Dave Jones2017-05-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | selinux policy On a slow machine (with debugging enabled), upgrading selinux policy may take a considerable amount of time. Long enough that the softlockup detector gets triggered. The backtrace looks like this.. > BUG: soft lockup - CPU#2 stuck for 23s! [load_policy:19045] > Call Trace: > [<ffffffff81221ddf>] symcmp+0xf/0x20 > [<ffffffff81221c27>] hashtab_search+0x47/0x80 > [<ffffffff8122e96c>] mls_convert_context+0xdc/0x1c0 > [<ffffffff812294e8>] convert_context+0x378/0x460 > [<ffffffff81229170>] ? security_context_to_sid_core+0x240/0x240 > [<ffffffff812221b5>] sidtab_map+0x45/0x80 > [<ffffffff8122bb9f>] security_load_policy+0x3ff/0x580 > [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100 > [<ffffffff810786dd>] ? sched_clock_local+0x1d/0x80 > [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100 > [<ffffffff8103096a>] ? __change_page_attr_set_clr+0x82a/0xa50 > [<ffffffff810786dd>] ? sched_clock_local+0x1d/0x80 > [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100 > [<ffffffff8103096a>] ? __change_page_attr_set_clr+0x82a/0xa50 > [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100 > [<ffffffff81534ddc>] ? retint_restore_args+0xe/0xe > [<ffffffff8109c82d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 > [<ffffffff81279a2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f > [<ffffffff810d28a8>] ? rcu_irq_exit+0x68/0xb0 > [<ffffffff81534ddc>] ? retint_restore_args+0xe/0xe > [<ffffffff8121e947>] sel_write_load+0xa7/0x770 > [<ffffffff81139633>] ? vfs_write+0x1c3/0x200 > [<ffffffff81210e8e>] ? security_file_permission+0x1e/0xa0 > [<ffffffff8113952b>] vfs_write+0xbb/0x200 > [<ffffffff811581c7>] ? fget_light+0x397/0x4b0 > [<ffffffff81139c27>] SyS_write+0x47/0xa0 > [<ffffffff8153bde4>] tracesys+0xdd/0xe2 Stephen Smalley suggested: > Maybe put a cond_resched() within the ebitmap_for_each_positive_bit() > loop in mls_convert_context()? That seems to do the trick. Tested by downgrading and re-upgrading selinux-policy-targeted. Signed-off-by: Dave Jones <davej@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: no recursive read_lock of policy_rwlock in security_genfs_sid()Waiman Long2017-05-241-9/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the introduction of fair queued rwlock, recursive read_lock() may hang the offending process if there is a write_lock() somewhere in between. With recursive read_lock checking enabled, the following error was reported: ============================================= [ INFO: possible recursive locking detected ] 3.16.0-rc1 #2 Tainted: G E --------------------------------------------- load_policy/708 is trying to acquire lock: (policy_rwlock){.+.+..}, at: [<ffffffff8125b32a>] security_genfs_sid+0x3a/0x170 but task is already holding lock: (policy_rwlock){.+.+..}, at: [<ffffffff8125b48c>] security_fs_use+0x2c/0x110 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(policy_rwlock); lock(policy_rwlock); This patch fixes the occurrence of recursive read_lock() of policy_rwlock by adding a helper function __security_genfs_sid() which requires caller to take the lock before calling it. The security_fs_use() was then modified to call the new helper function. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com> Conflicts: security/selinux/ss/services.c (Adapted for Shamu 3.10 Kernel) Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: fix a possible memory leak in cond_read_node()Namhyung Kim2017-05-241-1/+1
| | | | | | | | | | The cond_read_node() should free the given node on error path as it's not linked to p->cond_list yet. This is done via cond_node_destroy() but it's not called when next_entry() fails before the expr loop. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: simple cleanup for cond_read_node()Namhyung Kim2017-05-241-7/+2
| | | | | | | | | | The node->cur_state and len can be read in a single call of next_entry(). And setting len before reading is a dead write so can be eliminated. Signed-off-by: Namhyung Kim <namhyung@kernel.org> (Minor tweak to the length parameter in the call to next_entry()) Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: normalize audit log formattingRichard Guy Briggs2017-05-241-6/+8
| | | | | | | | | | Restructure to keyword=value pairs without spaces. Drop superfluous words in text. Make invalid_context a keyword. Change result= keyword to seresult=. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [Minor rewrite to the patch subject line] Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: cleanup error reporting in selinux_nlmsg_perm()Richard Guy Briggs2017-05-241-4/+3
| | | | | | | | | | | | | | | Convert audit_log() call to WARN_ONCE(). Rename "type=" to nlmsg_type=" to avoid confusion with the audit record type. Added "protocol=" to help track down which protocol (NETLINK_AUDIT?) was used within the netlink protocol family. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [Rewrote the patch subject line] Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: Remove unused function avc_sidcmp()Rickard Strandqvist2017-05-241-5/+0
| | | | | | | | | | | Remove the function avc_sidcmp() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> [PM: rewrite the patch subject line] Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: quiet the filesystem labeling behavior messagePaul Moore2017-05-241-4/+0
| | | | | | | | | | | | | | While the filesystem labeling method is only printed at the KERN_DEBUG level, this still appears in dmesg and on modern Linux distributions that create a lot of tmpfs mounts for session handling, the dmesg can easily be filled with a lot of "SELinux: initialized (dev X ..." messages. This patch removes this notification for the normal case but leaves the error message intact (displayed when mounting a filesystem with an unknown labeling behavior). Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* SELinux: do all flags twiddling in one placeEric Paris2017-05-241-7/+5
| | | | | | | | | | Currently we set the initialize and seclabel flag in one place. Do some unrelated printk then we unset the seclabel flag. Eww. Instead do the flag twiddling in one place in the code not seperated by unrelated printk. Also don't set and unset the seclabel flag. Only set it if we need to. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: add force_audit sysfs node to enable logging of dontauditimoseyon2017-05-242-0/+7
| | | | | | | | * for kernel selinux debugging * to enable: * echo Y > /sys/module/selinux/parameters/force_audit Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* selinux: remove unused variabled in the netport, netnode, and netif cachesPaul Moore2017-05-243-6/+4
| | | | | | | | | This patch removes the unused return code variable in the netport, netnode, and netif initialization functions. Reported-by: fengguang.wu@intel.com Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* nohz: Avoid tick's double reprogramming in highres modeViresh Kumar2017-05-241-0/+4
| | | | | | | | | | | | | | | | | | | | | In highres mode, the tick reschedules itself unconditionally to the next jiffies. However while this clock reprogramming is relevant when the tick is in periodic mode, it's not that interesting when we run in dynticks mode because irq exit is likely going to overwrite the next tick to some randomly deferred future. So lets just get rid of this tick self rescheduling in dynticks mode. This way we can avoid some clockevents double write in favourable scenarios like when we stop the tick completely in idle while no other hrtimer is pending. Change-Id: I4e73da9bffbcac49789e5f279b67873f414b22a8 Suggested-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* nohz: Fix spurious periodic tick behaviour in low-res dynticks modeViresh Kumar2017-05-241-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we reach the end of the tick handler, we unconditionally reschedule the next tick to the next jiffy. Then on irq exit, the nohz code overrides that setting if needed and defers the next tick as far away in the future as possible. Now in the best dynticks case, when we actually don't need any tick in the future (ie: expires == KTIME_MAX), low-res and high-res behave differently. What we want in this case is to cancel the next tick programmed by the previous one. That's what we do in high-res mode. OTOH we lack a low-res mode equivalent of hrtimer_cancel() so we simply don't do anything in this case and the next tick remains scheduled to jiffies + 1. As a result, in low-res mode, when the dynticks code determines that no tick is needed in the future, we can recursively get a spurious tick every jiffy because then the next tick is always reprogrammed from the tick handler and is never cancelled. And this can happen indefinetly until some subsystem actually needs a precise tick in the future and only then we eventually overwrite the previous tick handler setting to defer the next tick. We are fixing this by introducing the ONESHOT_STOPPED mode which will let us pause a clockevent when no further interrupt is needed. Meanwhile we can't expect all drivers to support this new mode. So lets reduce much of the symptoms by skipping the nohz-blind tick rescheduling from the tick-handler when the CPU is in dynticks mode. That tick rescheduling wrongly assumed periodicity and the low-res dynticks code can't cancel such decision. This breaks the recursive (and thus the worst) part of the problem. In the worst case now, we'll get only one extra tick due to uncancelled tick scheduled before we entered dynticks mode. This also removes a needless clockevent write on idle ticks. Since those clock write are usually considered to be slow, it's a general win. Change-Id: Id416f55d0356c0f3d178e87d9821ea50d2e1ed69 Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* nohz: Get timekeeping max deferment outside jiffies_lockFrederic Weisbecker2017-05-241-1/+2
| | | | | | | | | | | | | | | | | | | | | We don't need to fetch the timekeeping max deferment under the jiffies_lock seqlock. If the clocksource is updated concurrently while we stop the tick, stop machine is called and the tick will be reevaluated again along with uptodate jiffies and its related values. Change-Id: I2889f0b3da1be1b78dfc74e62986deaafd27b679 Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alex Shi <alex.shi@linaro.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Kevin Hilman <khilman@linaro.org> Link: http://lkml.kernel.org/r/1387320692-28460-9-git-send-email-fweisbec@gmail.com Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* IKSWL-3373: selinux: Improve avc loggingJoel Voss2017-05-241-0/+4
| | | | | | | | | | | | | | | | | | | | Where applicable, include the process UID in the audit log message. This assists debugging the source of denials, especially in the application domain. Change-Id: I082398f0216db893b51f9371f98e6b230d2e9147 Signed-off-by: Joel Voss <jvoss@motorola.com> Reviewed-by: Connie Zhao <czhao1@motorola.com> Reviewed-on: http://gerrit.mot.com/689473 SLTApproved: Slta Waiver <sltawvr@motorola.com> Tested-by: Jira Key <jirakey@motorola.com> Reviewed-by: Christopher Fries <cfries@motorola.com> Submit-Approved: Jira Key <jirakey@motorola.com> Signed-off-by: kgudeth <kgudeth@motorola.com> Reviewed-on: http://gerrit.mot.com/695886 Reviewed-on: http://gerrit.mot.com/727995 SME-Granted: SME Approvals Granted Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
* UPSTREAM: math64: New separate div64_u64_rem helperMike Snitzer2017-05-242-0/+53
| | | | | | | | | | | | | | | | | | | | | | Commit f792685006274a850e6cc0ea9ade275ccdfc90bc ("math64: New div64_u64_rem helper") implemented div64_u64 in terms of div64_u64_rem. But div64_u64_rem was removed because it slowed down div64_u64 (and there were no other users of div64_u64_rem). Device Mapper's I/O statistics support has a need for div64_u64_rem; reintroduce this helper as a separate method that doesn't slow down div64_u64, especially on 32-bit systems. BUG: 27175947 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Change-Id: I05274c1a7235dfa972f5ddc4778e0154240fd9b6 Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
* UPSTREAM: proc: actually make proc_fd_permission() thread-friendlyOleg Nesterov2017-05-241-2/+5
| | | | | | | | | | | | | | | | | | | | | (cherry pick from commit 54708d2858e79a2bdda10bf8a20c80eb96c20613) The commit 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly") fixed the access to /proc/self/fd from sub-threads, but introduced another problem: a sub-thread can't access /proc/<tid>/fd/ or /proc/thread-self/fd if generic_permission() fails. Change proc_fd_permission() to check same_thread_group(pid_task(), current). Fixes: 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly") Reported-by: "Jin, Yihua" <yihua.jin@intel.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 26016905 Change-Id: I1894c78d0b13f0bde8cde84bd142ba67590dc0f1
* UPSTREAM: proc: make proc_fd_permission() thread-friendlyOleg Nesterov2017-05-241-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (cherry pick from commit 96d0df79f2644fc823f26c06491e182d87a90c2a) proc_fd_permission() says "process can still access /proc/self/fd after it has executed a setuid()", but the "task_pid() = proc_pid() check only helps if the task is group leader, /proc/self points to /proc/<leader-pid>. Change this check to use task_tgid() so that the whole thread group can access its /proc/self/fd or /proc/<tid-of-sub-thread>/fd. Notes: - CLONE_THREAD does not require CLONE_FILES so task->files can differ, but I don't think this can lead to any security problem. And this matches same_thread_group() in __ptrace_may_access(). - /proc/self should probably point to /proc/<thread-tid>, but it is too late to change the rules. Perhaps it makes sense to add /proc/thread though. Test-case: void *tfunc(void *arg) { assert(opendir("/proc/self/fd")); return NULL; } int main(void) { pthread_t t; pthread_create(&t, NULL, tfunc, NULL); pthread_join(t, NULL); return 0; } fails if, say, this executable is not readable and suid_dumpable = 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 26016905 Change-Id: Ifdd6403a8fccd073122e89d3547c13ccc08f0dce Signed-off-by: Joe Maples <joe@frap129.org> Conflicts: fs/proc/fd.c
* include/linux/mm.h: remove ifdef conditionRashika Kheria2017-05-241-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ifdef conditions in include/linux/mm.h presents three cases: - !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) There is no actual definition of function but include/linux/mm.h has a static inline stub defined. - defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) linux/mm.h does not define a prototype, but mm/page_alloc.c defines the function. Hence, compiler reports the following warning: mm/page_alloc.c:4300:15: warning: no previous prototype for `__early_pfn_to_nid' [-Wmissing-prototypes] - defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) The architecture defines the function, and linux/mm.h has a prototype. Thus, join the conditions of Case 2 and 3 ie eliminate the ifdef condition of CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID to eliminate the missing prototype warning from file mm/page_alloc.c. Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* mm/readahead.c: inline ra_submitFabian Frederick2017-05-243-21/+18
| | | | | | | | | | | | | | Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left ra_submit with a single function call. Move ra_submit to internal.h and inline it to save some stack. Thanks to Andrew Morton for commenting different versions. Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* mm/memblock.c: introduce bottom-up allocation modeTang Chen2017-05-243-3/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux kernel cannot migrate pages used by the kernel. As a result, kernel pages cannot be hot-removed. So we cannot allocate hotpluggable memory for the kernel. ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info. But before SRAT is parsed, memblock has already started to allocate memory for the kernel. So we need to prevent memblock from doing this. In a memory hotplug system, any numa node the kernel resides in should be unhotpluggable. And for a modern server, each node could have at least 16GB memory. So memory around the kernel image is highly likely unhotpluggable. So the basic idea is: Allocate memory from the end of the kernel image and to the higher memory. Since memory allocation before SRAT is parsed won't be too much, it could highly likely be in the same node with kernel image. The current memblock can only allocate memory top-down. So this patch introduces a new bottom-up allocation mode to allocate memory bottom-up. And later when we use this allocation direction to allocate memory, we will limit the start address above the kernel. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* mm/memblock.c: factor out of top-down allocationTang Chen2017-05-241-13/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Problem] The current Linux cannot migrate pages used by the kernel because of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET. When the pa is changed, we cannot simply update the pagetable and keep the va unmodified. So the kernel pages are not migratable. There are also some other issues will cause the kernel pages not migratable. For example, the physical address may be cached somewhere and will be used. It is not to update all the caches. When doing memory hotplug in Linux, we first migrate all the pages in one memory device somewhere else, and then remove the device. But if pages are used by the kernel, they are not migratable. As a result, memory used by the kernel cannot be hot-removed. Modifying the kernel direct mapping mechanism is too difficult to do. And it may cause the kernel performance down and unstable. So we use the following way to do memory hotplug. [What we are doing] In Linux, memory in one numa node is divided into several zones. One of the zones is ZONE_MOVABLE, which the kernel won't use. In order to implement memory hotplug in Linux, we are going to arrange all hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory. To do this, we need ACPI's help. In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory affinities in SRAT record every memory range in the system, and also, flags specifying if the memory range is hotpluggable. (Please refer to ACPI spec 5.0 5.2.16) With the help of SRAT, we have to do the following two things to achieve our goal: 1. When doing memory hot-add, allow the users arranging hotpluggable as ZONE_MOVABLE. (This has been done by the MOVABLE_NODE functionality in Linux.) 2. when the system is booting, prevent bootmem allocator from allocating hotpluggable memory for the kernel before the memory initialization finishes. The problem 2 is the key problem we are going to solve. But before solving it, we need some preparation. Please see below. [Preparation] Bootloader has to load the kernel image into memory. And this memory must be unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system, we can assume any node the kernel resides in is not hotpluggable. Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But memblock has already started to work. In the current kernel, memblock allocates the following memory before SRAT is parsed: setup_arch() |->memblock_x86_fill() /* memblock is ready */ |...... |->early_reserve_e820_mpc_new() /* allocate memory under 1MB */ |->reserve_real_mode() /* allocate memory under 1MB */ |->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */ |->dma_contiguous_reserve() /* specified by user, should be low */ |->setup_log_buf() /* specified by user, several mega bytes */ |->relocate_initrd() /* could be large, but will be freed after boot, should reorder */ |->acpi_initrd_override() /* several mega bytes */ |->reserve_crashkernel() /* could be large, should reorder */ |...... |->initmem_init() /* Parse SRAT */ According to Tejun's advice, before SRAT is parsed, we should try our best to allocate memory near the kernel image. Since the whole node the kernel resides in won't be hotpluggable, and for a modern server, a node may have at least 16GB memory, allocating several mega bytes memory around the kernel image won't cross to hotpluggable memory. [About this patchset] So this patchset is the preparation for the problem 2 that we want to solve. It does the following: 1. Make memblock be able to allocate memory bottom up. 1) Keep all the memblock APIs' prototype unmodified. 2) When the direction is bottom up, keep the start address greater than the end of kernel image. 2. Improve init_mem_mapping() to support allocate page tables in bottom up direction. 3. Introduce "movable_node" boot option to enable and disable this functionality. This patch (of 6): Create a new function __memblock_find_range_top_down to factor out of top-down allocation from memblock_find_in_range_node. This is a preparation because we will introduce a new bottom-up allocation mode in the following patch. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Toshi Kani <toshi.kani@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
* memblock, numa: binary search node idYinghai Lu2017-05-243-10/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | Current early_pfn_to_nid() on arch that support memblock go over memblock.memory one by one, so will take too many try near the end. We can use existing memblock_search to find the node id for given pfn, that could save some time on bigger system that have many entries memblock.memory array. Here are the timing differences for several machines. In each case with the patch less time was spent in __early_pfn_to_nid(). 3.11-rc5 with patch difference (%) -------- ---------- -------------- UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%) UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%) UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%) UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%) Time in seconds. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: Russ Anderson <rja@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>