| Commit message (Collapse) | Author | Age | Files | Lines |
| ... | |
| |
|
|
| |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.
We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.
Effect of sync on fsmark before the patch:
FSUse% Count Size Files/sec App Overhead
.....
0 640000 4096 35154.6 1026984
0 720000 4096 36740.3 1023844
0 800000 4096 36184.6 916599
0 880000 4096 1282.7 1054367
0 960000 4096 3951.3 918773
0 1040000 4096 40646.2 996448
0 1120000 4096 43610.1 895647
0 1200000 4096 40333.1 921048
And a single sync pass took:
real 0m52.407s
user 0m0.000s
sys 0m0.090s
After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:
real 0m6.930s
user 0m0.000s
sys 0m0.039s
IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 7747bd4bceb3079572695d3942294a6c7b265557)
|
| |
|
|
| |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
|
|
| |
Commit 1c8349a17137: "ext4: fix data integrity sync in ordered mode"
included changes to include/linux/page-flags.h and
mm/page-writeback.c. Apply them as part of the 3.18 ext4 backport.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
| |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
| |
Older kernels don't pass in a gfp_flags argument.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
| |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
|
| |
The ACL code changes too much between 3.18 and 3.10; so disable acl
handling so we don't have make things actually work.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
| |
partial commit from ec991b05a282642359e81e65855f189f7881009c
include/linux/fs.h: add dir_emit() and dir_relax() for 3.18 backport
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | |
|
| |
|
|
|
|
| |
Excerpted from commit 5f16f3225b062
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
|
| |
Excerpted from commit 743162013: "sched: Remove proliferation of
wait_on_bit() action functions"
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
| |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
|
|
|
|
|
| |
note: this doesn't guarantee that functionality provided by
FALLOC_FL_COLLAPSE_RANGE, FALLOC_FL_ZERO_RANGE, and FIEMAP_FLAG_CACHE
to necessarily _work_; it only allows ext4 from 3.18 to *compile*.
Fortunately, these are exotic bits of functionality that most people
never use.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | |
|
| |
|
|
| |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
| |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | |
|
| |
|
|
| |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
| |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | |
|
| |
|
|
|
|
|
|
| |
uid_io depends on TASK_XACCT and TASK_IO_ACCOUNTING.
So add depends in Kconfig before compiling code.
Change-Id: Ie6bf57ec7c2eceffadf4da0fc2aca001ce10c36e
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
|
| |
|
|
|
|
|
|
|
| |
Store sum of dead task io stats in uid_entry and defer uid io
calulation until next uid proc stat change or dumpsys.
Bug: 37754877
Change-Id: I970f010a4c841c5ca26d0efc7e027414c3c952e0
Signed-off-by: Jin Qian <jinqian@google.com>
|
| |
|
|
|
|
|
|
| |
struct task_struct *task should be proteced by tasklist_lock.
Change-Id: Iefcd13442a9b9d855a2bbcde9fd838a4132fee58
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
(cherry picked from commit 90d78776c4a0e13fb7ee5bd0787f04a1730631a6)
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Replaced read_lock with rcu_read_lock to reduce time that preemption
is disabled.
Added a function to update io stats for specific uid and moved
hash table lookup, user_namespace out of loops.
Bug: 37319300
Change-Id: I2b81b5cd3b6399b40d08c3c14b42cad044556970
Signed-off-by: Jin Qian <jinqian@google.com>
|
| |
|
|
|
|
|
|
|
| |
If the inode is in the process of being evicted,
the top value may be NULL.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 38502532
Change-Id: I0b9d04aab621e0398d44d1c5dc53293106aa5f89
|
| |
|
|
|
|
|
|
|
|
| |
After silencing the sleeping warning in mls_convert_context() I started
seeing similar traces from hashtab_insert. Do a cond_resched there too.
Signed-off-by: Dave Jones <davej@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
selinux policy
On a slow machine (with debugging enabled), upgrading selinux policy may take
a considerable amount of time. Long enough that the softlockup detector
gets triggered.
The backtrace looks like this..
> BUG: soft lockup - CPU#2 stuck for 23s! [load_policy:19045]
> Call Trace:
> [<ffffffff81221ddf>] symcmp+0xf/0x20
> [<ffffffff81221c27>] hashtab_search+0x47/0x80
> [<ffffffff8122e96c>] mls_convert_context+0xdc/0x1c0
> [<ffffffff812294e8>] convert_context+0x378/0x460
> [<ffffffff81229170>] ? security_context_to_sid_core+0x240/0x240
> [<ffffffff812221b5>] sidtab_map+0x45/0x80
> [<ffffffff8122bb9f>] security_load_policy+0x3ff/0x580
> [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100
> [<ffffffff810786dd>] ? sched_clock_local+0x1d/0x80
> [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100
> [<ffffffff8103096a>] ? __change_page_attr_set_clr+0x82a/0xa50
> [<ffffffff810786dd>] ? sched_clock_local+0x1d/0x80
> [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100
> [<ffffffff8103096a>] ? __change_page_attr_set_clr+0x82a/0xa50
> [<ffffffff810788a8>] ? sched_clock_cpu+0xa8/0x100
> [<ffffffff81534ddc>] ? retint_restore_args+0xe/0xe
> [<ffffffff8109c82d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [<ffffffff81279a2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff810d28a8>] ? rcu_irq_exit+0x68/0xb0
> [<ffffffff81534ddc>] ? retint_restore_args+0xe/0xe
> [<ffffffff8121e947>] sel_write_load+0xa7/0x770
> [<ffffffff81139633>] ? vfs_write+0x1c3/0x200
> [<ffffffff81210e8e>] ? security_file_permission+0x1e/0xa0
> [<ffffffff8113952b>] vfs_write+0xbb/0x200
> [<ffffffff811581c7>] ? fget_light+0x397/0x4b0
> [<ffffffff81139c27>] SyS_write+0x47/0xa0
> [<ffffffff8153bde4>] tracesys+0xdd/0xe2
Stephen Smalley suggested:
> Maybe put a cond_resched() within the ebitmap_for_each_positive_bit()
> loop in mls_convert_context()?
That seems to do the trick. Tested by downgrading and re-upgrading selinux-policy-targeted.
Signed-off-by: Dave Jones <davej@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the introduction of fair queued rwlock, recursive read_lock()
may hang the offending process if there is a write_lock() somewhere
in between.
With recursive read_lock checking enabled, the following error was
reported:
=============================================
[ INFO: possible recursive locking detected ]
3.16.0-rc1 #2 Tainted: G E
---------------------------------------------
load_policy/708 is trying to acquire lock:
(policy_rwlock){.+.+..}, at: [<ffffffff8125b32a>]
security_genfs_sid+0x3a/0x170
but task is already holding lock:
(policy_rwlock){.+.+..}, at: [<ffffffff8125b48c>]
security_fs_use+0x2c/0x110
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(policy_rwlock);
lock(policy_rwlock);
This patch fixes the occurrence of recursive read_lock() of
policy_rwlock by adding a helper function __security_genfs_sid()
which requires caller to take the lock before calling it. The
security_fs_use() was then modified to call the new helper function.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Conflicts:
security/selinux/ss/services.c
(Adapted for Shamu 3.10 Kernel)
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
| |
The cond_read_node() should free the given node on error path as it's
not linked to p->cond_list yet. This is done via cond_node_destroy()
but it's not called when next_entry() fails before the expr loop.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
| |
The node->cur_state and len can be read in a single call of next_entry().
And setting len before reading is a dead write so can be eliminated.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
(Minor tweak to the length parameter in the call to next_entry())
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
| |
Restructure to keyword=value pairs without spaces. Drop superfluous words in
text. Make invalid_context a keyword. Change result= keyword to seresult=.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[Minor rewrite to the patch subject line]
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Convert audit_log() call to WARN_ONCE().
Rename "type=" to nlmsg_type=" to avoid confusion with the audit record
type.
Added "protocol=" to help track down which protocol (NETLINK_AUDIT?) was used
within the netlink protocol family.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[Rewrote the patch subject line]
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
| |
Remove the function avc_sidcmp() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
[PM: rewrite the patch subject line]
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
While the filesystem labeling method is only printed at the KERN_DEBUG
level, this still appears in dmesg and on modern Linux distributions
that create a lot of tmpfs mounts for session handling, the dmesg can
easily be filled with a lot of "SELinux: initialized (dev X ..."
messages. This patch removes this notification for the normal case
but leaves the error message intact (displayed when mounting a
filesystem with an unknown labeling behavior).
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
| |
Currently we set the initialize and seclabel flag in one place. Do some
unrelated printk then we unset the seclabel flag. Eww. Instead do the flag
twiddling in one place in the code not seperated by unrelated printk. Also
don't set and unset the seclabel flag. Only set it if we need to.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
| |
* for kernel selinux debugging
* to enable:
* echo Y > /sys/module/selinux/parameters/force_audit
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
| |
This patch removes the unused return code variable in the netport,
netnode, and netif initialization functions.
Reported-by: fengguang.wu@intel.com
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In highres mode, the tick reschedules itself unconditionally to the
next jiffies.
However while this clock reprogramming is relevant when the tick is
in periodic mode, it's not that interesting when we run in dynticks mode
because irq exit is likely going to overwrite the next tick to some
randomly deferred future.
So lets just get rid of this tick self rescheduling in dynticks mode.
This way we can avoid some clockevents double write in favourable
scenarios like when we stop the tick completely in idle while no other
hrtimer is pending.
Change-Id: I4e73da9bffbcac49789e5f279b67873f414b22a8
Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we reach the end of the tick handler, we unconditionally reschedule
the next tick to the next jiffy. Then on irq exit, the nohz code
overrides that setting if needed and defers the next tick as far away in
the future as possible.
Now in the best dynticks case, when we actually don't need any tick in
the future (ie: expires == KTIME_MAX), low-res and high-res behave
differently. What we want in this case is to cancel the next tick
programmed by the previous one. That's what we do in high-res mode. OTOH
we lack a low-res mode equivalent of hrtimer_cancel() so we simply don't
do anything in this case and the next tick remains scheduled to jiffies + 1.
As a result, in low-res mode, when the dynticks code determines that no
tick is needed in the future, we can recursively get a spurious tick
every jiffy because then the next tick is always reprogrammed from the
tick handler and is never cancelled. And this can happen indefinetly
until some subsystem actually needs a precise tick in the future and only
then we eventually overwrite the previous tick handler setting to defer
the next tick.
We are fixing this by introducing the ONESHOT_STOPPED mode which will
let us pause a clockevent when no further interrupt is needed. Meanwhile
we can't expect all drivers to support this new mode.
So lets reduce much of the symptoms by skipping the nohz-blind tick
rescheduling from the tick-handler when the CPU is in dynticks mode.
That tick rescheduling wrongly assumed periodicity and the low-res
dynticks code can't cancel such decision. This breaks the recursive (and
thus the worst) part of the problem. In the worst case now, we'll get
only one extra tick due to uncancelled tick scheduled before we entered
dynticks mode.
This also removes a needless clockevent write on idle ticks. Since those
clock write are usually considered to be slow, it's a general win.
Change-Id: Id416f55d0356c0f3d178e87d9821ea50d2e1ed69
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't need to fetch the timekeeping max deferment under the
jiffies_lock seqlock.
If the clocksource is updated concurrently while we stop the tick,
stop machine is called and the tick will be reevaluated again along with
uptodate jiffies and its related values.
Change-Id: I2889f0b3da1be1b78dfc74e62986deaafd27b679
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Kevin Hilman <khilman@linaro.org>
Link: http://lkml.kernel.org/r/1387320692-28460-9-git-send-email-fweisbec@gmail.com
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Where applicable, include the process UID in the audit
log message. This assists debugging the source of denials,
especially in the application domain.
Change-Id: I082398f0216db893b51f9371f98e6b230d2e9147
Signed-off-by: Joel Voss <jvoss@motorola.com>
Reviewed-by: Connie Zhao <czhao1@motorola.com>
Reviewed-on: http://gerrit.mot.com/689473
SLTApproved: Slta Waiver <sltawvr@motorola.com>
Tested-by: Jira Key <jirakey@motorola.com>
Reviewed-by: Christopher Fries <cfries@motorola.com>
Submit-Approved: Jira Key <jirakey@motorola.com>
Signed-off-by: kgudeth <kgudeth@motorola.com>
Reviewed-on: http://gerrit.mot.com/695886
Reviewed-on: http://gerrit.mot.com/727995
SME-Granted: SME Approvals Granted
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit f792685006274a850e6cc0ea9ade275ccdfc90bc ("math64: New
div64_u64_rem helper") implemented div64_u64 in terms of div64_u64_rem.
But div64_u64_rem was removed because it slowed down div64_u64 (and
there were no other users of div64_u64_rem).
Device Mapper's I/O statistics support has a need for div64_u64_rem;
reintroduce this helper as a separate method that doesn't slow down
div64_u64, especially on 32-bit systems.
BUG: 27175947
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change-Id: I05274c1a7235dfa972f5ddc4778e0154240fd9b6
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(cherry pick from commit 54708d2858e79a2bdda10bf8a20c80eb96c20613)
The commit 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly")
fixed the access to /proc/self/fd from sub-threads, but introduced another
problem: a sub-thread can't access /proc/<tid>/fd/ or /proc/thread-self/fd
if generic_permission() fails.
Change proc_fd_permission() to check same_thread_group(pid_task(), current).
Fixes: 96d0df79f264 ("proc: make proc_fd_permission() thread-friendly")
Reported-by: "Jin, Yihua" <yihua.jin@intel.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 26016905
Change-Id: I1894c78d0b13f0bde8cde84bd142ba67590dc0f1
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(cherry pick from commit 96d0df79f2644fc823f26c06491e182d87a90c2a)
proc_fd_permission() says "process can still access /proc/self/fd after it
has executed a setuid()", but the "task_pid() = proc_pid() check only
helps if the task is group leader, /proc/self points to
/proc/<leader-pid>.
Change this check to use task_tgid() so that the whole thread group can
access its /proc/self/fd or /proc/<tid-of-sub-thread>/fd.
Notes:
- CLONE_THREAD does not require CLONE_FILES so task->files
can differ, but I don't think this can lead to any security
problem. And this matches same_thread_group() in
__ptrace_may_access().
- /proc/self should probably point to /proc/<thread-tid>, but
it is too late to change the rules. Perhaps it makes sense
to add /proc/thread though.
Test-case:
void *tfunc(void *arg)
{
assert(opendir("/proc/self/fd"));
return NULL;
}
int main(void)
{
pthread_t t;
pthread_create(&t, NULL, tfunc, NULL);
pthread_join(t, NULL);
return 0;
}
fails if, say, this executable is not readable and suid_dumpable = 0.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 26016905
Change-Id: Ifdd6403a8fccd073122e89d3547c13ccc08f0dce
Signed-off-by: Joe Maples <joe@frap129.org>
Conflicts:
fs/proc/fd.c
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ifdef conditions in include/linux/mm.h presents three cases:
- !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
There is no actual definition of function but include/linux/mm.h has a
static inline stub defined.
- defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
linux/mm.h does not define a prototype, but mm/page_alloc.c defines
the function.
Hence, compiler reports the following warning:
mm/page_alloc.c:4300:15: warning: no previous prototype for `__early_pfn_to_nid' [-Wmissing-prototypes]
- defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
The architecture defines the function, and linux/mm.h has a
prototype.
Thus, join the conditions of Case 2 and 3 ie eliminate the ifdef
condition of CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID to eliminate the missing
prototype warning from file mm/page_alloc.c.
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
ra_submit with a single function call.
Move ra_submit to internal.h and inline it to save some stack. Thanks
to Andrew Morton for commenting different versions.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Linux kernel cannot migrate pages used by the kernel. As a result,
kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
memory for the kernel.
ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
info. But before SRAT is parsed, memblock has already started to allocate
memory for the kernel. So we need to prevent memblock from doing this.
In a memory hotplug system, any numa node the kernel resides in should be
unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely
unhotpluggable.
So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.
The current memblock can only allocate memory top-down. So this patch
introduces a new bottom-up allocation mode to allocate memory bottom-up.
And later when we use this allocation direction to allocate memory, we
will limit the start address above the kernel.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[Problem]
The current Linux cannot migrate pages used by the kernel because of the
kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and keep the
va unmodified. So the kernel pages are not migratable.
There are also some other issues will cause the kernel pages not
migratable. For example, the physical address may be cached somewhere and
will be used. It is not to update all the caches.
When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages
are used by the kernel, they are not migratable. As a result, memory used
by the kernel cannot be hot-removed.
Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the
following way to do memory hotplug.
[What we are doing]
In Linux, memory in one numa node is divided into several zones. One of
the zones is ZONE_MOVABLE, which the kernel won't use.
In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
memory. To do this, we need ACPI's help.
In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The
memory affinities in SRAT record every memory range in the system, and
also, flags specifying if the memory range is hotpluggable. (Please refer
to ACPI spec 5.0 5.2.16)
With the help of SRAT, we have to do the following two things to achieve our
goal:
1. When doing memory hot-add, allow the users arranging hotpluggable as
ZONE_MOVABLE.
(This has been done by the MOVABLE_NODE functionality in Linux.)
2. when the system is booting, prevent bootmem allocator from allocating
hotpluggable memory for the kernel before the memory initialization
finishes.
The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.
[Preparation]
Bootloader has to load the kernel image into memory. And this memory must
be unhotpluggable. We cannot prevent this anyway. So in a memory hotplug
system, we can assume any node the kernel resides in is not hotpluggable.
Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
But memblock has already started to work. In the current kernel,
memblock allocates the following memory before SRAT is parsed:
setup_arch()
|->memblock_x86_fill() /* memblock is ready */
|......
|->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
|->reserve_real_mode() /* allocate memory under 1MB */
|->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
|->dma_contiguous_reserve() /* specified by user, should be low */
|->setup_log_buf() /* specified by user, several mega bytes */
|->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
|->acpi_initrd_override() /* several mega bytes */
|->reserve_crashkernel() /* could be large, should reorder */
|......
|->initmem_init() /* Parse SRAT */
According to Tejun's advice, before SRAT is parsed, we should try our best
to allocate memory near the kernel image. Since the whole node the kernel
resides in won't be hotpluggable, and for a modern server, a node may have
at least 16GB memory, allocating several mega bytes memory around the
kernel image won't cross to hotpluggable memory.
[About this patchset]
So this patchset is the preparation for the problem 2 that we want to
solve. It does the following:
1. Make memblock be able to allocate memory bottom up.
1) Keep all the memblock APIs' prototype unmodified.
2) When the direction is bottom up, keep the start address greater than the
end of kernel image.
2. Improve init_mem_mapping() to support allocate page tables in
bottom up direction.
3. Introduce "movable_node" boot option to enable and disable this
functionality.
This patch (of 6):
Create a new function __memblock_find_range_top_down to factor out of
top-down allocation from memblock_find_in_range_node. This is a
preparation because we will introduce a new bottom-up allocation mode in
the following patch.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current early_pfn_to_nid() on arch that support memblock go over
memblock.memory one by one, so will take too many try near the end.
We can use existing memblock_search to find the node id for given pfn,
that could save some time on bigger system that have many entries
memblock.memory array.
Here are the timing differences for several machines. In each case with
the patch less time was spent in __early_pfn_to_nid().
3.11-rc5 with patch difference (%)
-------- ---------- --------------
UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
Time in seconds.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|