| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Moyster <oysterized@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
while compiling integer err was showing as a set but unused variable.
elevator_init_fn can be either cfq_init_queue or deadline_init_queue
or noop_init_queue.
all three of these functions are returning -ENOMEM if they fail to
allocate the queue.
so we should actually be returning the error code rather than
returning 0 always.
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: mydongistiny <jaysonedson@gmail.com>
Signed-off-by: Harshit Jain <harshitjain6751@gmail.com>
Signed-off-by: dev-harsh1998 <harshitjain6751@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit ac34f15e0c6d2fd58480052b6985f6991fb53bcc upstream.
When tearing down a block device early in its lifetime, userspace may
still be performing discovery actions like blkdev_ioctl() to re-read
partitions.
The nvdimm_revalidate_disk() implementation depends on
disk->driverfs_dev to be valid at entry. However, it is set to NULL in
del_gendisk() and fatally this is happening *before* the disk device is
deleted from userspace view.
There's no reason for del_gendisk() to clear ->driverfs_dev. That
device is the parent of the disk. It is guaranteed to not be freed
until the disk, as a child, drops its ->parent reference.
We could also fix this issue locally in nvdimm_revalidate_disk() by
using disk_to_dev(disk)->parent, but lets fix it globally since
->driverfs_dev follows the lifetime of the parent. Longer term we
should probably just add a @parent parameter to add_disk(), and stop
carrying this pointer in the gendisk.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa00340a8>] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G O 4.4.0-rc5 #2257
[..]
Call Trace:
[<ffffffff8143e5c7>] rescan_partitions+0x87/0x2c0
[<ffffffff810f37f9>] ? __lock_is_held+0x49/0x70
[<ffffffff81438c62>] __blkdev_reread_part+0x72/0xb0
[<ffffffff81438cc5>] blkdev_reread_part+0x25/0x40
[<ffffffff8143982d>] blkdev_ioctl+0x4fd/0x9c0
[<ffffffff811246c9>] ? current_kernel_time64+0x69/0xd0
[<ffffffff812916dd>] block_ioctl+0x3d/0x50
[<ffffffff81264c38>] do_vfs_ioctl+0x308/0x560
[<ffffffff8115dbd1>] ? __audit_syscall_entry+0xb1/0x100
[<ffffffff810031d6>] ? do_audit_syscall_entry+0x66/0x70
[<ffffffff81264f09>] SyS_ioctl+0x79/0x90
[<ffffffff81902672>] entry_SYSCALL_64_fastpath+0x12/0x76
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Reported-by: Robert Hu <robert.hu@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 25cdb64510644f3e854d502d69c73f21c6df88a9 upstream.
The WRITE_SAME commands are not present in the blk_default_cmd_filter
write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
[ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]
The problem can be reproduced with the sg_write_same command
# sg_write_same --num 1 --xferlen 512 /dev/sda
#
# capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
Write same: pass through os error: Operation not permitted
#
For comparison, the WRITE_VERIFY command does not observe this problem,
since it is in that list:
# capsh --drop=cap_sys_rawio -- -c \
'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
#
So, this patch adds the WRITE_SAME commands to the list, in order
for the SG_IO ioctl to finish successfully:
# capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
#
That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]),
which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).
In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
which are translated to write-same calls in the guest kernel, and then into
SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:
[...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
[...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
[...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
[...] blk_update_request: I/O error, dev sda, sector 17096824
Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Brahadambal Srinivasan <latha@linux.vnet.ibm.com>
Reported-by: Manjunatha H R <manjuhr1@in.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the optimizations around not clearing the full request at alloc
time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
up to the user allocating the request.
Add a blk_rq_set_block_pc() that sets the command type to
REQ_TYPE_BLOCK_PC, and properly initializes the members associated
with this type of request. Update callers to use this function instead
of manipulating rq->cmd_type directly.
Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed
attempt.
Change-Id: Ifc386dfb951c5d6adebf48ff38135dda28e4b1ce
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those. Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.
Change-Id: I383485b4d44970eb61b7178c0b9a4376abfe8cd1
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Corinna Vinschen <xda@vinschen.de>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux 3.19 commit 69c953c ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
caused blk_stack_limits() to not properly stack queue_limits for stacked
devices (e.g. DM).
Fix this regression by establishing lcm_not_zero() and switching
blk_stack_limits() over to using it.
DM uses blk_set_stacking_limits() to establish the initial top-level
queue_limits that are then built up based on underlying devices' limits
using blk_stack_limits(). In the case of optimal_io_size (io_opt)
blk_set_stacking_limits() establishes a default value of 0. With commit
69c953c, lcm(0, n) is no longer n, which compromises proper stacking of
the underlying devices' io_opt.
Test:
$ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
$ cat /sys/block/sde/queue/optimal_io_size
786432
$ dmsetup create node --table "0 100 linear /dev/sde 0"
Before this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
0
After this fix:
$ cat /sys/block/dm-5/queue/optimal_io_size
786432
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
Signed-off-by: mydongistiny <jaysonedson@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit is the result of
find . -name '*.c' | xargs sed -i 's/ __cpuinit / /g'
find . -name '*.c' | xargs sed -i 's/ __cpuexit / /g'
find . -name '*.c' | xargs sed -i 's/ __cpuinitdata / /g'
find . -name '*.c' | xargs sed -i 's/ __cpuinit$//g'
find ./arch/ -name '*.h' | xargs sed -i 's/ __cpuinit//g'
find . -name '*.c' | xargs sed -i 's/^__cpuinit //g'
find . -name '*.c' | xargs sed -i 's/^__cpuinitdata //g'
find . -name '*.c' | xargs sed -i 's/\*__cpuinit /\*/g'
find . -name '*.c' | xargs sed -i 's/ __cpuinitconst / /g'
find . -name '*.h' | xargs sed -i 's/ __cpuinit / /g'
find . -name '*.h' | xargs sed -i 's/ __cpuinitdata / /g'
git add .
git reset include/linux/init.h
git checkout -- include/linux/init.h
based off : https://github.com/jollaman999/jolla-kernel_bullhead/commit/bc15db84a622eed7d61d3ece579b577154d0ec29
|
| | |
|
| |
|
|
|
|
|
|
|
|
| |
This reference count has been around since before git history, but the only
place where it's used is in blk_execute_rq, and ther it is entirely useless
as it is incremented before submitting the request and decremented in the
end_io handler before waking up the submitter thread.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
|
|
|
|
|
|
| |
We have officially run out of flags in a 32-bit space. Extend it
to 64-bit even on 32-bit archs.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
|
|
|
|
|
| |
Use the helper function instead of __GFP_ZERO.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
|
|
|
|
|
|
|
|
| |
This patch fix spelling typo found in DocBook/kernel-api.xml.
It is because the file is generated from the source comments,
I have to fix the comments in source codes.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec upstream.
While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.
The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):
ffffffffb90c57b1 finish_task_switch
ffffffffb97dffb5 schedule
ffffffffb97e310c schedule_timeout
ffffffffb97e1f12 __down
ffffffffb90ea821 down
ffffffffc046a9dc xfs_buf_lock
ffffffffc046abfb _xfs_buf_find
ffffffffc046ae4a xfs_buf_get_map
ffffffffc046babd xfs_buf_read_map
ffffffffc0499931 xfs_trans_read_buf_map
ffffffffc044a561 xfs_da_read_buf
ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
ffffffffc0452b90 xfs_dir2_leaf_lookup_int
ffffffffc0452e0f xfs_dir2_leaf_lookup
ffffffffc044d9d3 xfs_dir_lookup
ffffffffc047d1d9 xfs_lookup
ffffffffc0479e53 xfs_vn_lookup
ffffffffb925347a path_openat
ffffffffb9254a71 do_filp_open
ffffffffb9242a94 do_sys_open
ffffffffb9242b9e sys_open
ffffffffb97e42b2 entry_SYSCALL_64_fastpath
00007fb0698162ed [unknown]
Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.
Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:
8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) 141696240
8,0 11 153 81.092472889 804 Q WM 141698288 + 8 [xfsaild/sda1]
8,0 11 154 81.092473207 804 G WM 141698288 + 8 [xfsaild/sda1]
8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( 22911) [xfsaild/sda1]
<==== 'I' means Inserted (into the IO scheduler) ===================================>
8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]
<==== Only 15s later the CFQ scheduler dispatches the request ======================>
As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):
8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
8,0 1 0 81.117914755 0 m N cfq767A / resid=40
8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0
where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.
The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.
Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.
Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.
When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).
This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)
Care is taken during preempt to still allow rt requests to preempt us
regardless.
Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:
Latency histogram of xfs_buf_lock acquisition (microseconds):
value |-------------------------------------------------- count
0 | 11
1 |@@@@ 161
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
4 |@ 54
8 | 36
16 | 7
32 | 0
64 | 0
~
1024 | 0
2048 | 0
4096 | 1
8192 | 1
16384 | 2
32768 | 0
65536 | 0
131072 | 1
262144 | 0
524288 | 0
Signed-off-by: Glauber Costa <glauber@scylladb.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
We've switched over every architecture that supports SMP to it, so
remove the new useless config variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[imaund@codeaurora.org: resolve merge conflicts]
Signed-off-by: Ian Maund <imaund@codeaurora.org>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Squashed commit of the following:
commit f49e14ccdcb6694ed27754e020057d27a8fcca07
Author: Andrei F <luxneb@gmail.com>
Date: Thu Nov 26 22:40:38 2015 +0100
elevator: Fix a race in elevator switching
commit d50235b7bc3ee0a0427984d763ea7534149531b4 upstream.
There's a race between elevator switching and normal io operation.
Because the allocation of struct elevator_queue and struct elevator_data
don't in a atomic operation.So there are have chance to use NULL
->elevator_data.
For example:
Thread A: Thread B
blk_queu_bio elevator_switch
spin_lock_irq(q->queue_block) elevator_alloc
elv_merge elevator_init_fn
Because call elevator_alloc, it can't hold queue_lock and the
->elevator_data is NULL.So at the same time, threadA call elv_merge and
nedd some info of elevator_data.So the crash happened.
Move the elevator_alloc into func elevator_init_fn, it make the
operations in a atomic operation.
Using the follow method can easy reproduce this bug
1:dd if=/dev/sdb of=/dev/null
2:while true;do echo noop > scheduler;echo deadline > scheduler;done
The test method also use this method.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Jonghwan Choi <jhbird.choi@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit daf22a727e64f1277b074442efb821366015ca72
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Jul 25 13:45:21 2013 +0300
block: row: Remove warning massage from add_request
Regular priority queues is marked as "starved" if it skipped a dispatch
due to being empty. When a new request is added to a "starved" queue
it will be marked as urgent.
The removed WARN_ON was warning about an impossible case when a regular
priority (read) queue was marked as starved but wasn't empty. This is
a possible case due to the bellow:
If the device driver fetched a read request that is pending for
transmission and an URGENT request arrives, the fetched read will be
reinserted back to the scheduler. Its possible that the queue it will be
reinserted to was marked as "starved" in the meanwhile due to being empty.
CRs-fixed: 517800
Change-Id: Iaae642ea0ed9c817c41745b0e8ae2217cc684f0c
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit dca47e75f1413d58e4f97ef638e5d4456c55bdce
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Tue Jul 2 14:43:13 2013 +0300
block: row: change hrtimer_cancel to hrtimer_try_to_cancel
Calling hrtimer_cancel with interrupts disabled can result in a livelock.
When flushing plug list in the block layer interrupts are disabled and an
hrtimer is used when adding requests from that plug list to the scheduler.
In this code flow, if the hrtimer (which is used for idling) is set, it's
being canceled by calling hrtimer_cancel. hrtimer_cancel will perform
the following in an endless loop:
1. try cancel the timer
2. if fails - rest_cpu
the cancellation can fail if the timer function already started. Since
interrupts are disabled it can never complete.
This patch reduced the number of times the hrtimer lock is taken while
interrupts are disabled by calling hrtimer_try_co_cancel. the later will
try to cancel the timer just once and return with an error code if fails.
CRs-fixed: 499887
Change-Id: I25f79c357426d72ad67c261ce7cb503ae97dc7b9
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit a6047b9d808eaa787e4df3107bea7536334856cd
Author: Lee Susman <lsusman@codeaurora.org>
Date: Sun Jun 23 16:27:40 2013 +0300
block: row-iosched idling triggered by readahead pages
In the current implementation idling is triggered only by request
insertion frequency. This heuristic is not very accurate and may hit
random requests that shouldn't trigger idling. This patch uses the
PG_readahead flag in struct page's flags, which indicates that the page
is part of a readahead window, to start idling upon dispatch of a request
associated with a readahead page.
The above readehead flag is used together with the existing
insertion-frequency trigger. The frequency timer will catch read requests
which are not part of a readahead window, but are still part of a
sequential stream (and therefore dispatched in small time intervals).
Change-Id: Icb7145199c007408de3f267645ccb842e051fd00
Signed-off-by: Lee Susman <lsusman@codeaurora.org>
commit e70e4e8e1d1f111023dd2b2d0fc9237240cab9ab
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Wed May 1 14:35:20 2013 +0300
block: urgent: Fix dispatching of URGENT mechanism
There are cases when blk_peek_request is called not from blk_fetch_request
thus the URGENT request may be started but the flag q->dispatched_urgent is
not updated.
Change-Id: I4fb588823f1b2949160cbd3907f4729767932e12
CRs-fixed: 471736
CRs-fixed: 473036
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 0e36870f6a436840eed1782d0e85b4adb300b59f
Author: Maya Erez <merez@codeaurora.org>
Date: Sun Apr 14 15:19:52 2013 +0300
block: row: Fix starvation tolerance values
The current starvation tolerance values increase the boot time
since high priority SW requests are delayed by regular priority requests.
In order to overcome this, increase the starvation tolerance values.
Change-Id: I9947fca9927cbd39a1d41d4bd87069df679d3103
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
Signed-off-by: Maya Erez <merez@codeaurora.org>
commit 3cab8d28e735fdad300eda3bed703129ba05d70a
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Apr 11 14:57:15 2013 +0300
block: urgent request: Update dispatch_urgent in case of requeue/reinsert
The block layer implements a mechanism for verifying that the device
driver won't be notified of an URGENT request if there is already an
URGENT request in flight. This is due to the fact that interrupting an
URGENT request isn't efficient.
This patch fixes the above described mechanism in case the URGENT request
was returned back to the block layer from some reason: by requeue or
reinsert.
CRs-fixed: 473376, 473036, 471736
Change-Id: Ie8b8208230a302d4526068531616984825f1050d
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit e052e4574bb928b44e660b9679d23e14011b0b9d
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Mar 21 11:04:02 2013 +0200
block: row: Update sysfs functions
All ROW (time related) configurable parameters are stored in ms so there
is no need to convert from/to ms when reading/updating them via sysfs.
Change-Id: Ib6a1de54140b5d25696743da944c076dd6fc02ae
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
Conflicts:
block/row-iosched.c
commit 2c3203650c2109c18abb3b17a5114d54bb22e683
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Mar 21 13:02:07 2013 +0200
block: row: Prevent starvation of regular priority by high priority
At the moment all REGULAR and LOW priority requests are starved as long as
there are HIGH priority requests to dispatch.
This patch prevents the above starvation by setting a starvation limit the
REGULAR\LOW priority requests can tolerate.
Change-Id: Ibe24207982c2c55d75c0b0230f67e013d1106017
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit a5434f618d395a03fe19ef430a8c5747bad069f9
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Tue Mar 12 21:02:33 2013 +0200
block: urgent request: remove unnecessary urgent marking
An urgent request is marked by the scheduler in rq->cmd_flags with the
REQ_URGENT flag. There is no need to add an additional marking by
the block layer.
Change-Id: I05d5e9539d2f6c1bfa80240b0671db197a5d3b3f
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 3928fb74c2f78578c57913938644acb704b77586
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Tue Mar 12 21:17:18 2013 +0200
block: row: Re-design urgent request notification mechanism
When ROW scheduler reports to the block layer that there is an urgent
request pending, the device driver may decide to stop the transmission
of the current request in order to handle the urgent one. This is done
in order to reduce the latency of an urgent request. For example:
long WRITE may be stopped to handle an urgent READ.
This patch updates the ROW URGENT notification policy to apply with the
below:
- Don't notify URGENT if there is an un-completed URGENT request in driver
- After notifying that URGENT request is present, the next request
dispatched is the URGENT one.
- At every given moment only 1 request can be marked as URGENT.
Independent of it's location (driver or scheduler)
Other changes to URGENT policy:
- Only READ queues are allowed to notify of an URGENT request pending.
CR fix:
If a pending urgent request (A) gets merged with another request (B)
A is removed from scheduler queue but is not removed from
rd->pending_urgent_rq.
CRs-Fixed: 453712
Change-Id: I321e8cf58e12a05b82edd2a03f52fcce7bc9a900
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 8912aa92e3d919ceabc72b2eddc829fc5e4bd7eb
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Jan 24 16:17:27 2013 +0200
block: row: Update initial values of ROW data structures
This patch sets the initial values of internal ROW
parameters.
Change-Id: I38132062a7fcbe2e58b9cc757e55caac64d013dc
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
[smuckle@codeaurora.org: ported from msm-3.7]
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
commit b709e1a8a56784cb83c2c31a4e7df574a6b29802
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Jan 24 15:08:40 2013 +0200
block: row: Don't notify URGENT if there are un-completed urgent req
When ROW scheduler reports to the block layer that there is an urgent
request pending, the device driver may decide to stop the transmission
of the current request in order to handle the urgent one. If the current
transmitted request is an urgent request - we don't want it to be
stopped.
Due to the above ROW scheduler won't notify of an urgent request if
there are urgent requests in flight.
Change-Id: I2fa186d911b908ec7611682b378b9cdc48637ac7
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit eba966603cc8e6f8fb418bf702f5a6eca5f56f34
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Jan 24 04:01:59 2013 +0200
block: add REQ_URGENT to request flags
This patch adds a new flag to be used in cmd_flags field of struct request
for marking request as urgent.
Urgent request is the one that should be given priority currently handled
(regular) request by the device driver. The decision of a request urgency
is taken by the scheduler.
Change-Id: Ic20470987ef23410f1d0324f96f00578f7df8717
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
Conflicts:
include/linux/blk_types.h
commit 7c865ab1a9ae626d023d0b03ed7fbe5c57bcbe7c
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Jan 17 20:56:07 2013 +0200
block: row: Idling mechanism re-factoring
At the moment idling in ROW is implemented by delayed work that uses
jiffies granularity which is not very accurate. This patch replaces
current idling mechanism implementation with hrtime API, which gives
nanosecond resolution (instead of jiffies).
Change-Id: I86c7b1776d035e1d81571894b300228c8b8f2d92
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 72ea1d39c04734bf5eb52117968704148d2da42f
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Wed Jan 23 17:15:49 2013 +0200
block: row: Dispatch requests according to their io-priority
This patch implements "application-hints" which is a way the issuing
application can notify the scheduler on the priority of its request.
This is done by setting the io-priority of the request.
This patch reuses an already existing mechanism of io-priorities developed
for CFQ. Please refer to kernel/Documentation/block/ioprio.txt for
usage example and explanations.
Change-Id: I228ec8e52161b424242bb7bb133418dc8b73925a
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 9f8f3d2757788477656b1d25a3055ae11d97cee4
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Sat Jan 12 16:23:18 2013 +0200
block: row: Aggregate row_queue parameters to one structure
Each ROW queues has several parameters which default values are defined
in separate arrays. This patch aggregates all default values into one
array.
The values in question are:
- is idling enabled for the queue
- queue quantum
- can the queue notify on urgent request
Change-Id: I3821b0a042542295069b340406a16b1000873ec6
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit d84ad45f3077661cab5984cd2fb7d5ef2ff06e39
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Sat Jan 12 16:21:47 2013 +0200
block: row: fix sysfs functions - idle_time conversion
idle_time was updated to be stored in msec instead of jiffies.
So there is no need to convert the value when reading from user or
displaying the value to him.
Change-Id: I58e074b204e90a90536d32199ac668112966e9cf
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 202b21e9daf7b8a097f97f764bb4ad4712c75fa7
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Sat Jan 12 16:21:12 2013 +0200
block: row: Insert dispatch_quantum into struct row_queue
There is really no point in keeping the dispatch quantum
of a queue outside of it. By inserting it to the row_queue
structure we spare extra level in accessing it.
Change-Id: Ic77571818b643e71f9aafbb2ca93d0a92158b199
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 58ca84f091faa6ff8c4f567b158be5d38f9a5c58
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Sun Jan 13 22:04:59 2013 +0200
block: row: Add some debug information on ROW queues
1. Add a counter for number of requests on queue.
2. Add function to print queues status (number requests
currently on queue and number of already dispatched requests
in current dispatch cycle).
Change-Id: I1e98b9ca33853e6e6a8ddc53240f6cd6981e6024
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 1bbb2c7ada5a647cab1f2306458d6cf9b821ddf7
Author: Subhash Jadavani <subhashj@codeaurora.org>
Date: Thu Jan 10 02:15:13 2013 +0530
block: blk-merge: don't merge the pages with non-contiguous descriptors
blk_rq_map_sg() function merges the physically contiguous pages to use same
scatter-gather node without checking if their page descriptors are
contiguous or not.
Now when dma_map_sg() is called on the scatter gather list, it would
take the base page pointer from each node (one by one) and iterates
through all of the pages in same sg node by keep incrementing the base
page pointer with the assumption that physically contiguous pages will
have their page descriptor address contiguous which may not be true
if SPARSEMEM config is enabled. So here we may end referring to invalid
page descriptor.
Following table shows the example of physically contiguous pages but
their page descriptor addresses non-contiguous.
-------------------------------------------
| Page Descriptor | Physical Address |
------------------------------------------
| 0xc1e43fdc | 0xdffff000 |
| 0xc2052000 | 0xe0000000 |
-------------------------------------------
With this patch, relevant blk-merge functions will also check if the
physically contiguous pages are having page descriptors address contiguous
or not? If not then, these pages are separated to be in different
scatter-gather nodes.
CRs-Fixed: 392141
Change-Id: I3601565e5569a69f06fb3af99061c4d4c23af241
Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org>
Conflicts:
block/blk-merge.c
commit 9a9b428480c932ef8434d8b9bd3b7bafdcac3f84
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Dec 20 19:23:58 2012 +0200
row: Add support for urgent request handling
This patch adds support for handling urgent requests.
ROW queue can be marked as "urgent" so if it was un-served in last
dispatch cycle and a request was added to it - it will trigger
issuing an urgent-request-notification to the block device driver.
The block device driver may choose at stop the transmission of current
ongoing request to handle the urgent one. Foe example: long WRITE may
be stopped to handle an urgent READ. This decreases READ latency.
Change-Id: I84954c13f5e3b1b5caeadc9fe1f9aa21208cb35e
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 8d5ec526b7e70307d3c4ce587b714349f44c0be8
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Dec 6 13:17:19 2012 +0200
block:row: fix idling mechanism in ROW
This patch addresses the following issues found in the ROW idling
mechanism:
1. Fix the delay passed to queue_delayed_work (pass actual delay
and not the time when to start the work)
2. Change the idle time and the idling-trigger frequency to be
HZ dependent (instead of using msec_to_jiffies())
3. Destroy idle_workqueue() in queue_exit
Change-Id: If86513ad6b4be44fb7a860f29bd2127197d8d5bf
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
Conflicts:
block/row-iosched.c
commit c26a95811462b9ba8eca23b4ba2150e7b660ca40
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Tue Oct 30 08:33:06 2012 +0200
row: Adding support for reinsert already dispatched req
Add support for reinserting already dispatched request back to the
schedulers internal data structures.
The request will be reinserted back to the queue (head) it was
dispatched from as if it was never dispatched.
Change-Id: I70954df300774409c25b5821465fb3aa33d8feb5
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit a1a6f09cae0149d935bcea3f20d4acb6556d68f9
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Tue Dec 4 16:04:15 2012 +0200
block: Add API for urgent request handling
This patch add support in block & elevator layers for handling
urgent requests. The decision if a request is urgent or not is taken
by the scheduler. Urgent request notification is passed to the underlying
block device driver (eMMC for example). Block device driver may decide to
interrupt the currently running low priority request to serve the new
urgent request. By doing so READ latency is greatly reduced in read&write
collision scenarios.
Note that if the current scheduler doesn't implement the urgent request
mechanism, this code path is never activated.
Change-Id: I8aa74b9b45c0d3a2221bd4e82ea76eb4103e7cfa
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
Conflicts:
block/blk-core.c
commit 4e907d9d6079629d6ce61fbdfb1a629d3587e176
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Tue Dec 4 15:54:43 2012 +0200
block: Add support for reinsert a dispatched req
Add support for reinserting a dispatched request back to the
scheduler's internal data structures.
This capability is used by the device driver when it chooses to
interrupt the current request transmission and execute another (more
urgent) pending request. For example: interrupting long write in order
to handle pending read. The device driver re-inserts the
remaining write request back to the scheduler, to be rescheduled
for transmission later on.
Add API for verifying whether the current scheduler
supports reinserting requests mechanism. If reinsert mechanism isn't
supported by the scheduler, this code path will never be activated.
Change-Id: I5c982a66b651ebf544aae60063ac8a340d79e67f
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit 0675c27faab797f7149893b84cc357aadb37c697
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Mon Oct 15 20:56:02 2012 +0200
block: ROW: Fix forced dispatch
This patch fixes forced dispatch in the ROW scheduling algorithm.
When the dispatch function is called with the forced flag on, we
can't delay the dispatch of the requests that are in scheduler queues.
Thus, when dispatch is called with forced turned on, we need to cancel
idling, or not to idle at all.
Change-Id: I3aa0da33ad7b59c0731c696f1392b48525b52ddc
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
commit ce6acf59662d1bbe5663a64aef9fe1695b8bbe1b
Author: Tatyana Brokhman <tlinder@codeaurora.org>
Date: Thu Sep 20 10:46:10 2012 +0300
block: Adding ROW scheduling algorithm
This patch adds the implementation of a new scheduling algorithm - ROW.
The policy of this algorithm is to prioritize READ requests over WRITE
as much as possible without starving the WRITE requests.
Change-Id: I4ed52ea21d43b0e7c0769b2599779a3d3869c519
Signed-off-by: Tatyana Brokhman <tlinder@codeaurora.org>
Signed-off-by: Tkkg1994 <luca.grifo@outlook.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bfq maintains a 'next-in-service' cache to prevent expensive lookups in
the hot path. However, the cache sometimes becomes inconsistent and
triggers a BUG:
[44042.622839] -(3)[154:mmcqd/0]BUG: failure at ../../../../../../kernel/cyanogen/mt6735/block/bfq-sched.c:72/bfq_check_next_in_service()!
[44042.622858] -(3)[154:mmcqd/0]Unable to handle kernel paging request at virtual address 0000dead
[44042.622866] -(3)[154:mmcqd/0]pgd = ffffffc001361000
[44042.622872] [0000dead] *pgd=000000007d816003, *pud=000000007d816003, *pmd=000000007d817003, *pte=0000000000000000
[44042.622890] -(3)[154:mmcqd/0]Internal error: Oops: 96000045 [#1] PREEMPT SMP
[44042.622907] -(3)[154:mmcqd/0]CPU: 3 PID: 154 Comm: mmcqd/0 Tainted:
[44042.622915] -(3)[154:mmcqd/0]Hardware name: MT6735 (DT)
[44042.622922] -(3)[154:mmcqd/0]task: ffffffc0378a6000 ti: ffffffc0378c4000
[44042.622936] -(3)[154:mmcqd/0]PC is at bfq_dispatch_requests+0x6c4/0x9bc
[44042.622944] -(3)[154:mmcqd/0]LR is at bfq_dispatch_requests+0x6bc/0x9bc
[44042.622952] -(3)[154:mmcqd/0]pc : [<ffffffc000306a68>] lr : [<ffffffc000306a60>] pstate: 800001c5
[44042.622958] -(3)[154:mmcqd/0]sp : ffffffc0378c7d30
[44042.622962] x29: ffffffc0378c7d30 x28: 0000000000000000
[44042.622972] x27: 0000000000000000 x26: ffffffc006c58810
[44042.622981] x25: ffffffc037f89820 x24: ffffffc000f14000
[44042.622990] x23: ffffffc036adb088 x22: ffffffc0369b2800
[44042.623000] x21: ffffffc036adb098 x20: ffffffc01d6a3b60
[44042.623009] x19: ffffffc036adb0c8 x18: 0000007f8cfa1500
[44042.623018] x17: 0000007f8db44f40 x16: ffffffc00012d0c0
[44042.623027] x15: 0000007f8dde04d8 x14: 676f6e6179632f6c
[44042.623037] x13: 656e72656b2f2e2e x12: 2f2e2e2f2e2e2f2e
[44042.623046] x11: 2e2f2e2e2f2e2e20 x10: 7461206572756c69
[44042.623055] x9 : 6166203a4755425d x8 : 00000000001f0cc5
[44042.623064] x7 : ffffffc000f3d5a0 x6 : 000000000000008b
[44042.623073] x5 : 0000000000000000 x4 : 0000000000000004
[44042.623082] x3 : 0000000000000002 x2 : 0000000000000001
[44042.623091] x1 : 0000000000000aee x0 : 000000000000dead
This patch makes the lookup resilient to cache inconsistencies by doing
the expensive recomputation in cases where the bug would otherwise be
triggered.
Ticket: PORRDIGE-527
Change-Id: I5dd701960057983a42d3d3bd57521e8d17c03d7f
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when the last reference of a blkcg_gq is put, all then
release operations sans the actual freeing happen directly in
blkg_put(). As blkg_put() may be called under queue_lock, all
pd_exit_fn()s may be too. This makes it impossible for pd_exit_fn()s
to use del_timer_sync() on timers which grab the queue_lock which is
an irq-safe lock due to the deadlock possibility described in the
comment on top of del_timer_sync().
This can be easily avoided by perfoming the release operations in the
RCU callback instead of directly from blkg_put(). This patch moves
the blkcg_gq release operations to the RCU callback.
As this leaves __blkg_release() with only call_rcu() invocation,
blkg_rcu_free() is renamed to __blkg_release_rcu(), exported and
call_rcu() invocation is now done directly from blkg_put() instead of
going through __blkg_release() which is removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when creating a new blkcg_gq, each policy's pd_init_fn() is
invoked in blkg_alloc() before the parent is linked. This makes it
difficult for policies to perform initializations which are dependent
on the parent.
This patch moves pd_init_fn() invocations to blkg_create() after the
parent blkg is linked where the new blkg is fully initialized. As
this means that blkg_free() can't assume that pd's are initialized,
pd_exit_fn() invocations are moved to __blkg_release(). This
guarantees that pd_exit_fn() is also invoked with fully initialized
blkgs with valid parent pointers.
This will help implementing hierarchy support in blk-throttle.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
| |
This will be used by blk-throttle hierarchy support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
| |
blk-throttle hierarchy support will make use of it. Move
blkg_for_each_descendant_pre() from block/blk-cgroup.c to
block/blk-cgroup.h.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
In blkg_create(), after lookup of parent fails, the control jumps to
error path with the error code encoded into @blkg. The error path
doesn't use @blkg for the return value. It returns ERR_PTR(ret).
Make lookup fail path set @ret instead of @blkg.
Note that the parent lookup is guaranteed to succeed at that point and
the condition check is purely for sanity and triggers WARN when fails.
As such, I don't think it's necessary to mark it for -stable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the recent updates, blk-throttle is finally ready for proper
hierarchy support. Dispatching now honors service_queue->parent_sq
and propagates correctly. The only thing missing is setting
->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
hierarchy.
This patch updates throtl_pd_init() such that service_queues form the
same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
As this concludes proper hierarchy support for blkcg, the shameful
.broken_hierarchy tag is removed from blkio_subsys.
v2: Updated blkio-controller.txt as suggested by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
blk_throtl_bio() has a quick exit path for throtl_grps without limits
configured. It looks at the bps and iops limits and if both are not
configured, the bio is issued immediately. While this is correct in
the current flat hierarchy as each throtl_grp behaves completely
independently, it would become wrong in proper hierarchy mode. A
group without any limits could still be limited by one of its
ancestors and bio's queued for such group should not bypass
blk-throtl.
As having a quick bypass mechanism is beneficial, this patch
reimplements the mechanism such that it's correct even with proper
hierarchy. throtl_grp->has_rules[] is added. These booleans are
updated for the whole subtree whenever a config is updated so that
has_rules[] of the whole subtree stays synchronized. They're also
updated when a new throtl_grp comes online so that it can't escape the
limits of its ancestors.
As no throtl_grp has another throtl_grp as parent now, this patch
doesn't yet make any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the planned proper hierarchy support, a bio will climb up the
tree before actually being dispatched. This makes sure bio is also
subjected to parent's throttling limits, if any.
It might happen that parent is idle and when bio is transferred to
parent, a new slice starts fresh. But that is incorrect as parents
wait time should have started when bio was queued in child group and
causes IOs to be throttled more than configured as they climb the
hierarchy.
Given the fact that we have not written hierarchical algorithm in a
way where child's and parents time slices are synchronized, we
transfer the child's start time to parent if parent was idling. If
parent was busy doing dispatch of other bios all this while, this is
not an issue.
Child's slice start time is passed to parent. Parent looks at its
last expired slice start time. If child's start time is after parents
old start time, that means parent had been idle and after parent
went idle, child had an IO queued. So use child's start time as
parent start time.
If parent's start time is after child's start time, that means,
when IO got queued in child group, parent was not idle. But later
it dispatched some IO, its slice got trimmed and then it went idle.
After a while child's request got shifted in parent group. In this
case use parent's old start time as new start time as that's the
duration of slice we did not use.
This logic is far from perfect as if there are multiple childs
then first child transferring the bio decides the start time while
a bio might have queued up even earlier in other child, which is
yet to be transferred up to parent. In that case we will lose
time and bandwidth in parent. This patch is just an approximation
to make situation somewhat better.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With flat hierarchy, there's only single level of dispatching
happening and fairness beyond that point is the responsibility of the
rest of the block layer and driver, which usually works out okay;
however, with the planned hierarchy support,
service_queue->bio_lists[] can be filled up by bios from a single
source. While the limits would still be honored, it'd be very easy to
starve IOs from siblings or children.
To avoid such starvation, this patch implements throtl_qnode and
converts service_queue->bio_lists[] to lists of per-source qnodes
which in turn contains the bio's. For example, when a bio is
dispatched from a child group, the bio doesn't get queued on
->bio_lists[] directly but it first gets queued on the group's qnode
which in turn gets queued on service_queue->queued[]. When
dispatching for the upper level, the ->queued[] list is consumed in
round-robing order so that the dispatch windows is consumed fairly by
all IO sources.
There are two ways a bio can come to a throtl_grp - directly queued to
the group or dispatched from a child. For the former
throtl_grp->qnode_on_self[rw] is used. For the latter, the child's
->qnode_on_parent[rw].
Note that this means that the child which is contributing a bio to its
parent should stay pinned until all its bios are dispatched to its
grand-parent. This patch moves blkg refcnting from bio add/remove
spots to qnode activation/deactivation so that the blkg containing an
active qnode is always pinned. As child pins the parent, this is
sufficient for keeping the relevant sub-tree pinned while bios are in
flight.
The starvation issue was spotted by Vivek Goyal.
v2: The original patch used the same throtl_grp->qnode_on_self/parent
for reads and writes causing RWs to be queued incorrectly if there
already are outstanding IOs in the other direction. They should
be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs
can use different qnodes. Spotted by Vivek Goyal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
throtl_pending_timer_fn() currently assumes that the parent_sq is the
top level one and the bio's dispatched are ready to be issued;
however, this assumption will be wrong with proper hierarchy support.
This patch makes the following changes to make
throtl_pending_timer_fn() ready for hiearchy.
* If the parent_sq isn't the top-level one, update the parent
throtl_grp's dispatch time and schedule the next dispatch as
necessary. If the parent's dispatch time is now, repeat the
function for the parent throtl_grp.
* If the parent_sq is the top-level one, kick issue work_item as
before.
* The debug message printed by throtl_log() now prints out the
service_queue's nr_queued[] instead of the total nr_queued as the
latter becomes uninteresting and misleading with hierarchical
dispatch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tg_dispatch_one_bio() currently assumes that the parent_sq is the top
level one and the bio being dispatched is ready to be issued; however,
this assumption will be wrong with proper hierarchy support. This
patch makes the following changes to make tg_dispatch_on_bio() ready
for hiearchy.
* throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead
of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to
transfer a bio from a child tg to its parent.
* tg_dispatch_one_bio() is updated to distinguish whether its parent
is another throtl_grp or the throtl_data. If former, the bio is
transferred to the parent throtl_grp using throtl_add_bio_tg(). If
latter, the bio is ready to be issued and put on the top-level
service_queue's bio_lists[] and throtl_data->nr_queued is
decremented.
As all throtl_grps currently have the top level service_queue as their
->parent_sq, this patch in itself doesn't make any behavior
difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, blk_throtl_bio() issues the passed in bio directly if it's
within limits of its associated tg (throtl_grp). This behavior
becomes incorrect with hierarchy support as the bio should be
accounted to and throttled by the ancestor throtl_grps too.
This patch makes the direct issue path of blk_throtl_bio() to loop
until it reaches the top-level service_queue or gets throttled. If
the former, the bio can be issued directly; otherwise, it gets queued
at the first layer it was above limits.
As tg->parent_sq is always the top-level service queue currently, this
patch in itself doesn't make any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current blk_throtl_drain() assumes that all active throtl_grps are
queued on throtl_data->service_queue, which won't be true once
hierarchy support is implemented.
This patch makes blk_throtl_drain() perform post-order walk of the
blkg hierarchy draining each associated throtl_grp, which guarantees
that all bios will eventually be pushed to the top-level service_queue
in throtl_data.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, blk_throtl_dispatch_work_fn() is responsible for both
dispatching bio's from throtl_grp's according to their limits and then
issuing the dispatched bios.
This patch moves the dispatch part to throtl_pending_timer_fn() so
that the work item is kicked iff there are bio's to issue. This is to
avoid work item execution at each step when hierarchy support is
enabled. bio's will be dispatched towards the top-level service_queue
from the timers at each layer and the work item will only be used to
issue the bio's which reached the top-level service_queue.
While fetching bio's to issue from bio_lists[],
blk_throtl_dispatch_work_fn() fetches all READs before WRITEs. While
the original code also dispatched READs first, if multiple throtl_grps
are dispatched on the same run, WRITEs from throtl_grp which is
dispatched first would precede READs from throtl_grps which are
dispatched later. While this is a behavior change, given that the
previous code already prioritized READs and block layer generally
prioritizes and segregates READs from WRITEs, this isn't likely to
make any noticeable differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
throtl_select_dispatch() only dispatches throtl_quantum bios on each
invocation. blk_throtl_dispatch_work_fn() in turn depends on
throtl_schedule_next_dispatch() scheduling the next dispatch window
immediately so that undue delays aren't incurred. This effectively
chains multiple dispatch work item executions back-to-back when there
are more than throtl_quantum bios to dispatch on a given tick.
There is no reason to finish the current work item just to repeat it
immediately. This patch makes throtl_schedule_next_dispatch() return
%false without doing anything if the current dispatch window is still
open and updates blk_throtl_dispatch_work_fn() repeat dispatching
after cpu_relax() on %false return.
This change will help implementing hierarchy support as dispatching
will be done from pending_timer and immediate reschedule of timer
function isn't supported and doesn't make much sense.
While this patch changes how dispatch behaves when there are more than
throtl_quantum bios to dispatch on a single tick, the behavior change
is immaterial.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
throtl_data->dispatch_work
Currently, throtl_data->dispatch_work is a delayed_work item which
handles both delayed dispatch and issuing bios. The two tasks will be
separated to support proper hierarchy. To prepare for that, this
patch separates out the timer into throtl_service_queue->pending_timer
from throtl_data->dispatch_work and make the latter a work_struct.
* As the timer is now per-service_queue, it's initialized and
del_sync'd as its corresponding service_queue is created and
destroyed. The timer, when triggered, simply schedules
throtl_data->dispathc_work for execution.
* throtl_schedule_delayed_work() is renamed to
throtl_schedule_pending_timer() and takes @sq and @expires now.
* Simiarly, throtl_schedule_next_dispatch() now takes @sq, which
should be the parent_sq of the service_queue which just got a new
bio or updated. As the parent_sq is always the top-level
service_queue now, this doesn't change anything at this point.
This patch doesn't introduce any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
update with it
With proper hierarchy support, a bio can be dispatched multiple times
until it reaches the top-level service_queue and we don't want to
update dispatch stats at each step. They are local stats and will be
kept local. If recursive stats are necessary, they should be
implemented separately and definitely not by updating counters
recursively on each dispatch.
This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate
stats update with it so that dispatch stats are updated only on the
first time the bio is charged to a throtl_grp, which will always be
the throtl_grp the bio was originally queued to.
This means that REQ_THROTTLED would be set even for bios which don't
get throttled. As we don't want bios to leave blk-throtl with the
flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio()
and clear if the bio is being issued directly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that both throtl_data and throtl_grp embed throtl_service_queue,
we can unify throtl_log() and throtl_log_tg().
* sq_to_tg() is added. This returns the throtl_grp a service_queue is
embedded in. If the service_queue is the top-level one embedded in
throtl_data, NULL is returned.
* sq_to_td() is added. A service_queue is always associated with a
throtl_data. This function finds the associated td and returns it.
* throtl_log() is updated to take throtl_service_queue instead of
throtl_data. If the service_queue is one embedded in throtl_grp, it
prints the same header as throtl_log_tg() did. If it's one embedded
in throtl_data, it behaves the same as before. This renders
throtl_log_tg() unnecessary. Removed.
This change is necessary for hierarchy support as we're gonna be using
the same code paths to dispatch bios to intermediate service_queues
embedded in throtl_grps and the top-level service_queue embedded in
throtl_data.
This patch doesn't make any behavior changes.
v2: throtl_log() didn't print a space after blkg path. Updated so
that it prints a space after throtl_grp path. Spotted by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To prepare for hierarchy support, this patch adds
throtl_service_queue->service_sq which points to the arent
service_queue. Currently, for all service_queues embedded in
throtl_grps, it points to throtl_data->service_queue. As
throtl_data->service_queue doesn't have a parent its parent_sq is set
to NULL.
There are a number of functions which take both throtl_grp *tg and
throtl_service_queue *parent_sq. With this patch, the parent
service_queue can be determined from @tg and the @parent_sq arguments
are removed.
This patch doesn't make any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When blk_throtl_bio() wants to queue a bio to a tg (throtl_grp), it
avoids invoking tg_update_disptime() and
throtl_schedule_next_dispatch() if the tg already has bios queued in
that direction. As a new bio is appeneded after the existing ones, it
can't change the tg's next dispatch time or the parent's dispatch
schedule.
This optimization is currently open coded in blk_throtl_bio().
Whether the target biolist was occupied was recorded in a local
variable and later used to skip disptime update. This patch moves
generalizes it so that throtl_add_bio_tg() sets a new flag
THROTL_TG_WAS_EMPTY if the biolist was empty before the new bio was
added. tg_update_disptime() clears the flag automatically.
blk_throtl_bio() is updated to simply test the flag before updating
disptime.
This patch doesn't make any functional differences now but will enable
using the same optimization for recursive dispatch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
throtl_service_queues will eventually form a tree which is anchored at
throtl_data->service_queue and queue bios will climb the tree to the
top service_queue to be executed.
This patch makes the dispatch paths in blk_throtl_dispatch_work_fn()
and blk_throtl_drain() to dispatch bios to
throtl_data->service_queue.bio_lists[] instead of the on-stack
bio_lists. This will keep the final dispatch to the top level
service_queue share the same mechanism as dispatches through the rest
of the hierarchy.
As bio's should be issued in a sleepable context,
blk_throtl_dispatch_work_fn() transfers all dispatched bio's from the
service_queue bio_lists[] into an onstack one before dropping
queue_lock and issuing the bio's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
throtl_service_queues will eventually form a tree which is anchored at
throtl_data->service_queue and queue bios will climb the tree to the
top service_queue to be executed.
This patch moves bio_lists[] and nr_queued[] from throtl_grp to its
service_queue to prepare for that. As currently only the
throtl_data->service_queue is in use, this patch just ends up moving
throtl_grp->bio_lists[] and ->nr_queued[] to
throtl_grp->service_queue.bio_lists[] and ->nr_queued[] without making
any functional differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, there's single service_queue per queue -
throtl_data->service_queue. All active throtl_grp's are queued on the
queue and dispatched according to their limits. To support hierarchy,
this will be expanded such that active throtl_grp's form a tree
anchored at throtl_data->service_queue and chained through each
intermediate throtl_grp's service_queue.
This patch adds throtl_grp->service_queue to prepare for hierarchy
support. The initialization function - throtl_service_queue_init() -
is added and replaces the macro initializer. The newly added
tg->service_queue isn't used yet. Following patches will do.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
throtl_service_queue will be the building block of hierarchy support
and will form a tree. This patch updates its usages as arguments to
reduce confusion.
* When a service queue is used as the parent role - the host of the
rbtree - use @parent_sq instead of @sq.
* For functions taking both @tg and @parent_sq, reorder them so that
the order is (@tg, @parent_sq) not the other way around. This makes
the code follow the usual convention of specifying the primary
target of the operation as the first argument.
This patch doesn't make any functional differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
throtl_service_queue will be used as the basic block to implement
hierarchy support. Pass around throtl_service_queue *sq instead of
throtl_data *td in the following functions which will be used across
multiple levels of hierarchy.
* [__]throtl_enqueue/dequeue_tg()
* throtl_add_bio_tg()
* tg_update_disptime()
* throtl_select_dispatch()
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add throtl_grp->td so that the td (throtl_data) a given tg
(throtl_grp) belongs to can be determined, and remove @td argument
from functions which take both @td and @tg as the former now can be
determined from the latter.
This generally simplifies the code and removes a number of cases where
@td is passed as an argument without being actually used. This will
also help hierarchy support implementation.
While at it, in multi-line conditions, move the logical operators
leading broken lines to the end of the previous line.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
blk-throttle is still using function-defining macros to define flag
handling functions, which went out style at least a decade ago.
Just define the flag as bitmask and use direct bit operations.
This patch doesn't make any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
throtl_rb_root will be expanded to cover more roles for hierarchy
support. Rename it to throtl_service_queue and make its fields more
descriptive.
* rb -> pending_tree
* left -> first_pending
* count -> nr_pending
* min_disptime -> first_pending_disptime
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
throtl_nr_queued() is used in several places to avoid performing
certain operations when the throtl_data is empty. This usually is
useless as those paths usually aren't traveled if there's no bio
queued.
* throtl_schedule_delayed_work() skips scheduling dispatch work item
if @td doesn't have any bios queued; however, the only case it can
be called when @td is empty is from tg_set_conf() which isn't
something we should be optimizing for.
* throtl_schedule_next_dispatch() takes a quick exit if @td is empty;
however, right after that it triggers BUG if the service tree is
empty. The two conditions are equivalent and it can just test
@st->count for the quick exit.
* blk_throtl_dispatch_work_fn() skips dispatch if @td is empty. This
work function isn't usually invoked when @td is empty. The only
possibility is from tg_set_conf() and when it happens the normal
dispatching path can handle empty @td fine. No need to add special
skip path.
This patch removes the above three unnecessary optimizations, which
leave throtl_log() call in blk_throtl_dispatch_work_fn() the only user
of throtl_nr_queued(). Remove throtl_nr_queued() and open code it in
throtl_log(). I don't think we need td->nr_queued[] at all. Maybe we
can remove it later.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
| |
Move throtl_schedule_delayed_work() above its first user so that the
forward declaration can be removed.
This patch is pure relocaiton.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
blk-throttle is about to go through major restructuring to support
hierarchy. Do cosmetic updates in preparation.
* s/throtl_data->throtl_work/throtl_data->dispatch_work/
* s/blk_throtl_work()/blk_throtl_dispatch_work_fn()/
* Collapse throtl_dispatch() into blk_throtl_dispatch_work_fn()
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
|