| Commit message (Collapse) | Author | Age | Files | Lines |
| ... | |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 7fabaf303458fcabb694999d6fa772cc13d4e217 upstream.
fuse_request_send_notify_reply() may fail if the connection was reset for
some reason (e.g. fs was unmounted). Don't leak request reference in this
case. Besides leaking memory, this resulted in fc->num_waiting not being
decremented and hence fuse_wait_aborted() left in a hanging and unkillable
state.
Fixes: 2d45ba381a74 ("fuse: add retrieve request")
Fixes: b8f95e5d13f5 ("fuse: umount should wait for all requests")
Reported-and-tested-by: syzbot+6339eda9cb4ebbc4c37b@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> #v2.6.36
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 1823342a1f2b47a4e6f5667f67cd28ab6bc4d6cd upstream.
gcc 8.1.0 complains:
fs/configfs/symlink.c:67:3: warning:
'strncpy' output truncated before terminating nul copying as many
bytes from a string as its length
fs/configfs/symlink.c: In function 'configfs_get_link':
fs/configfs/symlink.c:63:13: note: length computed here
Using strncpy() is indeed less than perfect since the length of data to
be copied has already been determined with strlen(). Replace strncpy()
with memcpy() to address the warning and optimize the code a little.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu@cybertrust.co.jp>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| |
|
|
|
|
|
| |
The final logical and limits the result the 32bit. Adding interim
logical and to ensure the compiler an overflow never happens
Signed-off-by: Andreas Fenkart <andreas.fenkart@digitalstrom.com>
|
| |
|
|
| |
This reverts commit ff505baaf412985af758d5820cd620ed9f1a7e05.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Moyster <oysterized@gmail.com>
|
| |
|
|
|
| |
Signed-off-by: mydongistiny <jaysonedson@gmail.com>
Signed-off-by: Mister Oyster <oysterized@gmail.com>
|
| |
|
|
|
|
|
|
| |
bug: 111641492
Change-Id: I79e9894f94880048edaf0f7cfa2d180f65cbcf3b
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
|
| |
|
|
|
|
|
|
|
|
|
| |
The macro hides some control flow, making it easier
to run into bugs.
bug: 111642636
Change-Id: I37ec207c277d97c4e7f1e8381bc9ae743ad78435
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
|
| |
|
|
|
|
|
|
|
| |
We forgot to remov memory footprint accounting of per-cpu type
variables, fix it.
Fixes: 35782b233f37 ("f2fs: remove percpu_count due to performance regression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| |
|
|
|
|
|
|
| |
No need to read nat block if nat_block_bitmap is set.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| |
|
|
|
|
|
| |
Handle this asynchronously to avoid wake-up delay.
Use system_power_efficient_wq since this is not CPU intensive.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
|
| |
|
|
|
|
|
|
|
|
| |
Wait until invalid blocks to reach 3% of utilization before triggering rapid GC
to avoid unnecessarily aggressive GC.
We use 3% from (total - written) to avoid GC not doing its job
when the storage gets nearly full.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
|
| |
|
|
|
|
|
| |
With the previous patch, we now have plenty of chances to GC properly
even with a very long interval between looped GC runs.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rapidly GC'ing all invalid blocks only takes a few minute,
even under extreme scenario.
Wait until the screen goes off and detect if the power source is plugged in.
If it is, GC thread runs repeatedly with 1ms interval.
Since the device can go to sleep even with power source plugged in,
catch an additional wakelock.
When no more GC victim block is found, GC thread sleeps for
gc_no_gc_sleep_time(5m).
With the previous ioprio patch, the impact to system responsiveness
(e.g. waking up the screen) is minimal, if not zero.
This patch is intended to optimize background GC behavior specific to Android.
So, disable background_gc by default as well.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
|
| |
|
|
|
|
|
|
|
| |
GC should run conservatively as possible to reduce latency spikes to the user.
Setting ioprio to idle class will allow the kernel to schedule GC thread's I/O
to not affect any other processes' I/O requests.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
sync_file_range(2) is documented to issue writeback only for pages that
are not currently being written. After all the system call has been
created for userspace to be able to issue background writeout and so
waiting for in-flight IO is undesirable there. However commit
ee53a891f474 ("mm: do_sync_mapping_range integrity fix") switched
do_sync_mapping_range() and thus sync_file_range() to issue writeback in
WB_SYNC_ALL mode since do_sync_mapping_range() was used by other code
relying on WB_SYNC_ALL semantics.
These days do_sync_mapping_range() went away and we can switch
sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should
help PostgreSQL avoid large latency spikes when flushing data in the
background.
Andres measured a 20% increase in transactions per second on an SSD disk.
Change-Id: Ib3e8a5e27501165fdd10486792d2a1989a841c9e
Signed-off-by: Jan Kara <jack@suse.com>
Reported-by: Andres Freund <andres@anarazel.de>
Tested-By: Andres Freund <andres@anarazel.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: engstk <eng.stk@sapo.pt>
|
| | |
|
| |
|
|
|
|
|
| |
This was extracted from the PaX RANDUSTACK feature in grsecurity, where
all of the lower bits are randomized. PaX keeps 16-byte alignment.
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
The permission call for xattr operations happens regardless of
whether or not the xattr functions are implemented.
The xattr functions currently don't have support for permission2.
Passing EOPNOTSUPP as the mount point in xattr_permission allows
us to return EOPNOTSUPP early in permission2, if the filesystem
supports it.
Change-Id: I9d07e4cd633cf40af60450ffbff7ac5c1b4e8c2c
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35848445
|
| |
|
|
|
|
|
|
|
| |
Moving default_normal from mount info to superblock info
as it doesn't need to change between mount points.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 72158116
Change-Id: I16c6a0577c601b4f7566269f7e189fcf697afd4e
|
| |
|
|
|
|
|
|
|
| |
The default_normal option causes mounts with the gid set to
AID_SDCARD_RW to have user specific gids, as in the normal case.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I9619b8ac55f41415df943484dc8db1ea986cef6f
Bug: 64672411
|
| | |
|
| |
|
|
| |
This reverts commit 7f4eab0e0a02c586b53466dc016f1bbd7b17b69e.
|
| |
|
|
|
|
|
| |
mark alloc_fd with EXPORT_SYMBOL so it can be used by modules.
Change-Id: Ic0dedbadecd2d0937cad8268aaa6eabbc52019ff
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Mateusz Guzik reported :
Currently obtaining a new file descriptor results in locking fdtable
twice - once in order to reserve a slot and second time to fill it.
Holding the spinlock in __fd_install() is needed in case a resize is
done, or to prevent a resize.
Mateusz provided an RFC patch and a micro benchmark :
http://people.redhat.com/~mguzik/pipebench.c
A resize is an unlikely operation in a process lifetime,
as table size is at least doubled at every resize.
We can use RCU instead of the spinlock.
__fd_install() must wait if a resize is in progress.
The resize must block new __fd_install() callers from starting,
and wait that ongoing install are finished (synchronize_sched())
resize should be attempted by a single thread to not waste resources.
rcu_sched variant is used, as __fd_install() and expand_fdtable() run
from process context.
It gives us a ~30% speedup using pipebench on a dual Intel(R) Xeon(R)
CPU E5-2696 v2 @ 2.50GHz
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Mateusz Guzik <mguzik@redhat.com>
Tested-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[@nathanchance: Leave out documentation, https://github.com/torvalds/linux/commit/8a81252b774b53e628a8a0fe18e2b8fc236d92cc]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mpage_submit_page() can race with another process growing i_size and
writing data via mmap to the written-back page. As mpage_submit_page()
samples i_size too early, it may happen that ext4_bio_write_page()
zeroes out too large tail of the page and thus corrupts user data.
Fix the problem by sampling i_size only after the page has been
write-protected in page tables by clear_page_dirty_for_io() call.
Reported-by: Michael Zimmer <michael@swarm64.com>
CC: stable@vger.kernel.org
Fixes: cb20d5188366f04d96d2e07b1240cc92170ade40
Signed-off-by: Jan Kara <jack@suse.cz>
[@nathanchance: fixed conflicts]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
| |
This reverts commit 156b66c5bad7fa7621024c2537bfeb48e5a21da0.
|
| |
|
|
|
|
|
|
|
|
|
| |
The conversion is generally straightforward. The only tricky part is
that xattr block corresponding to found mbcache entry can get freed
before we get buffer lock for that block. So we have to check whether
the entry is still valid after getting buffer lock.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
(Cherry-pick from commit be0726d33cb8f411945884664924bed3cb8c70ee)
The conversion is generally straightforward. We convert filesystem from
a global cache to per-fs one. Similarly to ext4 the tricky part is that
xattr block corresponding to found mbcache entry can get freed before we
get buffer lock for that block. So we have to check whether the entry is
still valid after getting the buffer lock.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(Cherry-pick from commit f9a61eb4e2471c56a63cd804c7474128138c38ac)
Original mbcache was designed to have more features than what ext?
filesystems ended up using. It supported entry being in more hashes, it
had a home-grown rwlocking of each entry, and one cache could cache
entries from multiple filesystems. This genericity also resulted in more
complex locking, larger cache entries, and generally more code
complexity.
This is reimplementation of the mbcache functionality to exactly fit the
purpose ext? filesystems use it for. Cache entries are now considerably
smaller (7 instead of 13 longs), the code is considerably smaller as
well (414 vs 913 lines of code), and IMO also simpler. The new code is
also much more lightweight.
I have measured the speed using artificial xattr-bench benchmark, which
spawns P processes, each process sets xattr for F different files, and
the value of xattr is randomly chosen from a pool of V values. Averages
of runtimes for 5 runs for various combinations of parameters are below.
The first value in each cell is old mbache, the second value is the new
mbcache.
V=10
F\P 1 2 4 8 16 32 64
10 0.158,0.157 0.208,0.196 0.500,0.277 0.798,0.400 3.258,0.584 13.807,1.047 61.339,2.803
100 0.172,0.167 0.279,0.222 0.520,0.275 0.825,0.341 2.981,0.505 12.022,1.202 44.641,2.943
1000 0.185,0.174 0.297,0.239 0.445,0.283 0.767,0.340 2.329,0.480 6.342,1.198 16.440,3.888
V=100
F\P 1 2 4 8 16 32 64
10 0.162,0.153 0.200,0.186 0.362,0.257 0.671,0.496 1.433,0.943 3.801,1.345 7.938,2.501
100 0.153,0.160 0.221,0.199 0.404,0.264 0.945,0.379 1.556,0.485 3.761,1.156 7.901,2.484
1000 0.215,0.191 0.303,0.246 0.471,0.288 0.960,0.347 1.647,0.479 3.916,1.176 8.058,3.160
V=1000
F\P 1 2 4 8 16 32 64
10 0.151,0.129 0.210,0.163 0.326,0.245 0.685,0.521 1.284,0.859 3.087,2.251 6.451,4.801
100 0.154,0.153 0.211,0.191 0.276,0.282 0.687,0.506 1.202,0.877 3.259,1.954 8.738,2.887
1000 0.145,0.179 0.202,0.222 0.449,0.319 0.899,0.333 1.577,0.524 4.221,1.240 9.782,3.579
V=10000
F\P 1 2 4 8 16 32 64
10 0.161,0.154 0.198,0.190 0.296,0.256 0.662,0.480 1.192,0.818 2.989,2.200 6.362,4.746
100 0.176,0.174 0.236,0.203 0.326,0.255 0.696,0.511 1.183,0.855 4.205,3.444 19.510,17.760
1000 0.199,0.183 0.240,0.227 1.159,1.014 2.286,2.154 6.023,6.039 ---,10.933 ---,36.620
V=100000
F\P 1 2 4 8 16 32 64
10 0.171,0.162 0.204,0.198 0.285,0.230 0.692,0.500 1.225,0.881 2.990,2.243 6.379,4.771
100 0.151,0.171 0.220,0.210 0.295,0.255 0.720,0.518 1.226,0.844 3.423,2.831 19.234,17.544
1000 0.192,0.189 0.249,0.225 1.162,1.043 2.257,2.093 5.853,4.997 ---,10.399 ---,32.198
We see that the new code is faster in pretty much all the cases and
starting from 4 processes there are significant gains with the new code
resulting in upto 20-times shorter runtimes. Also for large numbers of
cached entries all values for the old code could not be measured as the
kernel started hitting softlockups and died before the test completed.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
In commit bcff24887d00 ("ext4: don't read blocks from disk after extents
being swapped") bh is not updated correctly in the for loop and wrong
data has been written to disk. generic/324 catches this on sub-page
block size ext4.
Fixes: bcff24887d00 ("ext4: don't read blocks from disk after extentsbeing swapped")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Online defragging of encrypted files is not currently implemented.
However, the move extent ioctl can still return successfully when
called. For example, this occurs when xfstest ext4/020 is run on an
encrypted file system, resulting in a corrupted test file and a
corresponding test failure.
Until the proper functionality is implemented, fail the move extent
ioctl if either the original or donor file is encrypted.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I notice ext4/307 fails occasionally on ppc64 host, reporting md5
checksum mismatch after moving data from original file to donor file.
The reason is that move_extent_per_page() calls __block_write_begin()
and block_commit_write() to write saved data from original inode blocks
to donor inode blocks, but __block_write_begin() not only maps buffer
heads but also reads block content from disk if the size is not block
size aligned. At this time the physical block number in mapped buffer
head is pointing to the donor file not the original file, and that
results in reading wrong data to page, which get written to disk in
following block_commit_write call.
This also can be reproduced by the following script on 1k block size ext4
on x86_64 host:
mnt=/mnt/ext4
donorfile=$mnt/donor
testfile=$mnt/testfile
e4compact=~/xfstests/src/e4compact
rm -f $donorfile $testfile
# reserve space for donor file, written by 0xaa and sync to disk to
# avoid EBUSY on EXT4_IOC_MOVE_EXT
xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile
# create test file written by 0xbb
xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile
# compute initial md5sum
md5sum $testfile | tee md5sum.txt
# drop cache, force e4compact to read data from disk
echo 3 > /proc/sys/vm/drop_caches
# test defrag
echo "$testfile" | $e4compact -i -v -f $donorfile
# check md5sum
md5sum -c md5sum.txt
Fix it by creating & mapping buffer heads only but not reading blocks
from disk, because all the data in page is guaranteed to be up-to-date
in mext_page_mkuptodate().
Cc: stable@vger.kernel.org
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently the case of writing via mmap to a file with inline data is not
handled. This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:
mkfs.ext4 -O inline_data /dev/vdb
mount /dev/vdb /mnt
xfs_io -f /mnt/file \
-c 'pwrite 0 1' \
-c 'mmap -w 0 1m' \
-c 'mwrite 0 1' \
-c 'fsync'
kernel BUG at fs/ext4/inode.c:2723!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
task: ffff88003d3a8040 task.stack: ffffc90000300000
RIP: 0010:ext4_writepages+0xc89/0xf8a
RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
Call Trace:
? _raw_spin_unlock+0x27/0x2a
? kvm_clock_read+0x1e/0x20
do_writepages+0x23/0x2c
? do_writepages+0x23/0x2c
__filemap_fdatawrite_range+0x80/0x87
filemap_write_and_wait_range+0x67/0x8c
ext4_sync_file+0x20e/0x472
vfs_fsync_range+0x8e/0x9f
? syscall_trace_enter+0x25b/0x2d0
vfs_fsync+0x1c/0x1e
do_fsync+0x31/0x4a
SyS_fsync+0x10/0x14
do_syscall_64+0x69/0x131
entry_SYSCALL64_slow_path+0x25/0x25
We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be. So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.
Reported-by: Nick Alcock <nick.alcock@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Competing overwrite DIO in dioread_nolock mode will just overwrite
pointer to io_end in the inode. This may result in data corruption or
extent conversion happening from IO completion interrupt because we
don't properly set buffer_defer_completion() when unlocked DIO races
with locked DIO to unwritten extent.
Since unlocked DIO doesn't need io_end for anything, just avoid
allocating it and corrupting pointer from inode for locked DIO.
A cleaner fix would be to avoid these games with io_end pointer from the
inode but that requires more intrusive changes so we leave that for
later.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pages clear buffers after ext4 delayed block allocation failed,
However, it does not clean its pte_dirty flag.
if the pages unmap ,in cording to the pte_dirty ,
unmap_page_range may try to call __set_page_dirty,
which may lead to the bugon at
mpage_prepare_extent_to_map:head = page_buffers(page);.
This patch just call clear_page_dirty_for_io to clean pte_dirty
at mpage_release_unused_pages for pages mmaped.
Steps to reproduce the bug:
(1) mmap a file in ext4
addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
fd, 0);
memset(addr, 'i', 4096);
(2) return EIO at
ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent
which causes this log message to be print:
ext4_msg(sb, KERN_CRIT,
"Delayed block allocation failed for "
"inode %lu at logical offset %llu with"
" max blocks %u with error %d",
inode->i_ino,
(unsigned long long)map->m_lblk,
(unsigned)map->m_len, -err);
(3)Unmap the addr cause warning at
__set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page));
(4) wait for a minute,then bugon happen.
Cc: stable@vger.kernel.org
Signed-off-by: wangguang <wangguang03@zte.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ext4 can update bh->b_state non-atomically in _ext4_get_block() and
ext4_da_get_block_prep(). Usually this is fine since bh is just a
temporary storage for mapping information on stack but in some cases it
can be fully living bh attached to a page. In such case non-atomic
update of bh->b_state can race with an atomic update which then gets
lost. Usually when we are mapping bh and thus updating bh->b_state
non-atomically, nobody else touches the bh and so things work out fine
but there is one case to especially worry about: ext4_finish_bio() uses
BH_Uptodate_Lock on the first bh in the page to synchronize handling of
PageWriteback state. So when blocksize < pagesize, we can be atomically
modifying bh->b_state of a buffer that actually isn't under IO and thus
can race e.g. with delalloc trying to map that buffer. The result is
that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
Fix the problem by always updating bh->b_state bits atomically.
CC: stable@vger.kernel.org
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
| |
The write_end() function must always unlock the page and drop its ref
count, even on an error.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
[@nathanchance: fixed conflicts]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted. If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When filesystem is corrupted in the right way, it can happen
ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
subsequently remove inode from the in-memory orphan list. However this
deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
leave i_orphan list_head with a stale content. Later we can look at this
content causing list corruption, oops, or other issues. The reported
trace looked like:
WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
0000000000100100)
CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
Stack:
60462947 62219960 602ede24 62219960
602ede24 603ca293 622198f0 602f02eb
62219950 6002c12c 62219900 601b4d6b
Call Trace:
[<6005769c>] ? vprintk_emit+0x2dc/0x5c0
[<602ede24>] ? printk+0x0/0x94
[<600190bc>] show_stack+0xdc/0x1a0
[<602ede24>] ? printk+0x0/0x94
[<602ede24>] ? printk+0x0/0x94
[<602f02eb>] dump_stack+0x2a/0x2c
[<6002c12c>] warn_slowpath_common+0x9c/0xf0
[<601b4d6b>] ? __list_del_entry+0x6b/0x100
[<6002c254>] warn_slowpath_fmt+0x94/0xa0
[<602f4d09>] ? __mutex_lock_slowpath+0x239/0x3a0
[<6002c1c0>] ? warn_slowpath_fmt+0x0/0xa0
[<60023ebf>] ? set_signals+0x3f/0x50
[<600a205a>] ? kmem_cache_free+0x10a/0x180
[<602f4e88>] ? mutex_lock+0x18/0x30
[<601b4d6b>] __list_del_entry+0x6b/0x100
[<601177ec>] ext4_orphan_del+0x22c/0x2f0
[<6012f27c>] ? __ext4_journal_start_sb+0x2c/0xa0
[<6010b973>] ? ext4_truncate+0x383/0x390
[<6010bc8b>] ext4_write_begin+0x30b/0x4b0
[<6001bb50>] ? copy_from_user+0x0/0xb0
[<601aa840>] ? iov_iter_fault_in_readable+0xa0/0xc0
[<60072c4f>] generic_perform_write+0xaf/0x1e0
[<600c4166>] ? file_update_time+0x46/0x110
[<60072f0f>] __generic_file_write_iter+0x18f/0x1b0
[<6010030f>] ext4_file_write_iter+0x15f/0x470
[<60094e10>] ? unlink_file_vma+0x0/0x70
[<6009b180>] ? unlink_anon_vmas+0x0/0x260
[<6008f169>] ? free_pgtables+0xb9/0x100
[<600a6030>] __vfs_write+0xb0/0x130
[<600a61d5>] vfs_write+0xa5/0x170
[<600a63d6>] SyS_write+0x56/0xe0
[<6029fcb0>] ? __libc_waitpid+0x0/0xa0
[<6001b698>] handle_syscall+0x68/0x90
[<6002633d>] userspace+0x4fd/0x600
[<6002274f>] ? save_registers+0x1f/0x40
[<60028bd7>] ? arch_prctl+0x177/0x1b0
[<60017bd5>] fork_handler+0x85/0x90
Fix the problem by using list_del_init() as we always should with
i_orphan list.
CC: stable@vger.kernel.org
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When ext4_map_blocks() is called with EXT4_GET_BLOCKS_ZERO to zero-out
allocated blocks and these blocks are actually converted from unwritten
extent the following race can happen:
CPU0 CPU1
page fault page fault
... ...
ext4_map_blocks()
ext4_ext_map_blocks()
ext4_ext_handle_unwritten_extents()
ext4_ext_convert_to_initialized()
- zero out converted extent
ext4_zeroout_es()
- inserts extent as initialized in status tree
ext4_map_blocks()
ext4_es_lookup_extent()
- finds initialized extent
write data
ext4_issue_zeroout()
- zeroes out new extent overwriting data
This problem can be reproduced by generic/340 for the fallocated case
for the last block in the file.
Fix the problem by avoiding zeroing out the area we are mapping with
ext4_map_blocks() in ext4_ext_convert_to_initialized(). It is pointless
to zero out this area in the first place as the caller asked us to
convert the area to initialized because he is just going to write data
there before the transaction finishes. To achieve this we delete the
special case of zeroing out full extent as that will be handled by the
cases below zeroing only the part of the extent that needs it. We also
instruct ext4_split_extent() that the middle of extent being split
contains data so that ext4_split_extent_at() cannot zero out full extent
in case of ENOSPC.
CC: stable@vger.kernel.org
Fixes: 12735f881952c32b31bc4e433768f18489f79ec9
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If s_reserved_gdt_blocks is extremely large, it's possible for
ext4_init_block_bitmap(), which is called when ext4 sets up an
uninitialized block bitmap, to corrupt random kernel memory. Add the
same checks which e2fsck has --- it must never be larger than
blocksize / sizeof(__u32) --- and then add a backup check in
ext4_init_block_bitmap() in case the superblock gets modified after
the file system is mounted.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
[@nathanchance: fixed conflicts]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, ext4 would fail the mount if the file system had the quota
feature enabled and quota mount options (used for the older quota
setups) were present. This broke xfstests, since xfs silently ignores
the usrquote and grpquota mount options if they are specified. This
commit changes things so that we are consistent with xfs; having the
mount options specified is harmless, so no sense break users by
forbidding them.
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[@nathanchance: fixed conflicts]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
.... because otherwise a corrupted root file file system (caused by
flash corruption, etc.) will result in the device becoming bricked
after an attempted OTA update --- which can only be recovered by a
custom recovery image manually flashed onto the device (or a warantee
return).
Cherry picked from android-common 3.10 branch: 00a6e4dfa
BUG=20939131
Change-Id: I7d5fe852c30efc35aa39fba7f708a22c034a1839
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: "Theodore Ts'o" <tytso@google.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a use-after-free possibility in __ext4_journal_stop() in the
case that we free the handle in the first jbd2_journal_stop() because
we're referencing handle->h_err afterwards. This was introduced in
9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by
storing the handle->h_err value beforehand and avoid referencing
potentially freed handle.
Fixes: 9705acd63b125dee8b15c705216d7186daea4625
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
(cherry picked from commit 6934da9238da947628be83635e365df41064b09b)
Signed-off-by: Bernie Thompson <bhthompson@google.com>
Change-Id: I788135d7558d1c6ddcdef4366e106b17e9df80a5
Reviewed-on: https://chromium-review.googlesource.com/412985
Commit-Ready: Bernie Thompson <bhthompson@chromium.org>
Tested-by: Bernie Thompson <bhthompson@chromium.org>
Reviewed-by: Bernie Thompson <bhthompson@chromium.org>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently when journal restart fails, we'll have the h_transaction of
the handle set to NULL to indicate that the handle has been effectively
aborted. We handle this situation quietly in the jbd2_journal_stop() and just
free the handle and exit because everything else has been done before we
attempted (and failed) to restart the journal.
Unfortunately there are a number of problems with that approach
introduced with commit
41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart()
fails"
First of all in ext4 jbd2_journal_stop() will be called through
__ext4_journal_stop() where we would try to get a hold of the superblock
by dereferencing h_transaction which in this case would lead to NULL
pointer dereference and crash.
In addition we're going to free the handle regardless of the refcount
which is bad as well, because others up the call chain will still
reference the handle so we might potentially reference already freed
memory.
Moreover it's expected that we'll get aborted handle as well as detached
handle in some of the journalling function as the error propagates up
the stack, so it's unnecessary to call WARN_ON every time we get
detached handle.
And finally we might leak some memory by forgetting to free reserved
handle in jbd2_journal_stop() in the case where handle was detached from
the transaction (h_transaction is NULL).
Fix the NULL pointer dereference in __ext4_journal_stop() by just
calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
the potential memory leak in jbd2_journal_stop() and use proper
handle refcounting before we attempt to free it to avoid use-after-free
issues.
And finally remove all WARN_ON(!transaction) from the code so that we do
not get random traces when something goes wrong because when journal
restart fails we will get to some of those functions.
Cc: stable@vger.kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
(cherry picked from commit 9d506594069355d1fb2de3f9104667312ff08ed3)
Signed-off-by: Bernie Thompson <bhthompson@google.com>
Change-Id: Ic0b0f902ccbacb6d5f3388e601464e203b78ccf0
Reviewed-on: https://chromium-review.googlesource.com/412984
Commit-Ready: Bernie Thompson <bhthompson@chromium.org>
Tested-by: Bernie Thompson <bhthompson@chromium.org>
Reviewed-by: Bernie Thompson <bhthompson@chromium.org>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
| |
There is no reason to allow ext2 filesystems be mounted with journal
mount options. So, this patch adds them to the MOPT_NO_EXT2 mount
options list.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
| |
This makes a very large function a little smaller.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Joe Maples <joe@frap129.org>
|
| |
|
|
|
|
|
|
|
|
| |
This allows us to refactor the procfs code, which saves a bit of
compiled space. More importantly it isolates most of the procfs
support code into a single file, so it's easier to #ifdef it out if
the proc file system has been disabled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Joe Maples <joe@frap129.org>
|