aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* ext4: correctly migrate a file with a hole at the beginningEryu Guan2017-05-291-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8974fec7d72e3e02752fe0f27b4c3719c78d9a15 upstream. Currently ext4_ind_migrate() doesn't correctly handle a file which contains a hole at the beginning of the file. This caused the migration to be done incorrectly, and then if there is a subsequent following delayed allocation write to the "hole", this would reclaim the same data blocks again and results in fs corruption. # assmuing 4k block size ext4, with delalloc enabled # skip the first block and write to the second block xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile # converting to indirect-mapped file, which would move the data blocks # to the beginning of the file, but extent status cache still marks # that region as a hole chattr -e /mnt/ext4/testfile # delayed allocation writes to the "hole", reclaim the same data block # again, results in i_blocks corruption xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile umount /mnt/ext4 e2fsck -nf /dev/sda6 ... Inode 53, i_blocks is 16, should be 8. Fix? no ... Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: be more strict when migrating to non-extent based fileEryu Guan2017-05-291-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d6f123a9297496ad0b6335fe881504c4b5b2a5e5 upstream. Currently the check in ext4_ind_migrate() is not enough before doing the real conversion: a) delayed allocated extents could bypass the check on eh->eh_entries and eh->eh_depth This can be demonstrated by this script xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile where testfile has two extents but still be converted to non-extent based file format. b) only extent length is checked but not the offset, which would result in data lose (delalloc) or fs corruption (nodelalloc), because non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30) blocks This can be demostrated by xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile sync If delalloc is enabled, dmesg prints EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53 EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5 EXT4-fs (dm-4): This should not happen!! Data will be lost If delalloc is disabled, e2fsck -nf shows corruption Inode 53, i_size is 5497558142976, should be 4096. Fix? no Fix the two issues by a) forcing all delayed allocation blocks to be allocated before checking eh->eh_depth and eh->eh_entries b) limiting the last logical block of the extent is within direct map Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix reservation release on invalidatepage for delalloc fsLukas Czerner2017-05-291-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9705acd63b125dee8b15c705216d7186daea4625 upstream. On delalloc enabled file system on invalidatepage operation in ext4_da_page_release_reservation() we want to clear the delayed buffer and remove the extent covering the delayed buffer from the extent status tree. However currently there is a bug where on the systems with page size > block size we will always remove extents from the start of the page regardless where the actual delayed buffers are positioned in the page. This leads to the errors like this: EXT4-fs warning (device loop0): ext4_da_release_space:1225: ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data blocks This however can cause data loss on writeback time if the file system is in ENOSPC condition because we're releasing reservation for someones else delayed buffer. Fix this by only removing extents that corresponds to the part of the page we want to invalidate. This problem is reproducible by the following fio receipt (however I was only able to reproduce it with fio-2.1 or older. [global] bs=8k iodepth=1024 iodepth_batch=60 randrepeat=1 size=1m directory=/mnt/test numjobs=20 [job1] ioengine=sync bs=1k direct=1 rw=randread filename=file1:file2 [job2] ioengine=libaio rw=randwrite direct=1 filename=file1:file2 [job3] bs=1k ioengine=posixaio rw=randwrite direct=1 filename=file1:file2 [job5] bs=1k ioengine=sync rw=randread filename=file1:file2 [job7] ioengine=libaio rw=randwrite filename=file1:file2 [job8] ioengine=posixaio rw=randwrite filename=file1:file2 [job10] ioengine=mmap rw=randwrite bs=1k filename=file1:file2 [job11] ioengine=mmap rw=randwrite direct=1 filename=file1:file2 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: don't retry file block mapping on bigalloc fs with non-extent fileDarrick J. Wong2017-05-291-1/+1
| | | | | | | | | | | | | | | | commit 292db1bc6c105d86111e858859456bcb11f90f91 upstream. ext4 isn't willing to map clusters to a non-extent file. Don't signal this with an out of space error, since the FS will retry the allocation (which didn't fail) forever. Instead, return EUCLEAN so that the operation will fail immediately all the way back to userspace. (The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: call sync_blockdev() before invalidate_bdev() in put_super()Theodore Ts'o2017-05-291-0/+1
| | | | | | | | | | | | | | | | | | commit 89d96a6f8e6491f24fc8f99fd6ae66820e85c6c1 upstream. Normally all of the buffers will have been forced out to disk before we call invalidate_bdev(), but there will be some cases, where a file system operation was aborted due to an ext4_error(), where there may still be some dirty buffers in the buffer cache for the device. So try to force them out to memory before calling invalidate_bdev(). This fixes a warning triggered by generic/081: WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f() Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix race between truncate and __ext4_journalled_writepage()Theodore Ts'o2017-05-291-4/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit bdf96838aea6a265f2ae6cbcfb12a778c84a0b8e upstream. The commit cf108bca465d: "ext4: Invert the locking order of page_lock and transaction start" caused __ext4_journalled_writepage() to drop the page lock before the page was written back, as part of changing the locking order to jbd2_journal_start -> page_lock. However, this introduced a potential race if there was a truncate racing with the data=journalled writeback mode. Fix this by grabbing the page lock after starting the journal handle, and then checking to see if page had gotten truncated out from under us. This fixes a number of different warnings or BUG_ON's when running xfstests generic/086 in data=journalled mode, including: jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7 c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0 - and - kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200! ... Call Trace: [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c027d883>] ? lock_buffer+0x36/0x36 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22 [<c0229139>] do_invalidatepage+0x22/0x26 [<c0229198>] truncate_inode_page+0x5b/0x85 [<c022934b>] truncate_inode_pages_range+0x156/0x38c [<c0229592>] truncate_inode_pages+0x11/0x15 [<c022962d>] truncate_pagecache+0x55/0x71 [<c02b913b>] ext4_setattr+0x4a9/0x560 [<c01ca542>] ? current_kernel_time+0x10/0x44 [<c026c4d8>] notify_change+0x1c7/0x2be [<c0256a00>] do_truncate+0x65/0x85 [<c0226f31>] ? file_ra_state_init+0x12/0x29 - and - WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396 irty_metadata+0x14a/0x1ae() ... Call Trace: [<c01b879f>] ? console_unlock+0x3a1/0x3ce [<c082cbb4>] dump_stack+0x48/0x60 [<c0178b65>] warn_slowpath_common+0x89/0xa0 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae [<c0178bef>] warn_slowpath_null+0x14/0x18 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d [<c02b2f44>] write_end_fn+0x40/0x53 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a [<c02b59e7>] ext4_writepage+0x354/0x3b8 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b5a5b>] __writepage+0x10/0x2e [<c0225956>] write_cache_pages+0x22d/0x32c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b6ee8>] ext4_writepages+0x102/0x607 [<c019adfe>] ? sched_clock_local+0x10/0x10e [<c01a8a7c>] ? __lock_is_held+0x2e/0x44 [<c01a8ad5>] ? lock_is_held+0x43/0x51 [<c0226dff>] do_writepages+0x1c/0x29 [<c0276bed>] __writeback_single_inode+0xc3/0x545 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d ... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: check for zero length extent explicitlyEryu Guan2017-05-291-1/+1
| | | | | | | | | | | | | | | | | commit 2f974865ffdfe7b9f46a9940836c8b167342563d upstream. The following commit introduced a bug when checking for zero length extent 5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries() Zero length extent could pass the check if lblock is zero. Adding the explicit check for zero length back. Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix fencepost error in lazytime optimizationTheodore Ts'o2017-05-291-1/+6
| | | | | | | | | | | | | Commit 8f4d8558391: "ext4: fix lazytime optimization" was not a complete fix. In the case where the inode number is a multiple of 16, and we could still end up updating an inode with dirty timestamps written to the wrong inode on disk. Oops. This can be easily reproduced by using generic/005 with a file system with metadata_csum and lazytime enabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: fix lazytime optimizationTheodore Ts'o2017-05-291-1/+1
| | | | | | | | We had a fencepost error in the lazytime optimization which means that timestamp would get written to the wrong inode. Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: set lazytime on remount if MS_LAZYTIME is set by mountTheodore Ts'o2017-05-291-0/+3
| | | | | | | | | | | Newer versions of mount parse the lazytime feature and pass it to the mount system call via the flags field in the mount system call, removing the lazytime string from the mount options list. So we need to check for the presence of MS_LAZYTIME and set it in sb->s_flags in order for this flag to be set on a remount. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: add optimization for the lazytime mount optionTheodore Ts'o2017-05-293-2/+102
| | | | | | | | | | | | | | | | | Add an optimization for the MS_LAZYTIME mount option so that we will opportunistically write out any inodes with the I_DIRTY_TIME flag set in a particular inode table block when we need to update some inode in that inode table block anyway. Also add some temporary code so that we can set the lazytime mount option without needing a modified /sbin/mount program which can set MS_LAZYTIME. We can eventually make this go away once util-linux has added support. Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: add support for a lazytime mount optionTheodore Ts'o2017-05-2913-35/+187
| | | | | | | | | | | | | | | | | | | | | | | | | | Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: Add dynamic sync controljollaman9992017-05-294-72/+298
| | | | | | | | | | | | | | Adative for jolla-kernel Original by @faux123 The dynamic sync control interface uses Android kernel's unique early suspend / lat resume interface. While screen is on, file sync is disabled when screen is off, a file sync is called to flush all outstanding writes and restore file sync operation as normal. Signed-off-by: Paul Reioux <reioux@gmail.com>
* Revert "Dynamic Fsync Control"Mister Oyster2017-05-294-264/+0
| | | | This reverts commit d2cdb4e1ce3df4e28b0e51807455ddf424ff2e71.
* writeback: Do not sort b_io list only because of block device inodeJan Kara2017-05-293-4/+12
| | | | | | | | | | It is very likely that block device inode will be part of BDI dirty list as well. However it doesn't make sence to sort inodes on the b_io list just because of this inode (as it contains buffers all over the device anyway). So save some CPU cycles which is valuable since we hold relatively contented wb->list_lock. Signed-off-by: Jan Kara <jack@suse.cz>
* fs: push sync_filesystem() down to the file system's remount_fs()Theodore Ts'o2017-05-293-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the no-op "mount -o mount /dev/xxx" operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara <jack@suse.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Anders Larsen <al@alarsen.net> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: xfs@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: codalist@coda.cs.cmu.edu Cc: linux-ext4@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: fuse-devel@lists.sourceforge.net Cc: cluster-devel@redhat.com Cc: linux-mtd@lists.infradead.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux-nfs@vger.kernel.org Cc: linux-nilfs@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: reiserfs-devel@vger.kernel.org Change-Id: Ie6fc68d845b0d327f56e4da91a8a9ba0673e5d5e
* fs: ext4: disable support for fallocate FALLOC_FL_PUNCH_HOLENick Desaulniers2017-05-291-0/+7
| | | | | | Bug: 28760453 Change-Id: I019c2de559db9e4b95860ab852211b456d78c4ca Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
* ext4 crypto: use dget_parent() in ext4_d_revalidate()Theodore Ts'o2017-05-291-4/+8
| | | | | | | | | | | | | | This avoids potential problems caused by a race where the inode gets renamed out from its parent directory and the parent directory is deleted while ext4_d_revalidate() is running. Upstream commit: 3d43bcfef5f0548845a425365011c499875491b0 Fixes: 28b4c263961c Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Change-Id: Ia970597753fae0d67fa6eebb972de24d5c1194f8
* ext4 crypto: don't let data integrity writebacks fail with ENOMEMTheodore Ts'o2017-05-294-19/+38
| | | | | | | | | | | | | | | | | | | We don't want the writeback triggered from the journal commit (in data=writeback mode) to cause the journal to abort due to generic_writepages() returning an ENOMEM error. In addition, if fsync() fails with ENOMEM, most applications will probably not do the right thing. So if we are doing a data integrity sync, and ext4_encrypt() returns ENOMEM, we will submit any queued I/O to date, and then retry the allocation using GFP_NOFAIL. Upstream commit: c9af28fdd44922a6c10c9f8315718408af98e315 Google-Bug-Id: 27641567 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Change-Id: I55b6ab35c9ad4eb2ca6d06380755395f17525496
* ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()Theodore Ts'o2017-05-292-1/+4
| | | | | | | | | | | | | We aren't checking to see if the in-inode extended attribute is corrupted before we try to expand the inode's extra isize fields. This can lead to potential crashes caused by the BUG_ON() check in ext4_xattr_shift_entries(). Upstream commit: 9e92f48c34eb2b9af9d12f892e2fe1fce5e8ce35 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Change-Id: Idd5c5eaaaf7e244e3d310fc528840c13ce4c44a4
* ext4 crypto: fix memleak in ext4_readdir()Kirill Tkhai2017-05-291-2/+5
| | | | | | | | | | | When ext4_bread() fails, fname_crypto_str remains allocated after return. Fix that. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> CC: Dmitry Monakhov <dmonakhov@virtuozzo.com> Signed-off-by: Theodore Ts'o <tytso@google.com> Change-Id: Ie137cb7be090c52c65c65872035b537ece8c2f17
* ext4 crypto: revalidate dentry after adding or removing the keyTheodore Ts'o2017-05-295-0/+98
| | | | | | | | | | | | Add a validation check for dentries for encrypted directory to make sure we're not caching stale data after a key has been added or removed. Also check to make sure that status of the encryption key is updated when readdir(2) is executed. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Theodore Ts'o <tytso@google.com> Change-Id: Ic7a90d79d9447272fc512ae2abbd299523de02b8
* ext4 crypto: simplify interfaces to directory entry insert functionsTheodore Ts'o2017-05-293-17/+11
| | | | | | | | | | | | A number of functions include ext4_add_dx_entry, make_indexed_dir, etc. are being passed a dentry even though the only thing they use is the containing parent. We can shrink the code size slightly by maing this replacement. This will also be useful in cases where we don't have a dentry as the argument to the directory entry insert functions. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: "Theodore Ts'o" <tytso@google.com> Change-Id: I9267c577ab4d7d60e34cbf37c71eaf443e637c5f
* ext4 crypto: add missing locking for keyring_key accessTheodore Ts'o2017-05-291-0/+4
| | | | | | | Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: "Theodore Ts'o" <tytso@google.com> Change-Id: Ia13629b6512a0c5dd2a09e7e3676c74af20c96a3
* ext4 crypto: exit cleanly if ext4_derive_key_aes() failsLaurent Navet2017-05-291-0/+2
| | | | | | | | | | | Return value of ext4_derive_key_aes() is stored but not used. Add test to exit cleanly if ext4_derive_key_aes() fail. Also fix coverity CID 1309760. Signed-off-by: Laurent Navet <laurent.navet@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: "Theodore Ts'o" <tytso@google.com> Change-Id: I796cdfc65386e546f332a3dbbf9f2c2cd76e3301
* ext4 crypto: check for too-short encrypted file namesTheodore Ts'o2017-05-291-0/+4
| | | | | | | | | | | | | | | An encrypted file name should never be shorter than an 16 bytes, the AES block size. The 3.10 crypto layer will oops and crash the kernel if ciphertext shorter than the block size is passed to it. Fortunately, in modern kernels the crypto layer will not crash the kernel in this scenario, but nevertheless, it represents a corrupted directory, and we should detect it and mark the file system as corrupted so that e2fsck can fix this. Change-Id: Ic42808e5161b22ff607689d3570be4d04e6158ed Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Theodore Ts'o <tytso@google.com>
* ext4 crypto: use a jbd2 transaction when adding a crypto policyTheodore Ts'o2017-05-291-2/+15
| | | | | | | | | | | | Start a jbd2 transaction, and mark the inode dirty on the inode under that transaction after setting the encrypt flag. Otherwise if the directory isn't modified after setting the crypto policy, the encrypted flag might not survive the inode getting pushed out from memory, or the the file system getting unmounted and remounted. Change-Id: I5868e0531881922d8a5e68fa88b6cf2bb1675b99 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Theodore Ts'o <tytso@google.com>
* ext4: fix data corruption caused by unwritten and delayed extentsLukas Czerner2017-05-292-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently it is possible to lose whole file system block worth of data when we hit the specific interaction with unwritten and delayed extents in status extent tree. The problem is that when we insert delayed extent into extent status tree the only way to get rid of it is when we write out delayed buffer. However there is a limitation in the extent status tree implementation so that when inserting unwritten extent should there be even a single delayed block the whole unwritten extent would be marked as delayed. At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed. When we try to write into that block later ext4_da_map_blocks() will set the buffer new and delayed and map it to invalid block which causes the rest of the block to be zeroed loosing already written data. For now we can fix this by simply not allowing to set delayed status on written extent in the extent status tree. Also add WARN_ON() to make sure that we notice if this happens in the future. This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff This can be theoretically also reproduced by at random by running fsx, but it's not very reliable, though on machines with bigger page size (like ppc) this can be seen more often (especially xfstest generic/127) Change-Id: I0ba800f68cf35a0137a5c5b0903017e08bc6f814 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: "Theodore Ts'o" <tytso@google.com> Cc: stable@vger.kernel.org
* ext4 crypto: fix bugs in ext4_encrypted_zeroout()Theodore Ts'o2017-05-292-4/+21
| | | | | | | | | | | | | | | | | | | | | Fix multiple bugs in ext4_encrypted_zeroout(), including one that could cause us to write an encrypted zero page to the wrong location on disk, potentially causing data and file system corruption. Fortunately, this tends to only show up in stress tests, but even with these fixes, we are seeing some test failures with generic/127 --- but these are now caused by data failures instead of metadata corruption. Since ext4_encrypted_zeroout() is only used for some optimizations to keep the extent tree from being too fragmented, and ext4_encrypted_zeroout() itself isn't all that optimized from a time or IOPS perspective, disable the extent tree optimization for encrypted inodes for now. This prevents the data corruption issues reported by generic/127 until we can figure out what's going wrong. Change-Id: I795e6b479c75f0f930bb47092720c4d7add538da Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: "Theodore Ts'o" <tytso@google.com> Cc: stable@vger.kernel.org
* ext4 crypto: replace some BUG_ON()'s with error checksTheodore Ts'o2017-05-294-7/+15
| | | | | | | | | | Buggy (or hostile) userspace should not be able to cause the kernel to crash. Change-Id: I67f7b32dd458d577b506ddff6ef07955e804e3ff Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: "Theodore Ts'o" <tytso@google.com> Cc: stable@vger.kernel.org
* ext4 crypto: ext4_page_crypto() doesn't need a encryption contextTheodore Ts'o2017-05-294-28/+9
| | | | | | | | | | | | Since ext4_page_crypto() doesn't need an encryption context (at least not any more), this allows us to simplify a number function signature and also allows us to avoid needing to allocate a context in ext4_block_write_begin(). It also means we no longer need a separate ext4_decrypt_one() function. Change-Id: I2f83f5745487ef85312bf8469a6b2a190545a5e4 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: "Theodore Ts'o" <tytso@google.com>
* ext4 crypto: fix memory leak in ext4_bio_write_page()Theodore Ts'o2017-05-291-1/+4
| | | | | | | | | | | | | | | | | There are times when ext4_bio_write_page() is called even though we don't actually need to do any I/O. This happens when ext4_writepage() gets called by the jbd2 commit path when an inode needs to force its pages written out in order to provide data=ordered guarantees --- and a page is backed by an unwritten (e.g., uninitialized) block on disk, or if delayed allocation means the page's backing store hasn't been allocated yet. In that case, we need to skip the call to ext4_encrypt_page(), since in addition to wasting CPU, it leads to a bounce page and an ext4 crypto context getting leaked. Change-Id: Icd2123808fd7372c11e6f9e17849e242837d729d Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: "Theodore Ts'o" <tytso@google.com> Cc: stable@vger.kernel.org
* mtk: binder: remove debug stuff to ease future mergeMister Oyster2017-05-281-2392/+11
|
* mtk: binder: 3.10 updatesMister Oyster2017-05-281-43/+41
|
* disable aio support in recommended configurationDaniel Micay2017-05-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | The aio interface adds substantial attack surface for a feature that's not being exposed by Android at all. It's unlikely that anyone is using the kernel feature directly either. This feature is rarely used even on servers. The glibc POSIX aio calls really use thread pools. The lack of widespread usage also means this is relatively poorly audited/tested. The kernel's aio rarely provides performance benefits over using a thread pool and is quite incomplete in terms of system call coverage along with having edge cases where blocking can occur. Part of the performance issue is the fact that it only supports direct io, not buffered io. The existing API is considered fundamentally flawed and it's unlikely it will be expanded, but rather replaced: https://marc.info/?l=linux-aio&m=145255815216051&w=2 Since ext4 encryption means no direct io support, kernel aio isn't even going to work properly on Android devices using file-based encryption. Change-Id: Iccc7cab4437791240817e6275a23e1d3f4a47f2d Signed-off-by: Daniel Micay <danielmicay@gmail.com>
* Fix "Information disclosure vulnerability in MediaTek driver"fire8552017-05-281-93/+14
| | | | CVE-2017-0529
* Fix "Elevation of privilege vulnerability in MediaTek components"fire8552017-05-283-32/+38
| | | | CVE-2017-0502
* Fix security vulnerablity in cmdq driverfire8552017-05-282-1/+10
|
* Fix "Elevation of privilege vulnerability in MediaTek components"fire8552017-05-281-2/+4
| | | | CVE-2017-0503
* Fix "Elevation of privilege vulnerability in MediaTek Hardware Sensor Driver"fire8552017-05-281-2/+2
| | | | CVE-2017-0517
* Revert "defconfig: kill LMK and enable MEMCG"Mister Oyster2017-05-281-5/+4
| | | | This reverts commit eccc87176fd5865301cc59e5f6987449f22a4489.
* makefile: reorganize cflagsMister Oyster2017-05-281-10/+9
|
* ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()Theodore Ts'o2017-05-271-4/+28
| | | | | | | | | | | | | | commit 9e92f48c34eb2b9af9d12f892e2fe1fce5e8ce35 upstream. We aren't checking to see if the in-inode extended attribute is corrupted before we try to expand the inode's extra isize fields. This can lead to potential crashes caused by the BUG_ON() check in ext4_xattr_shift_entries(). Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: change invalidatepage prototype to accept lengthLukas Czerner2017-05-2727-60/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Change-Id: Id47992f86b307985b3215bcf141d56d1849d71df Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> (cherry picked from commit d47992f86b307985b3215bcf141d56d1849d71df) f2fs: removed f2fs modifications bcs of f2fs backports Signed-off-by: Mister Oyster <oysterized@gmail.com>
* UPSTREAM: ext4: fix fencepost in s_first_meta_bg validationTheodore Ts'o2017-05-271-1/+1
| | | | | | | | | | | | | | | (cherry-picked from commit 2ba3e6e8afc9b6188b471f27cf2b5e3cf34e7af2) It is OK for s_first_meta_bg to be equal to the number of block group descriptor blocks. (It rarely happens, but it shouldn't cause any problems.) https://bugzilla.kernel.org/show_bug.cgi?id=194567 Fixes: 3a4b77cd47bb837b8557595ec7425f281f2ca1fe Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Change-Id: Ib414feb50f88dcd42dc846429b81df6c72b28136
* BACKPORT: ext4: validate s_first_meta_bg at mount timeEryu Guan2017-05-271-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (Cherry-picked from commit 3a4b77cd47bb837b8557595ec7425f281f2ca1fe) Ralf Spenneberg reported that he hit a kernel crash when mounting a modified ext4 image. And it turns out that kernel crashed when calculating fs overhead (ext4_calculate_overhead()), this is because the image has very large s_first_meta_bg (debug code shows it's 842150400), and ext4 overruns the memory in count_overhead() when setting bitmap buffer, which is PAGE_SIZE. ext4_calculate_overhead(): buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer blks = count_overhead(sb, i, buf); count_overhead(): for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400 ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun count++; } This can be reproduced easily for me by this script: #!/bin/bash rm -f fs.img mkdir -p /mnt/ext4 fallocate -l 16M fs.img mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img debugfs -w -R "ssv first_meta_bg 842150400" fs.img mount -o loop fs.img /mnt/ext4 Fix it by validating s_first_meta_bg first at mount time, and refusing to mount if its value exceeds the largest possible meta_bg number. Reported-by: Ralf Spenneberg <ralf@os-t.de> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Change-Id: I252fda33d116b044a3e710b79bdd0c7ce2870145
* ANDROID: ext4 crypto: Disables zeroing on truncation when there's no keyMichael Halcrow2017-05-271-0/+5
| | | | | | | | | | | | | | When performing orphan cleanup on mount, ext4 may truncate pages. Truncation as currently implemented may require the encryption key for partial zeroing, and the key isn't necessarily available on mount. Since the userspace tools don't perform the partial zeroing operation anyway, let's just skip doing that in the kernel. This patch fixes a BUG_ON() oops. Bug: 35209576 Change-Id: I2527a3f8d2c57d2de5df03fda69ee397f76095d7 Signed-off-by: Michael Halcrow <mhalcrow@google.com>
* Update defconfig for arm64 to enable ext4 encryptionTheodore Ts'o2017-05-271-0/+1
| | | | | Signed-off-by: "Theodore Ts'o" <tytso@google.com> Change-Id: Ia20fb759595bc86f4cf6d8d22b14c6790e099124
* ext4 crypto: fix return value for ext4_es_scan()Theodore Ts'o2017-05-271-1/+1
| | | | | | | | | | | | Between 3.10 and 3.18, the abstraction to scan for objects in the slab cache which can be freed when the system is under memory pressure changed. When I backported the ext4 code from 3.18 to the 3.10 kernel, I didn't get the return value required by the calling conventions for the scan function correct, which could potentially cause the memory reclaimer to loop indefinitely. Change-Id: I1712fedf96fd91c911fb4d019d7ef76f6c4c1808 Signed-off-by: "Theodore Ts'o" <tytso@google.com>
* ext4 crypto: allocate bounce pages using GFP_NOWAITTheodore Ts'o2017-05-272-23/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we allocated bounce pages using a combination of alloc_page() and mempool_alloc() with the __GFP_WAIT bit set. Instead, use mempool_alloc() with GFP_NOWAIT. The mempool_alloc() function will try using alloc_pages() initially, and then only use the mempool reserve of pages if alloc_pages() is unable to fulfill the request. This minimizes the the impact on the mm layer when we need to do a large amount of writeback of encrypted files, as Jaeguk Kim had reported that under a heavy fio workload on a system with restricted amounts memory (which unfortunately, includes many mobile handsets), he had observed the the OOM killer getting triggered several times. Using GFP_NOWAIT If the mempool_alloc() function fails, we will retry the page writeback at a later time; the function of the mempool is to ensure that we can writeback at least 32 pages at a time, so we can more efficiently dispatch I/O under high memory pressure situations. In the future we should make this be a tunable so we can determine the best tradeoff between permanently sequestering memory and the ability to quickly launder pages so we can free up memory quickly when necessary. Change-Id: I3dbb5eb9a3aa04f40e551338eee5e8d06f352fe8 Signed-off-by: Theodore Ts'o <tytso@mit.edu>