aboutsummaryrefslogtreecommitdiff
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* Revert "GCC: Fix up for gcc 5+"Moyster2018-11-301-1/+0
| | | | This reverts commit ff505baaf412985af758d5820cd620ed9f1a7e05.
* Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds2018-11-291-1/+1
| | | | | | | | | | | | | | This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Moyster <oysterized@gmail.com>
* GCC: Fix up for gcc 5+mydongistiny2018-11-291-0/+1
| | | | | Signed-off-by: mydongistiny <jaysonedson@gmail.com> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* ANDROID: dm verity: add minimum prefetch sizeKeun-young Park2018-01-052-1/+24
| | | | | | | | | | | | | - For device like eMMC, it gives better performance to read more hash blocks at a time. - For android, set it to default 128. For other devices, set it to 1 which is the same as now. - saved boot-up time by 300ms in tested device bug: 32246564 Cc: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Keun-young Park <keunyoung@google.com>
* dm mpath: fix stalls when handling invalid ioctlsHannes Reinecke2018-01-021-2/+5
| | | | | | | | | | | | | | | | An invalid ioctl will never be valid, irrespective of whether multipath has active paths or not. So for invalid ioctls we do not have to wait for multipath to activate any paths, but can rather return an error code immediately. This fix resolves numerous instances of: udevd[]: worker [] unexpectedly returned with status 0x0100 that have been seen during testing. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Joe Maples <joe@frap129.org>
* md: be careful not lot leak internal curr_resync value into metadata. -- (all)NeilBrown2018-01-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mddev->curr_resync usually records where the current resync is up to, but during the starting phase it has some "magic" values. 1 - means that the array is trying to start a resync, but has yielded to another array which shares physical devices, and also needs to start a resync 2 - means the array is trying to start resync, but has found another array which shares physical devices and has already started resync. 3 - means that resync has commensed, but it is possible that nothing has actually been resynced yet. It is important that this value not be visible to user-space and particularly that it doesn't get written to the metadata, as the resync or recovery checkpoint. In part, this is because it may be slightly higher than the correct value, though this is very rare. In part, because it is not a multiple of 4K, and some devices only support 4K aligned accesses. There are two places where this value is propagates into either ->curr_resync_completed or ->recovery_cp or ->recovery_offset. These currently avoid the propagation of values 1 and 3, but will allow 3 to leak through. Change them to only propagate the value if it is > 3. As this can cause an array to fail, the patch is suitable for -stable. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Viswesh <viswesh.vichu@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Joe Maples <joe@frap129.org>
* dm: take care to copy the space map roots before locking the superblockJoe Thornber2018-01-022-55/+85
| | | | | | | | | | | | | In theory copying the space map root can fail, but in practice it never does because we're careful to check what size buffer is needed. But make certain we're able to copy the space map roots before locking the superblock. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # drop dm-era and dm-cache changes as needed Signed-off-by: Joe Maples <joe@frap129.org>
* dm thin: grab a virtual cell before looking up the mappingJoe Thornber2018-01-021-4/+12
| | | | | | | | | Avoids normal IO racing with discard. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Joe Maples <joe@frap129.org>
* dm thin: requeue bios to DM core if no_free_space and in read-only modeMike Snitzer2018-01-021-6/+20
| | | | | | | | | | | | | | | | | | | | | | | Now that we switch the pool to read-only mode when the data device runs out of space it causes active writers to get IO errors once we resume after resizing the data device. If no_free_space is set, save bios to the 'retry_on_resume_list' and requeue them on resume (once the data or metadata device may have been resized). With this patch the resize_io test passes again (on slower storage): dmtest run --suite thin-provisioning -n /resize_io/ Later patches fix some subtle races associated with the pool mode transitions done as part of the pool's -ENOSPC handling. These races are exposed on fast storage (e.g. PCIe SSD). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> [@nathanchance: fixed conflicts] Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Joe Maples <joe@frap129.org>
* dm cache: fix a lock-inversionJoe Thornber2018-01-023-52/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When suspending a cache the policy is walked and the individual policy hints written to the metadata via sync_metadata(). This led to this lock order: policy->lock cache_metadata->root_lock When loading the cache target the policy is populated while the metadata lock is held: cache_metadata->root_lock policy->lock Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by ensuring the cache_metadata root_lock is held whilst all the hints are written, rather than being repeatedly locked while policy->lock is held (as was the case with each callout that policy_walk_mappings() made to the old save_hint() method). Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove locking correctness") build option. However, it is not clear how the LOCKDEP reported paths can lead to a deadlock since the two paths, suspending a target and loading a target, never occur at the same time. But that doesn't mean the same lock-inversion couldn't have occurred elsewhere. Reported-by: Marian Csontos <mcsontos@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Joe Maples <joe@frap129.org>
* dm cache: prevent corruption caused by discard_block_size > cache_block_sizeMike Snitzer2018-01-021-34/+3
| | | | | | | | | | | | | | | | | | | | | | | | If the discard block size is larger than the cache block size we will not properly quiesce IO to a region that is about to be discarded. This results in a race between a cache migration where no copy is needed, and a write to an adjacent cache block that's within the same large discard block. Workaround this by limiting the discard_block_size to cache_block_size. Also limit the max_discard_sectors to cache_block_size. A more comprehensive fix that introduces range locking support in the bio_prison and proper quiescing of a discard range that spans multiple cache blocks is already in development. Reported-by: Morgan Mears <Morgan.Mears@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Cc: stable@vger.kernel.org [@nathanchance: fixed conflicts] Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Joe Maples <joe@frap129.org>
* dm cache: fix race causing dirty blocks to be marked as cleanAnssi Hannula2018-01-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a writeback or a promotion of a block is completed, the cell of that block is removed from the prison, the block is marked as clean, and the clear_dirty() callback of the cache policy is called. Unfortunately, performing those actions in this order allows an incoming new write bio for that block to come in before clearing the dirty status is completed and therefore possibly causing one of these two scenarios: Scenario A: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . . incoming write bio . remapped to cache . set_dirty() called, . but block already dirty . => it does nothing clear_dirty() . - block marked clean . - policy clear_dirty() called . Result: Block is marked clean even though it is actually dirty. No writeback will occur. Scenario B: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . clear_dirty() . - block marked clean . . incoming write bio . remapped to cache . set_dirty() called . - block marked dirty . - policy set_dirty() called - policy clear_dirty() called . Result: Block is properly marked as dirty, but policy thinks it is clean and therefore never asks us to writeback it. This case is visible in "dmsetup status" dirty block count (which normally decreases to 0 on a quiet device). Fix these issues by calling clear_dirty() before calling cell_defer(). Incoming bios for that block will then be detained in the cell and released only after clear_dirty() has completed, so the race will not occur. Found by inspecting the code after noticing spurious dirty counts (scenario B). Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org [@nathanchance: fixed conflicts] Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Joe Maples <joe@frap129.org>
* dm cache: fix truncation bug when mapping I/O to >2TB fast deviceHeinz Mauelshagen2018-01-021-2/+3
| | | | | | | | | | | | | | | | | When remapping a block to the cache's fast device that is larger than 2TB we must not truncate the destination sector to 32bits. The 32bit temporary result of from_cblock() was being overflowed in remap_to_cache() due to the logical left shift. Use an intermediate 64bit type to store the 32bit from_cblock() result to fix the overflow. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org [@nathanchance: fixed conflicts since no iters in 3.10] Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Joe Maples <joe@frap129.org>
* dm cache: fix race affecting dirty block countAnssi Hannula2018-01-021-7/+6
| | | | | | | | | | | | | | | | | | | | nr_dirty is updated without locking, causing it to drift so that it is non-zero (either a small positive integer, or a very large one when an underflow occurs) even when there are no actual dirty blocks. This was due to a race between the workqueue and map function accessing nr_dirty in parallel without proper protection. People were seeing under runs due to a race on increment/decrement of nr_dirty, see: https://lkml.org/lkml/2014/6/3/648 Fix this by using an atomic_t for nr_dirty. Reported-by: roma1390@gmail.com Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Joe Maples <joe@frap129.org>
* dm cache: add block sizes and total cache blocks to status outputMike Snitzer2018-01-021-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | Improve cache_status to emit: <metadata block size> <#used metadata blocks>/<#total metadata blocks> <cache block size> <#used cache blocks>/<#total cache blocks> ... Adding the block sizes allows for easier calculation of the overall size of both the metadata and cache devices. Adding <#total cache blocks> provides useful context for how much of the cache is used. Unfortunately these additions to the status will require updates to users' scripts that monitor the cache status. But these changes help provide more comprehensive information about the cache device and will simplify tools that are being developed to manage dm-cache devices -- because they won't need to issue 3 operations to cobble together the information that we can easily provide via a single status ioctl. While updating the status documentation in cache.txt spaces were tabify'd. Requested-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Joe Maples <joe@frap129.org>
* dm ioctl: remove double parenthesesMatthias Kaehlcke2018-01-021-2/+2
| | | | | | | | | | | The extra pair of parantheses is not needed and causes clang to generate warnings about the DM_DEV_CREATE_CMD comparison in validate_params(). Also remove another double parentheses that doesn't cause a warning. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Maples <joe@frap129.org>
* ANDROID: dm verity fec: initialize recursion levelSami Tolvanen2017-12-271-0/+1
| | | | | | | | | | Explicitly initialize recursion level to zero at the beginning of each I/O operation. Bug: 28943429 Change-Id: I00c612be2b8c22dd5afb65a739551df91cb324fc Signed-off-by: Sami Tolvanen <samitolvanen@google.com> (cherry picked from commit 32ffb3a22d7fd269b2961323478ece92c06a8334)
* ANDROID: dm verity fec: limit error correction recursionSami Tolvanen2017-12-272-1/+14
| | | | | | | | | | | | | | | | | | | | | | If verity tree itself is sufficiently corrupted in addition to data blocks, it's possible for error correction to end up in a deep recursive error correction loop that eventually causes a kernel panic as follows: [ 14.728962] [<ffffffc0008c1a14>] verity_fec_decode+0xa8/0x138 [ 14.734691] [<ffffffc0008c3ee0>] verity_verify_level+0x11c/0x180 [ 14.740681] [<ffffffc0008c482c>] verity_hash_for_block+0x88/0xe0 [ 14.746671] [<ffffffc0008c1508>] fec_decode_rsb+0x318/0x75c [ 14.752226] [<ffffffc0008c1a14>] verity_fec_decode+0xa8/0x138 [ 14.757956] [<ffffffc0008c3ee0>] verity_verify_level+0x11c/0x180 [ 14.763944] [<ffffffc0008c482c>] verity_hash_for_block+0x88/0xe0 This change limits the recursion to a reasonable level during a single I/O operation. Bug: 28943429 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Change-Id: I0a7ebff331d259c59a5e03c81918cc1613c3a766 (cherry picked from commit f4b9e40597e73942d2286a73463c55f26f61bfa7)
* ANDROID: dm verity fec: add missing release from fec_ktypeSami Tolvanen2017-12-271-1/+2
| | | | | | | | | Add a release function to allow destroying the dm-verity device. Bug: 27928374 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Change-Id: Ic0f7c17e4889c5580d70b52d9a709a37165a5747 (cherry picked from commit 0039ccf47c8f99888f7b71b2a36a68a027fbe357)
* ANDROID: dm: Mounting root as linear device when verity disabledBadhri Jagan Sridharan2017-12-273-23/+113
| | | | | | | | | This CL makes android-verity target to be added as linear dm device if when bootloader is unlocked and verity is disabled. Bug: 27175947 Change-Id: Ic41ca4b8908fb2777263799cf3a3e25934d70f18 Signed-off-by: Badhri Jagan Sridharan <Badhri@google.com>
* ANDROID: dm verity fec: add sysfs attribute fec/correctedSami Tolvanen2017-12-272-1/+47
| | | | | | | | | | | Add a sysfs entry that allows user space to determine whether dm-verity has come across correctable errors on the underlying block device. Bug: 22655252 Bug: 27928374 Change-Id: I80547a2aa944af2fb9ffde002650482877ade31b Signed-off-by: Sami Tolvanen <samitolvanen@google.com> (cherry picked from commit 7911fad5f0a2cf5afc2215657219a21e6630e001)
* ANDROID: dm: Add android verity targetBadhri Jagan Sridharan2017-12-276-8/+903
| | | | | | | | | | | | | | This device-mapper target is virtually a VERITY target. This target is setup by reading the metadata contents piggybacked to the actual data blocks in the block device. The signature of the metadata contents are verified against the key included in the system keyring. Upon success, the underlying verity target is setup. BUG: 27175947 Change-Id: I7e99644a0960ac8279f02c0158ed20999510ea97 Signed-off-by: Badhri Jagan Sridharan <Badhri@google.com>
* CHROMIUM: dm: boot time specification of dm=Will Drewry2017-12-272-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a wrap-up of three patches pending upstream approval. I'm bundling them because they are interdependent, and it'll be easier to drop it on rebase later. 1. dm: allow a dm-fs-style device to be shared via dm-ioctl Integrates feedback from Alisdair, Mike, and Kiyoshi. Two main changes occur here: - One function is added which allows for a programmatically created mapped device to be inserted into the dm-ioctl hash table. This binds the device to a name and, optional, uuid which is needed by udev and allows for userspace management of the mapped device. - dm_table_complete() was extended to handle all of the final functional changes required for the table to be operational once called. 2. init: boot to device-mapper targets without an initr* Add a dm= kernel parameter modeled after the md= parameter from do_mounts_md. It allows for device-mapper targets to be configured at boot time for use early in the boot process (as the root device or otherwise). It also replaces /dev/XXX calls with major:minor opportunistically. The format is dm="name uuid ro,table line 1,table line 2,...". The parser expects the comma to be safe to use as a newline substitute but, otherwise, uses the normal separator of space. Some attempt has been made to make it forgiving of additional spaces (using skip_spaces()). A mapped device created during boot will be assigned a minor of 0 and may be access via /dev/dm-0. An example dm-linear root with no uuid may look like: root=/dev/dm-0 dm="lroot none ro, 0 4096 linear /dev/ubdb 0, 4096 4096 linear /dv/ubdc 0" Once udev is started, /dev/dm-0 will become /dev/mapper/lroot. Older upstream threads: http://marc.info/?l=dm-devel&m=127429492521964&w=2 http://marc.info/?l=dm-devel&m=127429499422096&w=2 http://marc.info/?l=dm-devel&m=127429493922000&w=2 Latest upstream threads: https://patchwork.kernel.org/patch/104859/ https://patchwork.kernel.org/patch/104860/ https://patchwork.kernel.org/patch/104861/ BUG: 27175947 Signed-off-by: Will Drewry <wad@chromium.org> Review URL: http://codereview.chromium.org/2020011 Change-Id: I92bd53432a11241228d2e5ac89a3b20d19b05a31
* dm verity: add ignore_zero_blocks featureSami Tolvanen2017-12-273-10/+89
| | | | | | | | | | | If ignore_zero_blocks is enabled dm-verity will return zeroes for blocks matching a zero hash without validating the content. Bug: 21893453 Change-Id: Ib9552f872bd82b1ba6a090686d2934a9551a3b48 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> (cherry picked from commit 0b7462a60aad0c0819a138608c43998f3c46d6a8)
* dm verity: add support for forward error correctionSami Tolvanen2017-12-276-6/+1050
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for correcting corrupted blocks using Reed-Solomon. This code uses RS(255, N) interleaved across data and hash blocks. Each error-correcting block covers N bytes evenly distributed across the combined total data, so that each byte is a maximum distance away from the others. This makes it possible to recover from several consecutive corrupted blocks with relatively small space overhead. In addition, using verity hashes to locate erasures nearly doubles the effectiveness of error correction. Being able to detect corrupted blocks also improves performance, because only corrupted blocks need to corrected. For a 2 GiB partition, RS(255, 253) (two parity bytes for each 253-byte block) can correct up to 16 MiB of consecutive corrupted blocks if erasures can be located, and 8 MiB if they cannot, with 16 MiB space overhead. Bug: 21893453 Change-Id: Ib0372f49f45127e33bfe6b7182b0d608f56f3c7e Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> (cherry picked from commit a431c56bf1764448c12fd2d545b15466d552460c)
* dm verity: factor out verity_for_bv_block()Sami Tolvanen2017-12-272-27/+64
| | | | | | | | | | verity_for_bv_block() will be re-used by optional dm-verity object. Bug: 21893453 Change-Id: I82a3e6efdd95a488770a2fea6794befa8f5a35ce Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> (cherry picked from commit cce43a8a989fb5e245f5ca457e39edc22941c42e)
* dm verity: factor out structures and functions useful to separate objectSami Tolvanen2017-12-272-109/+134
| | | | | | | | | | | Prepare for an optional verity object to make use of existing dm-verity structures and functions. Bug: 21893453 Change-Id: I68b32d2a2ba044b73074410d9c8d916f44fb638d Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> (cherry picked from commit 212032ee8b4123dd001861e87fbde57084c3494e)
* dm verity: move dm-verity.c to dm-verity-target.cSami Tolvanen2017-12-272-0/+1
| | | | | | | | | | | Prepare for extending dm-verity with an optional object. Follows the naming convention used by other DM targets (e.g. dm-cache and dm-era). Bug: 21893453 Change-Id: If5e416de81b7f8e7a7e20fb9fcc723af19b8067d Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> (cherry picked from commit 13e0ef92e4308e5541d1abf462bf12d80545a516)
* dm verity: separate function for parsing opt argsSami Tolvanen2017-12-271-28/+43
| | | | | | | | | | | | Move optional argument parsing into a separate function to make it easier to add more of them without making verity_ctr even longer. Bug: 21893453 Change-Id: Iccc8d9de46674dedbcfbd8362a6048562af80be3 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> (cherry picked from commit e7f44b0ea4feabc1db477a8076553bea0969d7d4)
* dm verity: clean up duplicate hashing codeSami Tolvanen2017-12-271-116/+147
| | | | | | | | | | | Handle dm-verity salting in one place to simplify the code. Bug: 21893453 Change-Id: I09c5e81f88ba6a3bce0627f80458ad5571c724d0 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> (cherry picked from commit bbdba4c572104388ef687eb75f9655426a76d338)
* dm verity: port upstream changes to 3.10Sami Tolvanen2017-12-271-29/+73
| | | | | | | | | | Upstream dm-verity has different optional parameters. Port back the relevant changes. Bug: 21893453 Change-Id: I5431388e041d6829ad60d2c86dd113210ba6aff7 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> (cherry picked from commit 82cdd95a61c921c3c3063178c272b251573b596f)
* dm-verity: Add modes and emit uevent on corrupted blocksSami Tolvanen2017-12-271-9/+89
| | | | | | | | | | | | | | | | | | | | | | | Add a device specific mode to dm-verity for handling corrupted blocks: DM_VERITY_MODE_EIO is the default behavior, where reading a corrupted block results in -EIO. DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does not block the read. DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted block is discovered. Each mode sends a uevent to notify userspace of corruption and allow further recovery actions. Defaults to previous behavior, other modes can be enabled with an optional parameter added to the verity table. Change-Id: Ib72ae6ccb865594d28f3553bdcc5a40b1d7af390 Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
* dm: Stop the dm_request calls in idleAnilKumar Chimata2017-12-271-1/+1
| | | | | | | | | Even if there are no pending peek work, dm layer trying to schedule the work with 10ms time delay. This patch fixes the issue by putting back the table entry. Change-Id: I0e5df117fae74ae37a621862f254220706f0d840 Signed-off-by: AnilKumar Chimata <anilc@codeaurora.org>
* UPSTREAM: block: disable entropy contributions for nonrot devicesMike Snitzer2017-12-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | (cherry picked from commit b277da0a8a594308e17881f4926879bd5fca2a2d) Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set QUEUE_FLAG_NONROT. Historically, all block devices have automatically made entropy contributions. But as previously stated in commit e2e1a148 ("block: add sysfs knob for turning off disk entropy contributions"): - On SSD disks, the completion times aren't as random as they are for rotational drives. So it's questionable whether they should contribute to the random pool in the first place. - Calling add_disk_randomness() has a lot of overhead. There are more reliable sources for randomness than non-rotational block devices. From a security perspective it is better to err on the side of caution than to allow entropy contributions from unreliable "random" sources. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* dm-crypt: remove io_poolMikulas Patocka2017-12-271-20/+1
| | | | | | | | | | Remove io_pool and _crypt_io_pool because they are unused. CRs-fixed: 670391 Change-Id: I71400ecda66902c56c3af981e8b739f156db1e27 Signed-off-by: Mikulas Patocka <mpatock@redhat.com> Patch-mainline: dm-devel @ 04/05/14, 14:08 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* dm-crypt: sort writesMikulas Patocka2017-12-271-15/+35
| | | | | | | | | | | | | | | Write requests are sorted in a red-black tree structure and are submitted in the sorted order. In theory the sorting should be performed by the underlying disk scheduler, however, in practice the disk scheduler accepts and sorts only 128 requests. In order to sort more requests, we need to implement our own sorting. CRs-fixed: 670391 Change-Id: Iffd9345fa1253f5cf2a556893ed36e08f1ac51aa Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Patch-mainline: dm-devel @ 04/05/14, 14:09 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* dm-crypt: offload writes to threadMikulas Patocka2017-12-271-23/+97
| | | | | | | | | | | | | | | | | | | | | | | Submitting write bios directly in the encryption thread caused serious performance degradation. On multiprocessor machine encryption requests finish in a different order than they were submitted in. Consequently, write requests would be submitted in a different order and it could cause severe performance degradation. This patch moves submitting write requests to a separate thread so that the requests can be sorted before submitting. Sorting is implemented in the next patch. Note: it is required that a previous patch "dm-crypt: don't allocate pages for a partial request." is applied before applying this patch. Without that, this patch could introduce a crash. CRs-fixed: 670391 Change-Id: I886ed2da0ff174d3539ea18e27170d7fd1062680 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Patch-mainline: dm-devel @ 04/05/14, 14:08 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* dm-crypt: avoid deadlock in mempoolsMikulas Patocka2017-12-271-5/+36
| | | | | | | | | | | | | | | | | | | | | | This patch fixes a theoretical deadlock introduced in the previous patch. The function crypt_alloc_buffer may be called concurrently. If we allocate from the mempool concurrently, there is a possibility of deadlock. For example, if we have mempool of 256 pages, two processes, each wanting 256, pages allocate from the mempool concurrently, it may deadlock in a situation where both processes have allocated 128 pages and the mempool is exhausted. In order to avoid this scenarios, we allocate the pages under a mutex. In order to not degrade performance with excessive locking, we try non-blocking allocations without a mutex first and if it fails, we fallback to a blocking allocation with a mutex. CRs-fixed: 670391 Change-Id: I6c391dece4ba44fe0b2e9b75ea2b9235bf1b525b Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Patch-mainline: dm-devel @ 04/05/14, 14:07 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* dm-crypt: don't allocate pages for a partial requestMikulas Patocka2017-12-271-110/+29
| | | | | | | | | | | | | | | | | This patch changes crypt_alloc_buffer so that it always allocates pages for a full request. This change enables further simplification and removing of one refcounts in the next patches. Note: the next patch is needed to fix a theoretical deadlock CRs-fixed: 670391 Change-Id: I7bcadac8b3450976366c701fceb1fee7cb18df85 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> [joonwoop@codeaurora.org: resolve trivial merge conflicts] Patch-mainline: dm-devel @ 04/05/14, 14:07 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* dm-crypt: use per-bio dataMikulas Patocka2017-12-271-14/+26
| | | | | | | | | | | | | | | | | | | | | | | This patch changes dm-crypt so that it uses auxiliary data allocated with the bio. Dm-crypt requires two allocations per request - struct dm_crypt_io and struct ablkcipher_request (with other data appended to it). It used mempool for the allocation. Some requests may require more dm_crypt_ios and ablkcipher_requests, however most requests need just one of each of these two structures to complete. This patch changes it so that the first dm_crypt_io and ablkcipher_request and allocated with the bio (using target per_bio_data_size option). If the request needs additional values, they are allocated from the mempool. CRs-fixed: 670391 Change-Id: I8abc48a021391398f3b35bdd4ac9efbbec3a9fa5 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Patch-mainline: dm-devel @ 04/05/14, 14:05 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* dm-crypt: run in a WQ_HIGHPRI workqueueTim Murray2017-12-271-1/+3
| | | | | | | | | | | Running dm-crypt in a standard workqueue results in IO competing for CPU time with standard user apps, which can lead to pipeline bubbles and seriously degraded performance. Move to a WQ_HIGHPRI workqueue to protect against that. bug 25392275 Change-Id: I589149a31c7b5d322fe2ed5b2476b1f6e3d5ee6f
* dm-crypt: use unbound workqueue for request processingMikulas Patocka2017-12-271-4/+2
| | | | | | | | | | | | Use unbound workqueue so that work is automatically ballanced between available CPUs. CRs-fixed: 670391 Change-Id: I169099d0b5b27535633c9d3aaab2037b5fea6aa9 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> [joonwoop@codeaurora.org: resolve trivial merge conflict] Patch-mainline: dm-devel @ 04/05/14, 14:06 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* BACKPORT: dm bufio: don't take the lock in dm_bufio_shrink_countMikulas Patocka2017-12-101-8/+8
| | | | | | | | | | | | | | | | dm_bufio_shrink_count() is called from do_shrink_slab to find out how many freeable objects are there. The reported value doesn't have to be precise, so we don't need to take the dm-bufio lock. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Bug: 64122284 Change-Id: Id2c3446e03e865f424be8666b1ee0822b9e33a63 (cherry picked from commit d12067f428c037b4575aaeb2be00847fc214c24a) Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* BACKPORT: dm bufio: avoid sleeping while holding the dm_bufio lockDouglas Anderson2017-12-101-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've seen in-field reports showing _lots_ (18 in one case, 41 in another) of tasks all sitting there blocked on: mutex_lock+0x4c/0x68 dm_bufio_shrink_count+0x38/0x78 shrink_slab.part.54.constprop.65+0x100/0x464 shrink_zone+0xa8/0x198 In the two cases analyzed, we see one task that looks like this: Workqueue: kverityd verity_prefetch_io __switch_to+0x9c/0xa8 __schedule+0x440/0x6d8 schedule+0x94/0xb4 schedule_timeout+0x204/0x27c schedule_timeout_uninterruptible+0x44/0x50 wait_iff_congested+0x9c/0x1f0 shrink_inactive_list+0x3a0/0x4cc shrink_lruvec+0x418/0x5cc shrink_zone+0x88/0x198 try_to_free_pages+0x51c/0x588 __alloc_pages_nodemask+0x648/0xa88 __get_free_pages+0x34/0x7c alloc_buffer+0xa4/0x144 __bufio_new+0x84/0x278 dm_bufio_prefetch+0x9c/0x154 verity_prefetch_io+0xe8/0x10c process_one_work+0x240/0x424 worker_thread+0x2fc/0x424 kthread+0x10c/0x114 ...and that looks to be the one holding the mutex. The problem has been reproduced on fairly easily: 0. Be running Chrome OS w/ verity enabled on the root filesystem 1. Pick test patch: http://crosreview.com/412360 2. Install launchBalloons.sh and balloon.arm from http://crbug.com/468342 ...that's just a memory stress test app. 3. On a 4GB rk3399 machine, run nice ./launchBalloons.sh 4 900 100000 ...that tries to eat 4 * 900 MB of memory and keep accessing. 4. Login to the Chrome web browser and restore many tabs With that, I've seen printouts like: DOUG: long bufio 90758 ms ...and stack trace always show's we're in dm_bufio_prefetch(). The problem is that we try to allocate memory with GFP_NOIO while we're holding the dm_bufio lock. Instead we should be using GFP_NOWAIT. Using GFP_NOIO can cause us to sleep while holding the lock and that causes the above problems. The current behavior explained by David Rientjes: It will still try reclaim initially because __GFP_WAIT (or __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO. This is the cause of contention on dm_bufio_lock() that the thread holds. You want to pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a mutex that can be contended by a concurrent slab shrinker (if count_objects didn't use a trylock, this pattern would trivially deadlock). This change significantly increases responsiveness of the system while in this state. It makes a real difference because it unblocks kswapd. In the bug report analyzed, kswapd was hung: kswapd0 D ffffffc000204fd8 0 72 2 0x00000000 Call trace: [<ffffffc000204fd8>] __switch_to+0x9c/0xa8 [<ffffffc00090b794>] __schedule+0x440/0x6d8 [<ffffffc00090bac0>] schedule+0x94/0xb4 [<ffffffc00090be44>] schedule_preempt_disabled+0x28/0x44 [<ffffffc00090d900>] __mutex_lock_slowpath+0x120/0x1ac [<ffffffc00090d9d8>] mutex_lock+0x4c/0x68 [<ffffffc000708e7c>] dm_bufio_shrink_count+0x38/0x78 [<ffffffc00030b268>] shrink_slab.part.54.constprop.65+0x100/0x464 [<ffffffc00030dbd8>] shrink_zone+0xa8/0x198 [<ffffffc00030e578>] balance_pgdat+0x328/0x508 [<ffffffc00030eb7c>] kswapd+0x424/0x51c [<ffffffc00023f06c>] kthread+0x10c/0x114 [<ffffffc000203dd0>] ret_from_fork+0x10/0x40 By unblocking kswapd memory pressure should be reduced. Suggested-by: David Rientjes <rientjes@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Bug: 64122284 Change-Id: I1ce9367c921d7ab07ca9e3d403c95cd0d333915c (cherry picked from commit 9ea61cac0b1ad0c09022f39fd97e9b99a2cfc2dc) Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* md: fix super_offset endianness in super_1_rdev_size_changeJason Yan2017-11-061-1/+1
| | | | | | | | | | | commit 3fb632e40d7667d8bedfabc28850ac06d5493f54 upstream. The sb->super_offset should be big-endian, but the rdev->sb_start is in host byte order, so fix this by adding cpu_to_le64. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* md/raid10: submit bio directly to replacement diskShaohua Li2017-11-061-3/+16
| | | | | | | | | | | | commit 6d399783e9d4e9bd44931501948059d24ad96ff8 upstream. Commit 57c67df(md/raid10: submit IO from originating thread instead of md thread) submits bio directly for normal disks but not for replacement disks. There is no point we shouldn't do this for replacement disks. Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* md/bitmap: disable bitmap_resize for file-backed bitmaps.NeilBrown2017-11-061-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit e8a27f836f165c26f867ece7f31eb5c811692319 upstream. bitmap_resize() does not work for file-backed bitmaps. The buffer_heads are allocated and initialized when the bitmap is read from the file, but resize doesn't read from the file, it loads from the internal bitmap. When it comes time to write the new bitmap, the bh is non-existent and we crash. The common case when growing an array involves making the array larger, and that normally means making the bitmap larger. Doing that inside the kernel is possible, but would need more code. It is probably easier to require people who use file-backed bitmaps to remove them and re-add after a reshape. So this patch disables the resizing of arrays which have file-backed bitmaps. This is better than crashing. Reported-by: Zhilong Liu <zlliu@suse.com> Fixes: d60b479d177a ("md/bitmap: add bitmap_resize function to allow bitmap resizing.") Cc: stable@vger.kernel.org (v3.5+). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* Revert "dm ioctl: prevent stack leak in dm ioctl call"Jonathan Solnit2017-08-301-1/+1
| | | | | | | | This reverts commit 1d5b6ba1bfe0ce28eca6fa79a74d0088e706e63e. Bug: 35644370 Change-Id: I0880d5f11cd22547934a13b7aa564a4102b95aa9 Signed-off-by: Jonathan Solnit <jsolnit@google.com>
* md linear: fix a race between linear_add() and linear_congested()colyli@suse.de2017-07-042-1/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 03a9e24ef2aaa5f1f9837356aed79c860521407a upstream. Recently I receive a bug report that on Linux v3.0 based kerenl, hot add disk to a md linear device causes kernel crash at linear_congested(). From the crash image analysis, I find in linear_congested(), mddev->raid_disks contains value N, but conf->disks[] only has N-1 pointers available. Then a NULL pointer deference crashes the kernel. There is a race between linear_add() and linear_congested(), RCU stuffs used in these two functions cannot avoid the race. Since Linuv v4.0 RCU code is replaced by introducing mddev_suspend(). After checking the upstream code, it seems linear_congested() is not called in generic_make_request() code patch, so mddev_suspend() cannot provent it from being called. The possible race still exists. Here I explain how the race still exists in current code. For a machine has many CPUs, on one CPU, linear_add() is called to add a hard disk to a md linear device; at the same time on other CPU, linear_congested() is called to detect whether this md linear device is congested before issuing an I/O request onto it. Now I use a possible code execution time sequence to demo how the possible race happens, seq linear_add() linear_congested() 0 conf=mddev->private 1 oldconf=mddev->private 2 mddev->raid_disks++ 3 for (i=0; i<mddev->raid_disks;i++) 4 bdev_get_queue(conf->disks[i].rdev->bdev) 5 mddev->private=newconf In linear_add() mddev->raid_disks is increased in time seq 2, and on another CPU in linear_congested() the for-loop iterates conf->disks[i] by the increased mddev->raid_disks in time seq 3,4. But conf with one more element (which is a pointer to struct dev_info type) to conf->disks[] is not updated yet, accessing its structure member in time seq 4 will cause a NULL pointer deference fault. To fix this race, there are 2 parts of modification in the patch, 1) Add 'int raid_disks' in struct linear_conf, as a copy of mddev->raid_disks. It is initialized in linear_conf(), always being consistent with pointers number of 'struct dev_info disks[]'. When iterating conf->disks[] in linear_congested(), use conf->raid_disks to replace mddev->raid_disks in the for-loop, then NULL pointer deference will not happen again. 2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to free oldconf memory. Because oldconf may be referenced as mddev->private in linear_congested(), kfree_rcu() makes sure that its memory will not be released until no one uses it any more. Also some code comments are added in this patch, to make this modification to be easier understandable. This patch can be applied for kernels since v4.0 after commit: 3be260cc18f8 ("md/linear: remove rcu protections in favour of suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for people who maintain kernels before Linux v4.0, they need to do some back back port to this patch. Changelog: - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to replace rcu_call() in linear_add(). - v2: add RCU stuffs by suggestion from Shaohua and Neil. - v1: initial effort. Signed-off-by: Coly Li <colyli@suse.de> Cc: Shaohua Li <shli@fb.com> Cc: Neil Brown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
* md:raid1: fix a dead loop when read from a WriteMostly diskWei Fang2017-07-041-1/+1
| | | | | | | | | | | | | | | | | | commit 816b0acf3deb6d6be5d0519b286fdd4bafade905 upstream. If first_bad == this_sector when we get the WriteMostly disk in read_balance(), valid disk will be returned with zero max_sectors. It'll lead to a dead loop in make_request(), and OOM will happen because of endless allocation of struct bio. Since we can't get data from this disk in this case, so continue for another disk. Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>