aboutsummaryrefslogtreecommitdiff
path: root/include/uapi
Commit message (Collapse)AuthorAgeFilesLines
* tcp: tcp_fragment() should apply sane memory limitsEric Dumazet2019-07-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream. Jonathan Looney reported that a malicious peer can force a sender to fragment its retransmit queue into tiny skbs, inflating memory usage and/or overflow 32bit counters. TCP allows an application to queue up to sk_sndbuf bytes, so we need to give some allowance for non malicious splitting of retransmit queue. A new SNMP counter is added to monitor how many times TCP did not allow to split an skb if the allowance was exceeded. Note that this counter might increase in the case applications use SO_SNDBUF socket option to lower sk_sndbuf. CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the socket is already using more than half the allowed space Change-Id: I594a9f68263f774fa6f0824042bc287bba6dc927 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* partial merge fix - BACKPORT: random: introduce getrandom(2) system callTheodore Ts'o2019-07-071-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Almost clean cherry pick of c6e9d6f38894798696f23c8084ca7edbf16ee895, includes change made by merge 0891ad829d2a0501053703df66029e843e3b8365. The getrandom(2) system call was requested by the LibreSSL Portable developers. It is analoguous to the getentropy(2) system call in OpenBSD. The rationale of this system call is to provide resiliance against file descriptor exhaustion attacks, where the attacker consumes all available file descriptors, forcing the use of the fallback code where /dev/[u]random is not available. Since the fallback code is often not well-tested, it is better to eliminate this potential failure mode entirely. The other feature provided by this new system call is the ability to request randomness from the /dev/urandom entropy pool, but to block until at least 128 bits of entropy has been accumulated in the /dev/urandom entropy pool. Historically, the emphasis in the /dev/urandom development has been to ensure that urandom pool is initialized as quickly as possible after system boot, and preferably before the init scripts start execution. This is because changing /dev/urandom reads to block represents an interface change that could potentially break userspace which is not acceptable. In practice, on most x86 desktop and server systems, in general the entropy pool can be initialized before it is needed (and in modern kernels, we will printk a warning message if not). However, on an embedded system, this may not be the case. And so with this new interface, we can provide the functionality of blocking until the urandom pool has been initialized. Any userspace program which uses this new functionality must take care to assure that if it is used during the boot process, that it will not cause the init scripts or other portions of the system startup to hang indefinitely. SYNOPSIS #include <linux/random.h> int getrandom(void *buf, size_t buflen, unsigned int flags); DESCRIPTION The system call getrandom() fills the buffer pointed to by buf with up to buflen random bytes which can be used to seed user space random number generators (i.e., DRBG's) or for other cryptographic uses. It should not be used for Monte Carlo simulations or other programs/algorithms which are doing probabilistic sampling. If the GRND_RANDOM flags bit is set, then draw from the /dev/random pool instead of the /dev/urandom pool. The /dev/random pool is limited based on the entropy that can be obtained from environmental noise, so if there is insufficient entropy, the requested number of bytes may not be returned. If there is no entropy available at all, getrandom(2) will either block, or return an error with errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags. If the GRND_RANDOM bit is not set, then the /dev/urandom pool will be used. Unlike using read(2) to fetch data from /dev/urandom, if the urandom pool has not been sufficiently initialized, getrandom(2) will block (or return -1 with the errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags). The getentropy(2) system call in OpenBSD can be emulated using the following function: int getentropy(void *buf, size_t buflen) { int ret; if (buflen > 256) goto failure; ret = getrandom(buf, buflen, 0); if (ret < 0) return ret; if (ret == buflen) return 0; failure: errno = EIO; return -1; } RETURN VALUE On success, the number of bytes that was filled in the buf is returned. This may not be all the bytes requested by the caller via buflen if insufficient entropy was present in the /dev/random pool, or if the system call was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. ERRORS EINVAL An invalid flag was passed to getrandom(2) EFAULT buf is outside the accessible address space. EAGAIN The requested entropy was not available, and getentropy(2) would have blocked if the GRND_NONBLOCK flag was not set. EINTR While blocked waiting for entropy, the call was interrupted by a signal handler; see the description of how interrupted read(2) calls on "slow" devices are handled with and without the SA_RESTART flag in the signal(7) man page. NOTES For small requests (buflen <= 256) getrandom(2) will not return EINTR when reading from the urandom pool once the entropy pool has been initialized, and it will return all of the bytes that have been requested. This is the recommended way to use getrandom(2), and is designed for compatibility with OpenBSD's getentropy() system call. However, if you are using GRND_RANDOM, then getrandom(2) may block until the entropy accounting determines that sufficient environmental noise has been gathered such that getrandom(2) will be operating as a NRBG instead of a DRBG for those people who are working in the NIST SP 800-90 regime. Since it may block for a long time, these guarantees do *not* apply. The user may want to interrupt a hanging process using a signal, so blocking until all of the requested bytes are returned would be unfriendly. For this reason, the user of getrandom(2) MUST always check the return value, in case it returns some error, or if fewer bytes than requested was returned. In the case of !GRND_RANDOM and small request, the latter should never happen, but the careful userspace code (and all crypto code should be careful) should check for this anyway! Finally, unless you are doing long-term key generation (and perhaps not even then), you probably shouldn't be using GRND_RANDOM. The cryptographic algorithms used for /dev/urandom are quite conservative, and so should be sufficient for all purposes. The disadvantage of GRND_RANDOM is that it can block, and the increased complexity required to deal with partially fulfilled getrandom(2) requests. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Zach Brown <zab@zabbo.net> Bug: http://b/29621447 Change-Id: I189ba74070dd6d918b0fdf83ff30bb74ec0f7556 (cherry picked from commit 4af712e8df998475736f3e2727701bd31e3751a9)
* xattr: Constify ->name member of "struct xattr".Tetsuo Handa2019-07-061-1/+1
| | | | | | | | | | | | | | | | | Since everybody sets kstrdup()ed constant string to "struct xattr"->name but nobody modifies "struct xattr"->name , we can omit kstrdup() and its failure checking by constifying ->name member of "struct xattr". Change-Id: I84a47af13e3c77b394218cc12ac8901d87b0fd69 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Joel Becker <jlbec@evilplan.org> [ocfs2] Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Tested-by: Paul Moore <paul@paul-moore.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
* allow O_TMPFILE to work with O_WRONLYAl Viro2019-05-021-2/+2
| | | | | Change-Id: I90171d1b53a4c35bfa76757ecfdfb6f95330d107 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Safer ABI for O_TMPFILEAl Viro2019-05-021-2/+6
| | | | | | | | | [suggested by Rasmus Villemoes] make O_DIRECTORY | O_RDWR part of O_TMPFILE; that will fail on old kernels in a lot more cases than what I came up with. And make sure O_CREAT doesn't get there... Change-Id: Iaa3c8b487d44515b539150bdb5d0b749b87d3ea2 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* it's still short a few helpers, but infrastructure should be OK now...Al Viro2019-05-021-0/+4
| | | | | Change-Id: I0adb8fe9c5029bad3ac52629003c3b78e9442936 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* FROMLIST: ANDROID: binder: Add BINDER_GET_NODE_INFO_FOR_REF ioctl.Martijn Coenen2018-11-291-0/+10
| | | | | | | | | | | | | | This allows the context manager to retrieve information about nodes that it holds a reference to, such as the current number of references to those nodes. Such information can for example be used to determine whether the servicemanager is the only process holding a reference to a node. This information can then be passed on to the process holding the node, which can in turn decide whether it wants to shut down to reduce resource usage. Signed-off-by: Martijn Coenen <maco@android.com>
* Revert "usb: gadget: f_fs: Increase EP_ALLOC ioctl number"fire8552018-05-161-1/+1
| | | | This reverts commit 4383cfbb3fe503cff5d7384986eb1b1b34133c01.
* v4l2-compat-ioctl32: Add support for private buffersSatish Kodishala2017-12-141-0/+3
| | | | | | | | Add support for copying length and userptr fields from user space private buffers to kernel space and vice versa. Change-Id: Ia7d41aa312544bb0960670af58623b0dc0435a8a Signed-off-by: Satish Kodishala <skodisha@codeaurora.org>
* UPSTREAM: USB: fix out-of-bounds in usb_set_configurationGreg Kroah-Hartman2017-11-181-0/+1
| | | | | | | | | | | | | | | | | | | | commit bd7a3fe770ebd8391d1c7d072ff88e9e76d063eb Andrey Konovalov reported a possible out-of-bounds problem for a USB interface association descriptor. He writes: It seems there's no proper size check of a USB_DT_INTERFACE_ASSOCIATION descriptor. It's only checked that the size is >= 2 in usb_parse_configuration(), so find_iad() might do out-of-bounds access to intf_assoc->bInterfaceCount. And he's right, we don't check for crazy descriptors of this type very well, so resolve this problem. Yet another issue found by syzkaller... Change-Id: I2cc3b5a66d16abd0fc567d69457fc90a45eb12d8 Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* BACKPORT: net: xfrm: support setting an output mark.Lorenzo Colitti2017-10-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On systems that use mark-based routing it may be necessary for routing lookups to use marks in order for packets to be routed correctly. An example of such a system is Android, which uses socket marks to route packets via different networks. Currently, routing lookups in tunnel mode always use a mark of zero, making routing incorrect on such systems. This patch adds a new output_mark element to the xfrm state and a corresponding XFRMA_OUTPUT_MARK netlink attribute. The output mark differs from the existing xfrm mark in two ways: 1. The xfrm mark is used to match xfrm policies and states, while the xfrm output mark is used to set the mark (and influence the routing) of the packets emitted by those states. 2. The existing mark is constrained to be a subset of the bits of the originating socket or transformed packet, but the output mark is arbitrary and depends only on the state. The use of a separate mark provides additional flexibility. For example: - A packet subject to two transforms (e.g., transport mode inside tunnel mode) can have two different output marks applied to it, one for the transport mode SA and one for the tunnel mode SA. - On a system where socket marks determine routing, the packets emitted by an IPsec tunnel can be routed based on a mark that is determined by the tunnel, not by the marks of the unencrypted packets. - Support for setting the output marks can be introduced without breaking any existing setups that employ both mark-based routing and xfrm tunnel mode. Simply changing the code to use the xfrm mark for routing output packets could xfrm mark could change behaviour in a way that breaks these setups. If the output mark is unspecified or set to zero, the mark is not set or changed. [backport of upstream 077fbac405bfc6d41419ad6c1725804ad4e9887c] Bug: 63589535 Test: https://android-review.googlesource.com/452776/ passes Tested: make allyesconfig; make -j64 Tested: https://android-review.googlesource.com/452776 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I76120fba036e21780ced31ad390faf491ea81e52
* f2fs: support project quotaChao Yu2017-10-041-0/+1
| | | | | | | This patch adds to support plain project quota. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* binder: make FIFO inheritance a per-context optionTim Murray2017-09-161-0/+1
| | | | | | | | | | | | | Add a new ioctl to binder to control whether FIFO inheritance should happen. In particular, hwbinder should inherit FIFO priority from callers, but standard binder threads should not. Test: boots bug 36516194 Signed-off-by: Tim Murray <timmurray@google.com> Change-Id: I8100c4364b7d15d1bf00a8ca5c286e4d4b23ce85
* drivers: merged Android Binder from 4.9Lukas06102017-09-161-13/+76
| | | | | Change-Id: I857ef86b2d502293fb8c37398383dceaa21dd29f Signed-off-by: Mister Oyster <oysterized@gmail.com>
* usb: gadget: f_fs: Increase EP_ALLOC ioctl numberJerry Zhang2017-09-141-1/+1
| | | | | | | | | Prevent conflict with possible new upstream ioctls before it itself is upstreamed. Test: None Change-Id: I10cbc01c25f920a626ea7559e8ca80ee08865333 Signed-off-by: Jerry Zhang <zhangjerry@google.com>
* usb: gadget: f_fs: Add ioctl for allocating endpoint buffers.Jerry Zhang2017-09-141-0/+5
| | | | | | | | | | | This creates an ioctl named FUNCTIONFS_ENDPOINT_ALLOC which will preallocate buffers for a given size. Any reads/writes on that endpoint below that size will use those buffers instead of allocating their own. If the endpoint is not active, the buffer will not be allocated until it becomes active. Change-Id: I4da517620ed913161ea9e21a31f6b92c9a012b44 Signed-off-by: Jerry Zhang <zhangjerry@google.com>
* usb: gadget: f_fs: add ioctl returning ep descriptorRobert Baldyga2017-09-141-0/+6
| | | | | | | | | | | | This patch introduces ioctl named FUNCTIONFS_ENDPOINT_DESC, which returns endpoint descriptor to userspace. It works only if function is active. Signed-off-by: Robert Baldyga <r.baldyga@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Jerry Zhang <zhangjerry@google.com> Change-Id: I55987bf0c6744327f7763b567b5a2b39c50d18e6
* fscrypt: add support for AES-128-CBCDaniel Walter2017-07-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fscrypt provides facilities to use different encryption algorithms which are selectable by userspace when setting the encryption policy. Currently, only AES-256-XTS for file contents and AES-256-CBC-CTS for file names are implemented. This is a clear case of kernel offers the mechanism and userspace selects a policy. Similar to what dm-crypt and ecryptfs have. This patch adds support for using AES-128-CBC for file contents and AES-128-CBC-CTS for file name encryption. To mitigate watermarking attacks, IVs are generated using the ESSIV algorithm. While AES-CBC is actually slightly less secure than AES-XTS from a security point of view, there is more widespread hardware support. Using AES-CBC gives us the acceptable performance while still providing a moderate level of security for persistent storage. Especially low-powered embedded devices with crypto accelerators such as CAAM or CESA often only support AES-CBC. Since using AES-CBC over AES-XTS is basically thought of a last resort, we use AES-128-CBC over AES-256-CBC since it has less encryption rounds and yields noticeable better performance starting from a file size of just a few kB. Signed-off-by: Daniel Walter <dwalter@sigma-star.at> [david@sigma-star.at: addressed review comments] Signed-off-by: David Gstir <david@sigma-star.at> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Conflicts: fs/crypto/crypto.c fs/crypto/fscrypt_private.h fs/crypto/keyinfo.c
* uapi: fix linux/packet_diag.h userspace compilation errorDmitry V. Levin2017-07-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | commit 745cb7f8a5de0805cade3de3991b7a95317c7c73 upstream. Replace MAX_ADDR_LEN with its numeric value to fix the following linux/packet_diag.h userspace compilation error: /usr/include/linux/packet_diag.h:67:17: error: 'MAX_ADDR_LEN' undeclared here (not in a function) __u8 pdmc_addr[MAX_ADDR_LEN]; This is not the first case in the UAPI where the numeric value of MAX_ADDR_LEN is used instead of symbolic one, uapi/linux/if_link.h already does the same: $ grep MAX_ADDR_LEN include/uapi/linux/if_link.h __u8 mac[32]; /* MAX_ADDR_LEN */ There are no UAPI headers besides these two that use MAX_ADDR_LEN. Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Acked-by: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Willy Tarreau <w@1wt.eu>
* can: raw: raw_setsockopt: limit number of can_filter that can be setMarc Kleine-Budde2017-07-021-0/+1
| | | | | | | | | | | | | | | commit 332b05ca7a438f857c61a3c21a88489a21532364 upstream. This patch adds a check to limit the number of can_filters that can be set via setsockopt on CAN_RAW sockets. Otherwise allocations > MAX_ORDER are not prevented resulting in a warning. Reference: https://lkml.org/lkml/2016/12/2/230 Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
* Revert "net: socket ioctl to reset connections matching local address"Amit Pundir2017-06-241-1/+0
| | | | | | | | | This reverts commit c7c3ec4903d32c60423ee013d96e94602f66042c. Now that SOCK_DESTROY is in place, lets revert SIOCKILLADDR patches. Change-Id: I2dcd833b66c88a48de8978dce9d72ab78f9af549 Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
* binder: merge aosp-common/3.10 binder drivers (uptodate)Mister Oyster2017-06-181-3/+112
|
* vfs: add support for a lazytime mount optionTheodore Ts'o2017-05-291-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* uapi: add new system call ABI codepoints for ext4 3.18 backportTheodore Ts'o2017-05-272-0/+37
| | | | | | | | | | | note: this doesn't guarantee that functionality provided by FALLOC_FL_COLLAPSE_RANGE, FALLOC_FL_ZERO_RANGE, and FIEMAP_FLAG_CACHE to necessarily _work_; it only allows ext4 from 3.18 to *compile*. Fortunately, these are exotic bits of functionality that most people never use. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* net: core: add UID to flows, rules, and routesLorenzo Colitti2017-05-232-0/+16
| | | | | | | | | | | | | | | | | | - Define a new FIB rule attributes, FRA_UID_RANGE, to describe a range of UIDs. - Define a RTA_UID attribute for per-UID route lookups and dumps. - Support passing these attributes to and from userspace via rtnetlink. The value INVALID_UID indicates no UID was specified. - Add a UID field to the flow structures. [Backport of net-next 622ec2c9d52405973c9f1ca5116eb1c393adfc7d] Bug: 16355602 Change-Id: I7e3ab388ed862c4b7e39dc8b0209d977cb1129ac Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* Revert "net: core: Support UID-based routing."Lorenzo Colitti2017-05-232-3/+0
| | | | | | | | This reverts commit f6f535d3e0d8da2b5bc3c93690c47485d29e4ce6. Bug: 16355602 Change-Id: I5987e276f5ddbe425ea3bd86861cee0ae22212d9 Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* fscrypt: Move key structure and constants to uapiJoe Richey2017-05-211-0/+13
| | | | | | | | | | | | | | | | | | | This commit exposes the necessary constants and structures for a userspace program to pass filesystem encryption keys into the keyring. The fscrypt_key structure was already part of the kernel ABI, this change just makes it so programs no longer have to redeclare these structures (like e4crypt in e2fsprogs currently does). Note that we do not expose the other FS_*_KEY_SIZE constants as they are not necessary. Only XTS is supported for contents_encryption_mode, so currently FS_MAX_KEY_SIZE bytes of key material must always be passed to the kernel. This commit also removes __packed from fscrypt_key as it does not contain any implicit padding and does not refer to an on-disk structure. Signed-off-by: Joe Richey <joerichey@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Staging: android: binder: Remove support for old 32 bit binder protocol.Arve Hjønnevåg2017-05-071-9/+0
| | | | | | Change-Id: I371072175a298282254a21ea69503b9d75633dc5 Signed-off-by: Arve Hjønnevåg <arve@android.com> Signed-off-by: Mister Oyster <oysterized@gmail.com>
* arm64: ptrace: add NT_ARM_SYSTEM_CALL regsetAKASHI Takahiro2017-04-271-0/+1
| | | | | | | | | | | | | | | | | | | | This regeset is intended to be used to get and set a system call number while tracing. There was some discussion about possible approaches to do so: (1) modify x8 register with ptrace(PTRACE_SETREGSET) indirectly, and update regs->syscallno later on in syscall_trace_enter(), or (2) define a dedicated regset for this purpose as on s390, or (3) support ptrace(PTRACE_SET_SYSCALL) as on arch/arm Thinking of the fact that user_pt_regs doesn't expose 'syscallno' to tracer as well as that secure_computing() expects a changed syscall number, especially case of -1, to be visible before this function returns in syscall_trace_enter(), (1) doesn't work well. We will take (2) since it looks much cleaner. Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
* BACKPORT: hw_breakpoint: Allow watchpoint of length 3,5,6 and 7Pratyush Anand2017-04-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | (cherry picked from commit 651be3cb085341a21847e47c694c249c3e1e4e5b) We only support breakpoint/watchpoint of length 1, 2, 4 and 8. If we can support other length as well, then user may watch more data with less number of watchpoints (provided hardware supports it). For example: if we have to watch only 4th, 5th and 6th byte from a 64 bit aligned address, we will have to use two slots to implement it currently. One slot will watch a half word at offset 4 and other a byte at offset 6. If we can have a watchpoint of length 3 then we can watch it with single slot as well. ARM64 hardware does support such functionality, therefore adding these new definitions in generic layer. Signed-off-by: Pratyush Anand <panand@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Pavel Labath <labath@google.com> [pavel: tools/include/uapi/linux/hw_breakpoint.h is not present in this branch] Change-Id: Ie17ed89ca526e4fddf591bb4e556fdfb55fc2eac Bug: 30919905
* fscrypt: catch up to v4.11-rc1Jaegeuk Kim2017-04-131-0/+15
| | | | | | | | | | | | | | | Keep validate_user_key() due to kasprintf() panic. fscrypt: - skcipher_ -> ablkcipher_ - fs/crypto/bio.c changes f2fs: - fscrypt: use ENOKEY when file cannot be created w/o key - fscrypt: split supp and notsupp declarations into their own headers - fscrypt: make fscrypt_operations.key_prefix a string Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* fs crypto: move per-file encryption from f2fs tree to fs/cryptoJaegeuk Kim2017-04-131-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the renamed functions moved from the f2fs crypto files. [Backporting to 3.10] - Removed d_is_negative() in fscrypt_d_revalidate(). 1. definitions for per-file encryption used by ext4 and f2fs. 2. crypto.c for encrypt/decrypt functions a. IO preparation: - fscrypt_get_ctx / fscrypt_release_ctx b. before IOs: - fscrypt_encrypt_page - fscrypt_decrypt_page - fscrypt_zeroout_range c. after IOs: - fscrypt_decrypt_bio_pages - fscrypt_pullback_bio_page - fscrypt_restore_control_page 3. policy.c supporting context management. a. For ioctls: - fscrypt_process_policy - fscrypt_get_policy b. For context permission - fscrypt_has_permitted_context - fscrypt_inherit_context 4. keyinfo.c to handle permissions - fscrypt_get_encryption_info - fscrypt_free_encryption_info 5. fname.c to support filename encryption a. general wrapper functions - fscrypt_fname_disk_to_usr - fscrypt_fname_usr_to_disk - fscrypt_setup_filename - fscrypt_free_filename b. specific filename handling functions - fscrypt_fname_alloc_buffer - fscrypt_fname_free_buffer 6. Makefile and Kconfig Cc: Al Viro <viro@ftp.linux.org.uk> Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* BACKPORT: nl80211: Stop scheduled scan if netlink client disappearsJukka Rissanen2017-04-131-1/+4
| | | | | | | | | | | | | (cherry pick from commit 93a1e86ce10e4898f9ca9cd09d659a8a7780ee5e) An attribute NL80211_ATTR_SOCKET_OWNER can be set by the scan initiator. If present, the attribute will cause the scan to be stopped if the client dies. Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Bug: 25561044 Change-Id: Ibe4a555b29b64b6df1b9ed4cdcd0f05a69416d14
* BACKPORT: cfg80211: allow userspace to take ownership of interfacesJohannes Berg2017-04-131-0/+32
| | | | | | | | | | | | | | | | | | | (cherry pick from commit 78f22b6a3a9254460d23060530b48ae02a9394e3) When dynamically creating interfaces from userspace, e.g. for P2P usage, such interfaces are usually owned by the process that created them, i.e. wpa_supplicant. Should wpa_supplicant crash, such interfaces will often cease operating properly and cause problems on restarting the process. To avoid this problem, introduce an ownership concept for interfaces. If an interface is owned by a netlink socket, then it will be destroyed if the netlink socket is closed for any reason, including if the process it belongs to crashed. This gives us a race-free way to get rid of any such interfaces. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Bug: 25561044 Change-Id: I5a9c8883c5c204ac5d2917ab8492b44daf4b71e7
* net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs.Joel Scherpelz2017-04-132-0/+14
| | | | | | | | | | | | | | | | | | | This commit adds a new sysctl accept_ra_rt_info_min_plen that defines the minimum acceptable prefix length of Route Information Options. The new sysctl is intended to be used together with accept_ra_rt_info_max_plen to configure a range of acceptable prefix lengths. It is useful to prevent misconfigurations from unintentionally blackholing too much of the IPv6 address space (e.g., home routers announcing RIOs for fc00::/7, which is incorrect). [backport of net-next bbea124bc99df968011e76eba105fe964a4eceab] Bug: 33333670 Test: net_test passes Signed-off-by: Joel Scherpelz <jscherpelz@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: sysctl to restrict candidate source addressesErik Kline2017-04-131-0/+1
| | | | | | | | | | | | | | | | | | | | Per RFC 6724, section 4, "Candidate Source Addresses": It is RECOMMENDED that the candidate source addresses be the set of unicast addresses assigned to the interface that will be used to send to the destination (the "outgoing" interface). Add a sysctl to enable this behaviour. Signed-off-by: Erik Kline <ek@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> [Simplified back-port of net-next 3985e8a3611a93bb36789f65db862e5700aab65e] Bug: 19470192 Bug: 21832279 Bug: 22464419 Change-Id: Ib74ef945dcabe64215064f15ee1660b6524d65ce
* sdcardfs: Change magic valueDaniel Rosenberg2017-04-111-1/+1
| | | | | | | | | | Sdcardfs uses the same magic value as wrapfs. This should not be the case. As it is entirely in memory, the value can be changed without any loss of compatibility. Change-Id: I24200b805d5e6d32702638be99e47d50d7f2f746 Signed-off-by: Daniel Rosenberg <drosen@google.com>
* fuse: Add support for d_canonical_pathDaniel Rosenberg2017-04-111-0/+1
| | | | | | | | | | | | Allows FUSE to report to inotify that it is acting as a layered filesystem. The userspace component returns a string representing the location of the underlying file. If the string cannot be resolved into a path, the top level path is returned instead. bug: 23904372 Change-Id: Iabdca0bbedfbff59e9c820c58636a68ef9683d9f Signed-off-by: Daniel Rosenberg <drosen@google.com>
* netfilter: xt_socket: add XT_SOCKET_NOWILDCARD flagEric Dumazet2017-04-111-0/+7
| | | | | | | | | | | | | | | | | | | | | | | xt_socket module can be a nice replacement to conntrack module in some cases (SYN filtering for example) But it lacks the ability to match the 3rd packet of TCP handshake (ACK coming from the client). Add a XT_SOCKET_NOWILDCARD flag to disable the wildcard mechanism. The wildcard is the legacy socket match behavior, that ignores LISTEN sockets bound to INADDR_ANY (or ipv6 equivalent) iptables -I INPUT -p tcp --syn -j SYN_CHAIN iptables -I INPUT -m socket --nowildcard -j ACCEPT Change-Id: I5216b9c9b367cb1adbe74d47014a4155f6c39271 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
* Initial port of sdcardfsDaniel Campello2017-04-111-0/+2
| | | | Change-Id: I5b5772a2bbff9f3a7dda641644630a7b8afacec0
* wlan: get rid of Meizu's CONFIG_NL80211_FASTSCANMister Oyster2017-04-111-6/+0
|
* net: diag: allow socket bytecode filters to match socket marksLorenzo Colitti2017-04-111-0/+6
| | | | | | | | | | | | | | | | | | | This allows a privileged process to filter by socket mark when dumping sockets via INET_DIAG_BY_FAMILY. This is useful on systems that use mark-based routing such as Android. The ability to filter socket marks requires CAP_NET_ADMIN, which is consistent with other privileged operations allowed by the SOCK_DIAG interface such as the ability to destroy sockets and the ability to inspect BPF filters attached to packet sockets. [backport of net-next a52e95abf772b43c9226e9a72d3c1353903ba96f] Change-Id: If4609026882ef283a619b8bf24c0127f1f18ce6a Tested: https://android-review.googlesource.com/261350 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* random: Backport driver from 4.1.31Joe Maples2017-04-111-1/+10
| | | | Signed-off-by: Joe Maples <joe@frap129.org>
* net: inet: diag: expose the socket mark to privileged processes.Lorenzo Colitti2017-04-111-1/+8
| | | | | | | | | | | | | | | | | | | This adds the capability for a process that has CAP_NET_ADMIN on a socket to see the socket mark in socket dumps. Commit a52e95abf772 ("net: diag: allow socket bytecode filters to match socket marks") recently gave privileged processes the ability to filter socket dumps based on mark. This patch is complementary: it ensures that the mark is also passed to userspace in the socket's netlink attributes. It is useful for tools like ss which display information about sockets. [backport of net-next d545caca827b65aab557a9e9dcdcf1e5a3823c2d] Change-Id: I0c9708aae5ab8dfa296b8a1e6aecceb2a382415a Tested: https://android-review.googlesource.com/270210 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: diag: Add support to filter on device indexDavid Ahern2017-04-111-0/+1
| | | | | | | | | | | | Add support to inet_diag facility to filter sockets based on device index. If an interface index is in the filter only sockets bound to that index (sk_bound_dev_if) are returned. [backport of net-next 637c841dd7a5f9bd97b75cbe90b526fa1a52e530] Change-Id: Ib430cfb44f1b3b1a771a561247ee9140737e52fd Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dhcpv6: remove MTK_DHCPV6C_WIFI featureYingjoe Chen2016-12-192-2/+0
| | | | | | | | | | | | MTK extension MTK_DHCPV6C_WIFI is no longer necessary. Remove option and functionality. This reverts commit ccd52552b0ef ("HPV6: fix HPv6 onfig Error") and 4996bbf5c24b ("DHCPV6:Support DHCPV6 to Assign IPV6 Address") Change-Id: I3a1ea546bd4006546a301e0fc0fed721ae5c507f CR-Id: ALPS02210363 Signed-off-by: Yingjoe Chen <yingjoe.chen@mediatek.com>
* revert "Network: reset onnection by UID"Yingjoe Chen2016-12-191-1/+0
| | | | | | | | This reverts commit 3860aeacee01 ("Network:Reset Connection by UID") Change-Id: I842aec64f563ba9d0b2b6f460b6d683bde30361e CR-Id: ALPS02210363 Signed-off-by: Yingjoe Chen <yingjoe.chen@mediatek.com>
* UPSTREAM: capabilities: ambient capabilitiesAndy Lutomirski2016-11-181-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Credit where credit is due: this idea comes from Christoph Lameter with a lot of valuable input from Serge Hallyn. This patch is heavily based on Christoph's patch. ===== The status quo ===== On Linux, there are a number of capabilities defined by the kernel. To perform various privileged tasks, processes can wield capabilities that they hold. Each task has four capability masks: effective (pE), permitted (pP), inheritable (pI), and a bounding set (X). When the kernel checks for a capability, it checks pE. The other capability masks serve to modify what capabilities can be in pE. Any task can remove capabilities from pE, pP, or pI at any time. If a task has a capability in pP, it can add that capability to pE and/or pI. If a task has CAP_SETPCAP, then it can add any capability to pI, and it can remove capabilities from X. Tasks are not the only things that can have capabilities; files can also have capabilities. A file can have no capabilty information at all [1]. If a file has capability information, then it has a permitted mask (fP) and an inheritable mask (fI) as well as a single effective bit (fE) [2]. File capabilities modify the capabilities of tasks that execve(2) them. A task that successfully calls execve has its capabilities modified for the file ultimately being excecuted (i.e. the binary itself if that binary is ELF or for the interpreter if the binary is a script.) [3] In the capability evolution rules, for each mask Z, pZ represents the old value and pZ' represents the new value. The rules are: pP' = (X & fP) | (pI & fI) pI' = pI pE' = (fE ? pP' : 0) X is unchanged For setuid binaries, fP, fI, and fE are modified by a moderately complicated set of rules that emulate POSIX behavior. Similarly, if euid == 0 or ruid == 0, then fP, fI, and fE are modified differently (primary, fP and fI usually end up being the full set). For nonroot users executing binaries with neither setuid nor file caps, fI and fP are empty and fE is false. As an extra complication, if you execute a process as nonroot and fE is set, then the "secure exec" rules are in effect: AT_SECURE gets set, LD_PRELOAD doesn't work, etc. This is rather messy. We've learned that making any changes is dangerous, though: if a new kernel version allows an unprivileged program to change its security state in a way that persists cross execution of a setuid program or a program with file caps, this persistent state is surprisingly likely to allow setuid or file-capped programs to be exploited for privilege escalation. ===== The problem ===== Capability inheritance is basically useless. If you aren't root and you execute an ordinary binary, fI is zero, so your capabilities have no effect whatsoever on pP'. This means that you can't usefully execute a helper process or a shell command with elevated capabilities if you aren't root. On current kernels, you can sort of work around this by setting fI to the full set for most or all non-setuid executable files. This causes pP' = pI for nonroot, and inheritance works. No one does this because it's a PITA and it isn't even supported on most filesystems. If you try this, you'll discover that every nonroot program ends up with secure exec rules, breaking many things. This is a problem that has bitten many people who have tried to use capabilities for anything useful. ===== The proposed change ===== This patch adds a fifth capability mask called the ambient mask (pA). pA does what most people expect pI to do. pA obeys the invariant that no bit can ever be set in pA if it is not set in both pP and pI. Dropping a bit from pP or pI drops that bit from pA. This ensures that existing programs that try to drop capabilities still do so, with a complication. Because capability inheritance is so broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and then calling execve effectively drops capabilities. Therefore, setresuid from root to nonroot conditionally clears pA unless SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can re-add bits to pA afterwards. The capability evolution rules are changed: pA' = (file caps or setuid or setgid ? 0 : pA) pP' = (X & fP) | (pI & fI) | pA' pI' = pI pE' = (fE ? pP' : pA') X is unchanged If you are nonroot but you have a capability, you can add it to pA. If you do so, your children get that capability in pA, pP, and pE. For example, you can set pA = CAP_NET_BIND_SERVICE, and your children can automatically bind low-numbered ports. Hallelujah! Unprivileged users can create user namespaces, map themselves to a nonzero uid, and create both privileged (relative to their namespace) and unprivileged process trees. This is currently more or less impossible. Hallelujah! You cannot use pA to try to subvert a setuid, setgid, or file-capped program: if you execute any such program, pA gets cleared and the resulting evolution rules are unchanged by this patch. Users with nonzero pA are unlikely to unintentionally leak that capability. If they run programs that try to drop privileges, dropping privileges will still work. It's worth noting that the degree of paranoia in this patch could possibly be reduced without causing serious problems. Specifically, if we allowed pA to persist across executing non-pA-aware setuid binaries and across setresuid, then, naively, the only capabilities that could leak as a result would be the capabilities in pA, and any attacker *already* has those capabilities. This would make me nervous, though -- setuid binaries that tried to privilege-separate might fail to do so, and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have unexpected side effects. (Whether these unexpected side effects would be exploitable is an open question.) I've therefore taken the more paranoid route. We can revisit this later. An alternative would be to require PR_SET_NO_NEW_PRIVS before setting ambient capabilities. I think that this would be annoying and would make granting otherwise unprivileged users minor ambient capabilities (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than it is with this patch. ===== Footnotes ===== [1] Files that are missing the "security.capability" xattr or that have unrecognized values for that xattr end up with has_cap set to false. The code that does that appears to be complicated for no good reason. [2] The libcap capability mask parsers and formatters are dangerously misleading and the documentation is flat-out wrong. fE is *not* a mask; it's a single bit. This has probably confused every single person who has tried to use file capabilities. [3] Linux very confusingly processes both the script and the interpreter if applicable, for reasons that elude me. The results from thinking about a script's file capabilities and/or setuid bits are mostly discarded. Preliminary userspace code is here, but it needs updating: https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2 Here is a test program that can be used to verify the functionality (from Christoph): /* * Test program for the ambient capabilities. This program spawns a shell * that allows running processes with a defined set of capabilities. * * (C) 2015 Christoph Lameter <cl@linux.com> * Released under: GPL v3 or later. * * * Compile using: * * gcc -o ambient_test ambient_test.o -lcap-ng * * This program must have the following capabilities to run properly: * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE * * A command to equip the binary with the right caps is: * * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test * * * To get a shell with additional caps that can be inherited by other processes: * * ./ambient_test /bin/bash * * * Verifying that it works: * * From the bash spawed by ambient_test run * * cat /proc/$$/status * * and have a look at the capabilities. */ /* * Definitions from the kernel header files. These are going to be removed * when the /usr/include files have these defined. */ static void set_ambient_cap(int cap) { int rc; capng_get_caps_process(); rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap); if (rc) { printf("Cannot add inheritable cap\n"); exit(2); } capng_apply(CAPNG_SELECT_CAPS); /* Note the two 0s at the end. Kernel checks for these */ if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) { perror("Cannot set cap"); exit(1); } } int main(int argc, char **argv) { int rc; set_ambient_cap(CAP_NET_RAW); set_ambient_cap(CAP_NET_ADMIN); set_ambient_cap(CAP_SYS_NICE); printf("Ambient_test forking shell\n"); if (execv(argv[1], argv + 1)) perror("Cannot exec"); return 0; } Signed-off-by: Christoph Lameter <cl@linux.com> # Original author Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Aaron Jones <aaronmdjones@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Andrew G. Morgan <morgan@kernel.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Austin S Hemmelgarn <ahferroin7@gmail.com> Cc: Markku Savela <msa@moth.iki.fi> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 58319057b7847667f0c9585b9de0e8932b0fdb08) Bug: 31038224 Change-Id: I88bc5caa782dc6be23dc7e839ff8e11b9a903f8c Signed-off-by: Jorge Lucangeli Obes <jorgelo@google.com>
* net: pkt_sched: PIE AQM schemeVijay Subramanian2016-09-181-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Proportional Integral controller Enhanced (PIE) is a scheduler to address the bufferbloat problem. >From the IETF draft below: " Bufferbloat is a phenomenon where excess buffers in the network cause high latency and jitter. As more and more interactive applications (e.g. voice over IP, real time video streaming and financial transactions) run in the Internet, high latency and jitter degrade application performance. There is a pressing need to design intelligent queue management schemes that can control latency and jitter; and hence provide desirable quality of service to users. We present here a lightweight design, PIE(Proportional Integral controller Enhanced) that can effectively control the average queueing latency to a target value. Simulation results, theoretical analysis and Linux testbed results have shown that PIE can ensure low latency and achieve high link utilization under various congestion situations. The design does not require per-packet timestamp, so it incurs very small overhead and is simple enough to implement in both hardware and software. " Many thanks to Dave Taht for extensive feedback, reviews, testing and suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and suggestions. Naeem Khademi and Dave Taht independently contributed to ECN support. For more information, please see technical paper about PIE in the IEEE Conference on High Performance Switching and Routing 2013. A copy of the paper can be found at ftp://ftpeng.cisco.com/pie/. Please also refer to the IETF draft submission at http://tools.ietf.org/html/draft-pan-tsvwg-pie-00 All relevant code, documents and test scripts and results can be found at ftp://ftpeng.cisco.com/pie/. For problems with the iproute2/tc or Linux kernel code, please contact Vijay Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu (mysuryan@cisco.com) Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Mythili Prabhu <mysuryan@cisco.com> CC: Dave Taht <dave.taht@bufferbloat.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdiscTerry Lam2016-09-181-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements the first size-based qdisc that attempts to differentiate between small flows and heavy-hitters. The goal is to catch the heavy-hitters and move them to a separate queue with less priority so that bulk traffic does not affect the latency of critical traffic. Currently "less priority" means less weight (2:1 in particular) in a Weighted Deficit Round Robin (WDRR) scheduler. In essence, this patch addresses the "delay-bloat" problem due to bloated buffers. In some systems, large queues may be necessary for obtaining CPU efficiency, or due to the presence of unresponsive traffic like UDP, or just a large number of connections with each having a small amount of outstanding traffic. In these circumstances, HHF aims to reduce the HoL blocking for latency sensitive traffic, while not impacting the queues built up by bulk traffic. HHF can also be used in conjunction with other AQM mechanisms such as CoDel. To capture heavy-hitters, we implement the "multi-stage filter" design in the following paper: C. Estan and G. Varghese, "New Directions in Traffic Measurement and Accounting", in ACM SIGCOMM, 2002. Some configurable qdisc settings through 'tc': - hhf_reset_timeout: period to reset counter values in the multi-stage filter (default 40ms) - hhf_admit_bytes: threshold to classify heavy-hitters (default 128KB) - hhf_evict_timeout: threshold to evict idle heavy-hitters (default 1s) - hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for non-heavy-hitters (default 2) - hh_flows_limit: max number of heavy-hitter flow entries (default 2048) Note that the ratio between hhf_admit_bytes and hhf_reset_timeout reflects the bandwidth of heavy-hitters that we attempt to capture (25Mbps with the above default settings). The false negative rate (heavy-hitter flows getting away unclassified) is zero by the design of the multi-stage filter algorithm. With 100 heavy-hitter flows, using four hashes and 4000 counters yields a false positive rate (non-heavy-hitters mistakenly classified as heavy-hitters) of less than 1e-4. Signed-off-by: Terry Lam <vtlam@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>