Remove remnants of optimization for > pagesize allocations.
In the past, this allocator seems to have allocated things larger than
a page seperately. Much of this code was removed at some point (perhaps
along with sbrk() used) so remove the rest. Instead, keep allocating in
power-of-two bins up to FIRST_BUCKET_SIZE << (NBUCKETS - 1). If we want
something more efficent, we should use a fancier allocator.
While here, remove some vestages of sbrk() use. Most importantly, don't
try to page align the pagepool since it's always page aligned by mmap().
The symbol version for _mcount was removed 12 years ago in r169525 from
gmon/Symbol.map, to be added to the per-arch Symbol.map. mips was overlooked
in this, so _mcount has no symver. Add it back to where it should have been,
rather than where it would go if it were added today, since we're correcting
a historical mistake.
Additionally, _mcount is getting thrown into .mdebug.abi32 in the llvm80/90
world as it's not getting explicitly thrown into .text, so do this now. This
fixes the libc build that was previously failing due to relocations in
.mdebug.abi32. This is specifically due to the way clang's integrated AS
works and that they emit the .mdebug.abiNN section early in the process. An
LLVM bug has been submitted[0] and an agreement has been made that the
mips backend should switch to .text following .mdebug.abiNN for
compatibility.
Extend uma_reclaim() to permit different reclamation targets.
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.
With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.
Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.
Restrict the input domain set in cpuset_setdomain(2) to all_domains.
To permit larger values of MAXMEMDOM, which is currently 8 on amd64,
cpuset_setdomain(2) accepts a mask of size 256. In the kernel, domain
set masks are 64 bits wide, but can only represent a set of MAXMEMDOM
domains due to the use of the ds_order table.
Domain sets passed to cpuset_setdomain(2) are restricted to a subset
of their parent set, which is typically the root set, but before this
happens we modify the input set to exclude empty domains.
domainset_empty_vm() and other code which manipulates domain sets
expect the mask to be a subset of all_domains, so enforce that when
performing validation of cpuset_setdomain(2) parameters.
Reported and tested by: pho
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21477
lldb: shorten thread names to make logs easier to follow
lldb prepends the thread name to log entries, and the existing thread
name for the FreeBSD ProcessMonitor thread was longer than the kernel's
supported thread name length, and so was truncated. This made logs hard
to read, as the truncated thread name ran into the log message. Shorten
"lldb.process.freebsd.operation" to just "freebsd.op" so that logs are
more readable.
gets is unsafe and shouldn't be used (for many years now). Leave it in
the existing symbol version so anything that previously linked aginst it
still runs, but do not allow new software to link against it.
(The compatability/legacy implementation must not be static so that
the symbol and in particular the compat sym gets@FBSD_1.0 make it
into libc.)
PR: 222796 (exp-run)
Reported by: Paul Vixie
Reviewed by: allanjude, cy, eadler, gnn, jhb, kib, ngie (some earlier)
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D12298
vfs: stop refing freed mount points in vop_stdgetwritemount
The code used blindly ref based on an unsafely red address and then would
backpedal if necessary. This was safe in terms of memory access since
mounts are type-stable, but made for a potential a bug where the mount
was reused and had the count reset to 0 before this code decreased it.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21411
Improve the handling of state cookie parameters in INIT-ACK chunks.
This fixes problem with parameters indicating a zero length or partial
parameters after an unknown parameter indicating to stop processing. It
also fixes a problem with state cookie parameters after unknown
parametes indicating to stop porcessing.
Thanks to Mark Wodrich from Google for finding two of these issues
by fuzz testing the userland stack and reporting them in
https://github.com/sctplab/usrsctp/issues/355
and
https://github.com/sctplab/usrsctp/issues/352
nullfs: reduce areas protected by vnode interlock in null_lock
Similarly to the other routine stop taking the interlock for the lower
vnode. The interlock for nullfs vnode is still taken to ensure
stability of ->v_data.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21480
posixshm: switch to OBJT_SWAP in advance of other changes
Future changes to posixshm will start tracking writeable mappings in order
to support file sealing. Tracking writeable mappings for an OBJT_DEFAULT
object is complicated as it may be swapped out and converted to an
OBJT_SWAP. One may generically add this tracking for vm_object, but this is
difficult to do without increasing memory footprint of vm_object and blowing
up memory usage by a significant amount.
On the other hand, the swap pager can be expanded to track writeable
mappings without increasing vm_object size. This change is currently in
D21456. Switch over to OBJT_SWAP in advance of the other changes to the
swap pager and posixshm.
ray [Sat, 31 Aug 2019 21:28:06 +0000 (21:28 +0000)]
ARM kernel can get RAM regions three ways:
o from FDT;
o from EFI;
o from Linux Boot API (ATAG).
U-Boot may pass RAM info all that 3 ways simultaneously.
We do select between FDT and EFI, but not for ATAG.
So this is not problem fix, but correctness check.
tuexen [Sat, 31 Aug 2019 08:18:49 +0000 (08:18 +0000)]
Improve the handling of illegal sequence number combinations in received
data chunks. Abort the association if there are data chunks with larger
fragement sequence numbers than the fragement sequence of the last
fragment.
Thanks to Mark Wodrich from Google who found this issue by fuzz testing
the userland stack and reporting this issue in
https://github.com/sctplab/usrsctp/issues/355
sevan [Fri, 30 Aug 2019 21:49:00 +0000 (21:49 +0000)]
Earliest reference to /dev/null I can find is in v4 sh(1) and nulldev in
nsys/ken/subr.c
via TUHS archive
https://minnie.tuhs.org/cgi-bin/utree.pl?file=V4
mjoras [Fri, 30 Aug 2019 20:19:43 +0000 (20:19 +0000)]
Wrap a vlan's parent's if_output in a separate function.
When a vlan interface is created, its if_output is set directly to the
parent interface's if_output. This is fine in the normal case but has an
unfortunate consequence if you end up with a certain combination of vlan
and lagg interfaces.
Consider you have a lagg interface with a single laggport member. When
an interface is added to a lagg its if_output is set to
lagg_port_output, which blackholes traffic from the normal networking
stack but not certain frames from BPF (pseudo_AF_HDRCMPLT). If you now
create a vlan with the laggport member (not the lagg interface) as its
parent, its if_output is set to lagg_port_output as well. While this is
confusing conceptually and likely represents a misconfigured system, it
is not itself a problem. The problem arises when you then remove the
lagg interface. Doing this resets the if_output of the laggport member
back to its original state, but the vlan's if_output is left pointing to
lagg_port_output. This gives rise to the possibility that the system
will panic when e.g. bpf is used to send any frames on the vlan
interface.
Fix this by creating a new function, vlan_output, which simply wraps the
parent's current if_output. That way when the parent's if_output is
restored there is no stale usage of lagg_port_output.
markj [Fri, 30 Aug 2019 19:35:44 +0000 (19:35 +0000)]
Update and clean up the UMA man page.
- Fix warnings from igor and mandoc.
- Provide a brief description of the separation between zones and their
backend slab allocators.
- Document cache zones and secondary zones.
- Document the kernel config options added in r350659.
- Document the uma_zalloc_pcpu() and uma_zfree_pcpu() wrappers.
- Document uma_zone_reserve(), uma_zone_reserve_kva() and
uma_zone_prealloc().
- Document uma_zone_alloc() and uma_zone_freef().
- Add some missing MLINKs and Xrefs.
jhb [Thu, 29 Aug 2019 18:23:38 +0000 (18:23 +0000)]
Simplify bhyve vlapic ESR logic.
The bhyve virtual local APIC uses an instance-global flag to indicate
when an error LVT is being delivered to prevent infinite recursion.
Use a function argument instead to reduce the amount of instance-global
state.
This was inspired by reviewing the bhyve save/restore work, which
saves a copy of the instance-global state for each vlapic.
Smart OS bug: https://smartos.org/bugview/OS-7777
Submitted by: Patrick Mooney
Reviewed by: markj, rgrimes
Obtained from: SmartOS / Joyent
Differential Revision: https://reviews.freebsd.org/D20365
zeising [Thu, 29 Aug 2019 17:17:39 +0000 (17:17 +0000)]
pwm.9 symlink shouldn't be removed
When the pwm.9 manual was removed, a symlink between pwmbus.9 and pwm.9 was
created, but there's an entry in ObsoleteFiles.inc to remove pwn.9, meaning
that on every installation pwm.9 is created, and make delete-old deletes it.
Remove the entry from ObsoleteFiles.inc, the symlink is clearly intentional
and shouldn't be removed.
mav [Thu, 29 Aug 2019 17:02:02 +0000 (17:02 +0000)]
Take proper lock in ses_setphyspath_callback().
XPT_DEV_ADVINFO call should be protected by the lock of the specific
device it is addressed to, not the lock of SES device. In some weird
case, probably with hardware violating standards, it sometimes caused
NULL dereference due to race.
To protect from it further, add lock assertion to *_dev_advinfo().
kib [Thu, 29 Aug 2019 07:50:25 +0000 (07:50 +0000)]
Rework v_object lifecycle for vnodes.
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode. Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.
Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM(). This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation. Remove code from
vnode_create_vobject() to handle races with the parallel destruction.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412
kib [Thu, 29 Aug 2019 07:45:23 +0000 (07:45 +0000)]
UFS: stop reusing the vnode for reallocated inode.
In ffs_valloc(), force reclaim existing vnode on inode reuse, instead
of trying to re-initialize the same vnode for new purposes. This is
done in preparation of changes to the vp->v_object lifecycle handling.
A new FFSV_REPLACE flag to ffs_vgetf() directs the function to
vgone(9) the vnode if found in vfs hash, instead of returning it.
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412
kib [Thu, 29 Aug 2019 07:25:27 +0000 (07:25 +0000)]
Centralize __pcpu definitions.
Many extern struct pcpu <something>__pcpu declarations were
copied/pasted in sources. The issue is that the definition is MD, but
it cannot be provided by machine/pcpu.h due to actual struct pcpu
defined in sys/pcpu.h later than the inclusion of machine/pcpu.h.
This forced the copying when other code needed direct access to
__pcpu. There is no way around it, due to machine/pcpu.h supplying
part of struct pcpu fields.
To work around the problem, add a new machine/pcpu_aux.h header, which
should fill any needed MD definitions after struct pcpu definition is
completed. This allows to remove copies of __pcpu spread around the
source. Also on x86 it makes it possible to remove work arounds like
OFFSETOF_CURTHREAD or clang specific warnings supressions.
Reported and tested by: lwhsu, bcran
Reviewed by: imp, markj (previous version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21418
avg [Thu, 29 Aug 2019 07:19:06 +0000 (07:19 +0000)]
zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets
Previously, the permissions were checked on the pool which was obviously
incorrect.
After this change, zfs_check_userprops() only validates the properties
without any permission checks. The permissions are checked individually
for each snapshotted dataset.
This was also committed to ZoL: zfsonlinux/zfs@e6203d2
karels [Thu, 29 Aug 2019 02:44:18 +0000 (02:44 +0000)]
Fix address annotation in xml output from w
The libxo xml feature of adding an annotation with the "original"
address from the utmpx file if it is different than the final "from"
field was broken by r351379. This was pointed out by the gcc error
that save_p might be used uninitialized. Save the original address
as needed in each entry, don't just use the last one from the previous
loop.
mjg [Wed, 28 Aug 2019 20:34:24 +0000 (20:34 +0000)]
vfs: add VOP_NEED_INACTIVE
vnode usecount drops to 0 all the time (e.g. for directories during path lookup).
When that happens the kernel would always lock the exclusive lock for the vnode
in order to call vinactive(). This blocks other threads who want to use the vnode
for looukp.
vinactive is very rarely needed and can be tested for without the vnode lock held.
This patch gives filesytems an opportunity to do it, sample total wait time for
tmpfs over 500 minutes of poudriere -j 104:
mjg [Wed, 28 Aug 2019 19:40:57 +0000 (19:40 +0000)]
amd64: clean up cpu_switch.S
- LK macro (conditional on SMP for the lock prefix) is unused
- SETLK unnecessarily performs xchg. obtained value is never used and the
implicit lock prefix adds avoidable cost. Barrier provided by it does
not appear to be of any use.
- the lock waited for is almost never blocked, yet the loop starts with
a pause. Move it out of the common case.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19563
mav [Wed, 28 Aug 2019 17:39:46 +0000 (17:39 +0000)]
MFV/ZoL: Fix wrong assertion in libzfs diff error handling
In compare(), all error cases set the error code to EPIPE, so when an
error is set, the correct assertion to make is that the error is EPIPE,
not EINVAL.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@freqlabs.com>
Closes #8743
zfsonlinux/zfs@9dc41a769df164875d974c2431b2453e70e16c41
Submitted by: Ryan Moeller <ryan@freqlabs.com>
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D20118
markj [Wed, 28 Aug 2019 16:08:06 +0000 (16:08 +0000)]
Wire pages in vm_page_grab() when appropriate.
uiomove_object_page() and exec_map_first_page() would previously wire a
page after having grabbed it. Ask vm_page_grab() to perform the wiring
instead: this removes some redundant code, and is cheaper in the case
where the requested page is not resident since the page allocator can be
asked to initialize the page as wired, whereas a separate vm_page_wire()
call requires the page lock.
In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of
vm_page_unwire(PQ_NONE). The latter ensures that the page is dequeued
before returning, but this is unnecessary since vm_page_free() will
trigger a batched dequeue of the page.
asomers [Wed, 28 Aug 2019 04:19:37 +0000 (04:19 +0000)]
fusefs: Fix some bugs regarding the size of the LISTXATTR list
* A small error in r338152 let to the returned size always being exactly
eight bytes too large.
* The FUSE_LISTXATTR operation works like Linux's listxattr(2): if the
caller does not provide enough space, then the server should return ERANGE
rather than return a truncated list. That's true even though in FUSE's
case the kernel doesn't provide space to the client at all; it simply
requests a maximum size for the list. We previously weren't handling the
case where the server returns ERANGE even though the kernel requested as
much size as the server had told us it needs; that can happen due to a
race.
* We also need to ensure that a pathological server that always returns
ERANGE no matter what size we request in FUSE_LISTXATTR won't cause an
infinite loop in the kernel. As of this commit, it will instead cause an
infinite loop that exits and enters the kernel on each iteration, allowing
signals to be processed.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21287
jhb [Tue, 27 Aug 2019 21:29:37 +0000 (21:29 +0000)]
Adjust the deprecated warnings for /dev/crypto to be less noisy.
Warn when actual operations are performed instead of when sessions are
created. The /dev/crypto engine in OpenSSL 1.0.x tries to create
sessions for all possible algorithms each time it is initialized
resulting in spurious warnings.
Reported by: Mike Tancsa
MFC after: 3 days
Sponsored by: Chelsio Communications
mjg [Tue, 27 Aug 2019 20:51:17 +0000 (20:51 +0000)]
unionfs: stop passing LK_INTERLOCK to VOP_UNLOCK
This is part of the preparation to remove flags argument from VOP_UNLOCK.
Also has a side effect of fixing stacking on top of nullfs broken by r351472.
Reported by: cy
Sponsored by: The FreeBSD Foundation
manu [Tue, 27 Aug 2019 18:00:01 +0000 (18:00 +0000)]
arm64: rk3399: pinctrl: Add gpio banks and fix iomux
Since r351187 the pinctrl driver need to know the gpio bank as it
directly attach the gpio driver to handle some setup that might
be present in the dts, add the gpio banks table for rk3399.
While here fix some IOMUX definition that prevented to boot
on RK3399 as pinctrl wasn't configured correctly.
manu [Tue, 27 Aug 2019 17:59:09 +0000 (17:59 +0000)]
arm64: rk3328: pinctrl: Add gpio banks and fix iomux
Since r351187 the pinctrl driver need to know the gpio bank as it
directly attach the gpio driver to handle some setup that might
be present in the dts, add the gpio banks table for rk3328.
While here fix some IOMUX definition that prevented to boot
on RK3328 as pinctrl wasn't configured correctly.
mav [Tue, 27 Aug 2019 16:41:06 +0000 (16:41 +0000)]
Always check cam_periph_error() status for ERESTART.
Even if we do not expect retries, we better be sure, since otherwise it
may result in use after free kernel panic. I've noticed that it retries
SCSI_STATUS_BUSY even with SF_NO_RECOVERY | SF_NO_RETRY.
markj [Tue, 27 Aug 2019 14:06:34 +0000 (14:06 +0000)]
Fix several logic issues in domainset_empty_vm().
- Don't add 1 to the result of DOMAINSET_FLS.
- Do not modify domainsets containing only empty domains.
- Always flatten a _PREFER policy to _ROUNDROBIN if the preferred
domain is empty. Previously we were doing this only when ds_cnt > 1.
These bugs could cause hangs during boot if a VM domain is empty.
Tested by: hselasky
Reviewed by: hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21420
trasz [Tue, 27 Aug 2019 11:46:22 +0000 (11:46 +0000)]
Introduce <sys/qmath.h>, a fixed-point math library from Netflix.
This makes it possible to perform mathematical operations on
fractional values without using floating point. It operates on Q
numbers, which are integer-sized, opaque structures initialized
to hold a chosen number of integer and fractional bits.
For a general description of the Q number system, see the "Fixed Point
Representation & Fractional Math" whitepaper[1]; for the actual
API see the qmath(3) man page.
This is one of dependencies for the upcoming stats(3) framework[2]
that will be applied to the TCP stack in a later commit.
mmel [Tue, 27 Aug 2019 09:20:01 +0000 (09:20 +0000)]
Add support for RK3288 into existing RockChip drivers.
This patch ensures only minimal level of compatibility necessary to boot
on RK3288 based boards. GPIO and pinctrl interaction, missing in current
implementation, will be improved by own patch in the near future.