Omar Sandoval [Tue, 6 Jun 2017 23:45:26 +0000 (16:45 -0700)]
Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
There are a few places where we pass in a negative num_bytes, so make it
signed for clarity. Also move it up in the file since later patches will
need it there.
Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 21 Jun 2017 15:43:24 +0000 (17:43 +0200)]
btrfs: fix validation of XATTR_ITEM dir items
The XATTR_ITEM is a type of a directory item so we use the common
validator helper. Unlike other dir items, it can have data. The way the
name len validation is currently implemented does not reflect that. We'd
have to adjust by the data_len when comparing the read and item limits.
However, this will not work for multi-item xattr dir items.
At the point of btrfs_is_name_len_valid call we don't have access to the
data_len value of the 2nd and 3rd sub-item. So simple btrfs_dir_data_len(leaf,
di) would always return 3, although we'd need to get 6 and 5 respectively to
get the claculations right. (read_end + name_len + data_len vs item_end)
We'd have to also pass data_len externally, which is not point of the
name validation. The last check is supposed to test if there's at least
one dir item space after the one we're processing. I don't think this is
particularly useful, validation of the next item would catch that too.
So the check is removed and we don't weaken the validation. Now tests
btrfs/048, btrfs/053, generic/273 and generic/337 pass.
Su Yue [Tue, 6 Jun 2017 09:57:05 +0000 (17:57 +0800)]
btrfs: Check name_len before read in iterate_dir_item
Since iterate_dir_item checks name_len in its own way,
so use btrfs_is_name_len_valid not 'verify_dir_item' to make more strict
name_len check.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ switched ENAMETOOLONG to EIO ] Signed-off-by: David Sterba <dsterba@suse.com>
Su Yue [Tue, 6 Jun 2017 09:57:02 +0000 (17:57 +0800)]
btrfs: Check name_len on add_inode_ref call path
replay_one_buffer first reads buffers and dispatches items accroding to
the item type.
In this patch, add_inode_ref handles inode_ref and inode_extref.
Then add_inode_ref calls ref_get_fields and extref_get_fields to read
ref/extref name for the first time.
So checking name_len before reading those two is fine.
add_inode_ref also calls inode_in_dir to match ref/extref in parent_dir.
The call graph includes btrfs_match_dir_item_name to read dir_item name
in the parent dir.
Checking first dir_item is not enough. Change it to verify every
dir_item while doing matches.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Su Yue [Tue, 6 Jun 2017 09:57:01 +0000 (17:57 +0800)]
btrfs: Check name_len with boundary in verify dir_item
Originally, verify_dir_item verifies name_len of dir_item with fixed
values but not item boundary.
If corrupted name_len was not bigger than the fixed value, for example
255, the function will think the dir_item is fine. And then reading
beyond boundary will cause crash.
Su Yue [Tue, 6 Jun 2017 09:57:00 +0000 (17:57 +0800)]
btrfs: Introduce btrfs_is_name_len_valid to avoid reading beyond boundary
Introduce function btrfs_is_name_len_valid.
The function compares parameter @name_len with item boundary then
returns true if name_len is valid.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ] Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 15 Jun 2017 14:04:26 +0000 (16:04 +0200)]
btrfs: account as waiting for IO, while waiting fot the flush bio completion
Similar to what submit_bio_wait does, we should account for IO while
waiting for a bio completion. This has marginal visible effects, flush
bio is short-lived.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 6 Jun 2017 15:06:06 +0000 (17:06 +0200)]
btrfs: preallocate device flush bio
For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.
Instead, we can allocate the bio at the same time we allocate the device
and reuse it each time we need to flush the barriers. The bio is reset
before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).
The bio used to be submitted through the integrity checker which will
find out that bio has no data attached and call submit_bio.
Status of the bio in flight needs to be tracked separately in case the
device caches get switched off between write and wait.
Filipe Manana [Tue, 13 Jun 2017 13:13:11 +0000 (14:13 +0100)]
Btrfs: incremental send, fix invalid path for unlink commands
An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).
Consider the following example scenario where this issue happens.
When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:
receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
utimes
utimes dir1
utimes dir1/dir2
link dir1/dir3/dir4/file11 -> dir1/dir2/file1
unlink dir1/dir2/file1
utimes dir1/dir2
truncate dir1/dir3/dir4/file11 size=0
utimes dir1/dir3/dir4/file11
rename dir1/dir3 -> o262-7-0
link dir1/dir3 -> o262-7-0/file22
unlink dir1/dir3/file22
ERROR: unlink dir1/dir3/file22 failed. Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) Before we start processing the new and deleted references for inode
260, we compute the full path of the deleted reference
("dir1/dir3/file22") and cache it in the list of deleted references
for our inode.
2) We then start processing the new references for inode 260, for which
there is only one new, located at "dir1/dir3". When processing this
new reference, we check that inode 262, which was not yet processed,
collides with the new reference and because of that we orphanize
inode 262 so its new full path becomes "o262-7-0".
3) After the orphanization of inode 262, we create the new reference for
inode 260 by issuing a link command with a target path of "dir1/dir3"
and a source path of "o262-7-0/file22".
4) We then start processing the deleted references for inode 260, for
which there is only one with the base name of "file22", and issue
an unlink operation containing the target path computed at step 1,
which is wrong because that path no longer exists and should be
replaced with "o262-7-0/file22".
So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ adjusted after prev patch removed fs_path::dir_path and dir_path_len ] Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 7 Jun 2017 10:41:29 +0000 (11:41 +0100)]
Btrfs: send, fix invalid path after renaming and linking file
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:
The following steps happen when computing the incremental send stream:
1) When processing inode 257, inode 258 is orphanized (renamed to
"o258-7-0"), because its current reference has the same name as the
new reference for inode 257;
2) When processing inode 258, we iterate over all its new references,
which have the names "f3" and "f5". The first iteration sees name
"f5" and renames the inode from its orphan name ("o258-7-0") to
"f5", while the second iteration sees the name "f3" and, incorrectly,
issues a link operation with a target name matching the orphan name,
which no longer exists. The first iteration had reset the current
valid path of the inode to "f5", but in the second iteration we lost
it because we found another inode, with a higher number of 259, which
has a reference named "f3" as well, so we orphanized inode 259 and
recomputed the current valid path of inode 258 to its old orphan
name because inode 259 could be an ancestor of inode 258 and therefore
the current valid path could contain the pre-orphanization name of
inode 259. However in this case inode 259 is not an ancestor of inode
258 so the current valid path should not be recomputed.
This makes the receiver fail with the following error:
ERROR: link f3 -> o258-7-0 failed: No such file or directory
So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.
A test case for fstests will follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 30 May 2017 04:29:09 +0000 (05:29 +0100)]
Btrfs: fix invalid extent maps due to hole punching
While punching a hole in a range that is not aligned with the sector size
(currently the same as the page size) we can end up leaving an extent map
in memory with a length that is smaller then the sector size or with a
start offset that is not aligned to the sector size. Both cases are not
expected and can lead to problems. This issue is easily detected
after the patch from commit a7e3b975a0f9 ("Btrfs: fix reported number of
inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
following for example:
What happens in the above example is the following:
1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
This results in the creation of an extent map with a length of 2Kb
starting at file offset 148Kb, through find_first_non_hole() ->
btrfs_get_extent().
2) The second write (first write after the hole punch operation), sets
the range [50Kb, 152Kb[ to delalloc.
3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
map covering the range [148Kb, 150Kb[ and ends up calling
set_extent_bit() for the same range, which results in splitting an
existing extent state record, covering the range [148Kb, 152Kb[ into
two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
[150Kb, 152Kb[.
4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
callback being invoked against the two 2Kb extent state records that
cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
with a length argument of 2048 bytes. That function rounds up the length
to a sector size aligned length, so it ends up considering a length of
4096 bytes, and then calls calc_csum_metadata_size() which results in
decrementing the inode's csum_bytes counter by 4096 bytes, so after
it stays a value of 0 bytes. Then the same happens when
btrfs_clear_bit_hook() is called against the second extent state that
has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
rounded up to 4096 and calc_csum_metadata_size() ends up being called
to decrement 4096 bytes from the inode's csum_bytes counter, which
at that time has a value of 0, leading to an underflow, which is
exactly what triggers the first warning, at btrfs_destroy_inode().
All the other warnings relate to several space accounting counters
that underflow as well due to similar reasons.
A similar case but where the hole punching operation creates an extent map
with a start offset not aligned to the sector size is the following:
During the unmount operation we get similar traces for the same reasons as
in the first example.
So fix the hole punching operation to make sure it never creates extent
maps with a length that is not aligned to the sector size nor with a start
offset that is not aligned to the sector size, as this breaks all
assumptions and it's a land mine.
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.") Cc: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Tue, 20 Jun 2017 12:15:26 +0000 (08:15 -0400)]
btrfs: add cond_resched to btrfs_qgroup_trace_leaf_items
On an uncontended system, we can end up hitting soft lockups while
doing replace_path. At the core, and frequently called is
btrfs_qgroup_trace_leaf_items, so it makes sense to add a cond_resched
there.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Subtracting the numbers we get a difference of less than a 4kb. Upon
closer inspection it became apparent that mkfs actually rounds down the
size of the device to a multiple of sector size. However, the same
cannot be said for various functions which modify the total size and are
called from btrfs_balance as well as when adding a new device. So this
patch ensures that values being saved into on-disk data structures are
always rounded down to a multiple of sectorsize.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The device->total_bytes member needs to always be rounded down to sectorsize
so that it corresponds to the value of super->total_bytes. However, there are
multiple places where the setter is fed a value which is not rounded which
can cause a fs to be unmountable due to the check introduced in 99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock"). This patch
implements the getter/setter manually so that in a later patch I can add
necessary code to catch offenders.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 14 Jun 2017 23:30:06 +0000 (01:30 +0200)]
btrfs: obsolete and remove mount option alloc_start
The mount option alloc_start was used in the past for debugging and
stressing the chunk allocator. Not meant to be used by users, so we're
not breaking anybody's setup.
There was some added complexity handling changes of the value and when
it was not same as default. Such code has likely been untested and I
think it's better to remove it.
This patch kills all use of alloc_start, and by doing that also fixes
a bug when alloc_size is set, potentially called from statfs:
in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU
protection is temporarily dropped so btrfs_account_dev_extents_size can
be called and then RCU is locked again! Doing that inside
list_for_each_entry_rcu is just asking for trouble, but unlikely to be
observed in practice.
Anand Jain [Tue, 13 Jun 2017 09:05:40 +0000 (17:05 +0800)]
btrfs: remove redundant null bdev counting during flush submission
There is no extra benefit to count null bdev during the submit loop,
as these null devices will be anyway checked during command
completion device loop just after the submit loop. We are holding the
device_list_mutex, the device->bdev status won't change in between.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Tue, 13 Jun 2017 09:32:29 +0000 (17:32 +0800)]
btrfs: write_dev_flush does not return ENOMEM anymore
Since commit "btrfs: btrfs_io_bio_alloc never fails, skip error handling"
write_dev_flush will not return ENOMEM in the sending part. We do not
need to check for it in the callers.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ updated changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs: compression must free at least one sector size
We already skip storing data where compression does not make the result
at least one byte less. Let's make the logic better and check
that compression frees at least one sector size of bytes, otherwise it's
not that useful.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ] Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 12 Jun 2017 15:29:41 +0000 (17:29 +0200)]
btrfs: sink gfp parameter to btrfs_io_bio_alloc
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 12 Jun 2017 15:29:39 +0000 (17:29 +0200)]
btrfs: add helper to initialize the non-bio part of btrfs_io_bio
We use btrfs_bioset for bios and ask to allocate the entire size of
btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
initialized but the bytes from 0 to offset of 'bio' are left
uninitialized. Although we initialize some of the members in our
helpers, we should initialize the whole structures.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
A programmer who is trying to implement calling the btrfs SEARCH
or SEARCH_V2 ioctl will probably soon end up reading this struct
definition.
Properly document the input fields to prevent common misconceptions:
1. The search space is linear, not 3 dimensional. The invidual min/max
values for objectid, type and offset cannot be used to filter the
result, they only define the endpoints of an interval.
2. The transaction id (a.k.a. generation) filter applies only on
transaction id of the last COW operation on a whole metadata page, not
on individual items.
Ad 1. The first misunderstanding was helped by the previous misleading
comments on min/max type and offset:
"keys returned will be >= min and <= max".
Ad 2. For example, running btrfs balance will happily cause rewriting of
metadata pages that contain a filesystem tree of a read only subvolume,
causing transids to be increased.
Also, improve descriptions of tree_id and nr_items and add in/out
annotations.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 14 Apr 2017 01:11:48 +0000 (18:11 -0700)]
Btrfs: skip checksum verification if IO error occurs
Currently dio read also goes to verify checksum if -EIO has been returned,
although it usually fails on checksum, it's not necessary at all, we could
directly check if there is another copy to read.
And with this, the behavior of dio read is now consistent with that of
buffered read.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ use bool for uptodate ] Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Wed, 17 May 2017 21:42:00 +0000 (15:42 -0600)]
Btrfs: tolerate errors if we have retried successfully
With raid1 profile, dio read isn't tolerating IO errors if read length is
less than the stripe length (64K).
Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags &
BTRFS_DIO_ORIG_BIO_SUBMITTED) is true and that happens when the read
length is less than 64k. In this case, if the underlying device returns
error somehow, bio->bi_error has recorded that error.
If we could recover the correct data from another copy in profile raid1/10/5/6,
with btrfs_subio_endio_read() returning 0, bio would have the correct data in
its vector, but bio->bi_error is not updated accordingly so that the following
dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed.
This fixes the problem by setting bio's error to 0 if a good copy has been
found.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
compressed_bio_alloc is now a trivial wrapper around btrfs_bio_alloc, no
point keeping it. The error handling can be simplified, as we know
btrfs_bio_alloc will never fail.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 2 Jun 2017 15:55:44 +0000 (17:55 +0200)]
btrfs: remove redundant parameters from btrfs_bio_alloc
All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.
submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.
David Sterba [Fri, 2 Jun 2017 15:26:26 +0000 (17:26 +0200)]
btrfs: bioset allocations will never fail, adapt our helpers
Christoph pointed out that bio allocations backed by a bioset will never
fail. As we always use a bioset for all bio allocations, we can skip
the error handling. This patch adjusts our low-level helpers, the
cascaded changes to all callers will come next.
CC: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 31 May 2017 15:21:15 +0000 (17:21 +0200)]
btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace
The compression workspace buffers are larger than a page so we use
vmalloc, unconditionally. This is not always necessary as there might be
contiguous memory available.
Let's use the kvmalloc helpers that will try kmalloc first and fallback
to vmalloc. For that they require GFP_KERNEL flags. As we now have the
alloc_workspace calls protected by memalloc_nofs in the critical
contexts, we can safely use GFP_KERNEL.
David Sterba [Wed, 31 May 2017 15:14:56 +0000 (17:14 +0200)]
btrfs: add memalloc_nofs protections around alloc_workspace callback
The workspaces are preallocated at the beginning where we can safely use
GFP_KERNEL, but in some cases the find_workspace might reach the
allocation again, now in a more restricted context when the bios or
pages are being compressed.
To avoid potential lockup when alloc_workspace -> vmalloc would silently
use the GFP_KERNEL, add the memalloc_nofs helpers around the critical
call site.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 31 May 2017 17:32:09 +0000 (19:32 +0200)]
btrfs: use GFP_KERNEL in init_ipath
Now that init_ipath is called either from a safe context or with
memalloc_nofs protection, we can switch to GFP_KERNEL allocations in
init_path and init_data_container.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 31 May 2017 17:21:38 +0000 (19:21 +0200)]
btrfs: scrub: add memalloc_nofs protection around init_ipath
init_ipath is called from a safe ioctl context and from scrub when
printing an error. The protection is added for three reasons:
* init_data_container calls vmalloc and this does not work as expected
in the GFP_NOFS context, so this silently does GFP_KERNEL and might
deadlock in some cases
* keep the context constraint of GFP_NOFS, used by scrub
* we want to use GFP_KERNEL unconditionally inside init_ipath or its
callees
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 31 May 2017 16:40:02 +0000 (18:40 +0200)]
btrfs: send: use kvmalloc in iterate_dir_item
We use a growing buffer for xattrs larger than a page size, at some
point vmalloc is unconditionally used for larger buffers. We can still
try to avoid it using the kvmalloc helper.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Mon, 5 Jun 2017 07:12:31 +0000 (00:12 -0700)]
Btrfs: use memalloc_nofs and kvzalloc() for free space tree bitmaps
First, instead of open-coding the vmalloc() fallback, use the new
kvzalloc() helper. Second, use memalloc_nofs_{save,restore}() instead of
GFP_NOFS, as vmalloc() uses some GFP_KERNEL allocations internally which
could lead to deadlocks.
Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 28 Mar 2017 10:06:05 +0000 (12:06 +0200)]
btrfs: use generic slab for for btrfs_transaction
Observing the number of slab objects of btrfs_transaction, there's just
one active on an almost quiescent filesystem, and the number of objects
goes to about ten when sync is in progress. Then the nubmer goes down to
1. This matches the expectations of the transaction lifetime.
For such use the separate slab cache is not justified, as we do not
reuse objects frequently. For the shortlived transaction, the generic
slab (size 512) should be ok. We can optimistically expect that the 512
slabs are not all used (fragmentation) and there are free slots to take
when we do the allocation, compared to potentially allocating a whole new
page for the separate slab.
We'll lose the stats about the object use, which could be added later if
we really need them.
David Sterba [Tue, 16 May 2017 17:10:32 +0000 (19:10 +0200)]
btrfs: scrub: embed scrub_wr_ctx into scrub context
The structure scrub_wr_ctx is not used anywhere just the scrub context,
we can move the members there. The tgtdev is renamed so it's more clear
that it belongs to the "wr" part.
Yonghong Song [Fri, 12 May 2017 22:07:43 +0000 (15:07 -0700)]
Btrfs: add statx support
Return enhanced file attributes from the btrfs, including:
(1). inode creation time as stx_btime, and
(2). Certain BTRFS_INODE_xxx flags are mapped to stx_attributes flags.
Jeff Layton [Thu, 25 May 2017 10:39:52 +0000 (06:39 -0400)]
btrfs: btrfs_wait_tree_block_writeback can be void return
Nothing checks its return value.
Is it safe to skip checking return value of btrfs_wait_tree_block_writeback?
Liu Bo: I think yes, it's used in walk_log_tree which is called in two
places, free_log_tree and log replay. For free_log_tree, it waits for
any running writeback of the extent buffer under freeing to finish in
case we need to access the eb pointer from page->private, and it's OK to
not check the return value, while for log replay, it's doesn't wait
because wc->wait is not set. So neither cares about the writeback error.
Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ added more explanation to changelog, from Liu Bo ] Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 29 May 2017 06:43:43 +0000 (09:43 +0300)]
btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET
Commit 5f39d397dfbe ("Btrfs: Create extent_buffer interface
for large blocksizes") refactored btrfs_leaf_data function to take
extent_buffer rather than struct btrfs_leaf. However, as it turns out the
parameter being passed is never used. Furthermore this function no longer
returns the leaf data but rather the offset to it. So rename the function
to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
helpers and turn it into a macro.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ removed () from the macro ] Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 22 May 2017 06:35:50 +0000 (09:35 +0300)]
btrfs: Refactor update_space_info
Following the factoring out of the creation code udpate_space_info can
only be called for already-existing space_info structs. As such it
cannot fail. Remove superfluous error handling and make the function
return void.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 22 May 2017 06:35:49 +0000 (09:35 +0300)]
btrfs: Separate space_info create/update
Currently the struct space_info creation code is intermixed in the
udpate_space_info function. There are well-defined points at which the
we actually want to create brand-new space_info structs (e.g. during
mount of the filesystem as well as sometimes when adding/initialising
new chunks). In such cases update_space_info is called with 0 as the
bytes parameter. All of this makes for spaghetti code.
Fix it by factoring out the creation code in a separate
create_space_info structure. This also allows to simplify the internals.
Also remove BUG_ON from do_alloc_chunk since the callers handle errors.
Furthermore it will make the update_space_info function not fail,
allowing us to remove error handling in callers. This will come in a
follow up patch.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 19 May 2017 17:39:15 +0000 (11:39 -0600)]
Btrfs: skip commit transaction if we don't have enough pinned bytes
We commit transaction in order to reclaim space from pinned bytes because
it could process delayed refs, and in may_commit_transaction(), we check
first if pinned bytes are enough for the required space, we then check if
that plus bytes reserved for delayed insert are enough for the required
space.
This changes the code to the above logic.
Fixes: b150a4f10d87 ("Btrfs: use a percpu to keep track of possibly pinned bytes") Tested-by: Nikolay Borisov <nborisov@suse.com> Reported-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 17 May 2017 15:38:36 +0000 (11:38 -0400)]
btrfs: remove root usage from can_overcommit
can_overcommit using the root to determine the allocation profile
is the only use of a root in the call graph below reserve_metadata_bytes.
It turns out that we only need to know whether the allocation is for
the chunk root or not -- and we can pass that around as a bool instead.
This allows us to pull root usage out of the reservation path all the
way up to reserve_metadata_bytes itself, which uses it only to compare
against fs_info->chunk_root to set the bool. In turn, this eliminates
a bunch of races where we use a particular root too early in the mount
process.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 17 May 2017 15:38:35 +0000 (11:38 -0400)]
btrfs: cleanup root usage by btrfs_get_alloc_profile
There are two places where we don't already know what kind of alloc
profile we need before calling btrfs_get_alloc_profile, but we need
access to a root everywhere we call it.
This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile()
and relegates btrfs_system_alloc_profile to a static for use in those
two cases. The next patch will eliminate one of those.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 13 Apr 2017 17:11:04 +0000 (19:11 +0200)]
btrfs: remove unused member list from btrfs_end_io_wq
The end io work queue items have been tracked by the work queues since
"Btrfs: Add async worker threads for pre and post IO checksumming"
(8b7128429235d9bd72cfd5e) (2008).
David Sterba [Thu, 13 Apr 2017 17:11:04 +0000 (19:11 +0200)]
btrfs: remove unused member list from async_submit_bio
The list used to track checksums in the early version (2.6.29), but I
was able not pinpoint the commit that stopped using it. Everything
apparently works without it for a long time.
Arnd Bergmann [Thu, 18 May 2017 13:33:29 +0000 (15:33 +0200)]
Btrfs: work around maybe-uninitialized warning
A rewrite of btrfs_submit_direct_hook appears to have introduced a warning:
fs/btrfs/inode.c: In function 'btrfs_submit_direct_hook':
fs/btrfs/inode.c:8467:14: error: 'bio' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Where the 'bio' variable was previously initialized unconditionally, it
is now set in the "while (submit_len > 0)" loop that would never execute
if submit_len is zero.
Assuming this cannot happen in practice, we can avoid the warning
by simply replacing the while{} loop with a do{}while() loop so
the compiler knows that it will always be entered at least once.
Fixes changes introduced in "Btrfs: use bio_clone_bioset_partial to
simplify DIO submit".
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 16 May 2017 00:20:07 +0000 (17:20 -0700)]
Btrfs: record error if one block has failed to retry
In the nocsum case of dio read endio, it returns immediately if an error
gets returned when repairing, which leaves the rest blocks unrepaired. The
behavior is different from how buffered read endio works in the same case.
This changes it to record error only and go on repairing the rest blocks.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Mon, 15 May 2017 22:33:27 +0000 (15:33 -0700)]
Btrfs: change how we iterate bios in endio
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.
Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio. Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 16 May 2017 16:51:39 +0000 (09:51 -0700)]
Btrfs: use bio_clone_bioset_partial to simplify DIO submit
Currently when mapping bio to limit bio to a single stripe length, we
split bio by adding page to bio one by one, but later we don't modify
the vector of bio at all, thus we can use bio_clone_fast to use the
original bio vector directly.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 4 Apr 2017 19:23:25 +0000 (12:23 -0700)]
Btrfs: use bio_clone_fast to clone our bio
For raid1 and raid10, we clone the original bio to the bios which are then
sent to different disks.
Right now we use bio_clone_bioset to create a clone bio with iterating
bi_io_vec to initialize it. This changes it to use bio_clone_fast()
which creates a clone bio but only copies the bi_io_vec pointer
instead of iterating bi_io_vec.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 5 May 2017 15:57:14 +0000 (11:57 -0400)]
btrfs: remove inode argument from repair_io_failure
Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.
Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 5 May 2017 15:57:13 +0000 (11:57 -0400)]
Btrfs: replace tree->mapping with tree->private_data
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks. This works fine when everything that has
an extent_io_tree has an inode. But we are going to remove the
btree_inode, so we need to change this. Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead. This had a lot of cascading changes but
should be relatively straightforward.
Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ] Signed-off-by: David Sterba <dsterba@suse.com>
Sargun Dhillon [Thu, 11 May 2017 21:18:03 +0000 (21:18 +0000)]
btrfs: Add quota_override knob into sysfs
This patch adds the read-write attribute quota_override into sysfs.
Any process which has CAP_SYS_RESOURCE can set this flag to on, and
once it is set to true, processes with CAP_SYS_RESOURCE can exceed
the quota.
Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: David Sterba <dsterba@suse.com>
[ minor changelog edits ] Signed-off-by: David Sterba <dsterba@suse.com>
Sargun Dhillon [Thu, 11 May 2017 21:17:33 +0000 (21:17 +0000)]
btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE
This patch introduces the quota override flag to btrfs_fs_info, and a
change to quota limit checking code to temporarily allow for quota to be
overridden for processes with CAP_SYS_RESOURCE.
It's useful for administrative programs, such as log rotation, that may
need to temporarily use more disk space in order to free up a greater
amount of overall disk space without yielding more disk space to the
rest of userland.
Eventually, we may want to add the idea of an operator-specific quota,
operator reserved space, or something else to allow for administrative
override, but this is perhaps the simplest solution.
Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: David Sterba <dsterba@suse.com>
[ minor changelog edits ] Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 11 May 2017 06:17:46 +0000 (09:17 +0300)]
btrfs: Convert fs_info->free_chunk_space to atomic64_t
The ->free_chunk_space variable is used to track the unallocated space
and access to it is protected by a spinlock, which is not used for
anything else. Make the code a bit self-explanatory by switching the
variable to an atomic64_t type and kill the spinlock.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ not a performance critical code, use of atomic type is ok ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 5 May 2017 23:17:54 +0000 (07:17 +0800)]
btrfs: add framework to handle device flush error as a volume
This adds comments to the flush error handling part of the code, and
hopes to maintain the same logic with a framework which can be used to
handle the errors at the volume level.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 5 May 2017 02:09:36 +0000 (10:09 +0800)]
btrfs: cleanup unused qgroup trace event
Commit 81fb6f77a026 (btrfs: qgroup: Add new trace point for
qgroup data reserve) added the following events which aren't used.
btrfs__qgroup_data_map
btrfs_qgroup_init_data_rsv_map
btrfs_qgroup_free_data_rsv_map
So remove them.