zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free. Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.
On a small UMA machine running tiobench the difference is marginal. On
a 4-node machine the overhead is more noticable. Note that automatic
NUMA balancing was disabled for this test as otherwise the system CPU
overhead is unpredictable.
3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5 vmstat-v5
User 746.94 759.78 774.56
System 65336.22 58350.98 32847.27
Elapsed 27553.52 27282.02 27415.04
Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.
The arrangement of struct zone has changed over time and now it has
reached the point where there is some inappropriate sharing going on.
On x86-64 for example
o The zone->node field is shared with the zone lock and zone->node is
accessed frequently from the page allocator due to the fair zone
allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so
reclaim-intensive and allocator-intensive workloads can bounce the cache
line on a stat update
This patch rearranges struct zone to put read-only and read-mostly
fields together and then splits the page allocator intensive fields, the
zone statistics and the page reclaim intensive fields into their own
cache lines. Note that the type of lowmem_reserve changes due to the
watermark calculations being signed and avoiding a signed/unsigned
conversion there.
On the test configuration I used the overall size of struct zone shrunk
by one cache line. On smaller machines, this is not likely to be
noticable. However, on a 4-node NUMA machine running tiobench the
system CPU overhead is reduced by this patch.
3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5r9
User 746.94 759.78
System 65336.22 58350.98
Elapsed 27553.52 27282.02
This was formerly the series "Improve sequential read throughput" which
noted some major differences in performance of tiobench since 3.0.
While there are a number of factors, two that dominated were the
introduction of the fair zone allocation policy and changes to CFQ.
The behaviour of fair zone allocation policy makes more sense than
tiobench as a benchmark and CFQ defaults were not changed due to
insufficient benchmarking.
This series is what's left. It's one functional fix to the fair zone
allocation policy when used on NUMA machines and a reduction of overhead
in general. tiobench was used for the comparison despite its flaws as
an IO benchmark as in this case we are primarily interested in the
overhead of page allocator and page reclaim activity.
On UMA, it makes little difference to overhead
3.16.0-rc3 3.16.0-rc3
vanilla lowercost-v5
User 383.61 386.77
System 403.83 401.74
Elapsed 5411.50 5413.11
On a 4-socket NUMA machine it's a bit more noticable
3.16.0-rc3 3.16.0-rc3
vanilla lowercost-v5
User 746.94 802.00
System 65336.22 40852.33
Elapsed 27553.52 27368.46
This patch (of 6):
The LRU insertion and activate tracepoints take PFN as a parameter
forcing the overhead to the caller. Move the overhead to the tracepoint
fast-assign method to ensure the cost is only incurred when the
tracepoint is active.
When memory cgoups are enabled, the code that decides to force to scan
anonymous pages in get_scan_count() compares global values (free,
high_watermark) to a value that is restricted to a memory cgroup (file).
It make the code over-eager to force anon scan.
For instance, it will force anon scan when scanning a memcg that is
mainly populated by anonymous page, even when there is plenty of file
pages to get rid of in others memcgs, even when swappiness == 0. It
breaks user's expectation about swappiness and hurts performance.
This patch makes sure that forced anon scan only happens when there not
enough file pages for the all zone, not just in one random memcg.
[hannes@cmpxchg.org: cleanups] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.
https://lkml.org/lkml/2014/4/10/416
Although I'm not sure that this is right usage or not, there is a
solution reducing vmap_area_lock contention with no side-effect. That
is just to use rcu list iterator in get_vmalloc_info().
rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe02258f1 ("mm:
rewrite vmap layer") back in linux-2.6.28
Specifically :
insertions use list_add_rcu(),
deletions use list_del_rcu() and kfree_rcu().
Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.
Note that __purge_vmap_area_lazy() already uses this rcu protection.
: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.
Joonsoo:
: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time. While we try to get
: certain stats, the other stats can change. And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.
[edumazet@google.com: add more commit description] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Richard Yao <ryao@gentoo.org> Acked-by: Eric Dumazet <edumazet@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 71e3aac0724f ("thp: transparent hugepage core") adds
copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this
function have been used outside of memory.c, but it currently isn't.
This patch makes copy_pte_range() static again.
Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
node") improved the previous khugepaged logic which allocated a
transparent hugepages from the node of the first page being collapsed.
However, it is still possible to collapse pages to remote memory which
may suffer from additional access latency. With the current policy, it
is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed
remotely if the majority are allocated from that node.
When zone_reclaim_mode is enabled, it means the VM should make every
attempt to allocate locally to prevent NUMA performance degradation. In
this case, we do not want to collapse hugepages to remote nodes that
would suffer from increased access latency. Thus, when
zone_reclaim_mode is enabled, only allow collapsing to nodes with
RECLAIM_DISTANCE or less.
There is no functional change for systems that disable
zone_reclaim_mode.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or
orig_pte upon which all subsequent decisions and pte_same() tests will
be made.
I have no evidence that its lack is responsible for the mm/filemap.c:202
BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by
trinity, and I am not optimistic that it will fix it. But I have found
no other explanation, and ACCESS_ONCE() here will surely not hurt.
If gcc does re-access the pte before passing it down, then that would be
disastrous for correct page fault handling, and certainly could explain
the page_mapped() BUGs seen (concurrent fault causing page to be mapped
in a second time on top of itself: mapcount 2 for a single pte).
Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
in isolate_lru_pages() at mm/vmscan.c:1281!
Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
cache allocation where possible") looks like interrupted work-in-progress.
mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
- shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
when the page is always visible in radix_tree, and often already on LRU.
Revert change to shmem_write_begin(), and use init_page_accessed() or
mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().
SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
before; but since many other filesystems use [__]page_symlink(), which did
and does mark the page accessed, consider this as rectifying an oversight.
If a page is marked for immediate reclaim then it is moved to the tail of
the LRU list. This occurs when the system is under enough memory pressure
for pages under writeback to reach the end of the LRU but we test for this
using atomic operations on every writeback. This patch uses an optimistic
non-atomic test first. It'll miss some pages in rare cases but the
consequences are not severe enough to warrant such a penalty.
While the function does not dominate profiles during a simple dd test the
cost of it is reduced.
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.
The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.
The primary APIs of concern in this care are the following and are used
by most filesystems.
All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.
Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.
The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.
The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.
The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.
Discarding buffers uses a bunch of atomic operations when discarding
buffers because ...... I can't think of a reason. Use a cmpxchg loop to
clear all the necessary flags. In most (all?) cases this will be a single
atomic operations.
[akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When adding pages to the LRU we clear the active bit unconditionally.
As the page could be reachable from other paths we cannot use unlocked
operations without risk of corruption such as a parallel
mark_page_accessed. This patch tests if is necessary to clear the
active flag before using an atomic operation. This potentially opens a
tiny race when PageActive is checked as mark_page_accessed could be
called after PageActive was checked. The race already exists but this
patch changes it slightly. The consequence is that that the page may be
promoted to the active list that might have been left on the inactive
list before the patch. It's too tiny a race and too marginal a
consequence to always use atomic operations for.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
before it's even added to the LRU or visible. This is unnecessary as what
could it possible race against? Use an unlocked variant.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page order.
This converts a number of sites in mm/page_alloc.c to use unsigned int for
order where possible.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these
cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
unlikely branch in the fast path. This patch moves the check out of the
fast path and after it has been determined that the watermarks have not
been met. This helps the common fast path at the cost of making the slow
path slower and hitting kswapd with a performance cost. It's a reasonable
tradeoff.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A node/zone index is used to check if pages are compatible for merging
but this happens unconditionally even if the buddy page is not free. Defer
the calculation as long as possible. Ideally we would check the zone boundary
but nodes can overlap.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch exposes the jump_label reference count in preparation for the
next patch. cpusets cares about both the jump_label being enabled and how
many users of the cpusets there currently are.
Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a zone cannot be used for a dirty page then it gets marked "full" which
is cached in the zlc and later potentially skipped by allocation requests
that have nothing to do with dirty zones.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The zlc is used on NUMA machines to quickly skip over zones that are full.
However it is always updated, even for the first zone scanned when the
zlc might not even be active. As it's a write to a bitmap that
potentially bounces cache line it's deceptively expensive and most
machines will not care. Only update the zlc if it was active.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In mm/swap.c, __lru_cache_add() is exported, but actually there are no
users outside this file.
This patch unexports __lru_cache_add(), and makes it static. It also
exports lru_cache_add_file(), as it is use by cifs and fuse, which can
loaded as modules.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sjoerd Simons [Mon, 19 Jan 2015 22:07:09 +0000 (23:07 +0100)]
mmc: sdhci: Don't signal the sdio irq if it's not setup
[Not needed in newer kernels due to refactoring fixing this issue.]
With 3.14.29 (and older kernels) some of my I.mx6 Sabrelite boards were
crashing with the following oops:
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
sdhci-pltfm: SDHCI platform and OF driver helper
sdhci-esdhc-imx 2198000.usdhc: could not get ultra high speed state, work on normal mode
mmc0: no vqmmc regulator found
mmc0: SDHCI controller on 2198000.usdhc [2198000.usdhc] using ADMA
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = c0004000
[00000000] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.29 #1
task: c08a7120 ti: c089c000 task.ti: c089c000
PC is at wake_up_process+0x8/0x40
LR is at sdhci_irq+0x748/0x9c4
Full boot log can be found at:
http://storage.kernelci.org/stable/v3.14.29/arm-multi_v7_defconfig/lab-collabora/boot-imx6q-sabrelite.html
This happens if the sdhci interrupt status contains SDHCI_INT_CARD_INT,
while the sdio irq was never setup. This patch fixes that in a minimal
way by checking if the sdio irq was setup.
In more recent kernels this bug went away due to refactoring done by
Russel King. So an alternative (potentially better?) fix for this patch
is to cherrypick the following patches from a recent kernel:
18258f7239a6 - genirq: Provide synchronize_hardirq() bf3b5ec66bd0 - mmc: sdio_irq: rework sdio irq handling 41005003bcaf - mmc: sdhci: clean up interrupt handling 781e989cf593 - mmc: sdhci: convert to new SDIO IRQ handling
Signed-off-by: Sjoerd Simons <sjoerd.simons@collabora.co.uk> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tyler Baker <tyler.baker@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit f95499c3030f ("n_tty: Don't wait for buffer work in read() loop")
introduces a race window where a pty master can be signalled that the pty
slave was closed before all the data that the slave wrote is delivered.
Commit f8747d4a466a ("tty: Fix pty master read() after slave closes") fixed the
problem in case of n_tty_read, but the problem still exists for n_tty_poll.
This can be seen by running 'for ((i=0; i<100;i++));do ./test.py ;done'
where test.py is:
import os, select, pty
(pid, pty_fd) = pty.fork()
if pid == 0:
os.write(1, 'This string should be received by parent')
else:
poller = select.epoll()
poller.register( pty_fd, select.EPOLLIN )
ready = poller.poll( 1 * 1000 )
for fd, events in ready:
if not events & select.EPOLLIN:
print 'missed POLLIN event'
else:
print os.read(fd, 100)
poller.close()
The string from the slave is missed several times.
This patch takes the same approach as the fix for read and special cases
this condition for poll.
Tested on 3.16.
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Cc: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The 'max' size passed into the function is measured in number of bits
(KEY_MAX, LED_MAX, etc) so we need to convert it accordingly before trying
to copy the data out, otherwise we will try copying too much and end up
with up with a page fault.
Reported-by: Pavel Machek <pavel@ucw.cz> Reviewed-by: Pavel Machek <pavel@ucw.cz> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 93ea02bb8435 ("arch: Clean up asm/barrier.h implementations")
wired generic barrier.h for ARC, but failed to delete the existing file.
In 3.15, due to rcupdate.h updates, this causes a build breakage on ARC:
CC arch/arc/kernel/asm-offsets.s
In file included from include/linux/sched.h:45:0,
from arch/arc/kernel/asm-offsets.c:9:
include/linux/rculist.h: In function __list_add_rcu:
include/linux/rculist.h:54:2: error: implicit declaration of function smp_store_release [-Werror=implicit-function-declaration]
rcu_assign_pointer(list_next_rcu(prev), new);
^
Commit 5d26a105b5a7 ("crypto: prefix module autoloading with "crypto-"")
changed the automatic module loading when requesting crypto algorithms
to prefix all module requests with "crypto-". This requires all crypto
modules to have a crypto specific module alias even if their file name
would otherwise match the requested crypto algorithm.
Even though commit 5d26a105b5a7 added those aliases for a vast amount of
modules, it was missing a few. Add the required MODULE_ALIAS_CRYPTO
annotations to those files to make them get loaded automatically, again.
This fixes, e.g., requesting 'ecb(blowfish-generic)', which used to work
with kernels v3.18 and below.
Also change MODULE_ALIAS() lines to MODULE_ALIAS_CRYPTO(). The former
won't work for crypto modules any more.
This prefixes all crypto module loading with "crypto-" so we never run
the risk of exposing module auto-loading to userspace via a crypto API,
as demonstrated by Mathias Krause:
This commit broke on x86 PV because entries in the generic SWIOTLB are
indexed using (pseudo-)physical address not DMA address and these are
not the same in a x86 PV guest.
Make sure there is enough room for the nfnetlink header in the
netlink messages that are part of the batch. There is a similar
check in netlink_rcv_skb().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a key is being garbage collected, it's key->user would get put before
the ->destroy() callback is called, where the key is removed from it's
respective tracking structures.
This leaves a key hanging in a semi-invalid state which leaves a window open
for a different task to try an access key->user. An example is
find_keyring_by_name() which would dereference key->user for a key that is
in the process of being garbage collected (where key->user was freed but
->destroy() wasn't called yet - so it's still present in the linked list).
This would cause either a panic, or corrupt memory.
Commit 0dbc6078c06bc0 ('x86, build, pci: Fix PCI_MSI build on !SMP')
introduced the dependency that X86_UP_APIC is only available when
PCI_MSI is false. This effectively prevents PCI_MSI support on 32bit
UP systems because it disables both APIC and IO-APIC. But APIC support
is architecturally required for PCI_MSI.
The intention of the patch was to enforce APIC support when PCI_MSI is
enabled, but failed to do so.
Remove the !PCI_MSI dependency from X86_UP_APIC and enforce
X86_UP_APIC when PCI_MSI support is enabled on 32bit UP systems.
[ tglx: Massaged changelog ]
Fixes 0dbc6078c06bc0 'x86, build, pci: Fix PCI_MSI build on !SMP' Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Link: http://lkml.kernel.org/r/1421967529-9037-1-git-send-email-pure.logic@nexus-software.ie Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Strictly speaking, this code was never correct. It should have set
read_exec_only and seg_not_present to 1 to indicate that it wanted
to find a free slot without putting anything there, or it should
have put something sensible in the TLS slot if it wanted to allocate
a TLS entry for real. The actual effect of this code was to
allocate a bogus segment that could be used to exploit espfix.
The set_thread_area hardening patches changed the behavior, causing
set_thread_area to return -EINVAL and crashing the game.
This changes set_thread_area to interpret this as a request to find
a free slot and to leave it empty, which isn't *quite* what the game
expects but should be close enough to keep it working. In
particular, using the code above to allocate two segments will
allocate the same segment both times.
According to FrostbittenKing on Github, this fixes The Witcher 2.
If this somehow still causes problems, we could instead allocate
a limit==0 32-bit data segment, but that seems rather ugly to me.
32-bit programs don't have an lm bit in their ABI, so they can't
reliably cause LDT_empty to return true without resorting to memset.
They shouldn't need to do this.
This should fix a longstanding, if minor, issue in all 64-bit kernels
as well as a potential regression in the TLS hardening code.
On 64-bit, relocation is not required unless the load address gets
changed. Without this, relocations do unexpected things when the kernel
is above 4G.
Reported-by: Baoquan He <bhe@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Thomas D. <whissi@whissi.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Junjie Mao <eternal.n08@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/20150116005146.GA4212@www.outflux.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Many users see this message when booting without knowning that it is
of no importance and that TSC calibration may have succeeded by
another way.
As explained by Paul Bolle in
http://lkml.kernel.org/r/1348488259.1436.22.camel@x61.thuisdomein
"Fast TSC calibration failed" should not be considered as an error
since other calibration methods are being tried afterward. At most,
those send a warning if they fail (not an error). So let's change
the message from error to warning.
[ tglx: Make if pr_info. It's really not important at all ]
Fixes: c767a54ba065 x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level> Signed-off-by: Alexandre Demers <alexandre.f.demers@gmail.com> Link: http://lkml.kernel.org/r/1418106470-6906-1-git-send-email-alexandre.f.demers@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
EXYNOS4_MCT_L_MASK is defined as 0xffffff00, so applying this bitmask
produces a number outside the range 0x00 to 0xff, which always results
in execution of the default switch statement.
Obviously this is wrong and git history shows that the bitmask inversion
was incorrectly set during a refactoring of the MCT code.
Fix this by putting the inversion at the correct position again.
Acked-by: Kukjin Kim <kgene.kim@samsung.com> Reported-by: GP Orcullo <kinsamanka@gmail.com> Reviewed-by: Doug Anderson <dianders@chromium.org> Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When changing flags in the CAN drivers ctrlmode the provided new content has to
be checked whether the bits are allowed to be changed. The bits that are to be
changed are given as a bitfield in cm->mask. Therefore checking against
cm->flags is wrong as the content can hold any kind of values.
The iproute2 tool sets the bits in cm->mask and cm->flags depending on the
detected command line options. To be robust against bogus user space
applications additionally sanitize the provided flags with the provided mask.
Cc: Wolfgang Grandegger <wg@grandegger.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It really needs to check that src is non-directory *and* use
{un,}lock_two_nodirectories(). As it is, it's trivial to cause
double-lock (ioctl(fd, CIFS_IOC_COPYCHUNK_FILE, fd)) and if the
last argument is an fd of directory, we are asking for trouble
by violating the locking order - all directories go before all
non-directories. If the last argument is an fd of parent
directory, it has 50% odds of locking child before parent,
which will cause AB-BA deadlock if we race with unlink().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On Armada XP, 375 and 38x the MBus window 13 has the remap capability,
like windows 0 to 7. However, the mvebu-mbus driver isn't currently
taking into account this special case, which means that when window 13
is actually used, the remap registers are left to 0, making the device
using this MBus window unavailable.
As a minimal fix for stable, don't use window 13. A full fix will
follow later.
Fixes: fddddb52a6c ("bus: introduce an Marvell EBU MBus driver") Reviewed-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Verify that the frequency value from userspace is valid and makes sense.
Unverified values can cause overflows later on.
Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: Fix up bug for negative values and drop redunent cap check] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
An unvalidated user input is multiplied by a constant, which can result in
an undefined behaviour for large values. While this is validated later,
we should avoid triggering undefined behaviour.
Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: include trivial milisecond->microsecond correction noticed
by Andy] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce a new variable to count the number of allocated migration
structures. The existing variable cache->nr_migrations became
overloaded. It was used to:
i) track of the number of migrations in flight for the purposes of
quiescing during suspend.
ii) to estimate the amount of background IO occuring.
Recent discard changes meant that REQ_DISCARD bios are processed with
a migration. Discards are not background IO so nr_migrations was not
incremented. However this could cause quiescing to complete early.
(i) is now handled with a new variable cache->nr_allocated_migrations.
cache->nr_migrations has been renamed cache->nr_io_migrations.
cleanup_migration() is now called free_io_migration(), since it
decrements that variable.
Also, remove the unused cache->next_migration variable that got replaced
with with prealloc_structs a while ago.
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a DM table is reloaded with an inactive table when the device is not
suspended (normal procedure for LVM2), then there will be two dm-bufio
objects that can diverge. This can lead to a situation where the
inactive table uses bufio to read metadata at the same time the active
table writes metadata -- resulting in the inactive table having stale
metadata buffers once it is promoted to the active table slot.
Fix this by using reference counting and a global list of cache metadata
objects to ensure there is only one metadata object per metadata device.
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes a race condition in abort handling that was injected
when multiple interrupt support was added. When only a single
interrupt is present, the adapter guarantees it will send
responses for aborted commands prior to the response for the
abort command itself. With multiple interrupts, these responses
generally come back on different interrupts, so we need to
ensure the abort thread waits until the aborted command is
complete so we don't perform a double completion. This race
condition was being hit frequently in environments which
were triggering command timeouts, which was resulting in
a double completion causing a kernel oops.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Reviewed-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com> Tested-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reports against the TL-WDN4800 card indicate that PCI bus reset of this
Atheros device cause system lock-ups and resets. I've also been able to
confirm this behavior on multiple systems. The device never returns from
reset and attempts to access config space of the device after reset result
in hangs. Blacklist bus reset for the device to avoid this issue.
[bhelgaas: This regression appeared in v3.14. Andreas bisected it to 425c1b223dac ("PCI: Add Virtual Channel to save/restore support"), but we
don't understand the mechanism by which that commit affects the reset
path.]
[bhelgaas: changelog, references] Link: http://lkml.kernel.org/r/20140923210318.498dacbd@dualc.maya.org Reported-by: Andreas Hartmann <andihartmann@freenet.de> Tested-by: Andreas Hartmann <andihartmann@freenet.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Enable a mechanism for devices to quirk that they do not behave when
doing a PCI bus reset. We require a modest level of spec compliant
behavior in order to do a reset, for instance the device should come
out of reset without throwing errors and PCI config space should be
accessible after reset. This is too much to ask for some devices.
If CONFIG_DEBUG_MUTEXES is set, the mutex->owner field is only cleared
if the mutex debugging is enabled which introduces a race in our
mutex_is_locked_by() - i.e. we may inspect the old owner value before it
is acquired by the new task.
This is the root cause of this error:
# diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
# index 5cf6731..3ef3736 100644
# --- a/kernel/locking/mutex-debug.c
# +++ b/kernel/locking/mutex-debug.c
# @@ -80,13 +80,13 @@ void debug_mutex_unlock(struct mutex *lock)
# DEBUG_LOCKS_WARN_ON(lock->owner != current);
#
# DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
# - mutex_clear_owner(lock);
# }
#
# /*
# * __mutex_slowpath_needs_to_unlock() is explicitly 0 for debug
# * mutexes so that we can do it here after we've verified state.
# */
# + mutex_clear_owner(lock);
# atomic_set(&lock->count, 1);
# }
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=87955 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Compiling SH with gcc-4.8 fails due to the -m32 option not being
supported.
From http://buildd.debian-ports.org/status/fetch.php?pkg=linux&arch=sh4&ver=3.16.7-ckt4-1&stamp=1421425783
CC init/main.o
gcc-4.8: error: unrecognized command line option '-m32'
ld: cannot find init/.tmp_mc_main.o: No such file or directory
objcopy: 'init/.tmp_mx_main.o': No such file
rm: cannot remove 'init/.tmp_mx_main.o': No such file or directory
rm: cannot remove 'init/.tmp_mc_main.o': No such file or directory
It is possible for ata_sff_flush_pio_task() to set ap->hsm_task_state to
HSM_ST_IDLE in between the time __ata_sff_port_intr() checks for HSM_ST_IDLE
and before it calls ata_sff_hsm_move() causing ata_sff_hsm_move() to BUG().
This problem is hard to reproduce making this patch hard to verify, but this
fix will prevent the race.
I have not been able to reproduce the problem, but here is a crash dump from
a 2.6.32 kernel.
On examining the ata port's state, its hsm_task_state field has a value of HSM_ST_IDLE:
Normally, this should not be possible as ata_sff_hsm_move() was called from ata_sff_host_intr(),
which checks hsm_task_state and won't call ata_sff_hsm_move() if it has a HSM_ST_IDLE value.
This patch is to fix two deadlock cases.
Deadlock 1:
CPU #1
pinctrl_register-> pinctrl_get ->
create_pinctrl
(Holding lock pinctrl_maps_mutex)
-> get_pinctrl_dev_from_devname
(Trying to acquire lock pinctrldev_list_mutex)
CPU #0
pinctrl_unregister
(Holding lock pinctrldev_list_mutex)
-> pinctrl_put ->> pinctrl_free ->
pinctrl_dt_free_maps -> pinctrl_unregister_map
(Trying to acquire lock pinctrl_maps_mutex)
Simply to say
CPU#1 is holding lock A and trying to acquire lock B,
CPU#0 is holding lock B and trying to acquire lock A.
Deadlock 2:
CPU #3
pinctrl_register-> pinctrl_get ->
create_pinctrl
(Holding lock pinctrl_maps_mutex)
-> get_pinctrl_dev_from_devname
(Trying to acquire lock pinctrldev_list_mutex)
CPU #2
pinctrl_unregister
(Holding lock pctldev->mutex)
-> pinctrl_put ->> pinctrl_free ->
pinctrl_dt_free_maps -> pinctrl_unregister_map
(Trying to acquire lock pinctrl_maps_mutex)
CPU #0
tegra_gpio_request
(Holding lock pinctrldev_list_mutex)
-> pinctrl_get_device_gpio_range
(Trying to acquire lock pctldev->mutex)
Simply to say
CPU#3 is holding lock A and trying to acquire lock D,
CPU#2 is holding lock B and trying to acquire lock A,
CPU#0 is holding lock D and trying to acquire lock B.
Signed-off-by: Jim Lin <jilin@nvidia.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The gpio device attributes were never destroyed when the gpio was
unexported (or on export failures).
Use device_create_with_groups() to create the default device attributes
of the gpio class device. Note that this also fixes the
attribute-creation race with userspace for these attributes.
Remove contingent attributes in export error path and on unexport.
git commit 37f81fa1f63ad38e16125526bb2769ae0ea8d332
"n_tty: do O_ONLCR translation as a single write"
surfaced a bug in the 3215 device driver. In combination this
broke tab expansion for tty ouput.
The cause is an asymmetry in the behaviour of tty3215_ops->write
vs tty3215_ops->put_char. The put_char function scans for '\t'
but the write function does not.
As the driver has logic for the '\t' expansion remove XTABS
from c_oflag of the initial termios as well.
Reported-by: Stephen Powell <zlinuxman@wowway.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The ccw_device_start in raw3215_start_io can fail. raw3215_try_io
does not check if the request could be started and removes any
pending timer. This can leave the system in a hanging state.
Check for pending request after raw3215_start_io and start a
timer if necessary.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
During file system stress testing on 3.10 and 3.12 based kernels, the
umount command occasionally hung in fsnotify_unmount_inodes in the
section of code:
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
spin_unlock(&inode->i_lock);
continue;
}
As this section of code holds the global inode_sb_list_lock, eventually
the system hangs trying to acquire the lock.
Multiple crash dumps showed:
The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point
back at itself. As this is not the value of list upon entry to the
function, the kernel never exits the loop.
To help narrow down problem, the call to list_del_init in
inode_sb_list_del was changed to list_del. This poisons the pointers in
the i_sb_list and causes a kernel to panic if it transverse a freed
inode.
Subsequent stress testing paniced in fsnotify_unmount_inodes at the
bottom of the list_for_each_entry_safe loop showing next_i had become
free.
We believe the root cause of the problem is that next_i is being freed
during the window of time that the list_for_each_entry_safe loop
temporarily releases inode_sb_list_lock to call fsnotify and
fsnotify_inode_delete.
The code in fsnotify_unmount_inodes attempts to prevent the freeing of
inode and next_i by calling __iget. However, the code doesn't do the
__iget call on next_i
if i_count == 0 or
if i_state & (I_FREEING | I_WILL_FREE)
The patch addresses this issue by advancing next_i in the above two cases
until we either find a next_i which we can __iget or we reach the end of
the list. This makes the handling of next_i more closely match the
handling of the variable "inode."
The time to reproduce the hang is highly variable (from hours to days.) We
ran the stress test on a 3.10 kernel with the proposed patch for a week
without failure.
During list_for_each_entry_safe, next_i is becoming free causing
the loop to never terminate. Advance next_i in those cases where
__iget is not done.
Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ken Helias <kenhelias@firemail.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
__netdev_adjacent_dev_insert may add adjust device of different net
namespace, without proper check it leads to emergence of broken
sysfs links from/to devices in another namespace.
Fix: rewrite netdev_adjacent_is_neigh_list macro as a function,
move net_eq check into netdev_adjacent_is_neigh_list.
(thanks David)
related to: 4c75431ac3520631f1d9e74aa88407e6374dbbc4
Signed-off-by: Alexander Fomichev <git.user@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Miquel van Smoorenburg <mikevs@xs4all.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Code manipulating sysfs symlinks on adjacent net_devices(s)
currently doesn't take into account that devices potentially
belong to different namespaces.
This patch trying to fix an issue as follows:
- check for net_ns before creating / deleting symlink.
for now only netdev_adjacent_rename_links and
__netdev_adjacent_dev_remove are affected, afaics
__netdev_adjacent_dev_insert implies both net_devs
belong to the same namespace.
- Drop all existing symlinks to / from all adj_devs before
switching namespace and recreate them just after.
Signed-off-by: Alexander Y. Fomichev <git.user@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Miquel van Smoorenburg <mikevs@xs4all.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We could be reading 8 bytes into a 4 byte buffer here. It seems
harmless but adding a check is the right thing to do and it silences a
static checker warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.
Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In situations such as bond failover, The new session establishment
implicitly invokes the termination of the old connection.
So, we don't want to wait for the old connection wait_conn to completely
terminate before we accept the new connection and post a login response.
The solution is to deffer the comp_wait completion and the conn_put to
a work so wait_conn will effectively be non-blocking (flush errors are
assumed to come very fast).
We allocate isert_release_wq with WQ_UNBOUND and WQ_UNBOUND_MAX_ACTIVE
to spread the concurrency of release works.
The np listener cm_id will also get ADDR_CHANGE event
upcall (in case it is bound to a specific IP). Handle
it correctly by creating a new cm_id and implicitly
destroy the old one.
Since this is the second event a listener np cm_id may
encounter, we move the np cm_id event handling to a
routine.
Take isert_conn pointer from cm_id->qp->qp_context. This
will allow us to know that the cm_id context is always
the network portal. This will make the cm_id event check
(connection or network portal) more reliable.
In order to avoid a NULL dereference in cma_id->qp->qp_context
we destroy the qp after we destroy the cm_id (and make the
dereference safe). session stablishment/teardown sequences
can happen in parallel, we should take into account that
connected_handler might race with connection teardown flow.
Also, protect isert_conn->conn_device->active_qps decrement
within the error patch during QP creation failure and the
normal teardown path in isert_connect_release().
Squashed:
iser-target: Decrement completion context active_qps in error flow
There is no point in accepting a new CM request only
when we are completely done with the last iscsi login.
Instead we accept immediately, this will also cause the
CM connection to reach connected state and the initiator
is allowed to send the first login. We mark that we got
the initial login and let iscsi layer pick it up when it
gets there.
This reduces the parallel login sequence by a factor of
more then 4 (and more for multi-login) and also prevents
the initiator (who does all logins in parallel) from
giving up on login timeout expiration.
In order to support multiple login requests sequence (CHAP)
we call isert_rx_login_req from isert_rx_completion insead
of letting isert_get_login_rx call it.
Squashed:
iser-target: Use kref_get_unless_zero in connected_handler
iser-target: Acquire conn_mutex when changing connection state
iser-target: Reject connect request in failure path
ISER_CONN_UP state is not sufficient to know if
we should wait for completion of flush errors and
disconnected_handler event.
Instead, split it to 2 states:
- ISER_CONN_UP: Got to CM connected phase, This state
indicates that we need to wait for a CM disconnect
event before going to teardown.
- ISER_CONN_FULL_FEATURE: Got to full feature phase
after we posted login response, This state indicates
that we posted recv buffers and we need to wait for
flush completions before going to teardown.
Also avoid deffering disconnected handler to a work,
and handle it within disconnected handler.
More work here is needed to handle DEVICE_REMOVAL event
correctly (cleanup all resources).
Squashed:
iser-target: Don't deffer disconnected handler to a work
Since commit 0fc4ea701fcf ("Target/iser: Don't put isert_conn inside
disconnected handler") we put the conn kref in isert_wait_conn, so we
need .wait_conn to be invoked also in the error path.
Introduce call to isert_conn_terminate (called under lock)
which transitions the connection state to TERMINATING and calls
rdma_disconnect. If the state is already teminating, just bail
out back (temination started).
Also, make sure to destroy the connection when getting a connect
error event if didn't get to connected (state UP). Same for the
handling of REJECTED and UNREACHABLE cma events.
Squashed:
iscsi-target: Add call to wait_conn in establishment error flow
While looking at hch's recent conversion to drop the MSG_*_TAG
definitions, I noticed a long standing bug in vhost-scsi where
the VIRTIO_SCSI_S_* attribute definitions where incorrectly
being passed directly into target_submit_cmd_map_sgls().
This patch adds the missing virtio-scsi to TCM/SAM task attribute
conversion.
Cc: Christoph Hellwig <hch@lst.de> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
tcm_loop has the I_T nexus associated with the HBA. This causes
commands to become misdirected if the HBA has more than one
target portal group; any command is then being sent to the
first target portal group instead of the correct one.
The nexus needs to be associated with the target portal group
instead.
Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The SCSI command tag is set to the tag assigned from the block
layer, not the SCSI-II tag message. So we need to convert
it into the correct SCSI-II tag message based on the
device flags, not the tag value itself.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit fixes a race whereby nlmclnt_init() first starts the lockd
daemon, and then calls nlm_bind_host() with the expectation that
nlmsvc_timeout has already been initialised. Unfortunately, there is no
no synchronisation between lockd() and lockd_up() to guarantee that this
is the case.
Fix is to move the initialisation of nlmsvc_timeout into lockd_create_svc
Commit a074335a370e ("x86, um: Mark system call tables readonly") was
supposed to mark the sys_call_table in UML as RO by adding the const,
but it doesn't have the desired effect as it's nevertheless being placed
into the data section since __cacheline_aligned enforces sys_call_table
being placed into .data..cacheline_aligned instead. We need to use
the ____cacheline_aligned version instead to fix this issue.
Before:
$ nm -v arch/x86/um/sys_call_table_64.o | grep -1 "sys_call_table"
U sys_writev 0000000000000000 D sys_call_table 0000000000000000 D syscall_table_size
After:
$ nm -v arch/x86/um/sys_call_table_64.o | grep -1 "sys_call_table"
U sys_writev 0000000000000000 R sys_call_table 0000000000000000 D syscall_table_size
Fixes: a074335a370e ("x86, um: Mark system call tables readonly") Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
futex_atomic_cmpxchg_inatomic() does not work on UML because
it triggers a copy_from_user() in kernel context.
On UML copy_from_user() can only be used if the kernel was called
by a real user space process such that UML can use ptrace()
to fetch the value.
Reported-by: Miklos Szeredi <miklos@szeredi.hu> Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Richard Weinberger <richard@nod.at> Tested-by: Daniel Walter <d.walter@0x90.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
"origPtr" is used as an offset into the bd->dbuf[] array. That array is
allocated in start_bunzip() and has "bd->dbufSize" number of elements so
the test here should be >= instead of >.
Later we check "origPtr" again before using it as an offset so I don't
know if this bug can be triggered in real life.
Fixes: bc22c17e12c1 ('bzip2/lzma: library support for gzip, bzip2 and lzma decompression') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The sh73a0 INTC can't mask interrupts properly most likely due to a
hardware bug. Set the .control_parent flag to delegate masking to the
parent interrupt controller, like was already done for irqpin1.
Without this, accessing the three-axis digital accelerometer ADXL345
on kzm9g through /dev/input/event1 causes an interrupt storm, which
requires a power-cycle to recover from.
This was inspired by a patch for arch/arm/boot/dts/sh73a0.dtsi from
Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>.
The post dividers do not work on i.MX6Q rev T0 1.0 so they must be fixed
to 1. As the table index was wrong, a divider a of 4 could still be
requested which implied the clock not to be set properly. This is the
root cause of the HDMI not working at high resolution on rev T0 1.0 of
the SoC.
On some laptops, keyboard needs to be reset in order to successfully detect
touchpad (e.g., some Gigabyte laptop models with Elantech touchpads).
Without resettin keyboard touchpad pretends to be completely dead.
Based on the original patch by Mateusz Jończyk this version has been
expanded to include DMI based detection & application of the fix
automatically on the affected models of laptops. This has been confirmed to
fix problem by three users already on three different models of laptops.