Steve Capper [Wed, 26 Dec 2012 05:42:13 +0000 (11:12 +0530)]
RM: mm: Add NUMA support.
This patch adds support for NUMA (running on either discontiguous
and sparse memory).
At the moment, the number of nodes has to be specified on the
commandline. One can also, optionally, specify the memory size of
each node. (Otherwise the memory range is split roughly equally
between nodes).
CPUs can be striped across nodes (cpu number modulo the number of
nodes), or assigned to a node based on their
topology_physical_package_id. So for instance on a TC2, the A7
cores can be grouped together in one node and the A15s grouped
together in another node.
Signed-off-by: Steve Capper <steve.capper@arm.com>
Steve Capper [Thu, 29 Nov 2012 15:14:14 +0000 (15:14 +0000)]
ARM: Consider memblocks in mem_init and show_mem.
This is based on Michael Spang's patch [1]; and is my attempt at
applying the feedback from Russell [2].
With discontiguous memory (a requirement for running NUMA on some
systems), membanks may not necessarily be representable as
contiguous blocks of struct page *s. This patch updates the page
scanning code in mem_init and show_mem to consider pages in the
intersection of membanks and memblocks instead.
We can't consider memblocks solely as under sparse memory
configurations, contiguous physical membanks won't necessarily have
a contiguous memory map (but may be merged into the same memblock).
Only memory blocks in the "memory" region were considered as the
"reserved" region was found to always overlap "memory"; all the
memory banks are added with memblock_add (which adds to "memory")
and no instances were found where memory was added to "reserved"
then removed from "memory".
In mem_init we are running on one CPU, and I can't see the
memblocks changing whilst being enumerated.
In show_mem, we can be running on multiple CPUs; whilst the
memblock manipulation functions are annotated as __init, this
doesn't stop memblocks being manipulated during bootup. I can't
see any place where memblocks are removed or merged other than
driver initialisation (memblock_steal) or boot memory
initialisation.
One consequence of using memblocks in show_mem, is that we are
unable to define ARCH_DISCARD_MEMBLOCK.
ARM: mm: Transparent huge page support for LPAE systems.
The patch adds support for THP (transparent huge pages) to LPAE systems. When
this feature is enabled, the kernel tries to map anonymous pages as 2MB
sections where possible.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[steve.capper@arm.com: symbolic constants used, value of PMD_SECT_SPLITTING
adjusted, tlbflush.h included in pgtable.h] Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com>
Steve Capper [Thu, 19 Jul 2012 10:57:45 +0000 (11:57 +0100)]
ARM: mm: HugeTLB support for non-LPAE systems.
Based on Bill Carson's HugeTLB patch, with the big difference being in the way
PTEs are passed back to the memory manager. Rather than store a "Linux Huge
PTE" separately; we make one up on the fly in huge_ptep_get. Also rather than
consider 16M supersections, we focus solely on 2x1M sections.
To construct a huge PTE on the fly we need additional information (such as the
accessed flag and dirty bit) which we choose to store in the domain bits of the
short section descriptor. In order to use these domain bits for storage, we need
to make ourselves a client for all 16 domains and this is done in head.S.
Storing extra information in the domain bits also makes it a lot easier to
implement Transparent Huge Pages, and some of the code in pgtable-2level.h is
arranged to facilitate THP support in a later patch.
Non-LPAE HugeTLB pages are incompatible with the huge page migration code
(enabled when CONFIG_MEMORY_FAILURE is selected) as that code dereferences PTEs
directly, rather than calling huge_ptep_get and set_huge_pte_at.
Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com>
This patch adds support for hugetlbfs based on the x86 implementation.
It allows mapping of 2MB sections (see Documentation/vm/hugetlbpage.txt
for usage). The 64K pages configuration is not supported (section size
is 512MB in this case).
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[steve.capper@arm.com: symbolic constants replace numbers in places.
Split up into multiple files, to simplify future non-LPAE support]. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com>
Steve Capper [Thu, 19 Jul 2012 10:51:50 +0000 (11:51 +0100)]
ARM: mm: Add support for flushing HugeTLB pages.
On ARM we use the __flush_dcache_page function to flush the dcache of pages
when needed; usually when the PG_dcache_clean bit is unset and we are setting a
PTE.
A HugeTLB page is represented as a compound page consisting of an array of
pages. Thus to flush the dcache of a HugeTLB page, one must flush more than a
single page.
This patch modifies __flush_dcache_page such that all constituent pages of a
HugeTLB page are flushed.
Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com>
This allows us to define L_PTE_PRESENT as (3 << 0) and use this value to
create ptes directly. However, when determining whether a given pte
value is present in the low-level page table accessors, we only need to
check the least significant bit of the descriptor, allowing us to write
faulting, present entries which are required for PROT_NONE mappings.
This patch introduces L_PTE_VALID, which can be used to test whether a
pte should fault, and updates the low-level page table accessors
accordingly.
Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com>
Will Deacon [Tue, 2 Oct 2012 10:18:52 +0000 (11:18 +0100)]
mm: thp: Set the accessed flag for old pages on access fault.
On x86 memory accesses to pages without the ACCESSED flag set result in the
ACCESSED flag being set automatically. With the ARM architecture a page access
fault is raised instead (and it will continue to be raised until the ACCESSED
flag is set for the appropriate PTE/PMD).
For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively
setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only
be called for a write fault.
This patch ensures that faults on transparent hugepages which do not result
in a CoW update the access flags for the faulting pmd.
Cc: Chris Metcalf <cmetcalf@tilera.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Steve Capper <steve.capper@arm.com>
Subash Patel [Mon, 10 Dec 2012 11:50:45 +0000 (17:20 +0530)]
ARM: OF: update coherent_dma_mask
This patch is tested in ARM:exynos5250 with LPAE enabled. The coherent_dma_mask
needs to be defined to DMA_BIT_MASK(64) as dma-mapping API's check it against
64-bit mask.
Subash Patel [Wed, 5 Dec 2012 05:07:49 +0000 (10:37 +0530)]
ARM: exynos: add coherent dma mask
This patch adds the coherent_dma_mask to usb/dwc3 node. This is
needed as check is performed before allocating any coherent buffer
in the dma-mapping framework.
Subash Patel [Wed, 5 Dec 2012 05:04:19 +0000 (10:34 +0530)]
ARM: exynos: add coherent_dma_mask
This patch adds the coherent_dma_mask for the dw_mmc device.
This is needed as check is now done in dma-mapping framework before
allocating the buffers.
Subash Patel [Tue, 4 Dec 2012 12:35:18 +0000 (18:05 +0530)]
NET: eth: ax88796: fixup for LPAE
This patch adds condition for variables declared of type
resource_size_t. When LPAE is enabled, these will be 64-bit,
but the linker will throw error for missing __aeabi_uldivmod
support in lib1funcs.s. This patch may be safetly reverted
when this is added.
Subash Patel [Mon, 3 Dec 2012 09:21:55 +0000 (14:51 +0530)]
ARM: LPAE: accomodate >32-bit addresses for page table base
This patch redefines the early boot time use of the R4 register to steal a few
low order bits (ARCH_PGD_SHIFT bits) on LPAE systems. This allows for up to
38-bit physical addresses.
Signed-off-by: Cyril Chemparathy <cyril@ti.com> Signed-off-by: Vitaly Andrianov <vitalya@ti.com>
Hand-edited as this patch in eml format doesnt apply due to missing blob data
for arch/arm/include/asm/memory.h Signed-off-by: Subash Patel <subash.rp@samsung.com>
This patch cleans up the highmem sanity check code by simplifying the range
checks with a pre-calculated size_limit. This patch should otherwise have no
functional impact on behavior.
This patch also removes a redundant (bank->start < vmalloc_limit) check, since
this is already covered by the !highmem condition.
ARM: mm: cleanup checks for membank overlap with vmalloc area
On Keystone platforms, physical memory is entirely outside the 32-bit
addressible range. Therefore, the (bank->start > ULONG_MAX) check below marks
the entire system memory as highmem, and this causes unpleasentness all over.
This patch eliminates the extra bank start check (against ULONG_MAX) by
checking bank->start against the physical address corresponding to vmalloc_min
instead.
In the process, this patch also cleans up parts of the highmem sanity check
code by removing what has now become a redundant check for banks that entirely
overlap with the vmalloc range.
ARM: mm: use physical addresses in highmem sanity checks
This patch modifies the highmem sanity checking code to use physical addresses
instead. This change eliminates the wrap-around problems associated with the
original virtual address based checks, and this simplifies the code a bit.
The one constraint imposed here is that low physical memory must be mapped in
a monotonically increasing fashion if there are multiple banks of memory,
i.e., x < y must => pa(x) < pa(y).
This patch moves the TTBR1 offset calculation and the T1SZ calculation out
of the TTB setup assembly code. This should not affect functionality in
any way, but improves code readability as well as readability of subsequent
patches in this series.
ARM: LPAE: define ARCH_LOW_ADDRESS_LIMIT for bootmem
This patch adds an architecture defined override for ARCH_LOW_ADDRESS_LIMIT.
On PAE systems, the absence of this override causes bootmem to incorrectly
limit itself to 32-bit addressable physical memory.
This patch modifies the switch_mm() processor functions to use phys_addr_t.
On LPAE systems, we now honor the upper 32-bits of the physical address that
is being passed in, and program these into TTBR as expected.
ARM: LPAE: use phys_addr_t for initrd location and size
This patch fixes the initrd setup code to use phys_addr_t instead of assuming
32-bit addressing. Without this we cannot boot on systems where initrd is
located above the 4G physical address limit.
The free_memmap() was mistakenly using unsigned long type to represent
physical addresses. This breaks on PAE systems where memory could be placed
above the 32-bit addressible limit.
This patch fixes this function to properly use phys_addr_t instead.
This patch fixes the alloc_init_pud() function to use phys_addr_t instead of
unsigned long when passing in the phys argument.
This is an extension to commit 97092e0c56830457af0639f6bd904537a150ea4a (ARM:
pgtable: use phys_addr_t for physical addresses), which applied similar changes
elsewhere in the ARM memory management code.
ARM: LPAE: use signed arithmetic for mask definitions
This patch applies to PAGE_MASK, PMD_MASK, and PGDIR_MASK, where forcing
unsigned long math truncates the mask at the 32-bits. This clearly does bad
things on PAE systems.
This patch fixes this problem by defining these masks as signed quantities.
We then rely on sign extension to do the right thing.
This patch adds support for 64-bit physical addresses in virt_to_phys()
patching. This does not do real 64-bit add/sub, but instead patches in the
upper 32-bits of the phys_offset directly into the output of virt_to_phys.
There is no corresponding change on the phys_to_virt() side, because
computations on the upper 32-bits would be discarded anyway.
This patch adds basic sanity tests to ensure that the instruction patching
results in valid instruction encodings. This is done by verifying the output
of the patch process against a vector of assembler generated instructions at
init time.
The original phys_to_virt/virt_to_phys patching implementation relied on early
patching prior to MMU initialization. On PAE systems running out of >4G
address space, this would have entailed an additional round of patching after
switching over to the high address space.
The approach implemented here conceptually extends the original PHYS_OFFSET
patching implementation with the introduction of "early" patch stubs. Early
patch code is required to be functional out of the box, even before the patch
is applied. This is implemented by inserting functional (but inefficient)
load code into the .runtime.patch.code init section. Having functional code
out of the box then allows us to defer the init time patch application until
later in the init sequence.
In addition to fitting better with our need for physical address-space
switch-over, this implementation should be somewhat more extensible by virtue
of its more readable (and hackable) C implementation. This should prove
useful for other similar init time specialization needs, especially in light
of our multi-platform kernel initiative.
This code has been boot tested in both ARM and Thumb-2 modes on an ARMv7
(Cortex-A8) device.
Note: the obtuse use of stringified symbols in patch_stub() and
early_patch_stub() is intentional. Theoretically this should have been
accomplished with formal operands passed into the asm block, but this requires
the use of the 'c' modifier for instantiating the long (e.g. .long %c0).
However, the 'c' modifier has been found to ICE certain versions of GCC, and
therefore we resort to stringified symbols here.
Signed-off-by: Cyril Chemparathy <cyril@ti.com> Reviewed-by: Nicolas Pitre <nico@linaro.org>
Florian Fainelli [Mon, 10 Dec 2012 20:25:32 +0000 (12:25 -0800)]
Input: matrix-keymap - provide proper module license
The matrix-keymap module is currently lacking a proper module license,
add one so we don't have this module tainting the entire kernel. This
issue has been present since commit 1932811f426f ("Input: matrix-keymap
- uninline and prepare for device tree support")
1) Netlink socket dumping had several missing verifications and checks.
In particular, address comparisons in the request byte code
interpreter could access past the end of the address in the
inet_request_sock.
Also, address family and address prefix lengths were not validated
properly at all.
This means arbitrary applications can read past the end of certain
kernel data structures.
Fixes from Neal Cardwell.
2) ip_check_defrag() operates in contexts where we're in the process
of, or about to, input the packet into the real protocols
(specifically macvlan and AF_PACKET snooping).
Unfortunately, it does a pskb_may_pull() which can modify the
backing packet data which is not legal if the SKB is shared. It
very much can be shared in this context.
Deal with the possibility that the SKB is segmented by using
skb_copy_bits().
Fix from Johannes Berg based upon a report by Eric Leblond.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
ipv4: ip_check_defrag must not modify skb before unsharing
inet_diag: validate port comparison byte code to prevent unsafe reads
inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
This is a revert of a revert of a revert. In addition, it reverts the
even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
original commits in linux-next.
It turns out that the original patch really was bogus, and that the
original revert was the correct thing to do after all. We thought we
had fixed the problem, and then reverted the revert, but the problem
really is fundamental: waking up kswapd simply isn't the right thing to
do, and direct reclaim sometimes simply _is_ the right thing to do.
When certain allocations fail, we simply should try some direct reclaim,
and if that fails, fail the allocation. That's the right thing to do
for THP allocations, which can easily fail, and the GPU allocations want
to do that too.
So starting kswapd is sometimes simply wrong, and removing the flag that
said "don't start kswapd" was a mistake. Let's hope we never revisit
this mistake again - and certainly not this many times ;)
Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Berg [Sun, 9 Dec 2012 23:41:06 +0000 (23:41 +0000)]
ipv4: ip_check_defrag must not modify skb before unsharing
ip_check_defrag() might be called from af_packet within the
RX path where shared SKBs are used, so it must not modify
the input SKB before it has unshared it for defragmentation.
Use skb_copy_bits() to get the IP header and only pull in
everything later.
The same is true for the other caller in macvlan as it is
called from dev->rx_handler which can also get a shared SKB.
Reported-by: Eric Leblond <eric@regit.org> Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to reinstate the __GFP_NO_KSWAPD flag that has been
removed, the removal reverted, and then removed again. Making this
commit a pointless fixup for a problem that was caused by the removal of
__GFP_NO_KSWAPD flag.
The thing is, we really don't want to wake up kswapd for THP allocations
(because they fail quite commonly under any kind of memory pressure,
including when there is tons of memory free), and these patches were
just trying to fix up the underlying bug: the original removal of
__GFP_NO_KSWAPD in commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD")
was simply bogus.
Neal Cardwell [Sun, 9 Dec 2012 11:09:54 +0000 (11:09 +0000)]
inet_diag: validate port comparison byte code to prevent unsafe reads
Add logic to verify that a port comparison byte code operation
actually has the second inet_diag_bc_op from which we read the port
for such operations.
Previously the code blindly referenced op[1] without first checking
whether a second inet_diag_bc_op struct could fit there. So a
malicious user could make the kernel read 4 bytes beyond the end of
the bytecode array by claiming to have a whole port comparison byte
code (2 inet_diag_bc_op structs) when in fact the bytecode was not
long enough to hold both.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Sat, 8 Dec 2012 19:43:23 +0000 (19:43 +0000)]
inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
Add logic to check the address family of the user-supplied conditional
and the address family of the connection entry. We now do not do
prefix matching of addresses from different address families (AF_INET
vs AF_INET6), except for the previously existing support for having an
IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
maintains as-is).
This change is needed for two reasons:
(1) The addresses are different lengths, so comparing a 128-bit IPv6
prefix match condition to a 32-bit IPv4 connection address can cause
us to unwittingly walk off the end of the IPv4 address and read
garbage or oops.
(2) The IPv4 and IPv6 address spaces are semantically distinct, so a
simple bit-wise comparison of the prefixes is not meaningful, and
would lead to bogus results (except for the IPv4-mapped IPv6 case,
which this commit maintains).
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Sat, 8 Dec 2012 19:43:22 +0000 (19:43 +0000)]
inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
operations.
Previously we did not validate the inet_diag_hostcond, address family,
address length, and prefix length. So a malicious user could make the
kernel read beyond the end of the bytecode array by claiming to have a
whole inet_diag_hostcond when the bytecode was not long enough to
contain a whole inet_diag_hostcond of the given address family. Or
they could make the kernel read up to about 27 bytes beyond the end of
a connection address by passing a prefix length that exceeded the
length of addresses of the given family.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Sat, 8 Dec 2012 19:43:21 +0000 (19:43 +0000)]
inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
instantiated for IPv4 traffic and in the SYN-RECV state were actually
created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
means that for such connections inet6_rsk(req) returns a pointer to a
random spot in memory up to roughly 64KB beyond the end of the
request_sock.
With this bug, for a server using AF_INET6 TCP sockets and serving
IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
inet_diag_fill_req() causing an oops or the export to user space of 16
bytes of kernel memory as a garbage IPv6 address, depending on where
the garbage inet6_rsk(req) pointed.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Weiner [Thu, 6 Dec 2012 20:23:25 +0000 (15:23 -0500)]
mm: vmscan: fix inappropriate zone congestion clearing
commit c702418f8a2f ("mm: vmscan: do not keep kswapd looping forever due
to individual uncompactable zones") removed zone watermark checks from
the compaction code in kswapd but left in the zone congestion clearing,
which now happens unconditionally on higher order reclaim.
This messes up the reclaim throttling logic for zones with
dirty/writeback pages, where zones should only lose their congestion
status when their watermarks have been restored.
Remove the clearing from the zone compaction section entirely. The
preliminary zone check and the reclaim loop in kswapd will clear it if
the zone is considered balanced.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 8 Dec 2012 00:48:39 +0000 (16:48 -0800)]
vfs: fix O_DIRECT read past end of block device
The direct-IO write path already had the i_size checks in mm/filemap.c,
but it turns out the read path did not, and removing the block size
checks in fs/block_dev.c (commit bbec0270bdd8: "blkdev_max_block: make
private to fs/buffer.c") removed the magic "shrink IO to past the end of
the device" code there.
Fix it by truncating the IO to the size of the block device, like the
write path already does.
NOTE! I suspect the write path would be *much* better off doing it this
way in fs/block_dev.c, rather than hidden deep in mm/filemap.c. The
mm/filemap.c code is extremely hard to follow, and has various
conditionals on the target being a block device (ie the flag passed in
to 'generic_write_checks()', along with a conditional update of the
inode timestamp etc).
It is also quite possible that we should treat this whole block device
size as a "s_maxbytes" issue, and try to make the logic even more
generic. However, in the meantime this is the fairly minimal targeted
fix.
Noted by Milan Broz thanks to a regression test for the cryptsetup
reencrypt tool.
Reported-and-tested-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
"Two stragglers:
1) The new code that adds new flushing semantics to GRO can cause SKB
pointer list corruption, manage the lists differently to avoid the
OOPS. Fix from Eric Dumazet.
2) When TCP fast open does a retransmit of data in a SYN-ACK or
similar, we update retransmit state that we shouldn't triggering a
WARN_ON later. Fix from Yuchung Cheng."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: gro: fix possible panic in skb_gro_receive()
tcp: bug fix Fast Open client retransmission
Eric Dumazet [Thu, 6 Dec 2012 13:54:59 +0000 (13:54 +0000)]
net: gro: fix possible panic in skb_gro_receive()
commit 2e71a6f8084e (net: gro: selective flush of packets) added
a bug for skbs using frag_list. This part of the GRO stack is rarely
used, as it needs skb not using a page fragment for their skb->head.
Most drivers do use a page fragment, but some of them use GFP_KERNEL
allocations for the initial fill of their RX ring buffer.
napi_gro_flush() overwrite skb->prev that was used for these skb to
point to the last skb in frag_list.
Fix this using a separate field in struct napi_gro_cb to point to the
last fragment.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Thu, 6 Dec 2012 08:45:32 +0000 (08:45 +0000)]
tcp: bug fix Fast Open client retransmission
If SYN-ACK partially acks SYN-data, the client retransmits the
remaining data by tcp_retransmit_skb(). This increments lost recovery
state variables like tp->retrans_out in Open state. If loss recovery
happens before the retransmission is acked, it triggers the WARN_ON
check in tcp_fastretrans_alert(). For example: the client sends
SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
another 4 data packets and get 3 dupacks.
Since the retransmission is not caused by network drop it should not
update the recovery state variables. Further the server may return a
smaller MSS than the cached MSS used for SYN-data, so the retranmission
needs a loop. Otherwise some data will not be retransmitted until timeout
or other loss recovery events.
Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mel Gorman [Wed, 5 Dec 2012 22:01:41 +0000 (14:01 -0800)]
tmpfs: fix shared mempolicy leak
This fixes a regression in 3.7-rc, which has since gone into stable.
Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount
imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
on expecting alloc_page_vma() to drop the refcount it had acquired.
This deserves a rework: but for now fix the leak in shmem_alloc_page().
Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
the same refcounting there as in shmem_alloc_page(), delete its onstack
mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
those were invented to let swapin_readahead() make an unknown number of
calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e,
alloc_pages_vma() has kept refcount in balance, so now no problem.
Johannes Weiner [Tue, 4 Dec 2012 16:11:31 +0000 (11:11 -0500)]
mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones
When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the node's
memory that is considered balanced.
This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all bigger
zones in the node had plenty of free memory. Arguably, the same should
apply to compaction: if a significant part of the node is balanced
enough to run compaction, do not get hung up on that tiny zone that
might never get in shape.
When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced). Remove the individual zone checks
that restart the kswapd cycle.
Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.
Mel Gorman [Thu, 6 Dec 2012 19:01:14 +0000 (19:01 +0000)]
mm: compaction: validate pfn range passed to isolate_freepages_block
Commit 0bf380bc70ec ("mm: compaction: check pfn_valid when entering a
new MAX_ORDER_NR_PAGES block during isolation for migration") added a
check for pfn_valid() when isolating pages for migration as the scanner
does not necessarily start pageblock-aligned.
Since commit c89511ab2f8f ("mm: compaction: Restart compaction from near
where it left off"), the free scanner has the same problem. This patch
makes sure that the pfn range passed to isolate_freepages_block() is
within the same block so that pfn_valid() checks are unnecessary.
In answer to Henrik's wondering why others have not reported this:
reproducing this requires a large enough hole with the right aligment to
have compaction walk into a PFN range with no memmap. Size and
alignment depends in the memory model - 4M for FLATMEM and 128M for
SPARSEMEM on x86. It needs a "lucky" machine.
mmc: sh-mmcif: avoid oops on spurious interrupts (second try)
On some systems, e.g., kzm9g, MMCIF interfaces can produce spurious
interrupts without any active request. To prevent the Oops, that results
in such cases, don't dereference the mmc request pointer until we make
sure, that we are indeed processing such a request.
Heiko Stübner [Sun, 18 Nov 2012 18:50:05 +0000 (19:50 +0100)]
mmc: sdhci-s3c: fix missing clock for gpio card-detect
2abeb5c5ded2 ("Add clk_(enable/disable) in runtime suspend/resume")
added the capability to stop the clocks when the device is runtime
suspended, but forgot to handle the case of the card-detect using
an external gpio.
Therefore in the case that runtime-pm is enabled, start the io-clock
when a card is inserted and stop it again once it is removed.
Signed-off-by: Heiko Stuebner <heiko@sntech.de> Signed-off-by: Chris Ball <cjb@laptop.org>
Linus Torvalds [Thu, 6 Dec 2012 16:42:13 +0000 (08:42 -0800)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"These are the fixes for the N32 syscall bugs found by Al, an
extraneous break that broke detection for R3000 and R3081 processors,
an endless loop processing signals for kernel task (x86 received the
same fix a while ago) and a fix for transparent huge page which took
ages to track down because it was so hard to come up with a workable
test case."
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: Fix endless loop when processing signals for kernel tasks
MIPS: R3000/R3081: Fix CPU detection.
MIPS: N32: Fix signalfd4 syscall entry point
MIPS: N32: Fix preadv(2) and pwritev(2) entry points.
MIPS: Avoid mcheck by flushing page range in huge_ptep_set_access_flags()
Linus Torvalds [Thu, 6 Dec 2012 16:39:57 +0000 (08:39 -0800)]
Merge branch 'more-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull build fix from Rusty Russell:
"Tim Gardner <tim.gardner@canonical.com> writes:
> It is $(obj)/oid_registry.o that is dependent on $(obj)/oid_registry_data.c.
> The object file cannot be built until $(obj)/oid_registry_data.c has been
> generated.
>
> A periodic and hard to reproduce parallel build failure is due to
> this incorrect lib/Makefile dependency. The compile error is completely
> disingenuous.
>
> GEN lib/oid_registry_data.c
> Compiling 49 OIDs
> CC lib/oid_registry.o
> gcc: error: lib/oid_registry.c: No such file or directory
> gcc: fatal error: no input files
> compilation terminated.
> make[3]: *** [lib/oid_registry.o] Error 4
I can't reproduce it either. It's completely weird; nothing ever
removes lib/oid_registry.c, so either gcc is giving the wrong message
or it's a weird fs with a very odd race.
But your version is definitely more correct than the previous one,
so..."
* 'more-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
lib/Makefile: Fix oid_registry build dependency
Linus Torvalds [Thu, 6 Dec 2012 16:29:08 +0000 (08:29 -0800)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module signing fixes from Rusty Russell:
"David gave me these a month ago, during my git workflow churn :("
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
ASN.1: Fix an indefinite length skip error
MODSIGN: Don't use enum-type bitfields in module signature info block
Tim Gardner [Tue, 4 Dec 2012 19:52:28 +0000 (12:52 -0700)]
lib/Makefile: Fix oid_registry build dependency
It is $(obj)/oid_registry.o that is dependent on $(obj)/oid_registry_data.c.
The object file cannot be built until $(obj)/oid_registry_data.c has been
generated.
A periodic and hard to reproduce parallel build failure is due to
this incorrect lib/Makefile dependency. The compile error is completely
disingenuous.
GEN lib/oid_registry_data.c
Compiling 49 OIDs
CC lib/oid_registry.o
gcc: error: lib/oid_registry.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
make[3]: *** [lib/oid_registry.o] Error 4
Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Howells <dhowells@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
MIPS: Fix endless loop when processing signals for kernel tasks
The problem occurs [1] when a kernel-mode task returns from a system
call with a pending signal.
A real-life scenario is a child of 'khelper' returning from a failed
kernel_execve() in ____call_usermodehelper() [ kernel/kmod.c ].
kernel_execve() fails due to a pending SIGKILL, which is the result of
"kill -9 -1" (at least, busybox's init does it upon reboot).
The loop is as follows:
* syscall_exit_work:
- work_pending: // start_of_the_loop
- work_notifysig:
- do_notify_resume()
- do_signal()
- if (!user_mode(regs)) return;
- resume_userspace // TIF_SIGPENDING is still set
- work_pending // so we call work_pending => goto
// start_of_the_loop
More information can be found in another LKML thread:
http://www.serverphorums.com/read.php?12,457826
[1] The problem was also reproduced on !CONFIG_VM86 x86, and the
following fix was accepted.
David Howells [Mon, 22 Oct 2012 14:05:55 +0000 (15:05 +0100)]
ASN.1: Fix an indefinite length skip error
Fix an error in asn1_find_indefinite_length() whereby small definite length
elements of size 0x7f are incorrecly classified as non-small. Without this
fix, an error will be given as the length of the length will be perceived as
being very much greater than the maximum supported size.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
David Howells [Mon, 22 Oct 2012 14:05:48 +0000 (15:05 +0100)]
MODSIGN: Don't use enum-type bitfields in module signature info block
Don't use enum-type bitfields in the module signature info block as we can't be
certain how the compiler will handle them. As I understand it, it is arch
dependent, and it is possible for the compiler to rearrange them based on
endianness and to insert a byte of padding to pad the three enums out to four
bytes.
Instead use u8 fields for these, which the compiler should emit in the right
order without padding.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Thomas Gleixner [Tue, 4 Dec 2012 17:59:34 +0000 (18:59 +0100)]
watchdog: Fix CPU hotplug regression
Norbert reported:
"3.7-rc6 booted with nmi_watchdog=0 fails to suspend to RAM or
offline CPUs. It's reproducable with a KVM guest and physical
system."
The reason is that commit bcd951cf(watchdog: Use hotplug thread
infrastructure) missed to take this into account. So the cpu offline
code gets stuck in the teardown function because it accesses non
initialized data structures.
Add a check for watchdog_enabled into that path to cure the issue.
Linus Torvalds [Tue, 4 Dec 2012 17:32:12 +0000 (09:32 -0800)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module fixes from Rusty Russell:
"Module signing build fixes for blackfin and metag"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
modsign: add symbol prefix to certificate list
linux/kernel.h: define SYMBOL_PREFIX
Linus Torvalds [Tue, 4 Dec 2012 17:15:51 +0000 (09:15 -0800)]
Merge tag 'upstream-3.7-rc9' of git://git.infradead.org/linux-ubi
Pull UBI changes from Artem Bityutskiy:
"Fixes for 2 brown-paperbag bugs introduced this merge window by the
fastmap code:
1. The UBI background thread got stuck when a bit-flip happened
because free LEBs was not removed from the "free" tree when we
started using it.
2. I/O debugging checks did not work because we called a sleeping
function in atomic context."
* tag 'upstream-3.7-rc9' of git://git.infradead.org/linux-ubi:
UBI: dont call ubi_self_check_all_ff() in __wl_get_peb()
UBI: remove PEB from free tree in get_peb_for_wl()
Linus Torvalds [Tue, 4 Dec 2012 17:02:45 +0000 (09:02 -0800)]
Merge branch 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
"So, safe fixes my ass.
Commit 8852aac25e79 ("workqueue: mod_delayed_work_on() shouldn't queue
timer on 0 delay") had the side-effect of performing delayed_work
sanity checks even when @delay is 0, which should be fine for any sane
use cases.
Unfortunately, megaraid was being overly ingenious. It seemingly
wanted to use cancel_delayed_work_sync() before cancel_work_sync() was
introduced, but didn't want to waste the space for full delayed_work
as it was only going to use 0 @delay. So, it only allocated space for
struct work_struct and then cast it to struct delayed_work and passed
it into delayed_work functions - truly awesome engineering tradeoff to
save some bytes.
Xiaotian fixed it by making megraid allocate full delayed_work for
now. It should be converted to use work_struct and cancel_work_sync()
but I think we better do that after 3.7.
I added another commit to change BUG_ON()s in __queue_delayed_work()
to WARN_ON_ONCE()s so that the kernel doesn't crash even if there are
more such abuses."
* 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s
megaraid: fix BUG_ON() from incorrect use of delayed work
Ralf Baechle [Tue, 4 Dec 2012 16:40:44 +0000 (17:40 +0100)]
MIPS: N32: Fix preadv(2) and pwritev(2) entry points.
By using the native syscall entry point the kernel was also expecting
64-bit iovec structures.
This is broken since ddd9e91b71072b8ebe89311c3a44b077defa1756 [preadv/
pwritev: MIPS: Add preadv(2) and pwritev(2) syscalls.] which originally
added these two syscalls. I walked through piles of code, including
libc and couldn't find anything that would have worked around the issue
so this change the API to what it should always have been.
Pull sparc fixes from David Miller:
"Two small fixes for Sparc, nobody uses sparc, so these are low risk :-)
1) Piggyback is too picky about the symbol types that _start and _end
have in the final kernel image, and it thus breaks with newer
binutils. Future proof by getting rid of the symbol type checks.
2) exit_group() should kill register windows on sparc64 the same way
we do for plain exit(). Thanks to Al Viro for spotting this."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc: Fix piggyback with newer binutils.
sparc64: exit_group should kill register windows just like plain exit.
Linus Torvalds [Tue, 4 Dec 2012 16:25:11 +0000 (08:25 -0800)]
vfs: avoid "attempt to access beyond end of device" warnings
The block device access simplification that avoided accessing the (racy)
block size information (commit bbec0270bdd8: "blkdev_max_block: make
private to fs/buffer.c") no longer checks the maximum block size in the
block mapping path.
That was _almost_ as simple as just removing the code entirely, because
the readers and writers all check the size of the device anyway, so
under normal circumstances it "just worked".
However, the block size may be such that the end of the device may
straddle one single buffer_head. At which point we may still want to
access the end of the device, but the buffer we use to access it
partially extends past the end.
The 'bd_set_size()' function intentionally sets the block size to avoid
this, but mounting the device - or setting the block size by hand to
some other value - can modify that block size.
So instead, teach 'submit_bh()' about the special case of the buffer
head straddling the end of the device, and turning such an access into a
smaller IO access, avoiding the problem.
This, btw, also means that unlike before, we can now access the whole
device regardless of device block size setting. So now, even if the
device size is only 512-byte aligned, we can read and write even the
last sector even when having a much bigger block size for accessing the
rest of the device.
So with this, we could now get rid of the 'bd_set_size()' block size
code entirely - resulting in faster IO for the common case - but that
would be a separate patch.
Reported-and-tested-by: Romain Francoise <romain@orebokech.com> Reporeted-and-tested-by: Meelis Roos <mroos@linux.ee> Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Tue, 4 Dec 2012 15:40:39 +0000 (07:40 -0800)]
workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s
8852aac25e ("workqueue: mod_delayed_work_on() shouldn't queue timer on
0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in
megaraid - it allocated work_struct, casted it to delayed_work and
then pass that into queue_delayed_work().
Previously, this was okay because 0 @delay short-circuited to
queue_work() before doing anything with delayed_work. 8852aac25e
moved 0 @delay test into __queue_delayed_work() after sanity check on
delayed_work making megaraid trigger BUG_ON().
Although megaraid is already fixed by c1d390d8e6 ("megaraid: fix
BUG_ON() from incorrect use of delayed work"), this patch converts
BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such
abusers, if there are more, trigger warning but don't crash the
machine.
David Daney [Mon, 3 Dec 2012 20:44:26 +0000 (12:44 -0800)]
MIPS: Avoid mcheck by flushing page range in huge_ptep_set_access_flags()
Problem:
1) Huge page mapping of anonymous memory is initially invalid. Will be
faulted in by copy-on-write mechanism.
2) Userspace attempts store at the end of the huge mapping.
3) TLB Refill exception handler fill TLB with a normal (4K sized)
invalid page at the end of the huge mapping virtual address range.
4) Userspace restarted, and re-attempts the store at the end of the
huge mapping.
5) Page from #3 is invalid, we get a fault and go to the hugepage
fault handler. This tries to map a huge page and calls
huge_ptep_set_access_flags() to install the mapping.
6) We just call the generic ptep_set_access_flags() to set up the page
tables, but the flush there assumes a normal (4K sized) page and
only tries to flush the first part of the huge page virtual address
out of the TLB, since the existing entry from step #3 doesn't
conflict, nothing is flushed.
7) We attempt to load the mapping into the TLB, but because it
conflicts with the entry from step #3, we get a Machine Check
exception.
The fix: Flush the entire rage covered by the huge page in
huge_ptep_set_access_flags(), and remove the optimization in
local_flush_tlb_range() so that the flush actually does the correct
thing.
Xiaotian Feng [Tue, 4 Dec 2012 11:33:54 +0000 (19:33 +0800)]
megaraid: fix BUG_ON() from incorrect use of delayed work
megaraid use INIT_WORK to declare a hotplug_work, but cast the
hotplug_work from work_struct to delayed_work and
schedule_delayed_work on it. This is very dangerous, as other part of
delayed_work might be kernel memories allocated by others.
With commit 8852aac ("workqueue: mod_delayed_work_on() shouldn't queue
timer on 0 delay"), schedule_delayed_work() will check dwork->timer
before queue_work even when @delay is 0, this causes megaraid code to
hit the BUG_ON() in workqueue code. Change megaraid code to use
delayed work.
UBI: dont call ubi_self_check_all_ff() in __wl_get_peb()
As ubi_self_check_all_ff() might sleep we are not allowed
to call it from atomic context.
For now we call it only from ubi_wl_get_peb().
There are some code paths where it would also make sense,
but these paths are currently atomic and only enabled
when fastmap is used.
Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>