Skip to content
  1. Mar 30, 2013
    • Eric Dumazet's avatar
      net: reorder some fields of net_device · 4c3d5e7b
      Eric Dumazet authored
      
      
      As time passed, some fields were added in net_device, and not
      at sensible offsets.
      
      Lets reorder some fields to reduce number of cache lines in RX path.
      Fields not used in data path should be moved out of this critical cache
      line.
      
      In particular, move broadcast[] to the end of the rx section,
      as it is less used, and ethernet uses only the beginning of the 32bytes
      field.
      
      Before patch :
      
      offsetof(struct net_device,dev_addr)=0x258
      offsetof(struct net_device,rx_handler)=0x2b8
      offsetof(struct net_device,ingress_queue)=0x2c8
      offsetof(struct net_device,broadcast)=0x278
      
      After :
      
      offsetof(struct net_device,dev_addr)=0x280
      offsetof(struct net_device,rx_handler)=0x298
      offsetof(struct net_device,ingress_queue)=0x2a8
      offsetof(struct net_device,broadcast)=0x2b0
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c3d5e7b
    • Chen Gang's avatar
      include/linux: printk is needed in filter.h when CONFIG_BPF_JIT is defined · a691ce7f
      Chen Gang authored
      
      
      for make V=1 EXTRA_CFLAGS=-W ARCH=arm allmodconfig
          printk is need when CONFIG_BPF_JIT is defined
          or it will report pr_err and print_hex_dump are implicit declaration
      
      Signed-off-by: default avatarChen Gang <gang.chen@asianux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a691ce7f
  2. Mar 29, 2013
    • Sergei Shtylyov's avatar
      sh_eth: add R-Car support for real · a3f109bd
      Sergei Shtylyov authored
      
      
      Commit d0418bb7 (net: sh_eth: Add eth support
      for R8A7779 device) was a failed attempt to add support for one of members of
      the R-Car SoC family.  That's for three reasons: it treated R8A7779 the  same
      as SH7724 except including quite dirty hack adding ECMR_ELB  bit  to the mask
      in sh_eth_set_rate() while not removing ECMR_RTM bit (despite it's reserved in
      R-Car Ether), and it didn't add a new register offset array despite the closest
      SH_ETH_REG_FAST_SH4 mapping differs by 0x200 to the offsets all the R-Car Ether
      registers have, and also some of the registers in this old mapping don't exist
      on R-Car Ether (due to this, SH7724's 'sh_eth_my_cpu_data' structure is not
      adequeate for R-Car too).  Fix all these shortcomings, restoring the SH7724
      related section to its pristine state...
      
      Signed-off-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3f109bd
  3. Mar 28, 2013
    • Simon Horman's avatar
      net: add ETH_P_802_3_MIN · e5c5d22e
      Simon Horman authored
      
      
      Add a new constant ETH_P_802_3_MIN, the minimum ethernet type for
      an 802.3 frame. Frames with a lower value in the ethernet type field
      are Ethernet II.
      
      Also update all the users of this value that David Miller and
      I could find to use the new constant.
      
      Also correct a bug in util.c. The comparison with ETH_P_802_3_MIN
      should be >= not >.
      
      As suggested by Jesse Gross.
      
      Compile tested only.
      
      Cc: David Miller <davem@davemloft.net>
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Karsten Keil <isdn@linux-pingi.de>
      Cc: John W. Linville <linville@tuxdriver.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Bart De Schuymer <bart.de.schuymer@pandora.be>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Cc: Gustavo Padovan <gustavo@padovan.org>
      Cc: Johan Hedberg <johan.hedberg@gmail.com>
      Cc: linux-bluetooth@vger.kernel.org
      Cc: netfilter-devel@vger.kernel.org
      Cc: bridge@lists.linux-foundation.org
      Cc: linux-wireless@vger.kernel.org
      Cc: linux1394-devel@lists.sourceforge.net
      Cc: linux-media@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: dev@openvswitch.org
      Acked-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      Acked-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5c5d22e
    • Paul Bolle's avatar
      tokenring: delete last holdout of CONFIG_TR · f3d40392
      Paul Bolle authored
      
      
      Tokenring support was deleted in v3.5. One last holdout of the macro
      CONFIG_TR escaped that fate. Until now.
      
      Signed-off-by: default avatarPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3d40392
    • Ying Xue's avatar
      net: fix compile error of implicit declaration of skb_probe_transport_header · fbbdb8f0
      Ying Xue authored
      
      
      The commit 40893fd0(net: switch to use skb_probe_transport_header())
      involes a new error accidently. When NET_SKBUFF_DATA_USES_OFFSE is
      not enabled, below compile error happens:
      
        CC      net/packet/af_packet.o
        net/packet/af_packet.c: In function ‘packet_sendmsg_spkt’:
        net/packet/af_packet.c:1516:2: error: implicit declaration of function ‘skb_probe_transport_header’ [-Werror=implicit-function-declaration]
        cc1: some warnings being treated as errors
        make[2]: *** [net/packet/af_packet.o] Error 1
        make[1]: *** [net/packet] Error 2
        make: *** [net] Error 2
      
      As it seems skb_probe_transport_header() is not related to
      NET_SKBUFF_DATA_USES_OFFSE, we should move the definition of
      skb_probe_transport_header() out of scope of
      NET_SKBUFF_DATA_USES_OFFSE macro.
      
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbbdb8f0
  4. Mar 27, 2013
  5. Mar 26, 2013
    • Pravin B Shelar's avatar
      ipv4: Fix ip-header identification for gso packets. · 330305cc
      Pravin B Shelar authored
      
      
      ip-header id needs to be incremented even if IP_DF flag is set.
      This behaviour was changed in commit 490ab081
      (IP_GRE: Fix IP-Identification).
      
      Following patch fixes it so that identification is always
      incremented.
      
      Reported-by: default avatarCong Wang <amwang@redhat.com>
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      330305cc
    • Hong zhi guo's avatar
      netlink: remove duplicated NLMSG_ALIGN · a88b9ce5
      Hong zhi guo authored
      
      
      NLMSG_HDRLEN is already aligned value. It's for directly reference
      without extra alignment.
      
      The redundant alignment here may confuse the API users.
      
      Signed-off-by: default avatarHong Zhiguo <honkiko@gmail.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a88b9ce5
    • YOSHIFUJI Hideaki / 吉藤英明's avatar
      firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection. · 6752c8db
      YOSHIFUJI Hideaki / 吉藤英明 authored
      
      
      Inspection of upper layer protocol is considered harmful, especially
      if it is about ARP or other stateful upper layer protocol; driver
      cannot (and should not) have full state of them.
      
      IPv4 over Firewire module used to inspect ARP (both in sending path
      and in receiving path), and record peer's GUID, max packet size, max
      speed and fifo address.  This patch removes such inspection by extending
      our "hardware address" definition to include other information as well:
      max packet size, max speed and fifo.  By doing this, The neighbour
      module in networking subsystem can cache them.
      
      Note: As we have started ignoring sspd and max_rec in ARP/NDP, those
            information will not be used in the driver when sending.
      
      When a packet is being sent, the IP layer fills our pseudo header with
      the extended "hardware address", including GUID and fifo.  The driver
      can look-up node-id (the real but rather volatile low-level address)
      by GUID, and then the module can send the packet to the wire using
      parameters provided in the extendedn hardware address.
      
      This approach is realistic because IP over IEEE1394 (RFC2734) and IPv6
      over IEEE1394 (RFC3146) share same "hardware address" format
      in their address resolution protocols.
      
      Here, extended "hardware address" is defined as follows:
      
      union fwnet_hwaddr {
      	u8 u[16];
      	struct {
      		__be64 uniq_id;		/* EUI-64			*/
      		u8 max_rec;		/* max packet size		*/
      		u8 sspd;		/* max speed			*/
      		__be16 fifo_hi;		/* hi 16bits of FIFO addr	*/
      		__be32 fifo_lo;		/* lo 32bits of FIFO addr	*/
      	} __packed uc;
      };
      
      Note that Hardware address is declared as union, so that we can map full
      IP address into this, when implementing MCAP (Multicast Cannel Allocation
      Protocol) for IPv6, but IP and ARP subsystem do not need to know this
      format in detail.
      
      One difference between original ARP (RFC826) and 1394 ARP (RFC2734)
      is that 1394 ARP Request/Reply do not contain the target hardware address
      field (aka ar$tha).  This difference is handled in the ARP subsystem.
      
      CC: Stephan Gatzka <stephan.gatzka@gmail.com>
      Signed-off-by: default avatarYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6752c8db
    • Pravin B Shelar's avatar
      GRE: Refactor GRE tunneling code. · c5441932
      Pravin B Shelar authored
      
      
      Following patch refactors GRE code into ip tunneling code and GRE
      specific code. Common tunneling code is moved to ip_tunnel module.
      ip_tunnel module is written as generic library which can be used
      by different tunneling implementations.
      
      ip_tunnel module contains following components:
       - packet xmit and rcv generic code. xmit flow looks like
         (gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
       - hash table of all devices.
       - lookup for tunnel devices.
       - control plane operations like device create, destroy, ioctl, netlink
         operations code.
       - registration for tunneling modules, like gre, ipip etc.
       - define single pcpu_tstats dev->tstats.
       - struct tnl_ptk_info added to pass parsed tunnel packet parameters.
      
      ipip.h header is renamed to ip_tunnel.h
      
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5441932
  6. Mar 25, 2013
  7. Mar 24, 2013
  8. Mar 22, 2013
    • Russ Anderson's avatar
      mm: zone_end_pfn is too small · f9228b20
      Russ Anderson authored
      
      
      Booting with 32 TBytes memory hits BUG at mm/page_alloc.c:552! (output
      below).
      
      The key hint is "page 4294967296 outside zone".
      4294967296 = 0x100000000 (bit 32 is set).
      
      The problem is in include/linux/mmzone.h:
      
        530 static inline unsigned zone_end_pfn(const struct zone *zone)
        531 {
        532         return zone->zone_start_pfn + zone->spanned_pages;
        533 }
      
      zone_end_pfn is "unsigned" (32 bits).  Changing it to "unsigned long"
      (64 bits) fixes the problem.
      
      zone_end_pfn() was added recently in commit 108bcc96 ("mm: add & use
      zone_end_pfn() and zone_spans_pfn()")
      
      Output from the failure.
      
        No AGP bridge found
        page 4294967296 outside zone [ 4294967296 - 4327469056 ]
        ------------[ cut here ]------------
        kernel BUG at mm/page_alloc.c:552!
        invalid opcode: 0000 [#1] SMP
        Modules linked in:
        CPU 0
        Pid: 0, comm: swapper Not tainted 3.9.0-rc2.dtp+ #10
        RIP: free_one_page+0x382/0x430
        Process swapper (pid: 0, threadinfo ffffffff81942000, task ffffffff81955420)
        Call Trace:
          __free_pages_ok+0x96/0xb0
          __free_pages+0x25/0x50
          __free_pages_bootmem+0x8a/0x8c
          __free_memory_core+0xea/0x131
          free_low_memory_core_early+0x4a/0x98
          free_all_bootmem+0x45/0x47
          mem_init+0x7b/0x14c
          start_kernel+0x216/0x433
          x86_64_start_reservations+0x2a/0x2c
          x86_64_start_kernel+0x144/0x153
        Code: 89 f1 ba 01 00 00 00 31 f6 d3 e2 4c 89 ef e8 66 a4 01 00 e9 2c fe ff ff 0f 0b eb fe 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 eb f3 <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 49
      
      Signed-off-by: default avatarRuss Anderson <rja@sgi.com>
      Reported-by: default avatarGeorge Beshers <gbeshers@sgi.com>
      Acked-by: default avatarHedi Berriche <hedi@sgi.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9228b20
    • Frederic Weisbecker's avatar
      printk: Provide a wake_up_klogd() off-case · dc72c32e
      Frederic Weisbecker authored
      
      
      wake_up_klogd() is useless when CONFIG_PRINTK=n because neither printk()
      nor printk_sched() are in use and there are actually no waiter on
      log_wait waitqueue.  It should be a stub in this case for users like
      bust_spinlocks().
      
      Otherwise this results in this warning when CONFIG_PRINTK=n and
      CONFIG_IRQ_WORK=n:
      
      	kernel/built-in.o In function `wake_up_klogd':
      	(.text.wake_up_klogd+0xb4): undefined reference to `irq_work_queue'
      
      To fix this, provide an off-case for wake_up_klogd() when
      CONFIG_PRINTK=n.
      
      There is much more from console_unlock() and other console related code
      in printk.c that should be moved under CONFIG_PRINTK.  But for now,
      focus on a minimal fix as we passed the merged window already.
      
      [akpm@linux-foundation.org: include printk.h in bust_spinlocks.c]
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Reported-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc72c32e
    • James Hogan's avatar
      irq_work.h: fix warning when CONFIG_IRQ_WORK=n · fe8d5261
      James Hogan authored
      
      
      A randconfig caught repeated compiler warnings when CONFIG_IRQ_WORK=n
      due to the definition of a non-inline static function in
      <linux/irq_work.h>:
      
        include/linux/irq_work.h +40 : warning: 'irq_work_needs_cpu' defined but not used
      
      Make it inline to supress the warning.  This is caused commit
      00b42959 ("irq_work: Don't stop the tick with pending works") merged
      in v3.9-rc1.
      
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe8d5261
    • Thomas Graf's avatar
      rtnetlink: Remove passing of attributes into rtnl_doit functions · 661d2967
      Thomas Graf authored
      
      
      With decnet converted, we can finally get rid of rta_buf and its
      computations around it. It also gets rid of the minimal header
      length verification since all message handlers do that explicitly
      anyway.
      
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      661d2967
    • Thomas Graf's avatar
      decnet: Parse netlink attributes on our own · 58d7d8f9
      Thomas Graf authored
      
      
      decnet is the only subsystem left that is relying on the global
      netlink attribute buffer rta_buf. It's horrible design and we
      want to get rid of it.
      
      This converts all of decnet to do implicit attribute parsing. It
      also gets rid of the error prone struct dn_kern_rta.
      
      Yes, the fib_magic() stuff is not pretty.
      
      It's compiled tested but I need someone with appropriate hardware
      to test the patch since I don't have access to it.
      
      Cc: linux-decnet-user@lists.sourceforge.net
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58d7d8f9
    • Florian Fainelli's avatar
      mv643xx_eth: convert to use the Marvell Orion MDIO driver · c3a07134
      Florian Fainelli authored
      
      
      This patch converts the Marvell MV643XX ethernet driver to use the
      Marvell Orion MDIO driver. As a result, PowerPC and ARM platforms
      registering the Marvell MV643XX ethernet driver are also updated to
      register a Marvell Orion MDIO driver. This driver voluntarily overlaps
      with the Marvell Ethernet shared registers because it will use a subset
      of this shared register (shared_base + 0x4 to shared_base + 0x84). The
      Ethernet driver is also updated to look up for a PHY device using the
      Orion MDIO bus driver.
      
      For ARM and PowerPC we register a single instance of the "mvmdio" driver
      in the system like it used to be done with the use of the "shared_smi"
      platform_data cookie on ARM.
      
      Note that it is safe to register the mvmdio driver only for the "ge00"
      instance of the driver because this "ge00" interface is guaranteed to
      always be explicitely registered by consumers of
      arch/arm/plat-orion/common.c and other instances (ge01, ge10 and ge11)
      were all pointing their shared_smi to ge00. For PowerPC the in-tree
      Device Tree Source files mention only one MV643XX ethernet MAC instance
      so the MDIO bus driver is registered only when id == 0.
      
      Signed-off-by: default avatarFlorian Fainelli <florian@openwrt.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3a07134
    • Rusty Russell's avatar
      virtio: remove obsolete virtqueue_get_queue_index() · 9d0ca6ed
      Rusty Russell authored
      
      
      You can access it directly now, since 3.8: v3.7-rc1-13-g06ca287
      'virtio: move queue_index and num_free fields into core struct
      virtqueue.'
      
      Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Acked-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d0ca6ed
    • Marcelo Tosatti's avatar
      Revert "KVM: allow host header to be included even for !CONFIG_KVM" · 09a6e1f4
      Marcelo Tosatti authored
      
      
      This reverts commit f445f11e as
      it breaks PPC with CONFIG_KVM=n.
      
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      09a6e1f4
  9. Mar 21, 2013
    • Johan Hovold's avatar
      USB: serial: add modem-status-change wait queue · e5b33dc9
      Johan Hovold authored
      
      
      Add modem-status-change wait queue to struct usb_serial_port that
      subdrivers can use to implement TIOCMIWAIT.
      
      Currently subdrivers use a private wait queue which may have been
      released when waking up after device disconnected.
      
      Note that we're adding a new wait queue rather than reusing the tty-port
      one as we do not want to get woken up at hangup (yet).
      
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarJohan Hovold <jhovold@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5b33dc9
    • Daniel Borkmann's avatar
      filter: bpf_jit_comp: refactor and unify BPF JIT image dump output · 79617801
      Daniel Borkmann authored
      
      
      If bpf_jit_enable > 1, then we dump the emitted JIT compiled image
      after creation. Currently, only SPARC and PowerPC has similar output
      as in the reference implementation on x86_64. Make a small helper
      function in order to reduce duplicated code and make the dump output
      uniform across architectures x86_64, SPARC, PPC, ARM (e.g. on ARM
      flen, pass and proglen are currently not shown, but would be
      interesting to know as well), also for future BPF JIT implementations
      on other archs.
      
      Cc: Mircea Gherzan <mgherzan@gmail.com>
      Cc: Matt Evans <matt@ozlabs.org>
      Cc: Eric Dumazet <eric.dumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79617801
    • Andrey Vagin's avatar
      netlink: Diag core and basic socket info dumping (v2) · eaaa3139
      Andrey Vagin authored
      
      
      The netlink_diag can be built as a module, just like it's done in
      unix sockets.
      
      The core dumping message carries the basic info about netlink sockets:
      family, type and protocol, portis, dst_group, dst_portid, state.
      
      Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.
      
      Netlink sockets cab be filtered by protocols.
      
      The socket inode number and cookie is reserved for future per-socket info
      retrieving. The per-protocol filtering is also reserved for future by
      requiring the sdiag_protocol to be zero.
      
      The file /proc/net/netlink doesn't provide enough information for
      dumping netlink sockets. It doesn't provide dst_group, dst_portid,
      groups above 32.
      
      v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.
      
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eaaa3139
    • Andrey Vagin's avatar
      net: fix *_DIAG_MAX constants · ae5fc987
      Andrey Vagin authored
      
      
      Follow the common pattern and define *_DIAG_MAX like:
      
              [...]
              __XXX_DIAG_MAX,
      };
      
      Because everyone is used to do:
      
              struct nlattr *attrs[XXX_DIAG_MAX+1];
      
              nla_parse([...], XXX_DIAG_MAX, [...]
      
      Reported-by: default avatarThomas Graf <tgraf@suug.ch>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae5fc987
    • Yuchung Cheng's avatar
      tcp: implement RFC5682 F-RTO · e33099f9
      Yuchung Cheng authored
      
      
      This patch implements F-RTO (foward RTO recovery):
      
      When the first retransmission after timeout is acknowledged, F-RTO
      sends new data instead of old data. If the next ACK acknowledges
      some never-retransmitted data, then the timeout was spurious and the
      congestion state is reverted.  Otherwise if the next ACK selectively
      acknowledges the new data, then the timeout was genuine and the
      loss recovery continues. This idea applies to recurring timeouts
      as well. While F-RTO sends different data during timeout recovery,
      it does not (and should not) change the congestion control.
      
      The implementaion follows the three steps of SACK enhanced algorithm
      (section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
      3 are in tcp_process_loss().  The basic version is not supported
      because SACK enhanced version also works for non-SACK connections.
      
      The new implementation is functionally in parity with the old F-RTO
      implementation except the one case where it increases undo events:
      In addition to the RFC algorithm, a spurious timeout may be detected
      without sending data in step 2, as long as the SACK confirms not
      all the original data are dropped. When this happens, the sender
      will undo the cwnd and perhaps enter fast recovery instead. This
      additional check increases the F-RTO undo events by 5x compared
      to the prior implementation on Google Web servers, since the sender
      often does not have new data to send for HTTP.
      
      Note F-RTO may detect spurious timeout before Eifel with timestamps
      does so.
      
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e33099f9
    • Yuchung Cheng's avatar
      tcp: refactor F-RTO · 9b44190d
      Yuchung Cheng authored
      
      
      The patch series refactor the F-RTO feature (RFC4138/5682).
      
      This is to simplify the loss recovery processing. Existing F-RTO
      was developed during the experimental stage (RFC4138) and has
      many experimental features.  It takes a separate code path from
      the traditional timeout processing by overloading CA_Disorder
      instead of using CA_Loss state. This complicates CA_Disorder state
      handling because it's also used for handling dubious ACKs and undos.
      While the algorithm in the RFC does not change the congestion control,
      the implementation intercepts congestion control in various places
      (e.g., frto_cwnd in tcp_ack()).
      
      The new code implements newer F-RTO RFC5682 using CA_Loss processing
      path.  F-RTO becomes a small extension in the timeout processing
      and interfaces with congestion control and Eifel undo modules.
      It lets congestion control (module) determines how many to send
      independently.  F-RTO only chooses what to send in order to detect
      spurious retranmission. If timeout is found spurious it invokes
      existing Eifel undo algorithms like DSACK or TCP timestamp based
      detection.
      
      The first patch removes all F-RTO code except the sysctl_tcp_frto is
      left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
      westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
      
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b44190d
  10. Mar 20, 2013
  11. Mar 19, 2013
    • Willem de Bruijn's avatar
      packet: packet fanout rollover during socket overload · 77f65ebd
      Willem de Bruijn authored
      
      
      Changes:
        v3->v2: rebase (no other changes)
                passes selftest
        v2->v1: read f->num_members only once
                fix bug: test rollover mode + flag
      
      Minimize packet drop in a fanout group. If one socket is full,
      roll over packets to another from the group. Maintain flow
      affinity during normal load using an rxhash fanout policy, while
      dispersing unexpected traffic storms that hit a single cpu, such
      as spoofed-source DoS flows. Rollover breaks affinity for flows
      arriving at saturated sockets during those conditions.
      
      The patch adds a fanout policy ROLLOVER that rotates between sockets,
      filling each socket before moving to the next. It also adds a fanout
      flag ROLLOVER. If passed along with any other fanout policy, the
      primary policy is applied until the chosen socket is full. Then,
      rollover selects another socket, to delay packet drop until the
      entire system is saturated.
      
      Probing sockets is not free. Selecting the last used socket, as
      rollover does, is a greedy approach that maximizes chance of
      success, at the cost of extreme load imbalance. In practice, with
      sufficiently long queues to absorb bursts, sockets are drained in
      parallel and load balance looks uniform in `top`.
      
      To avoid contention, scales counters with number of sockets and
      accesses them lockfree. Values are bounds checked to ensure
      correctness.
      
      Tested using an application with 9 threads pinned to CPUs, one socket
      per thread and sufficient busywork per packet operation to limits each
      thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
      packets, a FANOUT_CPU setup processes 32 Kpps in total without this
      patch, 270 Kpps with the patch. Tested with read() and with a packet
      ring (V1).
      
      Also, passes psock_fanout.c unit test added to selftests.
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77f65ebd
    • Vladimir Davydov's avatar
      netfilter: nf_conntrack: speed up module removal path if netns in use · dece40e8
      Vladimir Davydov authored
      
      
      The patch introduces nf_conntrack_cleanup_net_list(), which cleanups
      nf_conntrack for a list of netns and calls synchronize_net() only once
      for them all. This should reduce netns destruction time.
      
      I've measured cleanup time for 1k dummy net ns. Here are the results:
      
       <without the patch>
       # modprobe nf_conntrack
       # time modprobe -r nf_conntrack
      
       real	0m10.337s
       user	0m0.000s
       sys	0m0.376s
      
       <with the patch>
       # modprobe nf_conntrack
       # time modprobe -r nf_conntrack
      
       real    0m5.661s
       user    0m0.000s
       sys     0m0.216s
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarGao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      dece40e8
    • Hannes Frederic Sowa's avatar
      inet: limit length of fragment queue hash table bucket lists · 5a3da1fe
      Hannes Frederic Sowa authored
      
      
      This patch introduces a constant limit of the fragment queue hash
      table bucket list lengths. Currently the limit 128 is choosen somewhat
      arbitrary and just ensures that we can fill up the fragment cache with
      empty packets up to the default ip_frag_high_thresh limits. It should
      just protect from list iteration eating considerable amounts of cpu.
      
      If we reach the maximum length in one hash bucket a warning is printed.
      This is implemented on the caller side of inet_frag_find to distinguish
      between the different users of inet_fragment.c.
      
      I dropped the out of memory warning in the ipv4 fragment lookup path,
      because we already get a warning by the slab allocator.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a3da1fe
Loading