January 22, 2013

NETDEV WATCHDOG: eth0 transmit queue 0 timed out


Recently I ran into an issue with a web server where the network card would just stop working at seemingly random times. When I would investigate, I would find a log entry similar to this each time:

Jan 17 14:03:37 cweb4 kernel: ------------[ cut here ]------------
Jan 17 14:03:37 cweb4 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Jan 17 14:03:37 cweb4 kernel: NETDEV WATCHDOG: eth0 (sis190): transmit queue 0 timed out
Jan 17 14:03:37 cweb4 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.19.1.el6.x86_64 #1
Jan 17 14:03:37 cweb4 kernel: Call Trace:
Jan 17 14:03:37 cweb4 kernel: <IRQ>  [<ffffffff8106a1e7>] ? warn_slowpath_common+0x87/0xc0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8101c0fa>] ? intel_pmu_enable_all+0xba/0x160
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8106a2d6>] ? warn_slowpath_fmt+0x46/0x50
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8144792d>] ? dev_watchdog+0x26d/0x280
Jan 17 14:03:37 cweb4 kernel: [<ffffffff814476c0>] ? dev_watchdog+0x0/0x280
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8107d2c7>] ? run_timer_softirq+0x197/0x340
Jan 17 14:03:37 cweb4 kernel: [<ffffffff810a0910>] ? tick_sched_timer+0x0/0xc0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8102adad>] ? lapic_next_event+0x1d/0x30
Jan 17 14:03:37 cweb4 kernel: [<ffffffff81072991>] ? __do_softirq+0xc1/0x1e0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff81095510>] ? hrtimer_interrupt+0x140/0x250
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff81072775>] ? irq_exit+0x85/0x90
Jan 17 14:03:37 cweb4 kernel: [<ffffffff814f1fa0>] ? smp_apic_timer_interrupt+0x70/0x9b
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
Jan 17 14:03:37 cweb4 kernel: <EOI>  [<ffffffff81014707>] ? mwait_idle+0x77/0xd0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff814ef7aa>] ? atomic_notifier_call_chain+0x1a/0x20
Jan 17 14:03:37 cweb4 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Jan 17 14:03:37 cweb4 kernel: [<ffffffff814e32aa>] ? start_secondary+0x22a/0x26d
Jan 17 14:03:37 cweb4 kernel: ---[ end trace c41b5453a60ecb3a ]---

Initially I was thinking I had some sort of driver conflict. The system I was using was CentOS 6.3 64-bit on an older motherboard. After searching high and low looking for an BIOS or some sort of driver update, I found none. I began to dive into what causes the NETDEV WATCHDOG error and found many posts that suggested I use the ‘noapic’ option with the kernel. Unfortunately, this was not a viable solution for me.

After diving back into my searches, I was able to piece together some info that it might be the Active-State Power Management system. That lead me here: https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/ASPM.html

After a quick look in this, it suggested that I add ‘pcie_aspm=off’ to the kernel options. This doesn’t cause me any conflicts, so I gave it a shot. System booted up with no problems and has been running strong ever since.