January 22, 2013

NETDEV WATCHDOG: eth0 transmit queue 0 timed out


Recently I ran into an issue with a web server where the network card would just stop working at seemingly random times. When I would investigate, I would find a log entry similar to this each time:

Jan 17 14:03:37 cweb4 kernel: ------------[ cut here ]------------
Jan 17 14:03:37 cweb4 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Jan 17 14:03:37 cweb4 kernel: NETDEV WATCHDOG: eth0 (sis190): transmit queue 0 timed out
Jan 17 14:03:37 cweb4 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.19.1.el6.x86_64 #1
Jan 17 14:03:37 cweb4 kernel: Call Trace:
Jan 17 14:03:37 cweb4 kernel: <IRQ>  [<ffffffff8106a1e7>] ? warn_slowpath_common+0x87/0xc0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8101c0fa>] ? intel_pmu_enable_all+0xba/0x160
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8106a2d6>] ? warn_slowpath_fmt+0x46/0x50
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8144792d>] ? dev_watchdog+0x26d/0x280
Jan 17 14:03:37 cweb4 kernel: [<ffffffff814476c0>] ? dev_watchdog+0x0/0x280
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8107d2c7>] ? run_timer_softirq+0x197/0x340
Jan 17 14:03:37 cweb4 kernel: [<ffffffff810a0910>] ? tick_sched_timer+0x0/0xc0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8102adad>] ? lapic_next_event+0x1d/0x30
Jan 17 14:03:37 cweb4 kernel: [<ffffffff81072991>] ? __do_softirq+0xc1/0x1e0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff81095510>] ? hrtimer_interrupt+0x140/0x250
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff81072775>] ? irq_exit+0x85/0x90
Jan 17 14:03:37 cweb4 kernel: [<ffffffff814f1fa0>] ? smp_apic_timer_interrupt+0x70/0x9b
Jan 17 14:03:37 cweb4 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
Jan 17 14:03:37 cweb4 kernel: <EOI>  [<ffffffff81014707>] ? mwait_idle+0x77/0xd0
Jan 17 14:03:37 cweb4 kernel: [<ffffffff814ef7aa>] ? atomic_notifier_call_chain+0x1a/0x20
Jan 17 14:03:37 cweb4 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Jan 17 14:03:37 cweb4 kernel: [<ffffffff814e32aa>] ? start_secondary+0x22a/0x26d
Jan 17 14:03:37 cweb4 kernel: ---[ end trace c41b5453a60ecb3a ]---

Initially I was thinking I had some sort of driver conflict. The system I was using was CentOS 6.3 64-bit on an older motherboard. After searching high and low looking for an BIOS or some sort of driver update, I found none. I began to dive into what causes the NETDEV WATCHDOG error and found many posts that suggested I use the ‘noapic’ option with the kernel. Unfortunately, this was not a viable solution for me.

After diving back into my searches, I was able to piece together some info that it might be the Active-State Power Management system. That lead me here: https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/ASPM.html

After a quick look in this, it suggested that I add ‘pcie_aspm=off’ to the kernel options. This doesn’t cause me any conflicts, so I gave it a shot. System booted up with no problems and has been running strong ever since.

4 comments:

Unknown said...
This comment has been removed by a blog administrator.
Unknown said...

I've just come across you're comment as I was getting almost the same message whilst we were performing ESD testing on the shielded network cable.
We were deliberately applying 4kV to the shield which would trigger this message but maybe it was occurring naturally for you.

Olivier L. said...

I fixed the same exact problem by disabling SMP in grub conf (add "nosmp" parameter). Works fine for weeks now.

Olivier L. said...

I tried updating to the latest kernel and removing "nosmp" option in grub config, and my server crashed less than 24hr after that.
I'll give pcie_aspm=off a try.