Network Outage

Summary:

The core WOU network router pair failed to pass traffic beginning at 9:30am on January 14, 2015.  Partial network throughput was restored at 12:40pm and a full recovery occured at 9:00pm January 14, 2015.

Timeline:

  • Campus network outage began at approximately 9:30am on January 14, 2015
  • UCS responded immediately and went into diagnostic mode
  • Cisco TAC support was engaged at 10:30am
  • High CPU utilization was identified as an issue on the core campus router pair at 11:00am
  • Call placed to local Cisco representative for additional support at 11:30
  • Call placed to NERO (the WOU ISP) engineer at 12:30
  • NERO diagnostics led to finding a server that was identified as pushing an excessive amount of ARP request to the router.  The server was removed from the network at 12:40pm
  • Several networks were pulled out from behind the firewall, allowing network traffic to flow again
  • CPU utilization went from 99% to 86% after server was removed from the network
  • About 12:50 the CPU utilization had climbed back to 99% even though the server had not been reconnected to the network
  • Additional Cisco support provided about 1:00pm — at this point we had three Cisco engineers on the phone and connected to our router pair via a Webex call.
  • By late afternoon, I requested additional on-site support from Mt. States Networking.
  • A Mt. States engineer was on site by 6:00pm
  • At ~8:15pm, the router netflow process was identified as a culprit in the high CPU utilization.  After the netflows were removed, the CPU utilization fell from 99% to 23%
  • All networks were moved behind the firewall and traffic continued to flow properly.
  • The suspect host that was removed in the morning was returned to service and the CPU utilization on the router immediately climbed to 99%
  • The suspect host was removed

Forensics:

  • February 15, 2015
    • Our unix systems administrator has been reviewing the suspect servers logs and discovered the server had been compromised.  This server is running openstack OS.
    • We know that whoever compromised the server did not gain direct access to it via ssh or telnet
    • Forensics work continues…

Comments are closed.