- Provide extensive redundancy, with parallel links between all major components (see the network design from Cisco in the picture as an example)
- Aim for 99.999% uptime (that's only five minutes of downtime per year)
- Assume there will be hardware and software failures, and design for sub-second failover to a redundant path when a failure occurs
- Assume there will be incessant security attacks from malware, viruses, Trojan horses, port scans, etc., and build systems to protect the network from these attacks
- Build diversity into the system so that a software bug or virus breakout is contained to one vendor's equipment and doesn't affect the entire system
- Build heterogeneity into the system so that the Internet is not owned or managed by a single company
- When low cost is more important than redundancy, which is true for some situations, avoid using the network for applications that can fail in such a way that people, birds, and turtles die horrible deaths, and hundreds of workers lose their jobs
- Design and build disaster recovery systems, and practice using them
- Continually research better ways to design and build networks for high availability, security, scalability, performance, efficiency, and accuracy, using a top-down approach that puts users before technology
- Move around bits and packets, not huge volumes of oil under tremendous pressure :-)
Tuesday, June 15, 2010
Petroleum Engineers Should Learn from Network Engineers
I'm going to go out on a limb and say that an Internet disaster of the same magnitude as the BP oil spill is unlikely. To hedge my bet, I'm also going to say that if it does happen, the Internet will recover quickly, within hours or even minutes. The disaster won't linger for months like it appears posed to do on the Gulf Coast of the U.S.
I'll add one more caveat. The BP disaster is regional, mostly just affecting the Gulf Coast states. The Internet is global. I think a global Internet outage that lasts more than a few hours is unlikely. A regional outage is more likely, but if it happens, it won't affect the entire Southeast U.S., and recovery will be quick, within hours, especially in metropolitan areas.
What do network engineers do differently than petroleum engineers?
I said that an Internet disaster of the same magnitude as the BP oil spill is unlikely, but I also said I was going out on a limb saying this. Am I too optimistic? Do we really have enough redundancy and fail safes? Do we have enough diversity? Do we too often cut corners to save money, as it seems that BP did? If a major Internet outage occurs, will it be caused by a software bug, a hardware failure, or a security breach, and what steps will we take to recover quickly?
What do you think? Please comment. Thank you.