I once authored a COM+ service hosted in MTS for a large ecommerce site. This was late 1990’s when there was no service oriented architecture and distributed computing was just becoming mainstream. This component enabled the ecommerce site to save quotes on their quote management system which was a mainframe by calling a specific program on the mainframe which was exposed at a particular IP address and port. The Quote management service could be called by multiple clients which were basically various segment of the online store. This component was developed and tested over a period of 3 months across various development and testing environments. This was deployed just before the start of the holiday season. The deployment went as per plan well into the evening. By about 9 PM the deployment was complete and unit tests run from one segment confirmed that the component worked as expected and was saving quotes on the mainframe. After what was a long day me and the rest of the team headed back home to catch up on sleep. At around 2 AM in the morning my phone stared buzzing with a message indicating that the component was failing to save quotes with certain calls running on a timeout. Slowly these timeouts escalated to a point where almost a quarter of all calls were timing out. I soon connected to check the logs and try and make sense of what was going wrong. There was not much information in there to indicate any issues. I turned on verbose logging at the risk of taking a small performance hit. Post this the component logs indicated that it had received some of the calls which the client team claimed to have made but it was also missing a lot of other calls. All the teams were assembled on a bridge and initially the teams started to pass the ball around with the component and the client teams indicating that the issue is on the other side. While the client’s logs indicated the call was made to the component at a particular time, the server log indicated either the call coming in at a much later point of time or never at all. All this while the customers on the site were unable to save any quotes and this was a major functionality since an enterprise segments workflow depended on quotes being created and approved before checkout.
As matters escalated a Sev 1 ticket was cut and incident and problem management was called in. Some of the guys in suits started calling in to check on when will it be resolved.The call to this component was put through a microscope and reviewed to check why the calls were missing on the server. Similarly, the server component was review to check for any reasons why we might not be acknowledging some calls. After various changes the server was rebooted and MTS restarted but still it did not work. The developers and Incident support in the room could not identify or isolate the issue since the behaviour was very random and only certain calls were not getting any response back or were getting a delayed response. One of the incident management folks came up with a suggestion to include network management. Once engaged these guys ran a network trace of some of the calls and finally after some time identified a rogue configuration on one of the network switches which was routing the calls trough to a wrong route leading to the call to server being lost. Once the configuration on the switch was corrected service returned to normal and 100% of calls were responded to. This simple network switch configuration issue took around 6 hours to solve and resulted in a rather large loss in revenue. It was 10 AM in the morning when service was normal and the ticket was closed.
Now the million-dollar question is, Could the development / deployment team have anticipated this issue. Could this issue have been resolved earlier. Is there anything in the design that could have averted this issue or enabled the teams to identify the cause faster ? While you are jumping to say this could have been identified and resolved easily , remember this was distributed computing in the late 1990’s.