Indian Power Grid Meltdown – an information failure

On the 30th and 31st of July 2012 two spectacular failures occurred in the Indian national electricity system resulting in hundreds of millions of people being left without electricity for much of the day.  A natural reaction in the West is to assume that these failures were caused by badly designed or maintained electrical systems and this might be true if the failures had originated in the local distribution networks. But in fact, Power Grid Corporation, the transmission operator, is actually considered to be one of the best run parts of the Indian grid.  It is ironic that it was the “best” part of the system that failed.

The failures occurred during the night, when the primary users are trains, irrigation pumps, air conditioners, and night-running industries.  It appears that the Northern Grid was drawing power from the Western grid, which had a surplus of generation.  This is a normal operating situation.  Then a tie-line – an inter-connection between the Northern and Western grids – tripped.  This is also not unusual, although in this case, it appears that the normal backup procedures also failed.

What then happened, leading over a period of one to two hours to the collapse of both grids, was in large part an information failure.  Although the tie-lines were becoming overloaded, since one of them had tripped, key network management signals – the grid frequency and the Availability-Based Tariffs – continued to encourage the transfer of power between the two grids.  As a result the state-level distribution networks continued to draw power.  A significant contributing factor is believed to have been the policy that makes power for irrigation free for farmers and hence a key balancing mechanism was absent.  This is a very large, quite modern grid, but it is far from being a Smart Grid.  Two major elements of information engineering were missing here.

First, although the grids are well instrumented, no mechanism existed to aggregate and interpret the quasi-real-time information from thousands of sensors scattered across some millions of square kilometers.  Under normal conditions, the system can enable the operators to identify and correct single, localized failures, but on the edge of many widespread failures, it would have been unclear how to respond.

Second, although information about the actual system state is collected every five minutes, there is no provision for predicting the future evolution.  Again in normal circumstances, the system evolves gradually and linearly in a way that is understandable to human operators.  But a complex system of systems on the edge of a massive instability is beyond human comprehension.  The operators had no way to foresee the catastrophe nor anyway to simulate the outcomes of potential mitigation actions that they could take.

The result was a domino cascade of shutdowns across the distribution networks.  The local distribution networks in India are famously unreliable however so many businesses and even private residences have backup power supplies.  Given that the failures occurred during the night, these consumers may not even have been aware of what happened for some hours.

Restarting a grid from this kind of failure is also a major challenge.  Each time a breaker is re-closed in the distribution network, it is like a car whose engine is running being jammed into gear without the use of the clutch.  There is no gradual “fade in”.  The breaker closes and your lights come on.  That sudden demand surges back up the distribution and transmission networks and back to the generators.  In India there must have been tens of thousands of such events and again no way to coordinate them to smooth the restart.

Other nations throughout the world with similarly large grids should take heed from these disasters, since with few exceptions – China, Japan, Korea, parts of Europe – their information engineering is no better.  As we put ever increasing stress on our man-made infrastructures and on the natural environment, we are pushing thousands if not millions of such complex systems of systems ever closer to their limits of stability.  In these limits, all that may be required is the literal “butterfly flapping its wings” to trigger the kind of meltdown we have seen here.  Indeed the 2003 meltdown in Italy was believed to be caused by a tree branch in Switzerland brushing against a high-voltage power line.  Worse, in some cases it may not possible ever to return the system to its previous, quasi-stable state.

(With many thanks to my colleagues in IBM’s Energy and Utility Industry group.)