Extreme availability with Oracle stretched clusters

Some of my customers have been pushing for more availability in their Oracle database applications. They want to eliminate downtime completely even if they experience a site failure. Whether this is a real business requirement or a technology push, I’m not sure – I guess a bit of both.

ha_aircraft

Most of these customers have already implemented Oracle RAC (Real Application Clusters), which provides them active/active server clustering for Oracle. If one of the servers in a RAC cluster fails, the others just keep running – no restart or recovery involved. This is a High Availability option typically for local sites.

For Disaster Recovery, most customers have some sort of storage replication (i.e. EMC SRDF/Synchronous or SRDF/Async, or they use Oracle Data Guard for this which replicates data on the Oracle database level). This protects against site failures and offers zero or near-zero dataloss (for committed transactions in Oracle – the non-committed transactions are rolled back during the restart – and this is exactly one of the problems by the way).

The problem on the business application level occurs when an outage occurs: you have to restart (and in many cases, recover) the database at the standby location. The designation “Disaster Recovery” suggests that this would only happen if a severe disaster strikes, but many of us are aware that most failovers have to be performed more often to recover from other causes, such as broken servers, human error, data communications issues, power failures etc. and therefore happen much more frequently than the name suggests.

Although a failover can be made to work within a few minutes, and no committed transactions need to be lost, the business problem is a bit more severe, because:

  • It requires all application servers to be restarted and/or to be reconnected against the failover server (which has to be done in the right order and can be quite complex, especially if it involves middleware software such as message buses or service oriented architectures)
  • All running transactions are aborted and have to be restarted
  • Some bad application code (which is rumoured to be really existent, although applications vendors will not admit this) does not do very well in recovering from those restarts.

Assume the following situation: a batch job processes 100 million transactions, and it takes 12 hours to complete. After 6 hours the database is failed over due to some kind of (logical or physical) disaster, when only half of the transactions are already committed (assuming the batch job commits every so many transactions to avoid running out of rollback space). If you re-run the job after failover, then half of the (already processed) transactions get processed again causing serious business issues (i.e. sending out double invoices and stuff like that). Of course, these issues should not occur as application code should be robust enough to recover from restarts without double processing – but this requires strict programming rules for code developers. And so, even well respected applications such as Oracle E-business suite and SAP suffer from such problems sometimes (so I’ve heard).

Besides, even if you can restart the database within five minutes with no lost (committed) transactions, you’re now 6 hours behind in the batch run and you might not be able to complete it in the regular batch window (this can be a serious issue for customers).

A real world example: I used to work as a Unix engineer for an investment bank before joining EMC. Investment banks have porftolio management systems where they track all stock transactions and assets. There typically is a nightly batch that calculates the value of the funds (a process called intrinsic value calculation) and this calculation can take a very long time (typically many hours). If you run out of time, then you cannot report the accurate stock prices to the stock exchange – which would really hurt the business – think of having to publish non-accurate stock prices based on partial (manual) calculations, penalties for late publishing and so on.

If a disaster would strike at the end of the batch window you would have to re-run the whole job, even if the whole application stack would be restarted in ten minutes or so, without data loss. The job would never complete in time again for reporting.

Another example: one of my customers was a semiconductor manufacturer, running large logistics and planning systems for their global microchip production plants. It happened a few times that their cluster (protected indeed with EMC SRDF technology and Unix cluster software) went down, and it took about 10 minutes for the database servers to restart. Of course with no single (committed) transaction lost (due to the simplicity and robustness of EMC SRDF).

But it would take up to another hour for the application stack to be restarted and reconnected. The chip manufacturing plants all over the world would not be able to connect to the transaction system anymore during this period, and this could eventually result in microchips being produced with wrong or missing labels or serial numbers – causing the whole batch of chip wafers to be discarded. Any chip produced at any of their worldwide plants had to be thrown in the garbage can after 30 minutes, because the system could not find out anymore where it should be delivered, what serial number it had, etc. The total downtime period was more than they could handle – potentially resulting in millions of revenue loss, due to overtime for their employees to fix this, delays in delivery and expensive chips thrown away etc.

As a result of such issues, many customers are asking for solutions that continue processing even when suffering from disasters. Oracle is pushing their Stretched RAC solution, part of their “Maximum Availability Architecture”: Oracle RAC clusters where node A and node B are separated across two data centers (typically campus distance – less than, say, 20 kilometers). This also requires storage to be available (read/write) at both locations – so that if one site would fail completely, the other one could continue running, just like in a local cluster with only one enterprise storage subsystem.

Although the basic concept seems very simple – just “stretch” the cluster across two sites, make sure there is enough bandwidth in between, provide for storage to be available at both locations, and you have the basic ingredients to survive site failures without any downtime at all.

In order to avoid split brain issues or data corruption, this requires a third location as well (the “arbitrator” or sometimes called “tie breaker”, to decide, during failures, which node is allowed to keep running and which one is aborted). Oracle recommends customers to put a small database server at the third location. The server does not require much performance or disk capacity as it will not process applications. Other options involve having “voting disks” in the third location (using, for example, an NFS fileserver).

A very important note here, and I will explain that in a bit more detail in a future post: High Availability is not the same as Disaster Recovery. Although some customers try to have the best of both worlds into one solution, there are caveats and even Oracle makes a clear distinction. A stretched cluster (even across a few kilometers distance) is NOT a disaster recovery solution.

In my next posts we will go into more details on why such a setup is more complicated than you would expect.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: