How does Oracle keep data consistent on filesystems?
February 14, 2011 Leave a comment
Sometimes there is questioning on how EMC can provide consistent snapshots on storage systems, given the fact that file systems like ext3, ufs and the like may keep write data in file system (server) cache.
As you may be aware, Oracle cannot recover a crashed database if data in Oracle’s datafiles (the ones holding the actual tables and indexes) is more recent than the data in the logs (i.e. redo logs). If both datafiles reside on filesystems then it is suggested that in some cases, Oracle would write an I/O commit to the redo log, then to the data file, but with both I/O’s staying in cache it could well be that the filesystem writes out the two changed data blocks in the wrong order. If that’s the case, then the database would not be able to (crash) recover from a power failure (or “shutdown abort” command in Oracle).
It would also prevent EMC to make crash-consistent database copies without using Oracle Hot Backup mode.
However, Oracle uses a special file open system call (the O_SYNC flag) for specific files (especially the redo logs) to make sure every write is flushed to disk directly instead of being kept in the filesystem cache (this could cause trouble doing crash recovery). The method to open a file is independent from filesystem or mount options.
For more info, look at http://linux.die.net/man/2/open to see how this is documented:
The file is opened for synchronous I/O. Any write()s on the resulting file descriptor will block the calling process until the data has been physically written to the underlying hardware. But see RESTRICTIONS below.
Of course this only works fine if writes are not cached somewhere else in the food chain (I consider battery protected memory of storage subsystems an exception – if done correctly :)
I have spoken once to a customer that had questions on this and doubted our consistent split technology – they said they had troubles with recovery in the past and now questioned this.
Digging deeper, I found the reason was they had used IDE drives in their servers that had write cache enabled on the physical drives. Guess what happened after a power failure…