Unit Blog

May 26, 2025

🚨 Unit 7: When Things Go Wrong – My Journey into Recovery, Logging, and High Availability

Database Recovery

💥 Crashes Happen – Now What?

Until Unit 7, I’d never thought deeply about what happens when a database system fails. But these lessons on recovery mechanisms, log-based undo/redo, and high-availability strategies opened my eyes to the behind-the-scenes magic that keeps our systems resilient.

Imagine a system crash mid-transaction. Without recovery mechanisms, you’d end up with half-transferred money, broken updates, and corrupted data. Thankfully, DBMS has a well-thought-out system for getting things back on track.

🧠 What Causes Failure?

Here’s what can go wrong in a DBMS:

Power/system failures
Transaction failures
Human errors
Hardware upgrades
Security breaches
Data corruption
Natural disasters
Compliance and audit requirements

This unit emphasized one key truth: Even if the DB design is perfect and the locks are flawless, it’s meaningless unless we can recover when things break.

🪵 Logging: The Foundation of Recovery

Databases use a special structure called a log to record every update. It’s like a black box flight recorder for DB operations.

Before any change is made to the actual database, a log record is written to stable storage. This is the foundation of the Write-Ahead Logging (WAL) rule.

📘 Anatomy of a Log Record

A typical update log:
< Ti, Xj, V1, V2 >

Ti: Transaction ID
Xj: Data item
V1: Old value
V2: New value

Other log entries include:

<Ti start>
<Ti commit>
<Ti abort>

Together, these entries tell the complete story of what happened and in what order.

🛠️ Deferred vs. Immediate Modification

This distinction was new to me:

Deferred: No changes happen until commit
Immediate: Changes may happen during transaction execution

Most recovery algorithms today support immediate modification and rely on logs to fix inconsistencies after a failure.

🔁 Undo and Redo

These are the superheroes of the recovery world.

Undo(Ti): Rolls back changes made by an uncommitted transaction using its old values
Redo(Ti): Re-applies changes from committed transactions to ensure durability

👉 Fun fact: Undo doesn’t just delete. It actually creates redo-only log records of the undo action itself to prevent repeated undos in case of another crash.

⚠️ Crash Recovery Scenarios

Crash before <Ti commit>: Undo
Crash after <Ti commit>: Redo
Crash mid-transaction: Undo incomplete ones, redo committed ones

The DBMS uses log analysis to classify which transactions need to be redone and which need to be undone.

🧷 Checkpoints: Like Save Points in a Game

A checkpoint captures the system state so recovery doesn’t need to scan the whole log.

Steps during a checkpoint:

Flush log records to disk
Write dirty buffer blocks to disk
Log a checkpoint record with a list of active transactions

There’s also a “fuzzy checkpoint” which allows transactions to keep running during checkpointing, speeding up performance but making recovery logic a bit more complex.

📥 Buffering and WAL Enforcement

Databases buffer both:

Log records in memory
Modified data pages in memory

The Write-Ahead Logging rule ensures:

Log is written before associated data block is flushed to disk

Two important policies here:

No-force: Not all blocks must be flushed at commit
Steal: Some modified blocks may be flushed before commit

Together, they optimize performance but require careful recovery logic.

🧰 Handling Total Disk Failures

What if the disk itself is lost?

Answer: Periodic full-database dumps to external stable storage like tapes, then redo log operations from the last dump.

🌐 Remote Backup Systems

This part fascinated me. High availability means you need:

Remote backup that receives log shipments
Mechanisms for automatic failover
Commit protocols like one-safe, two-safe, and very-safe for ensuring durability across locations

Hot-spare configurations mean the backup is always nearly ready to go live.

Other techniques include:

Replicated databases (not just backup copies)
Load balancing and application-level failover

🧪 Recovery Algorithms in Action: Meet ARIES

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the industry standard, developed by IBM for DB2.

Three steps:

Analysis – Find out which transactions were active
Redo – Repeat history to bring DB to crash time
Undo – Roll back uncommitted transactions

ARIES assumes:

Write-Ahead Logging
STEAL and NO-FORCE policies
Logging during undo to support repeat crash recovery

⚡ Main-Memory Databases: Recovery at Warp Speed

When your whole DB lives in RAM (for speed), recovery becomes trickier:

Data vanishes if RAM is lost
Logging and checkpointing are still needed

Optimizations include:

Skipping index logging (rebuilds fast)
Keeping undo logs in memory
Partitioning log recovery across CPU cores

Parallel redo from logs makes recovery blazing fast.

💭 Final Thoughts on Unit 7

This unit made me appreciate how databases not only store and serve data but also fight to survive chaos. From logging every action to shipping logs across continents, the resilience engineered into modern DBMS is phenomenal.

Now I understand that a “COMMIT” isn’t just a word. It’s a promise backed by logs, locks, buffers, backups, and decades of engineering wisdom.