Dealing with Race Conditions: A Practical Example-Golang-php.cn

In your career, you'll encounter Schrödinger's cat problems, situations that sometimes work and sometimes don't. Race conditions are one of these challenges (yes, just one!).

Throughout this blog post, I'll present a real-world example, demonstrate how to reproduce the problem and discuss strategies for handling race conditions using serializable transaction isolation and advisory locks with PostgreSQL.

Inspired by "Designing Data-Intensive Applications," Chapter 7 - Transactions "Weak Isolation Levels"

Github Repository with Practical Example

The application

This application manages on-call shifts for doctors at a hospital. To focus on the race condition problem, let's simplify our scenario. Our app resolves around this single table:

CREATE TABLE shifts (
    id SERIAL PRIMARY KEY,
    doctor_name TEXT NOT NULL,
    shift_id INTEGER NOT NULL,
    on_call BOOLEAN NOT NULL DEFAULT FALSE
);

登录后复制

We have a critical business rule:

Each shift must always have at least one doctor on call.

As you may have guessed, implementing a naive API can lead to race condition scenarios. Consider this hypothetical situation:

Jack and John are both on-call at the hospital during the same shift. At nearly the same time, they decide to request leave. One succeeds, but the other relies on outdated information about how many doctors are on shift. As a result, both end up leaving their shift, breaking the business rule and leaving a specific shift with no doctors on call:

John --BEGIN------doctors on call: 2-------leave on call-----COMMIT--------> (t)
          \                 \                      \             \
           \                 \                      \             \ 
Database ------------------------------------------------------------------> (t)
               /               /                      /              /
              /               /                      /              /
Jack ------BEGIN------doctors on call: 2-----leave on call----COMMIT-------> (t)

登录后复制

Reproducing the problem

The application is a simple API implemented in Golang. Checkout the GitHub repository for instructions on how to run and execute the script to reproduce this race condition scenario. In summary, you'll need to:

Start the server: yarn nx serve hospital-shifts
Run the k6 test to reproduce the race condition scenario: yarn nx test hospital-shifts

The test attempts to call off two doctors simultaneously, hitting endpoints with different approaches: shiftId=1 uses advisory lock, shiftId=2 uses serializable transaction isolation, and shiftId=3 is a naive implementation without concurrency control.

The k6 results will output custom metrics to indicate which shiftId violated the business rule:

     ✓ at least one doctor on call for shiftId=1
     ✓ at least one doctor on call for shiftId=2
     ✗ at least one doctor on call for shiftId=3
      ↳  36% — ✓ 123 / ✗ 217

登录后复制

You'll need tools such as Yarn, Go, K6 and Docker, or you can use DevBox for an easier setup of repository dependencies.

Addressing the Race Condition

The problem occurs when our application makes a decision based on stale data. This can happen if two transactions run almost simultaneously and both try to call off doctors for their shift. One transaction succeeds as expected, but the other, relying on outdated information, also succeeds incorrectly. How can we prevent this undesired behavior? There are a few ways to achieve this, and I'll explore two options backed by PostgreSQL, though similar solutions can be found in other database management systems.

Serializable Transaction Isolation

Serializable Snapshot Isolation automatically detects and prevents anomalies such as the write skew demonstrated by our application.

I won't dive deep into the theory behind transaction isolation, but it is a common topic in many popular database management systems. You can find good materials by searching for snapshot isolation, like this one from the PostgreSQL official documentation on transaction isolation. Additionally, here is the paper that proposed this solution years ago. Talk is cheap, so let's see the code:

First, start the transaction and set the isolation level to Serializable:

    // Init transaction with serializable isolation level
    tx, err := db.BeginTxx(c.Request().Context(),    &sql.TxOptions{
        Isolation: sql.LevelSerializable,
    })

登录后复制

Then, proceed to execute operations. In our case its execute this function:

CREATE OR REPLACE FUNCTION update_on_call_status_with_serializable_isolation(shift_id_to_update INT, doctor_name_to_update TEXT, on_call_to_update BOOLEAN)
RETURNS VOID AS $$
DECLARE
    on_call_count INT;
BEGIN
    -- Check the current number of doctors on call for this shift
    SELECT COUNT(*) INTO on_call_count FROM shifts s WHERE s.shift_id = shift_id_to_update AND s.on_call = TRUE;

    IF on_call_to_update = FALSE AND on_call_count = 1 THEN
        RAISE EXCEPTION '[SerializableIsolation] Cannot set on_call to FALSE. At least one doctor must be on call for this shiftId: %', shift_id_to_update;
    ELSE
        UPDATE shifts s
        SET on_call = on_call_to_update
        WHERE s.shift_id = shift_id_to_update AND s.doctor_name = doctor_name_to_update;
    END IF;

END;
$$ LANGUAGE plpgsql;

登录后复制

Whenever inconsistent scenarios occur due to concurrent execution, the serializable isolation level will allow one transaction to succeed and will automatically rollback the others with this message, so you can safely retry:

ERROR:  could not serialize access due to read/write dependencies among transactions

登录后复制

You can find the complete example in the function updateWithSerializableIsolation.

Advisory Lock

Another way to ensure our business rules are enforced is by explicitly locking the resource for a specific shift. We can achieve this using an Advisory Lock at the transaction level. This type of lock is fully controlled by the application. You can find more information about it here.

It's crucial to note that locks can be applied at both the session and transaction levels. You can explore the various functions available here. In our case, we'll use pg_try_advisory_xact_lock(key bigint) → boolean, which automatically releases the lock after a commit or rollback:

BEGIN;

-- Attempt to acquire advisory lock and handle failure with EXCEPTION
    IF NOT pg_try_advisory_xact_lock(shift_id_to_update) THEN
        RAISE EXCEPTION '[AdvisoryLock] Could not acquire advisory lock for shift_id: %', shift_id_to_update;
    END IF;

-- Perform necessary operations

-- Commit will automatically release the lock
COMMIT;

登录后复制

Here is the complete function used in our application:

-- Function to Manage On Call Status with Advisory Locks, automatic release when the trx commits
CREATE OR REPLACE FUNCTION update_on_call_status_with_advisory_lock(shift_id_to_update INT, doctor_name_to_update TEXT, on_call_to_update BOOLEAN)
RETURNS VOID AS $$
DECLARE
    on_call_count INT;
BEGIN
    -- Attempt to acquire advisory lock and handle failure with NOTICE
    IF NOT pg_try_advisory_xact_lock(shift_id_to_update) THEN
        RAISE EXCEPTION '[AdvisoryLock] Could not acquire advisory lock for shift_id: %', shift_id_to_update;
    END IF;

    -- Check the current number of doctors on call for this shift
    SELECT COUNT(*) INTO on_call_count FROM shifts s WHERE s.shift_id = shift_id_to_update AND s.on_call = TRUE;

    IF on_call_to_update = FALSE AND on_call_count = 1 THEN
        RAISE EXCEPTION '[AdvisoryLock] Cannot set on_call to FALSE. At least one doctor must be on call for this shiftId: %', shift_id_to_update;
    ELSE
        UPDATE shifts s
        SET on_call = on_call_to_update
        WHERE s.shift_id = shift_id_to_update AND s.doctor_name = doctor_name_to_update;
    END IF;
END;
$$ LANGUAGE plpgsql;

登录后复制

You can find the complete example in the function updateWithAdvisoryLock.

Conclusion

Dealing with race conditions, like the write skew scenario we talked about, can be pretty tricky. There's a ton of research and different ways to solve these problems, so definitely check out some papers and articles if you're curious.

These issues can pop up in real-life situations, like when multiple people try to book the same seat at an event or buy the same spot in a theater. They tend to appear randomly and can be hard to figure out, especially if it's your first time dealing with them.

When you run into race conditions, it's important to look into what solution works best for your specific situation. I might do a benchmark in the future to compare different approaches and give you more insights.

I hope this post has been helpful. Remember, there are tools out there to help with these problems, and you're not alone in facing them!