Idempotency

Parent: Systems Design

Imagine a situation where your script fails mid run, well you don’t have to imagine it, its pretty common after all. What happens when the script retries?

Will it:

Skip the process that it should be running, resulting in missing data?
Reprocess what has already been processed, resulting in duplicates?
If multiple scripts run at the same time, will they process the same data twice?

This is the core of idempotency for data engineering. Can your script give the same output reliably when retried multiple times?

In my experience, idempotency is designed into data engineering in the WHERE clause, in which the clause should be scoped in a way where each subsequent run of that query yields the same results.

For instance this is non-idempotent:

SELECT TOP 1000 s3_key
FROM table
WHERE created_date = GETDATE()
ORDER BY created_date DESC

If the table gets new keys between script retries, you will get a different set of keys

And idempotent query looks like this:

-- Claim keys first
UPDATE TOP 1000 table
SET s3_run_id = {transaction_id} -- unique id for that particular script run
WHERE created_date = GETDATE()
ORDER BY created_date DESC
 
-- Select based on the run id
SELECT TOP 1000 s3_key
FROM table
WHERE s3_run_id = {transaction_id}

The transaction_id is one of the general ways to achieve idempotency, it should be generated by the script that is doing this processing.

Re-running the select statement will yield the same keys no matter how many times you run it.
Multiple scripts won’t work on the same keys once its been claimed, eliminating race conditions

Fauzan Acyuto

Explorer

Idempotency

Resources:

Backlinks

Graph View

Table of Contents

Recent Notes

Azure

Azure App Services

Azure Container Instances

Azure Economics

Azure Functions

Azure Monitor

Azure Purview

Azure Queues

Azure Redundancy Level

Azure Resource Manager

Explorer

Idempotency

Related:

Resources:

Backlinks

Graph View

Table of Contents

Recent Notes