Parent: Systems Design
Imagine a situation where your script fails mid run, well you don’t have to imagine it, its pretty common after all. What happens when the script retries?
Will it:
- Skip the process that it should be running, resulting in missing data?
- Reprocess what has already been processed, resulting in duplicates?
- If multiple scripts run at the same time, will they process the same data twice?
This is the core of idempotency for data engineering. Can your script give the same output reliably when retried multiple times?
In my experience, idempotency is designed into data engineering in the WHERE clause, in which the clause should be scoped in a way where each subsequent run of that query yields the same results.
For instance this is non-idempotent:
SELECT TOP 1000 s3_key
FROM table
WHERE created_date = GETDATE()
ORDER BY created_date DESCIf the table gets new keys between script retries, you will get a different set of keys
And idempotent query looks like this:
-- Claim keys first
UPDATE TOP 1000 table
SET s3_run_id = {transaction_id} -- unique id for that particular script run
WHERE created_date = GETDATE()
ORDER BY created_date DESC
-- Select based on the run id
SELECT TOP 1000 s3_key
FROM table
WHERE s3_run_id = {transaction_id}The transaction_id is one of the general ways to achieve idempotency, it should be generated by the script that is doing this processing.
- Re-running the select statement will yield the same keys no matter how many times you run it.
- Multiple scripts won’t work on the same keys once its been claimed, eliminating race conditions