Refund webhook drift: governed reconciliation with deterministic ledger fixes

The problem: refunds get “stuck” when webhooks + ledger disagree

A common payments-ops failure mode:

  • Your PSP (e.g., card processor) says a refund succeeded (or failed) via webhook.
  • Your internal ledger shows the opposite (or shows nothing).
  • Webhooks arrive late, duplicated, or out of order.
  • Ops teams end up doing “refund archaeology” across dashboards, spreadsheets, and Slack.

This is exactly where AI can help with triage, but AI alone is risky:

  • It can misread context and trigger the wrong financial action (double-refund, wrong customer, wrong amount).
  • It can’t guarantee exactly-once execution under retries and partial failures.

Principle: AI suggests, Autom Mate executes under control.


Proposed pattern: Governed “Refund State Reconciliation” with deterministic execution

End-to-end workflow (copyable design)

1) Trigger

  • Trigger: PSP webhook refund.updated / charge.refunded (or a scheduled sweep every 15 minutes for “pending > X minutes”).
  • Autom Mate trigger type: API/Webhook trigger (event-based) n (before any action)
  • Validate payload schema + required fields (refund_id, payment_id, amount, currency, event_created_at).
  • Enforce idempotency:
    • Build a deterministic key: psp_event_id OR refund_id + status + amount.
    • Check if this key was already processed (store in your internal DB/ledger or a small “processed-events” table).
  • Reject/stop if:
    • currency mismatch
    • amount mismatch vs original payment
    • refund references unknown payment

3) AI-assisted triage (suggestion only)

  • If validation passes but states disagree, have AI classify the case:
    • “Webhook duplicate”
    • “Out-of-order event”
    • “Ledger write failed”
    • “PSP says failed; customer expects refund”
    • “High-risk: possible double-refund exposure”
  • Output is a recommendation + confidence, not an action.
  • Keep the AI output in the run log for review.

4) Approvals (human or policy-based)

  • Policy-based auto-approve if all are true:
    • amount <= $50
    • customer is low-risk
    • refund is already marked succeeded at PSP
    • ledger is missing only the final “refund_succeeded” entry
  • Human approval required if any are true:
    • amount > $50
    • customer flagged
    • AI confidence below threshold
    • action would initiate a new refund (not just ledger correction)

5) Deterministic execution (the important part)

Autom Mate executes only pre-defined steps:

  • Step A (read): Fetch refund status from PSP
    • Integration label: REST/HTTP/Webhook action (PSP API)
  • Step B (read): Fetch internal ledger state
    • Integration label: REST/HTTP/Webhook action (ledger service)
  • Step C (write): If PSP=SUCCEEDED and ledger missing entry → write a compensating ledger event refund_succeeded (no money movement)
    • Integration label: REST/HTTP/Webhook action
  • Step D (write): If PSP=FAILED but ledger shows succeeded → open an exception case + block downstream “refund complete” comms until resolved
    • Integration label: REST/HTTP/Webhook action (case system / ticket)

This keeps execution deterministic: the Autom only performs explicit, bounded actions you designed, with retries and error handling.

6) Logging / Mate run logs capture:

  • trigger payload
  • validation results
  • AI recommendation + confidence
  • approval decision
  • every API call + response summary
  • final state transition
  • This supports auditability and post-incident review.

7) Exception handling + rollback

handling to:

  • retry transient PSP/ledger errors with backoff
  • route to an “Ops review” queue when retries exhausted
  • prevent partial completion (e.g., if ledger write fails, do not send customer notification)
  • If a compensating ledger write was made incorrectly (rare, but possible), rollback is a new compensating entry (append-only ledger discipline), not deletion.

Why this is a real fintech ops issue (and why governance matters)

Payment systems are asynchronous. Webhooks can be duplicated, delayed, or arrive out of order, and retries can cause accidental double-actions if you don’t enforce idempotency and deterministic execution. (dev.to)


Mini examples

Example 1: Duplicate webhook, safe no-op

  • Webhook arrives twice: refund_id=rf_123, status succeeded.
  • Autom Mate checks idempotency key → second event is already processed.
  • Result: no duplicate ledger write, run is logged as “duplicate ignored”.

Example 2: Ledger missing final state, auto-fix under policy

  • PSP shows refund succeeded 20 minutes ago.
  • Ledger shows refund_initiated but no refund_succeeded.
  • Amount is $18.50, low-risk customer.
  • Policy auto-approves → Autom Mate writes compensating ledger event and closes the exception.

Discussion questions

  • Where do you draw the line between auto-approve vs human approval for refund corrections (amount threshold, customer risk, processor type)?
  • Do you prefer an append-only compensating ledger approach, or do you allow “state overwrite” in your ledger service (and how do you audit it)?