Salesforce AI Lab Blog Photography About Contact

When Flows Break at Scale: Debugging a Governor Limit Crisis on a 500-User Org

How a 2 AM production outage at a financial services firm revealed cascading Flow recursion, and the bulkification patterns that fixed it for good.

Back to Blog
Published November 2025
Read Time 9 min
Performance Governor Limits Apex

The 2 AM Call Nobody Wants

My phone lit up at 2:14 AM on a Tuesday. The message from the VP of Operations was terse: "Production is down. Nothing is saving. Users in Asia-Pacific can't process trades." For a financial services firm that handles cross-border transactions across three time zones, "nothing is saving" translates directly into regulatory exposure and lost revenue. Every minute mattered.

I pulled up the org's monitoring dashboard and saw the wreckage immediately. The error logs were flooded with a single exception: System.LimitException: Too many DML statements: 151. But this wasn't a new deployment gone wrong. The same automation had been running for months. What changed was volume: the firm had just onboarded 80 new users as part of a regional expansion, pushing the org from roughly 420 active users to just over 500. A batch process that ran nightly to reconcile transaction records had tipped over the edge.

The cruel thing about governor limits is that they don't degrade gracefully. You don't get a warning at 140 DML statements. You get silence at 149 and a brick wall at 151. Your transaction doesn't slow down—it dies. And in this case, it was taking a nightly reconciliation job with it, which meant downstream Flows that depended on those reconciled records were also failing in a cascade that was filling the debug logs faster than I could read them.

Triage: Isolating the Bottleneck

The first rule of production triage is: stop the bleeding before you diagnose the disease. I immediately disabled the scheduled batch job to prevent further failures from stacking up, then sent a status update to the operations team so the APAC desk knew the timeline. Only then did I open the debug logs.

Salesforce debug logs can be overwhelming. A single transaction can generate tens of thousands of lines. The trick is knowing what to filter for. I set the log levels to APEX_CODE=FINEST and WORKFLOW=FINER, then re-ran the batch process against a subset of 50 records in a sandbox we'd refreshed the previous week. What I was looking for was the DML_BEGIN and DML_END markers, and specifically, where they were being invoked relative to the loop structures.

Within minutes, the pattern was obvious. The debug log showed something like this:

15:42:03.112 (112847)|DML_BEGIN|[45]|Op:Update|Type:Transaction__c|Rows:1
15:42:03.118 (118293)|DML_END|[45]
15:42:03.119 (119001)|FLOW_START_INTERVIEWS_BEGIN|1
15:42:03.125 (125440)|FLOW_START_INTERVIEW_BEGIN|Transaction_Reconciliation_Update
15:42:03.198 (198712)|DML_BEGIN|[Flow]|Op:Update|Type:Account|Rows:1
15:42:03.205 (205118)|DML_END|[Flow]
15:42:03.206 (206331)|FLOW_START_INTERVIEWS_BEGIN|1
15:42:03.211 (211894)|FLOW_START_INTERVIEW_BEGIN|Account_Risk_Score_Recalc
15:42:03.289 (289100)|DML_BEGIN|[Flow]|Op:Update|Type:Risk_Assessment__c|Rows:1
15:42:03.295 (295221)|DML_END|[Flow]
15:42:03.296 (296017)|FLOW_START_INTERVIEWS_BEGIN|1
15:42:03.301 (301445)|FLOW_START_INTERVIEW_BEGIN|Risk_Assessment_Notification
15:42:03.378 (378200)|DML_BEGIN|[Flow]|Op:Insert|Type:Task|Rows:1
...

Each DML operation was triggering a record-change Flow, which performed its own DML, which triggered another Flow, which performed another DML. A single record update in the batch job was generating four or five DML operations through this chain. Multiply that by 200 records in the batch scope, and you're looking at 800-1000 DML statements in a single transaction—far beyond the 150-statement limit.

The Hidden Recursion: Flow-Triggered Flows and Cascading DML

Here is the part that makes this story worth telling. The automation wasn't built by one person. It was built by three different admins over eighteen months. Each Flow, in isolation, was perfectly reasonable:

Flow 1: Transaction Reconciliation Update. When a Transaction__c record's status changed to "Reconciled," update the parent Account's Last_Reconciliation_Date__c.

Flow 2: Account Risk Score Recalculation. When an Account's Last_Reconciliation_Date__c changed, recalculate the risk score and update the related Risk_Assessment__c record.

Flow 3: Risk Assessment Notification. When a Risk_Assessment__c record's score crossed a threshold, create a Task for the compliance officer.

None of these Flows were wrong. Each one followed Salesforce best practices in isolation. But composed together, they formed a recursion chain where a single record update cascaded into four separate DML operations across three objects. The original batch job that worked fine with 50 records per batch scope was now processing 200 records (a previous admin had increased the scope size to "improve performance"), and 200 multiplied by 4 DML operations per record equals 800—well past the 150-statement governor limit.

This is the fundamental architectural problem with declarative automation at scale. Flows are incredibly powerful, but they lack the visibility that code provides. When you write an Apex trigger, you can see every callout, every DML statement, every loop in a single file. Flows distribute that logic across multiple automation components, and the interaction effects only reveal themselves under load.

The Fix: Refactoring for Bulkification

The immediate fix was straightforward: reduce the batch scope back to 50 records. That got production running again within the hour. But that was a band-aid. The real fix required rethinking the entire automation chain.

The approach I took was to replace the three cascading Flows with a single, bulkified Apex trigger on Transaction__c that handled all downstream updates in a controlled manner. Here's the before-and-after comparison that illustrates the difference.

Before: The naive trigger pattern (effectively what the Flows were doing per-record)

// ANTI-PATTERN: Per-record DML inside a loop
trigger TransactionTrigger on Transaction__c (after update) {
    for (Transaction__c txn : Trigger.new) {
        Transaction__c oldTxn = Trigger.oldMap.get(txn.Id);

        if (txn.Status__c == 'Reconciled' &&
            oldTxn.Status__c != 'Reconciled') {

            // DML #1: Update parent Account (PER RECORD!)
            Account acc = [SELECT Id, Last_Reconciliation_Date__c
                           FROM Account WHERE Id = :txn.Account__c];
            acc.Last_Reconciliation_Date__c = Date.today();
            update acc;

            // DML #2: Recalculate risk score (PER RECORD!)
            Risk_Assessment__c ra = [SELECT Id, Score__c
                                     FROM Risk_Assessment__c
                                     WHERE Account__c = :acc.Id
                                     LIMIT 1];
            ra.Score__c = RiskCalculator.recalculate(acc.Id);
            update ra;

            // DML #3: Create compliance task if threshold crossed
            if (ra.Score__c > 75) {
                insert new Task(
                    Subject = 'Review High Risk Account',
                    WhatId = acc.Id,
                    OwnerId = ComplianceConfig.getOwnerId()
                );
            }
        }
    }
}
// Result: 3+ DML statements PER RECORD = 600 DML for 200 records

After: Bulkified trigger with collected DML

trigger TransactionTrigger on Transaction__c (after update) {
    Set<Id> reconciledAccountIds = new Set<Id>();

    // Step 1: Collect — zero DML, zero SOQL
    for (Transaction__c txn : Trigger.new) {
        Transaction__c oldTxn = Trigger.oldMap.get(txn.Id);
        if (txn.Status__c == 'Reconciled' &&
            oldTxn.Status__c != 'Reconciled') {
            reconciledAccountIds.add(txn.Account__c);
        }
    }

    if (reconciledAccountIds.isEmpty()) return;

    // Step 2: Query in bulk — 1 SOQL
    Map<Id, Account> accounts = new Map<Id, Account>(
        [SELECT Id, Last_Reconciliation_Date__c
         FROM Account WHERE Id IN :reconciledAccountIds]
    );

    // Step 3: Prepare Account updates
    for (Account acc : accounts.values()) {
        acc.Last_Reconciliation_Date__c = Date.today();
    }
    update accounts.values(); // DML #1 — single bulk operation

    // Step 4: Query risk assessments in bulk — 1 SOQL
    List<Risk_Assessment__c> assessments = [
        SELECT Id, Score__c, Account__c
        FROM Risk_Assessment__c
        WHERE Account__c IN :reconciledAccountIds
    ];

    // Step 5: Recalculate scores, collect tasks
    List<Task> tasksToInsert = new List<Task>();
    for (Risk_Assessment__c ra : assessments) {
        ra.Score__c = RiskCalculator.recalculate(ra.Account__c);
        if (ra.Score__c > 75) {
            tasksToInsert.add(new Task(
                Subject = 'Review High Risk Account',
                WhatId = ra.Account__c,
                OwnerId = ComplianceConfig.getOwnerId()
            ));
        }
    }
    update assessments; // DML #2 — single bulk operation

    if (!tasksToInsert.isEmpty()) {
        insert tasksToInsert; // DML #3 — single bulk operation
    }
}
// Result: Exactly 3 DML statements TOTAL regardless of batch size

The difference is stark. The naive pattern executes 3+ DML statements per record, scaling linearly with volume. The bulkified version executes exactly 3 DML operations total, whether you're processing 1 record or 2,000. That's the entire philosophy of bulkification in a single example: collect first, operate once.

Key Takeaway

Bulkification is not an optimization—it is a survival strategy. On the Salesforce platform, any automation that performs DML inside a loop is a ticking time bomb. It will work in your sandbox with 5 test records. It will work in production with 50 users. And it will detonate the moment volume crosses a threshold you never tested for.

The Governor Limit Reference You Actually Need

Salesforce publishes a comprehensive list of governor limits in their documentation, but in my experience most architects only internalize the limits they've personally hit. Here are the ones that matter most in real-world, high-volume orgs—the limits I keep pinned to my desk:

DML Statements: 150 per transaction. This is the one that bit us. Every insert, update, delete, or upsert call counts as one statement, regardless of how many records it includes. update listOf200Records counts as 1. Two hundred individual update singleRecord calls count as 200. This distinction is the entire basis of bulkification.

SOQL Queries: 100 per synchronous transaction, 200 for asynchronous. Queries inside loops are the classic violation. But what catches people off guard is that Flow queries count against this limit too. A Flow with a Get Records element inside a loop will burn through this limit just as fast as a SOQL query inside a for-loop in Apex.

Total records retrieved by SOQL: 50,000. This one is subtle. Even if you only run one query, if it returns more than 50,000 rows, you hit the limit. For orgs with large data volumes, this means your queries must be selective—indexed fields in WHERE clauses are not optional, they are mandatory.

CPU Time: 10,000 ms synchronous, 60,000 ms asynchronous. This is the governor limit that reveals algorithmic inefficiency. If your trigger works correctly but times out, you likely have an O(n^2) pattern hiding somewhere—often a nested loop comparing two lists that should be using a Map for lookups.

Heap Size: 6 MB synchronous, 12 MB asynchronous. This one hits hardest when you're serializing large data structures or processing attachment/file bodies. If you're building a complex in-memory data structure for processing, you need to think about whether you can process in chunks instead.

Future Calls: 50 per transaction. And each future method counts individually. If you're making callouts to external systems inside automation, you need to batch those callouts rather than making one per record.

Key Takeaway

Governor limits are not arbitrary restrictions—they protect every tenant on the shared Salesforce infrastructure. Designing around them is not about working within constraints; it is about building automation that can scale by an order of magnitude without architectural changes. If your automation works today but breaks at 2x volume, it was never production-ready.

Beyond the Fix: Preventing the Next Crisis

After deploying the refactored trigger (behind a feature flag, with the old Flows deactivated but not deleted), I worked with the team to put safeguards in place so we would never get another 2 AM call for the same class of problem.

Limit consumption monitoring in every test class. Every Apex test class now includes assertions against Limits.getDmlStatements() and Limits.getQueries(). We don't just assert that the code produces the right output—we assert that it does so within a defined performance envelope. If a future developer adds a query inside a loop, the test fails not because the result is wrong, but because the limit consumption exceeds the threshold.

Volume testing as a deployment gate. We introduced a test data factory that generates records at 200x the expected single-transaction volume. Every trigger and Flow is tested with at least 200 records in the trigger context. This mimics the maximum batch size for Data Loader and API bulk operations and flushes out bulkification failures before they reach production.

Automation inventory. We built a simple spreadsheet (later migrated to a custom metadata type) that catalogs every Flow, Process Builder, trigger, and workflow rule in the org, along with the objects they fire on and the DML operations they perform. When a new automation is proposed, we can trace the potential interaction chain before building anything. This is the kind of documentation that feels bureaucratic until it saves you from a cascading failure at 2 AM.

Flow recursion guards. For the Flows that remained active, we implemented custom Apex actions that check a static variable before proceeding. If the Flow has already fired in the current transaction context, the guard short-circuits the execution. This is the same pattern experienced Apex developers have used with trigger handlers for years, but it requires deliberate implementation in the Flow world.

The Architecture Lesson

The deeper lesson from this incident isn't about Flows versus Apex. Flows are an excellent tool. The admin who built Flow 1 did exactly the right thing for a single-record use case. The problem was systemic: three independently correct automations composed into a system that nobody had analyzed as a whole.

In traditional software engineering, we call this emergent complexity. Each component passes its unit tests. The integration failures only appear at scale, under load, in production, at 2 AM. The Salesforce platform makes it remarkably easy to build powerful automation declaratively—and that same ease makes it possible for an org to accumulate dozens of interacting automations without anyone holding the architectural blueprint.

If you're a Solution Architect working on a Salesforce org with more than 200 users, I'd recommend three things. First, audit your automation inventory quarterly. Know every Flow, trigger, and Process Builder that fires on each object. Second, test every automation with at least 200 records in the trigger context, because that is the number that separates working-in-sandbox from surviving-in-production. Third, treat governor limits not as ceilings to avoid but as design constraints that inform your architecture from the start.

The org is stable now. The batch job processes 2,000 records nightly with a comfortable margin on every governor limit. The APAC team hasn't had a disruption since. And I keep my phone on vibrate at night—just in case.

Need a Salesforce architect?

Get in Touch