fix(RecordExpander): replace deepcopy with shared reference to prevent O(N²) memory amplification#1009
Conversation
…t O(N²) memory amplification When remain_original_record is enabled, RecordExpander previously deep-copied the entire parent record for every expanded child. For records with large nested arrays (e.g., Stripe invoice events with hundreds of line items), this caused O(N²) memory usage — each of N children got a full copy of the parent containing all N items. Replace copy.deepcopy(parent_record) with a direct reference. The original_record field is read-only in downstream transformations and then removed, making deep copying unnecessary. All expanded siblings now share the same parent reference. Co-Authored-By: bot_apk <apk@cognition.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777635782-fix-record-expander-deepcopy#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777635782-fix-record-expander-deepcopyPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
|
↩️ Triggering Reason: CDK PR is linked to a connector oncall issue, CI is passing or already has local/full pytest evidence, and no AI review marker is present. |
|
Correction: attempted to trigger the CDK This PR remains marked for human review/next-step decision. |
Summary
Replaces
copy.deepcopy(parent_record)with a direct shared reference inRecordExpanderwhenremain_original_record=True.Problem: When expanding records from a nested array (e.g., Stripe invoice events with many line items), the previous implementation deep-copied the entire parent record for every expanded child. For a parent with N children, this created N full copies of the parent (which itself contains all N children), resulting in O(N²) memory usage.
This appears to cause sync stalling for high-volume connectors like source-stripe, where invoice events can contain hundreds of line items. With 10 concurrent workers, the memory amplification is further compounded.
Fix: Since
original_recordis only read from in downstream transformations and then removed (viaRemoveFields), deep-copying is unnecessary. All expanded siblings now share the same parent reference, reducing memory from O(N²) to O(N).Affected code:
airbyte_cdk/sources/declarative/expanders/record_expander.py— both_apply_parent_context(dict items) and the inline non-dict branch.Declarative-First Evaluation
This fix modifies the CDK's
RecordExpandercomponent itself. No declarative YAML alternative exists — theRecordExpanderis the declarative component, and the performance issue is in its Python implementation.Test Coverage
test_record_expander_shares_parent_reference_for_dict_items— verifies all expanded dict records share the same parent reference (identity check withis)test_record_expander_shares_parent_reference_for_non_dict_items— same for non-dict itemstest_record_expander_large_expansion_memory_efficient— simulates a Stripe-like scenario with 500 line items and verifies shared referencesAll 41 tests in
test_dpath_extractor.pypass, including 17 pre-existing RecordExpander tests.Review & Testing Checklist for Human
original_recordmutates it in-place before it is removed. The shared reference is safe as long asoriginal_recordis treated as read-only.invoice_line_itemsandsubscription_itemsincremental streams work correctly after this change (they read fromoriginal_recordthen remove it).Notes
Related to https://github.com/airbytehq/oncall/issues/12136 (Stripe sync stalling after April 15).
The
RecordExpandercomponent was introduced in v7.17.0 via #859. Theremain_original_recordfeature was first used by source-stripe v6.0.0 (airbytehq/airbyte#76095) forinvoice_line_itemsandsubscription_itemsincremental streams.Link to Devin session: https://app.devin.ai/sessions/3e3a626cafe64a60a2ee62fa119d7ce5