fix: validate empty CSV column names and improve mismatch error messages#1010
fix: validate empty CSV column names and improve mismatch error messages#1010devin-ai-integration[bot] wants to merge 1 commit into
Conversation
Co-Authored-By: bot_apk <apk@cognition.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777652889-fix-csv-empty-column-validation#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777652889-fix-csv-empty-column-validationPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
|
↩️ Triggering Reason: CDK PR is linked to a connector oncall issue, CI is passing or already has local/full pytest evidence, and no AI review marker is present. |
|
Correction: attempted to trigger the CDK This PR remains marked for human review/next-step decision. |
Summary
Resolves https://github.com/airbytehq/oncall/issues/12144:
When a CSV file has trailing empty columns (e.g.,
col1,col2,col3,,,), the CDK's_get_headers()method silently accepted empty-string column names. These propagated into the discovered schema and catalog, and when the platform deserialized the catalog, Jackson/Kotlin failed onio.airbyte.config.Field.namebeing non-nullable, producing an opaque "Init container error" withKotlinInvalidNullException.This PR fixes the issue with three targeted changes:
Validate empty column names in
_get_headers()— After reading headers from the CSV file, check for empty or whitespace-only column names and raiseAirbyteTracedExceptionwithfailure_type=config_error. This surfaces the problem during discover so the customer gets a clear, actionable message.Improve existing mismatch error messages — Replaced confusing internal-facing messages that referenced "resolved to
None" with clear user-facing messages:MISMATCHED_COLUMNS: "CSV data row contains more columns than the header row defines."MISMATCHED_ROWS: "CSV data row contains fewer columns than the header row defines."Preserve specific error context in
parse_records()— Previously,parse_records()caughtRecordParseErrorand re-raised with a genericERROR_PARSING_RECORDmessage, discarding the specific column mismatch detail. Now passes through the original exception's message.Declarative-First Evaluation
N/A — This fix targets the file-based CDK (
csv_parser.py), not a declarative connector manifest.Breaking Change Evaluation
Not breaking. No schema, spec, stream, or state changes. This adds a validation guard that raises a
config_errorduring discover for malformed CSV headers (which previously caused an opaque platform crash), and improves error message clarity.Test Coverage
Added 6 new test cases in
unit_tests/sources/file_based/file_types/test_csv_parser.py:test_get_headers_raises_on_empty_column_names(parametrized, 4 cases: trailing, middle, leading empty columns, whitespace-only)test_get_headers_accepts_valid_headers— confirms valid headers still worktest_read_data_raises_on_empty_column_names— end-to-end throughread_data()test_parse_records_preserves_mismatch_error_detail— confirms the re-raised error preserves specific mismatch detailAll 60 tests pass locally.
Review & Testing Checklist for Human
_get_headers()correctly identifies all edge cases (trailing, middle, leading, whitespace-only)exceptions.pyare clear for end-usersparse_records()error re-raise preserves the original detail without losing contextNotes
ignore_errors_on_fields_mismatchin the same fileLink to Devin session: https://app.devin.ai/sessions/c0ac93b0ed1a401ba346b7fcc93bc41b