Improve data validation, fetch resilience, and database safety#98
Improve data validation, fetch resilience, and database safety#98
Conversation
…ng, CLI Wire up the previously-unregistered EntityHistory and AddressHistoryJunction repositories so EntityTemporalRepo no longer throws at construction. Switch SQLite to WAL + synchronous=NORMAL so a crash mid-write can no longer corrupt the database. Add closeDb()/closePgPool() to the CLI exit path. Strip the angle brackets from the SEC User-Agent (some EDGAR endpoints 403 on RFC-5322 mailbox forms) and let deployers override it via SEC_USER_AGENT. Make the file-output cache durable: write to a unique tmp path then atomic rename, round-trip Blob → Buffer → Blob correctly, and only swallow ENOENT on read. Cache freshness comment now matches the actual comparison. Add an HTTP retry-with-Retry-After loop and a per-attempt timeout to SecFetchJob so transient 429/5xx and hung TCP connections no longer pin a queue slot. Honor the rate-limiter via the existing CompositeLimiter. Key Form._arrayPaths by the subclass constructor instead of `this.name`, removing a latent collision when two Form subclasses share a description. Centralize the ad-hoc `isBadPersonField` allowlists into one shared `bad-data.ts` registry, and add NaN-safe parseInt helpers in Form_C/Form_D storage. Use the actual US state set to derive country_code in Form 1-A instead of `length === 2`, which silently coerced GB/FR to "US". Implement `db reset --confirm` (delete-all + setupAllDatabases) and clean up the CLI: drop the unused `--concurrency` flag, validate CIK args, fix the inclusive-range math in FetchQuarterlyIndexRangeTask, restore the missing `date` field in FetchCompanyFactsTask's empty-CIK branch, drop the dead null-check in StoreCompanyFactsTask, log per-item failures in the fetchAndStore* glue, and move Sqlite.init() into the preAction hook so help/version don't load the native binding. InvestmentOfferingHistorySchema.accession_number is now 20 chars to match real SEC accession numbers (Postgres would have rejected the old 10). package.json: rename the misnamed `publish` script to `release`, point `bin` at `./dist/sec.js` so it ships with the published tarball, and move concurrently to devDependencies. Delete the leftover `test_any.ts` scratch file. https://claude.ai/code/session_01UEfSvXJcxWC1csqry7FWmJ
There was a problem hiding this comment.
Pull request overview
This PR strengthens SEC data ingestion by centralizing bad/placeholder-field validation, improving fetch robustness (retries + safer caching), and making database operations more durable and operationally manageable (WAL mode, reset/close helpers), along with several parsing and CLI fixes.
Changes:
- Centralized EDGAR “bad person field” detection into a shared module and updated form processors to use it.
- Added retry/backoff and per-attempt timeout handling for SEC fetch jobs; improved file-cache safety via atomic writes and stricter cache-miss error handling.
- Improved operational durability: SQLite WAL/NORMAL pragmas + DB close helper, new
db resetcommand wiring, safer parsing helpers across form processors.
Reviewed changes
Copilot reviewed 29 out of 30 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| test_any.ts | Removed obsolete debug/test file. |
| src/util/db.ts | Switch SQLite pragmas to WAL/NORMAL; add closeDb(). |
| src/types/edgar/bad-data.ts | New shared bad-data detection module and isBadPersonField(). |
| src/task/submissions/fetchAndStoreSubmission.ts | Add warning log on per-item pipeline failures. |
| src/task/index/FetchQuarterlyIndexRangeTask.ts | Fix quarter range enumeration to be inclusive and start-quarter anchored. |
| src/task/facts/StoreCompanyFactsTask.ts | Record “processed” even when facts array is empty. |
| src/task/facts/FetchCompanyFactsTask.ts | Preserve date on early-return path. |
| src/task/facts/fetchAndStoreCompanyFacts.ts | Add warning log on failures. |
| src/storage/investment-offering/InvestmentOfferingHistorySchema.ts | Increase accession number max length. |
| src/sec/forms/Form.ts | Use WeakMap keyed by constructor to avoid array-path collisions. |
| src/sec/forms/exempt-offerings/Form_D.storage.ts | Use shared bad-data detection; add safer numeric/CIK parsing helpers; fix amount parsing. |
| src/sec/forms/exempt-offerings/Form_C.storage.ts | Use shared bad-data detection; add safer portal CIK parsing. |
| src/sec/forms/exempt-offerings/Form_1_A.storage.ts | Add country code resolution logic for phone normalization. |
| src/sec.ts | Add shutdown cleanup (stop queues, close DB/PG pool); move SQLite init out. |
| src/fetch/SecFetchJob.ts | Add retry/backoff, Retry-After support, and per-attempt timeouts. |
| src/fetch/SecFetchFileOutputCache.ts | Atomic cache writes; Blob/Buffer handling fixes; don’t swallow non-ENOENT errors. |
| src/fetch/SecCachedFetchTask.ts | Ensure response_type is set consistently before cache usage. |
| src/config/TestingDI.ts | Register new in-memory repositories for tests. |
| src/config/setupAllDatabases.ts | Include setup for new repositories. |
| src/config/resetAllDatabases.ts | New helper to truncate all repositories (used by CLI reset). |
| src/config/DefaultDI.ts | Register new repositories in default DI. |
| src/config/Constants.ts | Make SEC User-Agent configurable via env var. |
| src/commands/index.ts | Move SQLite initialization into command preAction hook. |
| src/cli/groups/fetch.ts | Validate CIK CLI args via helper. |
| src/cli/groups/db.ts | Implement db reset --confirm using new reset helper. |
| src/cli/GlobalOptions.ts | Remove concurrency flag. |
| src/cli/GlobalOptions.test.ts | Update tests for removed concurrency flag. |
| src/cli/cli.integration.test.ts | Update help output expectations for removed concurrency flag. |
| package.json | Point bin to dist build output; move concurrently to devDependencies; rename publish script. |
| bun.lock | Lockfile updates due to dependency move/version resolution changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const MAX_FETCH_ATTEMPTS = 4; | ||
| const INITIAL_BACKOFF_MS = 1_000; | ||
| const MAX_BACKOFF_MS = 30_000; |
There was a problem hiding this comment.
Fixed in e5bb21e — MAX_FETCH_ATTEMPTS, INITIAL_BACKOFF_MS, and MAX_BACKOFF_MS are now read from SEC_FETCH_MAX_ATTEMPTS / SEC_FETCH_INITIAL_BACKOFF_MS / SEC_FETCH_MAX_BACKOFF_MS via a shared readPositiveIntEnv helper that validates with /^\d+$/ and falls back to the defaults on missing/invalid input.
Generated by Claude Code
| const DEFAULT_TIMEOUT_MS = Number(process.env.SEC_FETCH_TIMEOUT_MS ?? 60_000); | ||
|
|
There was a problem hiding this comment.
Fixed in e5bb21e — DEFAULT_TIMEOUT_MS now goes through readPositiveIntEnv("SEC_FETCH_TIMEOUT_MS", 60_000), which requires the env var to match /^\d+$/ and falls back to the default otherwise. Non-numeric input no longer silently disables timeouts.
Generated by Claude Code
| const timeoutSignal = | ||
| DEFAULT_TIMEOUT_MS > 0 ? AbortSignal.timeout(DEFAULT_TIMEOUT_MS) : undefined; | ||
| const signal = combineSignals([context.signal, timeoutSignal]); |
There was a problem hiding this comment.
Fixed in e5bb21e — replaced AbortSignal.timeout() with a per-attempt AbortController + setTimeout, and clearTimeout runs in the finally so the timer is released on success or failure. Timers no longer accumulate across attempts.
Generated by Claude Code
|
|
||
| function parseCikSafely(raw: string | number | undefined | null): number { | ||
| if (raw === undefined || raw === null) return 0; | ||
| const parsed = Number.parseInt(String(raw).trim(), 10); |
There was a problem hiding this comment.
Fixed in e5bb21e — parseCikSafely in Form_D now requires the trimmed string to match /^\d+$/ before calling Number(), so "123abc" returns 0 instead of 123.
Generated by Claude Code
| process.exit(1); | ||
| } | ||
| throw e; |
There was a problem hiding this comment.
Fixed in e5bb21e — process.exit(1) is gone; the SecCliConfigurationError branch now sets process.exitCode = 1 and stores the error, the finally runs cleanup, and any non-config error is rethrown after cleanup so shutdown actually executes.
Generated by Claude Code
| async execute(input: Input, context: IJobExecuteContext): Promise<Output> { | ||
| let lastError: unknown; | ||
| for (let attempt = 0; attempt < MAX_FETCH_ATTEMPTS; attempt++) { | ||
| // Per-attempt timeout so a hung TCP connection cannot pin a queue slot | ||
| // forever; respects the caller's abort signal as well. |
There was a problem hiding this comment.
Added in e5bb21e — three new tests in SecFetchJob.test.ts using a loopback Bun.serve: 429 with Retry-After: 0 retries until success on the third attempt, 503 retries at least once before failing, and 404 fails fast on the first attempt. While diagnosing them I also tightened isRetriableError to detect workglow's RetryableJobError by name/retryable flag and honor its retryDate field for Retry-After (the prior instanceof Error check was unreliable across module boundaries).
Generated by Claude Code
| import { runCommand } from "../runCommand"; | ||
|
|
||
| function parseCikArg(value: string): number { | ||
| const parsed = Number.parseInt(value, 10); |
There was a problem hiding this comment.
Fixed in e5bb21e — parseCikArg now requires the trimmed argument to match /^\d+$/ before calling Number(), so "123abc" is rejected with the documented error rather than silently becoming 123.
Generated by Claude Code
| */ | ||
| function parseIntegerOrNull(raw: string | undefined | null): number | null { | ||
| if (raw === undefined || raw === null || raw === "") return null; | ||
| const parsed = Number.parseInt(String(raw).trim(), 10); |
There was a problem hiding this comment.
Fixed in e5bb21e — parseIntegerOrNull now requires the trimmed value to match /^-?\d+$/ before parsing, so "123abc" returns null instead of 123. Also widened the input type to string | number | null | undefined so number inputs (e.g. directly from EDGAR JSON) round-trip correctly.
Generated by Claude Code
| * missing/non-numeric. EDGAR occasionally emits non-digit cruft (whitespace, | ||
| * stray punctuation) that `parseInt` would silently turn into `NaN`. | ||
| */ | ||
| function parseCikSafely(raw: string | undefined | null): number { | ||
| if (!raw) return 0; | ||
| const parsed = Number.parseInt(String(raw).trim(), 10); |
There was a problem hiding this comment.
Fixed in e5bb21e — parseCikSafely in Form_C now requires the trimmed value to match /^\d+$/ before parsing, so "123abc" returns 0 instead of 123.
Generated by Claude Code
| import { US_STATE_CODE_ARRAY } from "../../../storage/address/AddressSchemaCodes"; | ||
| import { CompanyRepo } from "../../../storage/company/CompanyRepo"; | ||
| import { hasCompanyEnding } from "../../../storage/company/CompanyNormalization"; | ||
| import { PersonRepo } from "../../../storage/person/PersonRepo"; | ||
| import { PhoneRepo } from "../../../storage/phone/PhoneRepo"; | ||
|
|
||
| const US_STATE_CODE_SET = new Set<string>(US_STATE_CODE_ARRAY.map(([code]) => code)); | ||
|
|
||
| /** | ||
| * EDGAR's `stateOrCountry` field stores either a 2-char US state code (e.g. | ||
| * "NY") or a 2-char ISO country code (e.g. "GB"). Both are 2 characters wide, | ||
| * so the country can only be inferred from set membership, not from length. | ||
| */ | ||
| function resolveCountryCode(stateOrCountry: string | undefined | null): string | undefined { | ||
| if (!stateOrCountry) return undefined; | ||
| return US_STATE_CODE_SET.has(stateOrCountry) ? "US" : stateOrCountry; |
There was a problem hiding this comment.
Fixed in e5bb21e — resolveCountryCode now trims+uppercases the input, then resolves through COUNTRY_STATE_CODE_ARRAY: US states → "US", SEC codes → ISO via the SEC_CODE_TO_ISO map (e.g. "B3" → "AL", "X0" → "GB"), already-ISO inputs pass through, and unknown codes return undefined so PhoneRepo falls back to its own defaults instead of receiving a SEC code as a regionCode.
Generated by Claude Code
Tighten CIK/integer parsers in Form_C, Form_D, and the CLI fetch group to require /^\d+$/ — parseInt would silently accept "123abc" as 123 and store plausible-but-wrong CIKs. Make SecFetchJob's retry/timeout knobs env-driven (SEC_FETCH_MAX_ATTEMPTS, SEC_FETCH_INITIAL_BACKOFF_MS, SEC_FETCH_MAX_BACKOFF_MS, SEC_FETCH_TIMEOUT_MS) with strict validation that falls back to defaults on missing/invalid values. Replace AbortSignal.timeout() with a cancellable AbortController + setTimeout/clearTimeout pair so timers don't leak in a high-throughput queue. Extract sleepWithAbort() helper that detaches its abort listener on resolve. Detect workglow's RetryableJobError via name/`retryable` flag (instanceof Error is unreliable across module boundaries) and honor `retryDate` for Retry-After. Add unit tests covering 429 + Retry-After retry, 5xx retry, and 404 fail-fast. Resolve Form_1_A country codes through COUNTRY_STATE_CODE_ARRAY so phone parsing receives ISO 3166-1 alpha-2 (the documented contract for PhoneSchema.country_code) rather than SEC's "B3"/"X0"-style codes. In sec.ts, replace process.exit(1) with process.exitCode + rethrow so the shutdown finally block actually runs, and wrap stopQueues/closeDb/closePgPool in Promise.allSettled so a crashing cleanup step can't mask the primary command failure. https://claude.ai/code/session_01UEfSvXJcxWC1csqry7FWmJ
Summary
This PR consolidates bad-data detection logic, adds comprehensive retry handling for SEC API fetches, improves database durability, and fixes several data parsing issues across form processors.
Key Changes
Bad Data Validation
isBadPersonField()from individual form processors (Form_C, Form_D) into a sharedbad-data.tsmodule with an expandedBAD_PERSON_FIELDSsetisBadPersonField()function for consistent validation across all form typesFetch Resilience & Retry Logic
SecFetchJob: Implements exponential backoff with jitter for transient failuresRetry-Afterheaders from server responsesMAX_FETCH_ATTEMPTS,INITIAL_BACKOFF_MS,MAX_BACKOFF_MS, andSEC_FETCH_TIMEOUT_MSenv varFile Cache Atomicity
SecFetchFileOutputCache: Write to temporary file then rename to prevent truncated cache entries and interleaved writes from concurrent processesDatabase Durability
journal_modefrom OFF to WAL (Write-Ahead Logging)synchronousfrom 0 (OFF) to NORMAL for crash-safe writeslocking_mode = EXCLUSIVEto allow concurrent readerscloseDb()function to properly close database connections on shutdownresetAllDatabases()function for CLIdb resetcommand to truncate all repositoriesData Parsing Improvements
parseIntegerOrNull()andparseCikSafely()helpers to handle EDGAR's malformed numeric strings (whitespace, non-digit cruft)totalOfferingAmountandtotalRemainingto use safe integer parsingisBadPersonField()with centralized version; addedparseCikSafely()for portal CIKCLI & Configuration
--concurrencyflag: Workglow now manages concurrency internallySecUserAgentconfigurable viaSEC_USER_AGENTenv var with better documentationcommands/index.tssec.tsto stop task queues and close database/connection poolsBug Fixes & Improvements
_arrayPathsfrom Map to WeakMap keyed by constructor to prevent collisions between Form subclasses with identicalnamepropertiesstartQuarterresponse_typeis consistently set before cache lookupsTesting
--concurrencyflag checkstest_any.tsfileNotable Implementation Details
AbortSignal.timeout()for per-attempt timeouts and `combineSignalshttps://claude.ai/code/session_01UEfSvXJcxWC1csqry7FWmJ