Skip to content

[codex] Strengthen light-mode OOXML shapes/charts and restore print-area contract#129

Merged
harumiWeb merged 7 commits into
mainfrom
feat/nocom-extract
Apr 22, 2026
Merged

[codex] Strengthen light-mode OOXML shapes/charts and restore print-area contract#129
harumiWeb merged 7 commits into
mainfrom
feat/nocom-extract

Conversation

@harumiWeb
Copy link
Copy Markdown
Owner

@harumiWeb harumiWeb commented Apr 22, 2026

Summary

  • restore the accepted light mode print-area contract across extract, process_excel, CLI, and engine export pathsrn- strengthen the OOXML rich baseline so mode="light" can extract shapes and charts without requiring COM
  • keep FilterOptions.include_print_areas=None as automatic inclusion, requiring False for explicit suppression
  • isolate malformed OOXML drawing failures to the affected worksheet so healthy sheets retain rich artifacts
  • refresh the related docs and generated model documentation for the corrected contract

Why

The current branch restores behavior already accepted in ADR-0010 and the published docs. It also promotes OOXML extraction to the baseline rich path for light mode, so shapes and charts can be returned without COM. In addition, it narrows OOXML drawing fallback scope so one broken drawing part does not erase shapes or charts from healthy sheets in the same workbook.

Impact

  • mode="light" keeps print_areas in default structured outputrn- mode="light" supports best-effort OOXML shape/chart extraction on healthy .xlsx / .xlsm worksheets without COM
  • print_areas_dir side output remains available on process_excel and CLI paths in light mode
  • best-effort OOXML rich extraction becomes sheet-local instead of workbook-wide when a drawing part is malformed

Validation

  • uv run pytest tests/engine/test_engine.py tests/core/test_mode_output.py tests/cli/test_cli.py tests/core/test_ooxml_drawing.py -q
  • uv run python scripts/gen_model_docs.py
  • uv run task precommit-run

Closes #128rnCloses #130

Summary by CodeRabbit

  • New Features

    • Light mode now extracts shapes and charts using pure Python (no COM required)
    • LibreOffice mode enriches a Python baseline rather than being the only rich extraction path
    • Print areas now included for all extraction modes by default
  • Bug Fixes

    • OOXML drawing parsing is resilient to malformed worksheet data
    • Light mode preserves extracted shapes/charts even if enrichment fails
  • Documentation

    • Updated extraction mode specifications and behavior contracts
    • New architectural decision documentation for light mode
    • Updated API and CLI guidance
    • Version 0.8.0

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@harumiWeb has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 25 minutes and 41 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 25 minutes and 41 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c62032a1-28c2-431d-9489-89e728171f21

📥 Commits

Reviewing files that changed from the base of the PR and between 11057df and e828563.

📒 Files selected for processing (15)
  • CHANGELOG.md
  • README.ja.md
  • README.md
  • dev-docs/architecture/pipeline.md
  • docs/release-notes/v0.8.0.md
  • mkdocs.yml
  • src/exstruct/__init__.py
  • src/exstruct/core/ooxml_drawing.py
  • src/exstruct/core/pipeline.py
  • src/exstruct/errors.py
  • tasks/feature_spec.md
  • tasks/todo.md
  • tests/core/test_mode_output.py
  • tests/core/test_ooxml_drawing.py
  • tests/core/test_pipeline.py
📝 Walkthrough

Walkthrough

This release upgrades ExStruct to version 0.8.0, introducing a pure-Python OOXML extraction baseline for light mode that extracts shapes, connectors, and charts without COM/LibreOffice dependencies. It adds a new architectural ADR-0010, updates provenance tracking, restores print-area defaults, and enhances OOXML parsing resilience. Documentation and schemas are updated accordingly.

Changes

Cohort / File(s) Summary
OOXML Rich Backend Implementation
src/exstruct/core/backends/ooxml_backend.py, src/exstruct/core/backends/__init__.py
Added new OoxmlRichBackend class for non-COM OOXML drawing extraction with per-sheet caching, best-effort shape/chart/connector extraction for light mode, and graceful degradation on malformed XML/missing files (logs warnings instead of failing).
OOXML Drawing Parsing & Metrics
src/exstruct/core/ooxml_drawing.py
Enhanced drawing parsing with SheetDrawingMetrics dataclass for per-sheet row/column point offsets, improved geometry conversion using worksheet dimensions, prefer_transform_position_when_sized flag for anchor resolution, and per-sheet exception isolation so malformed drawings on one sheet don't erase healthy artifacts from others.
Backend Protocol & Implementation Updates
src/exstruct/core/backends/base.py, src/exstruct/core/backends/com_backend.py, src/exstruct/core/backends/libreoffice_backend.py
Updated RichBackend protocol to accept "light" mode in extract_shapes and extract_charts signatures; modified ComRichBackend accordingly; added provenance parameter to _build_shapes_from_ooxml in LibreOfficeBackend to support custom provenance values.
Extraction Pipeline & Light Mode
src/exstruct/core/pipeline.py
Added _run_light_pipeline() to execute light-mode OOXML extraction; updated resolve_rich_backend to instantiate OoxmlRichBackend for light mode; changed libreoffice pipeline to seed OOXML baseline first and preserve rich artifacts on fallback instead of clearing them; removed chart suppression in light mode.
Engine & Extract API
src/exstruct/__init__.py, src/exstruct/engine.py
Updated extract() docstring to clarify light mode as pure-Python OOXML baseline for .xlsx/.xlsm; changed FilterOptions.include_print_areas=None auto-selection to enable print areas for all modes (not just non-light); updated docstring descriptions for light/libreoffice mode contract.
Model Provenance & Schema Enums
src/exstruct/models/__init__.py, schemas/*.json
Extended provenance field on BaseShape and Chart models, and across all shape-related schema definitions (arrow.json, chart.json, shape.json, smartart.json, sheet.json, workbook.json, print_area_view.json), to include "python_ooxml" alongside existing "excel_com" and "libreoffice_uno".
ADR & Architecture Documentation
dev-docs/adr/ADR-0010-light-mode-as-the-pure-python-rich-ooxml-baseline.md, dev-docs/adr/ADR-0001-extraction-mode-boundaries.md, dev-docs/adr/README.md, dev-docs/adr/decision-map.md, dev-docs/adr/index.yaml
Added new ADR-0010 defining light mode as pure-Python OOXML baseline with best-effort rich artifacts; marked ADR-0001 as superseded; updated ADR indices and decision map to reflect new status and relationships.
Extraction & Data Model Specs
dev-docs/specs/excel-extraction.md, dev-docs/specs/data-model.md
Updated extraction spec to clarify light mode's OOXML-rich pathway, reframed libreoffice as optional enrichment on top of light baseline, documented fallback behavior preserving OOXML artifacts; added provenance metadata notes for shape/chart serialization.
README & User Documentation
README.md, README.ja.md, docs/api.md, docs/cli.md, docs/mcp.md
Updated README sections removing "Choose an Interface" and legacy phase descriptions; clarified light/libreoffice mode contract; updated example images to local assets; synchronized README.ja.md structure with English version; added Extraction Mode Guide to API docs; updated CLI/MCP mode descriptions.
CLI & Integration Tests
tests/cli/test_cli.py, tests/core/test_mode_output.py, tests/core/test_ooxml_drawing.py, tests/core/test_pipeline.py, tests/engine/test_engine.py, tests/integration/test_integrate_raw_data.py, tests/models/test_models_export.py
Added CLI test for light-mode print-areas output; updated light-mode tests to assert shapes/charts with provenance="python_ooxml" are present instead of empty; added comprehensive OOXML drawing parsing tests covering geometry resolution, per-sheet resilience, and metrics-aware coordinate conversion; updated pipeline fallback tests for OOXML baseline preservation.
Task & Internal Documentation
tasks/feature_spec.md, tasks/todo.md, tasks/lessons.md
Added feature spec entries documenting README parity refresh and light-mode print-areas/OOXML resilience scope; recorded task completions with verification steps; added lessons on extraction/export mode contract validation.
Formatting & Line-Ending Normalization
.agents/skills/adr-linter/SKILL.md, .agents/skills/adr-reviewer/SKILL.md, .agents/skills/exstruct-cli/agents/openai.yaml, AGENTS.md, CHANGELOG.md, CLAUDE.md, SECURITY.md, dev-docs/agents/*, dev-docs/architecture/*, dev-docs/testing/test-requirements.md, docs/cli.md, docs/release-notes/v0.7.0.md, mkdocs.yml, tests/cli/test_cli_lazy_imports.py, tests/cli/test_edit_cli.py, src/exstruct/edit/service.py, schemas/cell_row.json, schemas/chart_series.json, schemas/merged_cells.json, schemas/print_area.json, schemas/smartart_node.json
Whitespace/line-ending normalization (CRLF conversions, blank-line adjustments) across documentation, configuration, and schema files without semantic content changes.
Version & Configuration
pyproject.toml
Updated package version from 0.7.1 to 0.8.0.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Pipeline
    participant RichBackend
    participant OOXML
    participant WorkbookBuilder

    Client->>Pipeline: extract(file, mode="light")
    Pipeline->>RichBackend: resolve_rich_backend("light")
    RichBackend->>RichBackend: OoxmlRichBackend(file_path)
    Pipeline->>Pipeline: _run_light_pipeline()
    Pipeline->>RichBackend: extract_shapes(mode="light")
    RichBackend->>OOXML: read_sheet_drawings()
    OOXML-->>RichBackend: SheetDrawingData (shapes, charts, connectors)
    RichBackend-->>Pipeline: ShapeData {provenance="python_ooxml"}
    Pipeline->>RichBackend: extract_charts(mode="light")
    RichBackend->>OOXML: parse_charts_from_drawings()
    OOXML-->>RichBackend: ChartData {provenance="python_ooxml"}
    RichBackend-->>Pipeline: ChartData
    Pipeline->>WorkbookBuilder: build_workbook(include_rich_artifacts=True)
    WorkbookBuilder-->>Client: Workbook with shapes, charts, print_areas
Loading
sequenceDiagram
    participant Client
    participant Pipeline
    participant OoxmlBackend
    participant LibreOfficeBackend
    participant Workbook

    Client->>Pipeline: extract(file, mode="libreoffice")
    Pipeline->>OoxmlBackend: extract_shapes/charts(mode="light")
    OoxmlBackend-->>Pipeline: OOXML baseline shapes/charts
    Pipeline->>LibreOfficeBackend: extract_shapes/charts(mode="libreoffice")
    alt LibreOffice Available
        LibreOfficeBackend-->>Pipeline: Enriched shapes/charts {provenance="libreoffice_uno"}
    else LibreOffice Unavailable
        LibreOfficeBackend-->>Pipeline: Error
        Pipeline->>Pipeline: Preserve OOXML baseline
    end
    Pipeline->>Workbook: build(shapes + charts from either backend)
    Workbook-->>Client: Workbook with best-effort rich artifacts
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Add libreoffice extraction mode #76: Adds initial OOXML-rich backend and pipeline integration; directly overlaps with this PR's core OOXML extraction implementation and backend protocol changes.
  • docs: split internal developer docs into dev-docs #91: Introduces ADR-0001 and dev-docs structure; this PR supersedes ADR-0001 with ADR-0010 and updates the same ADR documentation surface.
  • Dev/refactor #23: Implements extraction pipeline and backend refactoring; this PR's light-mode pipeline additions and backend composition build on that foundation.

Suggested labels

enhancement, documentation, architecture, core-extraction

Poem

🐰 A light mode emerges, pure and free,
OOXML drawings parsed with glee,
No COM, no runtime to confess,
Just Python extracting—nothing less!
Shapes and charts bloom from the baseline true,
v0.8.0 delivers what's overdue! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Code changes comprehensively address issue #128 objectives: light-mode print-area contract restored across all paths [src/exstruct/init.py, src/exstruct/engine.py, src/exstruct/core/pipeline.py], FilterOptions.include_print_areas defaults corrected, and OOXML drawing failures isolated to individual sheets via SheetDrawingMetrics [src/exstruct/core/ooxml_drawing.py].
Out of Scope Changes check ✅ Passed Changes are narrowly scoped to light-mode print areas and OOXML drawing resilience. Documentation/ADR updates (ADR-0010 creation, README changes) and version bump (0.7.1→0.8.0) are justified by the contract restoration; no unrelated features added.
Description check ✅ Passed The PR description is comprehensive and clearly explains the scope, motivation, acceptance criteria, and validation steps.
Title check ✅ Passed The PR title clearly and concisely describes the main changes: restoring light-mode print-area contract and OOXML drawing resilience, which aligns with the primary objectives outlined in the PR summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/nocom-extract

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 22, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 19 complexity · 0 duplication

Metric Results
Complexity 19
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes. Give us feedback

@harumiWeb harumiWeb marked this pull request as ready for review April 22, 2026 12:49
@harumiWeb harumiWeb requested a review from Copilot April 22, 2026 12:49

This comment was marked as resolved.

@harumiWeb harumiWeb changed the title [codex] Restore light-mode print-area contract and OOXML drawing resilience [codex] Strengthen light-mode OOXML shapes/charts and restore print-area contract Apr 22, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 66 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@harumiWeb harumiWeb merged commit 91777b0 into main Apr 22, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants