Skip to content

add 'dqx' as test engine for DQT tests within datacontract#1069

Draft
gkoenig wants to merge 3 commits into
datacontract:mainfrom
gkoenig:features/support-dqx-quality-checks-databricks
Draft

add 'dqx' as test engine for DQT tests within datacontract#1069
gkoenig wants to merge 3 commits into
datacontract:mainfrom
gkoenig:features/support-dqx-quality-checks-databricks

Conversation

@gkoenig
Copy link
Copy Markdown
Contributor

@gkoenig gkoenig commented Feb 22, 2026

Add an alternative test engine to Soda, by introducing Databricks' DQX framework ... which obviously requires Databricks as server_type. Since DQX is able to read datacontracts in ODCS format natively, no conversion is required.
Initial extension offers DQX as test_engine in programmatic mode.

  • [+] Tests pass
  • [+] ruff format
  • [+] README.md updated (if relevant)
  • [-] CHANGELOG.md entry added

@gkoenig gkoenig marked this pull request as draft February 22, 2026 16:39
@gkoenig
Copy link
Copy Markdown
Contributor Author

gkoenig commented Feb 22, 2026

Commit 0c53e35 adds the DQX test_engine also for the CLI mode of datacontract-cli.
Just to get the idea => I was running it with a sample datacontract including 8 checks, on a databricks unitcatalog table with 6 rows (yes, very little, indeed) produces:

image

the DataContract which I used for this test, you'll find here:
dqx-test-contract.yaml

@jochenchrist
Copy link
Copy Markdown
Contributor

We need to make a strategic decision, if we want to support datasource-specific engines.
I like the idea, but it may cause inconsistent behavior for checks across different data sources,

@jochenchrist jochenchrist added the waiting-for-decision Waiting for a decision of the maintainers. label Apr 14, 2026
@dmaresma
Copy link
Copy Markdown
Contributor

We need to make a strategic decision, if we want to support datasource-specific engines. I like the idea, but it may cause inconsistent behavior for checks across different data sources,

The performance of native data quality engines should primarily be supported and optimized by the data platform providers themselves. From my personal experience, I have encountered limitations and issues with Snowflake DMFs. Having the ability to choose alternative engines is therefore important—not only from a technical standpoint, but also by principle, as it aligns with anti‑trust and vendor‑neutrality considerations.
SODA is a solid option, and DQT is also a viable alternative. What matters most is preserving optionality.
ODCS maintains the right level of abstraction, while the CLI gives users the flexibility to decide how they want to integrate and execute data quality checks. This separation is key: it allows innovation and specialization at the engine level without locking users into a single implementation.
Regarding the concern about inconsistent behavior across data sources if we support datasource‑specific engines: I see this as a manageable trade‑off rather than a blocker. In practice, we already deal with feature gaps and behavioral differences between platforms. Clear contracts, documented semantics, and conformance tests can mitigate most of these inconsistencies, while still allowing us to benefit from native optimizations where they make sense.
As a side note, I will be migrating my Snowflake workloads to Databricks in the near future, and data contracts will play an important role in securing and de‑risking that transition. This is a concrete example of why abstraction and portability matter—we already manage platform-specific feature defects today, and data contracts help turn those differences into explicit, controlled decisions rather than hidden risks.

@gkoenig
Copy link
Copy Markdown
Contributor Author

gkoenig commented Apr 17, 2026

We need to make a strategic decision, if we want to support datasource-specific engines. I like the idea, but it may cause inconsistent behavior for checks across different data sources,

Thanks for your feedback Jochen.
Fully understand thorough thinking about major changes.....looking forward to you decision.

@pocelka
Copy link
Copy Markdown

pocelka commented May 1, 2026

@gkoenig Although I'm not involved with the team maintaining CLI, I find your PR quite useful. Of course depending on what @jochenchrist will decide.

However I think that there is a small problem in the code or the approach you took. You probably need to exclude specific dataset level checks which user might configure in the contract. For example foreign key check as it depends on data from another DataFrame. And potentially any validation check that is not from the standard set of the checks; for example how would you deal with custom check that was registered with DQX? As an alternative the limitations should be described in readme.md.

Overall, this might be super useful for the people who use DQX.

@jochenchrist
Copy link
Copy Markdown
Contributor

We are considering a dqx sync command to convert and sync ODCS to a dqx project, so that dqx would run these tests directly.

datacontract dqx sync

Of course this should include quality checks with custom dqx tests

What do you think?

@pocelka
Copy link
Copy Markdown

pocelka commented May 2, 2026

Hi, can you elaborate little bit more on how the sync command would work?

@gkoenig
Copy link
Copy Markdown
Contributor Author

gkoenig commented May 4, 2026

@gkoenig Although I'm not involved with the team maintaining CLI, I find your PR quite useful. Of course depending on what @jochenchrist will decide.

However I think that there is a small problem in the code or the approach you took. You probably need to exclude specific dataset level checks which user might configure in the contract. For example foreign key check as it depends on data from another DataFrame. And potentially any validation check that is not from the standard set of the checks; for example how would you deal with custom check that was registered with DQX? As an alternative the limitations should be described in readme.md.

Overall, this might be super useful for the people who use DQX.

thanks for your feedback @pocelka . You are absolutely right, there will be some fixes/extensions/etc to be worked on....as the current state of the PR was more a "kickstart" to figure out if it will make it to main ;)

@gkoenig
Copy link
Copy Markdown
Contributor Author

gkoenig commented May 4, 2026

We are considering a dqx sync command to convert and sync ODCS to a dqx project, so that dqx would run these tests directly.

datacontract dqx sync

Of course this should include quality checks with custom dqx tests

What do you think?

thanks for your reply @jochenchrist

What do you mean by "sync ODCS to a dqx project" ? ...creating an isolated small python project which gets the ODCS contract and uses the dqx library to run the tests?

@jochenchrist
Copy link
Copy Markdown
Contributor

What do you mean by "sync ODCS to a dqx project" ? ...creating an isolated small python project which gets the ODCS contract and uses the dqx library to run the tests?

Yes, idea is to convert ODCS to dbx tests and let dbx execute the tests. Should also work with existing databricks projects.

How do your Databricks data product code structure look like?

@gkoenig
Copy link
Copy Markdown
Contributor Author

gkoenig commented May 4, 2026

What do you mean by "sync ODCS to a dqx project" ? ...creating an isolated small python project which gets the ODCS contract and uses the dqx library to run the tests?

Yes, idea is to convert ODCS to dbx tests and let dbx execute the tests. Should also work with existing databricks projects.

How do your Databricks data product code structure look like?

ok, got it, thanks @jochenchrist
Actually, in one project we use datacontract-cli to validate the datacontract "schema" against UnityCatalog and use dqx library to run dqt checks, which are defined in dedicated yaml files (since we also want to cover dqt checks for data-in-motion, before it reaches "outputport" stage).
Since DQX natively supports ODCS and there is already a method to generate DQX rules from an ODCS compliant datacontract , I am not sure if there is a huge benefit of having this conversion also in the datacontract-cli. For executing the checks you anyways need to have the dqx libraries onboarded, hence we could use the existing method already.......or am I missing something here?

@simonharrer
Copy link
Copy Markdown
Contributor

@gkoenig perhaps have a look at the dbt integration, with the new dbt sync command (which will be improved along the way as well). But the idea could be to simply have "$ datacontract dqx sync", which would make sure that all tests defined within the data cotnract are available as dqx tests, but there can also be additional dqx tests as well (they will be merged together), run all the dqx tests, and convert the test results in the test result format of the data contract cli, and possibly report them back to a system that understands that format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-for-decision Waiting for a decision of the maintainers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants