[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF by Yicong-Huang · Pull Request #55675 · apache/spark

Yicong-Huang · 2026-05-04T21:09:51Z

What changes were proposed in this pull request?

Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF to be self-contained in read_udfs().

Why are the changes needed?

Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in read_udfs() improves readability and makes it easier to reason about the data flow for each eval type independently.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. No behavior change.

ASV benchmark (GroupedMapPandasIterUDFTimeBench, single run with -a repeat=5):

master: 4b3f8c3796e  vs  PR: 29538fd7980

Time (ms, lower = better)
scenario           udf                   master       PR       diff
sm_grp_few_col     identity_udf            447.4    441.0    -1.43%
sm_grp_few_col     sort_udf                499.5    498.8    -0.14%
sm_grp_few_col     key_identity_udf        449.9    411.8    -8.46%
sm_grp_many_col    identity_udf            358.3    375.5    +4.79%
sm_grp_many_col    sort_udf                378.5    388.7    +2.70%
sm_grp_many_col    key_identity_udf        371.3    341.1    -8.14%
lg_grp_few_col     identity_udf            802.7    791.6    -1.39%
lg_grp_few_col     sort_udf                993.7    949.8    -4.42%
lg_grp_few_col     key_identity_udf        682.4    691.2    +1.30%
lg_grp_many_col    identity_udf            928.7    911.1    -1.89%
lg_grp_many_col    sort_udf               1010.4    963.1    -4.69%
lg_grp_many_col    key_identity_udf        897.8    919.7    +2.44%
mixed_types        identity_udf            446.2    431.3    -3.34%
mixed_types        sort_udf                471.2    450.0    -4.50%
mixed_types        key_identity_udf        399.8    383.4    -4.10%
SUM                                       9137.8   8948.1    -2.08%

Aggregate slightly improved (-2.08%); per-scenario variation within run-to-run noise.

Peakmem benchmark (GroupedMapPandasIterUDFPeakmemBench) was essentially flat (SUM -0.02%).

Was this patch authored or co-authored using generative AI tooling?

No.

refactor: extract grouped map pandas iter UDF logic into read_udfs

29538fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF#55675

[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF#55675
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56691

Yicong-Huang commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yicong-Huang commented May 4, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant