Skip to content

[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF#55675

Open
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56691
Open

[SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF#55675
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56691

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF to be self-contained in read_udfs().

Why are the changes needed?

Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in read_udfs() improves readability and makes it easier to reason about the data flow for each eval type independently.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. No behavior change.

ASV benchmark (GroupedMapPandasIterUDFTimeBench, single run with -a repeat=5):

master: 4b3f8c3796e  vs  PR: 29538fd7980

Time (ms, lower = better)
scenario           udf                   master       PR       diff
sm_grp_few_col     identity_udf            447.4    441.0    -1.43%
sm_grp_few_col     sort_udf                499.5    498.8    -0.14%
sm_grp_few_col     key_identity_udf        449.9    411.8    -8.46%
sm_grp_many_col    identity_udf            358.3    375.5    +4.79%
sm_grp_many_col    sort_udf                378.5    388.7    +2.70%
sm_grp_many_col    key_identity_udf        371.3    341.1    -8.14%
lg_grp_few_col     identity_udf            802.7    791.6    -1.39%
lg_grp_few_col     sort_udf                993.7    949.8    -4.42%
lg_grp_few_col     key_identity_udf        682.4    691.2    +1.30%
lg_grp_many_col    identity_udf            928.7    911.1    -1.89%
lg_grp_many_col    sort_udf               1010.4    963.1    -4.69%
lg_grp_many_col    key_identity_udf        897.8    919.7    +2.44%
mixed_types        identity_udf            446.2    431.3    -3.34%
mixed_types        sort_udf                471.2    450.0    -4.50%
mixed_types        key_identity_udf        399.8    383.4    -4.10%
SUM                                       9137.8   8948.1    -2.08%

Aggregate slightly improved (-2.08%); per-scenario variation within run-to-run noise.

Peakmem benchmark (GroupedMapPandasIterUDFPeakmemBench) was essentially flat (SUM -0.02%).

Was this patch authored or co-authored using generative AI tooling?

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant