Release Notes - ASF JIRA

Release Notes - Spark - Version 3.2.3 - HTML format

Configure Release Notes

Sub-task

[SPARK-38697] - Extend SparkSessionExtensions to inject rules into AQE Optimizer
[SPARK-39200] - Stream is corrupted Exception while fetching the blocks from fallback storage system
[SPARK-39965] - Skip PVC cleanup when driver doesn't own PVCs
[SPARK-40459] - recoverDiskStore should not stop by existing recomputed files
[SPARK-40636] - Fix wrong remained shuffles log in BlockManagerDecommissioner

Bug

[SPARK-8731] - Beeline doesn't work with -e option when started in background
[SPARK-32380] - sparksql cannot access hive table while data in hbase
[SPARK-35542] - Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it.
[SPARK-39184] - ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones
[SPARK-39647] - Block push fails with java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration even when the NodeManager hasn't been restarted
[SPARK-39775] - Regression due to AVRO-2035
[SPARK-39833] - Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true
[SPARK-39835] - Fix EliminateSorts remove global sort below the local sort
[SPARK-39839] - Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check
[SPARK-39847] - Race condition related to interruption of task threads while they are in RocksDBLoader.loadLibrary()
[SPARK-39867] - Global limit should not inherit OrderPreservingUnaryNode
[SPARK-39887] - Expression transform error
[SPARK-39900] - Issue with querying dataframe produced by 'binaryFile' format using 'not' operator
[SPARK-39932] - WindowExec should clear the final partition buffer
[SPARK-39952] - SaveIntoDataSourceCommand should recache result relation
[SPARK-39962] - Global aggregation against pandas aggregate UDF does not take the column order into account
[SPARK-39972] - Revert the test case of SPARK-39962 in branch-3.2 and branch-3.1
[SPARK-40002] - Limit improperly pushed down through window using ntile function
[SPARK-40065] - Executor ConfigMap is not mounted if profile is not default
[SPARK-40079] - Add Imputer inputCols validation for empty input case
[SPARK-40089] - Sorting of at least Decimal(20, 2) fails for some values near the max.
[SPARK-40117] - Convert condition to java in DataFrameWriterV2.overwrite
[SPARK-40121] - Initialize projection used for Python UDF
[SPARK-40124] - Update TPCDS v1.4 q32 for Plan Stability tests
[SPARK-40149] - Star expansion after outer join asymmetrically includes joining key
[SPARK-40169] - Fix the issue with Parquet column index and predicate pushdown in Data source V1
[SPARK-40212] - SparkSQL castPartValue does not properly handle byte & short
[SPARK-40218] - GROUPING SETS should preserve the grouping columns
[SPARK-40270] - Make compute.max_rows as None working in DataFrame.style
[SPARK-40280] - Failure to create parquet predicate push down for ints and longs on some valid files
[SPARK-40315] - Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects
[SPARK-40407] - Repartition of DataFrame can result in severe data skew in some special case
[SPARK-40470] - arrays_zip output unexpected alias column names when using GetMapValue and GetArrayStructFields
[SPARK-40493] - Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[SPARK-40562] - Add spark.sql.legacy.groupingIdWithAppendedUserGroupBy
[SPARK-40583] - Documentation error in "Integration with Cloud Infrastructures"
[SPARK-40588] - Sorting issue with partitioned-writing and AQE turned on
[SPARK-40612] - On Kubernetes for long running app Spark using an invalid principal to renew the delegation token
[SPARK-40660] - Switch to XORShiftRandom to distribute elements
[SPARK-40829] - STORED AS serde in CREATE TABLE LIKE view does not work
[SPARK-40851] - TimestampFormatter behavior changed when using the latest Java 8/11/17
[SPARK-40869] - KubernetesConf.getResourceNamePrefix creates invalid name prefixes
[SPARK-40874] - Fix broadcasts in Python UDFs when encryption is enabled
[SPARK-40902] - Quick submission of drivers in tests to mesos scheduler results in dropping drivers
[SPARK-40963] - ExtractGenerator sets incorrect nullability in new Project
[SPARK-40987] - Avoid creating a directory when deleting a block, causing DAGScheduler to not work
[SPARK-41035] - Incorrect results or NPE when a literal is reused across distinct aggregations
[SPARK-41091] - Fix Docker release tool for branch-3.2
[SPARK-41188] - Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes
[SPARK-41327] - Fix SparkStatusTracker.getExecutorInfos by switch On/OffHeapStorageMemory info
[SPARK-41395] - InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[SPARK-41448] - Make consistent MR job IDs in FileBatchWriter and FileFormatWriter
[SPARK-41522] - GA dependencies test faild
[SPARK-41535] - InterpretedUnsafeProjection and InterpretedMutableProjection can corrupt unsafe buffer when used with calendar interval data
[SPARK-41668] - DECODE function returns wrong results when passed NULL

Improvement

[SPARK-38034] - Optimize time complexity and extend applicable cases for TransposeWindow
[SPARK-39831] - R dependencies installation start to fail after devtools_2.4.4 was released
[SPARK-39879] - Reduce local-cluster memory configuration in BroadcastJoinSuite* and HiveSparkSubmitSuite
[SPARK-40022] - YarnClusterSuite should not ABORTED when there is no Python3 environment
[SPARK-40241] - Correct the link of GenericUDTF
[SPARK-40490] - `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321
[SPARK-40574] - Add PURGE to DROP TABLE doc
[SPARK-41541] - Fix wrong child call in SQLShuffleWriteMetricsReporter.decRecordsWritten()

Test

[SPARK-40172] - Temporarily disable flaky test cases in ImageFileFormatSuite
[SPARK-40461] - Set upperbound for pyzmq 24.0.0 for Python linter

Task

[SPARK-40213] - Incorrect ASCII value for Latin-1 Supplement characters
[SPARK-40292] - arrays_zip output unexpected alias column names

Dependency upgrade

[SPARK-40801] - Upgrade Apache Commons Text to 1.10

Documentation

[SPARK-40043] - Document DataStreamWriter.toTable and DataStreamReader.table
[SPARK-40983] - Remove Hadoop requirements for zstd mention in Parquet compression codec

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.

Release Notes - Spark - Version 3.2.3
    
<h2>        Sub-task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-38697'>SPARK-38697</a>] -         Extend SparkSessionExtensions to inject rules into AQE Optimizer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39200'>SPARK-39200</a>] -         Stream is corrupted Exception while fetching the blocks from fallback storage system
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39965'>SPARK-39965</a>] -         Skip PVC cleanup when driver doesn&#39;t own PVCs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40459'>SPARK-40459</a>] -         recoverDiskStore should not stop by existing recomputed files
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40636'>SPARK-40636</a>] -         Fix wrong remained shuffles log in BlockManagerDecommissioner
</li>
</ul>
            
<h2>        Bug
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-8731'>SPARK-8731</a>] -         Beeline doesn&#39;t work with -e option when started in background
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32380'>SPARK-32380</a>] -         sparksql cannot access hive table while data in hbase
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-35542'>SPARK-35542</a>] -         Bucketizer created for multiple columns with parameters splitsArray,  inputCols and outputCols can not be loaded after saving it.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39184'>SPARK-39184</a>] -         ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39647'>SPARK-39647</a>] -         Block push fails with java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration even when the NodeManager hasn&#39;t been restarted
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39775'>SPARK-39775</a>] -         Regression due to AVRO-2035
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39833'>SPARK-39833</a>] -         Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39835'>SPARK-39835</a>] -         Fix EliminateSorts remove global sort below the local sort
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39839'>SPARK-39839</a>] -         Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39847'>SPARK-39847</a>] -         Race condition related to interruption of task threads while they are in RocksDBLoader.loadLibrary()
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39867'>SPARK-39867</a>] -         Global limit should not inherit OrderPreservingUnaryNode
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39887'>SPARK-39887</a>] -         Expression transform error
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39900'>SPARK-39900</a>] -         Issue with querying dataframe produced by &#39;binaryFile&#39; format using &#39;not&#39; operator
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39932'>SPARK-39932</a>] -         WindowExec should clear the final partition buffer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39952'>SPARK-39952</a>] -         SaveIntoDataSourceCommand should recache result relation
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39962'>SPARK-39962</a>] -         Global aggregation against pandas aggregate UDF does not take the column order into account
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39972'>SPARK-39972</a>] -         Revert the test case of SPARK-39962 in branch-3.2 and branch-3.1
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40002'>SPARK-40002</a>] -         Limit improperly pushed down through window using ntile function
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40065'>SPARK-40065</a>] -         Executor ConfigMap is not mounted if profile is not default
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40079'>SPARK-40079</a>] -         Add Imputer inputCols validation for empty input case
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40089'>SPARK-40089</a>] -         Sorting of at least Decimal(20, 2) fails for some values near the max.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40117'>SPARK-40117</a>] -         Convert condition to java in DataFrameWriterV2.overwrite
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40121'>SPARK-40121</a>] -         Initialize projection used for Python UDF
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40124'>SPARK-40124</a>] -         Update TPCDS v1.4 q32 for Plan Stability tests
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40149'>SPARK-40149</a>] -         Star expansion after outer join asymmetrically includes joining key
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40169'>SPARK-40169</a>] -         Fix the issue with Parquet column index and predicate pushdown in Data source V1
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40212'>SPARK-40212</a>] -         SparkSQL castPartValue does not properly handle byte &amp; short
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40218'>SPARK-40218</a>] -         GROUPING SETS should preserve the grouping columns
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40270'>SPARK-40270</a>] -         Make compute.max_rows as None working in DataFrame.style
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40280'>SPARK-40280</a>] -         Failure to create parquet predicate push down for ints and longs on some valid files
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40315'>SPARK-40315</a>] -         Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40407'>SPARK-40407</a>] -         Repartition of DataFrame can result in severe data skew in some special case
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40470'>SPARK-40470</a>] -         arrays_zip output unexpected alias column names when using GetMapValue and GetArrayStructFields
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40493'>SPARK-40493</a>] -         Revert &quot;[SPARK-33861][SQL] Simplify conditional in predicate&quot;
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40562'>SPARK-40562</a>] -         Add spark.sql.legacy.groupingIdWithAppendedUserGroupBy
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40583'>SPARK-40583</a>] -         Documentation error in &quot;Integration with Cloud Infrastructures&quot;
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40588'>SPARK-40588</a>] -         Sorting issue with partitioned-writing and AQE turned on
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40612'>SPARK-40612</a>] -         On Kubernetes for long running app Spark using an invalid principal to renew the delegation token
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40660'>SPARK-40660</a>] -         Switch to XORShiftRandom to distribute elements
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40829'>SPARK-40829</a>] -         STORED AS serde in CREATE TABLE LIKE view does not work
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40851'>SPARK-40851</a>] -         TimestampFormatter behavior changed when using the latest Java 8/11/17
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40869'>SPARK-40869</a>] -         KubernetesConf.getResourceNamePrefix creates invalid name prefixes
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40874'>SPARK-40874</a>] -         Fix broadcasts in Python UDFs when encryption is enabled
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40902'>SPARK-40902</a>] -         Quick submission of drivers in tests to mesos scheduler results in dropping drivers
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40963'>SPARK-40963</a>] -         ExtractGenerator sets incorrect nullability in new Project
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40987'>SPARK-40987</a>] -         Avoid creating a directory when deleting a block, causing DAGScheduler to not work
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41035'>SPARK-41035</a>] -         Incorrect results or NPE when a literal is reused across distinct aggregations
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41091'>SPARK-41091</a>] -         Fix Docker release tool for branch-3.2
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41188'>SPARK-41188</a>] -         Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41327'>SPARK-41327</a>] -         Fix SparkStatusTracker.getExecutorInfos by switch On/OffHeapStorageMemory info
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41395'>SPARK-41395</a>] -         InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41448'>SPARK-41448</a>] -         Make consistent MR job IDs in FileBatchWriter and FileFormatWriter
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41522'>SPARK-41522</a>] -         GA dependencies test faild
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41535'>SPARK-41535</a>] -         InterpretedUnsafeProjection and InterpretedMutableProjection can corrupt unsafe buffer when used with calendar interval data
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41668'>SPARK-41668</a>] -         DECODE function returns wrong results when passed NULL
</li>
</ul>
                
<h2>        Improvement
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-38034'>SPARK-38034</a>] -         Optimize time complexity and extend applicable cases for TransposeWindow 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39831'>SPARK-39831</a>] -         R dependencies installation start to fail after devtools_2.4.4 was released
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39879'>SPARK-39879</a>] -         Reduce local-cluster memory configuration in BroadcastJoinSuite* and HiveSparkSubmitSuite
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40022'>SPARK-40022</a>] -         YarnClusterSuite should not ABORTED when there is no Python3 environment
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40241'>SPARK-40241</a>] -         Correct the link of GenericUDTF
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40490'>SPARK-40490</a>] -         `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile`  reload after  SPARK-17321
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40574'>SPARK-40574</a>] -         Add PURGE to DROP TABLE doc
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41541'>SPARK-41541</a>] -         Fix wrong child call in SQLShuffleWriteMetricsReporter.decRecordsWritten()
</li>
</ul>
    
<h2>        Test
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40172'>SPARK-40172</a>] -         Temporarily disable flaky test cases in ImageFileFormatSuite
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40461'>SPARK-40461</a>] -         Set upperbound for pyzmq 24.0.0 for Python linter
</li>
</ul>
        
<h2>        Task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40213'>SPARK-40213</a>] -         Incorrect ASCII value for Latin-1 Supplement characters
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40292'>SPARK-40292</a>] -         arrays_zip output unexpected alias column names
</li>
</ul>
                                                    
<h2>        Dependency upgrade
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40801'>SPARK-40801</a>] -         Upgrade Apache Commons Text to 1.10
</li>
</ul>
                                                                                    
<h2>        Documentation
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40043'>SPARK-40043</a>] -         Document DataStreamWriter.toTable and DataStreamReader.table
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40983'>SPARK-40983</a>] -         Remove Hadoop requirements for zstd mention in Parquet compression codec
</li>
</ul>