Release Notes - ASF JIRA

Release Notes - Spark - Version 3.4.3 - HTML format

Configure Release Notes

Sub-task

[SPARK-44495] - Use the latest minikube in K8s IT
[SPARK-45445] - Upgrade snappy to 1.1.10.5
[SPARK-46369] - Remove `kill` link from RELAUNCHING drivers in MasterPage
[SPARK-46400] - When there are corrupted files in the local maven repo, retry to skip this cache
[SPARK-46411] - Change to use bcprov/bcpkix-jdk18on for test
[SPARK-46704] - Fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly
[SPARK-46747] - Too Many Shared Locks due to PostgresDialect.getTableExistsQuery - LIMIT 1
[SPARK-46817] - Fix `spark-daemon.sh` usage by adding `decommission` command
[SPARK-46888] - Fix `Master` to reject worker kill request if decommission is disabled
[SPARK-47021] - Fix `kvstore` module to have explicit `commons-lang3` test dependency
[SPARK-47111] - Upgrade `PostgreSQL` JDBC driver to 42.7.2 and docker image to 16.2
[SPARK-47368] - Remove inferTimestampNTZ config check in ParquetRowConverter
[SPARK-47370] - Add migration doc: TimestampNTZ type inference on Parquet files
[SPARK-47494] - Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3
[SPARK-47537] - Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J
[SPARK-47666] - Fix NPE when reading mysql bit array as LongType
[SPARK-47770] - Fix `GenerateMIMAIgnore.isPackagePrivateModule` to return false instead of failing
[SPARK-47774] - Remove redundant rules from `MimaExcludes`

Bug

[SPARK-45580] - Subquery changes the output schema of the outer query
[SPARK-46092] - Overflow in Parquet row group filter creation causes incorrect results
[SPARK-46189] - Various Pandas functions fail in interpreted mode
[SPARK-46239] - Hide Jetty info
[SPARK-46275] - Protobuf: Permissive mode should return null rather than struct with null fields
[SPARK-46330] - Loading of Spark UI blocks for a long time when HybridStore enabled
[SPARK-46339] - Directory with number name should not be treated as metadata log
[SPARK-46466] - vectorized parquet reader should never do rebase for timestamp ntz
[SPARK-46514] - Fix HiveMetastoreLazyInitializationSuite
[SPARK-46577] - HiveMetastoreLazyInitializationSuite leaks hive's SessionState
[SPARK-46598] - OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column
[SPARK-46700] - count the last spilling for the shuffle disk spilling bytes metric
[SPARK-46763] - ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes
[SPARK-46779] - Grouping by subquery with a cached relation can fail
[SPARK-46786] - Fix MountVolumesFeatureStep to use ReadWriteOncePod instead of ReadWriteOnce
[SPARK-46794] - Incorrect results due to inferred predicate from checkpoint with subquery
[SPARK-46855] - Add `sketch` to the dependencies of the `catalyst` module in `module.py`
[SPARK-46861] - Avoid Deadlock in DAGScheduler
[SPARK-46862] - Incorrect count() of a dataframe loaded from CSV datasource
[SPARK-46893] - Remove inline scripts from UI descriptions
[SPARK-46945] - Add `spark.kubernetes.legacy.useReadWriteOnceAccessMode` for old K8s clusters
[SPARK-47063] - CAST long to timestamp has different behavior for codegen vs interpreted
[SPARK-47068] - Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch
[SPARK-47072] - Wrong error message for incorrect ANSI intervals
[SPARK-47085] - Preformance issue on thrift API
[SPARK-47125] - Return null if Univocity never triggers parsing
[SPARK-47146] - Possible thread leak when doing sort merge join
[SPARK-47177] - Cached SQL plan do not display final AQE plan in explain string
[SPARK-47196] - Fix `core` module to succeed SBT tests
[SPARK-47236] - Fix `deleteRecursivelyUsingJavaIO` to skip non-existing file
[SPARK-47305] - PruneFilters incorrectly tags isStreaming flag when replacing child of Filter with LocalRelation
[SPARK-47318] - AuthEngine key exchange needs additional KDF round
[SPARK-47385] - Tuple encoder produces wrong results with Option inputs
[SPARK-47434] - Streaming Statistics link redirect causing 302 error
[SPARK-47455] - Fix Resource Handling of `scalaStyleOnCompileConfig` in SparkBuild.scala
[SPARK-47503] - Spark history sever fails to display query for cached JDBC relation named in quotes
[SPARK-47521] - Use `Utils.tryWithResource` during reading shuffle data from external storage
[SPARK-47646] - try_to_number fails with NPE for malformed input
[SPARK-47676] - Clean up the removed `VersionsSuite` references
[SPARK-47824] - Nondeterminism in pyspark.pandas.series.asof
[SPARK-47844] - Upgrade ORC to 1.8.7

Improvement

[SPARK-45587] - Skip UNIDOC and MIMA in build GitHub Action job
[SPARK-46286] - Document spark.io.compression.zstd.bufferPool.enabled
[SPARK-46425] - Pin the bundler version in CI
[SPARK-47505] - Fix `pyspark-errors` test jobs for branch-3.4
[SPARK-47734] - Fix flaky pyspark.sql.dataframe.DataFrame.writeStream doctest by stopping streaming query

Test

[SPARK-45141] - Pin `pyarrow==12.0.1` in CI
[SPARK-46801] - Do not treat exit 5 as a test failure in Python testing script
[SPARK-47472] - Pin `numpy` to 1.23.5 in `dev/infra/Dockerfile`

Task

[SPARK-46182] - Shuffle data lost on decommissioned executor caused by race condition between lastTaskRunningTime and lastShuffleMigrationTime
[SPARK-46628] - Use SPDX short identifier in `licenses` name
[SPARK-47187] - Fix hive compress output config does not work
[SPARK-47432] - Add `pyarrow` upper bound requirement, `<13.0.0`
[SPARK-47433] - Update PySpark package dependency version ranges
[SPARK-47481] - Fix Python linter

Dependency upgrade

[SPARK-44393] - Upgrade H2 from 2.1.214 to 2.2.220

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.

Release Notes - Spark - Version 3.4.3
    
<h2>        Sub-task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44495'>SPARK-44495</a>] -         Use the latest minikube in K8s IT
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-45445'>SPARK-45445</a>] -         Upgrade snappy to 1.1.10.5
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46369'>SPARK-46369</a>] -         Remove `kill` link from RELAUNCHING drivers in MasterPage
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46400'>SPARK-46400</a>] -         When there are corrupted files in the local maven repo, retry to skip this cache
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46411'>SPARK-46411</a>] -         Change to use bcprov/bcpkix-jdk18on for test
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46704'>SPARK-46704</a>] -         Fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46747'>SPARK-46747</a>] -         Too Many Shared Locks due to PostgresDialect.getTableExistsQuery - LIMIT 1
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46817'>SPARK-46817</a>] -         Fix `spark-daemon.sh` usage by adding `decommission` command
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46888'>SPARK-46888</a>] -         Fix `Master` to reject worker kill request if decommission is disabled
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47021'>SPARK-47021</a>] -         Fix `kvstore` module to have explicit `commons-lang3` test dependency
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47111'>SPARK-47111</a>] -         Upgrade `PostgreSQL` JDBC driver to 42.7.2 and docker image to 16.2
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47368'>SPARK-47368</a>] -         Remove inferTimestampNTZ config check in ParquetRowConverter
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47370'>SPARK-47370</a>] -         Add migration doc: TimestampNTZ type inference on Parquet files
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47494'>SPARK-47494</a>] -         Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47537'>SPARK-47537</a>] -         Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47666'>SPARK-47666</a>] -         Fix NPE when reading mysql bit array as LongType
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47770'>SPARK-47770</a>] -         Fix `GenerateMIMAIgnore.isPackagePrivateModule` to return false instead of failing
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47774'>SPARK-47774</a>] -         Remove redundant rules from `MimaExcludes`
</li>
</ul>
            
<h2>        Bug
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-45580'>SPARK-45580</a>] -         Subquery changes the output schema of the outer query
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46092'>SPARK-46092</a>] -         Overflow in Parquet row group filter creation causes incorrect results
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46189'>SPARK-46189</a>] -         Various Pandas functions fail in interpreted mode
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46239'>SPARK-46239</a>] -         Hide Jetty info 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46275'>SPARK-46275</a>] -         Protobuf: Permissive mode should return null rather than struct with null fields
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46330'>SPARK-46330</a>] -         Loading of Spark UI blocks for a long time when HybridStore enabled
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46339'>SPARK-46339</a>] -         Directory with number name should not be treated as metadata log
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46466'>SPARK-46466</a>] -         vectorized parquet reader should never do rebase for timestamp ntz
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46514'>SPARK-46514</a>] -         Fix HiveMetastoreLazyInitializationSuite
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46577'>SPARK-46577</a>] -         HiveMetastoreLazyInitializationSuite leaks hive&#39;s SessionState
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46598'>SPARK-46598</a>] -         OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46700'>SPARK-46700</a>] -         count the last spilling for the shuffle disk spilling bytes metric
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46763'>SPARK-46763</a>] -         ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46779'>SPARK-46779</a>] -         Grouping by subquery with a cached relation can fail
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46786'>SPARK-46786</a>] -         Fix MountVolumesFeatureStep to use ReadWriteOncePod instead of ReadWriteOnce
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46794'>SPARK-46794</a>] -         Incorrect results due to inferred predicate from checkpoint with subquery 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46855'>SPARK-46855</a>] -         Add `sketch` to the dependencies of the `catalyst` module in `module.py`
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46861'>SPARK-46861</a>] -         Avoid Deadlock in DAGScheduler
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46862'>SPARK-46862</a>] -         Incorrect count() of a dataframe loaded from CSV datasource
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46893'>SPARK-46893</a>] -         Remove inline scripts from UI descriptions
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46945'>SPARK-46945</a>] -         Add `spark.kubernetes.legacy.useReadWriteOnceAccessMode` for old K8s clusters
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47063'>SPARK-47063</a>] -         CAST long to timestamp has different behavior for codegen vs interpreted
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47068'>SPARK-47068</a>] -         Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47072'>SPARK-47072</a>] -         Wrong error message for incorrect ANSI intervals
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47085'>SPARK-47085</a>] -         Preformance issue on thrift API
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47125'>SPARK-47125</a>] -         Return null if Univocity never triggers parsing
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47146'>SPARK-47146</a>] -         Possible thread leak when doing sort merge join
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47177'>SPARK-47177</a>] -         Cached SQL plan do not display final AQE plan in explain string
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47196'>SPARK-47196</a>] -         Fix `core` module to succeed SBT tests
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47236'>SPARK-47236</a>] -         Fix `deleteRecursivelyUsingJavaIO` to skip non-existing file
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47305'>SPARK-47305</a>] -         PruneFilters incorrectly tags isStreaming flag when replacing child of Filter with LocalRelation
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47318'>SPARK-47318</a>] -          AuthEngine key exchange needs additional KDF round
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47385'>SPARK-47385</a>] -         Tuple encoder produces wrong results with Option inputs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47434'>SPARK-47434</a>] -         Streaming Statistics link redirect causing 302 error
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47455'>SPARK-47455</a>] -         Fix Resource Handling of `scalaStyleOnCompileConfig` in SparkBuild.scala
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47503'>SPARK-47503</a>] -         Spark history sever fails to display query for cached JDBC relation named in quotes
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47521'>SPARK-47521</a>] -         Use `Utils.tryWithResource` during reading shuffle data from external storage
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47646'>SPARK-47646</a>] -         try_to_number fails with NPE for malformed input
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47676'>SPARK-47676</a>] -         Clean up the removed `VersionsSuite` references
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47824'>SPARK-47824</a>] -         Nondeterminism in pyspark.pandas.series.asof
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47844'>SPARK-47844</a>] -         Upgrade ORC to 1.8.7
</li>
</ul>
                
<h2>        Improvement
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-45587'>SPARK-45587</a>] -         Skip UNIDOC and MIMA in build GitHub Action job
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46286'>SPARK-46286</a>] -         Document spark.io.compression.zstd.bufferPool.enabled
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46425'>SPARK-46425</a>] -         Pin the bundler version in CI
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47505'>SPARK-47505</a>] -         Fix `pyspark-errors` test jobs for branch-3.4
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47734'>SPARK-47734</a>] -         Fix flaky pyspark.sql.dataframe.DataFrame.writeStream doctest by stopping streaming query
</li>
</ul>
    
<h2>        Test
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-45141'>SPARK-45141</a>] -         Pin `pyarrow==12.0.1` in CI
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46801'>SPARK-46801</a>] -         Do not treat exit 5 as a test failure in Python testing script
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47472'>SPARK-47472</a>] -         Pin `numpy` to 1.23.5 in `dev/infra/Dockerfile`
</li>
</ul>
        
<h2>        Task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46182'>SPARK-46182</a>] -         Shuffle data lost on decommissioned executor caused by race condition between lastTaskRunningTime and lastShuffleMigrationTime
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-46628'>SPARK-46628</a>] -         Use SPDX short identifier in `licenses` name
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47187'>SPARK-47187</a>] -         Fix hive compress output config does not work
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47432'>SPARK-47432</a>] -         Add `pyarrow` upper bound requirement, `&lt;13.0.0`
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47433'>SPARK-47433</a>] -         Update PySpark package dependency version ranges
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47481'>SPARK-47481</a>] -         Fix Python linter
</li>
</ul>
                                                    
<h2>        Dependency upgrade
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44393'>SPARK-44393</a>] -         Upgrade H2 from 2.1.214 to 2.2.220
</li>
</ul>