Release Notes - ASF JIRA

Release Notes - Spark - Version 3.3.2 - HTML format

Configure Release Notes

Sub-task

[SPARK-38697] - Extend SparkSessionExtensions to inject rules into AQE Optimizer
[SPARK-40872] - Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
[SPARK-41185] - Remove ARM limitation for YuniKorn from docs
[SPARK-41388] - getReusablePVCs should ignore recently created PVCs in the previous batch
[SPARK-42071] - Register scala.math.Ordering$Reverse to KyroSerializer

Bug

[SPARK-32380] - sparksql cannot access hive table while data in hbase
[SPARK-39404] - Unable to query _metadata in streaming if getBatch returns multiple logical nodes in the DataFrame
[SPARK-40493] - Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[SPARK-40588] - Sorting issue with partitioned-writing and AQE turned on
[SPARK-40817] - Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode
[SPARK-40819] - Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
[SPARK-40829] - STORED AS serde in CREATE TABLE LIKE view does not work
[SPARK-40851] - TimestampFormatter behavior changed when using the latest Java 8/11/17
[SPARK-40869] - KubernetesConf.getResourceNamePrefix creates invalid name prefixes
[SPARK-40874] - Fix broadcasts in Python UDFs when encryption is enabled
[SPARK-40902] - Quick submission of drivers in tests to mesos scheduler results in dropping drivers
[SPARK-40918] - Mismatch between ParquetFileFormat and FileSourceScanExec in # columns for WSCG.isTooManyFields when using _metadata
[SPARK-40924] - Unhex function works incorrectly when input has uneven number of symbols
[SPARK-40932] - Barrier: messages for allGather will be overridden by the following barrier APIs
[SPARK-40963] - ExtractGenerator sets incorrect nullability in new Project
[SPARK-40987] - Avoid creating a directory when deleting a block, causing DAGScheduler to not work
[SPARK-41035] - Incorrect results or NPE when a literal is reused across distinct aggregations
[SPARK-41118] - to_number/try_to_number throws NullPointerException when format is null
[SPARK-41144] - UnresolvedHint should not cause query failure
[SPARK-41151] - Keep built-in file _metadata column nullable value consistent
[SPARK-41154] - Incorrect relation caching for queries with time travel spec
[SPARK-41162] - Anti-join must not be pushed below aggregation with ambiguous predicates
[SPARK-41187] - [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[SPARK-41188] - Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes
[SPARK-41202] - Update ORC to 1.7.7
[SPARK-41254] - YarnAllocator.rpIdToYarnResource map is not properly updated
[SPARK-41327] - Fix SparkStatusTracker.getExecutorInfos by switch On/OffHeapStorageMemory info
[SPARK-41339] - RocksDB state store WriteBatch doesn't clean up native memory
[SPARK-41350] - allow simple name access of using join hidden columns after subquery alias
[SPARK-41365] - Stages UI page fails to load for proxy in some yarn versions
[SPARK-41375] - Avoid empty latest KafkaSourceOffset
[SPARK-41376] - Executor netty direct memory check should respect spark.shuffle.io.preferDirectBufs
[SPARK-41379] - Inconsistency of spark session in DataFrame in user function for foreachBatch sink in PySpark
[SPARK-41385] - Replace deprecated `.newInstance()` in K8s module
[SPARK-41395] - InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
[SPARK-41448] - Make consistent MR job IDs in FileBatchWriter and FileFormatWriter
[SPARK-41458] - Correctly transform the SPI services for Yarn Shuffle Service
[SPARK-41468] - Fix PlanExpression handling in EquivalentExpressions
[SPARK-41522] - GA dependencies test faild
[SPARK-41535] - InterpretedUnsafeProjection and InterpretedMutableProjection can corrupt unsafe buffer when used with calendar interval data
[SPARK-41554] - Decimal.changePrecision produces ArrayIndexOutOfBoundsException
[SPARK-41668] - DECODE function returns wrong results when passed NULL
[SPARK-41732] - Session window: analysis rule "SessionWindowing" does not apply tree-pattern based pruning
[SPARK-41989] - PYARROW_IGNORE_TIMEZONE warning can break application logging setup
[SPARK-42084] - Avoid leaking the qualified-access-only restriction
[SPARK-42090] - Introduce sasl retry count in RetryingBlockTransferor
[SPARK-42134] - Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes
[SPARK-42157] - `spark.scheduler.mode=FAIR` should provide FAIR scheduler
[SPARK-42176] - Cast boolean to timestamp fails with ClassCastException
[SPARK-42179] - Upgrade ORC to 1.7.8
[SPARK-42188] - Force SBT protobuf version to match Maven on branch 3.2 and 3.3
[SPARK-42201] - `build/sbt` should allow SBT_OPTS to override JVM memory setting
[SPARK-42222] - Spark 3.3 Backport: SPARK-41344 Reading V2 datasource masks underlying error
[SPARK-42259] - ResolveGroupingAnalytics should take care of Python UDAF
[SPARK-42344] - The default size of the CONFIG_MAP_MAXSIZE should not be greater than 1048576
[SPARK-42346] - distinct(count colname) with UNION ALL causes query analyzer bug
[SPARK-42747] - Fix incorrect internal status of LoR and AFT

New Feature

[SPARK-47717] - Support Hive tables as a streaming source and sink

Improvement

[SPARK-38277] - Clear write batch after RocksDB state store's commit
[SPARK-40886] - Bump Jackson Databind 2.13.4.2
[SPARK-40913] - Pin `pytest==7.1.3`
[SPARK-41031] - Upgrade `org.tukaani:xz` to 1.9
[SPARK-41089] - Relocate Netty native arm64 libs
[SPARK-41360] - Avoid BlockManager re-registration if the executor has been lost
[SPARK-41476] - Prevent `README.md` from triggering CIs
[SPARK-41541] - Fix wrong child call in SQLShuffleWriteMetricsReporter.decRecordsWritten()
[SPARK-41962] - Update the import order of scala package in class SpecificParquetRecordReaderBase
[SPARK-42230] - Improve `lint` job by skipping PySpark and SparkR docs if unchanged

Test

[SPARK-41863] - Skip `flake8` tests if the command is not available
[SPARK-41864] - Fix mypy linter errors
[SPARK-42110] - Reduce the number of repetition in ParquetDeltaEncodingSuite.`random data test`

Task

[SPARK-41415] - SASL Request Retries
[SPARK-41538] - Metadata column should be appended at the end of project list

Dependency upgrade

[SPARK-40801] - Upgrade Apache Commons Text to 1.10
[SPARK-41030] - Upgrade Apache Ivy to 2.5.1
[SPARK-41686] - Upgrade Apache Ivy to 2.5.1

Question

[SPARK-42977] - spark sql Disable vectorized faild

Documentation

[SPARK-40983] - Remove Hadoop requirements for zstd mention in Parquet compression codec

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.

Release Notes - Spark - Version 3.3.2
    
<h2>        Sub-task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-38697'>SPARK-38697</a>] -         Extend SparkSessionExtensions to inject rules into AQE Optimizer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40872'>SPARK-40872</a>] -         Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41185'>SPARK-41185</a>] -         Remove ARM limitation for YuniKorn from docs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41388'>SPARK-41388</a>] -         getReusablePVCs should ignore recently created PVCs in the previous batch
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42071'>SPARK-42071</a>] -         Register scala.math.Ordering$Reverse to KyroSerializer
</li>
</ul>
            
<h2>        Bug
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32380'>SPARK-32380</a>] -         sparksql cannot access hive table while data in hbase
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39404'>SPARK-39404</a>] -         Unable to query _metadata in streaming if getBatch returns multiple logical nodes in the DataFrame
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40493'>SPARK-40493</a>] -         Revert &quot;[SPARK-33861][SQL] Simplify conditional in predicate&quot;
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40588'>SPARK-40588</a>] -         Sorting issue with partitioned-writing and AQE turned on
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40817'>SPARK-40817</a>] -         Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40819'>SPARK-40819</a>] -         Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40829'>SPARK-40829</a>] -         STORED AS serde in CREATE TABLE LIKE view does not work
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40851'>SPARK-40851</a>] -         TimestampFormatter behavior changed when using the latest Java 8/11/17
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40869'>SPARK-40869</a>] -         KubernetesConf.getResourceNamePrefix creates invalid name prefixes
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40874'>SPARK-40874</a>] -         Fix broadcasts in Python UDFs when encryption is enabled
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40902'>SPARK-40902</a>] -         Quick submission of drivers in tests to mesos scheduler results in dropping drivers
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40918'>SPARK-40918</a>] -         Mismatch between ParquetFileFormat and FileSourceScanExec in # columns for WSCG.isTooManyFields when using _metadata
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40924'>SPARK-40924</a>] -         Unhex function works incorrectly when input has uneven number of symbols
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40932'>SPARK-40932</a>] -         Barrier: messages for allGather will be overridden by the following barrier APIs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40963'>SPARK-40963</a>] -         ExtractGenerator sets incorrect nullability in new Project
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40987'>SPARK-40987</a>] -         Avoid creating a directory when deleting a block, causing DAGScheduler to not work
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41035'>SPARK-41035</a>] -         Incorrect results or NPE when a literal is reused across distinct aggregations
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41118'>SPARK-41118</a>] -         to_number/try_to_number throws NullPointerException when format is null
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41144'>SPARK-41144</a>] -         UnresolvedHint should not cause query failure
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41151'>SPARK-41151</a>] -         Keep built-in file _metadata column nullable value consistent
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41154'>SPARK-41154</a>] -         Incorrect relation caching for queries with time travel spec
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41162'>SPARK-41162</a>] -         Anti-join must not be pushed below aggregation with ambiguous predicates
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41187'>SPARK-41187</a>] -         [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41188'>SPARK-41188</a>] -         Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41202'>SPARK-41202</a>] -         Update ORC to 1.7.7
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41254'>SPARK-41254</a>] -         YarnAllocator.rpIdToYarnResource map is not properly updated
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41327'>SPARK-41327</a>] -         Fix SparkStatusTracker.getExecutorInfos by switch On/OffHeapStorageMemory info
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41339'>SPARK-41339</a>] -         RocksDB state store WriteBatch doesn&#39;t clean up native memory
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41350'>SPARK-41350</a>] -         allow simple name access of using join hidden columns after subquery alias
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41365'>SPARK-41365</a>] -         Stages UI page fails to load for proxy in some yarn versions 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41375'>SPARK-41375</a>] -         Avoid empty latest KafkaSourceOffset
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41376'>SPARK-41376</a>] -         Executor netty direct memory check should respect spark.shuffle.io.preferDirectBufs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41379'>SPARK-41379</a>] -         Inconsistency of spark session in DataFrame in user function for foreachBatch sink in PySpark
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41385'>SPARK-41385</a>] -         Replace deprecated `.newInstance()` in K8s module
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41395'>SPARK-41395</a>] -         InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41448'>SPARK-41448</a>] -         Make consistent MR job IDs in FileBatchWriter and FileFormatWriter
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41458'>SPARK-41458</a>] -         Correctly transform the SPI services for Yarn Shuffle Service
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41468'>SPARK-41468</a>] -         Fix PlanExpression handling in EquivalentExpressions
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41522'>SPARK-41522</a>] -         GA dependencies test faild
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41535'>SPARK-41535</a>] -         InterpretedUnsafeProjection and InterpretedMutableProjection can corrupt unsafe buffer when used with calendar interval data
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41554'>SPARK-41554</a>] -         Decimal.changePrecision produces ArrayIndexOutOfBoundsException
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41668'>SPARK-41668</a>] -         DECODE function returns wrong results when passed NULL
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41732'>SPARK-41732</a>] -         Session window: analysis rule &quot;SessionWindowing&quot; does not apply tree-pattern based pruning
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41989'>SPARK-41989</a>] -         PYARROW_IGNORE_TIMEZONE warning can break application logging setup
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42084'>SPARK-42084</a>] -         Avoid leaking the qualified-access-only restriction
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42090'>SPARK-42090</a>] -         Introduce sasl retry count in RetryingBlockTransferor
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42134'>SPARK-42134</a>] -         Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42157'>SPARK-42157</a>] -         `spark.scheduler.mode=FAIR` should provide FAIR scheduler
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42176'>SPARK-42176</a>] -         Cast boolean to timestamp fails with ClassCastException
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42179'>SPARK-42179</a>] -         Upgrade ORC to 1.7.8
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42188'>SPARK-42188</a>] -         Force SBT protobuf version to match Maven on branch 3.2 and 3.3
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42201'>SPARK-42201</a>] -         `build/sbt` should allow SBT_OPTS to override JVM memory setting
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42222'>SPARK-42222</a>] -         Spark 3.3 Backport: SPARK-41344 Reading V2 datasource masks underlying error
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42259'>SPARK-42259</a>] -         ResolveGroupingAnalytics should take care of Python UDAF
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42344'>SPARK-42344</a>] -         The default size of the CONFIG_MAP_MAXSIZE should not be greater than 1048576
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42346'>SPARK-42346</a>] -         distinct(count colname) with UNION ALL causes query analyzer bug
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42747'>SPARK-42747</a>] -         Fix incorrect internal status of LoR and AFT
</li>
</ul>
            
<h2>        New Feature
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-47717'>SPARK-47717</a>] -         Support Hive tables as a streaming source and sink
</li>
</ul>
    
<h2>        Improvement
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-38277'>SPARK-38277</a>] -         Clear write batch after RocksDB state store&#39;s commit
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40886'>SPARK-40886</a>] -         Bump Jackson Databind 2.13.4.2
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40913'>SPARK-40913</a>] -         Pin `pytest==7.1.3`
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41031'>SPARK-41031</a>] -         Upgrade `org.tukaani:xz` to 1.9
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41089'>SPARK-41089</a>] -         Relocate Netty native arm64 libs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41360'>SPARK-41360</a>] -         Avoid BlockManager re-registration if the executor has been lost
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41476'>SPARK-41476</a>] -         Prevent `README.md` from triggering CIs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41541'>SPARK-41541</a>] -         Fix wrong child call in SQLShuffleWriteMetricsReporter.decRecordsWritten()
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41962'>SPARK-41962</a>] -         Update the import order of scala package in class SpecificParquetRecordReaderBase
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42230'>SPARK-42230</a>] -         Improve `lint` job by skipping PySpark and SparkR docs if unchanged
</li>
</ul>
    
<h2>        Test
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41863'>SPARK-41863</a>] -         Skip `flake8` tests if the command is not available
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41864'>SPARK-41864</a>] -         Fix mypy linter errors
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42110'>SPARK-42110</a>] -         Reduce the number of repetition in ParquetDeltaEncodingSuite.`random data test`
</li>
</ul>
        
<h2>        Task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41415'>SPARK-41415</a>] -         SASL Request Retries
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41538'>SPARK-41538</a>] -         Metadata column should be appended at the end of project list
</li>
</ul>
                                                    
<h2>        Dependency upgrade
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40801'>SPARK-40801</a>] -         Upgrade Apache Commons Text to 1.10
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41030'>SPARK-41030</a>] -         Upgrade Apache Ivy to 2.5.1
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41686'>SPARK-41686</a>] -         Upgrade Apache Ivy to 2.5.1
</li>
</ul>
        
<h2>        Question
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42977'>SPARK-42977</a>] -         spark sql Disable vectorized  faild
</li>
</ul>
                                                                            
<h2>        Documentation
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40983'>SPARK-40983</a>] -         Remove Hadoop requirements for zstd mention in Parquet compression codec
</li>
</ul>