Release Notes - ASF JIRA

Release Notes - Spark - Version 3.3.3 - HTML format

Configure Release Notes

Bug

[SPARK-37829] - An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values
[SPARK-39399] - proxy-user not working for Spark on k8s in cluster deploy mode
[SPARK-39696] - Uncaught exception in thread executor-heartbeater java.util.ConcurrentModificationException: mutation occurred during iteration
[SPARK-41741] - [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[SPARK-41952] - Upgrade Parquet to fix off-heap memory leaks in Zstd codec
[SPARK-41958] - Disallow arbitrary custom classpath with proxy user in cluster mode
[SPARK-42286] - Fix internal error for valid CASE WHEN expression with CAST when inserting into a table
[SPARK-42445] - Fix SparkR install.spark function
[SPARK-42462] - Prevent `docker-image-tool.sh` from publishing OCI manifests
[SPARK-42473] - An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL
[SPARK-42478] - Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory
[SPARK-42516] - Non-captured session time zone in view creation
[SPARK-42553] - NonReserved keyword "interval" can't be column name
[SPARK-42596] - [YARN] OMP_NUM_THREADS not set to number of executor cores by default
[SPARK-42635] - Several counter-intuitive behaviours in the TimestampAdd expression
[SPARK-42649] - Remove the standard Apache License header from the top of third-party source files
[SPARK-42673] - Make build/mvn build Spark only with the verified maven version
[SPARK-42697] - /api/v1/applications return 0 for duration
[SPARK-42784] - Fix the problem of incomplete creation of subdirectories in push merged localDir
[SPARK-42785] - [K8S][Core] When spark submit without --deploy-mode, will face NPE in Kubernetes Case
[SPARK-42799] - Update SBT build `xercesImpl` version to match with pom.xml
[SPARK-42906] - Replace a starting digit with `x` in resource name prefix
[SPARK-42922] - Use SecureRandom, instead of Random in security sensitive contexts
[SPARK-42937] - Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
[SPARK-42967] - Fix SparkListenerTaskStart.stageAttemptId when a task is started after the stage is cancelled
[SPARK-43004] - vendor==vendor typo in ResourceRequest.equals()
[SPARK-43005] - `v is v >= 0` typo in pyspark/pandas/config.py
[SPARK-43050] - Fix construct aggregate expressions by replacing grouping functions
[SPARK-43069] - Use `sbt-eclipse` instead of `sbteclipse-plugin`
[SPARK-43113] - Codegen error when full outer join's bound condition has multiple references to the same stream-side column
[SPARK-43158] - Set upperbound of pandas version in binder integrations
[SPARK-43240] - df.describe() method may- return wrong result if the last RDD is RDD[UnsafeRow]
[SPARK-43293] - __qualified_access_only should be ignored in normal columns
[SPARK-43337] - Asc/desc arrow icons for sorting column does not get displayed in the table column
[SPARK-43398] - Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
[SPARK-43541] - Incorrect column resolution on FULL OUTER JOIN with USING
[SPARK-43589] - Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
[SPARK-43718] - References to a specific side's key in a USING join can have wrong nullability
[SPARK-43719] - Handle missing row.excludedInStages field
[SPARK-43956] - Fix the bug doesn't display column's sql for Percentile[Cont|Disc]
[SPARK-43976] - Handle the case where modifiedConfigs doesn't exist in event logs
[SPARK-44040] - Incorrect result after count distinct
[SPARK-44134] - Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf
[SPARK-44142] - Utility to convert python types to spark types compares Python "type" object rather than user's "tpe" for categorical data types
[SPARK-44158] - Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
[SPARK-44184] - Remove a wrong doc about ARROW_PRE_0_15_IPC_FORMAT
[SPARK-44215] - Client receives zero number of chunks in merge meta response which doesn't trigger fallback to unmerged blocks
[SPARK-44241] - Set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause executor incessantes cons/destructions
[SPARK-44251] - Potential for incorrect results or NPE when full outer USING join has null key value
[SPARK-44588] - Migrated shuffle blocks are encrypted multiple times when io.encryption is enabled
[SPARK-44653] - non-trivial DataFrame unions should not break caching

Improvement

[SPARK-40376] - `np.bool` will be deprecated
[SPARK-41660] - only propagate metadata columns if they are used
[SPARK-42647] - Remove aliases from deprecated numpy data types
[SPARK-42934] - Testing OrcEncryptionSuite using maven is always skipped
[SPARK-43395] - Exclude macOS tar extended metadata in make-distribution.sh

Test

[SPARK-43587] - Run HealthTrackerIntegrationSuite in a dedicate JVM

Documentation

[SPARK-43751] - Document for unbase64 behavior change

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.

Release Notes - Spark - Version 3.3.3
                
<h2>        Bug
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-37829'>SPARK-37829</a>] -         An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39399'>SPARK-39399</a>] -         proxy-user not working for Spark on k8s in cluster deploy mode
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-39696'>SPARK-39696</a>] -         Uncaught exception in thread executor-heartbeater java.util.ConcurrentModificationException: mutation occurred during iteration
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41741'>SPARK-41741</a>] -         [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41952'>SPARK-41952</a>] -         Upgrade Parquet to fix off-heap memory leaks in Zstd codec
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41958'>SPARK-41958</a>] -         Disallow arbitrary custom classpath with proxy user in cluster mode
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42286'>SPARK-42286</a>] -         Fix internal error for valid CASE WHEN expression with CAST when inserting into a table
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42445'>SPARK-42445</a>] -         Fix SparkR install.spark function
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42462'>SPARK-42462</a>] -         Prevent `docker-image-tool.sh` from publishing OCI manifests
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42473'>SPARK-42473</a>] -         An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42478'>SPARK-42478</a>] -         Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42516'>SPARK-42516</a>] -         Non-captured session time zone in view creation
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42553'>SPARK-42553</a>] -         NonReserved keyword &quot;interval&quot; can&#39;t be column name
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42596'>SPARK-42596</a>] -         [YARN] OMP_NUM_THREADS not set to number of executor cores by default
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42635'>SPARK-42635</a>] -         Several counter-intuitive behaviours in the TimestampAdd expression
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42649'>SPARK-42649</a>] -         Remove the standard Apache License header from the top of third-party source files
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42673'>SPARK-42673</a>] -         Make build/mvn build Spark only with the verified maven version
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42697'>SPARK-42697</a>] -         /api/v1/applications return 0 for duration
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42784'>SPARK-42784</a>] -         Fix the problem of incomplete creation of subdirectories in push merged localDir
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42785'>SPARK-42785</a>] -         [K8S][Core] When spark submit without --deploy-mode, will face NPE in Kubernetes Case
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42799'>SPARK-42799</a>] -         Update SBT build `xercesImpl` version to match with pom.xml
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42906'>SPARK-42906</a>] -         Replace a starting digit with `x` in resource name prefix
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42922'>SPARK-42922</a>] -         Use SecureRandom, instead of Random in security sensitive contexts
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42937'>SPARK-42937</a>] -         Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42967'>SPARK-42967</a>] -         Fix SparkListenerTaskStart.stageAttemptId when a task is started after the stage is cancelled
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43004'>SPARK-43004</a>] -         vendor==vendor typo in ResourceRequest.equals()
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43005'>SPARK-43005</a>] -         `v is v &gt;= 0` typo in pyspark/pandas/config.py
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43050'>SPARK-43050</a>] -         Fix construct aggregate expressions by replacing grouping functions
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43069'>SPARK-43069</a>] -         Use `sbt-eclipse` instead of `sbteclipse-plugin`
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43113'>SPARK-43113</a>] -         Codegen error when full outer join&#39;s bound condition has multiple references to the same stream-side column
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43158'>SPARK-43158</a>] -         Set upperbound of pandas version in binder integrations
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43240'>SPARK-43240</a>] -         df.describe() method may- return wrong result if the last RDD is RDD[UnsafeRow]
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43293'>SPARK-43293</a>] -         __qualified_access_only should be ignored in normal columns
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43337'>SPARK-43337</a>] -         Asc/desc arrow icons for sorting column does not get displayed in the table column
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43398'>SPARK-43398</a>] -         Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43541'>SPARK-43541</a>] -         Incorrect column resolution on FULL OUTER JOIN with USING
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43589'>SPARK-43589</a>] -         Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43718'>SPARK-43718</a>] -         References to a specific side&#39;s key in a USING join can have wrong nullability
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43719'>SPARK-43719</a>] -         Handle missing row.excludedInStages field
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43956'>SPARK-43956</a>] -         Fix the bug doesn&#39;t display column&#39;s sql for Percentile[Cont|Disc]
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43976'>SPARK-43976</a>] -         Handle the case where modifiedConfigs doesn&#39;t exist in event logs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44040'>SPARK-44040</a>] -         Incorrect result after count distinct
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44134'>SPARK-44134</a>] -         Can&#39;t set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44142'>SPARK-44142</a>] -         Utility to convert python types to spark types compares Python &quot;type&quot; object rather than user&#39;s &quot;tpe&quot; for categorical data types
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44158'>SPARK-44158</a>] -         Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44184'>SPARK-44184</a>] -         Remove a wrong doc about ARROW_PRE_0_15_IPC_FORMAT
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44215'>SPARK-44215</a>] -         Client receives zero number of chunks in merge meta response which doesn&#39;t trigger fallback to unmerged blocks
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44241'>SPARK-44241</a>] -         Set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause executor incessantes cons/destructions
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44251'>SPARK-44251</a>] -         Potential for incorrect results or NPE when full outer USING join has null key value
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44588'>SPARK-44588</a>] -         Migrated shuffle blocks are encrypted multiple times when io.encryption is enabled 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-44653'>SPARK-44653</a>] -         non-trivial DataFrame unions should not break caching
</li>
</ul>
                
<h2>        Improvement
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-40376'>SPARK-40376</a>] -         `np.bool` will be deprecated
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-41660'>SPARK-41660</a>] -         only propagate metadata columns if they are used
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42647'>SPARK-42647</a>] -         Remove aliases from deprecated numpy data types
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-42934'>SPARK-42934</a>] -         Testing OrcEncryptionSuite using maven is always skipped
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43395'>SPARK-43395</a>] -         Exclude macOS tar extended metadata in make-distribution.sh
</li>
</ul>
    
<h2>        Test
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43587'>SPARK-43587</a>] -         Run HealthTrackerIntegrationSuite in a dedicate JVM
</li>
</ul>
                                                                                                                                                
<h2>        Documentation
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-43751'>SPARK-43751</a>] -         Document for unbase64 behavior change
</li>
</ul>