Release Notes - ASF JIRA

Release Notes - Spark - Version 2.4.7 - HTML format

Configure Release Notes

Sub-task

[SPARK-32249] - Run Github Actions builds in other branches as well
[SPARK-32367] - Fix typo of parameter in KubernetesTestComponents
[SPARK-32695] - Add 'build' and 'project/build.properties' into cache key of SBT and Zinc

Bug

[SPARK-28818] - FrequentItems applies an incorrect schema to the resulting dataframe when nulls are present
[SPARK-31511] - Make BytesToBytesMap iterator() thread-safe
[SPARK-31703] - Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)
[SPARK-31854] - Different results of query execution with wholestage codegen on and off
[SPARK-31871] - Display the canvas element icon for sorting column
[SPARK-31903] - toPandas with Arrow enabled doesn't show metrics in Query UI.
[SPARK-31911] - Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data
[SPARK-31918] - SparkR CRAN check gives a warning with R 4.0.0 on OSX
[SPARK-31923] - Event log cannot be generated when some internal accumulators use unexpected types
[SPARK-31935] - Hadoop file system config should be effective in data source options
[SPARK-31941] - Handling the exception in SparkUI for getSparkUser method
[SPARK-31967] - Loading jobs UI page takes 40 seconds
[SPARK-31968] - write.partitionBy() creates duplicate subdirectories when user provides duplicate columns
[SPARK-31980] - Spark sequence() fails if start and end of range are identical dates
[SPARK-31997] - Should drop test_udtf table when SingleSessionSuite completed
[SPARK-32000] - Fix the flaky testcase for partially launched task in barrier-mode.
[SPARK-32003] - Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[SPARK-32024] - Disk usage tracker went negative in HistoryServerDiskManager
[SPARK-32028] - App id link in history summary page point to wrong application attempt
[SPARK-32034] - Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
[SPARK-32035] - Inconsistent AWS environment variables in documentation
[SPARK-32044] - [SS] 2.4 Kafka continuous processing print mislead initial offsets log
[SPARK-32098] - Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow
[SPARK-32115] - Incorrect results for SUBSTRING when overflow
[SPARK-32131] - Fix AnalysisException messages at UNION/INTERSECT/EXCEPT/MINUS operations
[SPARK-32167] - nullability of GetArrayStructFields is incorrect
[SPARK-32214] - The type conversion function generated in makeFromJava for "other" type uses a wrong variable.
[SPARK-32238] - Use Utils.getSimpleName to avoid hitting Malformed class name in ScalaUDF
[SPARK-32280] - AnalysisException thrown when query contains several JOINs
[SPARK-32300] - toPandas with no partitions should work
[SPARK-32344] - Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates
[SPARK-32364] - Use CaseInsensitiveMap for DataFrameReader/Writer options
[SPARK-32372] - "Resolved attribute(s) XXX missing" after dudup conflict references
[SPARK-32377] - CaseInsensitiveMap should be deterministic for addition
[SPARK-32379] - docker based spark release script should use correct CRAN repo.
[SPARK-32556] - Fix release script to uri encode the user provided passwords.
[SPARK-32609] - Incorrect exchange reuse with DataSourceV2
[SPARK-32625] - Log error message when falling back to interpreter mode
[SPARK-32672] - Data corruption in some cached compressed boolean columns
[SPARK-32693] - Compare two dataframes with same schema except nullable property
[SPARK-32771] - The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
[SPARK-32810] - CSV/JSON data sources should avoid globbing paths when inferring schema
[SPARK-32812] - Run tests script for Python fails in certain environments

Improvement

[SPARK-31860] - Only push release tags on success
[SPARK-31889] - Docker release script does not allocate enough memory to reliably publish
[SPARK-31954] - delete duplicate test cases in hivequerysuite
[SPARK-32073] - Drop R < 3.5 support
[SPARK-32089] - Upgrade R version to 4.0.2 in the release DockerFile
[SPARK-32397] - Snapshot artifacts can have differing timestamps, making it hard to consume
[SPARK-32428] - [EXAMPLES] Make BinaryClassificationMetricsExample consistently print the metrics on driver's stdout
[SPARK-32560] - improve exception message

Test

[SPARK-31966] - Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[SPARK-32318] - Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE BY

Documentation

[SPARK-32674] - Add suggestion for parallel directory listing in tuning doc

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.

Release Notes - Spark - Version 2.4.7
    
<h2>        Sub-task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32249'>SPARK-32249</a>] -         Run Github Actions builds in other branches as well
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32367'>SPARK-32367</a>] -         Fix typo of parameter in KubernetesTestComponents
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32695'>SPARK-32695</a>] -         Add &#39;build&#39; and &#39;project/build.properties&#39; into cache key of SBT and Zinc
</li>
</ul>
            
<h2>        Bug
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-28818'>SPARK-28818</a>] -         FrequentItems applies an incorrect schema to the resulting dataframe when nulls are present
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31511'>SPARK-31511</a>] -         Make BytesToBytesMap iterator() thread-safe
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31703'>SPARK-31703</a>] -         Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31854'>SPARK-31854</a>] -         Different results of query execution with wholestage codegen on and off
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31871'>SPARK-31871</a>] -         Display the canvas element icon for sorting column
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31903'>SPARK-31903</a>] -         toPandas with Arrow enabled doesn&#39;t show metrics in Query UI.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31911'>SPARK-31911</a>] -         Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31918'>SPARK-31918</a>] -         SparkR CRAN check gives a warning with R 4.0.0 on OSX
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31923'>SPARK-31923</a>] -         Event log cannot be generated when some internal accumulators use unexpected types
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31935'>SPARK-31935</a>] -         Hadoop file system config should be effective in data source options 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31941'>SPARK-31941</a>] -         Handling the exception in SparkUI for getSparkUser method
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31967'>SPARK-31967</a>] -         Loading jobs UI page takes 40 seconds
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31968'>SPARK-31968</a>] -         write.partitionBy() creates duplicate subdirectories when user provides duplicate columns
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31980'>SPARK-31980</a>] -         Spark sequence() fails if start and end of range are identical dates
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31997'>SPARK-31997</a>] -         Should drop test_udtf table when SingleSessionSuite completed
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32000'>SPARK-32000</a>] -         Fix the flaky testcase for partially launched task in barrier-mode.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32003'>SPARK-32003</a>] -         Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32024'>SPARK-32024</a>] -         Disk usage tracker went negative in HistoryServerDiskManager
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32028'>SPARK-32028</a>] -         App id link in history summary page point to wrong application attempt
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32034'>SPARK-32034</a>] -         Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32035'>SPARK-32035</a>] -         Inconsistent AWS environment variables in documentation
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32044'>SPARK-32044</a>] -         [SS] 2.4 Kafka continuous processing print mislead initial offsets log 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32098'>SPARK-32098</a>] -         Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32115'>SPARK-32115</a>] -         Incorrect results for SUBSTRING when overflow
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32131'>SPARK-32131</a>] -         Fix AnalysisException messages at UNION/INTERSECT/EXCEPT/MINUS operations
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32167'>SPARK-32167</a>] -         nullability of GetArrayStructFields is incorrect
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32214'>SPARK-32214</a>] -         The type conversion function generated in makeFromJava for &quot;other&quot;  type uses a wrong variable.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32238'>SPARK-32238</a>] -         Use Utils.getSimpleName to avoid hitting Malformed class name in ScalaUDF
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32280'>SPARK-32280</a>] -         AnalysisException thrown when query contains several JOINs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32300'>SPARK-32300</a>] -         toPandas with no partitions should work
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32344'>SPARK-32344</a>] -         Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32364'>SPARK-32364</a>] -         Use CaseInsensitiveMap for DataFrameReader/Writer options
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32372'>SPARK-32372</a>] -         &quot;Resolved attribute(s) XXX missing&quot; after dudup conflict references
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32377'>SPARK-32377</a>] -         CaseInsensitiveMap should be deterministic for addition
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32379'>SPARK-32379</a>] -         docker based spark release script should use correct CRAN repo.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32556'>SPARK-32556</a>] -         Fix release script to uri encode the user provided passwords.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32609'>SPARK-32609</a>] -         Incorrect exchange reuse with DataSourceV2
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32625'>SPARK-32625</a>] -         Log error message when falling back to interpreter mode
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32672'>SPARK-32672</a>] -         Data corruption in some cached compressed boolean columns
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32693'>SPARK-32693</a>] -         Compare two dataframes with same schema except nullable property
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32771'>SPARK-32771</a>] -         The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32810'>SPARK-32810</a>] -         CSV/JSON data sources should avoid globbing paths when inferring schema
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32812'>SPARK-32812</a>] -         Run tests script for Python fails in certain environments
</li>
</ul>
                
<h2>        Improvement
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31860'>SPARK-31860</a>] -         Only push release tags on success
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31889'>SPARK-31889</a>] -         Docker release script does not allocate enough memory to reliably publish
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31954'>SPARK-31954</a>] -         delete duplicate test cases in hivequerysuite
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32073'>SPARK-32073</a>] -         Drop R &lt; 3.5 support
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32089'>SPARK-32089</a>] -         Upgrade R version to 4.0.2 in the release DockerFile
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32397'>SPARK-32397</a>] -         Snapshot artifacts can have differing timestamps, making it hard to consume
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32428'>SPARK-32428</a>] -         [EXAMPLES] Make BinaryClassificationMetricsExample consistently print the metrics on driver&#39;s stdout
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32560'>SPARK-32560</a>] -         improve exception message
</li>
</ul>
    
<h2>        Test
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-31966'>SPARK-31966</a>] -         Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32318'>SPARK-32318</a>] -         Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE BY
</li>
</ul>
                                                                                                                                                
<h2>        Documentation
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-32674'>SPARK-32674</a>] -         Add suggestion for parallel directory listing in tuning doc
</li>
</ul>