Release Notes - Spark - Version 2.4.7 - HTML format

Sub-task

  • [SPARK-32249] - Run Github Actions builds in other branches as well
  • [SPARK-32367] - Fix typo of parameter in KubernetesTestComponents
  • [SPARK-32695] - Add 'build' and 'project/build.properties' into cache key of SBT and Zinc

Bug

  • [SPARK-28818] - FrequentItems applies an incorrect schema to the resulting dataframe when nulls are present
  • [SPARK-31511] - Make BytesToBytesMap iterator() thread-safe
  • [SPARK-31703] - Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)
  • [SPARK-31854] - Different results of query execution with wholestage codegen on and off
  • [SPARK-31871] - Display the canvas element icon for sorting column
  • [SPARK-31903] - toPandas with Arrow enabled doesn't show metrics in Query UI.
  • [SPARK-31911] - Using S3A staging committer, pending uploads are committed more than once and listed incorrectly in _SUCCESS data
  • [SPARK-31918] - SparkR CRAN check gives a warning with R 4.0.0 on OSX
  • [SPARK-31923] - Event log cannot be generated when some internal accumulators use unexpected types
  • [SPARK-31935] - Hadoop file system config should be effective in data source options
  • [SPARK-31941] - Handling the exception in SparkUI for getSparkUser method
  • [SPARK-31967] - Loading jobs UI page takes 40 seconds
  • [SPARK-31968] - write.partitionBy() creates duplicate subdirectories when user provides duplicate columns
  • [SPARK-31980] - Spark sequence() fails if start and end of range are identical dates
  • [SPARK-31997] - Should drop test_udtf table when SingleSessionSuite completed
  • [SPARK-32000] - Fix the flaky testcase for partially launched task in barrier-mode.
  • [SPARK-32003] - Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
  • [SPARK-32024] - Disk usage tracker went negative in HistoryServerDiskManager
  • [SPARK-32028] - App id link in history summary page point to wrong application attempt
  • [SPARK-32034] - Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown
  • [SPARK-32035] - Inconsistent AWS environment variables in documentation
  • [SPARK-32044] - [SS] 2.4 Kafka continuous processing print mislead initial offsets log
  • [SPARK-32098] - Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow
  • [SPARK-32115] - Incorrect results for SUBSTRING when overflow
  • [SPARK-32131] - Fix AnalysisException messages at UNION/INTERSECT/EXCEPT/MINUS operations
  • [SPARK-32167] - nullability of GetArrayStructFields is incorrect
  • [SPARK-32214] - The type conversion function generated in makeFromJava for "other" type uses a wrong variable.
  • [SPARK-32238] - Use Utils.getSimpleName to avoid hitting Malformed class name in ScalaUDF
  • [SPARK-32280] - AnalysisException thrown when query contains several JOINs
  • [SPARK-32300] - toPandas with no partitions should work
  • [SPARK-32344] - Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates
  • [SPARK-32364] - Use CaseInsensitiveMap for DataFrameReader/Writer options
  • [SPARK-32372] - "Resolved attribute(s) XXX missing" after dudup conflict references
  • [SPARK-32377] - CaseInsensitiveMap should be deterministic for addition
  • [SPARK-32379] - docker based spark release script should use correct CRAN repo.
  • [SPARK-32556] - Fix release script to uri encode the user provided passwords.
  • [SPARK-32609] - Incorrect exchange reuse with DataSourceV2
  • [SPARK-32625] - Log error message when falling back to interpreter mode
  • [SPARK-32672] - Data corruption in some cached compressed boolean columns
  • [SPARK-32693] - Compare two dataframes with same schema except nullable property
  • [SPARK-32771] - The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
  • [SPARK-32810] - CSV/JSON data sources should avoid globbing paths when inferring schema
  • [SPARK-32812] - Run tests script for Python fails in certain environments

Improvement

  • [SPARK-31860] - Only push release tags on success
  • [SPARK-31889] - Docker release script does not allocate enough memory to reliably publish
  • [SPARK-31954] - delete duplicate test cases in hivequerysuite
  • [SPARK-32073] - Drop R < 3.5 support
  • [SPARK-32089] - Upgrade R version to 4.0.2 in the release DockerFile
  • [SPARK-32397] - Snapshot artifacts can have differing timestamps, making it hard to consume
  • [SPARK-32428] - [EXAMPLES] Make BinaryClassificationMetricsExample consistently print the metrics on driver's stdout
  • [SPARK-32560] - improve exception message

Test

  • [SPARK-31966] - Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
  • [SPARK-32318] - Add a test case to EliminateSortsSuite for protecting ORDER BY in DISTRIBUTE BY

Documentation

  • [SPARK-32674] - Add suggestion for parallel directory listing in tuning doc

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.