Release Notes - Spark - Version 3.4.2 - HTML format

Sub-task

  • [SPARK-42730] - Update Spark Standalone Mode - Starting a Cluster Manually
  • [SPARK-44641] - SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet
  • [SPARK-44729] - Add canonical links to the PySpark docs page
  • [SPARK-44857] - Fix getBaseURI error in Spark Worker LogPage UI buttons
  • [SPARK-45187] - Fix WorkerPage to use the same pattern for `logPage` urls
  • [SPARK-45652] - SPJ: Handle empty input partitions after dynamic filtering
  • [SPARK-45749] - Fix Spark History Server to sort `Duration` column properly
  • [SPARK-45961] - Document `spark.master.*` configurations
  • [SPARK-46012] - EventLogFileReader should not read rolling logs if appStatus is missing
  • [SPARK-46095] - Document REST API for Spark Standalone Cluster

Bug

  • [SPARK-40154] - PySpark: DataFrame.cache docstring gives wrong storage level
  • [SPARK-42784] - Fix the problem of incomplete creation of subdirectories in push merged localDir
  • [SPARK-43203] - Fix DROP table behavior in session catalog
  • [SPARK-43393] - Sequence expression can overflow
  • [SPARK-44074] - `Logging plan changes for execution` test failed
  • [SPARK-44079] - Json reader crashes when a different schema is present
  • [SPARK-44134] - Can't set resources (GPU/FPGA) to 0 when they are set to positive value in spark-defaults.conf
  • [SPARK-44158] - Remove unused `spark.kubernetes.executor.lostCheck.maxAttempts`
  • [SPARK-44180] - DistributionAndOrderingUtils should apply ResolveTimeZone
  • [SPARK-44184] - Remove a wrong doc about ARROW_PRE_0_15_IPC_FORMAT
  • [SPARK-44215] - Client receives zero number of chunks in merge meta response which doesn't trigger fallback to unmerged blocks
  • [SPARK-44241] - Set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause executor incessantes cons/destructions
  • [SPARK-44251] - Potential for incorrect results or NPE when full outer USING join has null key value
  • [SPARK-44313] - Generated column expression validation fails if there is a char/varchar column anywhere in the schema
  • [SPARK-44391] - `url_decode` can fail w/ an internal error
  • [SPARK-44464] - Fix applyInPandasWithStatePythonRunner to output rows that have Null as first column value
  • [SPARK-44494] - K8s-it test failed
  • [SPARK-44513] - Upgrade snappy-java to 1.1.10.3
  • [SPARK-44547] - BlockManagerDecommissioner throws exceptions when migrating RDD cached blocks to fallback storage
  • [SPARK-44581] - ShutdownHookManager get wrong hadoop user group information
  • [SPARK-44585] - Fix warning condition in MLLib RankingMetrics ndcgAk
  • [SPARK-44588] - Migrated shuffle blocks are encrypted multiple times when io.encryption is enabled
  • [SPARK-44630] - Revert SPARK-43043 Improve the performance of MapOutputTracker.updateMapOutput
  • [SPARK-44634] - Encoders.bean does no longer support nested beans with type arguments
  • [SPARK-44653] - non-trivial DataFrame unions should not break caching
  • [SPARK-44657] - Incorrect limit handling and config parsing in Arrow collect
  • [SPARK-44670] - Fix the `test_to_excel` tests for python3.7
  • [SPARK-44805] - Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
  • [SPARK-44813] - The JIRA Python misses our assignee when it searches user again
  • [SPARK-44840] - array_insert() give wrong results for ngative index
  • [SPARK-44843] - flaky test: RocksDBStateStoreStreamingAggregationSuite
  • [SPARK-44846] - PushFoldableIntoBranches in complex grouping expressions may cause bindReference error
  • [SPARK-44854] - Python timedelta to DayTimeIntervalType edge cases bug
  • [SPARK-44871] - Fix PERCENTILE_DISC behaviour
  • [SPARK-44910] - Encoders.bean does not support superclasses with generic type arguments
  • [SPARK-44922] - Disable o.a.p.h.InternalParquetRecordWriter logs for tests to reduce the log volume
  • [SPARK-44925] - K8s default service token file should not be materialized into token
  • [SPARK-44935] - Fix `RELEASE` file to have the correct information in Docker images
  • [SPARK-44940] - Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
  • [SPARK-44973] - Fix ArrayIndexOutOfBoundsException in conv()
  • [SPARK-44990] - CSV conversion performance severely degraded for null fields
  • [SPARK-45054] - HiveExternalCatalog.listPartitions should restore Spark SQL stats
  • [SPARK-45057] - Deadlock caused by rdd replication level of 2
  • [SPARK-45075] - Alter table with invalid default value will not report error
  • [SPARK-45078] - The ArrayInsert function should make explicit casting when element type not equals derived component type
  • [SPARK-45079] - percentile_approx() fails with an internal error on NULL accuracy
  • [SPARK-45081] - Encoders.bean does no longer work with read-only properties
  • [SPARK-45100] - reflect() fails with an internal error on NULL class and method
  • [SPARK-45103] - Update ORC to 1.8.5
  • [SPARK-45109] - Fix eas_decrypt and ln in connect
  • [SPARK-45210] - Switch languages consistently across docs for all code snippets (Spark 3.4 and below)
  • [SPARK-45227] - Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck
  • [SPARK-45237] - Correct the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md`
  • [SPARK-45282] - Join loses records for cached datasets
  • [SPARK-45311] - Encoder fails on many "NoSuchElementException: None.get" since 3.4.x, search for an encoder for a generic type, and since 3.5.x isn't "an expression encoder"
  • [SPARK-45430] - FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of rows
  • [SPARK-45433] - CSV/JSON schema inference when timestamps do not match specified timestampFormat with only one row on each partition report error
  • [SPARK-45473] - Incorrect error message for RoundBase
  • [SPARK-45508] - Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access cleaner on Java 9+
  • [SPARK-45592] - AQE and InMemoryTableScanExec correctness bug
  • [SPARK-45604] - Converting timestamp_ntz to array<timestamp_ntz> can cause NPE or SEGFAULT on parquet vectorized reader
  • [SPARK-45670] - SparkSubmit does not support --total-executor-cores when deploying on K8s
  • [SPARK-45678] - Cover BufferReleasingInputStream.available under tryOrFetchFailedException
  • [SPARK-45786] - Inaccurate Decimal multiplication and division results
  • [SPARK-45814] - ArrowConverters.createEmptyArrowBatch may cause memory leak
  • [SPARK-45847] - CliSuite flakiness due to non-sequential guarantee for stdout&stderr
  • [SPARK-45878] - ConcurrentModificationException in CliSuite
  • [SPARK-45884] - Upgrade ORC to 1.8.6
  • [SPARK-45896] - Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
  • [SPARK-45920] - group by ordinal should be idempotent
  • [SPARK-45935] - Fix RST files link substitutions error
  • [SPARK-45963] - Restore documentation for DSv2 API
  • [SPARK-46006] - YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop
  • [SPARK-46016] - Fix pandas API support list properly
  • [SPARK-46019] - Fix HiveThriftServer2ListenerSuite and ThriftServerPageSuite to create java.io.tmpdir if it doesn't exist
  • [SPARK-46033] - Fix flaky ArithmeticExpressionSuite
  • [SPARK-46062] - CTE reference node does not inherit the flag `isStreaming` from CTE definition node
  • [SPARK-46064] - EliminateEventTimeWatermark does not consider the fact that isStreaming flag can change for current child during resolution

New Feature

  • [SPARK-45735] - Reenable CatalogTests without Spark Connect

Improvement

  • [SPARK-44206] - Dataset.selectExpr scope Session.active
  • [SPARK-44415] - Upgrade snappy-java to 1.1.10.2
  • [SPARK-44875] - commentor to commenter in merge script
  • [SPARK-44920] - Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()
  • [SPARK-44929] - Standardize log output for console appender in tests
  • [SPARK-45071] - Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
  • [SPARK-45127] - Exclude README.md from document build
  • [SPARK-45286] - Add back Matomo analytics to release docs
  • [SPARK-45588] - Minor scaladoc improvement in StreamingForeachBatchHelper
  • [SPARK-45640] - Fix flaky ProtobufCatalystDataConversionSuite
  • [SPARK-45751] - The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect
  • [SPARK-45829] - The default value of ‘spark.executor.logs.rolling.maxSize' on the official website is incorrect
  • [SPARK-45882] - BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning

Test

  • [SPARK-44544] - Deduplicate run_python_packaging_tests
  • [SPARK-44553] - Ignoring `connect-check-protos` logic in GA testing
  • [SPARK-44661] - getMapOutputLocation should not throw NPE
  • [SPARK-45568] - WholeStageCodegenSparkSubmitSuite flakiness

Task

Documentation

  • [SPARK-44725] - Document spark.network.timeoutInterval
  • [SPARK-44745] - Document shuffle data recovery from the remounted K8s PVCs
  • [SPARK-44859] - Fix incorrect property name `asyncProgressCheckpointingInterval` in structured streaming doc

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.