Release Notes - Spark - Version 3.3.2 - HTML format

Sub-task

  • [SPARK-38697] - Extend SparkSessionExtensions to inject rules into AQE Optimizer
  • [SPARK-40872] - Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
  • [SPARK-41185] - Remove ARM limitation for YuniKorn from docs
  • [SPARK-41388] - getReusablePVCs should ignore recently created PVCs in the previous batch
  • [SPARK-42071] - Register scala.math.Ordering$Reverse to KyroSerializer

Bug

  • [SPARK-32380] - sparksql cannot access hive table while data in hbase
  • [SPARK-39404] - Unable to query _metadata in streaming if getBatch returns multiple logical nodes in the DataFrame
  • [SPARK-40493] - Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
  • [SPARK-40588] - Sorting issue with partitioned-writing and AQE turned on
  • [SPARK-40817] - Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode
  • [SPARK-40819] - Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
  • [SPARK-40829] - STORED AS serde in CREATE TABLE LIKE view does not work
  • [SPARK-40851] - TimestampFormatter behavior changed when using the latest Java 8/11/17
  • [SPARK-40869] - KubernetesConf.getResourceNamePrefix creates invalid name prefixes
  • [SPARK-40874] - Fix broadcasts in Python UDFs when encryption is enabled
  • [SPARK-40902] - Quick submission of drivers in tests to mesos scheduler results in dropping drivers
  • [SPARK-40918] - Mismatch between ParquetFileFormat and FileSourceScanExec in # columns for WSCG.isTooManyFields when using _metadata
  • [SPARK-40924] - Unhex function works incorrectly when input has uneven number of symbols
  • [SPARK-40932] - Barrier: messages for allGather will be overridden by the following barrier APIs
  • [SPARK-40963] - ExtractGenerator sets incorrect nullability in new Project
  • [SPARK-40987] - Avoid creating a directory when deleting a block, causing DAGScheduler to not work
  • [SPARK-41035] - Incorrect results or NPE when a literal is reused across distinct aggregations
  • [SPARK-41118] - to_number/try_to_number throws NullPointerException when format is null
  • [SPARK-41144] - UnresolvedHint should not cause query failure
  • [SPARK-41151] - Keep built-in file _metadata column nullable value consistent
  • [SPARK-41154] - Incorrect relation caching for queries with time travel spec
  • [SPARK-41162] - Anti-join must not be pushed below aggregation with ambiguous predicates
  • [SPARK-41187] - [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
  • [SPARK-41188] - Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes
  • [SPARK-41202] - Update ORC to 1.7.7
  • [SPARK-41254] - YarnAllocator.rpIdToYarnResource map is not properly updated
  • [SPARK-41327] - Fix SparkStatusTracker.getExecutorInfos by switch On/OffHeapStorageMemory info
  • [SPARK-41339] - RocksDB state store WriteBatch doesn't clean up native memory
  • [SPARK-41350] - allow simple name access of using join hidden columns after subquery alias
  • [SPARK-41365] - Stages UI page fails to load for proxy in some yarn versions
  • [SPARK-41375] - Avoid empty latest KafkaSourceOffset
  • [SPARK-41376] - Executor netty direct memory check should respect spark.shuffle.io.preferDirectBufs
  • [SPARK-41379] - Inconsistency of spark session in DataFrame in user function for foreachBatch sink in PySpark
  • [SPARK-41385] - Replace deprecated `.newInstance()` in K8s module
  • [SPARK-41395] - InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
  • [SPARK-41448] - Make consistent MR job IDs in FileBatchWriter and FileFormatWriter
  • [SPARK-41458] - Correctly transform the SPI services for Yarn Shuffle Service
  • [SPARK-41468] - Fix PlanExpression handling in EquivalentExpressions
  • [SPARK-41522] - GA dependencies test faild
  • [SPARK-41535] - InterpretedUnsafeProjection and InterpretedMutableProjection can corrupt unsafe buffer when used with calendar interval data
  • [SPARK-41554] - Decimal.changePrecision produces ArrayIndexOutOfBoundsException
  • [SPARK-41668] - DECODE function returns wrong results when passed NULL
  • [SPARK-41732] - Session window: analysis rule "SessionWindowing" does not apply tree-pattern based pruning
  • [SPARK-41989] - PYARROW_IGNORE_TIMEZONE warning can break application logging setup
  • [SPARK-42084] - Avoid leaking the qualified-access-only restriction
  • [SPARK-42090] - Introduce sasl retry count in RetryingBlockTransferor
  • [SPARK-42134] - Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes
  • [SPARK-42157] - `spark.scheduler.mode=FAIR` should provide FAIR scheduler
  • [SPARK-42176] - Cast boolean to timestamp fails with ClassCastException
  • [SPARK-42179] - Upgrade ORC to 1.7.8
  • [SPARK-42188] - Force SBT protobuf version to match Maven on branch 3.2 and 3.3
  • [SPARK-42201] - `build/sbt` should allow SBT_OPTS to override JVM memory setting
  • [SPARK-42222] - Spark 3.3 Backport: SPARK-41344 Reading V2 datasource masks underlying error
  • [SPARK-42259] - ResolveGroupingAnalytics should take care of Python UDAF
  • [SPARK-42344] - The default size of the CONFIG_MAP_MAXSIZE should not be greater than 1048576
  • [SPARK-42346] - distinct(count colname) with UNION ALL causes query analyzer bug
  • [SPARK-42747] - Fix incorrect internal status of LoR and AFT

New Feature

  • [SPARK-47717] - Support Hive tables as a streaming source and sink

Improvement

  • [SPARK-38277] - Clear write batch after RocksDB state store's commit
  • [SPARK-40886] - Bump Jackson Databind 2.13.4.2
  • [SPARK-40913] - Pin `pytest==7.1.3`
  • [SPARK-41031] - Upgrade `org.tukaani:xz` to 1.9
  • [SPARK-41089] - Relocate Netty native arm64 libs
  • [SPARK-41360] - Avoid BlockManager re-registration if the executor has been lost
  • [SPARK-41476] - Prevent `README.md` from triggering CIs
  • [SPARK-41541] - Fix wrong child call in SQLShuffleWriteMetricsReporter.decRecordsWritten()
  • [SPARK-41962] - Update the import order of scala package in class SpecificParquetRecordReaderBase
  • [SPARK-42230] - Improve `lint` job by skipping PySpark and SparkR docs if unchanged

Test

  • [SPARK-41863] - Skip `flake8` tests if the command is not available
  • [SPARK-41864] - Fix mypy linter errors
  • [SPARK-42110] - Reduce the number of repetition in ParquetDeltaEncodingSuite.`random data test`

Task

  • [SPARK-41415] - SASL Request Retries
  • [SPARK-41538] - Metadata column should be appended at the end of project list

Dependency upgrade

Question

Documentation

  • [SPARK-40983] - Remove Hadoop requirements for zstd mention in Parquet compression codec

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.