Release Notes - Spark - Version 3.2.1 - HTML format

Sub-task

  • [SPARK-30789] - Support (IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
  • [SPARK-36632] - DivideYMInterval and DivideDTInterval should throw the same exception when divide by zero.
  • [SPARK-36754] - array_intersect should handle Double.NaN and Float.NaN
  • [SPARK-36785] - Fix ps.DataFrame.isin
  • [SPARK-36900] - "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
  • [SPARK-37023] - Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry
  • [SPARK-37317] - Reduce weights in GaussianMixtureSuite
  • [SPARK-37389] - Check unclosed bracketed comments
  • [SPARK-37442] - In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the table that is larger than 8GB: 8 GB" failure
  • [SPARK-37522] - Fix MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction
  • [SPARK-37695] - Skip diagnosis ob merged blocks from push-based shuffle
  • [SPARK-37957] - Deterministic flag is not handled for V2 functions
  • [SPARK-38129] - Adaptively enable timeout for BroadcastQueryStageExec

Bug

  • [SPARK-23626] - DAGScheduler blocked due to JobSubmitted event
  • [SPARK-33277] - Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
  • [SPARK-36464] - Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
  • [SPARK-36717] - Wrong order of variable initialization may lead to incorrect behavior
  • [SPARK-36795] - Explain Formatted has Duplicated Node IDs with InMemoryRelation Present
  • [SPARK-36865] - Add PySpark API document of session_window
  • [SPARK-36905] - Reading Hive view without explicit column names fails in Spark
  • [SPARK-36979] - Add RewriteLateralSubquery rule into nonExcludableRules
  • [SPARK-36993] - Fix json_tuple throw NPE if fields exist no foldable null value
  • [SPARK-37004] - Job cancellation causes py4j errors on Jupyter due to pinned thread mode
  • [SPARK-37046] - Alter view does not preserve column case
  • [SPARK-37049] - executorIdleTimeout is not working for pending pods on K8s
  • [SPARK-37052] - Fix spark-3.2 can use --verbose with spark-shell
  • [SPARK-37057] - Fix wrong DocSearch facet filter in release-tag.sh
  • [SPARK-37060] - Report driver status does not handle response from backup masters
  • [SPARK-37061] - Custom V2 Metrics uses wrong classname for lookup
  • [SPARK-37064] - Fix outer join return the wrong max rows if other side is empty
  • [SPARK-37069] - HiveClientImpl throws NoSuchMethodError: org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns
  • [SPARK-37078] - Support old 3-parameter Sink constructors
  • [SPARK-37079] - Fix DataFrameWriterV2.partitionedBy to send the arguments to JVM properly
  • [SPARK-37088] - Python UDF after off-heap vectorized reader can cause crash due to use-after-free in writer thread
  • [SPARK-37089] - ParquetFileFormat registers task completion listeners lazily, causing Python writer thread to segfault when off-heap vectorized reader is enabled
  • [SPARK-37098] - Alter table properties should invalidate cache
  • [SPARK-37117] - Can't read files in one of Parquet encryption modes (external keymaterial)
  • [SPARK-37121] - TestUtils.isPythonVersionAtLeast38 returns incorrect results
  • [SPARK-37147] - MetricsReporter producing NullPointerException when element 'triggerExecution' not present in Map[]
  • [SPARK-37170] - Pin PySpark version installed in the Binder environment for tagged commit
  • [SPARK-37196] - NPE in org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106)
  • [SPARK-37202] - Temp view didn't collect temp function that registered with catalog API
  • [SPARK-37203] - Fix NotSerializableException when observe with TypedImperativeAggregate
  • [SPARK-37209] - YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed
  • [SPARK-37217] - The number of dynamic partitions should early check when writing to external tables
  • [SPARK-37238] - Upgrade ORC to 1.6.12
  • [SPARK-37252] - Ignore test_memory_limit on non-Linux environment
  • [SPARK-37253] - try_simplify_traceback should not fail when tb_frame.f_lineno is None
  • [SPARK-37260] - PYSPARK Arrow 3.2.0 docs link invalid
  • [SPARK-37270] - Incorect result of filter using isNull condition
  • [SPARK-37288] - Backport update pyspark.since annotation to 3.1 and 3.2
  • [SPARK-37302] - Explicitly download the dependencies of guava and jetty-io in test-dependencies.sh
  • [SPARK-37318] - Make FallbackStorageSuite robust in terms of DNS
  • [SPARK-37320] - Delete py_container_checks.zip after the test in DepsTestsSuite finishes
  • [SPARK-37388] - WidthBucket throws NullPointerException in WholeStageCodegenExec
  • [SPARK-37390] - Buggy method retrival in pyspark.docs.conf.setup
  • [SPARK-37391] - SIGNIFICANT bottleneck introduced by fix for SPARK-32001
  • [SPARK-37392] - Catalyst optimizer very time-consuming and memory-intensive with some "explode(array)"
  • [SPARK-37451] - Performance improvement regressed String to Decimal cast
  • [SPARK-37452] - Char and Varchar breaks backward compatibility between v3 and v2
  • [SPARK-37480] - Configurations in docs/running-on-kubernetes.md are not uptodate
  • [SPARK-37481] - Disappearance of skipped stages mislead the bug hunting
  • [SPARK-37524] - We should drop all tables after testing dynamic partition pruning
  • [SPARK-37534] - Bump dev.ludovic.netlib to 2.2.1
  • [SPARK-37556] - Deser void class fail with Java serialization
  • [SPARK-37577] - ClassCastException: ArrayType cannot be cast to StructType
  • [SPARK-37615] - Upgrade SBT to 1.5.6
  • [SPARK-37633] - Unwrap cast should skip if downcast failed with ansi enabled
  • [SPARK-37654] - Regression - NullPointerException in Row.getSeq when field null
  • [SPARK-37656] - Upgrade SBT to 1.5.7
  • [SPARK-37659] - Fix FsHistoryProvider race condition between list and delet log info
  • [SPARK-37678] - Incorrect annotations in SeriesGroupBy._cleanup_and_return
  • [SPARK-37728] - reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException
  • [SPARK-37779] - Make ColumnarToRowExec plan canonicalizable after (de)serialization
  • [SPARK-37800] - TreeNode.argString incorrectly formats arguments of type Set[_]
  • [SPARK-37802] - composite field name like `field name` doesn't work with Aggregate push down
  • [SPARK-37807] - Fix a typo in HttpAuthenticationException message
  • [SPARK-37855] - IllegalStateException when transforming an array inside a nested struct
  • [SPARK-37859] - SQL tables created with JDBC with Spark 3.1 are not readable with 3.2
  • [SPARK-37860] - [BUG] Revert: Fix taskid in the stage page task event timeline
  • [SPARK-37874] - Link to Pandas UDF documentation is broken

New Feature

Improvement

  • [SPARK-34399] - Add file commit time to metrics and shown in SQL Tab UI
  • [SPARK-35714] - Bug fix for deadlock during the executor shutdown
  • [SPARK-36659] - Promote spark.sql.execution.topKSortFallbackThreshold to user-faced config
  • [SPARK-37001] - Disable two level of map for final hash aggregation by default
  • [SPARK-37032] - Remove unuseable link in spark-3.2.0's doc
  • [SPARK-37058] - Add spark-shell command line unit test
  • [SPARK-37113] - Upgrade Parquet to 1.12.2
  • [SPARK-37134] - documentation - unclear "Using PySpark Native Features"
  • [SPARK-37208] - Support mapping Spark gpu/fpga resource types to custom YARN resource type
  • [SPARK-37214] - Fail query analysis earlier with invalid identifiers
  • [SPARK-37307] - Don't obtain JDBC connection for empty partition
  • [SPARK-37346] - Link migration guide for structured stream.
  • [SPARK-37460] - ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
  • [SPARK-37505] - mesos module is missing log4j.properties file for UT
  • [SPARK-37513] - date +/- interval with only day-time fields returns different data type between Spark3.2 and Spark3.1
  • [SPARK-37594] - Make UT test("SPARK-34399: Add job commit duration metrics for DataWritingCommand") more stable
  • [SPARK-37705] - Write session time zone in the Parquet file metadata so that rebase can use it instead of JVM timezone
  • [SPARK-37784] - CodeGenerator.addBufferedState() does not properly handle UDTs
  • [SPARK-37959] - Fix the UT of checking norm in KMeans & BiKMeans
  • [SPARK-38639] - Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

Test

  • [SPARK-37218] - Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
  • [SPARK-37322] - `run_scala_tests` should respect test module order
  • [SPARK-37871] - Use python3 instead of python in BaseScriptTransformation tests
  • [SPARK-37987] - Flaky Test: StreamingAggregationSuite.changing schema of state when restarting query - state format version 1

Task

  • [SPARK-37050] - Update conda installation instructions
  • [SPARK-37067] - DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
  • [SPARK-37446] - hive-2.3.9 related API use invoke method
  • [SPARK-37471] - spark-sql support nested bracketed comment
  • [SPARK-37497] - Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi

Documentation

  • [SPARK-36791] - this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
  • [SPARK-36939] - Add orphan migration page into list in PySpark documentation
  • [SPARK-37624] - Suppress warnings for live pandas-on-Spark quickstart notebooks
  • [SPARK-37692] - sql-migration-guide wrong description
  • [SPARK-37818] - Add option for show create table command

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.