Release Notes - Spark - Version 3.2.2 - HTML format

Sub-task

  • [SPARK-37675] - Prevent overwriting of push shuffle merged files once the shuffle is finalized
  • [SPARK-37735] - Add appId interface to KubernetesConf
  • [SPARK-37866] - Set file.encoding to UTF-8 for SBT tests
  • [SPARK-37995] - TPCDS 1TB q72 fails when spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is false
  • [SPARK-37998] - Use `rbac.authorization.k8s.io/v1` instead of `v1beta1`
  • [SPARK-38013] - AQE can change bhj to smj if no extra shuffle introduce
  • [SPARK-38019] - ExecutorMonitor.timedOutExecutors should be deterministic
  • [SPARK-38023] - ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished
  • [SPARK-38029] - Support docker-desktop K8S integration test in SBT
  • [SPARK-38030] - Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
  • [SPARK-38048] - Add IntegrationTestBackend.describePods to support all K8s test backends
  • [SPARK-38071] - Support K8s namespace parameter in SBT K8s IT
  • [SPARK-38072] - Support K8s imageTag parameter in SBT K8s IT
  • [SPARK-38081] - Support cloud-backend in K8s IT with SBT
  • [SPARK-38180] - Allow safe up-cast expressions in correlated equality predicates
  • [SPARK-38272] - Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode and context name
  • [SPARK-38325] - ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()
  • [SPARK-38363] - Avoid runtime error in Dataset.summary() when ANSI mode is on
  • [SPARK-38392] - Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT
  • [SPARK-38398] - Add `priorityClassName` integration test case
  • [SPARK-38407] - ANSI Cast: loosen the limitation of casting non-null complex types
  • [SPARK-38430] - Add SBT commands to K8s IT readme
  • [SPARK-38538] - Fix driver environment verification in BasicDriverFeatureStepSuite
  • [SPARK-38787] - Possible correctness issue on stream-stream join when handling edge case
  • [SPARK-38809] - Implement option to skip null values in symmetric hash impl of stream-stream joins
  • [SPARK-39553] - Failed to remove shuffle ${shuffleId} - null when using Scala 2.13
  • [SPARK-39611] - PySpark support numpy 1.23.X

Bug

  • [SPARK-30062] - Add IMMEDIATE statement to the DB2 dialect truncate implementation
  • [SPARK-33206] - Spark Shuffle Index Cache calculates memory usage wrong
  • [SPARK-36553] - KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced
  • [SPARK-37290] - Exponential planning time in case of non-deterministic function
  • [SPARK-37498] - test_reuse_worker_of_parallelize_range is flaky
  • [SPARK-37544] - sequence over dates with month interval is producing incorrect results
  • [SPARK-37554] - Add PyArrow, pandas and plotly to release Docker image dependencies
  • [SPARK-37643] - when charVarcharAsString is true, char datatype partition table query incorrect
  • [SPARK-37690] - Recursive view `df` detected (cycle: `df` -> `df`)
  • [SPARK-37730] - plot.hist throws AttributeError on pandas=1.3.5
  • [SPARK-37793] - Invalid LocalMergedBlockData cause task hang
  • [SPARK-37865] - Spark should not dedup the groupingExpressions when the first child of Union has duplicate columns
  • [SPARK-37932] - Analyzer can fail when join left side and right side are the same view
  • [SPARK-37963] - Need to update Partition URI after renaming table in InMemoryCatalog
  • [SPARK-37977] - Upgrade ORC to 1.6.13
  • [SPARK-38017] - Fix the API doc for window to say it supports TimestampNTZType too as timeColumn
  • [SPARK-38018] - Fix ColumnVectorUtils.populate to handle CalendarIntervalType correctly
  • [SPARK-38042] - Encoder cannot be found when a tuple component is a type alias for an Array
  • [SPARK-38056] - Structured streaming not working in history server when using LevelDB
  • [SPARK-38073] - Update atexit function to avoid issues with late binding
  • [SPARK-38075] - Hive script transform with order by and limit will return fake rows
  • [SPARK-38120] - HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value
  • [SPARK-38151] - Handle `Pacific/Kanton` in DateTimeUtilsSuite
  • [SPARK-38178] - Correct the logic to measure the memory usage of RocksDB
  • [SPARK-38185] - Fix data incorrect if aggregate function is empty
  • [SPARK-38198] - Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when `explainMode` is `CodegenMode`
  • [SPARK-38204] - All state operators are at a risk of inconsistency between state partitioning and operator partitioning
  • [SPARK-38221] - Group by a stream of complex expressions fails
  • [SPARK-38236] - Absolute file paths specified in create/alter table are treated as relative
  • [SPARK-38271] - PoissonSampler may output more rows than MaxRows
  • [SPARK-38273] - decodeUnsafeRows's iterators should close underlying input streams
  • [SPARK-38285] - ClassCastException: GenericArrayData cannot be cast to InternalRow
  • [SPARK-38286] - Union's maxRows and maxRowsPerPartition may overflow
  • [SPARK-38304] - Elt() should return null if index is null under ANSI mode
  • [SPARK-38309] - SHS has incorrect percentiles for shuffle read bytes and shuffle total blocks metrics
  • [SPARK-38320] - (flat)MapGroupsWithState can timeout groups which just received inputs in the same microbatch
  • [SPARK-38333] - DPP cause DataSourceScanExec java.lang.NullPointerException
  • [SPARK-38347] - Nullability propagation in transformUpWithNewOutput
  • [SPARK-38379] - Fix Kubernetes Client mode when mounting persistent volume with storage class
  • [SPARK-38411] - Use UTF-8 when doMergeApplicationListingInternal reads event logs
  • [SPARK-38412] - `from` and `to` is swapped in the StateSchemaCompatibilityChecker
  • [SPARK-38416] - Change day to month
  • [SPARK-38436] - Fix `test_ceil` to test `ceil`
  • [SPARK-38446] - Deadlock between ExecutorClassLoader and FileDownloadCallback caused by Log4j
  • [SPARK-38517] - Fix PySpark documentation generation (missing ipython_genutils)
  • [SPARK-38528] - NullPointerException when selecting a generator in a Stream of aggregate expressions
  • [SPARK-38542] - UnsafeHashedRelation should serialize numKeys out
  • [SPARK-38563] - Upgrade to Py4J 0.10.9.5
  • [SPARK-38579] - Requesting Restful API can cause NullPointerException
  • [SPARK-38587] - Validating new location for rename command should use formatted names
  • [SPARK-38614] - Don't push down limit through window that's using percent_rank
  • [SPARK-38631] - Arbitrary shell command injection via Utils.unpack()
  • [SPARK-38652] - uploadFileUri should preserve file scheme
  • [SPARK-38655] - OffsetWindowFunctionFrameBase cannot find the offset row whose input is not null
  • [SPARK-38677] - pyspark hangs in local mode running rdd map operation
  • [SPARK-38684] - Stream-stream outer join has a possible correctness issue due to weakly read consistent on outer iterators
  • [SPARK-38807] - Error when starting spark shell on Windows system
  • [SPARK-38830] - Warn on corrupted block messages
  • [SPARK-38868] - `assert_true` fails unconditionnaly after `left_outer` joins
  • [SPARK-38905] - Upgrade ORC to 1.6.14
  • [SPARK-38922] - TaskLocation.apply throw NullPointerException
  • [SPARK-38931] - RocksDB File manager would not create initial dfs directory with unknown number of keys on 1st empty checkpoint
  • [SPARK-38955] - Disable lineSep option in 'from_csv' and 'schema_of_csv'
  • [SPARK-38977] - Fix schema pruning with correlated subqueries
  • [SPARK-38990] - date_trunc and trunc both fail with format from column in inline table
  • [SPARK-38992] - Avoid using bash -c in ShellBasedGroupsMappingProvider
  • [SPARK-39060] - Typo in error messages of decimal overflow
  • [SPARK-39061] - Incorrect results or NPE when using Inline function against an array of dynamically created structs
  • [SPARK-39083] - Fix FsHistoryProvider race condition between update and clean app data
  • [SPARK-39084] - df.rdd.isEmpty() results in unexpected executor failure and JVM crash
  • [SPARK-39104] - Null Pointer Exeption on unpersist call
  • [SPARK-39107] - Silent change in regexp_replace's handling of empty strings
  • [SPARK-39258] - Fix `Hide credentials in show create table` after SPARK-35378
  • [SPARK-39259] - Timestamps returned by now() and equivalent functions are not consistent in subqueries
  • [SPARK-39283] - Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
  • [SPARK-39293] - The accumulator of ArrayAggregate should copy the intermediate result if string, struct, array, or map
  • [SPARK-39340] - DS v2 agg pushdown should allow dots in the name of top-level columns
  • [SPARK-39355] - Single column uses quoted to construct UnresolvedAttribute
  • [SPARK-39376] - Do not output duplicated columns in star expansion of subquery alias of NATURAL/USING JOIN
  • [SPARK-39393] - Parquet data source only supports push-down predicate filters for non-repeated primitive types
  • [SPARK-39419] - When the comparator of ArraySort returns null, it should fail.
  • [SPARK-39421] - Sphinx build fails with "node class 'meta' is already registered, its visitors will be overridden"
  • [SPARK-39422] - SHOW CREATE TABLE should suggest 'AS SERDE' for Hive tables with unsupported serde configurations
  • [SPARK-39437] - normalize plan id separately in PlanStabilitySuite
  • [SPARK-39447] - Only non-broadcast query stage can propagate empty relation
  • [SPARK-39476] - Disable Unwrap cast optimize when casting from Long to Float/ Double or from Integer to Float
  • [SPARK-39496] - Inline eval path cannot handle null structs
  • [SPARK-39505] - Escape log content rendered in UI
  • [SPARK-39543] - The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
  • [SPARK-39548] - CreateView Command with a window clause query hit a wrong window definition not found issue
  • [SPARK-39551] - Add AQE invalid plan check
  • [SPARK-39570] - inline table should allow expressions with alias
  • [SPARK-39575] - ByteBuffer forget to rewind after get in AvroDeserializer
  • [SPARK-39621] - Make run-tests.py robust by avoiding `rmtree` usage
  • [SPARK-39650] - Streaming Deduplication should not check the schema of "value"
  • [SPARK-39672] - NotExists subquery failed with conflicting attributes
  • [SPARK-39758] - NPE on invalid patterns from the regexp functions
  • [SPARK-40804] - Missing handling a catalog name in destination tables in `RenameTableExec`
  • [SPARK-41336] - BroadcastExchange does not support the execute() code path. when AQE enabled

Improvement

  • [SPARK-36808] - Upgrade Kafka to 2.8.1
  • [SPARK-37670] - Support predicate pushdown and column pruning for de-duped CTEs
  • [SPARK-37891] - Add scalastyle check to disable scala.concurrent.ExecutionContext.Implicits.global
  • [SPARK-37934] - Upgrade Jetty version to 9.4.44
  • [SPARK-38007] - Update K8s doc to recommend K8s 1.20+
  • [SPARK-38046] - Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing
  • [SPARK-38100] - Remove unused method in `Decimal`
  • [SPARK-38184] - Fix malformatted ExpressionDescription of `decode`
  • [SPARK-38211] - Add SQL migration guide on restoring loose upcast from string
  • [SPARK-38279] - Pin markupsafe to 2.0.1 fix linter failure
  • [SPARK-38305] - Check existence of file before untarring/zipping
  • [SPARK-38353] - Instrument __enter__ and __exit__ magic methods for pandas API on Spark
  • [SPARK-38487] - Fix docstrings of nlargest/nsmallest of DataFrame
  • [SPARK-38570] - Incorrect DynamicPartitionPruning caused by Literal
  • [SPARK-38816] - Wrong comment in random matrix generator in spark-als algorithm
  • [SPARK-38892] - Fix the UT of schema equal assert
  • [SPARK-38936] - Script transform feed thread should have name
  • [SPARK-39030] - Rename sum to avoid shading the builtin Python function
  • [SPARK-39154] - Remove outdated statements on distributed-sequence default index
  • [SPARK-39174] - Catalogs loading swallows missing classname for ClassNotFoundException
  • [SPARK-39240] - Source and binary releases using different tool to generates hashes for integrity

Test

  • [SPARK-38045] - More strict validation on plan check for stream-stream join unit test
  • [SPARK-38080] - Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'
  • [SPARK-38084] - Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
  • [SPARK-38142] - Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized
  • [SPARK-38297] - Fix mypy failure on DataFrame.to_numpy in pandas API on Spark
  • [SPARK-38786] - Test Bug in StatisticsSuite "change stats after add/drop partition command"
  • [SPARK-38927] - Skip NumPy/Pandas tests in `test_rdd.py` if not available
  • [SPARK-39252] - Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
  • [SPARK-39273] - Make PandasOnSparkTestCase inherit ReusedSQLTestCase
  • [SPARK-39373] - Recover branch-3.2 build broken by SPARK-39273 and SPARK-39252

Task

  • [SPARK-38122] - Update App Key of DocSearch
  • [SPARK-38144] - Remove unused `spark.storage.safetyFraction` config
  • [SPARK-38189] - Add priority scheduling doc for Spark on K8S
  • [SPARK-38318] - regression when replacing a dataset view
  • [SPARK-39367] - Review and fix issues in Scala/Java API docs of SQL module

Dependency upgrade

  • [SPARK-38303] - Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev
  • [SPARK-39099] - Add dependencies to Dockerfile for building Spark releases
  • [SPARK-39183] - Upgrade Apache Xerces Java to 2.12.2

Documentation

  • [SPARK-38606] - Update document to make a good guide of multiple versions of the Spark Shuffle Service
  • [SPARK-38629] - Two links beneath Spark SQL Guide/Data Sources do not work properly
  • [SPARK-39032] - Incorrectly formatted examples in pyspark.sql.functions.when
  • [SPARK-39219] - Promote Structured Streaming over Spark Streaming
  • [SPARK-39677] - Wrong args item formatting of the regexp functions

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.