Sub-task
- [SPARK-37675] - Prevent overwriting of push shuffle merged files once the shuffle is finalized
- [SPARK-37735] - Add appId interface to KubernetesConf
- [SPARK-37866] - Set file.encoding to UTF-8 for SBT tests
- [SPARK-37995] - TPCDS 1TB q72 fails when spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is false
- [SPARK-37998] - Use `rbac.authorization.k8s.io/v1` instead of `v1beta1`
- [SPARK-38013] - AQE can change bhj to smj if no extra shuffle introduce
- [SPARK-38019] - ExecutorMonitor.timedOutExecutors should be deterministic
- [SPARK-38023] - ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished
- [SPARK-38029] - Support docker-desktop K8S integration test in SBT
- [SPARK-38030] - Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
- [SPARK-38048] - Add IntegrationTestBackend.describePods to support all K8s test backends
- [SPARK-38071] - Support K8s namespace parameter in SBT K8s IT
- [SPARK-38072] - Support K8s imageTag parameter in SBT K8s IT
- [SPARK-38081] - Support cloud-backend in K8s IT with SBT
- [SPARK-38180] - Allow safe up-cast expressions in correlated equality predicates
- [SPARK-38272] - Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode and context name
- [SPARK-38325] - ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()
- [SPARK-38363] - Avoid runtime error in Dataset.summary() when ANSI mode is on
- [SPARK-38392] - Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT
- [SPARK-38398] - Add `priorityClassName` integration test case
- [SPARK-38407] - ANSI Cast: loosen the limitation of casting non-null complex types
- [SPARK-38430] - Add SBT commands to K8s IT readme
- [SPARK-38538] - Fix driver environment verification in BasicDriverFeatureStepSuite
- [SPARK-38787] - Possible correctness issue on stream-stream join when handling edge case
- [SPARK-38809] - Implement option to skip null values in symmetric hash impl of stream-stream joins
- [SPARK-39553] - Failed to remove shuffle ${shuffleId} - null when using Scala 2.13
- [SPARK-39611] - PySpark support numpy 1.23.X
Bug
- [SPARK-30062] - Add IMMEDIATE statement to the DB2 dialect truncate implementation
- [SPARK-33206] - Spark Shuffle Index Cache calculates memory usage wrong
- [SPARK-36553] - KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced
- [SPARK-37290] - Exponential planning time in case of non-deterministic function
- [SPARK-37498] - test_reuse_worker_of_parallelize_range is flaky
- [SPARK-37544] - sequence over dates with month interval is producing incorrect results
- [SPARK-37554] - Add PyArrow, pandas and plotly to release Docker image dependencies
- [SPARK-37643] - when charVarcharAsString is true, char datatype partition table query incorrect
- [SPARK-37690] - Recursive view `df` detected (cycle: `df` -> `df`)
- [SPARK-37730] - plot.hist throws AttributeError on pandas=1.3.5
- [SPARK-37793] - Invalid LocalMergedBlockData cause task hang
- [SPARK-37865] - Spark should not dedup the groupingExpressions when the first child of Union has duplicate columns
- [SPARK-37932] - Analyzer can fail when join left side and right side are the same view
- [SPARK-37963] - Need to update Partition URI after renaming table in InMemoryCatalog
- [SPARK-37977] - Upgrade ORC to 1.6.13
- [SPARK-38017] - Fix the API doc for window to say it supports TimestampNTZType too as timeColumn
- [SPARK-38018] - Fix ColumnVectorUtils.populate to handle CalendarIntervalType correctly
- [SPARK-38042] - Encoder cannot be found when a tuple component is a type alias for an Array
- [SPARK-38056] - Structured streaming not working in history server when using LevelDB
- [SPARK-38073] - Update atexit function to avoid issues with late binding
- [SPARK-38075] - Hive script transform with order by and limit will return fake rows
- [SPARK-38120] - HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value
- [SPARK-38151] - Handle `Pacific/Kanton` in DateTimeUtilsSuite
- [SPARK-38178] - Correct the logic to measure the memory usage of RocksDB
- [SPARK-38185] - Fix data incorrect if aggregate function is empty
- [SPARK-38198] - Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when `explainMode` is `CodegenMode`
- [SPARK-38204] - All state operators are at a risk of inconsistency between state partitioning and operator partitioning
- [SPARK-38221] - Group by a stream of complex expressions fails
- [SPARK-38236] - Absolute file paths specified in create/alter table are treated as relative
- [SPARK-38271] - PoissonSampler may output more rows than MaxRows
- [SPARK-38273] - decodeUnsafeRows's iterators should close underlying input streams
- [SPARK-38285] - ClassCastException: GenericArrayData cannot be cast to InternalRow
- [SPARK-38286] - Union's maxRows and maxRowsPerPartition may overflow
- [SPARK-38304] - Elt() should return null if index is null under ANSI mode
- [SPARK-38309] - SHS has incorrect percentiles for shuffle read bytes and shuffle total blocks metrics
- [SPARK-38320] - (flat)MapGroupsWithState can timeout groups which just received inputs in the same microbatch
- [SPARK-38333] - DPP cause DataSourceScanExec java.lang.NullPointerException
- [SPARK-38347] - Nullability propagation in transformUpWithNewOutput
- [SPARK-38379] - Fix Kubernetes Client mode when mounting persistent volume with storage class
- [SPARK-38411] - Use UTF-8 when doMergeApplicationListingInternal reads event logs
- [SPARK-38412] - `from` and `to` is swapped in the StateSchemaCompatibilityChecker
- [SPARK-38416] - Change day to month
- [SPARK-38436] - Fix `test_ceil` to test `ceil`
- [SPARK-38446] - Deadlock between ExecutorClassLoader and FileDownloadCallback caused by Log4j
- [SPARK-38517] - Fix PySpark documentation generation (missing ipython_genutils)
- [SPARK-38528] - NullPointerException when selecting a generator in a Stream of aggregate expressions
- [SPARK-38542] - UnsafeHashedRelation should serialize numKeys out
- [SPARK-38563] - Upgrade to Py4J 0.10.9.5
- [SPARK-38579] - Requesting Restful API can cause NullPointerException
- [SPARK-38587] - Validating new location for rename command should use formatted names
- [SPARK-38614] - Don't push down limit through window that's using percent_rank
- [SPARK-38631] - Arbitrary shell command injection via Utils.unpack()
- [SPARK-38652] - uploadFileUri should preserve file scheme
- [SPARK-38655] - OffsetWindowFunctionFrameBase cannot find the offset row whose input is not null
- [SPARK-38677] - pyspark hangs in local mode running rdd map operation
- [SPARK-38684] - Stream-stream outer join has a possible correctness issue due to weakly read consistent on outer iterators
- [SPARK-38807] - Error when starting spark shell on Windows system
- [SPARK-38830] - Warn on corrupted block messages
- [SPARK-38868] - `assert_true` fails unconditionnaly after `left_outer` joins
- [SPARK-38905] - Upgrade ORC to 1.6.14
- [SPARK-38922] - TaskLocation.apply throw NullPointerException
- [SPARK-38931] - RocksDB File manager would not create initial dfs directory with unknown number of keys on 1st empty checkpoint
- [SPARK-38955] - Disable lineSep option in 'from_csv' and 'schema_of_csv'
- [SPARK-38977] - Fix schema pruning with correlated subqueries
- [SPARK-38990] - date_trunc and trunc both fail with format from column in inline table
- [SPARK-38992] - Avoid using bash -c in ShellBasedGroupsMappingProvider
- [SPARK-39060] - Typo in error messages of decimal overflow
- [SPARK-39061] - Incorrect results or NPE when using Inline function against an array of dynamically created structs
- [SPARK-39083] - Fix FsHistoryProvider race condition between update and clean app data
- [SPARK-39084] - df.rdd.isEmpty() results in unexpected executor failure and JVM crash
- [SPARK-39104] - Null Pointer Exeption on unpersist call
- [SPARK-39107] - Silent change in regexp_replace's handling of empty strings
- [SPARK-39258] - Fix `Hide credentials in show create table` after SPARK-35378
- [SPARK-39259] - Timestamps returned by now() and equivalent functions are not consistent in subqueries
- [SPARK-39283] - Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
- [SPARK-39293] - The accumulator of ArrayAggregate should copy the intermediate result if string, struct, array, or map
- [SPARK-39340] - DS v2 agg pushdown should allow dots in the name of top-level columns
- [SPARK-39355] - Single column uses quoted to construct UnresolvedAttribute
- [SPARK-39376] - Do not output duplicated columns in star expansion of subquery alias of NATURAL/USING JOIN
- [SPARK-39393] - Parquet data source only supports push-down predicate filters for non-repeated primitive types
- [SPARK-39419] - When the comparator of ArraySort returns null, it should fail.
- [SPARK-39421] - Sphinx build fails with "node class 'meta' is already registered, its visitors will be overridden"
- [SPARK-39422] - SHOW CREATE TABLE should suggest 'AS SERDE' for Hive tables with unsupported serde configurations
- [SPARK-39437] - normalize plan id separately in PlanStabilitySuite
- [SPARK-39447] - Only non-broadcast query stage can propagate empty relation
- [SPARK-39476] - Disable Unwrap cast optimize when casting from Long to Float/ Double or from Integer to Float
- [SPARK-39496] - Inline eval path cannot handle null structs
- [SPARK-39505] - Escape log content rendered in UI
- [SPARK-39543] - The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
- [SPARK-39548] - CreateView Command with a window clause query hit a wrong window definition not found issue
- [SPARK-39551] - Add AQE invalid plan check
- [SPARK-39570] - inline table should allow expressions with alias
- [SPARK-39575] - ByteBuffer forget to rewind after get in AvroDeserializer
- [SPARK-39621] - Make run-tests.py robust by avoiding `rmtree` usage
- [SPARK-39650] - Streaming Deduplication should not check the schema of "value"
- [SPARK-39672] - NotExists subquery failed with conflicting attributes
- [SPARK-39758] - NPE on invalid patterns from the regexp functions
- [SPARK-40804] - Missing handling a catalog name in destination tables in `RenameTableExec`
- [SPARK-41336] - BroadcastExchange does not support the execute() code path. when AQE enabled
Improvement
- [SPARK-36808] - Upgrade Kafka to 2.8.1
- [SPARK-37670] - Support predicate pushdown and column pruning for de-duped CTEs
- [SPARK-37891] - Add scalastyle check to disable scala.concurrent.ExecutionContext.Implicits.global
- [SPARK-37934] - Upgrade Jetty version to 9.4.44
- [SPARK-38007] - Update K8s doc to recommend K8s 1.20+
- [SPARK-38046] - Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing
- [SPARK-38100] - Remove unused method in `Decimal`
- [SPARK-38184] - Fix malformatted ExpressionDescription of `decode`
- [SPARK-38211] - Add SQL migration guide on restoring loose upcast from string
- [SPARK-38279] - Pin markupsafe to 2.0.1 fix linter failure
- [SPARK-38305] - Check existence of file before untarring/zipping
- [SPARK-38353] - Instrument __enter__ and __exit__ magic methods for pandas API on Spark
- [SPARK-38487] - Fix docstrings of nlargest/nsmallest of DataFrame
- [SPARK-38570] - Incorrect DynamicPartitionPruning caused by Literal
- [SPARK-38816] - Wrong comment in random matrix generator in spark-als algorithm
- [SPARK-38892] - Fix the UT of schema equal assert
- [SPARK-38936] - Script transform feed thread should have name
- [SPARK-39030] - Rename sum to avoid shading the builtin Python function
- [SPARK-39154] - Remove outdated statements on distributed-sequence default index
- [SPARK-39174] - Catalogs loading swallows missing classname for ClassNotFoundException
- [SPARK-39240] - Source and binary releases using different tool to generates hashes for integrity
Test
- [SPARK-38045] - More strict validation on plan check for stream-stream join unit test
- [SPARK-38080] - Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'
- [SPARK-38084] - Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
- [SPARK-38142] - Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized
- [SPARK-38297] - Fix mypy failure on DataFrame.to_numpy in pandas API on Spark
- [SPARK-38786] - Test Bug in StatisticsSuite "change stats after add/drop partition command"
- [SPARK-38927] - Skip NumPy/Pandas tests in `test_rdd.py` if not available
- [SPARK-39252] - Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
- [SPARK-39273] - Make PandasOnSparkTestCase inherit ReusedSQLTestCase
- [SPARK-39373] - Recover branch-3.2 build broken by SPARK-39273 and SPARK-39252
Task
- [SPARK-38122] - Update App Key of DocSearch
- [SPARK-38144] - Remove unused `spark.storage.safetyFraction` config
- [SPARK-38189] - Add priority scheduling doc for Spark on K8S
- [SPARK-38318] - regression when replacing a dataset view
- [SPARK-39367] - Review and fix issues in Scala/Java API docs of SQL module
Dependency upgrade
- [SPARK-38303] - Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev
- [SPARK-39099] - Add dependencies to Dockerfile for building Spark releases
- [SPARK-39183] - Upgrade Apache Xerces Java to 2.12.2
Documentation
- [SPARK-38606] - Update document to make a good guide of multiple versions of the Spark Shuffle Service
- [SPARK-38629] - Two links beneath Spark SQL Guide/Data Sources do not work properly
- [SPARK-39032] - Incorrectly formatted examples in pyspark.sql.functions.when
- [SPARK-39219] - Promote Structured Streaming over Spark Streaming
- [SPARK-39677] - Wrong args item formatting of the regexp functions
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.