Release Notes - Spark - Version 3.5.1 - HTML format

Sub-task

  • [SPARK-41086] - Consolidate SecondArgumentXXX error to INVALID_PARAMETER_VALUE
  • [SPARK-44495] - Use the latest minikube in K8s IT
  • [SPARK-44508] - Add user guide for Python UDTFs
  • [SPARK-44619] - Free up disk space for container jobs
  • [SPARK-44640] - Improve error messages for Python UDTF returning non iterable
  • [SPARK-44742] - Add Spark version drop down to the PySpark doc site
  • [SPARK-45016] - Add missing `try_remote_functions` annotations
  • [SPARK-45187] - Fix WorkerPage to use the same pattern for `logPage` urls
  • [SPARK-45553] - Deprecate assertPandasOnSparkEqual
  • [SPARK-45561] - Convert TINYINT catalyst properly in MySQL Dialect
  • [SPARK-45652] - SPJ: Handle empty input partitions after dynamic filtering
  • [SPARK-45749] - Fix Spark History Server to sort `Duration` column properly
  • [SPARK-45764] - Make code block copyable
  • [SPARK-45934] - Fix `Spark Standalone` documentation table layout
  • [SPARK-45961] - Document `spark.master.*` configurations
  • [SPARK-46012] - EventLogFileReader should not read rolling logs if appStatus is missing
  • [SPARK-46029] - Escape the single quote, _ and % for DS V2 pushdown
  • [SPARK-46095] - Document REST API for Spark Standalone Cluster
  • [SPARK-46369] - Remove `kill` link from RELAUNCHING drivers in MasterPage
  • [SPARK-46400] - When there are corrupted files in the local maven repo, retry to skip this cache
  • [SPARK-46478] - Revert SPARK-43049
  • [SPARK-46704] - Fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly
  • [SPARK-46747] - Too Many Shared Locks due to PostgresDialect.getTableExistsQuery - LIMIT 1
  • [SPARK-46789] - Add `VolumeSuite` to K8s IT
  • [SPARK-46817] - Fix `spark-daemon.sh` usage by adding `decommission` command
  • [SPARK-46888] - Fix `Master` to reject worker kill request if decommission is disabled
  • [SPARK-47021] - Fix `kvstore` module to have explicit `commons-lang3` test dependency
  • [SPARK-47023] - Upgrade `aircompressor` to 0.26

Bug

  • [SPARK-39910] - DataFrameReader API cannot read files from hadoop archives (.har)
  • [SPARK-40154] - PySpark: DataFrame.cache docstring gives wrong storage level
  • [SPARK-43393] - Sequence expression can overflow
  • [SPARK-44683] - Logging level isn't passed to RocksDB state store provider correctly
  • [SPARK-44805] - Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
  • [SPARK-44840] - array_insert() give wrong results for ngative index
  • [SPARK-44843] - flaky test: RocksDBStateStoreStreamingAggregationSuite
  • [SPARK-44880] - Remove unnecessary curly braces at the end of the thread locks info
  • [SPARK-44910] - Encoders.bean does not support superclasses with generic type arguments
  • [SPARK-44971] - [BUG Fix] PySpark StreamingQuerProgress fromJson
  • [SPARK-44973] - Fix ArrayIndexOutOfBoundsException in conv()
  • [SPARK-45014] - Clean up fileserver when cleaning up files, jars and archives in SparkContext
  • [SPARK-45057] - Deadlock caused by rdd replication level of 2
  • [SPARK-45072] - Fix Outerscopes for same cell evaluation
  • [SPARK-45075] - Alter table with invalid default value will not report error
  • [SPARK-45078] - The ArrayInsert function should make explicit casting when element type not equals derived component type
  • [SPARK-45081] - Encoders.bean does no longer work with read-only properties
  • [SPARK-45098] - Custom jekyll-rediect-from redirect.html template
  • [SPARK-45106] - percentile_cont gets internal error when user input fails runtime replacement's input type check
  • [SPARK-45117] - Implement missing otherCopyArgs for the MultiCommutativeOp expression
  • [SPARK-45124] - Do not use local user ID for Local Relations
  • [SPARK-45132] - Fix IDENTIFIER clause for functions
  • [SPARK-45142] - Specify the range for Spark Connect dependencies in pyspark base image
  • [SPARK-45167] - Python Spark Connect client does not call `releaseAll`
  • [SPARK-45171] - GenerateExec fails to initialize non-deterministic expressions before use
  • [SPARK-45182] - Ignore task completion from old stage after retrying indeterminate stages
  • [SPARK-45205] - Since version 3.2.0, Spark SQL has taken longer to execute "show paritions",probably because of changes introduced by SPARK-35278
  • [SPARK-45211] - Scala 2.13 daily test failed
  • [SPARK-45227] - Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck
  • [SPARK-45237] - Correct the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md`
  • [SPARK-45255] - Spark connect client failing with java.lang.NoClassDefFoundError
  • [SPARK-45291] - Use unknown query execution id instead of no such app when id is invalid
  • [SPARK-45306] - Make `InMemoryColumnarBenchmark` use AQE-aware utils to collect plans
  • [SPARK-45311] - Encoder fails on many "NoSuchElementException: None.get" since 3.4.x, search for an encoder for a generic type, and since 3.5.x isn't "an expression encoder"
  • [SPARK-45346] - Parquet schema inference should respect case sensitive flag when merging schema
  • [SPARK-45371] - FIx shading problem in Spark Connect
  • [SPARK-45383] - Missing case for RelationTimeTravel in CheckAnalysis
  • [SPARK-45389] - Correct MetaException matching rule on getting partition metadata
  • [SPARK-45424] - Regression in CSV schema inference when timestamps do not match specified timestampFormat
  • [SPARK-45430] - FramelessOffsetWindowFunctionFrame fails when ignore nulls and offset > # of rows
  • [SPARK-45433] - CSV/JSON schema inference when timestamps do not match specified timestampFormat with only one row on each partition report error
  • [SPARK-45449] - Cache Invalidation Issue with JDBC Table
  • [SPARK-45473] - Incorrect error message for RoundBase
  • [SPARK-45484] - Fix the bug that uses incorrect parquet compression codec lz4raw
  • [SPARK-45498] - Followup: Ignore task completion from old stage after retrying indeterminate stages
  • [SPARK-45508] - Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access cleaner on Java 9+
  • [SPARK-45543] - InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions
  • [SPARK-45580] - Subquery changes the output schema of the outer query
  • [SPARK-45584] - Execution fails when there are subqueries in TakeOrderedAndProjectExec
  • [SPARK-45592] - AQE and InMemoryTableScanExec correctness bug
  • [SPARK-45604] - Converting timestamp_ntz to array<timestamp_ntz> can cause NPE or SEGFAULT on parquet vectorized reader
  • [SPARK-45616] - Usages of ParVector are unsafe because it does not propagate ThreadLocals or SparkSession
  • [SPARK-45631] - Broken backward compatibility in PySpark: StreamingQueryListener due to the addition of onQueryIdle
  • [SPARK-45670] - SparkSubmit does not support --total-executor-cores when deploying on K8s
  • [SPARK-45678] - Cover BufferReleasingInputStream.available under tryOrFetchFailedException
  • [SPARK-45786] - Inaccurate Decimal multiplication and division results
  • [SPARK-45791] - Rename `SparkConnectSessionHodlerSuite.scala` to `SparkConnectSessionHolderSuite.scala`
  • [SPARK-45814] - ArrowConverters.createEmptyArrowBatch may cause memory leak
  • [SPARK-45847] - CliSuite flakiness due to non-sequential guarantee for stdout&stderr
  • [SPARK-45878] - ConcurrentModificationException in CliSuite
  • [SPARK-45883] - Upgrade ORC to 1.9.2
  • [SPARK-45896] - Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
  • [SPARK-45920] - group by ordinal should be idempotent
  • [SPARK-45935] - Fix RST files link substitutions error
  • [SPARK-45943] - DataSourceV2Relation.computeStats throws IllegalStateException in test mode
  • [SPARK-45963] - Restore documentation for DSv2 API
  • [SPARK-46006] - YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop
  • [SPARK-46014] - Run RocksDBStateStoreStreamingAggregationSuite on a dedicated JVM
  • [SPARK-46016] - Fix pandas API support list properly
  • [SPARK-46019] - Fix HiveThriftServer2ListenerSuite and ThriftServerPageSuite to create java.io.tmpdir if it doesn't exist
  • [SPARK-46033] - Fix flaky ArithmeticExpressionSuite
  • [SPARK-46062] - CTE reference node does not inherit the flag `isStreaming` from CTE definition node
  • [SPARK-46064] - EliminateEventTimeWatermark does not consider the fact that isStreaming flag can change for current child during resolution
  • [SPARK-46092] - Overflow in Parquet row group filter creation causes incorrect results
  • [SPARK-46189] - Various Pandas functions fail in interpreted mode
  • [SPARK-46239] - Hide Jetty info
  • [SPARK-46274] - Range operator computeStats() proper long conversions
  • [SPARK-46275] - Protobuf: Permissive mode should return null rather than struct with null fields
  • [SPARK-46330] - Loading of Spark UI blocks for a long time when HybridStore enabled
  • [SPARK-46339] - Directory with number name should not be treated as metadata log
  • [SPARK-46388] - HiveAnalysis misses pattern guard `query.resolved`
  • [SPARK-46396] - LegacyFastTimestampFormatter.parseOptional should not throw exception
  • [SPARK-46443] - Decimal precision and scale should decided by JDBC dialect.
  • [SPARK-46453] - SessionHolder doesn't throw exceptions from internalError()
  • [SPARK-46464] - Fix the scroll issue of tables when overflow
  • [SPARK-46466] - vectorized parquet reader should never do rebase for timestamp ntz
  • [SPARK-46480] - Fix NPE when table cache task attempt
  • [SPARK-46514] - Fix HiveMetastoreLazyInitializationSuite
  • [SPARK-46535] - NPE when describe extended a column without col stats
  • [SPARK-46546] - Fix the formatting of tables in `running-on-yarn` pages
  • [SPARK-46562] - Remove retrieval of `keytabFile` from `UserGroupInformation` in `HiveAuthFactory`
  • [SPARK-46577] - HiveMetastoreLazyInitializationSuite leaks hive's SessionState
  • [SPARK-46590] - Coalesce partiton assert error after skew join optimization
  • [SPARK-46598] - OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column
  • [SPARK-46602] - CREATE VIEW IF NOT EXISTS should never throw `TABLE_OR_VIEW_ALREADY_EXISTS` exception
  • [SPARK-46609] - avoid exponential explosion in PartitioningPreservingUnaryExecNode
  • [SPARK-46640] - RemoveRedundantAliases does not account for SubqueryExpression when removing aliases
  • [SPARK-46663] - Disable memory profiler for pandas UDFs with iterators
  • [SPARK-46676] - dropDuplicatesWithinWatermark throws error on canonicalizing plan
  • [SPARK-46684] - CoGroup.applyInPandas/Arrow should pass arguments properly
  • [SPARK-46700] - count the last spilling for the shuffle disk spilling bytes metric
  • [SPARK-46763] - ReplaceDeduplicateWithAggregate fails when non-grouping keys have duplicate attributes
  • [SPARK-46769] - Refine timestamp related schema inference
  • [SPARK-46779] - Grouping by subquery with a cached relation can fail
  • [SPARK-46786] - Fix MountVolumesFeatureStep to use ReadWriteOncePod instead of ReadWriteOnce
  • [SPARK-46794] - Incorrect results due to inferred predicate from checkpoint with subquery
  • [SPARK-46796] - RocksDB versionID Mismatch in SST files
  • [SPARK-46855] - Add `sketch` to the dependencies of the `catalyst` module in `module.py`
  • [SPARK-46861] - Avoid Deadlock in DAGScheduler
  • [SPARK-46862] - Incorrect count() of a dataframe loaded from CSV datasource
  • [SPARK-46893] - Remove inline scripts from UI descriptions
  • [SPARK-46945] - Add `spark.kubernetes.legacy.useReadWriteOnceAccessMode` for old K8s clusters
  • [SPARK-47019] - AQE dynamic cache partitioning causes SortMergeJoin to result in data loss
  • [SPARK-47022] - Fix `connect/client/jvm` to have explicit `commons-lang3` test dependency
  • [SPARK-47053] - Docker image for release has to bump versions of some python libraries for 3.5.1
  • [SPARK-47206] - Add official image Dockerfile for Apache Spark 3.5.1
  • [SPARK-47759] - Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

New Feature

  • [SPARK-45360] - Initialize spark session builder configuration from SPARK_REMOTE
  • [SPARK-45706] - Makes entire Binder build fails fast during setting up
  • [SPARK-45735] - Reenable CatalogTests without Spark Connect
  • [SPARK-46732] - Propagate JobArtifactSet to broadcast execution thread
  • [SPARK-47717] - Support Hive tables as a streaming source and sink

Improvement

  • [SPARK-44833] - Spark Connect reattach when initial ExecutePlan didn't reach server doing too eager Reattach
  • [SPARK-44835] - SparkConnect ReattachExecute could raise before ExecutePlan even attaches.
  • [SPARK-45050] - Improve error message for UNKNOWN io.grpc.StatusRuntimeException
  • [SPARK-45071] - Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
  • [SPARK-45127] - Exclude README.md from document build
  • [SPARK-45250] - Support stage level task resource profile for yarn cluster when dynamic allocation disabled
  • [SPARK-45286] - Add back Matomo analytics to release docs
  • [SPARK-45386] - Correctness issue when persisting using StorageLevel.NONE
  • [SPARK-45419] - Avoid reusing rocksdb sst files in a dfferent rocksdb instance by removing file version map entry of larger versions
  • [SPARK-45459] - Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file
  • [SPARK-45475] - Should use DataFrame.foreachPartition instead of RDD.foreachPartition in JdbcUtils
  • [SPARK-45495] - Support stage level task resource profile for k8s cluster when dynamic allocation disabled
  • [SPARK-45532] - Restore codetabs for the Protobuf Data Source Guide
  • [SPARK-45538] - pyspark connect overwrite_partitions bug
  • [SPARK-45588] - Minor scaladoc improvement in StreamingForeachBatchHelper
  • [SPARK-45640] - Fix flaky ProtobufCatalystDataConversionSuite
  • [SPARK-45751] - The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect
  • [SPARK-45770] - Fix column resolution in DataFrame.drop
  • [SPARK-45829] - The default value of ‘spark.executor.logs.rolling.maxSize' on the official website is incorrect
  • [SPARK-45882] - BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning
  • [SPARK-45974] - Add scan.filterAttributes non-empty judgment for RowLevelOperationRuntimeGroupFiltering
  • [SPARK-45975] - STORE_ASSIGNMENT_POLICY should be reset in HiveCompatibilitySuite
  • [SPARK-46170] - Support inject adaptive query post planner strategy rules in SparkSessionExtensions
  • [SPARK-46286] - Document spark.io.compression.zstd.bufferPool.enabled
  • [SPARK-46380] - Replacing current time prior to inline table eval
  • [SPARK-46425] - Pin the bundler version in CI
  • [SPARK-46600] - Move shared code between SqlConf and SqlApiConf to another object
  • [SPARK-46610] - Create table should throw exception when no value for a key in options
  • [SPARK-47978] - Decouple Spark Go Connect Library versioning from Spark versioning

Test

  • [SPARK-45568] - WholeStageCodegenSparkSubmitSuite flakiness
  • [SPARK-45585] - Fix time format and redirection issues in SparkSubmit tests
  • [SPARK-46801] - Do not treat exit 5 as a test failure in Python testing script
  • [SPARK-46953] - Wrap withTable for a test in ResolveDefaultColumnsSuite

Task

  • [SPARK-44872] - Testing reattachable execute
  • [SPARK-45189] - Creating UnresolvedRelation from TableIdentifier should include the catalog field
  • [SPARK-46182] - Shuffle data lost on decommissioned executor caused by race condition between lastTaskRunningTime and lastShuffleMigrationTime
  • [SPARK-46188] - Fix the CSS of Spark doc's generated tables
  • [SPARK-46547] - Fix deadlock issue between maintenance thread and streaming agg physical operators
  • [SPARK-46628] - Use SPDX short identifier in `licenses` name

Documentation

  • [SPARK-44725] - Document spark.network.timeoutInterval
  • [SPARK-45969] - Document configuration change of executor failure tracker

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.