Sub-task
- [SPARK-24266] - Spark client terminates while driver is still running
- [SPARK-27421] - RuntimeException when querying a view on a partitioned parquet table
- [SPARK-32067] - Use unique ConfigMap name for executor pod template
- [SPARK-32119] - ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
- [SPARK-32247] - scipy installation fails with PyPy
- [SPARK-32436] - Initialize numNonEmptyBlocks in HighlyCompressedMapStatus.readExternal
- [SPARK-33096] - Use LinkedHashMap instead of Map for newlyCreatedExecutors
- [SPARK-33163] - Check the metadata key 'org.apache.spark.legacyDateTime' in Avro/Parquet files
- [SPARK-33176] - Use 11-jre-slim as default in K8s Dockerfile
- [SPARK-33290] - REFRESH TABLE should invalidate cache even though the table itself may not be cached
- [SPARK-33408] - Use R 3.6.3 in K8s R image
- [SPARK-33435] - DSv2: REFRESH TABLE should invalidate caches
- [SPARK-33464] - Add/remove (un)necessary cache and restructure GitHub Actions yaml
- [SPARK-33524] - Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`
- [SPARK-33667] - Respect case sensitivity in V1 SHOW PARTITIONS
- [SPARK-33670] - Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED
- [SPARK-33711] - Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
- [SPARK-33727] - `gpg: keyserver receive failed: No name` during K8s IT
- [SPARK-33732] - Kubernetes integration tests doesn't work with Minikube 1.9+
- [SPARK-33742] - Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions()
- [SPARK-33788] - Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions()
- [SPARK-33822] - TPCDS Q5 fails if spark.sql.adaptive.enabled=true
- [SPARK-33844] - InsertIntoDir failed since query column name contains ',' cause column type and column names size not equal
- [SPARK-33891] - Update dynamic allocation related documents
- [SPARK-33950] - ALTER TABLE .. DROP PARTITION doesn't refresh cache
- [SPARK-33963] - `isCached` return `false` for cached Hive table
- [SPARK-34011] - ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
- [SPARK-34027] - ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
- [SPARK-34055] - ALTER TABLE .. ADD PARTITION doesn't refresh cache
- [SPARK-34060] - ALTER TABLE .. DROP PARTITION uncaches Hive table while updating table stats
- [SPARK-34115] - Long runtime on many environment variables
- [SPARK-34213] - LOAD DATA doesn't refresh v1 table cache
- [SPARK-34262] - ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache
- [SPARK-34359] - add a legacy config to restore the output schema of SHOW DATABASES
- [SPARK-34407] - KubernetesClusterSchedulerBackend.stop should clean up K8s resources
Bug
- [SPARK-27428] - Test "metrics StatsD sink with Timer " fails on BigEndian
- [SPARK-31511] - Make BytesToBytesMap iterator() thread-safe
- [SPARK-31952] - The metric of MemoryBytesSpill is incorrect when doing Aggregate
- [SPARK-32110] - -0.0 vs 0.0 is inconsistent
- [SPARK-32598] - Not able to see driver logs in spark history server in standalone mode
- [SPARK-32635] - When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result
- [SPARK-32638] - WidenSetOperationTypes in subquery attribute missing
- [SPARK-32680] - CTAS with V2 catalog wrongly accessed unresolved query
- [SPARK-32691] - Update commons-crypto to v1.1.0
- [SPARK-32693] - Compare two dataframes with same schema except nullable property
- [SPARK-32715] - Broadcast block pieces may memory leak
- [SPARK-32738] - thread safe endpoints may hang due to fatal error
- [SPARK-32753] - Deduplicating and repartitioning the same column create duplicate rows with AQE
- [SPARK-32761] - Planner error when aggregating multiple distinct Constant columns
- [SPARK-32764] - compare of -0.0 < 0.0 return true
- [SPARK-32767] - Bucket join should work if spark.sql.shuffle.partitions larger than bucket number
- [SPARK-32771] - The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
- [SPARK-32776] - Limit in streaming should not be optimized away by PropagateEmptyRelation
- [SPARK-32779] - Spark/Hive3 interaction potentially causes deadlock
- [SPARK-32785] - interval with dangling part should not results null
- [SPARK-32788] - non-partitioned table scan should not have partition filter
- [SPARK-32794] - Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources
- [SPARK-32810] - CSV/JSON data sources should avoid globbing paths when inferring schema
- [SPARK-32812] - Run tests script for Python fails in certain environments
- [SPARK-32813] - Reading parquet rdd in non columnar mode fails in multithreaded environment
- [SPARK-32815] - Fix LibSVM data source loading error on file paths with glob metacharacters
- [SPARK-32819] - Spark SQL aggregate() fails on nested string arrays
- [SPARK-32823] - Standalone Master UI resources in use wrong
- [SPARK-32824] - The error is confusing when resource .amount not provided
- [SPARK-32832] - Use CaseInsensitiveMap for DataStreamReader/Writer options
- [SPARK-32836] - Fix DataStreamReaderWriterSuite to check writer options correctly
- [SPARK-32840] - Invalid interval value can happen to be just adhesive with the unit
- [SPARK-32845] - Add sinkParameter to check sink options robustly in DataStreamReaderWriterSuite
- [SPARK-32865] - python section in quickstart page doesn't display SPARK_VERSION correctly
- [SPARK-32872] - BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
- [SPARK-32877] - Fix Hive UDF not support decimal type in complex type
- [SPARK-32886] - '.../jobs/undefined' link from "Event Timeline" in jobs page
- [SPARK-32887] - Example command in https://spark.apache.org/docs/latest/sql-ref-syntax-aux-show-table.html to be changed
- [SPARK-32897] - SparkSession.builder.getOrCreate should not show deprecation warning of SQLContext
- [SPARK-32898] - totalExecutorRunTimeMs is too big
- [SPARK-32900] - UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.
- [SPARK-32901] - UnsafeExternalSorter may cause a SparkOutOfMemoryError to be thrown while spilling
- [SPARK-32905] - ApplicationMaster fails to receive UpdateDelegationTokens message
- [SPARK-32906] - Struct field names should not change after normalizing floats
- [SPARK-32908] - percentile_approx() returns incorrect results
- [SPARK-32977] - [SQL] JavaDoc on Default Save mode Incorrect
- [SPARK-32996] - Handle Option.empty v1.ExecutorSummary#peakMemoryMetrics
- [SPARK-32999] - TreeNode.nodeName should not throw malformed class name error
- [SPARK-33015] - Compute the current date only once
- [SPARK-33018] - Fix compute statistics issue
- [SPARK-33019] - Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
- [SPARK-33029] - Standalone mode blacklist executors page UI marks driver as blacklisted
- [SPARK-33035] - Updates the obsoleted entries of attribute mapping in QueryPlan#transformUpWithNewOutput
- [SPARK-33043] - RowMatrix is incompatible with spark.driver.maxResultSize=0
- [SPARK-33065] - Expand the stack size of a thread in a test in LocalityPlacementStrategySuite for Java 11
- [SPARK-33089] - avro format does not propagate Hadoop config from DS options to underlying HDFS file system
- [SPARK-33094] - ORC format does not propagate Hadoop config from DS options to underlying HDFS file system
- [SPARK-33100] - Support parse the sql statements with c-style comments
- [SPARK-33101] - LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system
- [SPARK-33115] - `kvstore` and `unsafe` doc tasks fail
- [SPARK-33118] - CREATE TEMPORARY TABLE fails with location
- [SPARK-33131] - Fix grouping sets with having clause can not resolve qualified col name
- [SPARK-33134] - Incorrect nested complex JSON fields raise an exception
- [SPARK-33136] - Handling nullability for complex types is broken during resolution of V2 write command
- [SPARK-33146] - Encountering an invalid rolling event log folder prevents loading other applications in SHS
- [SPARK-33183] - Bug in optimizer rule EliminateSorts
- [SPARK-33197] - Changes to spark.sql.analyzer.maxIterations do not take effect at runtime
- [SPARK-33230] - FileOutputWriter jobs have duplicate JobIDs if launched in same second
- [SPARK-33260] - SortExec produces incorrect results if sortOrder is a Stream
- [SPARK-33267] - Query with having null in "in" condition against data source V2 source table supporting push down filter fails with NPE
- [SPARK-33268] - Fix bugs for casting data from/to PythonUserDefinedType
- [SPARK-33277] - Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
- [SPARK-33284] - In the Storage UI page, clicking any field to sort the table will cause the header content to be lost
- [SPARK-33292] - Make Literal ArrayBasedMapData string representation disambiguous
- [SPARK-33306] - TimezoneID is needed when there cast from Date to String
- [SPARK-33313] - R/run-tests.sh is not compatible with testthat >= 3.0
- [SPARK-33333] - Upgrade Jetty to 9.4.28.v20200408
- [SPARK-33338] - GROUP BY using literal map should not fail
- [SPARK-33339] - Pyspark application will hang due to non Exception
- [SPARK-33358] - Spark SQL CLI command processing loop can't exit while one comand fail
- [SPARK-33362] - skipSchemaResolution should still require query to be resolved
- [SPARK-33372] - Fix InSet bucket pruning
- [SPARK-33391] - element_at with CreateArray not respect one based index
- [SPARK-33397] - mistakenly generate markdown to html for available-patterns-for-shs-custom-executor-log-ur
- [SPARK-33398] - AnalysisException when loading a PipelineModel with Spark 3
- [SPARK-33402] - Jobs launched in same second have duplicate MapReduce JobIDs
- [SPARK-33404] - "date_trunc" expression returns incorrect results
- [SPARK-33405] - Upgrade commons-compress to 1.20
- [SPARK-33412] - OverwriteByExpression should resolve its delete condition based on the table relation not the input query
- [SPARK-33417] - Correct the behaviour of query filters in TPCDSQueryBenchmark
- [SPARK-33422] - Incomplete menu item display in documention
- [SPARK-33438] - set -v couldn't dump all the conf entries
- [SPARK-33439] - Use SERIAL_SBT_TESTS=1 for SQL module like Hive module
- [SPARK-33440] - Spark schedules on updating delegation token with 0 interval under some token provider implementation
- [SPARK-33472] - IllegalArgumentException when applying RemoveRedundantSorts before EnsureRequirements
- [SPARK-33483] - Fix rat exclusion patterns and add a LICENSE
- [SPARK-33557] - spark.storage.blockManagerSlaveTimeoutMs default value does not follow spark.network.timeout value when the latter was changed
- [SPARK-33579] - Executors blank page behind proxy
- [SPARK-33588] - Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`
- [SPARK-33591] - NULL is recognized as the "null" string in partition specs
- [SPARK-33593] - Vector reader got incorrect data with binary partition value
- [SPARK-33611] - Decode Query parameters of the redirect URL for reverse proxy
- [SPARK-33629] - spark.buffer.size not applied in driver from pyspark
- [SPARK-33631] - Clean up `spark.core.connection.ack.wait.timeout` from `configuration.md`
- [SPARK-33635] - Performance regression in Kafka read
- [SPARK-33677] - LikeSimplification should be skipped if pattern contains any escapeChar
- [SPARK-33681] - Increase K8s IT timeout to 3 minutes
- [SPARK-33725] - Upgrade snappy-java to 1.1.8.2
- [SPARK-33726] - Duplicate field names causes wrong answers during aggregation
- [SPARK-33733] - PullOutNondeterministic should check and collect deterministic field
- [SPARK-33740] - hadoop configs in hive-site.xml can overrides pre-existing hadoop ones
- [SPARK-33749] - Exclude target directory in pycodestyle and flake8
- [SPARK-33756] - BytesToBytesMap's iterator hasNext method should be idempotent.
- [SPARK-33757] - Fix the R dependencies build error on GitHub Actions and AppVeyor
- [SPARK-33774] - "Back to Master" returns 500 error in Standalone cluster
- [SPARK-33786] - Cache's storage level is not respected when a table name is altered.
- [SPARK-33793] - Refactor usage of Executor in ExecutorSuite to ensure proper cleanup
- [SPARK-33813] - JDBC datasource fails when reading spatial datatypes with the MS SQL driver
- [SPARK-33819] - SingleFileEventLogFileReader/RollingEventLogFilesFileReader should be `package private`
- [SPARK-33831] - Update Jetty to 9.4.34
- [SPARK-33841] - Jobs disappear intermittently from the SHS under high load
- [SPARK-33853] - EXPLAIN CODEGEN and BenchmarkQueryTest don't show subquery code
- [SPARK-33867] - java.time.Instant and java.time.LocalDate not handled in org.apache.spark.sql.jdbc.JdbcDialect#compileValue
- [SPARK-33900] - Show shuffle read size / records correctly when only remotebytesread is available
- [SPARK-33931] - Recover GitHub Action
- [SPARK-33935] - Fix CBOs cost function
- [SPARK-33942] - Wrong metrics information in Spark Monitoring Documentation
- [SPARK-34000] - ExecutorAllocationListener threw an exception java.util.NoSuchElementException
- [SPARK-34010] - Use python3 instead of python in SQL documentation build
- [SPARK-34012] - Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide
- [SPARK-34084] - ALTER TABLE .. ADD PARTITION does not update table stats
- [SPARK-34103] - Fix MiMaExcludes by moving SPARK-23429 from 2.4.x to 3.0.x
- [SPARK-34144] - java.time.Instant and java.time.LocalDate not handled when writing to tables
- [SPARK-34154] - Flaky Test: LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750)
- [SPARK-34187] - Use available offset range obtained during polling when checking offset validation
- [SPARK-34200] - ambiguous column reference should consider attribute availability
- [SPARK-34201] - Fix the test code build error for docker-integration-tests in branch-3.0
- [SPARK-34203] - FileSource table null partition can not be dropped
- [SPARK-34212] - For parquet table, after changing the precision and scale of decimal type in hive, spark reads incorrect value
- [SPARK-34217] - Fix Scala 2.12 release profile
- [SPARK-34221] - In a special scenario, the error message of the stage in the UI page is blank
- [SPARK-34223] - NPE for static partion with null in InsertIntoHadoopFsRelationCommand
- [SPARK-34224] - Ensure all resource opened by `Source.fromXXX` method are closed
- [SPARK-34229] - Avro should read decimal values with the file schema
- [SPARK-34231] - AvroSuite has test failure when run from IDE due to bad loading of resource file
- [SPARK-34232] - redact credentials not working when log slow event enabled
- [SPARK-34260] - UnresolvedException when creating temp view twice
- [SPARK-34268] - The Signature for ConcatWs in Spark SQL Docs Is Inconsistent with the Actual Behavior
- [SPARK-34270] - Combine StateStoreMetrics should not override StateStoreCustomMetric
- [SPARK-34273] - Do not reregister BlockManager when SparkContext is stopped
- [SPARK-34318] - Dataset.colRegex should work with column names and qualifiers which contain newlines
- [SPARK-34319] - Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes
- [SPARK-34327] - Omit inlining passwords during build process.
- [SPARK-34346] - io.file.buffer.size set by spark.buffer.size will override by hive-site.xml may cause perf regression
- [SPARK-34405] - The getMetricsSnapshot method of the PrometheusServlet class has a wrong value
- [SPARK-34442] - Pin Sphinx version to `sphinx<3.5.0`
- [SPARK-39752] - Spark job failed with 10M rows data with Broken pipe error
Improvement
- [SPARK-30821] - Executor pods with multiple containers will not be rescheduled unless all containers fail
- [SPARK-32090] - UserDefinedType.equal() does not have symmetry
- [SPARK-32557] - Logging and Swallowing the Exception Per Entry in History Server
- [SPARK-32718] - remove unnecessary keywords for interval units
- [SPARK-32774] - Don't track docs/.jekyll-cache
- [SPARK-32786] - Improve performance for some slow DPP tests
- [SPARK-32791] - non-partitioned table metric should not have dynamic partition pruning time
- [SPARK-33073] - Improve error handling on Pandas to Arrow conversion failures
- [SPARK-33091] - Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema
- [SPARK-33123] - Ignore `GitHub Action file` change in Amplab Jenkins
- [SPARK-33156] - Upgrade GithubAction image from 18.04 to 20.04
- [SPARK-33162] - Use pre-built image at GitHub Action PySpark jobs
- [SPARK-33170] - Add SQL config to control fast-fail behavior in FileFormatWriter
- [SPARK-33171] - Mark ParquetV*FilterSuite/ParquetV*SchemaPruningSuite as ExtendedSQLTest
- [SPARK-33189] - Support PyArrow 2.0.0+
- [SPARK-33228] - Don't uncache data when replacing an existing view having the same plan
- [SPARK-33239] - Use pre-built image at GitHub Action SparkR job
- [SPARK-33264] - Add a dedicated page for SQL-on-file in SQL documents
- [SPARK-33371] - Support Python 3.9+ in PySpark
- [SPARK-33535] - export LANG to en_US.UTF-8 in jenkins test script
- [SPARK-33660] - Update Kafka Headers Documentation in Structured Streaming
- [SPARK-33675] - Add GitHub Action job to publish snapshot
- [SPARK-33790] - Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
- [SPARK-34059] - Use for/foreach rather than map to make sure execute it eagerly
- [SPARK-34118] - Replaces filter and check for emptiness with exists or forall
- [SPARK-34153] - Remove unused `getRawTable()` from `HiveExternalCatalog.alterPartitions()`
- [SPARK-34181] - Update build doc help document
- [SPARK-34202] - Let HiveExternalCatalogVersionsSuite can run in orgs internal environment
- [SPARK-34275] - Replaces filter and size with count
- [SPARK-34310] - Replaces map and flatten with flatMap
- [SPARK-34431] - Only load hive-site.xml once
Test
- [SPARK-32688] - LiteralGenerator for float and double does not generate special values
- [SPARK-32747] - Deduplicate configuration set/unset in test_sparkSQL_arrow.R
- [SPARK-32876] - Change default fallback versions in HiveExternalCatalogVersionsSuite
- [SPARK-33021] - Move functions related test cases into test_functions.py
- [SPARK-33051] - Uses setup-r to install R in GitHub Actions build
- [SPARK-33153] - HiveExternalCatalogVersionsSuite fails on Ubuntu 20.04
- [SPARK-33165] - Remove dependencies(scalatest,scalactic) from Benchmark
- [SPARK-33190] - Set upperbound of PyArrow version in GitHub Actions
- [SPARK-33770] - Test failures: ALTER TABLE .. DROP PARTITION tries to delete files out of partition path
- [SPARK-33869] - Have a separate metastore dir for each PySpark test process
Documentation
- [SPARK-32306] - `approx_percentile` in Spark SQL gives incorrect results
- [SPARK-32860] - Encoders::bean doc incorrectly states maps are not supported
- [SPARK-32888] - reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
- [SPARK-33181] - SQL Reference: Run SQL on files directly
- [SPARK-33208] - Update the document of SparkSession#sql
- [SPARK-33246] - Spark SQL null semantics documentation is incorrect
- [SPARK-33451] - change 'spark.sql.adaptive.skewedPartitionThresholdInBytes' to 'spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes'
- [SPARK-33585] - The comment for SQLContext.tables() doesn't mention the `database` column
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.