Sub-task
- [SPARK-41527] - Implement DataFrame.observe
- [SPARK-41818] - Support DataFrameWriter.saveAsTable
- [SPARK-41843] - Implement SparkSession.udf
- [SPARK-42020] - createDataFrame with UDT
- [SPARK-42194] - Allow `columns` parameter when creating DataFrame with Series.
- [SPARK-42247] - Standardize `returnType` property of UserDefinedFunction
- [SPARK-42340] - Implement Grouped Map API
- [SPARK-42496] - Introducing Spark Connect on the main page and adding Spark Connect Overview page
- [SPARK-42529] - Support Cube and Rollup
- [SPARK-42541] - Support Pivot with provided pivot column values
- [SPARK-42542] - Support Pivot without providing pivot column values
- [SPARK-42570] - Fix DataFrameReader to use the default source
- [SPARK-42615] - Refactor the AnalyzePlan RPC and add `session.version`
- [SPARK-42679] - createDataFrame doesn't work with non-nullable schema.
- [SPARK-42706] - Document the Spark SQL error classes in user-facing documentation.
- [SPARK-42731] - Update Spark Configuration
- [SPARK-42733] - df.write.format().save() should support calling with no path or table name
- [SPARK-42743] - Support analyze TimestampNTZ columns
- [SPARK-42755] - Factor literal value conversion out to connect-common
- [SPARK-42756] - Helper function to convert proto literal to value in Python Client
- [SPARK-42765] - Enable importing `pandas_udf` from `pyspark.sql.connect.functions`
- [SPARK-42777] - Support converting TimestampNTZ catalog stats to plan stats
- [SPARK-42796] - Support TimestampNTZ in Cached Batch
- [SPARK-42816] - Increase max message size to 128MB
- [SPARK-42818] - Implement DataFrameReader/Writer.jdbc
- [SPARK-42824] - Provide a clear error message for unsupported JVM attributes.
- [SPARK-42826] - Add migration notes for update to supported pandas version.
- [SPARK-42848] - Implement DataFrame.registerTempTable
- [SPARK-42911] - Introduce more basic exceptions.
- [SPARK-42970] - Reuse pyspark.sql.tests.test_arrow test cases
- [SPARK-43071] - Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation
- [SPARK-43072] - Include TIMESTAMP_NTZ in ANSI Compliance doc
- [SPARK-43098] - Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty
- [SPARK-43156] - Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
- [SPARK-43336] - Casting between Timestamp and TimestampNTZ requires timezone
- [SPARK-44018] - Improve the hashCode for Some DS V2 Expression
- [SPARK-44168] - Add Apache Spark 3.4.1 Dockerfiles
Bug
- [SPARK-37829] - An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values
- [SPARK-42290] - Spark Driver hangs on OOM during Broadcast when AQE is enabled
- [SPARK-42553] - NonReserved keyword "interval" can't be column name
- [SPARK-42622] - StackOverflowError reading json that does not conform to schema
- [SPARK-42623] - parameter markers not blocked in DDL
- [SPARK-42635] - Several counter-intuitive behaviours in the TimestampAdd expression
- [SPARK-42644] - Add `hive` dependency to `connect` module
- [SPARK-42649] - Remove the standard Apache License header from the top of third-party source files
- [SPARK-42671] - Fix bug for createDataFrame from complex type schema
- [SPARK-42745] - Improved AliasAwareOutputExpression works with DSv2
- [SPARK-42747] - Fix incorrect internal status of LoR and AFT
- [SPARK-42770] - SQLImplicitsTestSuite test failed with Java 17
- [SPARK-42785] - [K8S][Core] When spark submit without --deploy-mode, will face NPE in Kubernetes Case
- [SPARK-42793] - `connect` module requires `build_profile_flags`
- [SPARK-42799] - Update SBT build `xercesImpl` version to match with pom.xml
- [SPARK-42801] - Fix Flaky ClientE2ETestSuite
- [SPARK-42812] - client_type is missing from AddArtifactsRequest proto message
- [SPARK-42817] - Spark driver logs are filled with Initializing service data for shuffle service using name
- [SPARK-42820] - Update ORC to 1.8.3
- [SPARK-42852] - Revert NamedLambdaVariable related changes from EquivalentExpressions
- [SPARK-42899] - DataFrame.to(schema) fails when it contains non-nullable nested field in nullable field
- [SPARK-42906] - Replace a starting digit with `x` in resource name prefix
- [SPARK-42922] - Use SecureRandom, instead of Random in security sensitive contexts
- [SPARK-42937] - Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
- [SPARK-42957] - `release-build.sh` should not remove SBOM artifacts
- [SPARK-42974] - Restore `Utils#createTempDir` use `ShutdownHookManager.registerShutdownDeleteDir` to cleanup tempDir
- [SPARK-43004] - vendor==vendor typo in ResourceRequest.equals()
- [SPARK-43005] - `v is v >= 0` typo in pyspark/pandas/config.py
- [SPARK-43006] - self.deserialized == self.deserialized typo in StorageLevel __eq__()
- [SPARK-43050] - Fix construct aggregate expressions by replacing grouping functions
- [SPARK-43067] - Error class resource file in Kafka connector is misplaced
- [SPARK-43069] - Use `sbt-eclipse` instead of `sbteclipse-plugin`
- [SPARK-43113] - Codegen error when full outer join's bound condition has multiple references to the same stream-side column
- [SPARK-43125] - Connect Server Can't Handle Exception With Null Message Normally
- [SPARK-43126] - mark two Hive UDF expressions as stateful
- [SPARK-43141] - Ignore generated Java files in checkstyle
- [SPARK-43157] - TreeNode tags can become corrupted and hang driver when the dataset is cached
- [SPARK-43249] - df.sql() should send metrics back()
- [SPARK-43281] - Fix concurrent writer does not update file metrics
- [SPARK-43293] - __qualified_access_only should be ignored in normal columns
- [SPARK-43329] - driver and executors shared same Kubernetes PVC in Spark 3.4+
- [SPARK-43337] - Asc/desc arrow icons for sorting column does not get displayed in the table column
- [SPARK-43340] - Handle missing stack-trace field in eventlogs
- [SPARK-43342] - Revert SPARK-39006 Show a directional error message for executor PVC dynamic allocation failure
- [SPARK-43373] - Revert [SPARK-39203][SQL] Rewrite table location to absolute URI based on database URI
- [SPARK-43378] - SerializerHelper.deserializeFromChunkedBuffer leaks deserialization streams
- [SPARK-43398] - Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
- [SPARK-43404] - Filter current version while reusing sst files for RocksDB state store provider while uploading to DFS to prevent id mismatch
- [SPARK-43425] - Add TimestampNTZType to ColumnarBatchRow
- [SPARK-43441] - makeDotNode should not fail when DeterministicLevel is absent
- [SPARK-43471] - Handle missing hadoopProperties and metricsProperties
- [SPARK-43510] - Spark application hangs when YarnAllocator adds running executors after processing completed containers
- [SPARK-43522] - Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
- [SPARK-43527] - Fix catalog.listCatalogs in PySpark
- [SPARK-43541] - Incorrect column resolution on FULL OUTER JOIN with USING
- [SPARK-43547] - Update "Supported Pandas API" page to point out the proper pandas docs
- [SPARK-43589] - Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
- [SPARK-43718] - References to a specific side's key in a USING join can have wrong nullability
- [SPARK-43719] - Handle missing row.excludedInStages field
- [SPARK-43758] - Upgrade snappy-java to 1.1.10.0
- [SPARK-43759] - Expose TimestampNTZType in pyspark.sql.types
- [SPARK-43760] - Incorrect attribute nullability after RewriteCorrelatedScalarSubquery leads to incorrect query results
- [SPARK-43802] - unbase64 and unhex codegen are invalid with failOnError
- [SPARK-43949] - Upgrade Cloudpickle to 2.2.1
- [SPARK-43956] - Fix the bug doesn't display column's sql for Percentile[Cont|Disc]
- [SPARK-43973] - Structured Streaming UI should display failed queries correctly
- [SPARK-43976] - Handle the case where modifiedConfigs doesn't exist in event logs
- [SPARK-44040] - Incorrect result after count distinct
- [SPARK-44053] - Update ORC to 1.8.4
- [SPARK-44136] - StateManager may get materialized in executor instead of driver in FlatMapGroupsWithStateExec
- [SPARK-44142] - Utility to convert python types to spark types compares Python "type" object rather than user's "tpe" for categorical data types
- [SPARK-44383] - Fix the trim logic did't handle ASCII control characters correctly
New Feature
- [SPARK-42555] - Add JDBC to DataFrameReader
- [SPARK-42557] - Add Broadcast to functions
- [SPARK-42558] - Implement DataFrameStatFunctions
- [SPARK-42559] - Implement DataFrameNaFunctions
- [SPARK-42560] - Implement ColumnName
- [SPARK-42561] - Add TempView APIs to Dataset
- [SPARK-42564] - Implement Dataset.version and Dataset.time
- [SPARK-42576] - Add 2nd groupBy method to Dataset
- [SPARK-42580] - Add initial typed Dataset APIs
- [SPARK-42581] - Add SparkSession implicits
- [SPARK-42586] - Implement RuntimeConf
- [SPARK-42605] - Implement TypedColumn
- [SPARK-42631] - Support custom extensions in Spark Connect Scala client
- [SPARK-42639] - Add createDataFrame/createDataset to SparkSession
- [SPARK-42691] - Implement Dataset.semanticHash
- [SPARK-42702] - Support parameterized CTE
- [SPARK-47717] - Support Hive tables as a streaming source and sink
Improvement
- [SPARK-42421] - Use the utils to get the switch for dynamic allocation used in local checkpoint
- [SPARK-42519] - Add more WriteTo tests after Scala Client session config is supported
- [SPARK-42533] - SSL support for Scala Client
- [SPARK-42538] - `functions#lit` support more types
- [SPARK-42573] - Enable binary compatibility tests for SparkSession/Dataset/Column/functions
- [SPARK-42575] - Replace `AnyFunSuite` with `ConnectFunSuite` for scala client tests
- [SPARK-42647] - Remove aliases from deprecated numpy data types
- [SPARK-42721] - Add an Interceptor to log RPCs in connect-server
- [SPARK-42754] - Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier
- [SPARK-42757] - Implement textFile for DataFrameReader
- [SPARK-42767] - Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive
- [SPARK-42778] - QueryStageExec should respect supportsRowBased
- [SPARK-42823] - spark-sql shell supports multipart namespaces for initialization
- [SPARK-42888] - Upgrade GCS connector to 2.2.11.
- [SPARK-42894] - Implement cache, persist, unpersist, and storageLevel
- [SPARK-42927] - Make `o.a.spark.util.Iterators#size` as `private[spark]`
- [SPARK-42930] - Change the access scope of `ProtobufSerDe` related implementations to `private[spark]`
- [SPARK-42934] - Testing OrcEncryptionSuite using maven is always skipped
- [SPARK-43284] - _metadata.file_path regression
- [SPARK-43374] - Protobuf Licensed under BSD-3 not BSD-2 clause
- [SPARK-43395] - Exclude macOS tar extended metadata in make-distribution.sh
- [SPARK-43414] - Fix flakiness in Kafka RDD suites due to port binding configuration issue
- [SPARK-43894] - df.cache() not working
Test
- [SPARK-43083] - Mark `*StateStoreSuite` as `ExtendedSQLTest`
- [SPARK-43450] - Add more `_metadata` filter test cases
- [SPARK-43587] - Run HealthTrackerIntegrationSuite in a dedicate JVM
Task
- [SPARK-42467] - Spark Connect Scala Client: GroupBy and Aggregation
- [SPARK-42531] - Scala Client Add Collection Functions
- [SPARK-42544] - Spark Connect Scala Client: support parameterized SQL
- [SPARK-42640] - Remove stale entries from the excluding rules for CompabilitySuite
- [SPARK-42667] - Spark Connect: newSession API
- [SPARK-42688] - Rename Connect proto Request client_id to session_id
Dependency upgrade
- [SPARK-44070] - Bump snappy-java 1.1.10.1
Documentation
- [SPARK-42773] - Minor grammatical change to "Supports Spark Connect" message
- [SPARK-42797] - Spark Connect - Grammatical improvements to Spark Overview and Spark Connect Overview doc pages
- [SPARK-43139] - Bug in INSERT INTO documentation
- [SPARK-43517] - Add a migration guide for namedtuple monkey patch
- [SPARK-43751] - Document for unbase64 behavior change
- [SPARK-44038] - Update YuniKorn docs with v1.3
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.