Release Notes - Spark - Version 3.4.1 - HTML format

Sub-task

  • [SPARK-41527] - Implement DataFrame.observe
  • [SPARK-41818] - Support DataFrameWriter.saveAsTable
  • [SPARK-41843] - Implement SparkSession.udf
  • [SPARK-42020] - createDataFrame with UDT
  • [SPARK-42194] - Allow `columns` parameter when creating DataFrame with Series.
  • [SPARK-42247] - Standardize `returnType` property of UserDefinedFunction
  • [SPARK-42340] - Implement Grouped Map API
  • [SPARK-42496] - Introducing Spark Connect on the main page and adding Spark Connect Overview page
  • [SPARK-42529] - Support Cube and Rollup
  • [SPARK-42541] - Support Pivot with provided pivot column values
  • [SPARK-42542] - Support Pivot without providing pivot column values
  • [SPARK-42570] - Fix DataFrameReader to use the default source
  • [SPARK-42615] - Refactor the AnalyzePlan RPC and add `session.version`
  • [SPARK-42679] - createDataFrame doesn't work with non-nullable schema.
  • [SPARK-42706] - Document the Spark SQL error classes in user-facing documentation.
  • [SPARK-42731] - Update Spark Configuration
  • [SPARK-42733] - df.write.format().save() should support calling with no path or table name
  • [SPARK-42743] - Support analyze TimestampNTZ columns
  • [SPARK-42755] - Factor literal value conversion out to connect-common
  • [SPARK-42756] - Helper function to convert proto literal to value in Python Client
  • [SPARK-42765] - Enable importing `pandas_udf` from `pyspark.sql.connect.functions`
  • [SPARK-42777] - Support converting TimestampNTZ catalog stats to plan stats
  • [SPARK-42796] - Support TimestampNTZ in Cached Batch
  • [SPARK-42816] - Increase max message size to 128MB
  • [SPARK-42818] - Implement DataFrameReader/Writer.jdbc
  • [SPARK-42824] - Provide a clear error message for unsupported JVM attributes.
  • [SPARK-42826] - Add migration notes for update to supported pandas version.
  • [SPARK-42848] - Implement DataFrame.registerTempTable
  • [SPARK-42911] - Introduce more basic exceptions.
  • [SPARK-42970] - Reuse pyspark.sql.tests.test_arrow test cases
  • [SPARK-43071] - Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation
  • [SPARK-43072] - Include TIMESTAMP_NTZ in ANSI Compliance doc
  • [SPARK-43098] - Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty
  • [SPARK-43156] - Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
  • [SPARK-43336] - Casting between Timestamp and TimestampNTZ requires timezone
  • [SPARK-44018] - Improve the hashCode for Some DS V2 Expression
  • [SPARK-44168] - Add Apache Spark 3.4.1 Dockerfiles

Bug

  • [SPARK-37829] - An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values
  • [SPARK-42290] - Spark Driver hangs on OOM during Broadcast when AQE is enabled
  • [SPARK-42553] - NonReserved keyword "interval" can't be column name
  • [SPARK-42622] - StackOverflowError reading json that does not conform to schema
  • [SPARK-42623] - parameter markers not blocked in DDL
  • [SPARK-42635] - Several counter-intuitive behaviours in the TimestampAdd expression
  • [SPARK-42644] - Add `hive` dependency to `connect` module
  • [SPARK-42649] - Remove the standard Apache License header from the top of third-party source files
  • [SPARK-42671] - Fix bug for createDataFrame from complex type schema
  • [SPARK-42745] - Improved AliasAwareOutputExpression works with DSv2
  • [SPARK-42747] - Fix incorrect internal status of LoR and AFT
  • [SPARK-42770] - SQLImplicitsTestSuite test failed with Java 17
  • [SPARK-42785] - [K8S][Core] When spark submit without --deploy-mode, will face NPE in Kubernetes Case
  • [SPARK-42793] - `connect` module requires `build_profile_flags`
  • [SPARK-42799] - Update SBT build `xercesImpl` version to match with pom.xml
  • [SPARK-42801] - Fix Flaky ClientE2ETestSuite
  • [SPARK-42812] - client_type is missing from AddArtifactsRequest proto message
  • [SPARK-42817] - Spark driver logs are filled with Initializing service data for shuffle service using name
  • [SPARK-42820] - Update ORC to 1.8.3
  • [SPARK-42852] - Revert NamedLambdaVariable related changes from EquivalentExpressions
  • [SPARK-42899] - DataFrame.to(schema) fails when it contains non-nullable nested field in nullable field
  • [SPARK-42906] - Replace a starting digit with `x` in resource name prefix
  • [SPARK-42922] - Use SecureRandom, instead of Random in security sensitive contexts
  • [SPARK-42937] - Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
  • [SPARK-42957] - `release-build.sh` should not remove SBOM artifacts
  • [SPARK-42974] - Restore `Utils#createTempDir` use `ShutdownHookManager.registerShutdownDeleteDir` to cleanup tempDir
  • [SPARK-43004] - vendor==vendor typo in ResourceRequest.equals()
  • [SPARK-43005] - `v is v >= 0` typo in pyspark/pandas/config.py
  • [SPARK-43006] - self.deserialized == self.deserialized typo in StorageLevel __eq__()
  • [SPARK-43050] - Fix construct aggregate expressions by replacing grouping functions
  • [SPARK-43067] - Error class resource file in Kafka connector is misplaced
  • [SPARK-43069] - Use `sbt-eclipse` instead of `sbteclipse-plugin`
  • [SPARK-43113] - Codegen error when full outer join's bound condition has multiple references to the same stream-side column
  • [SPARK-43125] - Connect Server Can't Handle Exception With Null Message Normally
  • [SPARK-43126] - mark two Hive UDF expressions as stateful
  • [SPARK-43141] - Ignore generated Java files in checkstyle
  • [SPARK-43157] - TreeNode tags can become corrupted and hang driver when the dataset is cached
  • [SPARK-43249] - df.sql() should send metrics back()
  • [SPARK-43281] - Fix concurrent writer does not update file metrics
  • [SPARK-43293] - __qualified_access_only should be ignored in normal columns
  • [SPARK-43329] - driver and executors shared same Kubernetes PVC in Spark 3.4+
  • [SPARK-43337] - Asc/desc arrow icons for sorting column does not get displayed in the table column
  • [SPARK-43340] - Handle missing stack-trace field in eventlogs
  • [SPARK-43342] - Revert SPARK-39006 Show a directional error message for executor PVC dynamic allocation failure
  • [SPARK-43373] - Revert [SPARK-39203][SQL] Rewrite table location to absolute URI based on database URI
  • [SPARK-43378] - SerializerHelper.deserializeFromChunkedBuffer leaks deserialization streams
  • [SPARK-43398] - Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
  • [SPARK-43404] - Filter current version while reusing sst files for RocksDB state store provider while uploading to DFS to prevent id mismatch
  • [SPARK-43425] - Add TimestampNTZType to ColumnarBatchRow
  • [SPARK-43441] - makeDotNode should not fail when DeterministicLevel is absent
  • [SPARK-43471] - Handle missing hadoopProperties and metricsProperties
  • [SPARK-43510] - Spark application hangs when YarnAllocator adds running executors after processing completed containers
  • [SPARK-43522] - Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
  • [SPARK-43527] - Fix catalog.listCatalogs in PySpark
  • [SPARK-43541] - Incorrect column resolution on FULL OUTER JOIN with USING
  • [SPARK-43547] - Update "Supported Pandas API" page to point out the proper pandas docs
  • [SPARK-43589] - Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString`
  • [SPARK-43718] - References to a specific side's key in a USING join can have wrong nullability
  • [SPARK-43719] - Handle missing row.excludedInStages field
  • [SPARK-43758] - Upgrade snappy-java to 1.1.10.0
  • [SPARK-43759] - Expose TimestampNTZType in pyspark.sql.types
  • [SPARK-43760] - Incorrect attribute nullability after RewriteCorrelatedScalarSubquery leads to incorrect query results
  • [SPARK-43802] - unbase64 and unhex codegen are invalid with failOnError
  • [SPARK-43949] - Upgrade Cloudpickle to 2.2.1
  • [SPARK-43956] - Fix the bug doesn't display column's sql for Percentile[Cont|Disc]
  • [SPARK-43973] - Structured Streaming UI should display failed queries correctly
  • [SPARK-43976] - Handle the case where modifiedConfigs doesn't exist in event logs
  • [SPARK-44040] - Incorrect result after count distinct
  • [SPARK-44053] - Update ORC to 1.8.4
  • [SPARK-44136] - StateManager may get materialized in executor instead of driver in FlatMapGroupsWithStateExec
  • [SPARK-44142] - Utility to convert python types to spark types compares Python "type" object rather than user's "tpe" for categorical data types
  • [SPARK-44383] - Fix the trim logic did't handle ASCII control characters correctly

New Feature

Improvement

  • [SPARK-42421] - Use the utils to get the switch for dynamic allocation used in local checkpoint
  • [SPARK-42519] - Add more WriteTo tests after Scala Client session config is supported
  • [SPARK-42533] - SSL support for Scala Client
  • [SPARK-42538] - `functions#lit` support more types
  • [SPARK-42573] - Enable binary compatibility tests for SparkSession/Dataset/Column/functions
  • [SPARK-42575] - Replace `AnyFunSuite` with `ConnectFunSuite` for scala client tests
  • [SPARK-42647] - Remove aliases from deprecated numpy data types
  • [SPARK-42721] - Add an Interceptor to log RPCs in connect-server
  • [SPARK-42754] - Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier
  • [SPARK-42757] - Implement textFile for DataFrameReader
  • [SPARK-42767] - Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive
  • [SPARK-42778] - QueryStageExec should respect supportsRowBased
  • [SPARK-42823] - spark-sql shell supports multipart namespaces for initialization
  • [SPARK-42888] - Upgrade GCS connector to 2.2.11.
  • [SPARK-42894] - Implement cache, persist, unpersist, and storageLevel
  • [SPARK-42927] - Make `o.a.spark.util.Iterators#size` as `private[spark]`
  • [SPARK-42930] - Change the access scope of `ProtobufSerDe` related implementations to `private[spark]`
  • [SPARK-42934] - Testing OrcEncryptionSuite using maven is always skipped
  • [SPARK-43284] - _metadata.file_path regression
  • [SPARK-43374] - Protobuf Licensed under BSD-3 not BSD-2 clause
  • [SPARK-43395] - Exclude macOS tar extended metadata in make-distribution.sh
  • [SPARK-43414] - Fix flakiness in Kafka RDD suites due to port binding configuration issue
  • [SPARK-43894] - df.cache() not working

Test

  • [SPARK-43083] - Mark `*StateStoreSuite` as `ExtendedSQLTest`
  • [SPARK-43450] - Add more `_metadata` filter test cases
  • [SPARK-43587] - Run HealthTrackerIntegrationSuite in a dedicate JVM

Task

  • [SPARK-42467] - Spark Connect Scala Client: GroupBy and Aggregation
  • [SPARK-42531] - Scala Client Add Collection Functions
  • [SPARK-42544] - Spark Connect Scala Client: support parameterized SQL
  • [SPARK-42640] - Remove stale entries from the excluding rules for CompabilitySuite
  • [SPARK-42667] - Spark Connect: newSession API
  • [SPARK-42688] - Rename Connect proto Request client_id to session_id

Dependency upgrade

Documentation

  • [SPARK-42773] - Minor grammatical change to "Supports Spark Connect" message
  • [SPARK-42797] - Spark Connect - Grammatical improvements to Spark Overview and Spark Connect Overview doc pages
  • [SPARK-43139] - Bug in INSERT INTO documentation
  • [SPARK-43517] - Add a migration guide for namedtuple monkey patch
  • [SPARK-43751] - Document for unbase64 behavior change
  • [SPARK-44038] - Update YuniKorn docs with v1.3

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.