Release Notes - Spark - Version 3.1.2 - HTML format

Sub-task

  • [SPARK-33976] - Add a dedicated SQL document page for the TRANSFORM-related functionality,
  • [SPARK-34507] - Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12
  • [SPARK-34543] - Respect case sensitivity in V1 ALTER TABLE .. SET LOCATION
  • [SPARK-34561] - Cannot drop/add columns from/to a dataset of v2 `DESCRIBE TABLE`
  • [SPARK-34577] - Cannot drop/add columns from/to a dataset of v2 `DESCRIBE NAMESPACE`
  • [SPARK-34630] - Add type hints of pyspark.__version__ and pyspark.sql.Column.contains
  • [SPARK-34682] - Regression in "operating on canonicalized plan" check in CustomShuffleReaderExec
  • [SPARK-34711] - Exercise code-gen enable/disable code paths for SHJ in join test suites
  • [SPARK-34790] - Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
  • [SPARK-34840] - Fix cases of corruption in merged shuffle blocks that are pushed
  • [SPARK-35019] - Improve type hints on pyspark.sql.*
  • [SPARK-35093] - AQE columnar mismatch on exchange reuse
  • [SPARK-35159] - extract doc of hive format
  • [SPARK-35168] - mapred.reduce.tasks should be shuffle.partitions not adaptive.coalescePartitions.initialPartitionNum
  • [SPARK-35431] - Sort elements generated by collect_set in SQLQueryTestSuite

Bug

  • [SPARK-32924] - Web UI sort on duration is wrong
  • [SPARK-33482] - V2 Datasources that extend FileScan preclude exchange reuse
  • [SPARK-34128] - Suppress excessive logging of TTransportExceptions in Spark ThriftServer
  • [SPARK-34225] - Jars or file paths which contain spaces are generating FileNotFoundException exception
  • [SPARK-34361] - Dynamic allocation on K8s kills executors with running tasks
  • [SPARK-34392] - Invalid ID for offset-based ZoneId since Spark 3.0
  • [SPARK-34417] - org.apache.spark.sql.DataFrameNaFunctions.fillMap(values: Seq[(String, Any)]) fails for column name having a dot
  • [SPARK-34436] - DPP support LIKE ANY/ALL
  • [SPARK-34473] - avoid NPE in DataFrameReader.schema(StructType)
  • [SPARK-34490] - table maybe resolved as a view if the table is dropped
  • [SPARK-34497] - JDBC connection provider is not removing kerberos credentials from JVM security context
  • [SPARK-34504] - Avoid unnecessary view resolving and remove the `performCheck` flag
  • [SPARK-34515] - Fix NPE if InSet contains null value during getPartitionsByFilter
  • [SPARK-34531] - Remove Experimental API tag in PrometheusServlet
  • [SPARK-34534] - New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
  • [SPARK-34545] - PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together
  • [SPARK-34547] - Resolve using child metadata attributes as fallback
  • [SPARK-34551] - generate-contributors.py, releaseutils.py and translate-contributors.py are broken
  • [SPARK-34555] - Resolve metadata output from DataFrame
  • [SPARK-34556] - Checking duplicate static partition columns doesn't respect case sensitive conf
  • [SPARK-34567] - CreateTableAsSelect should have metrics update too
  • [SPARK-34584] - When insert into a partition table with a illegal partition value, DSV2 behavior different as others
  • [SPARK-34596] - NewInstance.doGenCode should not throw malformed class name error
  • [SPARK-34599] - INSERT INTO OVERWRITE doesn't support partition columns containing dot for DSv2
  • [SPARK-34607] - NewInstance.resolved should not throw malformed class name error
  • [SPARK-34613] - Fix view does not capture disable hint config
  • [SPARK-34642] - TypeError in Pyspark Linear Regression docs
  • [SPARK-34643] - Use CRAN URL in canonical form
  • [SPARK-34660] - Don't use ParVector with `withExistingConf` which is not thread-safe
  • [SPARK-34674] - Spark app on k8s doesn't terminate without call to sparkContext.stop() method
  • [SPARK-34676] - TableCapabilityCheckSuite should not inherit all tests from AnalysisSuite
  • [SPARK-34681] - Full outer shuffled hash join when building left side produces wrong result
  • [SPARK-34696] - Fix CodegenInterpretedPlanTest to generate correct test cases
  • [SPARK-34697] - Allow DESCRIBE FUNCTION and SHOW FUNCTIONS explain about || (string concatenation operator).
  • [SPARK-34713] - group by CreateStruct with ExtractValue fails analysis
  • [SPARK-34714] - collect_list(struct()) fails when used with GROUP BY
  • [SPARK-34719] - fail if the view query has duplicated column names
  • [SPARK-34723] - Correct parameter type for subexpression elimination under whole-stage
  • [SPARK-34724] - Fix Interpreted evaluation by using getClass.getMethod instead of getDeclaredMethod
  • [SPARK-34727] - Difference in results of casting float to timestamp
  • [SPARK-34731] - ConcurrentModificationException in EventLoggingListener when redacting properties
  • [SPARK-34737] - Discrepancy between TIMESTAMP_SECONDS and cast from float
  • [SPARK-34743] - ExpressionEncoderSuite should use deepEquals when we expect `array of array`
  • [SPARK-34747] - Add virtual operators to the built-in function document.
  • [SPARK-34756] - Fix FileScan equality check
  • [SPARK-34760] - run JavaSQLDataSourceExample failed with Exception in runBasicDataSourceExample().
  • [SPARK-34763] - col(), $"<name>" and df("name") should handle quoted column names properly.
  • [SPARK-34768] - Respect the default input buffer size in Univocity
  • [SPARK-34770] - InMemoryCatalog.tableExists should not fail if database doesn't exist
  • [SPARK-34772] - RebaseDateTime loadRebaseRecords should use Spark classloader instead of context
  • [SPARK-34774] - The `change-scala- version.sh` script not replaced scala.version property correctly
  • [SPARK-34776] - Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
  • [SPARK-34794] - Nested higher-order functions broken in DSL
  • [SPARK-34796] - Codegen compilation error for query with LIMIT operator and without AQE
  • [SPARK-34798] - Fix incorrect join condition
  • [SPARK-34803] - Util methods requiring certain versions of Pandas & PyArrow don't pass through the raised ImportError
  • [SPARK-34811] - Redact fs.s3a.access.key like secret and token
  • [SPARK-34814] - LikeSimplification should handle NULL
  • [SPARK-34820] - K8s Integration test failed (due to libldap installation failed)
  • [SPARK-34829] - transform_values return identical values when it's used with udf that returns reference type
  • [SPARK-34832] - ExternalAppendOnlyUnsafeRowArrayBenchmark can't run with spark-submit
  • [SPARK-34833] - Apply right-padding correctly for correlated subqueries
  • [SPARK-34834] - There is a potential Netty memory leak in TransportResponseHandler.
  • [SPARK-34842] - Corrects the type of date_dim.d_quarter_name in the TPCDS schema
  • [SPARK-34845] - ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing
  • [SPARK-34874] - Recover test reports for failed GA builds
  • [SPARK-34876] - Non-nullable aggregates can return NULL in a correlated subquery
  • [SPARK-34897] - Support reconcile schemas based on index after nested column pruning
  • [SPARK-34900] - Some `spark-submit`  commands used to run benchmarks in the user's guide is wrong
  • [SPARK-34909] - conv() does not convert negative inputs to unsigned correctly
  • [SPARK-34926] - PartitionUtils.getPathFragment should handle null value
  • [SPARK-34933] - Remove the description that || and && can be used as logical operators from the document.
  • [SPARK-34939] - Throw fetch failure exception when unable to deserialize broadcasted map statuses
  • [SPARK-34948] - Add ownerReference to executor configmap to fix leakages
  • [SPARK-34949] - Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down
  • [SPARK-34963] - Nested column pruning fails to extract case-insensitive struct field from array
  • [SPARK-34965] - Remove .sbtopts that duplicately sets the default memory
  • [SPARK-34988] - Upgrade Jetty for CVE-2021-28165
  • [SPARK-35004] - Fix Incorrect assertion of "master/worker web ui available behind front-end reverseProxy" in MasterSuite
  • [SPARK-35014] - A foldable expression could not be replaced by an AttributeReference
  • [SPARK-35079] - Transform with udf gives incorrect result
  • [SPARK-35080] - Correlated subqueries with equality predicates can return wrong results
  • [SPARK-35096] - foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
  • [SPARK-35106] - HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used
  • [SPARK-35117] - UI progress bar no longer highlights in progress tasks
  • [SPARK-35136] - Initial null value of LiveStage.info can lead to NPE
  • [SPARK-35142] - `OneVsRest` classifier uses incorrect data type for `rawPrediction` column
  • [SPARK-35178] - maven autodownload failing
  • [SPARK-35210] - Upgrade Jetty to 9.4.40 to fix ERR_CONNECTION_RESET issue
  • [SPARK-35213] - Corrupt DataFrame for certain withField patterns
  • [SPARK-35226] - JDBC datasources should accept refreshKrb5Config parameter
  • [SPARK-35244] - invoke should throw the original exception
  • [SPARK-35278] - Invoke should find the method with correct number of parameters
  • [SPARK-35288] - StaticInvoke should find the method without exact argument classes match
  • [SPARK-35359] - Insert data with char/varchar datatype will fail when data length exceed length limitation
  • [SPARK-35375] - Use Jinja2 < 3.0.0 for Python linter dependency in GA
  • [SPARK-35381] - Fix lambda variable name issues in nested DataFrame functions in R APIs
  • [SPARK-35382] - Fix lambda variable name issues in nested DataFrame functions in Python APIs
  • [SPARK-35393] - PIP packaging test is skipped in GitHub Actions build
  • [SPARK-35425] - Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md
  • [SPARK-35458] - ARM CI failed: failed to validate maven sha512
  • [SPARK-35463] - Skip checking checksum on a system doesn't have `shasum`
  • [SPARK-35482] - case sensitive block manager port key should be used in BasicExecutorFeatureStep
  • [SPARK-35493] - spark.blockManager.port does not work for driver pod
  • [SPARK-36765] - Spark Support for MS Sql JDBC connector with Kerberos/Keytab
  • [SPARK-38208] - 'Column' object is not callable

Improvement

  • [SPARK-34482] - Correct the active SparkSession for streaming query
  • [SPARK-34550] - Skip InSet null value during push filter to Hive metastore
  • [SPARK-34639] - always remove unnecessary Alias in Analyzer.resolveExpression
  • [SPARK-34683] - Update the documents to explain the usage of LIST FILE and LIST JAR in case they take multiple file names
  • [SPARK-34749] - Simplify CreateNamedStruct
  • [SPARK-34752] - Upgrade Jetty to 9.4.37 to fix CVE-2020-27223
  • [SPARK-34762] - Many PR's Scala 2.13 build action failed
  • [SPARK-34766] - Do not capture maven config for views
  • [SPARK-34915] - Cache Maven, SBT and Scala in all jobs that use them
  • [SPARK-34922] - Use better CBO cost function
  • [SPARK-34923] - Metadata output should not always be propagated
  • [SPARK-34940] - Fix minor unit test in BasicWriteTaskStatsTrackerSuite
  • [SPARK-35002] - Fix the java.net.BindException when testing with Github Action
  • [SPARK-35045] - Add an internal option to control input buffer in univocity
  • [SPARK-35087] - Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.
  • [SPARK-35127] - When we switch between different stage-detail pages, the entry item in the newly-opened page may be blank.
  • [SPARK-35171] - Declare the markdown package as a dependency of the SparkR package
  • [SPARK-35227] - Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit
  • [SPARK-35358] - Set maximum Java heap used for release build
  • [SPARK-35373] - Verify checksums of downloaded artifacts in build/mvn
  • [SPARK-35411] - Essential information missing in TreeNode json string

Test

  • [SPARK-24931] - Recover lint-r job in GitHub Actions workflow
  • [SPARK-34604] - Flaky test: TaskContextTestsWithWorkerReuse.test_task_context_correct_with_python_worker_reuse
  • [SPARK-34610] - Fix Python UDF used in GroupedAggPandasUDFTests.
  • [SPARK-34795] - Adds a new job in GitHub Actions to check the output of TPC-DS queries
  • [SPARK-34813] - Remove Scala 2.13 build GitHub Action job from branch-3.1
  • [SPARK-34951] - Recover Python linter (Sphinx build) in GitHub Actions
  • [SPARK-35192] - Port minimal TPC-DS datagen code from databricks/spark-sql-perf
  • [SPARK-35293] - Use the newer dsdgen for TPCDSQueryTestSuite
  • [SPARK-35327] - Filters out the TPC-DS queries that can cause flaky test results
  • [SPARK-35413] - Use the SHA of the latest commit when checking out databricks/tpcds-kit

Task

  • [SPARK-34970] - Redact map-type options in the output of explain()
  • [SPARK-35495] - Change SparkR maintainer for CRAN

Documentation

  • [SPARK-35250] - SQL DataFrameReader unescapedQuoteHandling parameter is misdocumented
  • [SPARK-35405] - Submitting Applications documentation has outdated information about K8s client mode support

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.