Release Notes - Spark - Version 3.3.0 - HTML format

Sub-task

  • [SPARK-6305] - Add support for log4j 2.x to Spark
  • [SPARK-27442] - ParquetFileFormat fails to read column named with invalid characters
  • [SPARK-27974] - Support ANSI Aggregate Function: array_agg
  • [SPARK-28137] - Data Type Formatting Functions: `to_number`
  • [SPARK-32567] - Code-gen for full outer shuffled hash join
  • [SPARK-32709] - Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
  • [SPARK-32712] - Support writing Hive non-ORC/Parquet bucketed table
  • [SPARK-33701] - Adaptive shuffle merge finalization for push-based shuffle
  • [SPARK-33832] - Add an option in AQE to mitigate skew even if it causes an new shuffle
  • [SPARK-34112] - Upgrade ORC to 1.7.0
  • [SPARK-34183] - DataSource V2: Support required distribution and ordering in SS
  • [SPARK-34332] - Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests
  • [SPARK-34544] - pyspark toPandas() should return pd.DataFrame
  • [SPARK-34826] - Adaptive fetch of shuffle mergers for Push based shuffle
  • [SPARK-34863] - Support nested column in Spark Parquet vectorized readers
  • [SPARK-34960] - Aggregate (Min/Max/Count) push down for ORC
  • [SPARK-34980] - Support coalesce partition through union
  • [SPARK-35352] - Add code-gen for full outer sort merge join
  • [SPARK-35437] - Use expressions to filter Hive partitions at client side
  • [SPARK-35496] - Upgrade Scala 2.13 to 2.13.7
  • [SPARK-35663] - Add Timestamp without time zone type
  • [SPARK-35664] - Support java.time. LocalDateTime as an external type of TimestampWithoutTZ type
  • [SPARK-35674] - Test timestamp without time zone in UDF
  • [SPARK-35697] - Test TimestampWithoutTZType as ordered and atomic type
  • [SPARK-35698] - Support casting of timestamp without time zone to strings
  • [SPARK-35711] - Support casting of timestamp without time zone to timestamp type
  • [SPARK-35716] - Support casting of timestamp without time zone to date type
  • [SPARK-35718] - Support casting of Date to timestamp without time zone type
  • [SPARK-35719] - Support type conversion between timestamp and timestamp without time zone type
  • [SPARK-35720] - Support casting of String to timestamp without time zone type
  • [SPARK-35764] - Assign pretty names to TimestampWithoutTZType
  • [SPARK-35785] - Cleanup support for RocksDB instance
  • [SPARK-35839] - New SQL function: to_timestamp_ntz
  • [SPARK-35854] - Improve the error message of to_timestamp_ntz with invalid format pattern
  • [SPARK-35867] - Enable vectorized read for VectorizedPlainValuesReader.readBooleans
  • [SPARK-35889] - Support adding TimestampWithoutTZ with Interval types
  • [SPARK-35895] - Support subtracting Intervals from TimestampWithoutTZ
  • [SPARK-35916] - Support subtraction among Date/Timestamp/TimestampWithoutTZ
  • [SPARK-35925] - Support DayTimeIntervalType in width-bucket function
  • [SPARK-35926] - Support YearMonthIntervalType in width-bucket function
  • [SPARK-35927] - Remove type collection AllTimestampTypes
  • [SPARK-35932] - Support extracting hour/minute/second from timestamp without time zone
  • [SPARK-35953] - Support extracting date fields from timestamp without time zone
  • [SPARK-35963] - Rename TimestampWithoutTZType to TimestampNTZType
  • [SPARK-35968] - Make sure partitions are not too small in AQE partition coalescing
  • [SPARK-35971] - Rename the type name of TimestampNTZType as "timestamp_ntz"
  • [SPARK-35975] - New configuration spark.sql.timestampType for the default timestamp type
  • [SPARK-35977] - Support non-reserved keyword TIMESTAMP_NTZ
  • [SPARK-35978] - Support non-reserved keyword TIMESTAMP_LTZ
  • [SPARK-35979] - Return different timestamp literals based on the default timestamp type
  • [SPARK-35987] - The ANSI flags of Sum and Avg should be kept after being copied
  • [SPARK-36015] - Support TimestampNTZType in the Window spec definition
  • [SPARK-36016] - Support TimestampNTZType in expression ApproxCountDistinctForIntervals
  • [SPARK-36017] - Support TimestampNTZType in expression ApproximatePercentile
  • [SPARK-36037] - Support ANSI SQL LOCALTIMESTAMP datetime value function
  • [SPARK-36043] - Add end-to-end tests with default timestamp type as TIMESTAMP_NTZ
  • [SPARK-36044] - Suport TimestampNTZ in functions unix_timestamp/to_unix_timestamp
  • [SPARK-36046] - Support new functions make_timestamp_ntz and make_timestamp_ltz
  • [SPARK-36050] - Spark doesn’t support reading/writing TIMESTAMP_NTZ with ORC
  • [SPARK-36054] - Support group by TimestampNTZ column
  • [SPARK-36055] - Assign pretty SQL string to TimestampNTZ literals
  • [SPARK-36058] - Support replicasets/job API
  • [SPARK-36059] - Add the ability to specify a scheduler
  • [SPARK-36061] - Add `volcano` module and feature step
  • [SPARK-36072] - TO_TIMESTAMP: return different results based on the default timestamp type
  • [SPARK-36075] - Support for specifiying executor/driver node selector
  • [SPARK-36083] - make_timestamp: return different result based on the default timestamp type
  • [SPARK-36090] - Support TimestampNTZType in expression Sequence
  • [SPARK-36091] - Support TimestampNTZ type in expression TimeWindow
  • [SPARK-36095] - Group exception messages in core/rdd
  • [SPARK-36097] - Group exception messages in core/scheduler
  • [SPARK-36098] - Group exception messages in core/storage
  • [SPARK-36101] - Group exception messages in core/api
  • [SPARK-36107] - Refactor first set of 20 query execution errors to use error classes
  • [SPARK-36110] - Upgrade SBT to 1.5.5
  • [SPARK-36119] - Add new SQL function to_timestamp_ltz
  • [SPARK-36120] - Support TimestampNTZ type in cache table
  • [SPARK-36135] - Support TimestampNTZ type in file partitioning
  • [SPARK-36139] - Remove Python 3.6 from `pyspark` GitHub Action job
  • [SPARK-36144] - Use Python 3.9 in `run-pip-tests` conda environment
  • [SPARK-36146] - Upgrade Python version from 3.6 to higher version in GitHub linter
  • [SPARK-36152] - Add Scala 2.13 daily build and test GitHub Action job
  • [SPARK-36175] - Support TimestampNTZ in Avro data source
  • [SPARK-36179] - Support TimestampNTZType in SparkGetColumnsOperation
  • [SPARK-36182] - Support TimestampNTZ type in Parquet file source
  • [SPARK-36208] - SparkScriptTransformation should support ANSI interval types
  • [SPARK-36227] - Remove TimestampNTZ type support in Spark 3.2
  • [SPARK-36230] - hasnans for Series of Decimal(`NaN`)
  • [SPARK-36231] - Support arithmetic operations of Series containing Decimal(np.nan)
  • [SPARK-36232] - Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
  • [SPARK-36255] - FileNotFoundException from the shuffle push can cause the executor to terminate
  • [SPARK-36256] - Upgrade lz4-java to 1.8.0
  • [SPARK-36257] - Updated the version of TimestampNTZ related changes as 3.3.0
  • [SPARK-36332] - Cleanup RemoteBlockPushResolver log messages
  • [SPARK-36336] - Define the new exception that mix SparkThrowable for all base exe in QueryExecutionErrors
  • [SPARK-36337] - decimal('Nan') is unsupported in net.razorvine.pickle
  • [SPARK-36346] - Support TimestampNTZ type in Orc file source
  • [SPARK-36357] - Support pushdown Timestamp with local time zone for orc
  • [SPARK-36368] - Fix CategoricalOps.astype to follow pandas 1.3
  • [SPARK-36378] - Minor changes to address a few identified server side inefficiencies
  • [SPARK-36396] - Implement DataFrame.cov
  • [SPARK-36399] - Implement DataFrame.combine_first
  • [SPARK-36401] - Implement Series.cov
  • [SPARK-36409] - Splitting test cases from datetime.sql
  • [SPARK-36424] - Support eliminate limits in AQE Optimizer
  • [SPARK-36435] - Implement MultIndex.equal_levels
  • [SPARK-36438] - Support list-like Python objects for Series comparison
  • [SPARK-36490] - Make from_csv/to_csv to handle timestamp_ntz type properly
  • [SPARK-36491] - Make from_json/to_json to handle timestamp_ntz type properly
  • [SPARK-36506] - Improve test coverage for series.py and indexes/*.py.
  • [SPARK-36526] - Add supportsIndex interface
  • [SPARK-36540] - AM should not just finish with Success when dissconnected
  • [SPARK-36556] - Add DSV2 filters
  • [SPARK-36587] - Migrate CreateNamespaceStatement to v2 command framework
  • [SPARK-36608] - Support TimestampNTZ in Arrow
  • [SPARK-36609] - Add `errors` argument for `ps.to_numeric`.
  • [SPARK-36615] - SparkContext should register shutdown hook earlier
  • [SPARK-36618] - Support dropping rows of a single-indexed DataFrame
  • [SPARK-36624] - When application killed, sc should not exit with code 0
  • [SPARK-36625] - Support TimestampNTZ in pandas API on Spark
  • [SPARK-36626] - Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
  • [SPARK-36645] - Aggregate (Min/Max/Count) push down for Parquet
  • [SPARK-36646] - Push down group by partition column for Aggregate (Min/Max/Count) for Parquet
  • [SPARK-36647] - Push down filter by partition column for Aggregate (Min/Max/Count) for Parquet
  • [SPARK-36650] - ApplicationMaster shutdown hook should catch timeout exception
  • [SPARK-36652] - AQE dynamic join selection should not apply to non-equi join
  • [SPARK-36653] - Implement Series.__xor__
  • [SPARK-36655] - Add `versionadded` for API added in Spark 3.3.0
  • [SPARK-36656] - CollapseProject should not collapse correlated scalar subqueries
  • [SPARK-36661] - Support TimestampNTZ in Py4J
  • [SPARK-36675] - Support ScriptTransformation for timestamp_ntz
  • [SPARK-36678] - Migrate SHOW TABLES to use V2 command by default
  • [SPARK-36687] - Rename error classes with _ERROR suffix
  • [SPARK-36708] - Support numpy.typing for annotating ArrayType
  • [SPARK-36709] - Support new syntax for specifying index type and name
  • [SPARK-36710] - Support new syntax in function apply APIs
  • [SPARK-36711] - Support multi-index in new syntax
  • [SPARK-36713] - Document new syntax for specifying index type
  • [SPARK-36724] - Support timestamp_ntz as a type of time column for SessionWindow
  • [SPARK-36742] - Fix ps.to_datetime with plurals of keys like years, months, days
  • [SPARK-36746] - Refactor _select_rows_by_iterable in iLocIndexer to use Column.isin
  • [SPARK-36748] - Introduce the 'compute.isin_limit' option
  • [SPARK-36754] - array_intersect should handle Double.NaN and Float.NaN
  • [SPARK-36760] - Add interface SupportsPushDownV2Filters
  • [SPARK-36769] - Improve `filter` of single-indexed DataFrame
  • [SPARK-36771] - Fix `pop` of Categorical Series
  • [SPARK-36778] - Support ILIKE API on Scala(dataframe)
  • [SPARK-36779] - Error when list of data type tuples has len = 1
  • [SPARK-36785] - Fix ps.DataFrame.isin
  • [SPARK-36794] - Ignore duplicated join keys when building relation for SEMI/ANTI shuffle hash join
  • [SPARK-36796] - Make sql/core and dependent modules all UTs pass on Java 17
  • [SPARK-36813] - Implement ps.merge_asof
  • [SPARK-36818] - Fix filtering a Series by a boolean Series
  • [SPARK-36825] - Read/write dataframes with ANSI intervals from/to parquet files
  • [SPARK-36830] - Read/write dataframes with ANSI intervals from/to JSON files
  • [SPARK-36831] - Read/write dataframes with ANSI intervals from/to CSV files
  • [SPARK-36846] - Inline most of type hint files under pyspark/sql/pandas folder
  • [SPARK-36848] - Migrate ShowCurrentNamespaceStatement to v2 command framework
  • [SPARK-36849] - Migrate UseStatement to v2 command framework
  • [SPARK-36850] - Migrate CreateTableStatement to v2 command framework
  • [SPARK-36852] - Test ANSI interval support by the Parquet datasource
  • [SPARK-36854] - Parquet reader fails on load of ANSI interval when off-heap is enabled
  • [SPARK-36866] - Pushdown filters with ANSI interval values to parquet
  • [SPARK-36868] - Migrate CreateFunctionStatement to v2 command framework
  • [SPARK-36871] - Migrate CreateViewStatement to v2 command
  • [SPARK-36879] - Support Parquet v2 data page encodings for the vectorized path
  • [SPARK-36880] - Inline type hints for python/pyspark/sql/functions.py
  • [SPARK-36881] - Inline type hints for python/pyspark/sql/catalog.py
  • [SPARK-36882] - Support ILIKE API on Python
  • [SPARK-36884] - Inline type hints for python/pyspark/sql/session.py
  • [SPARK-36885] - Inline type hints for python/pyspark/sql/dataframe.py
  • [SPARK-36886] - Inline type hints for python/pyspark/sql/context.py
  • [SPARK-36891] - Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized Parquet decoding
  • [SPARK-36895] - Add Create Index syntax support
  • [SPARK-36897] - Replace collections.namedtuple() by typing.NamedTuple
  • [SPARK-36899] - Support ILIKE API on R
  • [SPARK-36900] - "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
  • [SPARK-36902] - Migrate CreateTableAsSelectStatement to v2 command
  • [SPARK-36906] - Inline type hints for conf.py and observation.py in python/pyspark/sql
  • [SPARK-36910] - Inline type hints for python/pyspark/sql/types.py
  • [SPARK-36913] - Implement createIndex and IndexExists in JDBC (MySQL dialect)
  • [SPARK-36914] - Implement dropIndex and listIndexes in JDBC (MySQL dialect)
  • [SPARK-36920] - Support ANSI intervals by ABS
  • [SPARK-36921] - The DIV function should support ANSI intervals
  • [SPARK-36922] - The SIGN/SIGNUM functions should support ANSI intervals
  • [SPARK-36924] - CAST between ANSI intervals and numerics
  • [SPARK-36927] - Inline type hints for python/pyspark/sql/window.py
  • [SPARK-36928] - Handle ANSI intervals in ColumnarRow, ColumnarBatchRow and ColumnarArray
  • [SPARK-36930] - Support ps.MultiIndex.dtypes
  • [SPARK-36931] - Read/write dataframes with ANSI intervals from/to ORC files
  • [SPARK-36935] - Enhance ParquetSchemaConverter to capture Parquet repetition & definition level
  • [SPARK-36938] - Inline type hints for group.py in python/pyspark/sql
  • [SPARK-36940] - Inline type hints for python/pyspark/sql/avro/functions.py
  • [SPARK-36941] - Check saving of a dataframe with ANSI intervals to a Hive parquet table
  • [SPARK-36942] - Inline type hints for python/pyspark/sql/readwriter.py
  • [SPARK-36944] - Remove unused python/pyspark/sql/__init__.pyi
  • [SPARK-36945] - Inline type hints for python/pyspark/sql/udf.py
  • [SPARK-36946] - Support time for ps.to_datetime
  • [SPARK-36948] - Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet
  • [SPARK-36949] - Fix CREATE TABLE AS SELECT of ANSI intervals
  • [SPARK-36951] - Inline type hints for python/pyspark/sql/column.py
  • [SPARK-36952] - Inline type hints for python/pyspark/resource/information.py and python/pyspark/resource/profile.py
  • [SPARK-36960] - Pushdown filters with ANSI interval values to ORC
  • [SPARK-36968] - ps.Series.dot raise "matrices are not aligned" if index is not same
  • [SPARK-36969] - Inline type hints for SparkContext
  • [SPARK-36970] - Manual disabled format `B` for `date_format` function to compatibility with Java 8 behavior.
  • [SPARK-36977] - Update docs to reflect that Python 3.6 is no longer supported
  • [SPARK-36982] - Migrate SHOW NAMESPACES to use V2 command by default
  • [SPARK-36991] - Inline type hints for spark/python/pyspark/sql/streaming.py
  • [SPARK-37000] - Add type hints to python/pyspark/sql/util.py
  • [SPARK-37008] - WholeStageCodegenSparkSubmitSuite Failed with Java 17
  • [SPARK-37013] - `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17
  • [SPARK-37014] - Inline type hints for python/pyspark/streaming/context.py
  • [SPARK-37015] - Inline type hints for python/pyspark/streaming/dstream.py
  • [SPARK-37023] - Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry
  • [SPARK-37031] - Unify v1 and v2 DESCRIBE NAMESPACE tests
  • [SPARK-37033] - Inline type hints for python/pyspark/resource/requests.py
  • [SPARK-37038] - Sample push down in DS v2
  • [SPARK-37042] - Inline type hints for kinesis.py and listener.py in python/pyspark/streaming
  • [SPARK-37048] - Clean up inlining type hints under SQL module
  • [SPARK-37056] - Fix unused code in test about history server & MetricsSystem
  • [SPARK-37066] - Improve ORC RecordReader's error message
  • [SPARK-37070] - Pass all UTs in `mllib-local` and `mllib` with Java 17
  • [SPARK-37072] - Pass all UTs in `repl` with Java 17
  • [SPARK-37073] - Pass all UTs in `external/avro` with Java 17
  • [SPARK-37083] - Inline type hints for python/pyspark/accumulators.py
  • [SPARK-37091] - Support Java 17 in SparkR SystemRequirements
  • [SPARK-37095] - Inline type hints for files in python/pyspark/broadcast.py
  • [SPARK-37105] - Pass all UTs in `sql/hive` with Java 17
  • [SPARK-37106] - Pass all UTs in `yarn` with Java 17
  • [SPARK-37107] - Inline type hints for files in python/pyspark/status.py
  • [SPARK-37120] - Add Daily GitHub Action jobs for Java11/17
  • [SPARK-37125] - Support AnsiInterval radix sort
  • [SPARK-37129] - Supplement all micro benchmark results use to Java 17
  • [SPARK-37137] - Inline type hints for python/pyspark/conf.py
  • [SPARK-37138] - Support ANSI Interval in functions that support numeric type
  • [SPARK-37139] - Inline type hints for python/pyspark/taskcontext.py and python/pyspark/version.py
  • [SPARK-37140] - Inline type hints for python/pyspark/resultiterable.py
  • [SPARK-37144] - Inline type hints for python/pyspark/file.py
  • [SPARK-37145] - Add KubernetesCustom[Driver/Executor]FeatureConfigStep developer API
  • [SPARK-37146] - Inline type hints for python/pyspark/__init__.py
  • [SPARK-37149] - Improve error messages for arithmetic overflow under ANSI mode
  • [SPARK-37150] - Migrate DESCRIBE NAMESPACE to use V2 command by default
  • [SPARK-37152] - Inline type hints for python/pyspark/context.py
  • [SPARK-37153] - Inline type hints for python/pyspark/profiler.py
  • [SPARK-37154] - Inline type hints for python/pyspark/rdd.py
  • [SPARK-37155] - Inline type hints for python/pyspark/statcounter.py
  • [SPARK-37156] - Inline type hints for python/pyspark/storagelevel.py
  • [SPARK-37157] - Inline type hints for python/pyspark/util.py
  • [SPARK-37159] - Change HiveExternalCatalogVersionsSuite to be able to test with Java 17
  • [SPARK-37161] - RowToColumnConverter support AnsiIntervalType
  • [SPARK-37166] - SPIP: Storage Partitioned Join
  • [SPARK-37168] - Improve error messages for SQL functions and operators under ANSI mode
  • [SPARK-37179] - ANSI mode: Add a config to allow casting between Datetime and Numeric
  • [SPARK-37181] - pyspark.pandas.read_csv() should support latin-1 encoding
  • [SPARK-37188] - pyspark.pandas histogram accepts the title option but does not add a title to the plot
  • [SPARK-37190] - Improve error messages for casting under ANSI mode
  • [SPARK-37192] - Migrate SHOW TBLPROPERTIES to use V2 command by default
  • [SPARK-37195] - Unify v1 and v2 SHOW TBLPROPERTIES tests
  • [SPARK-37200] - Drop index support
  • [SPARK-37212] - Improve the implement of aggregate pushdown.
  • [SPARK-37220] - Do not split input file for Parquet reader with aggregate push down
  • [SPARK-37225] - Read/write dataframes with ANSI intervals from/to Avro files
  • [SPARK-37228] - Implement DataFrame.mapInArrow in Python
  • [SPARK-37230] - Document DataFrame.mapInArrow
  • [SPARK-37231] - Dynamic writes/reads of ANSI interval partitions
  • [SPARK-37232] - Upgrade ORC to 1.7.1
  • [SPARK-37234] - Inline type hints for python/pyspark/mllib/stat/_statistics.py
  • [SPARK-37235] - Inline type hints for distribution.py and __init__.py in python/pyspark/mllib/stat
  • [SPARK-37236] - Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/
  • [SPARK-37240] - Cannot read partitioned parquet files with ANSI interval partition values
  • [SPARK-37258] - Upgrade kubernetes-client to 5.12.0
  • [SPARK-37261] - Check adding partitions with ANSI intervals
  • [SPARK-37262] - Not log empty aggregate and group by in JDBCScan
  • [SPARK-37264] - Exclude hadoop-client-api transitive dependency from orc-core
  • [SPARK-37267] - OptimizeSkewInRebalancePartitions support optimize non-root node
  • [SPARK-37272] - Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
  • [SPARK-37277] - Support DayTimeIntervalType in Arrow
  • [SPARK-37279] - Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs
  • [SPARK-37281] - Support DayTimeIntervalType in Py4J
  • [SPARK-37282] - Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
  • [SPARK-37286] - Move compileAggregates from JDBCRDD to JdbcDialect
  • [SPARK-37291] - PySpark init SparkSession should copy conf to sharedState
  • [SPARK-37293] - Remove explicit GC options from Scala tests
  • [SPARK-37294] - Check inserting of ANSI intervals into a table partitioned by the interval columns
  • [SPARK-37296] - Add missing type hints in python/pyspark/util.py
  • [SPARK-37304] - Check replacing columns with ANSI intervals
  • [SPARK-37310] - Migrate ALTER NAMESPACE ... SET PROPERTIES to use v2 command by default
  • [SPARK-37311] - Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default
  • [SPARK-37312] - Add `.java-version` to `.gitignore` and `.rat-excludes`
  • [SPARK-37316] - Add code-gen for existence sort merge join
  • [SPARK-37317] - Reduce weights in GaussianMixtureSuite
  • [SPARK-37319] - Support K8s image building with Java 17
  • [SPARK-37326] - Support TimestampNTZ in CSV data source
  • [SPARK-37328] - SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan
  • [SPARK-37330] - Migrate ReplaceTableStatement to v2 command
  • [SPARK-37331] - Add the ability to create resources before driver pod
  • [SPARK-37332] - Check adding of ANSI interval columns to v1/v2 tables
  • [SPARK-37343] - Implement createIndex and IndexExists in JDBC (Postgres dialect)
  • [SPARK-37345] - Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS
  • [SPARK-37354] - Make the Java version installed on the container image used by the K8s integration tests with SBT configurable
  • [SPARK-37357] - Add small partition factor for rebalance partitions
  • [SPARK-37360] - Support TimestampNTZ in JSON data source
  • [SPARK-37376] - SPJ: Introduce a new DataSource V2 interface HasPartitionKey
  • [SPARK-37377] - SPJ: Initial implementation of Storage-Partitioned Join
  • [SPARK-37379] - Add tree pattern pruning to CTESubstitution rule
  • [SPARK-37381] - Unify v1 and v2 SHOW CREATE TABLE tests
  • [SPARK-37385] - Add tests for TimestampNTZ and TimestampLTZ for Parquet data source
  • [SPARK-37389] - Check unclosed bracketed comments
  • [SPARK-37397] - Inline type hints for python/pyspark/ml/base.py
  • [SPARK-37398] - Inline type hints for python/pyspark/ml/classification.py
  • [SPARK-37399] - Inline type hints for python/pyspark/ml/common.py
  • [SPARK-37400] - Inline type hints for python/pyspark/mllib/classification.py
  • [SPARK-37401] - Inline type hints for python/pyspark/ml/clustering.py
  • [SPARK-37402] - Inline type hints for python/pyspark/mllib/clustering.py
  • [SPARK-37403] - Inline type hints for python/pyspark/mllib/common.py
  • [SPARK-37404] - Inline type hints for python/pyspark/ml/evaluation.py
  • [SPARK-37405] - Inline type hints for python/pyspark/ml/feature.py
  • [SPARK-37406] - Inline type hints for python/pyspark/ml/fpm.py
  • [SPARK-37407] - Inline type hints for python/pyspark/ml/functions.py
  • [SPARK-37408] - Inline type hints for python/pyspark/ml/image.py
  • [SPARK-37409] - Inline type hints for python/pyspark/ml/pipeline.py
  • [SPARK-37410] - Inline type hints for python/pyspark/ml/recommendation.py
  • [SPARK-37411] - Inline type hints for python/pyspark/ml/regression.py
  • [SPARK-37412] - Inline type hints for python/pyspark/ml/stat.py
  • [SPARK-37413] - Inline type hints for python/pyspark/ml/tree.py
  • [SPARK-37414] - Inline type hints for python/pyspark/ml/tuning.py
  • [SPARK-37415] - Inline type hints for python/pyspark/ml/util.py
  • [SPARK-37416] - Inline type hints for python/pyspark/ml/wrapper.py
  • [SPARK-37417] - Inline type hints for python/pyspark/ml/linalg/__init__.py
  • [SPARK-37418] - Inline type hints for python/pyspark/ml/param/__init__.py
  • [SPARK-37419] - Inline type hints for python/pyspark/ml/param/shared.py
  • [SPARK-37421] - Inline type hints for python/pyspark/mllib/evaluation.py
  • [SPARK-37422] - Inline type hints for python/pyspark/mllib/feature.py
  • [SPARK-37423] - Inline type hints for python/pyspark/mllib/fpm.py
  • [SPARK-37424] - Inline type hints for python/pyspark/mllib/random.py
  • [SPARK-37426] - Inline type hints for python/pyspark/mllib/regression.py
  • [SPARK-37427] - Inline type hints for python/pyspark/mllib/tree.py
  • [SPARK-37428] - Inline type hints for python/pyspark/mllib/util.py
  • [SPARK-37429] - Inline type hints for python/pyspark/mllib/linalg/__init__.py
  • [SPARK-37430] - Inline type hints for python/pyspark/mllib/linalg/distributed.py
  • [SPARK-37438] - ANSI mode: Use store assignment rules for resolving function invocation
  • [SPARK-37442] - In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the table that is larger than 8GB: 8 GB" failure
  • [SPARK-37444] - ALTER NAMESPACE ... SET LOCATION should handle empty location consistently across v1 and v2 command
  • [SPARK-37455] - Replace hash with sort aggregate if child is already sorted
  • [SPARK-37456] - CREATE NAMESPACE should qualify location for v2 command
  • [SPARK-37463] - Read/Write Timestamp ntz from/to Orc uses int64
  • [SPARK-37478] - Unify v1 and v2 DROP NAMESPACE tests
  • [SPARK-37479] - Migrate DROP NAMESPACE to use V2 command by default
  • [SPARK-37482] - Skip check monotonic increasing for Series.asof with 'compute.eager_check'
  • [SPARK-37483] - Support push down top N to JDBC data source V2
  • [SPARK-37489] - Skip hasnans check in numops if eager_check disable
  • [SPARK-37490] - Show hint if analyzer fails due to ANSI type coercion
  • [SPARK-37494] - Unify v1 and v2 options output of `SHOW CREATE TABLE` command
  • [SPARK-37495] - Skip identical index checking of Series.compare when config 'compute.eager_check' is disabled
  • [SPARK-37501] - CREATE/REPLACE TABLE should qualify location for v2 command
  • [SPARK-37504] - pyspark should not pass all options to session states.
  • [SPARK-37509] - Improve Fallback Storage upload speed by avoiding S3 rate limiter
  • [SPARK-37510] - Support TimedeltaIndex in pandas API on Spark
  • [SPARK-37511] - Introduce TimedeltaIndex to pandas API on Spark
  • [SPARK-37512] - Support TimedeltaIndex creation (from Series/Index) and TimedeltaIndex.astype
  • [SPARK-37522] - Fix MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction
  • [SPARK-37526] - Add Java17 PySpark daily test coverage
  • [SPARK-37527] - Translate more standard aggregate functions for pushdown
  • [SPARK-37529] - Support K8s integration tests for Java 17
  • [SPARK-37533] - New SQL function: try_element_at
  • [SPARK-37543] - Document Java 17 support
  • [SPARK-37548] - Add Java17 SparkR daily test coverage
  • [SPARK-37557] - Replace object hash with sort aggregate if child is already sorted
  • [SPARK-37563] - Implement days, seconds, microseconds properties of TimedeltaIndex
  • [SPARK-37564] - Support sort aggregate code-gen without grouping keys
  • [SPARK-37576] - Support built-in K8s executor roll plugin
  • [SPARK-37590] - Unify v1 and v2 ALTER NAMESPACE ... SET PROPERTIES tests
  • [SPARK-37613] - Support ANSI Aggregate Function: regr_count
  • [SPARK-37614] - Support ANSI Aggregate Function: regr_avgx & regr_avgy
  • [SPARK-37619] - Upgrade Maven to 3.8.4
  • [SPARK-37620] - Use more precise types for SparkContext Optional fields (i.e. _gateway, _jvm)
  • [SPARK-37622] - Support K8s executor rolling policy
  • [SPARK-37632] - Drop code targetting Python < 3.7
  • [SPARK-37636] - Migrate CREATE NAMESPACE to use v2 command by default
  • [SPARK-37638] - Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark
  • [SPARK-37641] - Support ANSI Aggregate Function: regr_r2
  • [SPARK-37644] - Support datasource v2 complete aggregate pushdown
  • [SPARK-37645] - Word spell error - "labeled" spells as "labled"
  • [SPARK-37651] - Use existing active Spark session in all places of pandas API on Spark
  • [SPARK-37652] - Support optimize skewed join through union
  • [SPARK-37653] - Upgrade RoaringBitmap to 0.9.23
  • [SPARK-37655] - Add RocksDB Implementation for KVStore
  • [SPARK-37664] - Add InMemoryColumnarBenchmark and StateStoreBasicOperationsBenchmark Java 11/17 result
  • [SPARK-37669] - Remove unnecessary usages of OrderedDict
  • [SPARK-37673] - Implement `ps.timedelta_range` method
  • [SPARK-37675] - Prevent overwriting of push shuffle merged files once the shuffle is finalized
  • [SPARK-37679] - Add a new executor roll policy, FAILED_TASKS
  • [SPARK-37680] - Support RocksDB backend in Spark History Server
  • [SPARK-37684] - Upgrade log4j to 2.17
  • [SPARK-37685] - Make log event immutable for LogAppender
  • [SPARK-37695] - Skip diagnosis ob merged blocks from push-based shuffle
  • [SPARK-37699] - Fix failed K8S integration test in SparkConfPropagateSuite
  • [SPARK-37707] - Allow store assignment between TimestampNTZ and Date/Timestamp
  • [SPARK-37709] - Add AVERAGE_DURATION executor roll policy
  • [SPARK-37714] - ANSI mode: allow casting between numeric type and timestamp type
  • [SPARK-37719] - Try to remove `--add-exports` compile option for Java 17
  • [SPARK-37727] - Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate
  • [SPARK-37729] - SparkSession.setLogLevel not working in Spark Shell
  • [SPARK-37732] - Improve the implement of JDBCV2Suite
  • [SPARK-37734] - Upgrade h2 from 1.4.195 to 2.0.202
  • [SPARK-37735] - Add appId interface to KubernetesConf
  • [SPARK-37741] - Remove Jenkins badge in README.md
  • [SPARK-37746] - log4j2-defaults.properties is not working since log4j 2 is always initialized by default
  • [SPARK-37755] - Optimize RocksDB KVStore configurations
  • [SPARK-37760] - Upgrade SBT to 1.6.0
  • [SPARK-37768] - Schema pruning for the metadata struct
  • [SPARK-37769] - Filter on the metadata struct
  • [SPARK-37773] - Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5
  • [SPARK-37774] - Upgrade log4j from 2.17 to 2.17.1
  • [SPARK-37775] - [PYSPARK] Fix mlflow doctest
  • [SPARK-37790] - Upgrade SLF4J to 1.7.32
  • [SPARK-37791] - Use log4j2 in examples.
  • [SPARK-37792] - Spark shell sets log level to INFO by default
  • [SPARK-37794] - Remove log4j bridge api usage
  • [SPARK-37795] - Add a scalastyle rule to ban `org.apache.log4j`
  • [SPARK-37801] - List PyPy3 installed libraries in build_and_test workflow
  • [SPARK-37804] - Unify v1 and v2 CREATE NAMESPACE tests
  • [SPARK-37805] - Refactor TestUtils#configTestLog4j method to use log4j2 api
  • [SPARK-37806] - Support minimum number of tasks per executor before being rolling
  • [SPARK-37819] - Add OUTLIER executor roll policy and use it by default
  • [SPARK-37824] - Document K8s executor rolling configurations
  • [SPARK-37827] - Put the some built-in table properties into V1Table.propertie to adapt to V2 command
  • [SPARK-37839] - DS V2 supports partial aggregate push-down AVG
  • [SPARK-37843] - Suppress NoSuchFieldError at setMDCForTask
  • [SPARK-37844] - Remove slf4j-log4j12 dependency from hadoop-minikdc
  • [SPARK-37847] - PushBlockStreamCallback should check isTooLate first to avoid NPE
  • [SPARK-37853] - Clean up deprecation compilation warning related to log4j2
  • [SPARK-37858] - Throw Spark exceptions from AES functions
  • [SPARK-37864] - Support Parquet v2 data page RLE encoding (for Boolean Values) for the vectorized path
  • [SPARK-37866] - Set file.encoding to UTF-8 for SBT tests
  • [SPARK-37867] - Compile aggregate functions of build-in JDBC dialect
  • [SPARK-37870] - Enable Apple Silicon Jenkins CI (Java/Scala/Python/R)
  • [SPARK-37875] - Support ARM64 in Java 17 docker image
  • [SPARK-37878] - Migrate SHOW CREATE TABLE to use v2 command by default
  • [SPARK-37880] - Upgrade Scala to 2.13.8
  • [SPARK-37887] - PySpark shell sets log level to INFO by default
  • [SPARK-37889] - Log4j2 MarkerFilter can not filter unnecessary thrift errors
  • [SPARK-37923] - Generate partition transforms for BucketSpec inside parser
  • [SPARK-37929] - Support cascade mode for `dropNamespace` API
  • [SPARK-37937] - Use error classes in the parsing errors of lateral join
  • [SPARK-37941] - Use error classes in the compilation errors of casting
  • [SPARK-37943] - Use error classes in the compilation errors of grouping
  • [SPARK-37957] - Deterministic flag is not handled for V2 functions
  • [SPARK-37960] - A new framework to represent catalyst expressions in DS v2 APIs
  • [SPARK-37979] - Switch to more generic error classes in AES functions
  • [SPARK-37983] - Backout agg build time metrics from sort aggregate
  • [SPARK-37986] - Support TimestampNTZ radix sort
  • [SPARK-37990] - Support TimestampNTZ in RowToColumnConverter
  • [SPARK-37995] - TPCDS 1TB q72 fails when spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is false
  • [SPARK-37998] - Use `rbac.authorization.k8s.io/v1` instead of `v1beta1`
  • [SPARK-38001] - Replace the unsupported error classes by `UNSUPPORTED_FEATURE`
  • [SPARK-38013] - AQE can change bhj to smj if no extra shuffle introduce
  • [SPARK-38015] - Mark legacy file naming functions as deprecated in FileCommitProtocol
  • [SPARK-38019] - ExecutorMonitor.timedOutExecutors should be deterministic
  • [SPARK-38022] - Use relativePath for K8s remote file test in BasicTestsSuite
  • [SPARK-38023] - ExecutorMonitor.onExecutorRemoved should handle ExecutorDecommission as finished
  • [SPARK-38029] - Support docker-desktop K8S integration test in SBT
  • [SPARK-38030] - Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
  • [SPARK-38047] - Add OUTLIER_NO_FALLBACK executor roll policy
  • [SPARK-38048] - Add IntegrationTestBackend.describePods to support all K8s test backends
  • [SPARK-38049] - Use Java 17 in K8s integration tests
  • [SPARK-38062] - FallbackStorage shouldn't attempt to resolve arbitrary "remote" hostname
  • [SPARK-38071] - Support K8s namespace parameter in SBT K8s IT
  • [SPARK-38072] - Support K8s imageTag parameter in SBT K8s IT
  • [SPARK-38081] - Support cloud-backend in K8s IT with SBT
  • [SPARK-38085] - DataSource V2: Handle DELETE commands for group-based sources
  • [SPARK-38095] - HistoryServerDiskManager.appStorePath should use backend-based extensions
  • [SPARK-38097] - Improve the error for pivoting of unsupported value types
  • [SPARK-38103] - Use error classes in the parsing errors of transform
  • [SPARK-38104] - Use error classes in the parsing errors of windows
  • [SPARK-38105] - Use error classes in the parsing errors of joins
  • [SPARK-38107] - Use error classes in the compilation errors of python/pandas UDFs
  • [SPARK-38112] - Use error classes in the execution errors of date/timestamp handling
  • [SPARK-38113] - Use error classes in the execution errors of pivoting
  • [SPARK-38125] - Use static factory methods instead of the deprecated `Byte/Short/Integer/Long` constructors
  • [SPARK-38126] - Check the whole message of error classes
  • [SPARK-38131] - Keep only user-facing error classes
  • [SPARK-38145] - Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
  • [SPARK-38155] - Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates
  • [SPARK-38157] - Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode
  • [SPARK-38159] - Minor refactor of MetadataAttribute unapply method
  • [SPARK-38162] - Optimize one row plan in normal and AQE Optimizer
  • [SPARK-38163] - Preserve the error class of `AnalysisException` while constructing of function builder
  • [SPARK-38164] - New SQL function: try_subtract and try_multiply
  • [SPARK-38176] - ANSI mode: allow implicitly casting String to other simple types
  • [SPARK-38180] - Allow safe up-cast expressions in correlated equality predicates
  • [SPARK-38187] - Support resource reservation (Introduce minCPU/minMemory) with volcano implementations
  • [SPARK-38188] - Support queue scheduling (Introduce queue) with volcano implementations
  • [SPARK-38196] - Reactor framework so as JDBC dialect could compile expression by self way
  • [SPARK-38203] - Fix SQLInsertTestSuite and SchemaPruningSuite under ANSI mode
  • [SPARK-38226] - Fix HiveCompatibilitySuite under ANSI mode
  • [SPARK-38228] - Legacy store assignment should not fail on error under ANSI mode
  • [SPARK-38232] - Explain formatted does not collect subqueries under query stage in AQE
  • [SPARK-38241] - Close KubernetesClient in K8S integrations tests
  • [SPARK-38244] - Upgrade kubernetes-client to 5.12.1
  • [SPARK-38246] - Refactor KVUtils and add UTs related to RocksDB
  • [SPARK-38251] - Change Cast.toString as "cast" instead of "ansi_cast" under ANSI mode
  • [SPARK-38268] - Hide the "failOnError" field in the toString method of Abs/CheckOverflow
  • [SPARK-38272] - Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode and context name
  • [SPARK-38276] - Add approved TPCDS plan under ANSI mode
  • [SPARK-38281] - Fix AnalysisSuite under ANSI mode
  • [SPARK-38283] - Test invalid datetime parsing under ANSI mode
  • [SPARK-38290] - Fix JsonSuite and ParquetIOSuite under ANSI mode
  • [SPARK-38295] - Fix ArithmeticExpressionSuite under ANSI mode
  • [SPARK-38298] - Fix DataExpressionSuite, NullExpressionsSuite, StringExpressionsSuite, complexTypesSuite, CastSuite under ANSI mode
  • [SPARK-38302] - Use Java 17 in K8S integration tests when setting spark-tgz
  • [SPARK-38306] - Fix ExplainSuite,StatisticsCollectionSuite and StringFunctionsSuite under ANSI mode
  • [SPARK-38307] - Fix ExpressionTypeCheckingSuite and CollectionExpressionsSuite under ANSI mode
  • [SPARK-38311] - Fix DynamicPartitionPruning/BucketedReadSuite/ExpressionInfoSuite under ANSI mode
  • [SPARK-38312] - Use error classes in org.apache.spark.metrics
  • [SPARK-38316] - Fix SQLViewSuite/TriggerAvailableNowSuite/UnwrapCastInBinaryComparisonSuite/UnwrapCastInComparisonEndToEndSuite under ANSI mode
  • [SPARK-38321] - Fix BooleanSimplificationSuite under ANSI mode
  • [SPARK-38325] - ANSI mode: avoid potential runtime error in HashJoin.extractKeyExprAt()
  • [SPARK-38343] - Fix SQLQuerySuite under ANSI mode
  • [SPARK-38352] - Fix DataFrameAggregateSuite/DataFrameSetOperationsSuite/DataFrameWindowFunctionsSuite under ANSI mode
  • [SPARK-38361] - Add factory method getConnection into JDBCDialect.
  • [SPARK-38363] - Avoid runtime error in Dataset.summary() when ANSI mode is on
  • [SPARK-38383] - Support APP_ID and EXECUTOR_ID placeholder in annotations
  • [SPARK-38385] - Improve error messages of 'mismatched input' cases from ANTLR
  • [SPARK-38387] - Support `na_action` and Series input correspondence in `Series.map`
  • [SPARK-38391] - Datasource v2 supports partial topN push-down
  • [SPARK-38392] - Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT
  • [SPARK-38398] - Add `priorityClassName` integration test case
  • [SPARK-38400] - Enable Series.rename to change index labels
  • [SPARK-38406] - Improve perfermance of ShufflePartitionsUtil createSkewPartitionSpecs
  • [SPARK-38407] - ANSI Cast: loosen the limitation of casting non-null complex types
  • [SPARK-38410] - Support specify initial partition number for rebalance
  • [SPARK-38417] - Remove `Experimental` from `RDD.cleanShuffleDependencies` API
  • [SPARK-38418] - Add PySpark cleanShuffleDependencies API
  • [SPARK-38423] - Support priority scheduling with volcano implementations
  • [SPARK-38430] - Add SBT commands to K8s IT readme
  • [SPARK-38432] - Refactor framework so as JDBC dialect could compile filter by self way
  • [SPARK-38442] - Fix ConstantFoldingSuite/ColumnExpressionSuite/DataFrameSuite/AdaptiveQueryExecSuite under ANSI mode
  • [SPARK-38450] - Fix HiveQuerySuite//PushFoldableIntoBranchesSuite/TransposeWindowSuite
  • [SPARK-38451] - Fix R tests under ANSI mode
  • [SPARK-38452] - Support pyDockerfile and rDockerfile in SBT K8s IT
  • [SPARK-38453] - Add volcano section to K8s IT README.md
  • [SPARK-38455] - Support driver/executor PodGroup templates
  • [SPARK-38456] - Improve error messages of no viable alternative, extraneous input and missing token
  • [SPARK-38480] - Remove spark.kubernetes.job.queue in favor of spark.kubernetes.driver.podGroupTemplateFile
  • [SPARK-38481] - Substitute Java overflow exception from TIMESTAMPADD by Spark exception
  • [SPARK-38486] - Upgrade the minimum Minikube version to 1.18.0
  • [SPARK-38490] - Add Github action test job for ANSI SQL mode
  • [SPARK-38491] - Support `ignore_index` of `Series.sort_values`
  • [SPARK-38501] - Fix thriftserver test failures under ANSI mode
  • [SPARK-38504] - Can't read TimestampNTZ as TimestampLTZ
  • [SPARK-38508] - Volcano feature doesn't work on EKS graviton instances
  • [SPARK-38511] - Remove priorityClassName propagation in favor of explicit settings
  • [SPARK-38513] - Move custom scheduler-specific configs to under `spark.kubernetes.scheduler.NAME` prefix
  • [SPARK-38515] - Volcano queue is not deleted
  • [SPARK-38518] - Implement `skipna` of `Series.all/Index.all` to exclude NA/null values
  • [SPARK-38519] - AQE throw exception should respect SparkFatalException
  • [SPARK-38524] - Fix Volcano weight to be positive integer and use cpu capability instead
  • [SPARK-38527] - Set the minimum Volcano version
  • [SPARK-38533] - DS V2 aggregate push-down supports project with alias
  • [SPARK-38534] - Disable to_timestamp('366', 'DD') test case
  • [SPARK-38537] - Unify Statefulset* to StatefulSet*
  • [SPARK-38538] - Fix driver environment verification in BasicDriverFeatureStepSuite
  • [SPARK-38544] - Upgrade log4j2 to 2.17.2
  • [SPARK-38548] - New SQL function: try_sum
  • [SPARK-38553] - Bump minimum Volcano version to v1.5.1
  • [SPARK-38560] - If `Sum`, `Count`, `Any` accompany distinct, cannot do partial agg push down.
  • [SPARK-38561] - Add doc for "Customized Kubernetes Schedulers"
  • [SPARK-38562] - Add doc for Volcano scheduler
  • [SPARK-38589] - New SQL function: try_avg
  • [SPARK-38590] - New SQL function: try_to_binary
  • [SPARK-38616] - Keep track of SQL query text in Catalyst TreeNode
  • [SPARK-38625] - DataSource V2: Add APIs for group-based row-level operations
  • [SPARK-38626] - Make condition in DeleteFromTable required
  • [SPARK-38633] - Support push down Cast to JDBC data source V2
  • [SPARK-38644] - DS V2 topN push-down supports project with alias
  • [SPARK-38676] - Provide query context in runtime error of Add/Subtract/Multiply
  • [SPARK-38698] - Provide query context in runtime error of Divide/Div/Reminder/Pmod
  • [SPARK-38716] - Provide query context in runtime error of map key not exists
  • [SPARK-38761] - DS V2 supports push down misc non-aggregate functions
  • [SPARK-38762] - Provide query context in Decimal overflow errors
  • [SPARK-38763] - Pandas API on spark Can`t apply lamda to columns.
  • [SPARK-38787] - Possible correctness issue on stream-stream join when handling edge case
  • [SPARK-38791] - Output parameter values of error classes in SQL style
  • [SPARK-38809] - Implement option to skip null values in symmetric hash impl of stream-stream joins
  • [SPARK-38813] - Remove TimestampNTZ type support in Spark 3.3
  • [SPARK-38817] - Upgrade kubernetes-client to 5.12.2
  • [SPARK-38828] - Remove TimestampNTZ type Python support in Spark 3.3
  • [SPARK-38829] - New configuration for controlling timestamp inference of Parquet
  • [SPARK-38837] - Implement `dropna` parameter of `SeriesGroupBy.value_counts`
  • [SPARK-38855] - DS V2 supports push down math functions
  • [SPARK-38865] - Update document of JDBC options for pushDownAggregate and pushDownLimit
  • [SPARK-38891] - Skipping allocating vector for repetition & definition levels when possible
  • [SPARK-38908] - Provide query context in runtime error of Casting from String to Number/Date/Timestamp/Boolean
  • [SPARK-38913] - Output identifiers in error messages in SQL style
  • [SPARK-38926] - Output types in error messages in SQL style
  • [SPARK-38949] - Wrap SQL statements by double quotes in error messages
  • [SPARK-38950] - Return Array of Predicate for SupportsPushDownCatalystFilters.pushedFilters
  • [SPARK-38967] - Turn spark.sql.ansi.strictIndexOperator into internal config
  • [SPARK-38996] - Use double quotes for types in error messages
  • [SPARK-38997] - DS V2 aggregate push-down supports group by expressions
  • [SPARK-39007] - Use double quotes for SQL configs in error messages
  • [SPARK-39027] - Output SQL statements in upper case in error messages
  • [SPARK-39040] - Respect NaNvl in EquivalentExpressions for expression elimination
  • [SPARK-39046] - Return an empty context string if TreeNode.origin is wrongly set
  • [SPARK-39087] - Improve error messages: step 1
  • [SPARK-39105] - Add ConditionalExpression trait
  • [SPARK-39106] - Correct conditional expression constant folding
  • [SPARK-39121] - Fix doc format/syntax error
  • [SPARK-39135] - DS V2 aggregate partial push-down should supports group by without aggregate functions
  • [SPARK-39157] - H2Dialect should override getJDBCType so as make the data type is correct
  • [SPARK-39162] - Jdbc dialect should decide which function could be pushed down.
  • [SPARK-39164] - Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions
  • [SPARK-39165] - Replace sys.error by IllegalStateException in Spark SQL
  • [SPARK-39166] - Provide runtime error query context for Binary Arithmetic when WSCG is off
  • [SPARK-39175] - Provide runtime error query context for Cast when WSCG is off
  • [SPARK-39177] - Provide query context on map key not exists error when WSCG is off
  • [SPARK-39187] - Remove SparkIllegalStateException
  • [SPARK-39190] - Provide query context for decimal precision overflow error when WSCG is off
  • [SPARK-39193] - Fasten Timestamp type inference of default format in JSON/CSV data source
  • [SPARK-39208] - Fix query context bugs in decimal overflow under codegen mode
  • [SPARK-39210] - Provide query context of Decimal overflow in AVG when WSCG is off
  • [SPARK-39212] - Use double quotes for values of SQL configs/DS options in error messages
  • [SPARK-39214] - Improve errors related to CAST
  • [SPARK-39229] - Separate query contexts from error-classes.json
  • [SPARK-39234] - Code clean up in SparkThrowableHelper.getMessage
  • [SPARK-39243] - Describe the rules of quoting elements in error messages
  • [SPARK-39255] - Improve error messages: step 2
  • [SPARK-39272] - Increase the start position of query context by 1
  • [SPARK-39322] - Remove `Experimental` from `spark.dynamicAllocation.shuffleTracking.enabled`
  • [SPARK-39327] - ExecutorRollPolicy.ID should consider ID as a numerical number.
  • [SPARK-39346] - Convert asserts/illegal state exception to internal errors on each phase
  • [SPARK-39886] - Disable DEFAULT column SQLConf until implementation is complete

Bug

  • [SPARK-8582] - Optimize checkpointing to avoid computing an RDD twice
  • [SPARK-18621] - PySQL SQL Types (aka Dataframa Schema) have __repr__() with Scala and not Python representation
  • [SPARK-23626] - DAGScheduler blocked due to JobSubmitted event
  • [SPARK-30062] - Add IMMEDIATE statement to the DB2 dialect truncate implementation
  • [SPARK-30537] - toPandas gets wrong dtypes when applied on empty DF when Arrow enabled
  • [SPARK-32079] - PySpark <> Beam pickling issues for collections.namedtuple
  • [SPARK-33206] - Spark Shuffle Index Cache calculates memory usage wrong
  • [SPARK-34521] - spark.createDataFrame does not support Pandas StringDtype extension type
  • [SPARK-34805] - PySpark loses metadata in DataFrame fields when selecting nested columns
  • [SPARK-35011] - False active executor in UI that caused by BlockManager reregistration
  • [SPARK-35430] - Investigate the failure of "PVs with local storage" integration test on Docker driver
  • [SPARK-35531] - Can not insert into hive bucket table if create table with upper case schema
  • [SPARK-35561] - partition result is incorrect when insert into partition table with int datatype partition column
  • [SPARK-35672] - Spark fails to launch executors with very large user classpath lists on YARN
  • [SPARK-35803] - Spark SQL does not support creating views using DataSource v2 based data sources
  • [SPARK-35881] - [SQL] AQE does not support columnar execution for the final query stage
  • [SPARK-35912] - [SQL] JSON read behavior is different depending on the cache setting when nullable is false.
  • [SPARK-35929] - Schema inference of nested structs defaults to map
  • [SPARK-36004] - Update MiMa and audit Scala/Java API changes
  • [SPARK-36007] - Failed to run benchmark in GA
  • [SPARK-36009] - Missing GraphX classes in registerKryoClasses util method
  • [SPARK-36013] - Upgrade Dropwizard Metrics to 4.2.2
  • [SPARK-36014] - Use uuid as app id in kubernetes client mode
  • [SPARK-36026] - Upgrade Kubernetes Client Version to 5.5.0
  • [SPARK-36036] - Regression: Remote blocks stored on disk by BlockManager are not deleted
  • [SPARK-36052] - Introduce pending pod limit for Spark on K8s
  • [SPARK-36122] - Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.
  • [SPARK-36169] - Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)
  • [SPARK-36211] - type check fails for `F.udf(...).asNonDeterministic()
  • [SPARK-36237] - SparkUI should bind handler after application started
  • [SPARK-36242] - Ensure spill file closed before set success to true in ExternalSorter.spillMemoryIteratorToDisk method
  • [SPARK-36327] - Spark sql creates staging dir inside database directory rather than creating inside table directory
  • [SPARK-36341] - In stage page, 'Aggregated Metrics by Executor' the underline displayed when the mouse is moved to the link is blocked
  • [SPARK-36348] - unexpected Index loaded: pd.Index([10, 20, None], name="x")
  • [SPARK-36358] - Upgrade Kubernetes Client Version to 5.6.0
  • [SPARK-36379] - Null at root level of a JSON array causes the parsing failure (w/ permissive mode)
  • [SPARK-36382] - Remove noisy footer from the summary table for metrics
  • [SPARK-36383] - NullPointerException throws during executor shutdown
  • [SPARK-36389] - Revert the change that accepts negative mapId in ShuffleBlockId
  • [SPARK-36391] - When fetch chunk throw NPE, improve the error message
  • [SPARK-36421] - Validate all SQL configs to prevent from wrong use for ConfigEntry
  • [SPARK-36433] - Logs should show correct URL of where HistoryServer is started
  • [SPARK-36448] - Exceptions in NoSuchItemException.scala have to be case classes to preserve specific exceptions
  • [SPARK-36488] - "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.
  • [SPARK-36507] - Remove/Replace missing links to AMP Camp materials from index.md
  • [SPARK-36512] - Fix UISeleniumSuite in sql/hive-thriftserver
  • [SPARK-36553] - KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced
  • [SPARK-36554] - Error message while trying to use spark sql functions directly on dataframe columns without using select expression
  • [SPARK-36568] - Missed broadcast join in V2 plan
  • [SPARK-36605] - Upgrade Jackson to 2.12.5
  • [SPARK-36627] - Tasks with Java proxy objects fail to deserialize
  • [SPARK-36681] - Fail to load Snappy codec
  • [SPARK-36700] - BlockManager re-registration is broken due to deferred removal of BlockManager
  • [SPARK-36717] - Wrong order of variable initialization may lead to incorrect behavior
  • [SPARK-36733] - Perf issue in SchemaPruning when a struct has many fields
  • [SPARK-36773] - Fix the uts to check the parquet compression
  • [SPARK-36798] - When SparkContext is stopped, metrics system should be flushed after listeners have finished processing
  • [SPARK-36804] - Using the verbose parameter in yarn mode would cause application submission failure
  • [SPARK-36806] - Use R 4.0.4 in K8s R image
  • [SPARK-36861] - Partition columns are overly eagerly parsed as dates
  • [SPARK-36867] - Misleading Error Message with Invalid Column and Group By
  • [SPARK-36889] - Respect `spark.sql.parquet.filterPushdown` by explain() for DSv2
  • [SPARK-36896] - Return boolean for `dropTempView` and `dropGlobalTempView`
  • [SPARK-36905] - Reading Hive view without explicit column names fails in Spark
  • [SPARK-36985] - Future typing errors in pyspark.pandas
  • [SPARK-36993] - Fix json_tuple throw NPE if fields exist no foldable null value
  • [SPARK-37004] - Job cancellation causes py4j errors on Jupyter due to pinned thread mode
  • [SPARK-37017] - Reduce the scope of synchronized to prevent deadlock.
  • [SPARK-37026] - Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13
  • [SPARK-37046] - Alter view does not preserve column case
  • [SPARK-37049] - executorIdleTimeout is not working for pending pods on K8s
  • [SPARK-37052] - Fix spark-3.2 can use --verbose with spark-shell
  • [SPARK-37057] - Fix wrong DocSearch facet filter in release-tag.sh
  • [SPARK-37059] - Ensure the sort order of the output in the PySpark doctests
  • [SPARK-37060] - Report driver status does not handle response from backup masters
  • [SPARK-37061] - Custom V2 Metrics uses wrong classname for lookup
  • [SPARK-37064] - Fix outer join return the wrong max rows if other side is empty
  • [SPARK-37069] - HiveClientImpl throws NoSuchMethodError: org.apache.hadoop.hive.ql.metadata.Hive.getWithoutRegisterFns
  • [SPARK-37076] - Implement StructType.toString explicitly for Scala 2.13
  • [SPARK-37078] - Support old 3-parameter Sink constructors
  • [SPARK-37079] - Fix DataFrameWriterV2.partitionedBy to send the arguments to JVM properly
  • [SPARK-37086] - Fix the R test of FPGrowthModel for Scala 2.13
  • [SPARK-37088] - Python UDF after off-heap vectorized reader can cause crash due to use-after-free in writer thread
  • [SPARK-37089] - ParquetFileFormat registers task completion listeners lazily, causing Python writer thread to segfault when off-heap vectorized reader is enabled
  • [SPARK-37098] - Alter table properties should invalidate cache
  • [SPARK-37102] - Missing dependencies for hadoop-azure
  • [SPARK-37103] - Switch from Maven to SBT to build Spark on AppVeyor
  • [SPARK-37112] - Fix MiMa failure with Scala 2.13
  • [SPARK-37117] - Can't read files in one of Parquet encryption modes (external keymaterial)
  • [SPARK-37121] - TestUtils.isPythonVersionAtLeast38 returns incorrect results
  • [SPARK-37135] - Fix some mirco-benchmarks run failed
  • [SPARK-37141] - WorkerSuite cannot run on Mac OS
  • [SPARK-37143] - Supplement the missing Java 11 benchmark result files
  • [SPARK-37147] - MetricsReporter producing NullPointerException when element 'triggerExecution' not present in Map[]
  • [SPARK-37170] - Pin PySpark version installed in the Binder environment for tagged commit
  • [SPARK-37191] - Allow merging DecimalTypes with different precision values
  • [SPARK-37196] - NPE in org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106)
  • [SPARK-37202] - Temp view didn't collect temp function that registered with catalog API
  • [SPARK-37203] - Fix NotSerializableException when observe with TypedImperativeAggregate
  • [SPARK-37209] - YarnShuffleIntegrationSuite and other two similar cases in `resource-managers` test failed
  • [SPARK-37217] - The number of dynamic partitions should early check when writing to external tables
  • [SPARK-37252] - Ignore test_memory_limit on non-Linux environment
  • [SPARK-37253] - try_simplify_traceback should not fail when tb_frame.f_lineno is None
  • [SPARK-37290] - Exponential planning time in case of non-deterministic function
  • [SPARK-37302] - Explicitly download the dependencies of guava and jetty-io in test-dependencies.sh
  • [SPARK-37308] - Flaky Test: DDLParserSuite.create view -- basic
  • [SPARK-37314] - Upgrade kubernetes-client to 5.10.1
  • [SPARK-37315] - Mitigate ConcurrentModificationException thrown from a test in MLEventSuite
  • [SPARK-37318] - Make FallbackStorageSuite robust in terms of DNS
  • [SPARK-37320] - Delete py_container_checks.zip after the test in DepsTestsSuite finishes
  • [SPARK-37356] - Add fine grained locking to BlockInfoManager
  • [SPARK-37374] - StatCounter should use mergeStats when merging with self.
  • [SPARK-37388] - WidthBucket throws NullPointerException in WholeStageCodegenExec
  • [SPARK-37390] - Buggy method retrival in pyspark.docs.conf.setup
  • [SPARK-37391] - SIGNIFICANT bottleneck introduced by fix for SPARK-32001
  • [SPARK-37392] - Catalyst optimizer very time-consuming and memory-intensive with some "explode(array)"
  • [SPARK-37452] - Char and Varchar breaks backward compatibility between v3 and v2
  • [SPARK-37459] - Upgrade commons-cli to 1.5.0
  • [SPARK-37465] - PySpark tests failing on Pandas 0.23
  • [SPARK-37480] - Configurations in docs/running-on-kubernetes.md are not uptodate
  • [SPARK-37481] - Disappearance of skipped stages mislead the bug hunting
  • [SPARK-37498] - test_reuse_worker_of_parallelize_range is flaky
  • [SPARK-37524] - We should drop all tables after testing dynamic partition pruning
  • [SPARK-37534] - Bump dev.ludovic.netlib to 2.2.1
  • [SPARK-37544] - sequence over dates with month interval is producing incorrect results
  • [SPARK-37545] - V2 CreateTableAsSelect command should qualify location
  • [SPARK-37546] - V2 ReplaceTableAsSelect command should qualify location
  • [SPARK-37554] - Add PyArrow, pandas and plotly to release Docker image dependencies
  • [SPARK-37556] - Deser void class fail with Java serialization
  • [SPARK-37569] - View Analysis incorrectly marks nested fields as nullable
  • [SPARK-37573] - IsolatedClient fallbackVersion should be build in version, not always 2.7.4
  • [SPARK-37575] - null values should be saved as nothing rather than quoted empty Strings "" with default settings
  • [SPARK-37577] - ClassCastException: ArrayType cannot be cast to StructType
  • [SPARK-37585] - DSV2 InputMetrics are not getting update in corner case
  • [SPARK-37598] - Pyspark's newAPIHadoopRDD() method fails with ShortWritables
  • [SPARK-37615] - Upgrade SBT to 1.5.6
  • [SPARK-37633] - Unwrap cast should skip if downcast failed with ansi enabled
  • [SPARK-37635] - SHOW TBLPROPERTIES should print the fully qualified table name
  • [SPARK-37643] - when charVarcharAsString is true, char datatype partition table query incorrect
  • [SPARK-37654] - Regression - NullPointerException in Row.getSeq when field null
  • [SPARK-37656] - Upgrade SBT to 1.5.7
  • [SPARK-37658] - Skip PIP packaging test if Python version is lower than 3.7
  • [SPARK-37659] - Fix FsHistoryProvider race condition between list and delet log info
  • [SPARK-37663] - Mitigate ConcurrentModificationException thrown from tests in SparkContextSuite
  • [SPARK-37668] - 'Index' object has no attribute 'levels' in pyspark.pandas.frame.DataFrame.insert
  • [SPARK-37678] - Incorrect annotations in SeriesGroupBy._cleanup_and_return
  • [SPARK-37690] - Recursive view `df` detected (cycle: `df` -> `df`)
  • [SPARK-37693] - Fix ChildProcAppHandleSuite failed in spark-master-test-maven-hadoop-3.2
  • [SPARK-37694] - Disallow delete resources in spark sql cli
  • [SPARK-37703] - Upgrade SBT to 1.5.8
  • [SPARK-37713] - No namespace assigned in Executor Pod ConfigMap
  • [SPARK-37721] - Failed to execute pyspark test in Win WSL
  • [SPARK-37728] - reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException
  • [SPARK-37730] - plot.hist throws AttributeError on pandas=1.3.5
  • [SPARK-37754] - Fix black version in dev/reformat-python
  • [SPARK-37763] - Upgrade jackson to 2.13.1
  • [SPARK-37778] - Upgrade SBT to 1.6.1
  • [SPARK-37779] - Make ColumnarToRowExec plan canonicalizable after (de)serialization
  • [SPARK-37793] - Invalid LocalMergedBlockData cause task hang
  • [SPARK-37800] - TreeNode.argString incorrectly formats arguments of type Set[_]
  • [SPARK-37802] - composite field name like `field name` doesn't work with Aggregate push down
  • [SPARK-37807] - Fix a typo in HttpAuthenticationException message
  • [SPARK-37820] - Replace ApacheCommonBase64 with JavaBase64 for string fucntions
  • [SPARK-37834] - Reenable length check in Python linter
  • [SPARK-37841] - BasicWriteTaskStatsTracker should not try get status for a skipped file
  • [SPARK-37846] - TaskContext is used at wrong place in BlockManagerDecommissionIntegrationSuite
  • [SPARK-37855] - IllegalStateException when transforming an array inside a nested struct
  • [SPARK-37859] - SQL tables created with JDBC with Spark 3.1 are not readable with 3.2
  • [SPARK-37860] - [BUG] Revert: Fix taskid in the stage page task event timeline
  • [SPARK-37865] - Spark should not dedup the groupingExpressions when the first child of Union has duplicate columns
  • [SPARK-37874] - Link to Pandas UDF documentation is broken
  • [SPARK-37884] - Upgrade kubernetes-client to 5.10.2
  • [SPARK-37893] - Fix flaky test: AdaptiveQueryExecSuite with Scala 2.13
  • [SPARK-37895] - Error while joining two tables with non-english field names
  • [SPARK-37905] - Make `merge_spark_pr.py` set primary author from the first commit in case of ties
  • [SPARK-37918] - Specified construct when instance SessionStateBuilder
  • [SPARK-37920] - Remove tab character and trailing space in pom.xml
  • [SPARK-37932] - Analyzer can fail when join left side and right side are the same view
  • [SPARK-37947] - Cannot use <func>_outer generators in a lateral view
  • [SPARK-37958] - Pyspark SparkContext.AddFile() does not respect spark.files.overwrite
  • [SPARK-37963] - Need to update Partition URI after renaming table in InMemoryCatalog
  • [SPARK-37972] - Typing incompatibilities with numpy==1.22.x
  • [SPARK-38016] - Fix the API doc for session_window to say it supports TimestampNTZType too as timeColumn
  • [SPARK-38018] - Fix ColumnVectorUtils.populate to handle CalendarIntervalType correctly
  • [SPARK-38042] - Encoder cannot be found when a tuple component is a type alias for an Array
  • [SPARK-38056] - Structured streaming not working in history server when using LevelDB
  • [SPARK-38060] - Inconsistent behavior from JSON option allowNonNumericNumbers
  • [SPARK-38067] - Inconsistent missing values handling in Pandas on Spark to_json
  • [SPARK-38073] - Update atexit function to avoid issues with late binding
  • [SPARK-38075] - Hive script transform with order by and limit will return fake rows
  • [SPARK-38118] - Func(wrong data type) in HAVING clause should throw data mismatch error
  • [SPARK-38120] - HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value
  • [SPARK-38124] - Revive HashClusteredDistribution and apply to stream-stream join
  • [SPARK-38130] - array_sort does not allow non-orderable datatypes
  • [SPARK-38132] - Remove NotPropagation
  • [SPARK-38133] - Grouping by timestamp_ntz will sometimes corrupt the results
  • [SPARK-38139] - ml.recommendation.ALS doctests failures
  • [SPARK-38140] - Desc column stats (min, max) for timestamp type is not consistent with the value due to time zone difference
  • [SPARK-38146] - UDAF fails to aggregate TIMESTAMP_NTZ column
  • [SPARK-38151] - Handle `Pacific/Kanton` in DateTimeUtilsSuite
  • [SPARK-38171] - Upgrade ORC to 1.7.3
  • [SPARK-38173] - Quoted column cannot be recognized correctly when quotedRegexColumnNames is true
  • [SPARK-38178] - Correct the logic to measure the memory usage of RocksDB
  • [SPARK-38182] - Fix NoSuchElementException if pushed filter does not contain any references
  • [SPARK-38185] - Fix data incorrect if aggregate function is empty
  • [SPARK-38192] - Use try-with-resources in Level/RocksDBSuite.java
  • [SPARK-38198] - Fix `QueryExecution.debug#toFile` use the passed in `maxFields` when `explainMode` is `CodegenMode`
  • [SPARK-38201] - Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`
  • [SPARK-38204] - All state operators are at a risk of inconsistency between state partitioning and operator partitioning
  • [SPARK-38206] - Relax the requirement of data type comparison for keys in stream-stream join
  • [SPARK-38221] - Group by a stream of complex expressions fails
  • [SPARK-38227] - Apply strict nullability of nested column in time window / session window
  • [SPARK-38236] - Absolute file paths specified in create/alter table are treated as relative
  • [SPARK-38239] - AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
  • [SPARK-38243] - Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold
  • [SPARK-38271] - PoissonSampler may output more rows than MaxRows
  • [SPARK-38273] - decodeUnsafeRows's iterators should close underlying input streams
  • [SPARK-38275] - Consider to include WriteBatch's memory in the memory usage of RocksDB state store
  • [SPARK-38285] - ClassCastException: GenericArrayData cannot be cast to InternalRow
  • [SPARK-38286] - Union's maxRows and maxRowsPerPartition may overflow
  • [SPARK-38304] - Elt() should return null if index is null under ANSI mode
  • [SPARK-38308] - Select of a stream of window expressions fails
  • [SPARK-38309] - SHS has incorrect percentiles for shuffle read bytes and shuffle total blocks metrics
  • [SPARK-38314] - Fail to read parquet files after writing the hidden file metadata in
  • [SPARK-38320] - (flat)MapGroupsWithState can timeout groups which just received inputs in the same microbatch
  • [SPARK-38333] - DPP cause DataSourceScanExec java.lang.NullPointerException
  • [SPARK-38344] - Avoid to submit task when there are no requests to push up in push-based shuffle
  • [SPARK-38347] - Nullability propagation in transformUpWithNewOutput
  • [SPARK-38355] - Change mktemp() to mkstemp()
  • [SPARK-38357] - StackOverflowError with OR(data filter, partition filter)
  • [SPARK-38394] - build of spark sql against hadoop-3.4.0-snapshot failing with bouncycastle classpath error
  • [SPARK-38411] - Use UTF-8 when doMergeApplicationListingInternal reads event logs
  • [SPARK-38412] - `from` and `to` is swapped in the StateSchemaCompatibilityChecker
  • [SPARK-38416] - Change day to month
  • [SPARK-38436] - Fix `test_ceil` to test `ceil`
  • [SPARK-38446] - Deadlock between ExecutorClassLoader and FileDownloadCallback caused by Log4j
  • [SPARK-38458] - Fix always false condition in LogDivertAppender#initLayout
  • [SPARK-38516] - Add log4j-core, log4j-api and log4j-slf4j-impl to classpath if active hadoop-provided
  • [SPARK-38517] - Fix PySpark documentation generation (missing ipython_genutils)
  • [SPARK-38523] - Failure on referring to the corrupt record from CSV
  • [SPARK-38526] - fix misleading function alias name for RuntimeReplaceable
  • [SPARK-38528] - NullPointerException when selecting a generator in a Stream of aggregate expressions
  • [SPARK-38530] - GeneratorNestedColumnAliasing does not work correctly for some expressions
  • [SPARK-38542] - UnsafeHashedRelation should serialize numKeys out
  • [SPARK-38563] - Upgrade to Py4J 0.10.9.5
  • [SPARK-38567] - Enable GitHub Action build_and_test on branch-3.3
  • [SPARK-38579] - Requesting Restful API can cause NullPointerException
  • [SPARK-38583] - to_timestamp should allow numeric types
  • [SPARK-38586] - Trigger notifying workflow in branch-3.3 and other future branches
  • [SPARK-38587] - Validating new location for rename command should use formatted names
  • [SPARK-38600] - Include unit into the sql string of TIMESTAMPADD/DIFF
  • [SPARK-38604] - ceil and floor return different types when called from scala than sql
  • [SPARK-38612] - Fix Inline type hint for duplicated.keep
  • [SPARK-38630] - K8s app name label should start and end with alphanumeric char
  • [SPARK-38631] - Arbitrary shell command injection via Utils.unpack()
  • [SPARK-38640] - NPE with unpersisting memory-only RDD with RDD fetching from shuffle service enabled
  • [SPARK-38652] - uploadFileUri should preserve file scheme
  • [SPARK-38655] - OffsetWindowFunctionFrameBase cannot find the offset row whose input is not null
  • [SPARK-38665] - upgrade jackson due to CVE-2020-36518
  • [SPARK-38666] - Missing aggregate filter checks
  • [SPARK-38675] - Race condition in BlockInfoManager during unlock
  • [SPARK-38677] - pyspark hangs in local mode running rdd map operation
  • [SPARK-38680] - Set upperbound for pandas-stubs in CI
  • [SPARK-38681] - Support nested generic case classes
  • [SPARK-38684] - Stream-stream outer join has a possible correctness issue due to weakly read consistent on outer iterators
  • [SPARK-38696] - Add `commons-collections` back
  • [SPARK-38706] - Use URI in FallbackStorage.copy
  • [SPARK-38776] - Flaky test: ALSSuite.'ALS validate input dataset'
  • [SPARK-38807] - Error when starting spark shell on Windows system
  • [SPARK-38818] - Fix the docs of try_multiply/try_subtract/ANSI cast
  • [SPARK-38823] - Incorrect result of dataset reduceGroups in java
  • [SPARK-38830] - Warn on corrupted block messages
  • [SPARK-38866] - Update ORC to 1.7.4
  • [SPARK-38868] - `assert_true` fails unconditionnaly after `left_outer` joins
  • [SPARK-38882] - The usage logger attachment logic should handle static methods properly.
  • [SPARK-38889] - Invalid column name while querying bit type column in MSSQL
  • [SPARK-38916] - Tasks not killed caused by race conditions between killTask() and launchTask()
  • [SPARK-38918] - Nested column pruning should filter out attributes that do not belong to the current relation
  • [SPARK-38922] - TaskLocation.apply throw NullPointerException
  • [SPARK-38931] - RocksDB File manager would not create initial dfs directory with unknown number of keys on 1st empty checkpoint
  • [SPARK-38941] - Skip RocksDB-based test case in StreamingJoinSuite on Apple Silicon
  • [SPARK-38942] - Skip RocksDB-based test case in FlatMapGroupsWithStateSuite on Apple Silicon
  • [SPARK-38955] - Disable lineSep option in 'from_csv' and 'schema_of_csv'
  • [SPARK-38973] - When push-based shuffle is enabled, a stage may not complete when retried
  • [SPARK-38974] - List functions should only list registered functions in the specified database
  • [SPARK-38977] - Fix schema pruning with correlated subqueries
  • [SPARK-38988] - Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.
  • [SPARK-38990] - date_trunc and trunc both fail with format from column in inline table
  • [SPARK-38992] - Avoid using bash -c in ShellBasedGroupsMappingProvider
  • [SPARK-39012] - SparkSQL parse partition value does not support all data types
  • [SPARK-39015] - SparkRuntimeException when trying to get non-existent key in a map
  • [SPARK-39055] - Fix documentation 404 page
  • [SPARK-39060] - Typo in error messages of decimal overflow
  • [SPARK-39083] - Fix FsHistoryProvider race condition between update and clean app data
  • [SPARK-39084] - df.rdd.isEmpty() results in unexpected executor failure and JVM crash
  • [SPARK-39093] - Dividing interval by integral can result in codegen compilation error
  • [SPARK-39104] - Null Pointer Exeption on unpersist call
  • [SPARK-39107] - Silent change in regexp_replace's handling of empty strings
  • [SPARK-39112] - UnsupportedOperationException if spark.sql.ui.explainMode is set to cost
  • [SPARK-39144] - Nested subquery expressions deduplicate relations should be done bottom up
  • [SPARK-39149] - SHOW DATABASES command should not quote database names under legacy mode
  • [SPARK-39216] - Do not collapse projects in CombineUnions if it hasCorrelatedSubquery
  • [SPARK-39218] - Python foreachBatch streaming query cannot be stopped gracefully after pin thread mode is enabled
  • [SPARK-39226] - Fix the precision of the return type of round-like functions
  • [SPARK-39233] - Remove the check for TimestampNTZ output in Analyzer
  • [SPARK-39250] - Upgrade Jackson to 2.13.3
  • [SPARK-39258] - Fix `Hide credentials in show create table` after SPARK-35378
  • [SPARK-39259] - Timestamps returned by now() and equivalent functions are not consistent in subqueries
  • [SPARK-39283] - Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
  • [SPARK-39286] - Documentation for the decode function has an incorrect reference
  • [SPARK-39293] - The accumulator of ArrayAggregate should copy the intermediate result if string, struct, array, or map
  • [SPARK-39313] - V2ExpressionUtils.toCatalystOrdering should fail if V2Expression can not be translated
  • [SPARK-39341] - KubernetesExecutorBackend should allow IPv6 pod IP
  • [SPARK-39354] - The analysis exception is incorrect
  • [SPARK-39360] - Recover spark.kubernetes.memoryOverheadFactor doc and remove deprecation
  • [SPARK-39376] - Do not output duplicated columns in star expansion of subquery alias of NATURAL/USING JOIN
  • [SPARK-39393] - Parquet data source only supports push-down predicate filters for non-repeated primitive types
  • [SPARK-39411] - Release candidates do not have the correct version for PySpark
  • [SPARK-39412] - IllegalStateException from connector does not work well with error class framework
  • [SPARK-39417] - Handle Null partition values in PartitioningUtils
  • [SPARK-39421] - Sphinx build fails with "node class 'meta' is already registered, its visitors will be overridden"
  • [SPARK-39427] - Disable ANSI intervals in the percentile functions
  • [SPARK-40804] - Missing handling a catalog name in destination tables in `RenameTableExec`
  • [SPARK-45601] - stackoverflow when executing rule ExtractWindowExpressions

Epic

Story

  • [SPARK-36642] - Add df.withMetadata: a syntax suger to update the metadata of a dataframe

New Feature

  • [SPARK-595] - Document "local-cluster" mode
  • [SPARK-12567] - Add aes_encrypt and aes_decrypt UDFs
  • [SPARK-28955] - Support for LocalDateTime semantics
  • [SPARK-32268] - Bloom Filter Join
  • [SPARK-33772] - Build and Run Spark on Java 17
  • [SPARK-34735] - Add modified configs for SQL execution in UI
  • [SPARK-34755] - Support the utils for transform number format
  • [SPARK-34806] - Helper class for batch Dataset.observe()
  • [SPARK-35334] - Spark should be more resilient to intermittent K8s flakiness
  • [SPARK-35781] - Support Spark on Apple Silicon on macOS natively on Java 17
  • [SPARK-36194] - Add a logical plan visitor to propagate the distinct attributes
  • [SPARK-36263] - Add Dataset.observe(Observation, Column, Column*) to PySpark
  • [SPARK-36371] - Support raw string literal
  • [SPARK-36425] - PySpark: support CrossValidatorModel get standard deviation of metrics for each paramMap
  • [SPARK-36533] - Allow streaming queries with Trigger.Once run in multiple batches
  • [SPARK-36674] - Support ILIKE - case insensitive Like
  • [SPARK-36736] - Support ILIKE (ALL | ANY | SOME) - case insensitive LIKE
  • [SPARK-37047] - Add overloads for lpad and rpad for BINARY strings
  • [SPARK-37062] - Introduce a new data source for providing consistent set of rows per microbatch
  • [SPARK-37205] - Support mapreduce.job.send-token-conf when starting containers in YARN
  • [SPARK-37207] - Python API does not have isEmpty
  • [SPARK-37219] - support AS OF syntax
  • [SPARK-37375] - Umbrella: Storage Partitioned Join (SPJ)
  • [SPARK-37475] - Add Scale Parameter to Floor and Ceil functions
  • [SPARK-37492] - Optimize Orc test code with withAllNativeOrcReaders
  • [SPARK-37507] - Add the TO_BINARY() function
  • [SPARK-37508] - Add CONTAINS() function
  • [SPARK-37520] - Add the startswith() and endswith() string functions
  • [SPARK-37552] - Add the convert_timezone() function
  • [SPARK-37568] - Support 2-arguments by the convert_timezone() function
  • [SPARK-37582] - Support the binary type by contains()
  • [SPARK-37583] - Support the binary type by startswith() and endswith()
  • [SPARK-37584] - New SQL function: map_contains_key
  • [SPARK-37671] - Support ANSI Aggregation Function of regression
  • [SPARK-37676] - Support ANSI Aggregation Function: percentile_cont
  • [SPARK-37691] - Support ANSI Aggregation Function: percentile_disc
  • [SPARK-37810] - Executor Rolling in Kubernetes environment
  • [SPARK-37863] - Add submitTime for Spark Application
  • [SPARK-37970] - Introduce a new interface on streaming data source to notify the latest seen offset
  • [SPARK-38035] - Add docker tests for build-in JDBC dialect
  • [SPARK-38054] - DS V2 supports list namespaces of MySQL
  • [SPARK-38094] - Parquet: enable matching schema columns by field id
  • [SPARK-38195] - Add the TIMESTAMPADD() function
  • [SPARK-38278] - Add SparkContext.addArchive in PySpark
  • [SPARK-38284] - Add the TIMESTAMPDIFF() function
  • [SPARK-38332] - Add the `DATEADD()` alias for `TIMESTAMPADD()`
  • [SPARK-38345] - Introduce SQL function ARRAY_SIZE
  • [SPARK-38389] - Add the `DATEDIFF` and `DATE_DIFF` aliases for `TIMESTAMPDIFF()`

Improvement

  • [SPARK-20384] - supporting value classes over primitives in DataSets
  • [SPARK-27790] - Support ANSI SQL INTERVAL types
  • [SPARK-32797] - Install mypy on the Jenkins CI workers
  • [SPARK-32940] - Collect, first and last should be deterministic aggregate functions
  • [SPARK-32986] - Add bucket scan info in explain output of FileSourceScanExec
  • [SPARK-34079] - Merge non-correlated scalar subqueries
  • [SPARK-34378] - Loosen AvroSerializer validation to allow extra nullable user-provided fields
  • [SPARK-34629] - Python type hints improvement
  • [SPARK-34943] - Upgrade flake8 to 3.8.0 or above in Jenkins
  • [SPARK-35173] - Support columns batch adding in PySpark.dataframe
  • [SPARK-35174] - Avoid opening watch when waitAppCompletion is false
  • [SPARK-35221] - Add join hint build side check
  • [SPARK-35320] - from_json cannot parse maps with timestamp as key
  • [SPARK-35442] - Support propagate empty relation through aggregate
  • [SPARK-35460] - invalid `spark.kubernetes.executor.podNamePrefix` causes app to hang
  • [SPARK-35703] - Relax constraint for Spark bucket join and remove HashClusteredDistribution
  • [SPARK-35848] - Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError
  • [SPARK-35907] - Instead of File#mkdirs, Files#createDirectories is expected.
  • [SPARK-35918] - Consolidate logic between AvroSerializer/AvroDeserializer for schema mismatch handling and error messages
  • [SPARK-35956] - Support auto-assigning labels to less important pods (e.g. decommissioning pods)
  • [SPARK-35980] - ThreadAudit test helper should log whether a thread is a Daemon thread
  • [SPARK-35986] - fix pyspark.rdd.RDD.histogram's buckets argument
  • [SPARK-35991] - Add PlanStability suite for TPCH
  • [SPARK-36000] - Support creation and operations of ps.Series/Index with Decimal('NaN')
  • [SPARK-36010] - Upgrade sbt-antlr4 from 0.8.2 to 0.8.3
  • [SPARK-36018] - Some Improvement for Spark Core
  • [SPARK-36038] - Basic speculation metrics at stage level
  • [SPARK-36047] - Replace the handwriting compare methods with static compare methods in Java code
  • [SPARK-36069] - spark function from_json should output field name, field type and field value when FAILFAST mode throw exception
  • [SPARK-36070] - Add time cost info for writing rows out and committing the task.
  • [SPARK-36073] - EquivalentExpressions fixes and improvements
  • [SPARK-36137] - HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS
  • [SPARK-36147] - [SQL] - log level should be warning if files not found in BasicWriteStatsTracker
  • [SPARK-36149] - dayofweek documentation for python and R
  • [SPARK-36154] - pyspark documentation doesn't mention week and quarter as valid format arguments to trunc
  • [SPARK-36157] - TimeWindow expression: apply filter before project
  • [SPARK-36158] - pyspark sql/functions documentation for months_between isn't as precise as scala version
  • [SPARK-36160] - pyspark sql/column documentation doesn't always match scala documentation
  • [SPARK-36163] - Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option
  • [SPARK-36173] - [CORE] Support getting CPU number in TaskContext
  • [SPARK-36176] - Expose tableExists in pyspark.sql.catalog
  • [SPARK-36183] - Push down limit 1 through Aggregate
  • [SPARK-36207] - Export databaseExists in pyspark.sql.catalog
  • [SPARK-36243] - pyspark catalog.tableExists doesn't work for temporary views
  • [SPARK-36258] - Export functionExists in pyspark catalog
  • [SPARK-36276] - Update maven-checkstyle-plugin to 3.1.2 and checkstyle to 8.43
  • [SPARK-36280] - Remove redundant aliases after RewritePredicateSubquery
  • [SPARK-36319] - Have Observation return Map instead of Row
  • [SPARK-36326] - Use Map.computeIfAbsent to simplify the process of HeapMemoryAllocator.bufferPoolsBySize init new item
  • [SPARK-36334] - Add a new conf to allow K8s API server-side cache for pod listing
  • [SPARK-36351] - Separate partition filters and data filters in PushDownUtils
  • [SPARK-36359] - Coalesce drop all expressions after the first non nullable expression
  • [SPARK-36361] - Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image
  • [SPARK-36362] - Omnibus Java code static analyzer warning fixes
  • [SPARK-36373] - DecimalPrecision only add necessary cast
  • [SPARK-36404] - Support nested columns in ORC vectorized reader for data source v2
  • [SPARK-36405] - Check that error class SQLSTATEs are valid
  • [SPARK-36406] - No longer do file truncate operation before delete a write failed file held by DiskBlockObjectWriter
  • [SPARK-36407] - Avoid potential integer multiplications overflow risk
  • [SPARK-36410] - Replace anonymous classes with lambda expressions
  • [SPARK-36418] - Use CAST in parsing of dates/timestamps with default pattern
  • [SPARK-36419] - Move final aggregation in RDD.treeAggregate to executor
  • [SPARK-36420] - Use `isEmpty` to improve performance in Pregel's superstep
  • [SPARK-36450] - Remove unused UnresolvedV2Relation
  • [SPARK-36451] - Ivy skips looking for source and doc pom
  • [SPARK-36475] - Add doc about spark.shuffle.service.fetch.rdd.enabled
  • [SPARK-36481] - Expose LogisticRegression.setInitialModel
  • [SPARK-36487] - modify exit executor log logic
  • [SPARK-36495] - Use Type match to simplify CatalystTypeConverter.toCatalyst
  • [SPARK-36498] - Reorder inner fields of the input query in byName V2 write
  • [SPARK-36502] - Remove jaxb-api from `sql/catalyst` module
  • [SPARK-36503] - Add RowToColumnConverter for BinaryType
  • [SPARK-36536] - Split the JSON/CSV option of datetime format to in read and in write
  • [SPARK-36546] - Make unionByName null-filling behavior work with array of struct columns
  • [SPARK-36550] - Propagation cause when UDF reflection fails
  • [SPARK-36560] - Deflake PySpark coverage report
  • [SPARK-36566] - Add Spark appname as a label to the executor pods
  • [SPARK-36573] - Add a default value to ORACLE_DOCKER_IMAGE
  • [SPARK-36575] - Executor lost may cause spark stage to hang
  • [SPARK-36576] - Improve range split calculation for Kafka Source minPartitions option
  • [SPARK-36580] - Replace filter and contains with intersect
  • [SPARK-36583] - Upgrade commons-pool2 from 2.6.2 to 2.11.1
  • [SPARK-36602] - Clean up redundant asInstanceOf casts
  • [SPARK-36607] - Support BooleanType in UnwrapCastInBinaryComparison
  • [SPARK-36613] - The return value of the Table.capabilities should use EnumSet instead of HashSet
  • [SPARK-36643] - Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
  • [SPARK-36644] - Push down boolean column filter
  • [SPARK-36649] - Support Trigger.AvailableNow on Kafka data source
  • [SPARK-36654] - Drop type ignores from numpy imports
  • [SPARK-36660] - Cotangent is not supported by Dataframe
  • [SPARK-36663] - When the existing field name is a number, an error will be reported when reading the orc file
  • [SPARK-36665] - Add more Not operator optimizations
  • [SPARK-36679] - Remove lz4 hadoop wrapper classes after Hadoop 3.3.2
  • [SPARK-36683] - Support secant and cosecant
  • [SPARK-36688] - Add cot as an R function
  • [SPARK-36689] - Cleanup the deprecated APIs and raise proper warning message.
  • [SPARK-36690] - Clean up deprecated api usage related to commons-pool2
  • [SPARK-36692] - Improve Error statement when requesting thread dump while executor already stopped
  • [SPARK-36703] - Remove the Sort if it is the child of RepartitionByExpression
  • [SPARK-36718] - only collapse projects if we don't duplicate expensive expressions
  • [SPARK-36719] - Supporting Netty Logging at the network layer
  • [SPARK-36721] - Simplify boolean equalities if one side is literal
  • [SPARK-36735] - Adjust overhead of cached relation for DPP
  • [SPARK-36737] - Upgrade commons-io to 2.11.0 and revert change of SPARK-36456
  • [SPARK-36745] - ExtractEquiJoinKeys should return the original predicates on join keys
  • [SPARK-36751] - octet_length/bit_length API is not implemented on Scala/Python/R
  • [SPARK-36797] - Union should resolve nested columns as top-level columns
  • [SPARK-36799] - Pass queryExecution name in CLI
  • [SPARK-36805] - Upgrade kubernetes-client to 5.7.3
  • [SPARK-36808] - Upgrade Kafka to 2.8.1
  • [SPARK-36809] - Remove broadcast for InSubqueryExec used in DPP
  • [SPARK-36814] - Make class ColumnarBatch extendable
  • [SPARK-36821] - Create a test to extend ColumnarBatch
  • [SPARK-36822] - BroadcastNestedLoopJoinExec should use all condition instead of non-equi condition
  • [SPARK-36824] - Add sec and csc as R functions
  • [SPARK-36829] - Refactor collectionOperation related Null check related code
  • [SPARK-36834] - Namespace log lines in External Shuffle Service
  • [SPARK-36838] - Improve InSet NaN check generated code performance
  • [SPARK-36841] - Provide ansi syntax `set catalog xxx` to change the current catalog
  • [SPARK-36847] - Explicitly specify error codes when ignoring type hint errors
  • [SPARK-36859] - Upgrade kubernetes-client to 5.8.0
  • [SPARK-36863] - Update dependency manifests for all released artifacts
  • [SPARK-36870] - Introduce INTERNAL_ERROR error class
  • [SPARK-36876] - Support Dynamic Partition pruning for HiveTableScanExec
  • [SPARK-36890] - Use default WebsocketPingInterval for Kubernetes watches
  • [SPARK-36893] - upgrade mesos into 1.4.3
  • [SPARK-36894] - RDD.toDF should be synchronized with dispatched variants of SparkSession.createDataFrame
  • [SPARK-36898] - Make the shuffle hash join factor configurable
  • [SPARK-36915] - Pin actions to a full length commit SHA
  • [SPARK-36918] - unionByName shouldn't consider types when comparing structs
  • [SPARK-36933] - Reduce duplication in TaskMemoryManager.acquireExecutionMemory
  • [SPARK-36937] - Change OrcSourceSuite to test both V1 and V2 sources.
  • [SPARK-36943] - Improve error message for missing column
  • [SPARK-36953] - Expose SQL state and error class in PySpark exceptions
  • [SPARK-36961] - Use PEP526 style variable type hints
  • [SPARK-36963] - Add max_by/min_by to sql.functions
  • [SPARK-36965] - Extend python test runner by logging out the temp output files
  • [SPARK-36967] - Report accurate shuffle block size if its skewed
  • [SPARK-36972] - Add max_by/min_by API to PySpark
  • [SPARK-36973] - Deduplicate prepare data method for HistogramPlotBase and KdePlotBase
  • [SPARK-36976] - Add max_by/min_by API to SparkR
  • [SPARK-36978] - InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type
  • [SPARK-36981] - Upgrade joda-time to 2.10.12
  • [SPARK-36989] - Migrate type hint data tests
  • [SPARK-36992] - Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray
  • [SPARK-36997] - Test type hints against examples
  • [SPARK-37001] - Disable two level of map for final hash aggregation by default
  • [SPARK-37002] - Introduce the 'compute.eager_check' option
  • [SPARK-37003] - Merge INSERT related docs
  • [SPARK-37010] - Remove unnecessary "noqa: F401" comments in pandas-on-Spark
  • [SPARK-37011] - Upgrade flake8 to 3.9.0 or above in Jenkins
  • [SPARK-37022] - Use black as a formatter for the whole PySpark codebase.
  • [SPARK-37025] - Upgrade RoaringBitmap to 0.9.22
  • [SPARK-37032] - Remove unuseable link in spark-3.2.0's doc
  • [SPARK-37036] - Add util function to raise advice warning for pandas API on Spark.
  • [SPARK-37037] - Improve byte array sort by unify compareTo function of UTF8String and ByteArray
  • [SPARK-37041] - Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
  • [SPARK-37044] - Add Row to __all__ in pyspark.sql.types
  • [SPARK-37058] - Add spark-shell command line unit test
  • [SPARK-37071] - OpenHashMap should be serializable without reference tracking
  • [SPARK-37075] - move UDAF expression building from sql/catalyst to sql/core
  • [SPARK-37077] - Annotations for pyspark.sql.context.SQLContext.createDataFrame are broken
  • [SPARK-37080] - Add benchmark tool guide in pull request template
  • [SPARK-37081] - Upgrade the version of RDBMS and corresponding JDBC drivers used by docker-integration-tests
  • [SPARK-37084] - Set spark.sql.files.openCostInBytes to bytesConf
  • [SPARK-37085] - Missing overloads in functions accepting both varargs and single arg collection
  • [SPARK-37087] - merge three relation resolutions into one
  • [SPARK-37101] - In class ShuffleBlockPusher, use config instead of key
  • [SPARK-37104] - RDD and DStream should be covariant
  • [SPARK-37108] - Expose make_date expression in R
  • [SPARK-37113] - Upgrade Parquet to 1.12.2
  • [SPARK-37115] - Replace HiveClient call with hive shim
  • [SPARK-37118] - Add KMeans distanceMeasure param to PythonMLLibAPI
  • [SPARK-37126] - Support TimestampNTZ in PySpark
  • [SPARK-37133] - Add a config to optionally enforce ANSI reserved keywords
  • [SPARK-37134] - documentation - unclear "Using PySpark Native Features"
  • [SPARK-37151] - Avoid executor state sync attempt fail continuously in a short timeframe
  • [SPARK-37160] - Add a config to optionally disable paddin for char type
  • [SPARK-37164] - Add ExpressionBuilder for functions with complex overloads
  • [SPARK-37165] - Add REPEATABLE in TABLESAMPLE to specify seed
  • [SPARK-37176] - JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic
  • [SPARK-37199] - Add a deterministic field to QueryPlan
  • [SPARK-37206] - Upgrade Avro to 1.11.0
  • [SPARK-37208] - Support mapping Spark gpu/fpga resource types to custom YARN resource type
  • [SPARK-37211] - More descriptions and adding an image to the failure message about enabling GitHub Actions
  • [SPARK-37214] - Fail query analysis earlier with invalid identifiers
  • [SPARK-37221] - The collect-like API in SparkPlan should support columnar output
  • [SPARK-37224] - Optimize write path on RocksDB state store provider
  • [SPARK-37237] - Upgrade kubernetes-client to 5.9.0
  • [SPARK-37239] - Avoid unnecessary `setReplication` in Yarn mode
  • [SPARK-37241] - Upgrade Jackson to 2.13.0
  • [SPARK-37243] - Fix the format of the document
  • [SPARK-37244] - Build and test on Python 3.10
  • [SPARK-37256] - Replace `ScalaObjectMapper` with `ClassTagExtensions` to fix compilation warnings
  • [SPARK-37257] - Update setup.py for Python 3.10
  • [SPARK-37263] - Add PandasAPIOnSparkAdviceWarning class
  • [SPARK-37266] - View text can only be SELECT queries
  • [SPARK-37268] - Remove unused method call in FileScanRDD
  • [SPARK-37273] - Hidden File Metadata Support for Spark SQL
  • [SPARK-37283] - Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
  • [SPARK-37284] - Upgrade Jekyll to 4.2.1
  • [SPARK-37289] - Refactoring:remove the unnecessary function with partitionSchemaOption
  • [SPARK-37292] - Removes outer join if it only has DISTINCT on streamed side with alias
  • [SPARK-37298] - Use unique exprId in RewriteAsOfJoin
  • [SPARK-37300] - TaskSchedulerImpl should ignore task finished event if its task was already finished state
  • [SPARK-37307] - Don't obtain JDBC connection for empty partition
  • [SPARK-37327] - Silence the to_pandas() advice log for internal usage
  • [SPARK-37335] - Clarify output of FPGrowth
  • [SPARK-37336] - Migrate _java2py to SparkSession
  • [SPARK-37337] - Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
  • [SPARK-37339] - Add `spark-version` label to driver and executor pods
  • [SPARK-37341] - Avoid unnecessary buffer and copy in full outer sort merge join
  • [SPARK-37342] - Upgrade Apache Arrow to 6.0.0
  • [SPARK-37346] - Link migration guide for structured stream.
  • [SPARK-37352] - Silence the `index_col` advice in `to_spark()` for internal usage
  • [SPARK-37369] - Avoid redundant ColumnarToRow transistion on InMemoryTableScan
  • [SPARK-37370] - Add SQL configs to control newly added join code-gen in 3.3
  • [SPARK-37371] - UnionExec should support columnar if all children support columnar
  • [SPARK-37372] - Removing redundant label addition
  • [SPARK-37373] - Collect LocalSparkContext worker logs in case of test failure
  • [SPARK-37380] - Miscellaneous Python lint infra cleanup
  • [SPARK-37386] - simplify OptimizeSkewedJoin to not run the cost evaluator
  • [SPARK-37436] - Uses Python's standard string formatter for SQL API in pandas API on Spark
  • [SPARK-37443] - Provide a profiler for Python/Pandas UDFs
  • [SPARK-37447] - Cache LogicalPlan.isStreaming() in a lazy val
  • [SPARK-37450] - Spark SQL reads unnecessary nested fields (another type of pruning case)
  • [SPARK-37454] - support expressions in time travel timestamp
  • [SPARK-37457] - Update cloudpickle to v2.0.0
  • [SPARK-37458] - Remove unnecessary object serialization on foreachBatch
  • [SPARK-37460] - ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
  • [SPARK-37462] - Avoid unnecessary calculating the number of outstanding fetch requests and RPCS
  • [SPARK-37464] - SCHEMA and DATABASE should simply be aliases of NAMESPACE
  • [SPARK-37468] - Support ANSI intervals and TimestampNTZ for UnionEstimation
  • [SPARK-37469] - Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
  • [SPARK-37474] - Migrate SparkR documentation to pkgdown
  • [SPARK-37484] - Replace Get and getOrElse with getOrElse
  • [SPARK-37485] - Replace map with expressions which produce no result with foreach
  • [SPARK-37493] - expose driver gc time and duration time
  • [SPARK-37503] - Improve SparkSession/PySpark SparkSession startup
  • [SPARK-37505] - mesos module is missing log4j.properties file for UT
  • [SPARK-37506] - Change the never changed 'var' to 'val'
  • [SPARK-37513] - date +/- interval with only day-time fields returns different data type between Spark3.2 and Spark3.1
  • [SPARK-37514] - Remove workarounds due to older pandas
  • [SPARK-37516] - Uses Python's standard string formatter for SQL API in PySpark
  • [SPARK-37530] - Spark reads many paths very slow though newAPIHadoopFile
  • [SPARK-37531] - Use PyArrow 6.0.0 in Python 3.9 tests at GitHub Action job
  • [SPARK-37540] - Detect more unsupported time travel
  • [SPARK-37558] - Improve spark-sql cli command doc
  • [SPARK-37561] - Avoid loading all functions when obtaining hive's DelegationToken
  • [SPARK-37565] - Upgrade mysql-connector-java to 8.0.27
  • [SPARK-37578] - DSV2 is not updating Output Metrics
  • [SPARK-37580] - Reset numFailures when one of task attempts succeeds
  • [SPARK-37586] - Add cipher mode option and set default cipher mode for aes_encrypt and aes_decrypt
  • [SPARK-37591] - Support the GCM mode by aes_encrypt()/aes_decrypt()
  • [SPARK-37592] - Improve performance of JoinSelection
  • [SPARK-37593] - Reduce default page size by LONG_ARRAY_OFFSET if G1GC and ON_HEAP are used
  • [SPARK-37594] - Make UT test("SPARK-34399: Add job commit duration metrics for DataWritingCommand") more stable
  • [SPARK-37600] - Upgrade to Hadoop 3.3.2
  • [SPARK-37611] - Remove upper limit of spark.kubernetes.memoryOverheadFactor
  • [SPARK-37618] - Support cleaning up shuffle blocks from external shuffle service
  • [SPARK-37627] - Add sorted column in BucketTransform
  • [SPARK-37628] - Upgrade Netty from 4.1.68 to 4.1.72
  • [SPARK-37629] - speed up Expression.canonicalized
  • [SPARK-37646] - Avoid touching Scala reflection APIs in the lit function
  • [SPARK-37649] - Switch default index to distributed-sequence by default in pandas API on Spark
  • [SPARK-37657] - Support str and timestamp for (Series|DataFrame).describe()
  • [SPARK-37666] - Set `GCM` as the default mode in `aes_encrypt()`/`aes_decrypt()`
  • [SPARK-37670] - Support predicate pushdown and column pruning for de-duped CTEs
  • [SPARK-37686] - Migrate remaining pyspark.sql.functions to _invoke_* style
  • [SPARK-37688] - ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was not active
  • [SPARK-37689] - Expand should be supported PropagateEmptyRelation
  • [SPARK-37698] - Update ORC to 1.7.2
  • [SPARK-37704] - Update mypy in tests to 0.920
  • [SPARK-37710] - Add detailed log message for java.io.IOException occurring on Kryo flow
  • [SPARK-37712] - Spark request yarn cluster metrics slow and cause unnecessary delay
  • [SPARK-37715] - Remove ojdbc6 dependency and update docker-integration test docs
  • [SPARK-37726] - Add spill size metrics for sort merge join
  • [SPARK-37731] - refactor and cleanup function lookup in Analyzer
  • [SPARK-37737] - Update Black to 21.12.b0
  • [SPARK-37738] - PySpark date_add only accepts an integer as it's second parameter
  • [SPARK-37739] - Upgrade Arrow to 6.0.1
  • [SPARK-37747] - Upgrade zstd-jni to 1.5.1-1
  • [SPARK-37753] - Fine tune logic to demote Broadcast hash join in DynamicJoinSelection
  • [SPARK-37756] - Enable matplotlib test for pandas API on Spark
  • [SPARK-37761] - Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image
  • [SPARK-37764] - Reserve bucket information when relation conversion from metastore relations to data source relations
  • [SPARK-37776] - Upgrade silencer to 1.7.7
  • [SPARK-37777] - update the SQL syntax of SHOW FUNCTIONS
  • [SPARK-37780] - QueryExecutionListener should also support SQLConf
  • [SPARK-37782] - Make DataFrame.transform take the parameters for the function.
  • [SPARK-37783] - Add @tailrec wherever possible
  • [SPARK-37784] - CodeGenerator.addBufferedState() does not properly handle UDTs
  • [SPARK-37785] - Add Utils.isAtExecutor
  • [SPARK-37786] - StreamingQueryListener should also support SQLConf
  • [SPARK-37789] - Add a class to represent general aggregate functions in DS V2
  • [SPARK-37796] - ByteArrayMethods arrayEquals should fast skip the check of aligning with unaligned platform
  • [SPARK-37803] - Create new benchmarks for struct deserializer improvement.
  • [SPARK-37812] - When deserializing an Orc struct, reuse the result row when possible
  • [SPARK-37822] - SQL function `split` should return an array of non-nullable elements
  • [SPARK-37826] - Use zstd codec name in ORC file names for hive orc impl
  • [SPARK-37828] - Push down filters through RebalancePartitions
  • [SPARK-37831] - Add task partition id in metrics
  • [SPARK-37832] - Orc struct serializer should look up field converters in an array rather than a linked list
  • [SPARK-37833] - Add `precondition` job for skip the main GitHub Action jobs
  • [SPARK-37835] - Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala to more explicit
  • [SPARK-37836] - Enable more flake8 rules for PEP 8 compliance
  • [SPARK-37837] - Enable black formatter in dev Python scripts
  • [SPARK-37838] - Upgrade scalatestplus artifacts to 3.3.0-SNAP3
  • [SPARK-37850] - Enable flake's E731 rule in PySpark
  • [SPARK-37851] - Mark org.apache.spark.sql.hive.execution as slow tests
  • [SPARK-37852] - Enable flake's E741 rule in PySpark
  • [SPARK-37854] - Use type match to simplify TestUtils#withHttpConnection
  • [SPARK-37862] - RecordBinaryComparator should fast skip the check of aligning with unaligned platform
  • [SPARK-37869] - Update pytest-mypy-plugins to 1.9.3
  • [SPARK-37876] - Move SpecificParquetRecordReaderBase.listDirectory to TestUtils
  • [SPARK-37879] - Show test report in GitHub Actions builds from PRs
  • [SPARK-37885] - Allow pandas_udf to take type annotations with future annotations enabled
  • [SPARK-37896] - ConstantColumnVector: a column vector with same values
  • [SPARK-37900] - Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager
  • [SPARK-37901] - Upgrade Netty from 4.1.72 to 4.1.73
  • [SPARK-37902] - Update annotations to resolve issues detected with mypy==0.931
  • [SPARK-37903] - Replace string_typehints with get_type_hints.
  • [SPARK-37904] - Improve RebalancePartitions in rules of Optimizer
  • [SPARK-37909] - Restore F403 checks
  • [SPARK-37915] - Combine unions if there is a project between them
  • [SPARK-37917] - Push down limit 1 for right side of left semi/anti join
  • [SPARK-37922] - Combine to one cast if we can safely up-cast two casts
  • [SPARK-37924] - Sort table properties by key in SHOW CREATE TABLE on VIEW (v1)
  • [SPARK-37928] - Add Parquet Data Page V2 bench scenario to DataSourceReadBenchmark
  • [SPARK-37934] - Upgrade Jetty version to 9.4.44
  • [SPARK-37949] - Improve Rebalance statistics estimation
  • [SPARK-37950] - Take EXTERNAL as a reserved table property
  • [SPARK-37952] - Add missing statements to ALTER TABLE document.
  • [SPARK-37959] - Fix the UT of checking norm in KMeans & BiKMeans
  • [SPARK-37968] - Upgrade commons-collections3 to commons-collections4
  • [SPARK-37974] - Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
  • [SPARK-37984] - Avoid calculating all outstanding requests to improve performance.
  • [SPARK-37992] - Restore mypy version check in dev/lint-python
  • [SPARK-38002] - Upgrade ZSTD-JNI to 1.5.2-1
  • [SPARK-38006] - Clean up duplicated planner logic for window operator
  • [SPARK-38007] - Update K8s doc to recommend K8s 1.20+
  • [SPARK-38008] - Fix the method description of refill
  • [SPARK-38011] - Remove useless and duplicated configuration in ParquetFileFormat builderReader
  • [SPARK-38014] - Add Parquet Data Page V2 test scenario for BuiltInDataSourceWriteBenchmark
  • [SPARK-38021] - Upgrade dropwizard metrics from 4.2.2 to 4.2.7
  • [SPARK-38028] - Expose Arrow Vector from ArrowColumnVector
  • [SPARK-38033] - The structured streaming processing cannot be started because the commitId and offsetId are inconsistent
  • [SPARK-38036] - Refactor `VersionsSuite` as a subclass of `HiveVersionSuite`
  • [SPARK-38046] - Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing
  • [SPARK-38051] - Update Roxygen reference to 7.1.2
  • [SPARK-38069] - improve structured streaming window of calculated
  • [SPARK-38076] - Remove redundant null-check is covered by further condition
  • [SPARK-38082] - Update minimum numpy version to 1.15
  • [SPARK-38086] - Make ArrowColumnVector Extendable
  • [SPARK-38089] - Show the root cause exception in TestUtils.assertExceptionMsg
  • [SPARK-38096] - Update sbt to 1.6.2
  • [SPARK-38100] - Remove unused method in `Decimal`
  • [SPARK-38121] - Use SparkSession instead of SQLContext inside PySpark
  • [SPARK-38123] - Unified use `DataType.catalogString` as `targetType` of `QueryExecutionErrors#castingCauseOverflowError`
  • [SPARK-38128] - Show full stacktrace in tests by default in PySpark tests
  • [SPARK-38134] - Upgrade Arrow to 7.0.0
  • [SPARK-38138] - Materialize QueryPlan subqueries
  • [SPARK-38147] - Upgrade shapeless to 2.3.7
  • [SPARK-38148] - Do not add dynamic partition pruning if there exists static partition pruning
  • [SPARK-38149] - Upgrade joda-time to 2.10.13
  • [SPARK-38154] - Set up a new GA job to run tests with ANSI mode
  • [SPARK-38175] - Clean up unused parameters in private methods signature
  • [SPARK-38177] - Fix wrong transformExpressions in Optimizer
  • [SPARK-38183] - Show warning when creating pandas-on-Spark session under ANSI mode.
  • [SPARK-38184] - Fix malformatted ExpressionDescription of `decode`
  • [SPARK-38186] - Improve the README of Spark docs
  • [SPARK-38191] - The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.
  • [SPARK-38194] - Make memory overhead factor configurable
  • [SPARK-38199] - Delete the unused `dataType` specified in the definition of `IntervalColumnAccessor`
  • [SPARK-38211] - Add SQL migration guide on restoring loose upcast from string
  • [SPARK-38214] - No need to filter windows when windowDuration is multiple of slideDuration
  • [SPARK-38216] - When creating a Hive table, fail early if all the columns are partitioned columns
  • [SPARK-38219] - Support ANSI aggregation function percentile_cont as window function
  • [SPARK-38220] - Upgrade `commons-math3` to 3.6.1
  • [SPARK-38225] - Adjust input `format` of function `to_binary`
  • [SPARK-38229] - Should't check temp/external/ifNotExists with visitReplaceTable when parser
  • [SPARK-38231] - Upgrade commons-text to 1.9
  • [SPARK-38235] - Add test util for testing grouped aggregate pandas UDF.
  • [SPARK-38240] - Improve RuntimeReplaceable and add a guideline for adding new functions
  • [SPARK-38247] - Unify the output of df.explain and "explain " if plan is command
  • [SPARK-38249] - Cleanup unused private methods and fields
  • [SPARK-38256] - Upgarde `org.scalatestplus:mockito` to 3.2.11.0
  • [SPARK-38259] - Upgrade netty to 4.1.74
  • [SPARK-38260] - Remove dependence on commons-net
  • [SPARK-38267] - Replace pattern matches on boolean expressions with conditional statements
  • [SPARK-38269] - Clean up redundant type cast
  • [SPARK-38274] - Upgarde junit4 to 4.13.2 and upgrade corresponding junit-interface to 0.13.3
  • [SPARK-38279] - Pin markupsafe to 2.0.1 fix linter failure
  • [SPARK-38299] - Clean up deprecated usage of `StringBuilder.newBuilder`
  • [SPARK-38300] - Use ByteStreams.toByteArray to simplify fileToString and resourceToBytes in catalyst.uti
  • [SPARK-38301] - Remove unused scala-actors dependency
  • [SPARK-38305] - Check existence of file before untarring/zipping
  • [SPARK-38322] - Support query stage show runtime statistics in formatted explain mode
  • [SPARK-38323] - Support the hidden file metadata in Streaming
  • [SPARK-38337] - Replace `toIterator` with `iterator` for `IterableLike`/`IterableOnce` to cleanup deprecated api usage
  • [SPARK-38338] - Remove test dependency of hamcrest
  • [SPARK-38339] - Upgrade RoaringBitmap to 0.9.25
  • [SPARK-38342] - Clean up deprecated api usage of Ivy
  • [SPARK-38348] - Upgrade tink to 1.6.1
  • [SPARK-38351] - [TESTS] Replace 'abc symbols with Symbol("abc") in tests
  • [SPARK-38353] - Instrument __enter__ and __exit__ magic methods for pandas API on Spark
  • [SPARK-38360] - Introduce a `exists` function for `TreeNode` to eliminate duplicate code pattern
  • [SPARK-38362] - Move eclipse.m2e Maven plugin config in its own profile
  • [SPARK-38378] - ANTLR grammar definition in separate Parser and Lexer files
  • [SPARK-38382] - Refactor migration guide's sentences
  • [SPARK-38384] - Improve error messages of ParseException from ANTLR
  • [SPARK-38393] - Clean up deprecated usage of GenSeq/GenMap
  • [SPARK-38414] - Remove redundant SuppressWarnings
  • [SPARK-38415] - Update histogram_numeric (x, y) result type to make x == input type
  • [SPARK-38424] - Disallow unused casts and ignores
  • [SPARK-38428] - Check the FetchShuffleBlocks message only once to improve iteration in external shuffle service
  • [SPARK-38434] - Correct semantic of CheckAnalysis.getDataTypesAreCompatibleFn method
  • [SPARK-38437] - Dynamic serialization of Java datetime objects to micros/days
  • [SPARK-38443] - Document config STREAMING_SESSION_WINDOW_MERGE_SESSIONS_IN_LOCAL_PARTITION
  • [SPARK-38484] - Move usage logging instrumentation util functions from pandas module to pyspark.util module
  • [SPARK-38487] - Fix docstrings of nlargest/nsmallest of DataFrame
  • [SPARK-38489] - Aggregate.groupOnly support foldable expressions
  • [SPARK-38499] - Upgrade Jackson to 2.13.2
  • [SPARK-38500] - Add ASF License header to all Service Provider configuration files
  • [SPARK-38509] - Unregister the TIMESTAMPADD/DIFF functions and remove DATE_ADD/DIFF
  • [SPARK-38529] - Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators
  • [SPARK-38535] - Add the datetimeUnit enum to the grammar and use it in TIMESTAMPADD/DIFF
  • [SPARK-38540] - Upgrade compress-lzf from 1.0.3 to 1.1.0
  • [SPARK-38549] - SessionWindowStateStoreRestoreExec should provide numRowsDroppedByWatermark metric
  • [SPARK-38558] - Remove unnecessary casts between IntegerType and IntDecimal
  • [SPARK-38565] - Support Left Semi join in row level runtime filters
  • [SPARK-38570] - Incorrect DynamicPartitionPruning caused by Literal
  • [SPARK-38574] - Enrich Avro data source documentation
  • [SPARK-38593] - Incorporate numRowsDroppedByWatermark metric from SessionWindowStateStoreRestoreExec into StateOperatorProgress
  • [SPARK-38607] - Test result report for ANSI mode
  • [SPARK-38609] - Add PYSPARK_PANDAS_USAGE_LOGGER environment variable as an alias of KOALAS_USAGE_LOGGER
  • [SPARK-38623] - Add more comments and tests for HashShuffleSpec
  • [SPARK-38628] - Complete the copy method in subclasses of InternalRow, ArrayData, and MapData to safely copy their instances.
  • [SPARK-38650] - Better ParseException message for char without length
  • [SPARK-38654] - Show default index type in SQL plans for pandas API on Spark
  • [SPARK-38656] - Show options for Pandas API on Spark in UI
  • [SPARK-38657] - Rename "SQL" to "SQL/DataFrame" in Spark UI
  • [SPARK-38709] - remove trailing $ from function class name in sql-expression-schema.md
  • [SPARK-38710] - use SparkArithmeticException for Arithmetic overflow runtime errors
  • [SPARK-38778] - Replace http with https for project url in pom
  • [SPARK-38796] - Implement the to_number and try_to_number SQL functions according to a new specification
  • [SPARK-38816] - Wrong comment in random matrix generator in spark-als algorithm
  • [SPARK-38825] - Add a test to cover parquet notIn filter
  • [SPARK-38833] - PySpark applyInPandas should allow to return empty DataFrame without columns
  • [SPARK-38892] - Fix the UT of schema equal assert
  • [SPARK-38924] - Update dataTables to 1.10.25 for security issue
  • [SPARK-38929] - Improve error messages for cast failures in ANSI
  • [SPARK-38936] - Script transform feed thread should have name
  • [SPARK-38939] - Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax
  • [SPARK-38953] - Document PySpark common exceptions / errors
  • [SPARK-38957] - Use multipartIdentifier for parsing table-valued functions
  • [SPARK-38972] - [SQL] Support <paramName> style for error message parameters
  • [SPARK-39008] - Change ASF as a single author in Spark distribution
  • [SPARK-39030] - Rename sum to avoid shading the builtin Python function
  • [SPARK-39049] - Remove unneeded pass
  • [SPARK-39154] - Remove outdated statements on distributed-sequence default index
  • [SPARK-39155] - Access to JVM through passed-in GatewayClient during type conversion
  • [SPARK-39174] - Catalogs loading swallows missing classname for ClassNotFoundException
  • [SPARK-39186] - make skew consistent with pandas
  • [SPARK-39215] - Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred
  • [SPARK-39240] - Source and binary releases using different tool to generates hashes for integrity
  • [SPARK-39295] - Improve documentation of pandas API support list.
  • [SPARK-39361] - Stop using Log4J2's extended throwable logging pattern in default logging configurations
  • [SPARK-39392] - Refine ANSI error messages and remove 'To return NULL instead'
  • [SPARK-39633] - Dataframe options for time travel via `timestampAsOf` should respect both formats of specifying timestamp
  • [SPARK-44700] - Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

Test

  • [SPARK-29871] - Flaky test: ImageFileFormatTest.test_read_images
  • [SPARK-32391] - Install pydata_sphinx_theme in Jenkins machines
  • [SPARK-32666] - Install ipython and nbsphinx in Jenkins for Binder integration
  • [SPARK-33242] - Install numpydoc in Jenkins machines
  • [SPARK-35345] - Add BloomFilter Benchmark test for Parquet
  • [SPARK-36048] - Wrong HealthTrackerSuite.allExecutorAndHostIds
  • [SPARK-36151] - Enable MiMa for Scala 2.13 artifacts after Spark 3.2.0 release
  • [SPARK-36165] - Fix SQL doc generation in GitHub Action
  • [SPARK-36204] - Deduplicate Scala 2.13 daily build
  • [SPARK-36820] - Disable LZ4 test for Hadoop 2.7 profile
  • [SPARK-36839] - Add daily build with Hadoop 2 profile in GitHub Actions build
  • [SPARK-36883] - Upgrade R version to 4.1.1 in CI images
  • [SPARK-36929] - Remove Unused Method EliminateSubqueryAliasesSuite#assertEquivalent
  • [SPARK-37218] - Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark
  • [SPARK-37223] - Fix unit test check in JoinHintSuite
  • [SPARK-37322] - `run_scala_tests` should respect test module order
  • [SPARK-37367] - Reenable exception test in DDLParserSuite.create view -- basic
  • [SPARK-37368] - Deflake TPC-DS build
  • [SPARK-37384] - Flay test: HealthTrackerIntegrationSuite.If preferred node is bad, without excludeOnFailure job will fail
  • [SPARK-37453] - Split TPC-DS build in GitHub Actions
  • [SPARK-37813] - ORC read benchmark should enable vectorization for nested column
  • [SPARK-37823] - Add `is-changed.py` dev script
  • [SPARK-37871] - Use python3 instead of python in BaseScriptTransformation tests
  • [SPARK-37908] - Refactoring on pod label test in BasicFeatureStepSuite
  • [SPARK-37921] - Update OrcReadBenchmark to use Hive ORC reader as the basis
  • [SPARK-37987] - Flaky Test: StreamingAggregationSuite.changing schema of state when restarting query - state format version 1
  • [SPARK-38031] - Update document type conversion for Pandas UDFs (pyarrow 6.0.1, pandas 1.4.0, Python 3.9)
  • [SPARK-38032] - Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
  • [SPARK-38040] - Enable binary compatibility check for APIs in Catalyst, KVStore and Avro modules
  • [SPARK-38045] - More strict validation on plan check for stream-stream join unit test
  • [SPARK-38080] - Flaky test: StreamingQueryManagerSuite: 'awaitAnyTermination with timeout and resetTerminated'
  • [SPARK-38084] - Support `SKIP_PYTHON` and `SKIP_R` in `run-tests.py`
  • [SPARK-38136] - Update GitHub Action test image
  • [SPARK-38142] - Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized
  • [SPARK-38297] - Fix mypy failure on DataFrame.to_numpy in pandas API on Spark
  • [SPARK-38532] - Add test case for invalid gapDuration of sessionwindow
  • [SPARK-38780] - PySpark docs build should fail when there is warning.
  • [SPARK-38786] - Test Bug in StatisticsSuite "change stats after add/drop partition command"
  • [SPARK-38800] - Explicitly document the supported pandas version.
  • [SPARK-38927] - Skip NumPy/Pandas tests in `test_rdd.py` if not available
  • [SPARK-38928] - Skip Pandas UDF test in `QueryCompilationErrorsSuite` if not available
  • [SPARK-39019] - Use `withTempPath` to clean up temporary data directory after `SPARK-37463: read/write Timestamp ntz to Orc with different time zone`
  • [SPARK-39252] - Flaky Test: pyspark.sql.tests.test_dataframe.DataFrameTests test_df_is_empty
  • [SPARK-39253] - Improve PySpark API reference to be more readable
  • [SPARK-39273] - Make PandasOnSparkTestCase inherit ReusedSQLTestCase
  • [SPARK-39334] - Change to exclude `slf4j-reload4j` for `hadoop-minikdc`
  • [SPARK-39394] - Improve PySpark structured streaming page more readable

Wish

  • [SPARK-36611] - Remove unused listener in HiveThriftServer2AppStatusStore
  • [SPARK-37931] - Quote the column name if needed
  • [SPARK-38242] - Sort the SparkSubmit debug output

Task

  • [SPARK-35973] - DataSourceV2: Support SHOW CATALOGS
  • [SPARK-35996] - Setting version to 3.3.0-SNAPSHOT
  • [SPARK-36034] - Incorrect datetime filter when reading Parquet files written in legacy mode
  • [SPARK-36148] - Missing validation of regexp_replace inputs
  • [SPARK-36223] - TPCDSQueryTestSuite should run with different config set
  • [SPARK-36888] - Sha2 with bit_length 512 not being tested
  • [SPARK-36975] - Refactor HiveClientImpl collect hive client call logic
  • [SPARK-36980] - Insert support query with CTE
  • [SPARK-37050] - Update conda installation instructions
  • [SPARK-37067] - DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
  • [SPARK-37136] - Remove code about hive build in functions
  • [SPARK-37437] - Remove unused hive-2.3 profile
  • [SPARK-37445] - Update hadoop-profile
  • [SPARK-37446] - hive-2.3.9 related API use invoke method
  • [SPARK-37461] - yarn-client mode client's appid value is null
  • [SPARK-37471] - spark-sql support nested bracketed comment
  • [SPARK-37497] - Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi
  • [SPARK-37555] - spark-sql should pass last unclosed comment to backend and execute throw a exception
  • [SPARK-37631] - Code clean up on promoting strings in math functions
  • [SPARK-37716] - Allow LateralJoin node to host non-deterministic expressions when the outer query is a single row relation
  • [SPARK-37724] - ANSI mode: disable ANSI reserved keywords by default
  • [SPARK-37750] - ANSI mode: optionally return null result if element not exists in array/map
  • [SPARK-37766] - Regenerate benchmark results
  • [SPARK-37815] - Fix the github action job "test_report"
  • [SPARK-37817] - Remove unreachable code in complexTypeExtractors.scala
  • [SPARK-37906] - spark-sql should not pass last simple comment to backend
  • [SPARK-37907] - StaticInvoke should support ConstantFolding
  • [SPARK-37951] - Refactor ImageFileFormatSuite
  • [SPARK-37965] - Remove check field name when reading/writing existing data in ORC
  • [SPARK-37967] - ConstantFolding/ Literal.create support ObjectType
  • [SPARK-37969] - Hive Serde insert should check schema before execution
  • [SPARK-37985] - Fix flaky test SPARK-37578
  • [SPARK-38003] - Differentiate scalar and table function lookup in LookupFunctions
  • [SPARK-38063] - Support SQL split_part function
  • [SPARK-38122] - Update App Key of DocSearch
  • [SPARK-38144] - Remove unused `spark.storage.safetyFraction` config
  • [SPARK-38150] - Update comment of RelationConversions
  • [SPARK-38153] - Remove option newlines.topLevelStatements in scalafmt.conf
  • [SPARK-38189] - Add priority scheduling doc for Spark on K8S
  • [SPARK-38197] - Improve error message of BlockManager.fetchRemoteManagedBuffer
  • [SPARK-38215] - InsertIntoHiveDir support convert metadata
  • [SPARK-38237] - Introduce a new config to require all cluster keys on Aggregate
  • [SPARK-38318] - regression when replacing a dataset view
  • [SPARK-38358] - Add migration guide for spark.sql.hive.convertMetastoreInsertDir and spark.sql.hive.convertMetastoreCtas
  • [SPARK-38419] - Remove tab character and trailing space in script
  • [SPARK-38449] - Not call createTable when ifNotExist=true and table eixsts
  • [SPARK-38566] - Revert the parser changes for DEFAULT column support
  • [SPARK-38784] - Upgrade Jetty to 9.4.46
  • [SPARK-39178] - When throw SparkFatalException, should show root cause too.
  • [SPARK-39367] - Review and fix issues in Scala/Java API docs of SQL module
  • [SPARK-39371] - Review and fix issues in Scala/Java API docs of Core module

Dependency upgrade

  • [SPARK-38287] - Upgrade h2 from 2.0.204 to 2.1.210 in /sql/core
  • [SPARK-38291] - Upgrade postgresql from 42.3.0 to 42.3.3
  • [SPARK-38303] - Upgrade ansi-regex from 5.0.0 to 5.0.1 in /dev
  • [SPARK-39099] - Add dependencies to Dockerfile for building Spark releases
  • [SPARK-39183] - Upgrade Apache Xerces Java to 2.12.2

Question

  • [SPARK-37788] - ColumnOrName vs Column in PySpark Functions module

Umbrella

  • [SPARK-34705] - Add code-gen for all join types of sort merge join
  • [SPARK-36504] - Improve test coverage for pandas API on Spark
  • [SPARK-36707] - Support to specify index type and name in pandas API on Spark
  • [SPARK-37093] - Inline type hints python/pyspark/streaming
  • [SPARK-37094] - Inline type hints for files in python/pyspark
  • [SPARK-37275] - Support ANSI intervals in PySpark
  • [SPARK-37395] - Inline type hint files for files in python/pyspark/ml
  • [SPARK-37396] - Inline type hint files for files in python/pyspark/mllib
  • [SPARK-37814] - Migrating from log4j 1 to log4j 2
  • [SPARK-37886] - Use ComparisonTestBase to reduce redundant test code
  • [SPARK-38396] - Improve K8s Integration Tests

Documentation

  • [SPARK-31907] - Spark SQL functions documentation refers to SQL API documentation without linking to it
  • [SPARK-36377] - Fix documentation in spark-env.sh.template
  • [SPARK-36474] - Mention pandas API on Spark in Spark overview pages
  • [SPARK-37550] - from_json documentation lacks examples for complex types
  • [SPARK-37624] - Suppress warnings for live pandas-on-Spark quickstart notebooks
  • [SPARK-37692] - sql-migration-guide wrong description
  • [SPARK-37718] - Demo sql is incorrect
  • [SPARK-37818] - Add option for show create table command
  • [SPARK-37925] - Update document to mention the workaround for YARN-11053
  • [SPARK-38606] - Update document to make a good guide of multiple versions of the Spark Shuffle Service
  • [SPARK-38629] - Two links beneath Spark SQL Guide/Data Sources do not work properly
  • [SPARK-38933] - Add examples of window functions into SQL docs
  • [SPARK-39001] - Document which options are unsupported in CSV and JSON functions
  • [SPARK-39032] - Incorrectly formatted examples in pyspark.sql.functions.when
  • [SPARK-39219] - Promote Structured Streaming over Spark Streaming
  • [SPARK-39237] - Update the ANSI SQL mode documentation

Github Integration

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.