Release Notes - Spark - Version 3.4.0 - HTML format

Sub-task

  • [SPARK-28330] - ANSI SQL: Top-level <result offset clause> in <query expression>
  • [SPARK-28516] - Data Type Formatting Functions: `to_char`
  • [SPARK-30220] - Support Filter expression uses IN/EXISTS predicate sub-queries
  • [SPARK-30661] - KMeans blockify input vectors
  • [SPARK-30835] - Add support for YARN decommissioning & pre-emption
  • [SPARK-33236] - Enable Push-based shuffle service to store state in NM level DB for work preserving restart
  • [SPARK-33573] - Server side metrics related to push-based shuffle
  • [SPARK-34305] - Unify v1 and v2 ALTER TABLE .. SET SERDE tests
  • [SPARK-36114] - Support subqueries with correlated non-equality predicates
  • [SPARK-36124] - Support set operators to be on correlation paths
  • [SPARK-36511] - Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13
  • [SPARK-36620] - Client side related push-based shuffle metrics
  • [SPARK-37194] - Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
  • [SPARK-37287] - Pull out dynamic partition and bucket sort from FileFormatWriter
  • [SPARK-37378] - SPJ: Convert V2 Transform expressions into catalyst expressions and load their associated functions from V2 FunctionCatalog
  • [SPARK-37425] - Inline type hints for python/pyspark/mllib/recommendation.py
  • [SPARK-37599] - Unify v1 and v2 ALTER TABLE .. SET LOCATION tests
  • [SPARK-37623] - Support ANSI Aggregate Function: regr_intercept
  • [SPARK-37672] - Support ANSI Aggregate Function: regr_sxx
  • [SPARK-37681] - Support ANSI Aggregate Function: regr_sxy
  • [SPARK-37702] - Support ANSI Aggregate Function: regr_syy
  • [SPARK-37888] - Unify v1 and v2 DESCRIBE TABLE tests
  • [SPARK-37938] - Use error classes in the parsing errors of partitions
  • [SPARK-37939] - Use error classes in the parsing errors of properties
  • [SPARK-37945] - Use error classes in the execution errors of arithmetic ops
  • [SPARK-37982] - Use error classes in the execution errors related to unsupported input type
  • [SPARK-38005] - Support cleaning up merged shuffle files and state from external shuffle service
  • [SPARK-38106] - Use error classes in the parsing errors of functions
  • [SPARK-38108] - Use error classes in the compilation errors of UDF/UDAF
  • [SPARK-38257] - Upgrade rocksdbjni to 7.0.3
  • [SPARK-38270] - SQL CLI AM should keep same exitcode with client
  • [SPARK-38335] - Parser changes for DEFAULT column support
  • [SPARK-38336] - Catalyst changes for DEFAULT column support
  • [SPARK-38441] - Support string and bool `regex` in `Series.replace`
  • [SPARK-38479] - Add `Series.duplicated` to indicate duplicate Series values.
  • [SPARK-38493] - Improve the test coverage for pyspark/pandas module
  • [SPARK-38496] - Improve the test coverage for pyspark/sql module
  • [SPARK-38552] - Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties
  • [SPARK-38576] - Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only
  • [SPARK-38588] - Validate input dataset of ml.classification
  • [SPARK-38608] - Implement `bool_only` parameter of `DataFrame.all` and`DataFrame.any`
  • [SPARK-38669] - Validate input dataset of ml.clustering
  • [SPARK-38678] - Enable RocksDB tests on Apple Silicon on MacOS
  • [SPARK-38686] - Implement `keep` parameter of `(Index/MultiIndex).drop_duplicates`
  • [SPARK-38687] - Use error classes in the compilation errors of generators
  • [SPARK-38688] - Use error classes in the compilation errors of deserializer
  • [SPARK-38689] - Use error classes in the compilation errors of not allowed DESC PARTITION
  • [SPARK-38697] - Extend SparkSessionExtensions to inject rules into AQE Optimizer
  • [SPARK-38700] - Use error classes in the execution errors of save mode
  • [SPARK-38701] - Inline IllegalStateException out from QueryExecutionErrors
  • [SPARK-38704] - Support string `inclusive` parameter of `Series.between`
  • [SPARK-38718] - Test the error class: AMBIGUOUS_FIELD_NAME
  • [SPARK-38720] - Test the error class: CANNOT_CHANGE_DECIMAL_PRECISION
  • [SPARK-38721] - Test the error class: CANNOT_PARSE_DECIMAL
  • [SPARK-38722] - Test the error class: CAST_CAUSES_OVERFLOW
  • [SPARK-38724] - Test the error class: DIVIDE_BY_ZERO
  • [SPARK-38725] - Test the error class: DUPLICATE_KEY
  • [SPARK-38726] - Support `how` parameter of `MultiIndex.dropna`
  • [SPARK-38727] - Test the error class: FAILED_EXECUTE_UDF
  • [SPARK-38728] - Test the error class: FAILED_RENAME_PATH
  • [SPARK-38729] - Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK
  • [SPARK-38730] - Move tests for the grouping error classes to QueryCompilationErrorsSuite
  • [SPARK-38731] - Move the tests `GROUPING_SIZE_LIMIT_EXCEEDED` to QueryCompilationErrorsSuite
  • [SPARK-38732] - Test the error class: INCOMPARABLE_PIVOT_COLUMN
  • [SPARK-38733] - Test the error class: INCOMPATIBLE_DATASOURCE_REGISTER
  • [SPARK-38734] - Test the error class: INDEX_OUT_OF_BOUNDS
  • [SPARK-38736] - Test the error classes: INVALID_ARRAY_INDEX*
  • [SPARK-38737] - Test the error classes: INVALID_FIELD_NAME
  • [SPARK-38738] - Test the error class: INVALID_FRACTION_OF_SECOND
  • [SPARK-38739] - Test the error class: INVALID_INPUT_SYNTAX_FOR_NUMERIC_TYPE
  • [SPARK-38740] - Test the error class: INVALID_JSON_SCHEMA_MAPTYPE
  • [SPARK-38741] - Test the error class: MAP_KEY_DOES_NOT_EXIST*
  • [SPARK-38742] - Move the tests `MISSING_COLUMN` to QueryCompilationErrorsSuite
  • [SPARK-38744] - Test the pivot error classes
  • [SPARK-38745] - Move the tests for `NON_PARTITION_COLUMN` to QueryCompilationErrorsSuite
  • [SPARK-38746] - Move the tests for `PARSE_EMPTY_STATEMENT` to QueryParsingErrorsSuite
  • [SPARK-38747] - Move the tests for `PARSE_SYNTAX_ERROR` to QueryParsingErrorsSuite
  • [SPARK-38748] - Test the error class: PIVOT_VALUE_DATA_TYPE_MISMATCH
  • [SPARK-38749] - Test the error class: RENAME_SRC_PATH_NOT_FOUND
  • [SPARK-38750] - Test the error class: SECOND_FUNCTION_ARGUMENT_NOT_INTEGER
  • [SPARK-38751] - Test the error class: UNRECOGNIZED_SQL_TYPE
  • [SPARK-38752] - Test the error class: UNSUPPORTED_DATATYPE
  • [SPARK-38753] - Move the tests for `WRITING_JOB_ABORTED` to QueryExecutionErrorsSuite
  • [SPARK-38765] - Implement `inplace` parameter of `Series.clip`
  • [SPARK-38768] - If limit could pushed down and Data source only have one partition, DS V2 should not do limit again
  • [SPARK-38774] - impl Series.autocorr
  • [SPARK-38775] - cleanup validation functions
  • [SPARK-38785] - impl Series.ewm and DataFrame.ewm
  • [SPARK-38791] - Output parameter values of error classes in SQL style
  • [SPARK-38793] - Support `return_indexer` parameter of `Index/MultiIndex.sort_values`
  • [SPARK-38795] - Support INSERT INTO user specified column lists with DEFAULT values
  • [SPARK-38811] - Support ALTER TABLE ADD COLUMN commands with DEFAULT values
  • [SPARK-38820] - Refresh dtype when astype("category")
  • [SPARK-38821] - test_nsmallest test failed tue to pandas 1.4.0-1.4.2 bug
  • [SPARK-38822] - Raise indexError when insert loc is out of bounds
  • [SPARK-38827] - Improve the test coverage for pyspark/find_spark_home.py
  • [SPARK-38834] - Update the version of TimestampNTZ related changes as 3.4.0
  • [SPARK-38837] - Implement `dropna` parameter of `SeriesGroupBy.value_counts`
  • [SPARK-38838] - Support ALTER TABLE ALTER COLUMN commands with DEFAULT values
  • [SPARK-38840] - Enable spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default
  • [SPARK-38844] - impl Series.interpolate and DataFrame.interpolate
  • [SPARK-38854] - Improve the test coverage for pyspark/statcounter.py
  • [SPARK-38857] - series name should be preserved in series.mode()
  • [SPARK-38859] - iloc setitem failed due to "Cannot convert * into bool"
  • [SPARK-38863] - Implement `skipna` parameter of `DataFrame.all`
  • [SPARK-38865] - Update document of JDBC options for pushDownAggregate and pushDownLimit
  • [SPARK-38869] - Respect Table capability `ACCEPT_ANY_SCHEMA` in default column resolution
  • [SPARK-38877] - CLONE - Improve the test coverage for pyspark/find_spark_home.py
  • [SPARK-38878] - CLONE - Improve the test coverage for pyspark/statcounter.py
  • [SPARK-38879] - Improve the test coverage for pyspark/rddsampler.py
  • [SPARK-38880] - Implement `numeric_only` parameter of `GroupBy.max/min`
  • [SPARK-38890] - Implement `ignore_index` of `DataFrame.sort_index`.
  • [SPARK-38891] - Skipping allocating vector for repetition & definition levels when possible
  • [SPARK-38894] - Exclude pyspark.cloudpickle in test coverage report
  • [SPARK-38897] - DS V2 supports push down string functions
  • [SPARK-38899] - DS V2 supports push down datetime functions
  • [SPARK-38901] - DS V2 supports push down misc functions
  • [SPARK-38903] - Implement `ignore_index` of `Series.sort_values` and `Series.sort_index`
  • [SPARK-38907] - Impl DataFrame.corrwith
  • [SPARK-38913] - Output identifiers in error messages in SQL style
  • [SPARK-38937] - interpolate support param `limit_direction`
  • [SPARK-38938] - Implement `inplace` and `columns` parameters of `Series.drop`
  • [SPARK-38943] - EWM support ignore_na
  • [SPARK-38946] - Generates a new dataframe instead of operating inplace in setitem
  • [SPARK-38947] - Support Groupby positional indexing
  • [SPARK-38949] - Wrap SQL statements by double quotes in error messages
  • [SPARK-38952] - Implement `numeric_only` of `GroupBy.first` and `GroupBy.last`
  • [SPARK-38959] - DataSource V2: Support runtime group filtering in row-level commands
  • [SPARK-38978] - DS V2 supports push down OFFSET operator
  • [SPARK-38980] - Move error class tests requiring ANSI SQL mode to QueryExecutionAnsiErrorsSuite
  • [SPARK-38982] - test_categories_setter failed due to pandas bug
  • [SPARK-38984] - Allow comparison between TimestampNTZ and Timestamp/Date
  • [SPARK-38986] - Prepend error class tag to error messages
  • [SPARK-38987] - Handle fallback when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true
  • [SPARK-38989] - Implement `ignore_index` of `DataFrame/Series.sample`
  • [SPARK-38993] - Impl DataFrame.boxplot and DataFrame.plot.box
  • [SPARK-38996] - Use double quotes for types in error messages
  • [SPARK-39000] - Convert bools to ints in basic statistical functions of GroupBy objects
  • [SPARK-39006] - Show a directional error message for PVC Dynamic Allocation Failure
  • [SPARK-39007] - Use double quotes for SQL configs in error messages
  • [SPARK-39018] - Add support for YARN decommissioning when ESS is Disabled
  • [SPARK-39028] - Use SparkDateTimeException when casting to datetime types failed
  • [SPARK-39029] - Improve the test coverage for pyspark/broadcast.py
  • [SPARK-39037] - DS V2 Top N push-down supports order by expressions
  • [SPARK-39047] - Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
  • [SPARK-39053] - test_multi_index_dtypes failed due to index mismatch
  • [SPARK-39054] - GroupByTest failed due to axis Length mismatch
  • [SPARK-39077] - Implement `skipna` of basic statistical functions of DataFrame and Series
  • [SPARK-39078] - Support UPDATE commands with DEFAULT values
  • [SPARK-39081] - Impl DataFrame.resample and Series.resample
  • [SPARK-39085] - Move error message of INCONSISTENT_BEHAVIOR_CROSS_VERSION to the json file
  • [SPARK-39086] - Support UDT in Spark Parquet vectorized reader
  • [SPARK-39087] - Improve error messages: step 1
  • [SPARK-39095] - Adjust `GroupBy.std` to match pandas 1.4
  • [SPARK-39096] - Support MERGE commands with DEFAULT values
  • [SPARK-39097] - Improve the test coverage for pyspark/taskcontext.py
  • [SPARK-39108] - Show hints for try_add/try_substract/try_multiply in error messages of int/long overflow
  • [SPARK-39109] - Adjust `GroupBy.mean/median` to match pandas 1.4
  • [SPARK-39114] - ml.optim.aggregator avoid re-allocating buffers
  • [SPARK-39121] - Fix doc format/syntax error
  • [SPARK-39139] - DS V2 supports push down DS V2 UDF
  • [SPARK-39143] - Support CSV file scans with DEFAULT values
  • [SPARK-39148] - DS V2 aggregate push down can work with OFFSET or LIMIT
  • [SPARK-39163] - Throw an exception w/ error class for an invalid bucket file
  • [SPARK-39164] - Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions
  • [SPARK-39165] - Replace sys.error by IllegalStateException in Spark SQL
  • [SPARK-39167] - Throw an exception w/ an error class for multiple rows from a subquery used as an expression
  • [SPARK-39170] - ImportError when creating pyspark.pandas document "Supported APIs" if pandas version is low.
  • [SPARK-39179] - Improve the test coverage for pyspark/shuffle.py
  • [SPARK-39187] - Remove SparkIllegalStateException
  • [SPARK-39189] - interpolate supports limit_area
  • [SPARK-39197] - Implement `skipna` parameter of `GroupBy.all`
  • [SPARK-39200] - Stream is corrupted Exception while fetching the blocks from fallback storage system
  • [SPARK-39201] - Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`
  • [SPARK-39211] - Support JSON file scans with default values
  • [SPARK-39214] - Improve errors related to CAST
  • [SPARK-39223] - implement skew and kurt in Rolling/RollingGroupby/Expanding/ExpandingGroupby
  • [SPARK-39230] - Support ANSI Aggregate Function: regr_slope
  • [SPARK-39234] - Code clean up in SparkThrowableHelper.getMessage
  • [SPARK-39236] - Make CreateTable API and ListTables API compatible
  • [SPARK-39243] - Describe the rules of quoting elements in error messages
  • [SPARK-39246] - Implement Groupby.skew
  • [SPARK-39255] - Improve error messages: step 2
  • [SPARK-39263] - GetTable, TableExists and DatabaseExists
  • [SPARK-39265] - Support Parquet file scans with DEFAULT values
  • [SPARK-39270] - JDBC dialect supports registering dialect specific functions
  • [SPARK-39271] - Upgrade pandas to 1.4.3
  • [SPARK-39285] - Spark should not check filed name when read data
  • [SPARK-39294] - Support Orc file scans with DEFAULT values
  • [SPARK-39309] - '_SubTest' object has no attribute 'elapsed_time'
  • [SPARK-39310] - rename `required_same_anchor`
  • [SPARK-39314] - Respect ps.concat sort parameter to follow pandas behavior
  • [SPARK-39316] - Merge PromotePrecision and CheckOverflow into decimal binary arithmetic
  • [SPARK-39317] - groupby.apply doc test failed when SPARK_CONF_ARROW_ENABLED disable
  • [SPARK-39319] - Make query context as part of SparkThrowable
  • [SPARK-39324] - Log ExecutorDecommission as INFO level in TaskSchedulerImpl
  • [SPARK-39326] - replace "NaN" with real "None" value in indexes in doctest
  • [SPARK-39335] - DescribeTableCommand should redact properties
  • [SPARK-39339] - Support TimestampNTZ in JDBC data source
  • [SPARK-39342] - ShowTablePropertiesCommand/ShowTablePropertiesExec should redact properties.
  • [SPARK-39343] - DescribeTableExec should redact properties
  • [SPARK-39346] - Convert asserts/illegal state exception to internal errors on each phase
  • [SPARK-39350] - DescribeNamespace should redact properties
  • [SPARK-39351] - ShowCreateTable should redact properties
  • [SPARK-39359] - Restrict DEFAULT columns to allowlist of supported data source types
  • [SPARK-39383] - Support V2 data sources with DEFAULT values
  • [SPARK-39384] - Compile build-in linear regression aggregate functions for JDBC dialect
  • [SPARK-39385] - Translate linear regression aggregate functions for pushdown
  • [SPARK-39406] - Accept NumPy array in createDataFrame
  • [SPARK-39413] - Capitalize sql keywords in JDBCV2Suite
  • [SPARK-39425] - Add migration guide for PS behavior changes
  • [SPARK-39432] - element_at(*, 0) does not return INVALID_ARRAY_INDEX_IN_ELEMENT_AT
  • [SPARK-39434] - Provide runtime error query context when array index is out of bound
  • [SPARK-39450] - Reuse PVCs by default
  • [SPARK-39451] - Support casting intervals to integrals in ANSI mode
  • [SPARK-39453] - DS V2 supports push down misc non-aggregate functions(non ANSI)
  • [SPARK-39459] - local*HostName* methods should support IPv6
  • [SPARK-39460] - Fix CoarseGrainedSchedulerBackendSuite to handle fast allocations
  • [SPARK-39461] - Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/{mvn|sbt}`
  • [SPARK-39464] - Use `Utils.localCanonicalHostName` instead of `localhost` in tests
  • [SPARK-39468] - Improve RpcAddress to add [] to IPv6 if needed
  • [SPARK-39470] - Support cast of ANSI intervals to decimals
  • [SPARK-39479] - DS V2 supports push down math functions(non ANSI)
  • [SPARK-39482] - Add build and test documentation on IPv6
  • [SPARK-39490] - Support `ipFamilyPolicy` and `ipFamilies` in Driver Service
  • [SPARK-39491] - Hadoop 2.7 build fails due to org.apache.hadoop.yarn.api.records.NodeState.DECOMMISSIONING
  • [SPARK-39501] - Propagate `java.net.preferIPv6Addresses=true` in SBT tests
  • [SPARK-39502] - Downgrade scala-maven-plugin to 4.6.1
  • [SPARK-39503] - Add session catalog name for v1 database table and function
  • [SPARK-39506] - CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs
  • [SPARK-39507] - SocketAuthServer should respect Java IPv6 options
  • [SPARK-39508] - Support IPv6 between JVM and Python Daemon in PySpark
  • [SPARK-39509] - Support DEFAULT_ARTIFACT_REPOSITORY in check-license
  • [SPARK-39514] - LauncherBackendSuite should add java.net.preferIPv6Addresses conf
  • [SPARK-39516] - Set a scheduled build for branch-3.3
  • [SPARK-39517] - Recover branch-3.2 build broken by is-changed.py script missing
  • [SPARK-39519] - Test failure in SPARK-39387 with JDK 11
  • [SPARK-39520] - ExpressionSetSuite test failure with Scala 2.13
  • [SPARK-39521] - Define each workflow for each scheduled job in GitHub Actions
  • [SPARK-39522] - Add Apache Spark infra GA image cache
  • [SPARK-39528] - Use V2 Filter in SupportsRuntimeFiltering
  • [SPARK-39529] - Refactor and merge all related job selection logic into precondition
  • [SPARK-39530] - Fix KafkaTestUtils to support IPv6
  • [SPARK-39542] - Improve YARN client mode to support IPv6
  • [SPARK-39552] - Unify v1 and v2 DESCRIBE TABLE
  • [SPARK-39553] - Failed to remove shuffle ${shuffleId} - null when using Scala 2.13
  • [SPARK-39555] - Make createTable and listTables in the python side support 3-layer-namespace
  • [SPARK-39557] - Support ARRAY, STRUCT, MAP types as DEFAULT values
  • [SPARK-39559] - Support IPv6 in WebUI
  • [SPARK-39561] - Improve SparkContext to propagate `java.net.preferIPv6Addresses`
  • [SPARK-39562] - Make hive-thrift server module passes in IPv6 environment
  • [SPARK-39563] - Use localHostNameForURI in UISuite
  • [SPARK-39566] - Improve YARN cluster mode to support IPv6
  • [SPARK-39571] - Add net-tools to Spark docker files
  • [SPARK-39572] - Fix `test_daemon.py` to support IPv6
  • [SPARK-39574] - Better error message when `ps.Index` is used for DataFrame/Series creation
  • [SPARK-39579] - Make ListFunctions/getFunction/functionExists API compatible
  • [SPARK-39583] - Make RefreshTable be compatible with 3 layer namespace
  • [SPARK-39594] - Improve logs to show addresses in addition to port
  • [SPARK-39597] - Make GetTable, TableExists and DatabaseExists in the python side support 3-layer-namespace
  • [SPARK-39598] - Make *cache*, *catalog* in the python side support 3-layer-namespace
  • [SPARK-39607] - DataSourceV2: Distribution and ordering support V2 function in writing
  • [SPARK-39610] - Add safe.directory for container based job
  • [SPARK-39611] - PySpark support numpy 1.23.X
  • [SPARK-39615] - Make listColumns be compatible with 3 layer namespace
  • [SPARK-39627] - DS V2 pushdown should unify the compile API
  • [SPARK-39629] - Support v2 SHOW FUNCTIONS
  • [SPARK-39641] - Unify v1 and v2 SHOW FUNCTIONS tests
  • [SPARK-39643] - Prohibit subquery expressions in DEFAULT values for now
  • [SPARK-39645] - Make getDatabase and listDatabases compatible with 3 layer namespace
  • [SPARK-39646] - Make setCurrentDatabase compatible with 3 layer namespace
  • [SPARK-39649] - Make listDatabases / getDatabase / listColumns / refreshTable in PySpark support 3-layer-namespace
  • [SPARK-39686] - Disable scheduled builds that do not pass even once
  • [SPARK-39687] - Make sure new catalog methods listed in API reference
  • [SPARK-39688] - getReusablePVCs should handle accounts with no PVC permission
  • [SPARK-39697] - Add REFRESH_DATE flag and use previous cache to build cache image
  • [SPARK-39700] - Update two-parameter listColumns/getTable/getFunction/tableExists/functionExists functions docs to mention limitation
  • [SPARK-39704] - Implement createIndex & dropIndex & IndexExists in JDBC (H2 dialect)
  • [SPARK-39716] - Make currentDatabase/setCurrentDatabase/listCatalogs in SparkR support 3L namespace
  • [SPARK-39718] - Enable base image build in PySpark job
  • [SPARK-39719] - Implement databaseExists/getDatabase in SparkR support 3L namespace
  • [SPARK-39720] - Implement tableExists/getTable in SparkR for 3L namespace
  • [SPARK-39723] - Implement functionExists/getFunc in SparkR for 3L namespace
  • [SPARK-39735] - Enable base image build in lint job and fix sparkr env
  • [SPARK-39736] - Enable base image build in SparkR job
  • [SPARK-39756] - Better error messages for missing pandas scalars
  • [SPARK-39759] - Implement listIndexes in JDBC (H2 dialect)
  • [SPARK-39762] - Support numpy 1.23.0 (Remove numpy<1.23.0 version limit)
  • [SPARK-39772] - namespace should be null when database is null in the old constructors
  • [SPARK-39773] - Update document of JDBC options for pushDownOffset
  • [SPARK-39778] - Improve error messages: step 3
  • [SPARK-39787] - Use error class in the parsing error of function to_timestamp
  • [SPARK-39788] - Rename catalogName to dialectName for JdbcUtils
  • [SPARK-39792] - Add DecimalDivideWithOverflowCheck for decimal average
  • [SPARK-39795] - New SQL function: try_to_timestamp
  • [SPARK-39799] - DataSourceV2: View catalog interface
  • [SPARK-39807] - Respect ``Series.concat`` sort parameter to follow 1.4.3 behavior
  • [SPARK-39810] - Catalog.tableExists should handle nested namespace
  • [SPARK-39818] - Fix bug in ARRAY, STRUCT, MAP types with DEFAULT values with NULL field(s)
  • [SPARK-39819] - DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)
  • [SPARK-39827] - add_months() returns a java error on overflow
  • [SPARK-39828] - Catalog.listTables() should respect currentCatalog
  • [SPARK-39836] - Simplify V2ExpressionBuilder by extract common method.
  • [SPARK-39844] - Restrict adding DEFAULT columns for existing tables to allowlist of supported data source types
  • [SPARK-39846] - Enable spark.dynamicAllocation.shuffleTracking.enabled by default
  • [SPARK-39852] - Unify v1 and v2 DESCRIBE TABLE tests for columns
  • [SPARK-39859] - Support v2 `DESCRIBE TABLE EXTENDED` for columns
  • [SPARK-39862] - Fix bug in existence DEFAULT value lookups for V2 data sources
  • [SPARK-39884] - KubernetesExecutorBackend should handle IPv6 hostname
  • [SPARK-39889] - Use different error classes for numeric/interval divided by 0
  • [SPARK-39898] - Upgrade kubernetes-client to 5.12.3
  • [SPARK-39899] - Incorrect passing of message parameters in InvalidUDFClassException
  • [SPARK-39905] - Remove checkErrorClass()
  • [SPARK-39907] - Implement axis and skipna of Series.argmin
  • [SPARK-39909] - Organize the check of push down information for JDBCV2Suite
  • [SPARK-39914] - Add DS V2 Filter to V1 Filter conversion
  • [SPARK-39917] - Use different error classes for numeric/interval arithmetic overflow
  • [SPARK-39923] - Put QueryContext to array instead of Option
  • [SPARK-39926] - Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans
  • [SPARK-39928] - Optimize Utils.getIteratorSize for Scala 2.13 refer to IterableOnceOps.size
  • [SPARK-39929] - DS V2 supports push down string functions(non ANSI)
  • [SPARK-39933] - Check query context by checkError()
  • [SPARK-39935] - Switch validateParsingError onto checkError
  • [SPARK-39949] - Principals in KafkaTestUtils should use canonical host name
  • [SPARK-39961] - DS V2 push-down translate Cast if the cast is safe
  • [SPARK-39964] - DS V2 pushdown should unify the translate path
  • [SPARK-39965] - Skip PVC cleanup when driver doesn't own PVCs
  • [SPARK-39966] - Use V2 Filter in SupportsDelete
  • [SPARK-39985] - Test DEFAULT column values with DataFrames
  • [SPARK-39987] - Support PEAK_JVM_(ON|OFF)HEAP_MEMORY executor rolling policy
  • [SPARK-40000] - Add config to toggle whether to automatically add default values for INSERTs without user-specified fields
  • [SPARK-40001] - Add config to make DEFAULT values in JSON tables mutually exclusive with SQLConf.JSON_GENERATOR_IGNORE_NULL_FIELDS
  • [SPARK-40006] - Make pyspark.sql.group examples self-contained
  • [SPARK-40008] - Support casting integrals to intervals in ANSI mode
  • [SPARK-40010] - Make pyspark.sql.window examples self-contained
  • [SPARK-40012] - Make pyspark.sql.dataframe examples self-contained
  • [SPARK-40013] - DS V2 expressions should have the default toString
  • [SPARK-40014] - Support cast of decimals to ANSI intervals
  • [SPARK-40016] - Remove unnecessary TryEval in TrySum
  • [SPARK-40018] - Output SparkThrowable to SQL golden files in JSON format
  • [SPARK-40027] - Make pyspark.sql.streaming.readwriter examples self-contained
  • [SPARK-40029] - Make pyspark.sql.types examples self-contained
  • [SPARK-40041] - Add Document Parameters for pyspark.sql.window
  • [SPARK-40042] - Make pyspark.sql.streaming.query examples self-contained
  • [SPARK-40044] - Incorrect target interval type in cast overflow errors
  • [SPARK-40051] - Make pyspark.sql.catalog examples self-contained
  • [SPARK-40054] - Restore the error handling syntax of try_cast()
  • [SPARK-40055] - listCatalogs should also return spark_catalog even spark_catalog implementation is defaultSessionCatalog
  • [SPARK-40060] - Add numberDecommissioningExecutors metric
  • [SPARK-40061] - Document cast of ANSI intervals
  • [SPARK-40064] - Use V2 Filter in SupportsOverwrite
  • [SPARK-40066] - ANSI mode: always return null on invalid access to map column
  • [SPARK-40077] - Make pyspark.context examples self-contained
  • [SPARK-40078] - Make pyspark.sql.column examples self-contained
  • [SPARK-40081] - Add Document Parameters for pyspark.sql.streaming.query
  • [SPARK-40098] - Format error messages in the Thrift Server
  • [SPARK-40102] - Use SparkException instead of IllegalStateException in SparkPlan
  • [SPARK-40107] - Pull out empty2null conversion from FileFormatWriter
  • [SPARK-40109] - New SQL function: get()
  • [SPARK-40111] - Make pyspark.rdd examples self-contained
  • [SPARK-40120] - Make pyspark.sql.readwriter examples self-contained
  • [SPARK-40135] - Support ps.Index in DataFrame creation
  • [SPARK-40136] - Incorrect fragment of query context
  • [SPARK-40138] - Implement DataFrame.mode
  • [SPARK-40142] - Make pyspark.sql.functions examples self-contained
  • [SPARK-40147] - Make pyspark.sql.session examples self-contained
  • [SPARK-40157] - Make pyspark.files examples self-contained
  • [SPARK-40160] - Make pyspark.broadcast examples self-contained
  • [SPARK-40161] - Make Series.mode apply PandasMode
  • [SPARK-40173] - Make pyspark.taskcontext examples self-contained
  • [SPARK-40180] - Format error messages by spark-sql
  • [SPARK-40183] - Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion
  • [SPARK-40187] - Add doc for using Apache YuniKorn as a customized scheduler
  • [SPARK-40191] - Make pyspark.resource examples self-contained
  • [SPARK-40196] - Consolidate `lit` function with NumPy scalar in sql and pandas module
  • [SPARK-40198] - Enable spark.storage.decommission.(rdd|shuffle)Blocks.enabled by default
  • [SPARK-40205] - Provide a query context of ELEMENT_AT_BY_INDEX_ZERO
  • [SPARK-40209] - Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
  • [SPARK-40220] - Don't output the empty map of error message parameters
  • [SPARK-40222] - Numeric try_add/try_divide/try_subtract/try_multiply should throw error from their children
  • [SPARK-40257] - Remove since usage in streaming/query.py and window.py
  • [SPARK-40260] - Use error classes in the compilation errors of GROUP BY a position
  • [SPARK-40269] - Randomize the orders of peer in BlockManagerDecommissioner
  • [SPARK-40291] - Improve the message for column not in group by clause error
  • [SPARK-40300] - Migrate onto the DATATYPE_MISMATCH error classes
  • [SPARK-40302] - Add YuniKornSuite
  • [SPARK-40304] - Add decomTestTag to K8s Integration Test
  • [SPARK-40305] - Implement Groupby.sem
  • [SPARK-40310] - try_sum() should throw the exceptions from its child
  • [SPARK-40313] - ps.DataFrame(data, index) should support the same anchor
  • [SPARK-40318] - try_avg() should throw the exceptions from its child
  • [SPARK-40324] - Provide a query context of ParseException
  • [SPARK-40330] - Implement `Series.searchsorted`.
  • [SPARK-40332] - Implement `GroupBy.quantile`.
  • [SPARK-40333] - Implement `GroupBy.nth`.
  • [SPARK-40334] - Implement `GroupBy.prod`.
  • [SPARK-40339] - Implement `Expanding.quantile`.
  • [SPARK-40342] - Implement `Rolling.quantile`.
  • [SPARK-40345] - Implement `ExpandingGroupby.quantile`.
  • [SPARK-40348] - Implement `RollingGroupby.quantile`.
  • [SPARK-40356] - Upgrade pandas to 1.4.4
  • [SPARK-40357] - Migrate window type check failures onto error classes
  • [SPARK-40358] - Migrate collection type check failures onto error classes
  • [SPARK-40359] - Migrate JSON type check failures onto error classes
  • [SPARK-40361] - Migrate arithmetic type check failures onto error classes
  • [SPARK-40368] - Migrate Bloom Filter type check failures onto error classes
  • [SPARK-40369] - Migrate the type check failures of calls via reflection onto error classes
  • [SPARK-40370] - Migrate cast type check failures onto error classes
  • [SPARK-40371] - Migrate type check failures of NthValue and NTile onto error classes
  • [SPARK-40372] - Migrate failures of array type checks onto error classes
  • [SPARK-40374] - Migrate type check failures of type creators onto error classes
  • [SPARK-40379] - Propagate decommission executor loss reason during onDisconnect in K8s
  • [SPARK-40386] - Implement `ddof` in `DataFrame.cov`
  • [SPARK-40391] - Test the error class UNSUPPORTED_FEATURE.JDBC_TRANSACTION
  • [SPARK-40393] - Refactor expanding and rolling test for function with input
  • [SPARK-40399] - Make `pearson` correlation in `DataFrame.corr` support missing values and `min_periods`
  • [SPARK-40400] - Pass error message parameters to exceptions as a map
  • [SPARK-40416] - Add error classes for subquery expression CheckAnalysis failures
  • [SPARK-40417] - Use YuniKorn v1.1+
  • [SPARK-40420] - Sort message parameters in the JSON formats
  • [SPARK-40421] - Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`
  • [SPARK-40423] - Add explicit YuniKorn queue submission test coverage
  • [SPARK-40426] - Return a map from SparkThrowable.getMessageParameters
  • [SPARK-40432] - Introduce GroupStateImpl and GroupStateTimeout in PySpark
  • [SPARK-40433] - Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row
  • [SPARK-40434] - Implement applyInPandasWithState in PySpark
  • [SPARK-40435] - Add test suites for applyInPandasWithState in PySpark
  • [SPARK-40445] - Refactor Resampler
  • [SPARK-40446] - Rename `_MissingPandasXXX` as `MissingPandasXXX`
  • [SPARK-40447] - Implement `kendall` correlation in `DataFrame.corr`
  • [SPARK-40448] - Initial prototype implementation
  • [SPARK-40453] - Improve error handling for GRPC server
  • [SPARK-40454] - Initial DSL framework for protobuf testing
  • [SPARK-40458] - Bump Kubernetes Client Version to 6.1.1
  • [SPARK-40459] - recoverDiskStore should not stop by existing recomputed files
  • [SPARK-40473] - Migrate parsing errors onto error classes
  • [SPARK-40479] - Migrate unexpected input type error to an error class
  • [SPARK-40481] - Ignore stage fetch failure caused by decommissioned executor
  • [SPARK-40483] - Add `CONNECT` label
  • [SPARK-40486] - Implement `spearman` and `kendall` in `DataFrame.corrwith`
  • [SPARK-40498] - Implement `kendall` and `min_periods` in `Series.corr`
  • [SPARK-40503] - Add resampling to API references
  • [SPARK-40509] - Construct an example of applyInPandasWithState in examples directory
  • [SPARK-40510] - Implement `ddof` in `Series.cov`
  • [SPARK-40512] - Upgrade pandas to 1.5.0
  • [SPARK-40515] - Add apache/spark-docker repo
  • [SPARK-40516] - Add official image dockerfile for Spark v3.3.0
  • [SPARK-40519] - Add "Publish workflow" to help release apache/spark image
  • [SPARK-40520] - Add a script to generate DOI mainifest
  • [SPARK-40528] - Add dockerfile template
  • [SPARK-40529] - Remove `pyspark.pandas.ml`
  • [SPARK-40532] - Python version for UDF should follow the servers version
  • [SPARK-40533] - Extend type support for Spark Connect literals
  • [SPARK-40534] - Extend support for Join Relation
  • [SPARK-40536] - Make Spark Connect port configurable.
  • [SPARK-40537] - Re-enable mypi supoprt
  • [SPARK-40538] - Add missing PySpark functions to Spark Connect
  • [SPARK-40539] - PySpark read API parity for Spark Connect
  • [SPARK-40540] - Migrate compilation errors onto error classes
  • [SPARK-40542] - Make `ddof` in `DataFrame.std` and `Series.std` accept arbitary integers
  • [SPARK-40543] - Make `ddof` in `DataFrame.var` and `Series.var` accept arbitary integers
  • [SPARK-40550] - DataSource V2: Handle DELETE commands for delta-based sources
  • [SPARK-40551] - DataSource V2: Add APIs for delta-based row-level operations
  • [SPARK-40554] - Make `ddof` in `DataFrame.sem` and `Series.sem` accept arbitary integers
  • [SPARK-40557] - Re-generate Spark Connect Python protos
  • [SPARK-40560] - Rename message to messageFormat in the STANDARD format of errors
  • [SPARK-40561] - Implement `min_count` in GroupBy.min
  • [SPARK-40569] - Add smoke test in standalone cluster for spark-docker
  • [SPARK-40571] - Construct a test case to verify fault-tolerance semantic with random python worker failures
  • [SPARK-40573] - Make `ddof` in `GroupBy.std`, `GroupBy.var` and `GroupBy.sem` accept arbitary integers
  • [SPARK-40577] - Fix CategoricalIndex.append
  • [SPARK-40578] - Fix `IndexesTest.test_to_frame` when pandas 1.5.0
  • [SPARK-40579] - `GroupBy.first` should skip nulls
  • [SPARK-40580] - Update the document for DataFrame.to_orc
  • [SPARK-40587] - SELECT * shouldn't be empty project list in proto.
  • [SPARK-40589] - Fix test for `DataFrame.corr_with` to skip the pandas regression
  • [SPARK-40590] - Fix `ps.read_parquet` when pandas_metadata is True
  • [SPARK-40592] - Implement `min_count` in `GroupBy.max`
  • [SPARK-40593] - protoc-3.21.1-linux-x86_64.exe requires GLIBC_2.14
  • [SPARK-40605] - Connect module should use log4j2.properties to configure test log output as other modules
  • [SPARK-40613] - Update sbt-protoc to 1.0.6
  • [SPARK-40615] - Check unsupported data type when decorrelating subqueries
  • [SPARK-40621] - Implement `numeric_only` and `min_count` in `GroupBy.sum`
  • [SPARK-40631] - Implement `min_count` in `GroupBy.first`
  • [SPARK-40636] - Fix wrong remained shuffles log in BlockManagerDecommissioner
  • [SPARK-40643] - Implement `min_count` in `GroupBy.last`
  • [SPARK-40645] - Throw exception for Collect() and recommend to use toPandas()
  • [SPARK-40663] - Migrate execution errors onto error classes
  • [SPARK-40665] - Avoid embedding Spark Connect in the Apache Spark binary release
  • [SPARK-40671] - Support driver service labels
  • [SPARK-40672] - Run Scala side tests in GitHub Actions
  • [SPARK-40674] - Use uniitest's asserts instead of built-in assert
  • [SPARK-40677] - Shade more dependency to be able to run separately
  • [SPARK-40680] - Avoid hardcoded versions in SBT build
  • [SPARK-40687] - Support data masking built-in Function 'mask'
  • [SPARK-40693] - mypy complains accessing the variable defined in the class method
  • [SPARK-40698] - Improve the precision of `product` for intergral inputs
  • [SPARK-40699] - Supplement undocumented yarn configuration in documentation
  • [SPARK-40702] - Confusing partition specs in PartitionsAlreadyExistException
  • [SPARK-40707] - Add groupby to connect DSL and test more than one grouping expressions
  • [SPARK-40709] - Supplement undocumented avro configurations in documentation
  • [SPARK-40710] - Supplement undocumented parquet configurations in documentation
  • [SPARK-40713] - Improve SET operation support in the proto and the server
  • [SPARK-40714] - Remove PartitionAlreadyExistsException
  • [SPARK-40717] - Support Column Alias in connect DSL
  • [SPARK-40718] - Replace shaded netty with grpc netty to avoid double shaded dependency.
  • [SPARK-40726] - Supplement undocumented orc configurations in documentation
  • [SPARK-40727] - Add merge_spark_docker_pr.py to help merge commit
  • [SPARK-40729] - Spark-shell run failed with Java 19
  • [SPARK-40733] - ShowCreateTableSuite test failed
  • [SPARK-40737] - Add basic support for DataFrameWriter
  • [SPARK-40743] - StructType should contain a list of StructField and each field should have a name
  • [SPARK-40744] - Make `_reduce_for_stat_function` in `groupby` accept `min_count`
  • [SPARK-40746] - Make Dockerfile build workflow work in apache repo
  • [SPARK-40748] - Migrate type check failures of conditions onto error classes
  • [SPARK-40749] - Migrate type check failures of generators onto error classes
  • [SPARK-40750] - Migrate type check failures of math expressions onto error classes
  • [SPARK-40751] - Migrate type check failures of high order functions onto error classes
  • [SPARK-40752] - Migrate type check failures of misc expressions onto error classes
  • [SPARK-40754] - Add LICENSE and NOTICE for apache/spark-docker
  • [SPARK-40755] - Migrate type check failures of number formatting onto error classes
  • [SPARK-40756] - Migrate type check failures of string expressions onto error classes
  • [SPARK-40757] - Add PULL_REQUEST_TEMPLATE for spark-docker
  • [SPARK-40759] - Migrate type check failures of time window onto error classes
  • [SPARK-40760] - Migrate type check failures of interval expressions onto error classes
  • [SPARK-40761] - Migrate type check failures of percentile expressions onto error classes
  • [SPARK-40762] - Check error classes in ErrorParserSuite
  • [SPARK-40768] - Migrate type check failures of bloom_filter_agg() onto error classes
  • [SPARK-40769] - Migrate type check failures of aggregate expressions onto error classes
  • [SPARK-40773] - Refactor checkCorrelationsInSubquery
  • [SPARK-40774] - Add Sample to proto and DSL
  • [SPARK-40779] - Fix `corrwith` to work properly with different anchor.
  • [SPARK-40780] - Add WHERE to Connect proto and DSL
  • [SPARK-40783] - Enable Spark on K8s integration test for official dockerfiles
  • [SPARK-40784] - Check error classes in DDLParserSuite
  • [SPARK-40785] - Check error classes in ExpressionParserSuite
  • [SPARK-40786] - Check error classes in PlanParserSuite
  • [SPARK-40787] - Check error classes in SparkSqlParserSuite
  • [SPARK-40788] - Check error classes in CreateNamespaceParserSuite
  • [SPARK-40790] - Check error classes in DDL parsing tests
  • [SPARK-40796] - Check the generated python protos in GitHub Actions
  • [SPARK-40799] - Enforce Scalafmt for Spark Connect Module
  • [SPARK-40800] - Always inline expressions in OptimizeOneRowRelationSubquery
  • [SPARK-40805] - Use `spark` username in official image
  • [SPARK-40809] - Add as(alias: String) to connect DSL
  • [SPARK-40810] - Use SparkIllegalArgumentException instead of IllegalArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand
  • [SPARK-40811] - Use checkError() to intercept ParseException
  • [SPARK-40812] - Add Deduplicate to Connect proto
  • [SPARK-40813] - Add limit and offset to Connect DSL
  • [SPARK-40816] - Python: rename LogicalPlan.collect to LogicalPlan.to_proto
  • [SPARK-40823] - Connect Proto should carry unparsed identifiers
  • [SPARK-40827] - Re-enable the DataFrame.corrwith test after fixing in future pandas.
  • [SPARK-40828] - Drop Python test tables before and after unit tests
  • [SPARK-40832] - Add README for spark-docker
  • [SPARK-40833] - Cleanup apt lists cache in Dockerfile
  • [SPARK-40836] - AnalyzeResult should use struct for schema
  • [SPARK-40839] - [Python] Implement `DataFrame.sample`
  • [SPARK-40845] - Add template support for SPARK_GPG_KEY
  • [SPARK-40852] - Implement `DataFrame.summary`
  • [SPARK-40854] - Change default serialization from 'broken' CSV to Spark DF JSON
  • [SPARK-40856] - Update the error template of WRONG_NUM_PARAMS
  • [SPARK-40857] - Allow configurable GPRC interceptors for Spark Connect
  • [SPARK-40859] - Upgrade action/checkout to v3
  • [SPARK-40860] - Change `set-output` to `GITHUB_EVENT` in spark infra code
  • [SPARK-40862] - Unexpected operators when rewriting scalar subqueries with non-deterministic expressions
  • [SPARK-40864] - Remove pip/setuptools dynamic upgrade
  • [SPARK-40866] - Rename Check Spark repo as Check Spark Docker repo in GA
  • [SPARK-40870] - Upgrade docker actions to cleanup warning
  • [SPARK-40871] - Upgrade actions/script to v6 and fix notify workflow
  • [SPARK-40872] - Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
  • [SPARK-40875] - Add .agg() to Connect DSL
  • [SPARK-40877] - Reimplement `crosstab` with dataframe operations
  • [SPARK-40878] - pin 'grpcio==1.48.1' 'protobuf==4.21.6'
  • [SPARK-40879] - Support Join UsingColumns in proto
  • [SPARK-40880] - Reimplement `summary` with dataframe operations
  • [SPARK-40881] - Upgrade actions/cache to v3 and actions/upload-artifact to v3
  • [SPARK-40882] - Upgrade actions/setup-java to v3 with distribution specified
  • [SPARK-40883] - Support Range in Connect proto
  • [SPARK-40888] - Check error classes in HiveQuerySuite
  • [SPARK-40889] - Check error classes in PlanResolutionSuite
  • [SPARK-40890] - Check error classes in DataSourceV2SQLSuite
  • [SPARK-40891] - Check error classes in TableIdentifierParserSuite
  • [SPARK-40896] - Fix doctest for `Index.(isin|isnull|notnull)` to work properly with pandas 1.5
  • [SPARK-40898] - Quote function names in datatype mismatch errors
  • [SPARK-40899] - UserContext should be extensible
  • [SPARK-40900] - Reimplement `frequentItems` with dataframe operations
  • [SPARK-40910] - Replace UnsupportedOperationException with SparkUnsupportedOperationException
  • [SPARK-40914] - Mark internal API to be private[connect]
  • [SPARK-40915] - Improve `on` in Join in Python client
  • [SPARK-40926] - Refactor server side tests to only use DataFrame API
  • [SPARK-40929] - Add official image dockerfile for Spark v3.3.1
  • [SPARK-40930] - Support Collect() in Python client
  • [SPARK-40933] - Reimplement df.stat.{cov, corr} with built-in sql functions
  • [SPARK-40938] - Support Alias for every Relation
  • [SPARK-40941] - Use Java 17 in K8s Dockerfile by default and remove `Dockerfile.java17`
  • [SPARK-40947] - Upgrade pandas to 1.5.1
  • [SPARK-40948] - Introduce new error class: PATH_NOT_FOUND
  • [SPARK-40949] - Implement `DataFrame.sortWithinPartitions`
  • [SPARK-40951] - pyspark-connect tests should be skipped if pandas doesn't exist
  • [SPARK-40953] - Add missing `limit(n)` in DataFrame.head
  • [SPARK-40965] - Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1208
  • [SPARK-40966] - FIX `read_parquet` with `pandas_metadata`
  • [SPARK-40967] - Migrate failAnalysis() onto error classes
  • [SPARK-40970] - Support List[Column] for Join's on argument.
  • [SPARK-40971] - Imports more from connect proto package to avoid calling `proto.` for Connect DSL
  • [SPARK-40973] - Rename _LEGACY_ERROR_TEMP_0055 to UNCLOSED_BRACKETED_COMMENT
  • [SPARK-40975] - Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0021
  • [SPARK-40977] - Complete Support for Union in Python client
  • [SPARK-40978] - Migrate failAnalysis() w/o context onto error classes
  • [SPARK-40979] - Keep removed executor info in decommission state
  • [SPARK-40980] - Support session.sql in Connect DSL
  • [SPARK-40981] - Support session.range in Python client
  • [SPARK-40984] - Replace `FRAME_LESS_OFFSET_WITHOUT_FOLDABLE` with `NON_FOLDABLE_INPUT`
  • [SPARK-40989] - Improve `session.sql` testing coverage in Python client
  • [SPARK-40990] - DataFrame creation from 2d NumPy array with arbitrary columns
  • [SPARK-40992] - Support toDF(columnNames) in Connect DSL
  • [SPARK-40995] - Developer Documentation for Spark Connect
  • [SPARK-40998] - Assign a name to the legacy error class _LEGACY_ERROR_TEMP_0040
  • [SPARK-41001] - Connection string support for Python client
  • [SPARK-41002] - Compatible `take`, `head` and `first` API in Python client
  • [SPARK-41004] - Check error classes in InterceptorRegistrySuite
  • [SPARK-41005] - Arrow based collect
  • [SPARK-41009] - Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1070
  • [SPARK-41010] - Complete Support for Except and Intersect in Python client
  • [SPARK-41012] - Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
  • [SPARK-41019] - Provide a query context to failAnalysis
  • [SPARK-41020] - Assign a name to the legacy error class _LEGACY_ERROR_TEMP_2440
  • [SPARK-41021] - Test some subclasses of error class DATATYPE_MISMATCH
  • [SPARK-41022] - Test the error class: DEFAULT_DATABASE_NOT_EXISTS, INDEX_ALREADY_EXISTS, INDEX_NOT_FOUND, ROUTINE_NOT_FOUND
  • [SPARK-41026] - Support Repartition in Connect DSL
  • [SPARK-41027] - Use `UNEXPECTED_INPUT_TYPE ` instead of `MAP_FROM_ENTRIES_WRONG_TYPE`
  • [SPARK-41034] - Connect DataFrame should require RemoteSparkSession
  • [SPARK-41036] - `columns` API should use `schema` API to avoid data fetching
  • [SPARK-41038] - Rename `MULTI_VALUE_SUBQUERY_ERROR` to `SCALAR_SUBQUERY_TOO_MANY_ROWS `
  • [SPARK-41041] - Integrate _LEGACY_ERROR_TEMP_1279 into TABLE_OR_VIEW_ALREADY_EXISTS
  • [SPARK-41042] - Rename PARSE_CHAR_MISSING_LENGTH to DATA_TYPE_MISSING_SIZE
  • [SPARK-41043] - Assign a name to the legacy error class _LEGACY_ERROR_TEMP_2429
  • [SPARK-41044] - Convert DATATYPE_MISMATCH.UNSPECIFIED_FRAME to INTERNAL_ERROR
  • [SPARK-41046] - Support CreateView in Connect DSL
  • [SPARK-41054] - Support disk-based KVStore in live UI
  • [SPARK-41055] - Rename _LEGACY_ERROR_TEMP_2424 to GROUP_BY_AGGREGATE
  • [SPARK-41058] - Removing unused code in connect
  • [SPARK-41059] - Rename _LEGACY_ERROR_TEMP_2420 to NESTED_AGGREGATE_FUNCTION
  • [SPARK-41061] - Support SelectExpr which apply Projection by expressions in Strings in Connect DSL
  • [SPARK-41062] - Rename UNSUPPORTED_CORRELATED_REFERENCE to CORRELATED_REFERENCE
  • [SPARK-41064] - Implement `DataFrame.crosstab` and `DataFrame.stat.crosstab`
  • [SPARK-41065] - Implement `DataFrame.freqItems ` and `DataFrame.stat.freqItems `
  • [SPARK-41066] - Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `
  • [SPARK-41067] - Implement `DataFrame.stat.cov`
  • [SPARK-41068] - Implement `DataFrame.stat.corr`
  • [SPARK-41069] - Implement `DataFrame.approxQuantile` and `DataFrame.stat.approxQuantile`
  • [SPARK-41072] - Convert the internal error about failed stream to user-facing error
  • [SPARK-41077] - Rename `ColumnRef` to `Column` in Python client implementation
  • [SPARK-41078] - DataFrame `withColumnsRenamed` can be implemented through `RenameColumns` proto
  • [SPARK-41095] - Convert unresolved operators to internal errors
  • [SPARK-41098] - Rename GROUP_BY_POS_REFERS_AGG_EXPR to GROUP_BY_POS_AGGREGATE
  • [SPARK-41102] - Merge SparkConnectPlanner and SparkConnectCommandPlanner
  • [SPARK-41103] - Document how to add a new proto field of messages
  • [SPARK-41105] - Adopt `optional` keyword from proto3 which offers `hasXXX` to differentiate if a field is set or unset
  • [SPARK-41108] - Control the max size of arrow batch
  • [SPARK-41109] - Rename the error class _LEGACY_ERROR_TEMP_1216 to INVALID_LIKE_PATTERN
  • [SPARK-41110] - Implement `DataFrame.sparkSession` in Python client
  • [SPARK-41111] - Implement `DataFrame.show`
  • [SPARK-41114] - Support local data for LocalRelation
  • [SPARK-41115] - Add ClientType to proto to indicate which client sends a request
  • [SPARK-41116] - Input relation can be optional for Project in Connect proto
  • [SPARK-41122] - Explain API can support different modes
  • [SPARK-41127] - Implement DataFrame.CreateGlobalView in Python client
  • [SPARK-41128] - Implement `DataFrame.fillna ` and `DataFrame.na.fill `
  • [SPARK-41130] - Rename OUT_OF_DECIMAL_TYPE_RANGE to NUMERIC_OUT_OF_SUPPORTED_RANGE
  • [SPARK-41131] - Improve error message for UNRESOLVED_MAP_KEY.WITHOUT_SUGGESTION
  • [SPARK-41133] - Integrate UNSCALED_VALUE_TOO_LARGE_FOR_PRECISION into NUMERIC_VALUE_OUT_OF_RANGE
  • [SPARK-41135] - Rename UNSUPPORTED_EMPTY_LOCATION to INVALID_EMPTY_LOCATION
  • [SPARK-41137] - Rename LATERAL_JOIN_OF_TYPE to INVALID_LATERAL_JOIN_TYPE
  • [SPARK-41139] - Improve error message for PYTHON_UDF_IN_ON_CLAUSE
  • [SPARK-41140] - Assign a name to the legacy error class _LEGACY_ERROR_TEMP_2440
  • [SPARK-41148] - Implement `DataFrame.dropna ` and `DataFrame.na.drop `
  • [SPARK-41150] - Document debugging with PySpark memory profiler
  • [SPARK-41157] - Show detailed differences in dataframe comparison
  • [SPARK-41158] - Use `checkError()` to check `DATATYPE_MISMATCH` in `DataFrameFunctionsSuite`
  • [SPARK-41164] - Update relations.proto to follow Connect Proto development guidance
  • [SPARK-41166] - Check errorSubClass of DataTypeMismatch in *ExpressionSuites
  • [SPARK-41169] - Implement `DataFrame.drop`
  • [SPARK-41172] - Migrate the ambiguous ref error to an error class
  • [SPARK-41173] - Move `require()` out from the constructors of string expressions
  • [SPARK-41174] - Propagate an error class to users for invalid `format` of `to_binary()`
  • [SPARK-41175] - Assign a name to the error class _LEGACY_ERROR_TEMP_1078
  • [SPARK-41176] - Assign a name to the error class _LEGACY_ERROR_TEMP_1042
  • [SPARK-41179] - Assign a name to the error class _LEGACY_ERROR_TEMP_1092
  • [SPARK-41180] - Assign an error class to "Cannot parse the data type"
  • [SPARK-41181] - Migrate the map options errors onto error classes
  • [SPARK-41182] - Assign a name to the error class _LEGACY_ERROR_TEMP_1102
  • [SPARK-41196] - Homogenize the protobuf version across server and client
  • [SPARK-41201] - Implement `DataFrame.SelectExpr` in Python client
  • [SPARK-41203] - Dataframe.transform in Python client support
  • [SPARK-41206] - Assign a name to the error class _LEGACY_ERROR_TEMP_1233
  • [SPARK-41212] - Implement `DataFrame.isEmpty`
  • [SPARK-41213] - Implement `DataFrame.__repr__` and `DataFrame.dtypes`
  • [SPARK-41215] - protoc-3.21.9-linux-x86_64.exe requires GLIBC_2.14
  • [SPARK-41216] - Make AnalyzePlan support multiple analysis tasks
  • [SPARK-41217] - Add an error class for failures of built-in function calls
  • [SPARK-41221] - Add the error class INVALID_FORMAT
  • [SPARK-41222] - Unify the typing definitions
  • [SPARK-41225] - Disable unsupported functions
  • [SPARK-41227] - Implement DataFrame cross join
  • [SPARK-41228] - Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
  • [SPARK-41230] - Remove `str` from Aggregate expression type
  • [SPARK-41232] - High-order function: array_append
  • [SPARK-41234] - High-order function: array_insert
  • [SPARK-41235] - High-order function: array_compact
  • [SPARK-41237] - Assign a name to the error class _LEGACY_ERROR_TEMP_0030
  • [SPARK-41238] - Support more datatypes
  • [SPARK-41243] - Update the protobuf version in README
  • [SPARK-41244] - Introducing a Protobuf serializer for UI data on KV store
  • [SPARK-41250] - DataFrame.to_pandas should not return optional pandas dataframe
  • [SPARK-41253] - Make K8s volcano IT work in Github Action
  • [SPARK-41255] - RemoteSparkSession should be called SparkSession
  • [SPARK-41256] - Implement DataFrame.withColumn(s)
  • [SPARK-41258] - Upgrade spark-docker actions
  • [SPARK-41263] - Upgrade buf to v1.9.0
  • [SPARK-41264] - Make Literal support more datatypes
  • [SPARK-41265] - Check and upgrade buf.build/protocolbuffers/plugins/python to 3.19.5
  • [SPARK-41268] - Refactor "Column" for API Compatibility
  • [SPARK-41269] - Move image matrix into version's workflow
  • [SPARK-41272] - Assign a name to the error class _LEGACY_ERROR_TEMP_2019
  • [SPARK-41278] - Clean up unused QualifiedAttribute in Expression.proto
  • [SPARK-41280] - Implement DataFrame.tail
  • [SPARK-41287] - Add a test workflow to help test image in fork repo
  • [SPARK-41291] - `DataFrame.explain` should print and return None
  • [SPARK-41292] - Window-function support
  • [SPARK-41293] - Code cleanup for assertXXX methods in ExpressionTypeCheckingSuite
  • [SPARK-41295] - Assign a name to the error class _LEGACY_ERROR_TEMP_1105
  • [SPARK-41296] - Assign a name to the error class _LEGACY_ERROR_TEMP_1106
  • [SPARK-41297] - Support string sql expressions in DF.where()
  • [SPARK-41300] - Unset Read.schema is incorrectly read when unset
  • [SPARK-41301] - SparkSession.range should treat end as optional
  • [SPARK-41302] - Assign a name to the error class _LEGACY_ERROR_TEMP_1185
  • [SPARK-41304] - Add missing docs for DataFrame API
  • [SPARK-41306] - Improve Connect Expression proto documentation
  • [SPARK-41308] - Improve `DataFrame.count()`
  • [SPARK-41309] - Assign a name to the error class _LEGACY_ERROR_TEMP_1093
  • [SPARK-41310] - Implement DataFrame.toDF
  • [SPARK-41311] - Rewrite test RENAME_SRC_PATH_NOT_FOUND to trigger the error from user space
  • [SPARK-41312] - Implement DataFrame.withColumnRenamed
  • [SPARK-41314] - Assign a name to the error class _LEGACY_ERROR_TEMP_1094
  • [SPARK-41315] - Implement `DataFrame.replace ` and `DataFrame.na.replace `
  • [SPARK-41317] - PySpark write API for Spark Connect
  • [SPARK-41319] - when-otherwise support
  • [SPARK-41321] - Support target field for UnresolvedStar
  • [SPARK-41325] - Add missing avg() to DF group
  • [SPARK-41326] - Bug in Deduplicate Python transformation
  • [SPARK-41328] - Add logical and string API to Column
  • [SPARK-41329] - Solve circular import between Column and _typing/functions
  • [SPARK-41330] - Improve Documentation for Take,Tail, Limit and Offset
  • [SPARK-41331] - Add orderBy and drop_duplicates
  • [SPARK-41332] - Fix `nullOrdering` in `SortOrder`
  • [SPARK-41333] - Make `Groupby.{min, max, sum, avg, mean}` compatible with PySpark
  • [SPARK-41334] - move SortField from relations.proto to expressions.proto
  • [SPARK-41335] - Support IsNull and IsNotNull in Column
  • [SPARK-41343] - Move FunctionName parsing to server side
  • [SPARK-41345] - Add Hint to Connect Proto
  • [SPARK-41346] - Implement asc and desc methods
  • [SPARK-41347] - Add Cast to Expression proto
  • [SPARK-41348] - Refactor `UnsafeArrayWriterSuite` to check error class
  • [SPARK-41349] - Implement `DataFrame.hint`
  • [SPARK-41351] - Column does not support !=
  • [SPARK-41354] - Implement `DataFrame.repartitionByRange`
  • [SPARK-41357] - Implement math functions
  • [SPARK-41358] - Use `PhysicalDataType` instead of DataType in ColumnVectorUtils
  • [SPARK-41363] - implement normal functions
  • [SPARK-41364] - implement `broadcast` function
  • [SPARK-41366] - DF.groupby.agg() API should be compatible
  • [SPARK-41371] - Improve Documentation for Command proto
  • [SPARK-41380] - Implement aggregation functions
  • [SPARK-41381] - Implement count_distinct and sum_distinct functions
  • [SPARK-41382] - implement `product` function
  • [SPARK-41383] - Implement `DataFrame.cube`
  • [SPARK-41388] - getReusablePVCs should ignore recently created PVCs in the previous batch
  • [SPARK-41389] - Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1044`
  • [SPARK-41394] - Skip MemoryProfilerTests when pandas is not installed
  • [SPARK-41397] - Implement part of string/binary functions
  • [SPARK-41398] - SPJ: Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match
  • [SPARK-41399] - Refactor column related tests to test_connect_column
  • [SPARK-41403] - Implement DataFrame.describe
  • [SPARK-41406] - Refactor error message for `NUM_COLUMNS_MISMATCH` to make it more generic
  • [SPARK-41407] - Pull out v1 write to WriteFiles
  • [SPARK-41409] - Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1043`
  • [SPARK-41410] - Support PVC-oriented executor pod allocation
  • [SPARK-41412] - Implement `Cast`
  • [SPARK-41413] - SPJ: Avoid shuffle when partition keys mismatch, but join expressions are compatible
  • [SPARK-41414] - Implement date/timestamp functions
  • [SPARK-41417] - Assign a name to the error class _LEGACY_ERROR_TEMP_0019
  • [SPARK-41420] - Protobuf serializer for ApplicationInfoWrapper
  • [SPARK-41421] - Protobuf serializer for ApplicationEnvironmentInfoWrapper
  • [SPARK-41422] - Protobuf serializer for ExecutorSummaryWrapper
  • [SPARK-41423] - Protobuf serializer for StageDataWrapper
  • [SPARK-41424] - Protobuf serializer for TaskDataWrapper
  • [SPARK-41425] - Protobuf serializer for RDDStorageInfoWrapper
  • [SPARK-41426] - Protobuf serializer for ResourceProfileWrapper
  • [SPARK-41427] - Protobuf serializer for ExecutorStageSummaryWrapper
  • [SPARK-41428] - Protobuf serializer for SpeculationStageSummaryWrapper
  • [SPARK-41429] - Protobuf serializer for RDDOperationGraphWrapper
  • [SPARK-41430] - Protobuf serializer for ProcessSummaryWrapper
  • [SPARK-41431] - Protobuf serializer for SQLExecutionUIData
  • [SPARK-41432] - Protobuf serializer for SparkPlanGraphWrapper
  • [SPARK-41433] - Make Max Arrow BatchSize configurable
  • [SPARK-41434] - Support LambdaFunction expresssion
  • [SPARK-41435] - Make `curdate()` throw `WRONG_NUM_ARGS ` instead of `_LEGACY_ERROR_TEMP_1043 ` when args is not null
  • [SPARK-41436] - Implement `collection` functions: A~C
  • [SPARK-41438] - Implement DataFrame. colRegex
  • [SPARK-41439] - Implement `DataFrame.melt` and `DataFrame.unpivot`
  • [SPARK-41440] - Implement DataFrame.randomSplit
  • [SPARK-41441] - Allow Generate with no required child output to host outer references
  • [SPARK-41443] - Assign a name to the error class _LEGACY_ERROR_TEMP_1061
  • [SPARK-41444] - Implement DataFrameReader.json
  • [SPARK-41445] - Implement DataFrameReader.parquet
  • [SPARK-41446] - Make `createDataFrame` support schema and more input dataset types
  • [SPARK-41453] - Implement DataFrame.subtract
  • [SPARK-41455] - Resolve dtypes inconsistencies of date/timestamp functions
  • [SPARK-41457] - Refactor pandas, pyarrow and grpc check in tests
  • [SPARK-41461] - protoc-3.21.9-linux-x86_64.exe requires GLIBC_2.14
  • [SPARK-41462] - Date and timestamp type can up cast to TimestampNTZ
  • [SPARK-41464] - Implement DataFrame.to
  • [SPARK-41465] - Assign a name to the error class _LEGACY_ERROR_TEMP_1235
  • [SPARK-41470] - SPJ: Spark shouldn't assume InternalRow implements equals and hashCode
  • [SPARK-41472] - Implement the rest of string/binary functions
  • [SPARK-41473] - Implement `functions.format_number`
  • [SPARK-41477] - Correctly infer the datatype of literal integers
  • [SPARK-41478] - Assign a name to the error class _LEGACY_ERROR_TEMP_1234
  • [SPARK-41479] - Add `IPv4 and IPv6` section to K8s document
  • [SPARK-41481] - Reuse `INVALID_TYPED_LITERAL` instead of `_LEGACY_ERROR_TEMP_0020`
  • [SPARK-41484] - Implement `collection` functions: E~M
  • [SPARK-41485] - Unify the environment variable of *_PROTOC_EXEC_PATH
  • [SPARK-41488] - Assign name to _LEGACY_ERROR_TEMP_1176
  • [SPARK-41489] - Assign name to _LEGACY_ERROR_TEMP_2415
  • [SPARK-41490] - Assign name to _LEGACY_ERROR_TEMP_2441
  • [SPARK-41492] - implement MISC function
  • [SPARK-41493] - Make csv functions support options
  • [SPARK-41495] - Implement `collection` functions: P~Z
  • [SPARK-41502] - Upgrade the minimum Minikube version to 1.28.0
  • [SPARK-41503] - Implement Partition Transformation Functions
  • [SPARK-41506] - Refactor LiteralExpression to support DataType
  • [SPARK-41508] - Assign name to _LEGACY_ERROR_TEMP_1179 and unwrap the existing SparkThrowable
  • [SPARK-41513] - Implement a Accumulator to collect per mapper row count metrics
  • [SPARK-41514] - Add `PVC-oriented executor pod allocation` section and revise config name
  • [SPARK-41518] - Assign a name to the error class _LEGACY_ERROR_TEMP_2422
  • [SPARK-41525] - Improve onNewSnapshots to use unique list of known executor IDs and PVC names
  • [SPARK-41526] - Implement `Column.isin`
  • [SPARK-41528] - Sharing namespace between PySpark and Spark Connect
  • [SPARK-41529] - Implement SparkSession.stop
  • [SPARK-41533] - GRPC Errors on the client should be cleaned up
  • [SPARK-41536] - Remove `Dynamic Resource Allocation` from K8s Future Work
  • [SPARK-41540] - Add `DISK_USED` executor roll policy
  • [SPARK-41542] - Run Coverage report for Spark Connect
  • [SPARK-41543] - Add `TOTAL_SHUFFLE_WRITE` executor roll policy
  • [SPARK-41546] - pyspark_types_to_proto_types should supports StructType.
  • [SPARK-41548] - Disable ANSI mode in pyspark.sql.tests.connect.test_connect_functions
  • [SPARK-41552] - Upgrade `kubernetes-client` to 6.3.1
  • [SPARK-41565] - Add the error class UNRESOLVED_ROUTINE
  • [SPARK-41568] - Assign name to _LEGACY_ERROR_TEMP_1236
  • [SPARK-41571] - Assign name to _LEGACY_ERROR_TEMP_2310
  • [SPARK-41572] - Assign name to _LEGACY_ERROR_TEMP_2149
  • [SPARK-41573] - Assign name to _LEGACY_ERROR_TEMP_2136
  • [SPARK-41574] - Assign name to _LEGACY_ERROR_TEMP_2009
  • [SPARK-41575] - Assign name to _LEGACY_ERROR_TEMP_2054
  • [SPARK-41576] - Assign name to _LEGACY_ERROR_TEMP_2051
  • [SPARK-41578] - Assign name to _LEGACY_ERROR_TEMP_2141
  • [SPARK-41579] - Assign name to _LEGACY_ERROR_TEMP_1249
  • [SPARK-41580] - Assign name to _LEGACY_ERROR_TEMP_2137
  • [SPARK-41581] - Assign name to _LEGACY_ERROR_TEMP_1230
  • [SPARK-41582] - Reuse `INVALID_TYPED_LITERAL` instead of `_LEGACY_ERROR_TEMP_0022`
  • [SPARK-41583] - Add Spark Connect and protobuf into setup.py with specifying dependencies
  • [SPARK-41586] - Introduce new PySpark package: pyspark.errors
  • [SPARK-41591] - Implement functionality for training a PyTorch file locally
  • [SPARK-41592] - Implement functionality for training a PyTorch file on the executors
  • [SPARK-41593] - Implement logging from the executor nodes
  • [SPARK-41595] - Support generator function explode/explode_outer in the FROM clause
  • [SPARK-41598] - Migrate the errors from `pyspark/sql/functions.py` into error class.
  • [SPARK-41600] - Support Catalog.cacheTable
  • [SPARK-41612] - Support Catalog.isCached
  • [SPARK-41623] - Support Catalog.uncacheTable
  • [SPARK-41629] - Support for protocol extensions
  • [SPARK-41630] - Support lateral column alias in Project code path
  • [SPARK-41631] - Support lateral column alias in Aggregate code path
  • [SPARK-41640] - implement `Window` functions
  • [SPARK-41641] - Implement `Column.over`
  • [SPARK-41643] - Deduplicate docstrings in pyspark.sql.connect.column
  • [SPARK-41644] - Introducing SPI mechanism to make it easy for other modules to register ProtoBufSerializer
  • [SPARK-41645] - Deduplicate docstrings in pyspark.sql.connect.dataframe
  • [SPARK-41647] - Deduplicate docstrings in pyspark.sql.connect.functions
  • [SPARK-41648] - Deduplicate docstrings in pyspark.sql.connect.readwriter
  • [SPARK-41649] - Deduplicate docstrings in pyspark.sql.connect.window
  • [SPARK-41654] - Enable doctests in pyspark.sql.connect.window
  • [SPARK-41655] - Enable doctests in pyspark.sql.connect.column
  • [SPARK-41656] - Enable doctests in pyspark.sql.connect.dataframe
  • [SPARK-41657] - Enable doctests in pyspark.sql.connect.session
  • [SPARK-41659] - Enable doctests in pyspark.sql.connect.readwriter
  • [SPARK-41663] - Implement the rest of Lambda functions
  • [SPARK-41672] - Enable the deprecated functions
  • [SPARK-41673] - Implement `Column.astype`
  • [SPARK-41675] - Make column op support `datetime`
  • [SPARK-41676] - Protobuf serializer for StreamingQueryData
  • [SPARK-41677] - Protobuf serializer for StreamingQueryProgressWrapper
  • [SPARK-41679] - Protobuf serializer for StreamBlockData
  • [SPARK-41680] - Protobuf serializer for CachedQuantile
  • [SPARK-41681] - Factor GroupedData out to group.py
  • [SPARK-41685] - Support optional using Protobuf serializer for KVStore in History server
  • [SPARK-41687] - Deduplicate docstrings in pyspark.sql.connect.group
  • [SPARK-41688] - Move Expressions to expressions.py
  • [SPARK-41689] - Enable doctests in pyspark.sql.connect.group
  • [SPARK-41692] - implement `DataFrame.rollup`
  • [SPARK-41693] - Implement `GroupedData.pivot`
  • [SPARK-41694] - Add new config to clean up `spark.ui.store.path` directory when SparkContext.stop()
  • [SPARK-41697] - Enable test_df_show, test_drop, test_dropna, test_toDF_with_schema_string and test_with_columns_renamed
  • [SPARK-41698] - Enable 16 tests that pass
  • [SPARK-41699] - Upgrade buf to v1.11.0
  • [SPARK-41700] - Remove `FunctionBuilder`
  • [SPARK-41701] - Make column op support `decimal`
  • [SPARK-41702] - Add invalid ops
  • [SPARK-41703] - Combine NullType and typed_null
  • [SPARK-41706] - pyspark_types_to_proto_types should supports MapType
  • [SPARK-41707] - Implement initial Catalog.* API
  • [SPARK-41708] - Pull v1write information to WriteFiles
  • [SPARK-41709] - Explicitly define `Seq` as `collection.Seq` to reduce `toSeq` when create ui objects from protobuf objects for Scala 2.13
  • [SPARK-41710] - Implement `Column.between`
  • [SPARK-41712] - Migrate the Spark Connect errors into PySpark error framework.
  • [SPARK-41713] - Make CTAS hold a nested execution for data writing
  • [SPARK-41715] - Catch specific exceptions for both Spark Connect and PySpark
  • [SPARK-41716] - Factor pyspark.sql.connect.Catalog._catalog_to_pandas to client.py
  • [SPARK-41717] - Implement the command logic for print and _repr_html_
  • [SPARK-41721] - Enable doctests in pyspark.sql.connect.catalog
  • [SPARK-41722] - Implement time window functions
  • [SPARK-41723] - Implement `sequence` function
  • [SPARK-41724] - Implement `call_udf` function
  • [SPARK-41725] - Remove the workaround of sql(...).collect back in PySpark tests
  • [SPARK-41726] - Remove OptimizedCreateHiveTableAsSelectCommand
  • [SPARK-41728] - Implement `unwrap_udt` function
  • [SPARK-41729] - Assign name to _LEGACY_ERROR_TEMP_0011
  • [SPARK-41731] - Implement the column accessor
  • [SPARK-41734] - Wrap catalog messages into a parent message
  • [SPARK-41736] - pyspark_types_to_proto_types should supports ArrayType
  • [SPARK-41737] - Implement `GroupedData.{min, max, avg, sum}`
  • [SPARK-41738] - Client ID should be mixed into SparkSession cache
  • [SPARK-41740] - Implement `Column.name`
  • [SPARK-41742] - Support star in groupBy.agg()
  • [SPARK-41743] - groupBy(...).agg(...).sort does not actually sort the output
  • [SPARK-41744] - Support multiple arguments in groupBy.max(...)
  • [SPARK-41745] - SparkSession.createDataFrame does not respect the column names in the row
  • [SPARK-41746] - SparkSession.createDataFrame does not support nested datatypes
  • [SPARK-41747] - Support multiple arguments in groupBy.avg(...)
  • [SPARK-41748] - Support multiple arguments in groupBy.min(...)
  • [SPARK-41749] - Support multiple arguments in groupBy.sum(...)
  • [SPARK-41751] - Support Column.bitwiseAND,bitwiseOR,bitwiseXOR,eqNullSafe,isNotNull,isNull,isin
  • [SPARK-41754] - Add simple developer guides for UI protobuf serializer
  • [SPARK-41757] - Compatibility of string representation in Column
  • [SPARK-41759] - Use `weakIntern` on string values in create new objects during deserialization
  • [SPARK-41761] - Fix arithmetic ops: negate, pow
  • [SPARK-41764] - Make the internal string op name consistent with FunctionRegistry
  • [SPARK-41767] - Implement `Column.{withField, dropFields}`
  • [SPARK-41768] - Refactor the definition of enum - `JobExecutionStatus` to follow with the code style
  • [SPARK-41770] - eqNullSafe does not support None as its argument
  • [SPARK-41771] - __getitem__ does not work with Column.isin
  • [SPARK-41772] - Enable pyspark.sql.connect.column.Column.withField doctest
  • [SPARK-41773] - Window.partitionBy is not respected with row_number
  • [SPARK-41775] - Implement training functions as input
  • [SPARK-41777] - Add Integration Tests
  • [SPARK-41779] - Make getitem support filter and select
  • [SPARK-41783] - Make column op support None
  • [SPARK-41784] - Add missing `__rmod__`
  • [SPARK-41785] - Implement `GroupedData.mean`
  • [SPARK-41786] - Deduplicate helper functions
  • [SPARK-41789] - Make `createDataFrame` support list of Rows
  • [SPARK-41796] - Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
  • [SPARK-41797] - Enable test for `array_repeat`
  • [SPARK-41799] - Combine plan-related tests
  • [SPARK-41803] - log() function variations are missing
  • [SPARK-41807] - Remove non-existent error class: UNSUPPORTED_FEATURE.DISTRIBUTE_BY
  • [SPARK-41808] - Make json functions support options
  • [SPARK-41809] - Make json functions support DataType Schema
  • [SPARK-41810] - SparkSession.createDataFrame does not respect the column names in the dictionary
  • [SPARK-41812] - DataFrame.join: ambiguous column
  • [SPARK-41815] - Column.isNull returns nan instead of None
  • [SPARK-41817] - SparkSession.read support reading with schema
  • [SPARK-41821] - Fix DataFrame.describe
  • [SPARK-41824] - Implement DataFrame.explain format to be similar to PySpark
  • [SPARK-41825] - DataFrame.show formatting int as double
  • [SPARK-41827] - DataFrame.groupBy requires all cols be Column or str
  • [SPARK-41828] - Implement creating empty Dataframe
  • [SPARK-41829] - Implement Dataframe.sort,sortWithinPartitions Ordering
  • [SPARK-41830] - Fix DataFrame.sample parameters
  • [SPARK-41831] - DataFrame.transform: Only Column or String can be used for projections
  • [SPARK-41832] - DataFrame.unionByName output is wrong
  • [SPARK-41833] - DataFrame.collect() output parity with pyspark
  • [SPARK-41834] - Implement SparkSession.conf
  • [SPARK-41835] - Implement `transform_keys` function
  • [SPARK-41836] - Implement `transform_values` function
  • [SPARK-41837] - DataFrame.createDataFrame datatype conversion error
  • [SPARK-41838] - DataFrame.show() fix map printing
  • [SPARK-41840] - DataFrame.show(): 'Column' object is not callable
  • [SPARK-41842] - Support data type Timestamp(NANOSECOND, null)
  • [SPARK-41844] - Implement `intX2` function
  • [SPARK-41845] - Fix `count(expr("*"))` function
  • [SPARK-41846] - DataFrame windowspec functions : unresolved columns
  • [SPARK-41847] - DataFrame mapfield,structlist invalid type
  • [SPARK-41849] - Implement DataFrameReader.text
  • [SPARK-41850] - Fix `isnan` function
  • [SPARK-41851] - Fix `nanvl` function
  • [SPARK-41852] - Fix `pmod` function
  • [SPARK-41855] - `createDataFrame` doesn't handle None/NaN properly
  • [SPARK-41856] - Enable test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
  • [SPARK-41857] - Enable test_between_function, test_datetime_functions, test_expr, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_approxQuantile
  • [SPARK-41862] - Fix a correctness bug in existence DEFAULT value lookups for the Orc data source
  • [SPARK-41866] - Make `createDataFrame` support array
  • [SPARK-41868] - Support data type Duration(NANOSECOND)
  • [SPARK-41869] - DataFrame dropDuplicates should throw error on non list argument
  • [SPARK-41870] - Handle duplicate columns in `createDataFrame`
  • [SPARK-41871] - DataFrame hint parameter can be str, float or int
  • [SPARK-41872] - Fix DataFrame createDataframe handling of None
  • [SPARK-41874] - Implement DataFrame `sameSemantics`
  • [SPARK-41875] - Throw proper errors in Dataset.to()
  • [SPARK-41876] - Implement DataFrame `toLocalIterator`
  • [SPARK-41877] - SparkSession.createDataFrame error parity
  • [SPARK-41878] - Add JIRAs or messages for skipped tests
  • [SPARK-41879] - `DataFrame.collect` should support nested types
  • [SPARK-41880] - Function `from_json` should support non-literal expression
  • [SPARK-41881] - `DataFrame.collect` should handle None/NaN properly
  • [SPARK-41882] - Add tests for SQLAppStatusStore with RocksDB Backend
  • [SPARK-41884] - DataFrame `toPandas` parity in return types
  • [SPARK-41886] - `DataFrame.intersect` doctest output has different order
  • [SPARK-41887] - Support DataFrame hint parameter to be list
  • [SPARK-41889] - Attach root cause to invalidPatternError
  • [SPARK-41890] - Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13
  • [SPARK-41891] - Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, test_window_time, test_reciprocal_trig_functions
  • [SPARK-41892] - Add JIRAs or messages for skipped messages
  • [SPARK-41895] - Add tests for streaming UI with RocksDB backend
  • [SPARK-41897] - Parity in Error types between pyspark and connect functions
  • [SPARK-41898] - Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument
  • [SPARK-41899] - DataFrame.createDataFrame converting int to bigint
  • [SPARK-41900] - Support data type int8
  • [SPARK-41901] - Parity in String representation of Column
  • [SPARK-41902] - Parity in String representation of higher_order_function's output
  • [SPARK-41903] - Support data type ndarray
  • [SPARK-41905] - Function `slice` should handle string in params
  • [SPARK-41906] - Handle Function `rand() `
  • [SPARK-41907] - Function `sampleby` return parity
  • [SPARK-41921] - Enable doctests in connect.column and connect.functions
  • [SPARK-41923] - Add `DataFrame.writeTo` to the unsupported list
  • [SPARK-41924] - Make StructType support metadata and Implement `DataFrame.withMetadata`
  • [SPARK-41926] - Add Github action test job with RocksDB as UI backend
  • [SPARK-41927] - Add the unsupported list for `GroupedData`
  • [SPARK-41928] - Add the unsupported list for functions
  • [SPARK-41929] - Add function array_compact
  • [SPARK-41933] - Provide local mode that automatically starts the server
  • [SPARK-41934] - Add the unsupported function list for `session`
  • [SPARK-41936] - Make `withMetadata` reuse the `withColumns` proto
  • [SPARK-41939] - Add the unsupported list for catalog functions
  • [SPARK-41944] - Pass configurations when local remote mode is on
  • [SPARK-41945] - Python: connect client lost column data with pyarrow.Table.to_pylist
  • [SPARK-41957] - Enable the doctest for `DataFrame.hint`
  • [SPARK-41959] - Improve v1 writes with empty2null
  • [SPARK-41960] - Assign name to _LEGACY_ERROR_TEMP_1056
  • [SPARK-41961] - Support table-valued functions with LATERAL
  • [SPARK-41963] - Different exception message in DataFrame.unpivot
  • [SPARK-41964] - Add the unsupported function list
  • [SPARK-41968] - Refactor ProtobufSerDe to ProtobufSerDe[T]
  • [SPARK-41973] - Assign name to _LEGACY_ERROR_TEMP_1311
  • [SPARK-41974] - Turn `INCORRECT_END_OFFSET` into `INTERNAL_ERROR`
  • [SPARK-41975] - Improve error message for `INDEX_ALREADY_EXISTS`
  • [SPARK-41976] - Improve error message for `INDEX_NOT_FOUND`
  • [SPARK-41977] - Enable test_generic_hints
  • [SPARK-41978] - SparkSession.range to take float as arguments
  • [SPARK-41980] - Enable test_functions_broadcast
  • [SPARK-41983] - Rename error class: NULL_COMPARISON_RESULT
  • [SPARK-41984] - Rename & improve error message for RESET_PERMISSION_TO_ORIGINAL
  • [SPARK-41988] - Fix map_filter and map_zip_with output order
  • [SPARK-41999] - NPE for bucketed write (ReadwriterTests.test_bucketed_write)
  • [SPARK-42000] - saveAsTable fail to find the default source (ReadwriterTests.test_insert_into)
  • [SPARK-42001] - Unexpected schema set to DefaultSource plan (ReadwriterTests.test_save_and_load)
  • [SPARK-42002] - Implement DataFrameWriterV2 (ReadwriterV2Tests)
  • [SPARK-42004] - Migrate "XX000" sqlState onto `INTERNAL_ERROR`
  • [SPARK-42007] - Reuse pyspark.sql.tests.test_group test cases
  • [SPARK-42008] - Reuse pyspark.sql.tests.test_datasources test cases
  • [SPARK-42009] - Reuse pyspark.sql.tests.test_serde test cases
  • [SPARK-42010] - Reuse pyspark.sql.tests.test_column test cases
  • [SPARK-42011] - Implement DataFrameReader.csv
  • [SPARK-42012] - Implement DataFrameReader.orc
  • [SPARK-42013] - Implement DataFrameReader.text to take multiple paths
  • [SPARK-42014] - Support aware datetimes
  • [SPARK-42016] - Type inconsistency of struct and map when accessing the nested column
  • [SPARK-42019] - Reuse pyspark.sql.tests.test_types test cases
  • [SPARK-42021] - createDataFrame with array.array
  • [SPARK-42022] - createDataFrame should autogenerate missing column names
  • [SPARK-42023] - createDataFrame should corse types of string false to bool false
  • [SPARK-42026] - Protobuf serializer for AppSummary and PoolData
  • [SPARK-42028] - Support Pandas DF to Spark DF with Nanosecond Timestamps
  • [SPARK-42029] - Distribution build for Spark Connect does not work with Spark Shell
  • [SPARK-42032] - Map data show in different order
  • [SPARK-42038] - SPJ: Support partially clustered distribution
  • [SPARK-42039] - SPJ: Remove Option in KeyGroupedPartitioning#partitionValues
  • [SPARK-42041] - DataFrameReader should support list of paths
  • [SPARK-42042] - DataFrameReader should support StructType schema
  • [SPARK-42044] - Fix wrong error message for `MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY`
  • [SPARK-42045] - Round/Bround should return an error on integral overflow
  • [SPARK-42047] - Literal should support numpy datatypes
  • [SPARK-42048] - Different column name of lit(np.int8)
  • [SPARK-42062] - Enforce scalafmt for connect-common
  • [SPARK-42063] - Register `byte[][]` to KyroSerializer
  • [SPARK-42070] - Change the default value of argument of Mask udf from -1 to NULL
  • [SPARK-42071] - Register scala.math.Ordering$Reverse to KyroSerializer
  • [SPARK-42073] - Enable pyspark.sql.tests.test_types 2 test cases
  • [SPARK-42074] - Enable KryoSerializer in TPCDSQueryBenchmark to enforce SQL class registration
  • [SPARK-42076] - Factor data conversion `arrow -> rows` out to `conversion.py`
  • [SPARK-42077] - Literal should throw TypeError for unsupported DataType
  • [SPARK-42078] - Migrate errors thrown by JVM into PySpark Exception.
  • [SPARK-42079] - Rename proto messages for `toDF` and `withColumnsRenamed`
  • [SPARK-42080] - Add guideline for PySpark errors.
  • [SPARK-42082] - Introduce `PySparkValueError` and `PySparkTypeError`
  • [SPARK-42085] - Make `from_arrow_schema` support nested types
  • [SPARK-42089] - Different result in nested lambda function
  • [SPARK-42095] - Fix gRPC check in tests
  • [SPARK-42097] - Register SerializedLambda and BitSet to KryoSerializer
  • [SPARK-42099] - Make `count(*)` work correctly
  • [SPARK-42100] - Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`
  • [SPARK-42119] - Add built-in table-valued functions inline and inline_outer
  • [SPARK-42120] - Add built-in table-valued function json_tuple
  • [SPARK-42121] - Add built-in table-valued functions posexplode and posexplode_outer
  • [SPARK-42122] - Add built-in table-valued function stack
  • [SPARK-42123] - Include column default values in DESCRIBE output for V1 tables
  • [SPARK-42124] - Scalar Inline Python UDF in Spark Connect
  • [SPARK-42130] - Handle null string values in AccumulableInfo and ProcessSummary
  • [SPARK-42137] - Enable spark.kryo.unsafe by default
  • [SPARK-42138] - Handle null string values in JobData/TaskDataWrapper/ExecutorStageSummaryWrapper
  • [SPARK-42139] - Handle null string values in SQLExecutionUIData/SQLPlanMetric/SparkPlanGraphWrapper
  • [SPARK-42140] - Handle null string values in ApplicationEnvironmentInfoWrapper/ApplicationInfoWrapper
  • [SPARK-42142] - Handle null string values in CachedQuantile/ExecutorSummary/PoolData
  • [SPARK-42143] - Handle null string values in RDDStorageInfo/RDDDataDistribution/RDDPartitionInfo
  • [SPARK-42144] - Handle null string values in StageData/StreamBlockData/StreamingQueryData
  • [SPARK-42146] - Refactor `Utils#setStringField` to make maven build pass when sql module use this method
  • [SPARK-42148] - Upgrade `kubernetes-client` to 6.4.0
  • [SPARK-42150] - Upgrade Volcano to 1.7.0
  • [SPARK-42153] - Handle null string values in PairStrings/RDDOperationNode/RDDOperationClusterWrapper
  • [SPARK-42154] - Enable Volcano unit tests and integration tests in GitHub Action
  • [SPARK-42164] - Register partitioned-table-related classes to KryoSerializer
  • [SPARK-42173] - IPv6 address mapping can fail with sparse addresses
  • [SPARK-42178] - Handle remaining null string values in ui protobuf serializer and add tests
  • [SPARK-42182] - Make `ReusedConnectTestCase` to take Spark configurations
  • [SPARK-42187] - Avoid using RemoteSparkSession.builder.getOrCreate in tests
  • [SPARK-42190] - Support `local[*]` in `spark-submit` in K8s environment
  • [SPARK-42192] - Migrate the `TypeError` from `pyspark/sql/dataframe.py` into `PySparkTypeError`.
  • [SPARK-42197] - Reuses JVM initialization, and separate configuration groups to set in remote local mode
  • [SPARK-42210] - Standardize registered pickled Python UDFs
  • [SPARK-42213] - Failed to test ClientE2ETestSuite with maven
  • [SPARK-42217] - Support lateral column alias in queries with Window
  • [SPARK-42221] - Introduce a new conf for TimestampNTZ schema inference in JSON/CSV
  • [SPARK-42224] - Migrate `TypeError` into error framework for Spark Connect functions
  • [SPARK-42225] - Add `SparkConnectIllegalArgumentException` to handle Spark Connect error precisely.
  • [SPARK-42229] - Migrate SparkCoreErrors into error class
  • [SPARK-42231] - Rename error class: MISSING_STATIC_PARTITION_COLUMN
  • [SPARK-42232] - Rename error class: UNSUPPORTED_FEATURE.JDBC_TRANSACTION
  • [SPARK-42233] - Improve error message for PIVOT_AFTER_GROUP_BY
  • [SPARK-42234] - Rename error class: UNSUPPORTED_FEATURE.REPEATED_PIVOT
  • [SPARK-42236] - Refine `NULLABLE_ARRAY_OR_MAP_ELEMENT`
  • [SPARK-42238] - Introduce `INCOMPATIBLE_JOIN_TYPES`
  • [SPARK-42239] - Integrate MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY
  • [SPARK-42243] - Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns
  • [SPARK-42244] - Refine error message by using Python types.
  • [SPARK-42249] - Refining html strings in error messages
  • [SPARK-42253] - Add test for detecting duplicated error class
  • [SPARK-42254] - Assign name to _LEGACY_ERROR_TEMP_1117
  • [SPARK-42255] - Assign name to _LEGACY_ERROR_TEMP_2430
  • [SPARK-42263] - Implement `spark.catalog.registerFunction`
  • [SPARK-42266] - Local mode should work with IPython
  • [SPARK-42267] - Support left_outer join
  • [SPARK-42268] - Add UserDefinedType in protos
  • [SPARK-42269] - Support complex return types in DDL strings
  • [SPARK-42271] - Reuse UDF test cases under `pyspark.sql.tests`
  • [SPARK-42272] - Use available ephemeral port for Spark Connect server in testing
  • [SPARK-42273] - Skip Spark Connect tests if dependencies are not installed
  • [SPARK-42275] - Avoid using built-in list, dict in static typing
  • [SPARK-42278] - DS V2 pushdown supports supports JDBC dialects compile `SortOrder` by themselves
  • [SPARK-42281] - Update Debugging PySpark documents to show error message properly
  • [SPARK-42294] - Include column default values in DESCRIBE output for V2 tables
  • [SPARK-42295] - Tear down the test cleanly
  • [SPARK-42296] - Apply spark.sql.inferTimestampNTZInDataSources.enabled on JDBC data source
  • [SPARK-42297] - Assign name to _LEGACY_ERROR_TEMP_2412
  • [SPARK-42301] - Assign name to _LEGACY_ERROR_TEMP_1129
  • [SPARK-42302] - Assign name to _LEGACY_ERROR_TEMP_2135
  • [SPARK-42303] - Assign name to _LEGACY_ERROR_TEMP_1326
  • [SPARK-42305] - Assign name to _LEGACY_ERROR_TEMP_1229
  • [SPARK-42306] - Assign name to _LEGACY_ERROR_TEMP_1317
  • [SPARK-42310] - Assign name to _LEGACY_ERROR_TEMP_1289
  • [SPARK-42312] - Assign name to _LEGACY_ERROR_TEMP_0042
  • [SPARK-42313] - Assign name to _LEGACY_ERROR_TEMP_1152
  • [SPARK-42314] - Assign name to _LEGACY_ERROR_TEMP_2127
  • [SPARK-42315] - Assign name to _LEGACY_ERROR_TEMP_2092
  • [SPARK-42318] - Assign name to _LEGACY_ERROR_TEMP_2125
  • [SPARK-42319] - Assign name to _LEGACY_ERROR_TEMP_2123
  • [SPARK-42320] - Assign name to _LEGACY_ERROR_TEMP_2188
  • [SPARK-42324] - Assign name to _LEGACY_ERROR_TEMP_1001
  • [SPARK-42326] - Assign name to _LEGACY_ERROR_TEMP_2099
  • [SPARK-42327] - Assign name to_LEGACY_ERROR_TEMP_2177
  • [SPARK-42338] - Different exception in DataFrame.sample
  • [SPARK-42342] - Introduce base hierarchy to exceptions.
  • [SPARK-42343] - Ignore `IOException` in `handleBlockRemovalFailure` if SparkContext is stopped
  • [SPARK-42345] - Rename TimestampNTZ inference conf as spark.sql.sources.timestampNTZTypeInference.enabled
  • [SPARK-42348] - Add SQLSTATE
  • [SPARK-42357] - Log `exitCode` when `SparkContext.stop` starts
  • [SPARK-42363] - Remove session.register_udf
  • [SPARK-42367] - DataFrame.drop should handle duplicated columns properly
  • [SPARK-42371] - Add scripts to start and stop Spark Connect server
  • [SPARK-42378] - Make `DataFrame.select` support `a.*`
  • [SPARK-42381] - `CreateDataFrame` should accept objects
  • [SPARK-42402] - Support parameterized SQL by sql()
  • [SPARK-42408] - Register DoubleType to KryoSerializer
  • [SPARK-42419] - Migrate `TypeError` into error framework for Spark Connect column API.
  • [SPARK-42420] - Register WriteTaskResult, BasicWriteTaskStats, and ExecutedWriteSummary to KryoSerializer
  • [SPARK-42426] - insertInto fails when the column names are different from the table columns
  • [SPARK-42427] - Conv should return an error if the internal conversion overflows
  • [SPARK-42428] - Standardize __repr__ of CommonInlineUserDefinedFunction
  • [SPARK-42430] - Add documentation for TimestampNTZ type
  • [SPARK-42431] - Union avoid calling `output` before analysis
  • [SPARK-42433] - Add `array_insert` to Connect
  • [SPARK-42434] - `array_append` should accept `Any` value
  • [SPARK-42455] - Rename JDBC option inferTimestampNTZType as preferTimestampNTZ
  • [SPARK-42458] - createDataFrame should support DDL string as schema
  • [SPARK-42459] - Create pyspark.sql.connect.utils to keep common codes
  • [SPARK-42468] - Implement agg by (String, String)*
  • [SPARK-42475] - Getting Started: Live Notebook for Spark Connect
  • [SPARK-42476] - Spark Connect API reference.
  • [SPARK-42481] - Implement agg.{max,min,mean,count,avg,sum}
  • [SPARK-42510] - Implement `DataFrame.mapInPandas`
  • [SPARK-42521] - Add NULL values for INSERT commands with user-specified lists of fewer columns than the target table
  • [SPARK-42522] - Fix DataFrameWriterV2 to find the default source
  • [SPARK-42524] - Upgrade numpy and pandas in the release Dockerfile
  • [SPARK-42532] - Update YuniKorn documentation with v1.2
  • [SPARK-42545] - Remove `experimental` from Volcano docs
  • [SPARK-42568] - SparkConnectStreamHandler should manage configs properly while creating plans.
  • [SPARK-42574] - DataFrame.toPandas should handle duplicated column names
  • [SPARK-42593] - Deprecate & remove the APIs that will be removed in pandas 2.0.
  • [SPARK-42609] - Add tests for grouping() and grouping_id() functions
  • [SPARK-42612] - Enable more parity tests related to functions
  • [SPARK-42630] - Make `parse_data_type` use new proto message `DDLParse`
  • [SPARK-42641] - Upgrade buf to v1.15.0
  • [SPARK-42643] - Register Java (aggregate) user-defined functions
  • [SPARK-42666] - Fix `createDataFrame` to work properly with rows and schema
  • [SPARK-42705] - SparkSession.sql doesn't return values from commands.
  • [SPARK-42707] - Remove experimental warning in developer documentation
  • [SPARK-42710] - Rename FrameMap proto to MapPartitions
  • [SPARK-42723] - Support parser data type json "timestamp_ltz" as TimestampType
  • [SPARK-42724] - Upgrade buf to v1.15.1
  • [SPARK-42725] - Make LiteralExpression support array
  • [SPARK-42726] - Implement `DataFrame.mapInArrow`
  • [SPARK-42739] - Ensure release tag to be pushed to release branch
  • [SPARK-42861] - Review and fix issues in SQL API docs
  • [SPARK-42864] - Review and fix issues in MLlib API docs
  • [SPARK-42865] - Review and fix issues in Streaming API docs
  • [SPARK-42875] - Fix toPandas to handle timezone and map types properly.
  • [SPARK-42889] - Implement cache, persist, unpersist, and storageLevel
  • [SPARK-42893] - Block Arrow Python UDFs
  • [SPARK-42900] - Fix createDataFrame to respect both type inference and column names.
  • [SPARK-42920] - Python UDF with UDT
  • [SPARK-42983] - Fix the error message of createDataFrame from np.array(0)
  • [SPARK-42998] - Fix DataFrame.collect with null struct.
  • [SPARK-43011] - array_insert should fail with 0 index
  • [SPARK-43018] - Fix bug with timestamp literals
  • [SPARK-43085] - Fix bug in column DEFAULT assignment for target tables with multi-part names
  • [SPARK-44681] - Solve issue referencing github.com/apache/spark-connect-go as Go library

Bug

  • [SPARK-8731] - Beeline doesn't work with -e option when started in background
  • [SPARK-28090] - Spark hangs when an execution plan has many projections on nested structs
  • [SPARK-33782] - Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode
  • [SPARK-34777] - [UI] StagePage input size/records not show when records greater than zero
  • [SPARK-35084] - [k8s] On Spark 3, jars listed in spark.jars and spark.jars.packages are not added to sparkContext
  • [SPARK-35542] - Bucketizer created for multiple columns with parameters splitsArray,  inputCols and outputCols can not be loaded after saving it.
  • [SPARK-35579] - Update to janino 3.1.7 to fix a bug
  • [SPARK-37259] - JDBC read is always going to wrap the query in a select statement
  • [SPARK-38404] - Spark does not find CTE inside nested CTE
  • [SPARK-38488] - Spark doc build not work on Mac OS M1
  • [SPARK-38503] - Add warn for getAdditionalPreKubernetesResources in executor side
  • [SPARK-38510] - Failure fetching JSON representation of Spark plans with Hive UDFs
  • [SPARK-38521] - Throw Exception if overwriting hive partition table with dynamic and staticPartitionOverwriteMode
  • [SPARK-38597] - Enable Spark on K8S integration tests
  • [SPARK-38613] - Fix RemoteBlockPushResolverSuite#testWritingPendingBufsIsAbortedImmediatelyDuringComplete
  • [SPARK-38614] - Don't push down limit through window that's using percent_rank
  • [SPARK-38708] - Upgrade Hive Metastore Client to the 3.1.3 for Hive 3.1
  • [SPARK-38717] - Handle Hive's bucket spec case preserving behaviour
  • [SPARK-38799] - Fix scala license declaration
  • [SPARK-38802] - Support spark.kubernetes.test.(driver|executor)RequestCores
  • [SPARK-38846] - Teradata's Number is either converted to its floor value or ceiling value despite its fractional part.
  • [SPARK-38870] - SparkSession.builder returns a new builder in Scala, but not in Python
  • [SPARK-38898] - Failed to build python docker images due to .cache not found
  • [SPARK-38918] - Nested column pruning should filter out attributes that do not belong to the current relation
  • [SPARK-38956] - Fix FAILED_EXECUTE_UDF test case on Java 17
  • [SPARK-38962] - Fix wrong computeStats at DataSourceV2Relation
  • [SPARK-38969] - Graceful decomissionning on Kubernetes fails / decom script error
  • [SPARK-38994] - Add an Python example of StreamingQueryListener
  • [SPARK-39015] - SparkRuntimeException when trying to get non-existent key in a map
  • [SPARK-39041] - Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly
  • [SPARK-39060] - Typo in error messages of decimal overflow
  • [SPARK-39079] - Catalog name should not contain dot
  • [SPARK-39104] - Null Pointer Exeption on unpersist call
  • [SPARK-39184] - ArrayIndexOutOfBoundsException for some date/time sequences in some time-zones
  • [SPARK-39221] - sensitive information is not redacted correctly on thrift job/stage page
  • [SPARK-39242] - AwaitOffset does not wait correctly for atleast expected offset and RateStreamProvider test is flaky
  • [SPARK-39259] - Timestamps returned by now() and equivalent functions are not consistent in subqueries
  • [SPARK-39296] - Replcace `Array.toString` with `Array.mkString`
  • [SPARK-39313] - V2ExpressionUtils.toCatalystOrdering should fail if V2Expression can not be translated
  • [SPARK-39338] - Remove dynamic pruning subquery if pruningKey's references is empty
  • [SPARK-39340] - DS v2 agg pushdown should allow dots in the name of top-level columns
  • [SPARK-39347] - Generate wrong time window when (timestamp-startTime) % slideDuration < 0
  • [SPARK-39354] - The analysis exception is incorrect
  • [SPARK-39355] - Single column uses quoted to construct UnresolvedAttribute
  • [SPARK-39391] - Reuse Partitioner Classes
  • [SPARK-39393] - Parquet data source only supports push-down predicate filters for non-repeated primitive types
  • [SPARK-39396] - Spark Thriftserver enabled LDAP,Error using beeline connection: error code 49 - invalid credentials
  • [SPARK-39399] - proxy-user not working for Spark on k8s in cluster deploy mode
  • [SPARK-39400] - spark-sql remain hive resource download dir after exit
  • [SPARK-39401] - Replace withView with withTempView in CTEInlineSuite
  • [SPARK-39404] - Unable to query _metadata in streaming if getBatch returns multiple logical nodes in the DataFrame
  • [SPARK-39411] - Release candidates do not have the correct version for PySpark
  • [SPARK-39412] - IllegalStateException from connector does not work well with error class framework
  • [SPARK-39417] - Handle Null partition values in PartitioningUtils
  • [SPARK-39421] - Sphinx build fails with "node class 'meta' is already registered, its visitors will be overridden"
  • [SPARK-39427] - Disable ANSI intervals in the percentile functions
  • [SPARK-39437] - normalize plan id separately in PlanStabilitySuite
  • [SPARK-39444] - Add OptimizeSubqueries into nonExcludableRules list
  • [SPARK-39445] - Remove the window if windowExpressions is empty in column pruning
  • [SPARK-39447] - Only non-broadcast query stage can propagate empty relation
  • [SPARK-39448] - Add ReplaceCTERefWithRepartition into nonExcludableRules list
  • [SPARK-39476] - Disable Unwrap cast optimize when casting from Long to Float/ Double or from Integer to Float
  • [SPARK-39493] - Update ORC to 1.7.5
  • [SPARK-39496] - Inline eval path cannot handle null structs
  • [SPARK-39505] - Escape log content rendered in UI
  • [SPARK-39543] - The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
  • [SPARK-39547] - V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata
  • [SPARK-39548] - CreateView Command with a window clause query hit a wrong window definition not found issue
  • [SPARK-39551] - Add AQE invalid plan check
  • [SPARK-39570] - inline table should allow expressions with alias
  • [SPARK-39575] - ByteBuffer forget to rewind after get in AvroDeserializer
  • [SPARK-39582] - "Since <version>" docs on array_agg are incorrect
  • [SPARK-39596] - Run `Linters, licenses, dependencies and documentation generation ` GitHub Actions failed
  • [SPARK-39601] - AllocationFailure should not be treated as exitCausedByApp when driver is shutting down
  • [SPARK-39612] - The dataframe returned by exceptAll() can no longer perform operations such as count() or isEmpty(), or an exception will be thrown.
  • [SPARK-39614] - K8s pod name follows `DNS Subdomain Names` rule
  • [SPARK-39620] - History server page and API are using inconsistent conditions to filter running applications
  • [SPARK-39621] - Make run-tests.py robust by avoiding `rmtree` usage
  • [SPARK-39622] - ParquetIOSuite fails intermittently on master branch
  • [SPARK-39647] - Block push fails with java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration even when the NodeManager hasn't been restarted
  • [SPARK-39648] - Fix type hints of `like`, `rlike`, `ilike` of Column
  • [SPARK-39650] - Streaming Deduplication should not check the schema of "value"
  • [SPARK-39672] - NotExists subquery failed with conflicting attributes
  • [SPARK-39696] - Uncaught exception in thread executor-heartbeater java.util.ConcurrentModificationException: mutation occurred during iteration
  • [SPARK-39703] - Mima complains with Scala 2.13 in the master branch
  • [SPARK-39714] - Resolve pyspark mypy part tests.
  • [SPARK-39731] - Correctness issue when parsing dates with yyyyMMdd format in CSV and JSON
  • [SPARK-39743] - Unable to set zstd compression level while writing parquet files
  • [SPARK-39758] - NPE on invalid patterns from the regexp functions
  • [SPARK-39761] - Add Apache Spark images info in running-on-kubernetes doc
  • [SPARK-39775] - Regression due to AVRO-2035
  • [SPARK-39776] - Join‘ verbose string didn't contains JoinType
  • [SPARK-39783] - Column backticks are misplaced in the AnalysisException [UNRESOLVED_COLUMN] error message when using field with "."
  • [SPARK-39829] - Upgrade log4j2 to 2.18.0
  • [SPARK-39830] - Add a test case to read ORC table that requires type promotion
  • [SPARK-39833] - Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true
  • [SPARK-39835] - Fix EliminateSorts remove global sort below the local sort
  • [SPARK-39839] - Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check
  • [SPARK-39847] - Race condition related to interruption of task threads while they are in RocksDBLoader.loadLibrary()
  • [SPARK-39848] - Upgrade Kafka to 3.2.1
  • [SPARK-39857] - V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
  • [SPARK-39867] - Global limit should not inherit OrderPreservingUnaryNode
  • [SPARK-39880] - V2 SHOW FUNCTIONS command should print qualified function name like v1
  • [SPARK-39887] - Expression transform error
  • [SPARK-39895] - pyspark drop doesn't accept *cols
  • [SPARK-39896] - The structural integrity of the plan is broken after UnwrapCastInBinaryComparison
  • [SPARK-39900] - Issue with querying dataframe produced by 'binaryFile' format using 'not' operator
  • [SPARK-39915] - Dataset.repartition(N) may not create N partitions
  • [SPARK-39932] - WindowExec should clear the final partition buffer
  • [SPARK-39936] - Spark View creation with hyphens in column-type names fails
  • [SPARK-39939] - shift() func need support periods=0
  • [SPARK-39940] - Batch query cannot read the updates from streaming query if streaming query writes to the catalog table via DSv1 sink
  • [SPARK-39943] - Upgrade rocksdbjni to 7.4.4
  • [SPARK-39945] - Upgrade sbt-mima-plugin to 1.1.0
  • [SPARK-39952] - SaveIntoDataSourceCommand should recache result relation
  • [SPARK-39962] - Global aggregation against pandas aggregate UDF does not take the column order into account
  • [SPARK-39974] - Create separate static image tag for infra cache
  • [SPARK-39976] - NULL check in ArrayIntersect adds extraneous null from first param
  • [SPARK-39980] - Change infra image to static tag
  • [SPARK-39981] - CheckOverflowInTableInsert returns exception rather than throwing it
  • [SPARK-39988] - LevelDBIterator not close after used in `RemoteBlockPushResolver`, `YarnShuffleService` and `ExternalShuffleBlockResolver`
  • [SPARK-40002] - Limit improperly pushed down through window using ntile function
  • [SPARK-40036] - LevelDB/RocksDBIterator.next should return false after iterator or db close
  • [SPARK-40045] - The order of filtering predicates is not reasonable
  • [SPARK-40052] - Handle direct byte buffers in VectorizedDeltaBinaryPackedReader
  • [SPARK-40057] - Cleanup "<BLANKLINE>" in doctest
  • [SPARK-40079] - Add Imputer inputCols validation for empty input case
  • [SPARK-40089] - Sorting of at least Decimal(20, 2) fails for some values near the max.
  • [SPARK-40094] - Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation
  • [SPARK-40096] - Finalize shuffle merge slow due to connection creation fails
  • [SPARK-40114] - Arrow 9.0.0 support with SparkR
  • [SPARK-40117] - Convert condition to java in DataFrameWriterV2.overwrite
  • [SPARK-40121] - Initialize projection used for Python UDF
  • [SPARK-40124] - Update TPCDS v1.4 q32 for Plan Stability tests
  • [SPARK-40132] - MultilayerPerceptronClassifier rawPredictionCol param missing from setParams
  • [SPARK-40134] - Update ORC to 1.7.6
  • [SPARK-40149] - Star expansion after outer join asymmetrically includes joining key
  • [SPARK-40151] - Fix return type for new median(interval) function
  • [SPARK-40152] - Codegen compilation error when using split_part
  • [SPARK-40156] - url_decode() exposes a Java error
  • [SPARK-40168] - Handle FileNotFoundException when shuffle file deleted in decommissioner
  • [SPARK-40169] - Fix the issue with Parquet column index and predicate pushdown in Data source V1
  • [SPARK-40202] - Allow a dictionary in SparkSession.config in PySpark
  • [SPARK-40212] - SparkSQL castPartValue does not properly handle byte & short
  • [SPARK-40218] - GROUPING SETS should preserve the grouping columns
  • [SPARK-40245] - Fix FileScan equality check when partition or data filter columns are not read
  • [SPARK-40247] - Fix BitSet equality check
  • [SPARK-40261] - DirectTaskResult meta should not be counted into result size
  • [SPARK-40270] - Make compute.max_rows as None working in DataFrame.style
  • [SPARK-40280] - Failure to create parquet predicate push down for ints and longs on some valid files
  • [SPARK-40295] - Allow v2 functions with literal args in write distribution and ordering
  • [SPARK-40297] - CTE outer reference nested in CTE main body cannot be resolved
  • [SPARK-40303] - The performance will be worse after codegen
  • [SPARK-40314] - Add inline Scala and Python bindings
  • [SPARK-40315] - Non-deterministic hashCode() calculations for ArrayBasedMapData on equal objects
  • [SPARK-40320] - When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung
  • [SPARK-40322] - Fix all dead links
  • [SPARK-40323] - Update ORC to 1.8.0
  • [SPARK-40380] - Constant-folding of InvokeLike should not result in non-serializable result
  • [SPARK-40385] - Classes with companion object constructor fails interpreted path
  • [SPARK-40403] - Negative size in error message when unsafe array is too big
  • [SPARK-40407] - Repartition of DataFrame can result in severe data skew in some special case
  • [SPARK-40429] - Only set KeyGroupedPartitioning when the referenced column is in the output
  • [SPARK-40440] - Fix wrong reference and content in PS windows related doc
  • [SPARK-40460] - Streaming metrics is zero when select _metadata
  • [SPARK-40468] - Column pruning is not handled correctly in CSV when _corrupt_record is used
  • [SPARK-40470] - arrays_zip output unexpected alias column names when using GetMapValue and GetArrayStructFields
  • [SPARK-40480] - Remove push-based shuffle data after query finished
  • [SPARK-40482] - Revert SPARK-24544 Print actual failure cause when look up function failed
  • [SPARK-40492] - Perform maintenance of StateStore instances when they become inactive
  • [SPARK-40496] - Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
  • [SPARK-40508] - Treat unknown partitioning as UnknownPartitioning
  • [SPARK-40521] - PartitionsAlreadyExistException in Hive V1 Command V1 reports all partitions instead of the conflicting partition
  • [SPARK-40535] - NPE from observe of collect_list
  • [SPARK-40562] - Add spark.sql.legacy.groupingIdWithAppendedUserGroupBy
  • [SPARK-40563] - Error at where clause, when sql case executes by else branch
  • [SPARK-40565] - Non-deterministic filters shouldn't get pushed to V2 file sources
  • [SPARK-40583] - Documentation error in "Integration with Cloud Infrastructures"
  • [SPARK-40612] - On Kubernetes for long running app Spark using an invalid principal to renew the delegation token
  • [SPARK-40617] - Assertion failed in ExecutorMetricsPoller "task count shouldn't below 0"
  • [SPARK-40618] - Bug in MergeScalarSubqueries rule attempting to merge nested subquery with parent
  • [SPARK-40622] - Result of a single task in collect() must fit in 2GB
  • [SPARK-40635] - Scala 2.12 + Hadoop 2 + JDK 8 Daily Test failed
  • [SPARK-40660] - Switch to XORShiftRandom to distribute elements
  • [SPARK-40670] - NPE in applyInPandasWithState when the input schema has "non-nullable" column(s)
  • [SPARK-40694] - Add permisson for label github aciton job
  • [SPARK-40695] - Add permisson for notify and status update job
  • [SPARK-40696] - Add permisson for infra image
  • [SPARK-40703] - Performance regression for joins in Spark 3.3 vs Spark 3.2
  • [SPARK-40705] - Issue with spark converting Row to Json using Scala 2.13
  • [SPARK-40738] - spark-shell fails with "bad array subscript" in cygwin or msys bash session
  • [SPARK-40739] - "sbt packageBin" fails in cygwin or other windows bash session
  • [SPARK-40753] - Fix bug in test case for catalog directory operation
  • [SPARK-40771] - Estimated size in log message can overflow Int
  • [SPARK-40775] - V2 file scans have duplicative descriptions
  • [SPARK-40798] - Alter partition should verify value
  • [SPARK-40806] - Typo fix: CREATE TABLE -> REPLACE TABLE
  • [SPARK-40815] - SymlinkTextInputFormat returns incorrect result due to enabled spark.hadoopRDD.ignoreEmptySplits
  • [SPARK-40817] - Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode
  • [SPARK-40819] - Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
  • [SPARK-40829] - STORED AS serde in CREATE TABLE LIKE view does not work
  • [SPARK-40838] - Upgrade infra base image to focal-20220922
  • [SPARK-40851] - TimestampFormatter behavior changed when using the latest Java 8/11/17
  • [SPARK-40858] - Cleanup github action warning
  • [SPARK-40867] - Flaky test ProtobufCatalystDataConversionSuite
  • [SPARK-40869] - KubernetesConf.getResourceNamePrefix creates invalid name prefixes
  • [SPARK-40874] - Fix broadcasts in Python UDFs when encryption is enabled
  • [SPARK-40901] - Unable to store Spark Driver logs with Absolute Hadoop based URI FS Path
  • [SPARK-40902] - Quick submission of drivers in tests to mesos scheduler results in dropping drivers
  • [SPARK-40906] - `Mode` should copy keys before inserting into Map
  • [SPARK-40907] - `PandasMode` should copy keys before inserting into Map
  • [SPARK-40924] - Unhex function works incorrectly when input has uneven number of symbols
  • [SPARK-40932] - Barrier: messages for allGather will be overridden by the following barrier APIs
  • [SPARK-40944] - Relax ordering constraint for CREATE TABLE column options
  • [SPARK-40963] - ExtractGenerator sets incorrect nullability in new Project
  • [SPARK-40969] - Unable to download spark 3.3.0 tarball after 3.3.1 release in spark-docker
  • [SPARK-40987] - Avoid creating a directory when deleting a block, causing DAGScheduler to not work
  • [SPARK-40999] - Hints on subqueries are not properly propagated
  • [SPARK-41003] - BHJ LeftAnti does not update numOutputRows when codegen is disabled
  • [SPARK-41007] - BigInteger Serialization doesn't work with JavaBean Encoder
  • [SPARK-41008] - Isotonic regression result differs from sklearn implementation
  • [SPARK-41015] - Failure of ProtobufCatalystDataConversionSuite.scala
  • [SPARK-41035] - Incorrect results or NPE when a literal is reused across distinct aggregations
  • [SPARK-41040] - Self-union streaming query may fail when using readStream.table
  • [SPARK-41047] - Remove legacy example of round function with negative scale
  • [SPARK-41049] - Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions
  • [SPARK-41056] - Fix new R_LIBS_SITE behavior introduced in R 4.2
  • [SPARK-41093] - Remove netty-tcnative-classes from Spark dependencyList
  • [SPARK-41118] - to_number/try_to_number throws NullPointerException when format is null
  • [SPARK-41136] - Shorten graceful shutdown time of ExecutorPodsSnapshotsStoreImpl to prevent blocking shutdown process
  • [SPARK-41144] - UnresolvedHint should not cause query failure
  • [SPARK-41149] - Fix `SparkSession.builder.config` to support bool
  • [SPARK-41151] - Keep built-in file _metadata column nullable value consistent
  • [SPARK-41154] - Incorrect relation caching for queries with time travel spec
  • [SPARK-41162] - Anti-join must not be pushed below aggregation with ambiguous predicates
  • [SPARK-41165] - Arrow collect should factor in failures
  • [SPARK-41177] - maven test `protobuf` module failed
  • [SPARK-41178] - fix parser rule precedence between JOIN and comma
  • [SPARK-41184] - Fill NA tests are flaky
  • [SPARK-41186] - Fix doctest for new version mlfow
  • [SPARK-41187] - [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
  • [SPARK-41188] - Set executorEnv OMP_NUM_THREADS to be spark.task.cpus by default for spark executor JVM processes
  • [SPARK-41189] - Add an environment to switch on and off namedtuple hack
  • [SPARK-41192] - Task finished before speculative task scheduled leads to holding idle executors
  • [SPARK-41193] - Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingSuite` as default
  • [SPARK-41198] - Streaming query metrics is broken with CTE
  • [SPARK-41199] - Streaming query metrics is broken with mixed-up usage of DSv1 streaming source and DSv2 streaming source
  • [SPARK-41219] - Regression in IntegralDivide returning null instead of 0
  • [SPARK-41254] - YarnAllocator.rpIdToYarnResource map is not properly updated
  • [SPARK-41261] - applyInPandasWithState can produce incorrect key value in user function for timed out state
  • [SPARK-41313] - AM shutdown hook fails with IllegalStateException if AM crashes on startup (recurrence of SPARK-3900)
  • [SPARK-41327] - Fix SparkStatusTracker.getExecutorInfos by switch On/OffHeapStorageMemory info
  • [SPARK-41339] - RocksDB state store WriteBatch doesn't clean up native memory
  • [SPARK-41344] - Reading V2 datasource masks underlying error
  • [SPARK-41350] - allow simple name access of using join hidden columns after subquery alias
  • [SPARK-41365] - Stages UI page fails to load for proxy in some yarn versions
  • [SPARK-41374] - Update ORC to 1.8.1
  • [SPARK-41375] - Avoid empty latest KafkaSourceOffset
  • [SPARK-41376] - Executor netty direct memory check should respect spark.shuffle.io.preferDirectBufs
  • [SPARK-41377] - Fix spark-version-info.properties not found on Windows
  • [SPARK-41379] - Inconsistency of spark session in DataFrame in user function for foreachBatch sink in PySpark
  • [SPARK-41385] - Replace deprecated `.newInstance()` in K8s module
  • [SPARK-41395] - InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data
  • [SPARK-41411] - Multi-Stateful Operator watermark support bug fix
  • [SPARK-41437] - Do not optimize the input query twice for v1 write fallback
  • [SPARK-41448] - Make consistent MR job IDs in FileBatchWriter and FileFormatWriter
  • [SPARK-41452] - to_char throws NullPointerException when format is null
  • [SPARK-41458] - Correctly transform the SPI services for Yarn Shuffle Service
  • [SPARK-41468] - Fix PlanExpression handling in EquivalentExpressions
  • [SPARK-41475] - Fix lint-scala command error
  • [SPARK-41522] - GA dependencies test faild
  • [SPARK-41535] - InterpretedUnsafeProjection and InterpretedMutableProjection can corrupt unsafe buffer when used with calendar interval data
  • [SPARK-41539] - stats and constraints in LogicalRDD may not be in sync with output attributes
  • [SPARK-41554] - Decimal.changePrecision produces ArrayIndexOutOfBoundsException
  • [SPARK-41668] - DECODE function returns wrong results when passed NULL
  • [SPARK-41683] - Spark UI: In jobs API, numActiveStages can be negative in some cases
  • [SPARK-41732] - Session window: analysis rule "SessionWindowing" does not apply tree-pattern based pruning
  • [SPARK-41733] - Session window: analysis rule "ResolveWindowTime" does not apply tree-pattern based pruning
  • [SPARK-41735] - Any SparkThrowable (with an error class) not in error-classes.json is masked in SQLExecution.withNewExecutionId and end-user will see "org.apache.spark.SparkException: [INTERNAL_ERROR]"
  • [SPARK-41741] - [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
  • [SPARK-41780] - `regexp_replace('', '[a\\\\d]{0, 2}', 'x')` causes an internal error
  • [SPARK-41790] - Set TRANSFORM reader and writer's format correctly
  • [SPARK-41792] - Shuffle merge finalization removes the wrong finalization state from the DB
  • [SPARK-41793] - Incorrect result for window frames defined by a range clause on large decimals
  • [SPARK-41804] - InterpretedUnsafeProjection doesn't properly handle an array of UDTs
  • [SPARK-41848] - Tasks are over-scheduled with TaskResourceProfile
  • [SPARK-41858] - Fix ORC reader perf regression due to DEFAULT value feature
  • [SPARK-41859] - CreateHiveTableAsSelectCommand should set the overwrite flag correctly
  • [SPARK-41894] - sql/core module mvn clean failed
  • [SPARK-41896] - Filtering by row_index always returns empty results
  • [SPARK-41912] - Subquery should not validate CTE
  • [SPARK-41914] - Sorting issue with partitioned-writing and planned write optimization disabled
  • [SPARK-41937] - SparkR datetime column compare with Sys.time() throws error in R (>= 4.2.0)
  • [SPARK-41947] - Update the contents of error class guidelines
  • [SPARK-41948] - Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
  • [SPARK-41952] - Upgrade Parquet to fix off-heap memory leaks in Zstd codec
  • [SPARK-41958] - Disallow arbitrary custom classpath with proxy user in cluster mode
  • [SPARK-41982] - When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`
  • [SPARK-41985] - Centralize more column resolution rules
  • [SPARK-41989] - PYARROW_IGNORE_TIMEZONE warning can break application logging setup
  • [SPARK-41990] - Filtering by composite field name like `field name` doesn't work with pushDownPredicate = true
  • [SPARK-41991] - Interpreted mode subexpression elimination can throw exception during insert
  • [SPARK-42046] - Add `connect-client-jvm` to connect module
  • [SPARK-42057] - Avoid losing exception info in Protobuf errors
  • [SPARK-42059] - Update ORC to 1.8.2
  • [SPARK-42061] - Mark Expressions that have state has stateful
  • [SPARK-42066] - The DATATYPE_MISMATCH error class contains inappropriate and duplicating subclasses
  • [SPARK-42084] - Avoid leaking the qualified-access-only restriction
  • [SPARK-42088] - Running python3 setup.py sdist on windows reports a permission error
  • [SPARK-42090] - Introduce sasl retry count in RetryingBlockTransferor
  • [SPARK-42109] - Upgrade Kafka to 3.3.2
  • [SPARK-42112] - Add null check before `ContinuousWriteRDD#compute` method close dataWriter
  • [SPARK-42113] - Upgrade pandas to 1.5.3
  • [SPARK-42115] - Push down limit through Python UDFs
  • [SPARK-42134] - Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes
  • [SPARK-42156] - Support client-side retries in Spark Connect Python client
  • [SPARK-42157] - `spark.scheduler.mode=FAIR` should provide FAIR scheduler
  • [SPARK-42162] - Memory usage on executors increased drastically for a complex query with large number of addition operations
  • [SPARK-42163] - Schema pruning fails on non-foldable array index or map key
  • [SPARK-42171] - Fix `pyspark-errors` module and enable it in GitHub Action
  • [SPARK-42174] - Use scikit-learn instead of sklearn
  • [SPARK-42176] - Cast boolean to timestamp fails with ClassCastException
  • [SPARK-42177] - Change master to brach-3.4 in GitHub Actions
  • [SPARK-42186] - Make SparkR able to stop properly when the connection is timed-out
  • [SPARK-42196] - Typo in StreamingQuery.scala
  • [SPARK-42201] - `build/sbt` should allow SBT_OPTS to override JVM memory setting
  • [SPARK-42228] - connect-client-jvm module should shaded+relocation grpc
  • [SPARK-42241] - Correct the condition for `SparkConnectServerUtils#findSparkConnectJar` to find the correct connect server jar for maven
  • [SPARK-42242] - Upgrade snappy-java to 1.1.9.1
  • [SPARK-42250] - predict_batch_udf with float fails when the batch size consists of single value
  • [SPARK-42259] - ResolveGroupingAnalytics should take care of Python UDAF
  • [SPARK-42274] - Upgrade `compress-lzf` to 1.1.2
  • [SPARK-42276] - Add ServicesResourceTransformer to connect server module shade configuration
  • [SPARK-42286] - Fix internal error for valid CASE WHEN expression with CAST when inserting into a table
  • [SPARK-42331] - Fix metadata col can not been resolved
  • [SPARK-42344] - The default size of the CONFIG_MAP_MAXSIZE should not be greater than 1048576
  • [SPARK-42346] - distinct(count colname) with UNION ALL causes query analyzer bug
  • [SPARK-42384] - Mask function's generated code does not handle null input
  • [SPARK-42401] - Incorrect results or NPE when inserting null value into array using array_insert/array_append
  • [SPARK-42403] - JsonProtocol should handle null JSON strings
  • [SPARK-42406] - [PROTOBUF] Recursive field handling is incompatible with delta
  • [SPARK-42410] - Support Scala 2.12/2.13 tests in connect module
  • [SPARK-42416] - Dateset operations should not resolve the analyzed logical plan again
  • [SPARK-42444] - DataFrame.drop should handle multi columns properly
  • [SPARK-42445] - Fix SparkR install.spark function
  • [SPARK-42448] - spark sql shell prompts wrong database info
  • [SPARK-42462] - Prevent `docker-image-tool.sh` from publishing OCI manifests
  • [SPARK-42478] - Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory
  • [SPARK-42515] - ClientE2ETestSuite local test failed
  • [SPARK-42516] - Non-captured session time zone in view creation
  • [SPARK-42534] - Fix DB2 Limit clause
  • [SPARK-42547] - Make PySpark working with Python 3.7
  • [SPARK-42596] - [YARN] OMP_NUM_THREADS not set to number of executor cores by default
  • [SPARK-42600] - currentDatabase Shall use NamespaceHelper instead of MultipartIdentifierHelper
  • [SPARK-42608] - Use full column names for inner fields in resolution errors
  • [SPARK-42611] - Insert char/varchar length checks for inner fields during resolution
  • [SPARK-42616] - SparkSQLCLIDriver shall only close started hive sessionState
  • [SPARK-42655] - Incorrect ambiguous column reference error
  • [SPARK-42665] - `simple udf` test failed using Maven
  • [SPARK-42673] - Make build/mvn build Spark only with the verified maven version
  • [SPARK-42677] - Fix the invalid tests for broadcast hint
  • [SPARK-42697] - /api/v1/applications return 0 for duration
  • [SPARK-42700] - Add h2 as test dependency of connect-server module
  • [SPARK-42709] - Do not rely on __file__
  • [SPARK-42851] - EquivalentExpressions methods need to be consistently guarded by supportedExpression
  • [SPARK-42928] - Make resolvePersistentFunction synchronized
  • [SPARK-42936] - Unresolved having at the end of analysis when using with LCA with the having clause that can be resolved directly by its child Aggregate
  • [SPARK-42967] - Fix SparkListenerTaskStart.stageAttemptId when a task is started after the stage is cancelled
  • [SPARK-42971] - When processing the WorkDirCleanup event, if appDirs is empty, should print workdir
  • [SPARK-43041] - Restore constructors of exceptions for compatibility in connector API
  • [SPARK-43158] - Set upperbound of pandas version in binder integrations
  • [SPARK-43538] - Spark Homebrew Formulae currently depends on non-officially-supported Java 20

Epic

  • [SPARK-32082] - Project Zen: Improving Python usability
  • [SPARK-40653] - Protobuf Support in Structured Streaming

Story

  • [SPARK-40211] - Allow executeTake() / collectLimit's number of starting partitions to be customized

New Feature

  • [SPARK-27561] - Support "lateral column alias references" to allow column aliases to be used within SELECT clauses
  • [SPARK-30641] - Project Matrix: Linear Models revisit and refactor
  • [SPARK-35662] - Support Timestamp without time zone data type
  • [SPARK-37568] - Support 2-arguments by the convert_timezone() function
  • [SPARK-37671] - Support ANSI Aggregation Function of regression
  • [SPARK-38591] - Add sortWithinGroups to KeyValueGroupedDataset
  • [SPARK-38647] - Add SupportsReportOrdering mix in interface for Scan
  • [SPARK-38864] - Unpivot / melt function for Dataset API
  • [SPARK-38904] - Low cost DataFrame schema swap util
  • [SPARK-39057] - Offset could work without Limit
  • [SPARK-39071] - Add unwrap_udt function for unwrapping UserDefinedType columns
  • [SPARK-39159] - Add new Dataset API for Offset
  • [SPARK-39168] - Consider all values in a python list when inferring schema
  • [SPARK-39305] - Implement the EQUAL_NULL function
  • [SPARK-39306] - support scalar subquery in time travel
  • [SPARK-39320] - Add the MEDIAN() function
  • [SPARK-39457] - Support IPv6-only environment
  • [SPARK-39567] - Support ANSI intervals in the percentile functions
  • [SPARK-39618] - Add the REGEXP_COUNT function
  • [SPARK-39625] - add Dataset.to(StructType)
  • [SPARK-39695] - Add the REGEXP_SUBSTR function
  • [SPARK-39741] - Support url encode/decode as built-in function
  • [SPARK-39744] - Add the REGEXP_INSTR function
  • [SPARK-39808] - Support aggregate function MODE
  • [SPARK-39876] - Unpivot / melt function for SQL
  • [SPARK-39877] - Unpivot / melt function for PySpark
  • [SPARK-40003] - Add median to PySpark
  • [SPARK-40007] - Add Mode to PySpark
  • [SPARK-40015] - Add sc.listArchives and sc.listFiles to PySpark
  • [SPARK-40087] - Support multiple Column drop in R
  • [SPARK-40264] - Add helper function for DL model inference in pyspark.ml.functions
  • [SPARK-40281] - Memory Profiler on Executors
  • [SPARK-40530] - Add error-related developer APIs
  • [SPARK-40585] - Support double-quoted identifiers
  • [SPARK-40849] - Async log purge
  • [SPARK-40956] - SQL Equivalent for Dataframe overwrite command
  • [SPARK-40957] - Add in memory cache in HDFSMetadataLog
  • [SPARK-41183] - Add an extension API to do plan normalization for caching
  • [SPARK-41195] - Support PIVOT/UNPIVOT with join children
  • [SPARK-41271] - Parameterized SQL
  • [SPARK-41290] - Support GENERATED ALWAYS AS syntax in create/replace table to create a generated column
  • [SPARK-41323] - Support CURRENT_SCHEMA() as alias for CURRENT_DATABASE()
  • [SPARK-41378] - Support Column Stats in DS V2
  • [SPARK-41515] - PVC-oriented executor pod allocation
  • [SPARK-41635] - GROUP BY ALL
  • [SPARK-41637] - ORDER BY ALL
  • [SPARK-41666] - Support parameterized SQL in PySpark
  • [SPARK-42477] - python: accept user_agent in spark connect's connection string
  • [SPARK-42556] - Dataset.colregex should link a plan_id when it only matches a single column.
  • [SPARK-42610] - Add implicit encoders to SQLImplicits
  • [SPARK-42614] - Make all constructors private[sql]
  • [SPARK-42632] - Fix scala paths in tests
  • [SPARK-42637] - Add SparkSession.stop
  • [SPARK-42680] - Create the helper function withSQLConf for connect's test
  • [SPARK-42690] - Implement CSV/JSON parsing funcions
  • [SPARK-42884] - Add Ammonite REPL support

Improvement

  • [SPARK-25050] - Handle more than two types in avro union types when writing avro files
  • [SPARK-29260] - Enable supported Hive metastore versions once it support altering database location
  • [SPARK-32170] - Improve the speculation for the inefficient tasks by the task metrics.
  • [SPARK-33605] - Add gcs-connector to hadoop-cloud module
  • [SPARK-33753] - Reduce the memory footprint and gc of the cache (hadoopJobMetadata)
  • [SPARK-34265] - Instrument Python UDF execution using SQL Metrics
  • [SPARK-34659] - Web UI does not correctly get appId
  • [SPARK-34927] - Support TPCDSQueryBenchmark in Benchmarks
  • [SPARK-35242] - Support change catalog default database for spark
  • [SPARK-35739] - [Spark Sql] Add Java-comptable Dataset.join overloads
  • [SPARK-35743] - Improve Parquet vectorized reader
  • [SPARK-36259] - Expose localtimestamp in pyspark.sql.functions
  • [SPARK-36462] - Allow Spark on Kube to operate without polling or watchers
  • [SPARK-36664] - Log time spent waiting for cluster resources
  • [SPARK-36837] - Upgrade Kafka to 3.1.0
  • [SPARK-37348] - PySpark pmod function
  • [SPARK-37523] - Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified
  • [SPARK-37825] - Make spark beeline be able to handle javaOpts
  • [SPARK-37956] - Add Java and Python examples to the Parquet encryption feature documentation
  • [SPARK-37961] - override maxRows/maxRowsPerPartition for some logical operators
  • [SPARK-37980] - Extend METADATA column to support row indices for file based data sources
  • [SPARK-38034] - Optimize time complexity and extend applicable cases for TransposeWindow
  • [SPARK-38098] - Add support for ArrayType of nested StructType to arrow-based conversion
  • [SPARK-38194] - Make memory overhead factor configurable
  • [SPARK-38277] - Clear write batch after RocksDB state store's commit
  • [SPARK-38334] - Implement support for DEFAULT values for columns in tables
  • [SPARK-38349] - No need to filter events when sessionwindow gapDuration greater than 0
  • [SPARK-38522] - Strengthen the contract on iterator method in StateStore
  • [SPARK-38541] - Upgrade netty to 4.1.75
  • [SPARK-38545] - Upgarde scala-maven-plugin from 4.4.0 to 4.5.6
  • [SPARK-38555] - Avoid contention and get or create clientPools quickly in the TransportClientFactory
  • [SPARK-38564] - Support collecting metrics from streaming sinks
  • [SPARK-38568] - Upgrade ZSTD-JNI to 1.5.2-2
  • [SPARK-38569] - external top-level directory is problematic for bazel
  • [SPARK-38573] - Support Auto Partition Statistics Collection
  • [SPARK-38575] - Duduplicate branch specification in GitHub Actions workflow
  • [SPARK-38582] - Add KubernetesUtils.buildEnvVars(WithFieldRef)? utility functions
  • [SPARK-38584] - Unify the data validation
  • [SPARK-38585] - Simplify the code of TreeNode.clone()
  • [SPARK-38593] - Incorporate numRowsDroppedByWatermark metric from SessionWindowStateStoreRestoreExec into StateOperatorProgress
  • [SPARK-38594] - Change to use `NettyUtils` to create `EventLoop` and `ChannelClass` in RBackend
  • [SPARK-38611] - Use `assertThrows` instead of handwriting `intercept` method in `CatalogLoadingSuite`
  • [SPARK-38619] - Clean up Junit api usage in scalatest
  • [SPARK-38620] - Replace `value.formatted(formatString)` with `formatString.format(value)` to clean up compilation warning
  • [SPARK-38622] - Upgrade jersey to 2.35
  • [SPARK-38624] - Reduce UnsafeProjection.create call times when Percentile function serializes the aggregation buffer object
  • [SPARK-38635] - Remove duplicate log for spark ApplicationMaster
  • [SPARK-38641] - Get rid of invalid configuration elements in mvn_scalafmt in main pom.xml
  • [SPARK-38646] - Pull a trait out for Python functions
  • [SPARK-38660] - PySpark DeprecationWarning: distutils Version classes are deprecated
  • [SPARK-38661] - [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests
  • [SPARK-38670] - Add offset commit time to streaming query listener
  • [SPARK-38671] - Publish snapshot from branch-3.3
  • [SPARK-38673] - Replace java assert with Junit api in Java UTs
  • [SPARK-38674] - Remove useless deduplicate in SubqueryBroadcastExec
  • [SPARK-38679] - Expose the number partitions in a stage to TaskContext
  • [SPARK-38683] - It is unnecessary to release the ShuffleManagedBufferIterator or ShuffleChunkManagedBufferIterator or ManagedBufferIterator buffers when the client channel's connection is terminated
  • [SPARK-38694] - Simplify Java UT code with Junit `assertThrows`
  • [SPARK-38711] - Refactor pyspark.sql.streaming module
  • [SPARK-38713] - Change spark.sessionstate.conf.getConf/setConf operation to spark.conf.get/set
  • [SPARK-38756] - Clean up useless security configs in `TransportConf`
  • [SPARK-38757] - Update the Oracle docker image version used for test and integration
  • [SPARK-38759] - Add StreamingQueryListener support in PySpark
  • [SPARK-38760] - Implement DataFrame.observe in PySpark
  • [SPARK-38767] - Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options
  • [SPARK-38770] - Remove renameMainAppResource from baseDriverContainer
  • [SPARK-38772] - Formatting the log plan in AdaptiveSparkPlanExec
  • [SPARK-38779] - Unify the pushed operator checking between FileSource test suite and JDBC test suite
  • [SPARK-38797] - Runtime Filter support pushdown through window
  • [SPARK-38798] - Make `spark.file.transferTo` as an `ConfigEntry`
  • [SPARK-38803] - Set minio cpu to 250m (0.25) in K8s IT
  • [SPARK-38804] - Add StreamingQueryManager.removeListener in PySpark
  • [SPARK-38826] - dropFieldIfAllNull option does not work for empty JSON struct
  • [SPARK-38832] - Remove unnecessary distinct in aggregate expression by distinctKeys
  • [SPARK-38835] - Refactor FsHistoryProviderSuite to test rocks db
  • [SPARK-38836] - Increase the performance of ExpressionSet
  • [SPARK-38841] - Enable Bloom filter join by default
  • [SPARK-38847] - Introduce a `viewToSeq` function for `KVUtils`
  • [SPARK-38848] - Replcace all `@Test(expected = XXException)` with assertThrows
  • [SPARK-38850] - Upgrade Kafka to 3.2.0
  • [SPARK-38851] - Refactor `HistoryServerSuite` to add UTs for RocksDB
  • [SPARK-38881] - PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs
  • [SPARK-38885] - Upgrade netty to 4.1.76
  • [SPARK-38886] - Remove outer join if aggregate functions are duplicate agnostic on streamed side
  • [SPARK-38888] - Add `RocksDBProvider` similar to `LevelDBProvider`
  • [SPARK-38896] - Use tryWithResource to recycling KVStoreIterator
  • [SPARK-38909] - Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB
  • [SPARK-38914] - Allow user to insert specified columns into insertable view
  • [SPARK-38921] - Use k8s-client to create queue resource in Volcano IT
  • [SPARK-38929] - Improve error messages for cast failures in ANSI
  • [SPARK-38940] - Test Series' anchor frame for in-place updates on Series
  • [SPARK-38966] - Fix CI for fork branches in-sync with upstream master
  • [SPARK-38968] - remove hadoopConf from KerberosConfDriverFeatureStep
  • [SPARK-38970] - Skip build-and-test workflow on forks when scheduled
  • [SPARK-38971] - Test anchor frame for in-place `Series.rename_axis`
  • [SPARK-38979] - Improve error log readability in OrcUtils.requestedColumnIds
  • [SPARK-38985] - Support sub-error-class for UNSUPPORTED_FEATURE et al
  • [SPARK-38999] - Refactor DataSourceScanExec code to
  • [SPARK-39002] - StringEndsWith/Contains support push down to Parquet so that we can leverage dictionary filter
  • [SPARK-39014] - Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
  • [SPARK-39016] - Fix compilation warnings related to "`enum` will become a keyword in Scala 3"
  • [SPARK-39038] - Skip reporting test results if triggering workflow was skipped
  • [SPARK-39042] - Use `Map.values()` instead of `Map.entrySet()` in scenarios that do not use `keys`
  • [SPARK-39050] - Convert UNSUPPORTED_OPERATION to UNSUPPORTED_FEATURE
  • [SPARK-39051] - Minor refactoring of `python/pyspark/sql/pandas/conversion.py`
  • [SPARK-39052] - Support Char in Literal.create
  • [SPARK-39062] - Add Standalone backend support for Stage Level Scheduling
  • [SPARK-39067] - Upgrade scala-maven-plugin to 4.6.1
  • [SPARK-39068] - Make thriftserver and sparksql-cli support in-memory catalog
  • [SPARK-39073] - Keep rowCount after hive table partition pruning if table only have hive statistics
  • [SPARK-39102] - Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
  • [SPARK-39111] - Mark overriden methods with `@override` annotation
  • [SPARK-39113] - rename self to cls in python/pyspark/mllib/clustering.py
  • [SPARK-39116] - Replcace double negation in exists with forall
  • [SPARK-39119] - Upgrade to Hadoop 3.3.3
  • [SPARK-39123] - Upgrade `org.scalatestplus:mockito` to 3.2.12.0
  • [SPARK-39124] - Upgrade rocksdbjni to 7.1.2
  • [SPARK-39133] - Mention log level setting in PYSPARK_JVM_STACKTRACE_ENABLED
  • [SPARK-39134] - Add custom metric of skipped null values for stream join operator
  • [SPARK-39137] - Use slice instead of take and drop
  • [SPARK-39138] - Add ANSI general value specification and function -user
  • [SPARK-39146] - The singleton Jackson ObjectMapper should be preferred
  • [SPARK-39147] - Code simplification, use count() instead of filter().size, etc.
  • [SPARK-39152] - StreamCorruptedException cause job failure for disk persisted RDD
  • [SPARK-39156] - Remove ParquetLogRedirector usage from ParquetFileFormat
  • [SPARK-39160] - Remove workaround for ARROW-1948
  • [SPARK-39161] - Upgrade rocksdbjni to 7.2.2
  • [SPARK-39171] - Unify the Cast expression
  • [SPARK-39172] - Remove outer join if all output come from streamed side and buffered side keys exist unique key
  • [SPARK-39180] - Simplify the planning of limit and offset
  • [SPARK-39182] - Upgrade to Arrow 8.0.0
  • [SPARK-39186] - make skew consistent with pandas
  • [SPARK-39192] - make pandas-on-spark's kurt consistent with pandas
  • [SPARK-39196] - Replace getOrElse(null) with orNull
  • [SPARK-39204] - Replace `Utils.createTempDir` related methods with JavaUtils
  • [SPARK-39205] - Add `PANDAS API ON SPARK` label
  • [SPARK-39213] - Create ANY_VALUE aggregate function
  • [SPARK-39217] - Makes DPP support the pruning side has Union
  • [SPARK-39225] - Support spark.history.fs.update.batchSize
  • [SPARK-39231] - Change to use `ConstantColumnVector` to store partition columns in `VectorizedParquetRecordReader`
  • [SPARK-39235] - Make Catalog API be compatible with 3-layer-namespace
  • [SPARK-39248] - Decimal divide much slower than multiply
  • [SPARK-39251] - Simplify MultiLike if remainPatterns is empty
  • [SPARK-39254] - Upgrade ZSTD-JNI to 1.5.2-3
  • [SPARK-39256] - Reduce multiple file attribute calls of JavaUtils#deleteRecursivelyUsingJavaIO
  • [SPARK-39260] - Use `Reader.getSchema` instead of `Reader.getTypes`
  • [SPARK-39261] - Improve newline formatting for error messages
  • [SPARK-39262] - Correct the behavior of creating DataFrame from an RDD
  • [SPARK-39266] - Cleanup unused spark.rpc.numRetries and spark.rpc.retry.wait configs
  • [SPARK-39267] - Clean up dsl unnecessary symbol
  • [SPARK-39277] - Make Optimizer extends SQLConfHelper
  • [SPARK-39282] - Replace If-Else branch with bitwise operators in roundNumberOfBytesToNearestWord
  • [SPARK-39295] - Improve documentation of pandas API support list.
  • [SPARK-39298] - Change to use `seq.indices` constructing ranges
  • [SPARK-39299] - Series.autocorr use SQL.corr to avoid conversion to vector
  • [SPARK-39301] - Levearge LocalRelation in createDataFrame with Arrow optimization
  • [SPARK-39308] - Upgrade parquet to 1.12.3
  • [SPARK-39312] - Use Parquet in predicate for Spark In filter
  • [SPARK-39318] - Remove tpch-plan-stability WithStats golden files
  • [SPARK-39321] - Refactor TryCast to use RuntimeReplaceable
  • [SPARK-39323] - Hide empty `taskResourceAssignments` from INFO log
  • [SPARK-39325] - Improve MapOutputTracker convertMapStatuses performance
  • [SPARK-39332] - Upgrade RoaringBitmap to 0.9.28
  • [SPARK-39333] - Change to use `foreach` when `map` produce no result
  • [SPARK-39349] - Add a CheckError() method to SparkFunSuite
  • [SPARK-39368] - Move RewritePredicateSubquery into InjectRuntimeFilter
  • [SPARK-39374] - Improve error message for user specified column list
  • [SPARK-39377] - Normalize expr ids in ListQuery and Exists expressions
  • [SPARK-39381] - Make vectorized orc columar writer batch size configurable
  • [SPARK-39387] - Upgrade hive-storage-api to 2.7.3
  • [SPARK-39388] - Reuse orcSchema when push down Orc predicates
  • [SPARK-39390] - Hide and optimize `viewAcls`/`viewAclsGroups`/`modifyAcls`/`modifyAclsGroups` fron INFO log
  • [SPARK-39392] - Refine ANSI error messages and remove 'To return NULL instead'
  • [SPARK-39397] - Relax AliasAwareOutputExpression to support alias with expression
  • [SPARK-39409] - Upgrade scala-maven-plugin to 4.6.2
  • [SPARK-39414] - Upgrade Scala to 2.12.16
  • [SPARK-39428] - use code block for `Coalesce Hints for SQL Queries`
  • [SPARK-39439] - Suppress error log for in-progress event log not found
  • [SPARK-39440] - Add a config to disable event timeline
  • [SPARK-39441] - Speed up DeduplicateRelations
  • [SPARK-39443] - Improve docstring of pyspark.sql.functions.col/first
  • [SPARK-39446] - Add relevance score for nDCG evaluation in MLLIB
  • [SPARK-39449] - Propagate empty relation through Window
  • [SPARK-39456] - Fix broken function links in the auto-generated pandas API support list documentation.
  • [SPARK-39466] - Clean `core/temp-secrets/` after executing `SecurityManagerSuite`
  • [SPARK-39469] - Infer date type for CSV schema inference
  • [SPARK-39488] - Simplify the error handling of TempResolvedColumn
  • [SPARK-39489] - Improve EventLoggingListener and ReplayListener performance by replacing Json4S ASTs with Jackson trees
  • [SPARK-39492] - Rework MISSING_COLUMN error class
  • [SPARK-39497] - Improve the analysis exception of missing map key column
  • [SPARK-39511] - Push limit 1 to right side if join type is LeftSemiOrAnti and join condition is empty
  • [SPARK-39512] - Document the Spark Docker container release process
  • [SPARK-39533] - Deprecate scoreLabelsWeight in BinaryClassificationMetrics
  • [SPARK-39534] - Series.argmax only needs single pass
  • [SPARK-39538] - Convert CaseInsensitiveStringMap#logger to static
  • [SPARK-39545] - Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
  • [SPARK-39546] - Support ports definition in executor pod template
  • [SPARK-39564] - Expose the information of catalog table to the logical plan in streaming query
  • [SPARK-39576] - Support GitHub Actions generate benchmark results using Scala 2.13
  • [SPARK-39591] - SPIP: Asynchronous Offset Management in Structured Streaming
  • [SPARK-39595] - Upgrade rocksdbjni to 7.3.1
  • [SPARK-39599] - Upgrade maven to 3.8.6
  • [SPARK-39606] - Use child stats to estimate order operator
  • [SPARK-39613] - Upgrade shapeless to 2.3.9
  • [SPARK-39616] - Upgrade Breeze to 2.0
  • [SPARK-39626] - Upgrade RoaringBitmap from 0.9.28 to 0.9.30
  • [SPARK-39633] - Dataframe options for time travel via `timestampAsOf` should respect both formats of specifying timestamp
  • [SPARK-39635] - Custom driver metrics for Datasource v2
  • [SPARK-39636] - Fix multiple small bugs in JsonProtocol, impacting StorageLevel and Task/Executor resource requests
  • [SPARK-39638] - Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
  • [SPARK-39651] - Prune filter condition if compare with rand is deterministic
  • [SPARK-39653] - Remove `ColumnVectorUtils#populate(WritableColumnVector, InternalRow, int) ` method
  • [SPARK-39657] - YARN AM client should call the non-static setTokensConf method
  • [SPARK-39661] - Avoid creating unnecessary SLF4J Logger
  • [SPARK-39662] - Upgrade HtmlUnit and its related artifacts from 2.50.0 to 2.62.0.
  • [SPARK-39666] - Use UnsafeProjection.create to respect `spark.sql.codegen.factoryMode` in ExpresssionEncoder
  • [SPARK-39667] - Add another workaround when there is not enough memory to build and broadcast the table
  • [SPARK-39675] - Switch 'spark.sql.codegen.factoryMode' configuration from testing purpose to internal purpose
  • [SPARK-39676] - Add task partition id for Task assertEquals method in JsonProtocolSuite
  • [SPARK-39679] - TakeOrderedAndProjectExec should respect child output ordering
  • [SPARK-39689] - Support 2-chars lineSep in CSV datasource
  • [SPARK-39691] - Supplement `MapStatusesConvertBenchmark` result generated by Java 11 and 17
  • [SPARK-39693] - `tpcds-1g-gen` shouldn't execute If benchmark GA does not specify to execute TPCDSQueryBenchmark
  • [SPARK-39694] - Update `${sbtProject}/test:runMain` to `${sbtProject}/Test/runMain`
  • [SPARK-39699] - Make CollapseProject smarter about collection creation expressions
  • [SPARK-39702] - Reduce memory overhead of TransportCipher$EncryptedMessage's byteRawChannel buffer
  • [SPARK-39706] - Set missing column with defaultValue as constant in `ParquetColumnVector`
  • [SPARK-39713] - ANSI mode: add suggestion of using try_element_at for INVALID_ARRAY_INDEX error
  • [SPARK-39724] - Remove duplicate `.setAccessible(true)` in `kvstore.KVTypeInfo`
  • [SPARK-39727] - Upgrade joda-time from 2.10.13 to 2.10.14
  • [SPARK-39728] - Test for parity of SQL functions between Python and JVM DataFrame API's
  • [SPARK-39733] - Add map_contains_key to pyspark.sql.functions
  • [SPARK-39734] - Add call_udf to pyspark.sql.functions
  • [SPARK-39739] - Upgrade sbt to 1.7.0
  • [SPARK-39748] - Include the origin logical plan for LogicalRDD if it comes from DataFrame
  • [SPARK-39749] - ANSI SQL mode: use plain string representation on casting Decimal to String
  • [SPARK-39751] - Better naming for hash aggregate key probing metric
  • [SPARK-39754] - Remove unused import or unnecessary {}
  • [SPARK-39755] - Improve LocalDirsFeatureStep to randomize local directories
  • [SPARK-39757] - Upgrade sbt from 1.7.0 to 1.7.1
  • [SPARK-39760] - Support Varchar in PySpark
  • [SPARK-39764] - Make PhysicalOperation the same as ScanOperation
  • [SPARK-39767] - Remove UnresolvedDBObjectName and add UnresolvedIdentifier
  • [SPARK-39784] - Put Literal values on the right side of the data source filter after translating Catalyst Expression to data source filter
  • [SPARK-39785] - Use setBufferedIo instead of withBufferedIo to cleanup log4j2 deprecated api usage
  • [SPARK-39789] - Remove unused method and redundant throw exception declare
  • [SPARK-39798] - Simplify `GenericArrayData` constructor implementation
  • [SPARK-39803] - Use commons-text LevenshteinDistance instead of commons-langs3 `StringUtils.getLevenshteinDistance`
  • [SPARK-39806] - Queries accessing METADATA struct crash on partitioned tables
  • [SPARK-39809] - Support CharType in PySpark
  • [SPARK-39812] - Simplify code to construct AggregateExpression with toAggregateExpression
  • [SPARK-39823] - add DataFrame.as(StructType) in PySpark
  • [SPARK-39831] - R dependencies installation start to fail after devtools_2.4.4 was released
  • [SPARK-39832] - regexp_replace should support column arguments
  • [SPARK-39834] - Include the origin stats and constraints for LogicalRDD if it comes from DataFrame
  • [SPARK-39840] - Factor PythonArrowInput out as a symmetry to PythonArrowOutput
  • [SPARK-39849] - Dataset.as(StructType) fills missing new columns with null value
  • [SPARK-39851] - Improve join stats estimation if one side can keep uniqueness
  • [SPARK-39853] - Support stage level schedule for standalone cluster when dynamic allocation is disabled
  • [SPARK-39858] - Remove unnecessary AliasHelper or PredicateHelper for some rules
  • [SPARK-39860] - More expressions should extend Predicate
  • [SPARK-39863] - Upgrade Hadoop to 3.3.4
  • [SPARK-39864] - ExecutionListenerManager's registration of the ExecutionListenerBus should be lazy
  • [SPARK-39868] - StageFailed event should attach with the root cause
  • [SPARK-39870] - Add flag to run-tests.py to retain the test output.
  • [SPARK-39872] - HeapByteBuffer#get(int) is a hotspot path when using BytePackerForLong#unpack8Values with ByteBuffer input API
  • [SPARK-39873] - Remove OptimizeLimitZero and merge it into EliminateLimits
  • [SPARK-39875] - The method in final class should not declare as protected
  • [SPARK-39879] - Reduce local-cluster memory configuration in BroadcastJoinSuite* and HiveSparkSubmitSuite
  • [SPARK-39881] - Python Lint does not actually check for `black` formatter
  • [SPARK-39882] - Upgrade rocksdbjni to 7.4.3
  • [SPARK-39883] - Add DataFrame function parity check
  • [SPARK-39890] - Make TakeOrderedAndProjectExec inherit AliasAwareOutputOrdering
  • [SPARK-39891] - Bump h2 to 2.1.214
  • [SPARK-39902] - Add Scan details to spark plan scan node in SparkUI
  • [SPARK-39904] - Rename inferDate to preferDate and fix an issue when inferring schema
  • [SPARK-39906] - Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
  • [SPARK-39911] - Optimize global Sort to RepartitionByExpression
  • [SPARK-39912] - Refine CatalogImpl
  • [SPARK-39913] - Upgrade Arrow to 9.0.0
  • [SPARK-39925] - Add array_sort(column, comparator) overload to DataFrame operations
  • [SPARK-39944] - Upgrade dropwizard metrics to 4.2.10
  • [SPARK-39947] - Upgrade jersey to 2.36
  • [SPARK-39948] - Exclude hive-vector-code-gen dependency
  • [SPARK-39951] - Support columnar batches with nested fields in Parquet V2
  • [SPARK-39954] - Upgrade ASM to 9.3
  • [SPARK-39955] - Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages
  • [SPARK-39957] - Delay onDisconnected to enable Driver receives ExecutorExitCode
  • [SPARK-39958] - Add warning log when unable to load custom metric object
  • [SPARK-39960] - Upgrade mysql-connector-java to 8.0.30
  • [SPARK-39963] - Simplify the implementation of SimplifyCasts
  • [SPARK-39973] - Avoid noisy warnings logs when spark.scheduler.listenerbus.metrics.maxListenerClassesTimed = 0
  • [SPARK-39975] - Upgrade rocksdbjni to 7.4.5
  • [SPARK-39977] - Remove unnecessary guava exclusion from jackson-module-scala
  • [SPARK-39982] - StructType.fromJson method missing documentation
  • [SPARK-39983] - Should not cache unserialized broadcast relations on the driver
  • [SPARK-39986] - Better example for Co-grouped Map
  • [SPARK-39989] - Support estimate column statistics if it is foldable expression
  • [SPARK-39991] - AQE should use available column statistics from completed query stages
  • [SPARK-40004] - Redundant `LevelDB.get` in `RemoteBlockPushResolver`
  • [SPARK-40009] - Add missing doc string info to DataFrame API
  • [SPARK-40019] - Refactor comment of ArrayType
  • [SPARK-40020] - centralize the code of qualifying identifiers in SessionCatalog
  • [SPARK-40022] - YarnClusterSuite should not ABORTED when there is no Python3 environment
  • [SPARK-40030] - Upgrade scala-maven-plugin to 4.7.1
  • [SPARK-40033] - Nested schema pruning support through element_at
  • [SPARK-40039] - Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
  • [SPARK-40040] - Push local limit to both sides if join condition is empty
  • [SPARK-40050] - Enhance EliminateSorts to support removing sorts via LocalLimit
  • [SPARK-40053] - HiveExternalCatalogVersionsSuite will test all spark versions and aborted when Python 2.7 is used
  • [SPARK-40056] - Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9
  • [SPARK-40058] - Avoid filter twice in HadoopFSUtils
  • [SPARK-40067] - Add table name to Spark plan node in SparkUI
  • [SPARK-40071] - Update plugins to latest versions
  • [SPARK-40072] - MAVEN_OPTS in make-distributions.sh is different from one specified in pom.xml
  • [SPARK-40073] - Should Use `connector/${moduleName}` instead of `external/${moduleName}`
  • [SPARK-40084] - Upgrade Py4J from 0.10.9.5 to 0.10.9.7
  • [SPARK-40085] - use INTERNAL_ERROR error class instead of IllegalStateException to indicate bugs
  • [SPARK-40086] - Improve AliasAwareOutputPartitioning to take all aliases into account
  • [SPARK-40095] - sc.uiWebUrl should not throw exception when webui is disabled
  • [SPARK-40105] - Improve repartition in ReplaceCTERefWithRepartition
  • [SPARK-40106] - Task failure handlers should always run if the task failed
  • [SPARK-40112] - Improve the TO_BINARY() function
  • [SPARK-40113] - Reactor ParquetScanBuilder DataSourceV2 interface implementation
  • [SPARK-40128] - Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
  • [SPARK-40145] - Create infra image when cut down branches
  • [SPARK-40146] - Simply the codegen of getting map value
  • [SPARK-40153] - Unify the logic of resolve functions and table-valued functions
  • [SPARK-40162] - Upgrade RoaringBitmap from 0.9.30 to 0.9.31
  • [SPARK-40163] - [SPARK][SQL] feat: SparkSession.confing(Map)
  • [SPARK-40165] - Update test plugins to latest versions
  • [SPARK-40166] - Add array_sort(column, comparator) to PySpark
  • [SPARK-40167] - Add array_sort(column, comparator) to SparkR
  • [SPARK-40175] - Converting Tuple2 to Scala Map via `.toMap` is slow
  • [SPARK-40185] - Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key
  • [SPARK-40192] - Remove redundant groupby
  • [SPARK-40194] - SPLIT function on empty regex should truncate trailing empty string.
  • [SPARK-40197] - Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR
  • [SPARK-40207] - Specify the column name when the data type is not supported by datasource
  • [SPARK-40214] - Add `get` to dataframe functions
  • [SPARK-40215] - Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
  • [SPARK-40216] - Extract common `prepareWrite` method for `ParquetFileFormat` and `ParquetWrite` to eliminate duplicate code
  • [SPARK-40219] - resolved view plan should hold the schema to avoid redundant lookup
  • [SPARK-40224] - Make ObjectHashAggregateExec release memory eagerly when fallback to sort-based
  • [SPARK-40225] - PySpark rdd.takeOrdered should check num and numPartitions
  • [SPARK-40228] - Don't simplify multiLike if child is not attribute
  • [SPARK-40234] - Clean only MDC items set by Spark
  • [SPARK-40235] - Use interruptible lock instead of synchronized in Executor.updateDependencies()
  • [SPARK-40239] - Remove duplicated 'fraction' validation in RDD.sample
  • [SPARK-40240] - PySpark rdd.takeSample should validate `num > maxSampleSize` at first
  • [SPARK-40241] - Correct the link of GenericUDTF
  • [SPARK-40243] - Enhance Hive UDF support documentation
  • [SPARK-40248] - Use larger number of bits to build bloom filter
  • [SPARK-40251] - Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2
  • [SPARK-40252] - Replace `Stream.collect(Collectors.joining(delimiter))` with `StringJoiner` Api
  • [SPARK-40254] - Upgrade netty from 4.1.77 to 4.1.80
  • [SPARK-40256] - Switch base image from openjdk to eclipse-temurin
  • [SPARK-40276] - reduce the result size of RDD.takeOrdered
  • [SPARK-40283] - Update mima's previousSparkVersion to 3.3.0
  • [SPARK-40285] - Simplify the roundTo[Numeric] for Decimal
  • [SPARK-40293] - Make the V2 table error message more meaningful
  • [SPARK-40301] - Add parameter validation in pyspark.rdd
  • [SPARK-40308] - str_to_map should accept non-foldable delimiter arguments
  • [SPARK-40311] - Introduce withColumnsRenamed
  • [SPARK-40312] - Add missing configuration documentation in Spark History Server
  • [SPARK-40321] - Upgrade rocksdbjni to 7.5.3
  • [SPARK-40352] - Add function aliases: len, datepart, dateadd, date_diff and curdate
  • [SPARK-40360] - Convert some DDL exception to new error framework
  • [SPARK-40365] - Bump ANTLR runtime version from 4.8 to 4.9.3
  • [SPARK-40376] - `np.bool` will be deprecated
  • [SPARK-40382] - Reduce projections in Expand when multiple distinct aggregations have semantically equivalent children
  • [SPARK-40383] - Pin mypy ==0.920 in dev/requirements.txt
  • [SPARK-40387] - Improve the implementation of Spark Decimal
  • [SPARK-40396] - Update scalatest and scalatestplus to use latest version
  • [SPARK-40397] - Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium to 3.2.13.0
  • [SPARK-40398] - Use Loop instead of Arrays.stream api
  • [SPARK-40401] - Remove the support of deprecated `spark.akka.*` config
  • [SPARK-40404] - Fix the wrong description related to `spark.shuffle.service.db` in the document
  • [SPARK-40406] - The default logging should go to stderr
  • [SPARK-40411] - Refactor FlatMapGroupsWithStateExec to have a parent trait
  • [SPARK-40414] - Fix PythonArrowInput and PythonArrowOutput to be more generic to handle complicated type/data
  • [SPARK-40419] - Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
  • [SPARK-40424] - Refactor ChromeUIHistoryServerSuite to test rocksdb
  • [SPARK-40425] - DROP TABLE does not need to do table lookup
  • [SPARK-40428] - Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown
  • [SPARK-40436] - Upgrade Scala to 2.12.17
  • [SPARK-40456] - PartitionIterator.hasNext should be cheap to call repeatedly
  • [SPARK-40463] - Update gpg's keyserver
  • [SPARK-40466] - Improve the error message if the DSv2 source is disabled but DSv1 streaming source is not available
  • [SPARK-40471] - Upgrade RoaringBitmap to 0.9.32
  • [SPARK-40474] - Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps
  • [SPARK-40476] - Reduce the shuffle size of ALS
  • [SPARK-40478] - Add create datasource table options docs
  • [SPARK-40484] - Upgrade log4j2 to 2.19.0
  • [SPARK-40487] - Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
  • [SPARK-40488] - Do not wrap exceptions thrown in FileFormatWriter.write with SparkException
  • [SPARK-40490] - `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321
  • [SPARK-40494] - Optimize the performance of `keys.zipWithIndex.toMap` code pattern
  • [SPARK-40500] - Use `pd.items` instead of `pd.iteritems`
  • [SPARK-40501] - Add PushProjectionThroughLimit for Optimizer
  • [SPARK-40511] - Upgrade slf4j to 2.x
  • [SPARK-40527] - Keep struct field names or map keys in CreateStruct
  • [SPARK-40531] - Upgrade zstd-jni from 1.5.2-3 to 1.5.2-4
  • [SPARK-40544] - The file size of `sql/hive/target/unit-tests.log` is too big
  • [SPARK-40545] - SparkSQLEnvSuite failed to clean the `spark_derby` directory after execution
  • [SPARK-40547] - Fix dead links in sparkr-vignettes.Rmd
  • [SPARK-40548] - Upgrade rocksdbjni from 7.5.3 to 7.6.0
  • [SPARK-40556] - Unpersist the intermediate datasets cached in AttachDistributedSequenceExec
  • [SPARK-40574] - Add PURGE to DROP TABLE doc
  • [SPARK-40575] - Add badges for PySpark downloads
  • [SPARK-40595] - Improve error message for unused CTE relations
  • [SPARK-40599] - Add multiTransform methods to TreeNode to generate alternatives
  • [SPARK-40601] - Improve error when cogrouping groups with mismatching key sizes
  • [SPARK-40604] - Verify the temporary column names in PS
  • [SPARK-40606] - Eliminate `to_pandas` warnings in test
  • [SPARK-40607] - Remove redundant string interpolator operations
  • [SPARK-40611] - Improve the performance for setInterval & getInterval of UnsafeRow
  • [SPARK-40619] - HivePartitionFilteringSuites teset aborted due to `java.lang.OutOfMemoryError: Metaspace`
  • [SPARK-40620] - Deduplication of WorkerOffer build in CoarseGrainedSchedulerBackend
  • [SPARK-40628] - Do not push complex left semi/anti join condition through project
  • [SPARK-40633] - Upgrade janino to 3.1.9
  • [SPARK-40634] - Upgrade jodatime to 2.11.2
  • [SPARK-40639] - Upgrade sbt from 1.7.1 to 1.7.2
  • [SPARK-40640] - SparkHadoopUtil to set origin of hadoop/hive config options
  • [SPARK-40646] - Fix returning partial results in JSON data source and JSON functions
  • [SPARK-40648] - Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
  • [SPARK-40654] - Protobuf support MVP with descriptor files
  • [SPARK-40655] - Protobuf functions in Python
  • [SPARK-40657] - Add support for compiled classes (Java classes)
  • [SPARK-40661] - Upgrade `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914
  • [SPARK-40667] - Refactor File Data Source Options
  • [SPARK-40675] - Supplement missing spark configuration in documentation
  • [SPARK-40676] - Upgrade scalatest related test dependencies to 3.2.14
  • [SPARK-40697] - Add read-side char/varchar handling to cover external data files
  • [SPARK-40711] - Add spill size metrics for window
  • [SPARK-40712] - upgrade sbt-assembly plugin to 1.2.0
  • [SPARK-40724] - Simplify `corr` with method `inline`
  • [SPARK-40725] - Add mypy-protobuf to requirements
  • [SPARK-40728] - Upgrade ASM to 9.4
  • [SPARK-40735] - Consistently invoke bash with /usr/bin/env bash in scripts to make code more portable
  • [SPARK-40740] - Improve listFunctions in SessionCatalog
  • [SPARK-40742] - Java compilation warnings related to generic type
  • [SPARK-40745] - Reduce the shuffle size of ALS in mllib
  • [SPARK-40765] - Optimize redundant fs operations in `CommandUtils#calculateSingleLocationSize#getPathSize` method
  • [SPARK-40766] - Upgrade the guava defined in `plugins.sbt` to `31.0.1-jre`
  • [SPARK-40772] - Improve spark.sql.adaptive.skewJoin.skewedPartitionFactor to support float values
  • [SPARK-40776] - Add documentation (similar to Avro functions).
  • [SPARK-40777] - Use error classes for Protobuf exceptions
  • [SPARK-40778] - Make HeartbeatReceiver as an IsolatedRpcEndpoint
  • [SPARK-40782] - Upgrade Jackson-databind to 2.13.4.1
  • [SPARK-40794] - Upgrade Netty from 4.1.80 to 4.1.84
  • [SPARK-40795] - Exclude redundant jars from spark-protobuf-assembly jar
  • [SPARK-40797] - Force grouped import onto single line with Scalafmt
  • [SPARK-40803] - LZ4CompressionCodec looks up configuration on each stream creation
  • [SPARK-40821] - Introduce window_time function to extract event time from the window column
  • [SPARK-40826] - Add additional checkpoint rename file check
  • [SPARK-40834] - Use SparkListenerSQLExecutionEnd to track final SQL status in UI
  • [SPARK-40843] - Clean up deprecated api usage in SparkThrowableSuite
  • [SPARK-40846] - GA test failed with Java 8u352
  • [SPARK-40853] - Pin mypy-protobuf==3.3.0
  • [SPARK-40863] - Upgrade dropwizard metrics from 4.2.10 to 4.2.12
  • [SPARK-40865] - Upgrade jodatime to 2.12.0
  • [SPARK-40886] - Bump Jackson Databind 2.13.4.2
  • [SPARK-40892] - Loosen the requirement of window_time rule - allow multiple window_time calls
  • [SPARK-40895] - Upgrade Arrow to 10.0.0
  • [SPARK-40897] - Add missing PySpark APIs to References
  • [SPARK-40904] - Support zsh in K8s `entrypoint.sh`
  • [SPARK-40905] - Upgrade rocksdbjni to 7.7.3
  • [SPARK-40913] - Pin `pytest==7.1.3`
  • [SPARK-40919] - Bad case of `AnalysisTest#assertAnalysisErrorClass` when `expectedMessageParameters.size between [2, 4]`
  • [SPARK-40921] - Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO command
  • [SPARK-40925] - Fix late record filtering to support chaining of steteful operators
  • [SPARK-40935] - Upgrade ZSTD-JNI to 1.5.2-5
  • [SPARK-40936] - Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`
  • [SPARK-40940] - Fix the unsupported ops checker to allow chaining of stateful operators
  • [SPARK-40943] - Make MSCK optional in MSCK REPAIR TABLE commands
  • [SPARK-40950] - isRemoteAddressMaxedOut performance overhead on scala 2.13
  • [SPARK-40976] - Upgrade sbt to 1.7.3
  • [SPARK-40985] - Upgrade RoaringBitmap to 0.9.35
  • [SPARK-40991] - Update cloudpickle to v2.2.0
  • [SPARK-40996] - Upgrade `sbt-checkstyle-plugin` to 4.0.0
  • [SPARK-41017] - Support column pruning with multiple nondeterministic Filters
  • [SPARK-41023] - Upgrade Jackson to 2.14.0
  • [SPARK-41024] - Upgrade scala-maven-plugin to 4.7.2
  • [SPARK-41029] - Optimize the use of `GenericArrayData` constructor for Scala 2.13
  • [SPARK-41031] - Upgrade `org.tukaani:xz` to 1.9
  • [SPARK-41039] - Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13
  • [SPARK-41045] - Pre-compute to eliminate ScalaReflection calls after deserializer is created
  • [SPARK-41048] - Improve output partitioning and ordering with AQE cache
  • [SPARK-41050] - Upgrade scalafmt from 3.5.9 to 3.6.1
  • [SPARK-41051] - Optimize ProcfsMetrics file acquisition
  • [SPARK-41071] - Metaspace OOM when Local run dev/make-distribution.sh
  • [SPARK-41074] - Add option `--upgrade` in dependency installation command
  • [SPARK-41087] - Make `build/mvn` use the same JAVA_OPTS as `dev/make-distribution.sh`
  • [SPARK-41089] - Relocate Netty native arm64 libs
  • [SPARK-41090] - Enhance Dataset.createTempView testing coverage for db_name.view_name
  • [SPARK-41092] - Do not use identifier to match interval units
  • [SPARK-41096] - Support reading parquet FIXED_LEN_BYTE_ARRAY type
  • [SPARK-41097] - Remove redundant collection conversion for Scala 2.13
  • [SPARK-41106] - Reduce collection conversion when create AttributeMap
  • [SPARK-41112] - RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter
  • [SPARK-41113] - Upgrade sbt to 1.8.0
  • [SPARK-41120] - Upgrade joda-time from 2.12.0 to 2.12.1
  • [SPARK-41121] - Upgrade sbt-assembly from 1.2.0 to 2.0.0
  • [SPARK-41123] - Upgrade mysql-connector-java from 8.0.30 to 8.0.31
  • [SPARK-41126] - `entrypoint.sh` should use its WORKDIR instead of `/tmp` directory
  • [SPARK-41134] - improve error message of internal errors
  • [SPARK-41153] - Log migrated shuffle data size and migration time
  • [SPARK-41155] - Add error message to SchemaColumnConvertNotSupportedException
  • [SPARK-41161] - Upgrade `scala-parser-combinators` to 2.1.1
  • [SPARK-41167] - Optimize LikeSimplification rule to improve multi like performance
  • [SPARK-41194] - Add log4j2.properties for testing to the protobuf module
  • [SPARK-41197] - Upgrade Kafka to 3.3.1
  • [SPARK-41209] - Improve PySpark type inference in _merge_type method
  • [SPARK-41211] - Upgrade ZooKeeper to 3.6.3
  • [SPARK-41223] - Upgrade slf4j to 2.0.4
  • [SPARK-41226] - Refactor Spark types by introducing physical types
  • [SPARK-41239] - Upgrade Jackson to 2.14.1
  • [SPARK-41248] - Add config flag to control before of JSON partial results parsing in SPARK-40646
  • [SPARK-41251] - Upgrade pandas from 1.5.1 to 1.5.2
  • [SPARK-41252] - Upgrade arrow from 10.0.0 to 10.0.1
  • [SPARK-41260] - Cast NumPy instances to Python primitive types in GroupState update
  • [SPARK-41267] - Add unpivot / melt to SparkR
  • [SPARK-41273] - Update plugins to latest versions
  • [SPARK-41275] - Upgrade pickle to 1.3
  • [SPARK-41276] - Optimize constructor use of `StructType`
  • [SPARK-41316] - Add @tailrec wherever possible
  • [SPARK-41338] - resolve outer references and normal columns in the same analyzer batch
  • [SPARK-41355] - Workaround hive table name validation issue
  • [SPARK-41360] - Avoid BlockManager re-registration if the executor has been lost
  • [SPARK-41369] - Refactor connect directory structure
  • [SPARK-41373] - Rename CAST_WITH_FUN_SUGGESTION to CAST_WITH_FUNC_SUGGESTION
  • [SPARK-41387] - Add assertion on end offset range for Kafka data source with Trigger.AvailableNow
  • [SPARK-41390] - Update the script used to generate register function in UDFRegistration
  • [SPARK-41393] - Upgrade slf4j to 2.0.5
  • [SPARK-41402] - Override nodeName of StringDecode
  • [SPARK-41404] - Support `ColumnarBatchSuite#testRandomRows` to test more primitive dataType
  • [SPARK-41405] - centralize the column resolution logic
  • [SPARK-41408] - Upgrade scala-maven-plugin to 4.8.0
  • [SPARK-41442] - Only update SQLMetric value if merging with valid metric
  • [SPARK-41447] - Reduce the number of doMergeApplicationListing invocations
  • [SPARK-41450] - PySpark built from master code raise error "java.lang.ClassNotFoundException: org.eclipse.jetty.server.Handler"
  • [SPARK-41454] - Support Python 3.11
  • [SPARK-41456] - Improve the performance of try_cast
  • [SPARK-41460] - Introduce IsolatedThreadSafeRpcEndpoint to extend IsolatedRpcEndpoint
  • [SPARK-41463] - Ensure error class (and subclass) names contain only capital letters, numbers and underscores
  • [SPARK-41466] - Change Scala Style configuration to catch AnyFunSuite instead of FunSuite
  • [SPARK-41467] - Upgrade httpclient from 4.5.13 to 4.5.14
  • [SPARK-41469] - Task rerun on decommissioned executor can be avoided if shuffle data has migrated
  • [SPARK-41474] - Exclude proto files from spark-protobuf
  • [SPARK-41476] - Prevent `README.md` from triggering CIs
  • [SPARK-41482] - Upgrade dropwizard metrics 4.2.13
  • [SPARK-41491] - Update postgres docker image to 15.1
  • [SPARK-41509] - Delay execution hash until after aggregation for semi-join runtime filter.
  • [SPARK-41511] - LongToUnsafeRowMap support ignoresDuplicatedKey
  • [SPARK-41520] - Split AND_OR TreePattern to separate AND and OR TreePatterns
  • [SPARK-41523] - `protoc-jar-maven-plugin` should uniformly use `protoc-jar-maven-plugin.version` as the version
  • [SPARK-41524] - Expose SQL confs and extraOptions separately in o.a.s.sql.execution.streaming.state.RocksDBConf
  • [SPARK-41530] - Rename MedianHeap to PercentileMap and support percentile
  • [SPARK-41534] - Setup initial client module for Spark Connect
  • [SPARK-41541] - Fix wrong child call in SQLShuffleWriteMetricsReporter.decRecordsWritten()
  • [SPARK-41544] - Upgrade `versions-maven-plugin` to 2.14.1
  • [SPARK-41553] - Fix the documentation for num_files
  • [SPARK-41561] - Upgrade slf4j related dependencies from 2.0.5 to 2.0.6
  • [SPARK-41562] - Upgrade joda-time from 2.12.1 to 2.12.2
  • [SPARK-41567] - Move configuration of `versions-maven-plugin` to parent pom
  • [SPARK-41569] - Upgrade rocksdbjni to 7.8.3
  • [SPARK-41584] - Upgrade RoaringBitmap to 0.9.36
  • [SPARK-41587] - Upgrade org.scalatestplus:selenium-4-4 to org.scalatestplus:selenium-4-7
  • [SPARK-41588] - Make "Rule id not found" error message more actionable
  • [SPARK-41660] - only propagate metadata columns if they are used
  • [SPARK-41669] - Speed up CollapseProject for wide tables
  • [SPARK-41704] - Upgrade `sbt-assembly` from 2.0.0 to 2.1.0
  • [SPARK-41711] - Upgrade protobuf-java to 3.21.12
  • [SPARK-41714] - Update maven-checkstyle-plugin from 3.1.2 to 3.2.0
  • [SPARK-41719] - Spark SSLOptions sub settings should be set only when ssl is enabled
  • [SPARK-41720] - Rename UnresolvedFunc to UnresolvedFunctionName
  • [SPARK-41750] - Upgrade dev.ludovic.netlib to 3.0.3
  • [SPARK-41760] - Enforce scalafmt for Spark Connect Client module
  • [SPARK-41778] - Add an alias "reduce" to ArrayAggregate
  • [SPARK-41787] - Upgrade silencer from 1.7.10 to 1.7.12
  • [SPARK-41791] - Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan
  • [SPARK-41798] - Upgrade hive-storage-api to 2.8.1
  • [SPARK-41800] - Upgrade commons-compress to 1.22
  • [SPARK-41802] - Upgrade Apache httpcore to 4.4.16
  • [SPARK-41805] - Reuse expressions in WindowSpecDefinition
  • [SPARK-41806] - Use AppendData.byName for SQL INSERT INTO by name for DSV2 and block ambiguous queries with static partitions columns
  • [SPARK-41822] - Setup Scala/JVM Client Connection
  • [SPARK-41860] - Make AvroScanBuilder and JsonScanBuilder case classes
  • [SPARK-41861] - Make v2 ScanBuilders' build() return typed scan
  • [SPARK-41883] - Upgrade dropwizard metrics 4.2.15
  • [SPARK-41893] - Publish SBOM artifacts
  • [SPARK-41925] - Enable spark.sql.orc.enableNestedColumnVectorizedReader by default
  • [SPARK-41938] - Upgrade sbt from 1.8.0 to 1.8.2
  • [SPARK-41941] - Upgrade scalatest related test dependencies to 3.2.15
  • [SPARK-41943] - Use java api to create files and grant permissions is DiskBlockManager
  • [SPARK-41949] - Make stage scheduling support local-cluster mode
  • [SPARK-41962] - Update the import order of scala package in class SpecificParquetRecordReaderBase
  • [SPARK-41965] - Add DataFrameWriterV2 to PySpark API references
  • [SPARK-41966] - Add `CharType` and `TimestampNTZType` to PySpark API references
  • [SPARK-41970] - Introduce SparkPath to address paths and URIs
  • [SPARK-41986] - Introduce shuffle on SinglePartition
  • [SPARK-41994] - Harden SQLSTATE usage for error classes
  • [SPARK-42031] - Clean up remove methods that do not need override
  • [SPARK-42037] - Rename AMPLAB_ to SPARK_ prefix in build environment variables
  • [SPARK-42043] - Basic Scala Client Result Implementation
  • [SPARK-42049] - Improve AliasAwareOutputExpression
  • [SPARK-42055] - Upgrade scalatest-maven-plugin from 2.1.0 to 2.2.0
  • [SPARK-42056] - Add missing options for Protobuf functions.
  • [SPARK-42058] - Harden SQLSTATE usage for error classes (2)
  • [SPARK-42065] - Remove duplicated test_freqItems
  • [SPARK-42067] - Upgrade buf from 1.11.0 to 1.12.0
  • [SPARK-42081] - improve the plan change validation
  • [SPARK-42083] - Make (Executor|StatefulSet)PodsAllocator extendable
  • [SPARK-42086] - Sort test cases in SQLQueryTestSuite
  • [SPARK-42091] - Upgrade jetty to 9.4.50.v20221201
  • [SPARK-42092] - Upgrade RoaringBitmap to 0.9.38
  • [SPARK-42096] - Code cleanup for connect module
  • [SPARK-42106] - [Pyspark] Hide parameters when re-printing user provided remote URL in REPL
  • [SPARK-42108] - Make Analyzer transform Count(*) into Count(1)
  • [SPARK-42111] - Mark Orc*FilterSuite/OrcV*SchemaPruningSuite as ExtendedSQLTest
  • [SPARK-42114] - Add uniform parquet encryption test case
  • [SPARK-42116] - Mark ColumnarBatchSuite as ExtendedSQLTest
  • [SPARK-42129] - Upgrade rocksdbjni to 7.9.2
  • [SPARK-42133] - Add basic Dataset API methods to Spark Connect Scala Client
  • [SPARK-42149] - Remove the env `SPARK_USE_CONC_INCR_GC` used to enable CMS GC for Yarn AM
  • [SPARK-42152] - Use `_` instead of `-` in `shadedPattern` for relocation package name
  • [SPARK-42161] - Upgrade Arrow to 11.0.0
  • [SPARK-42166] - Make `docker-image-tool.sh` usage message up-to-date
  • [SPARK-42167] - Improve GitHub Action `lint` job to stop on failures earlier
  • [SPARK-42172] - Compatibility check for Scala Client
  • [SPARK-42180] - Update `SCALA_VERSION` in `_config.yml`
  • [SPARK-42202] - Scala Client E2E test stop the server gracefully
  • [SPARK-42220] - Upgrade buf from 1.12.0 to 1.13.1
  • [SPARK-42230] - Improve `lint` job by skipping PySpark and SparkR docs if unchanged
  • [SPARK-42237] - change binary to unsupported dataType in csv format
  • [SPARK-42277] - Use ROCKSDB for spark.history.store.hybridStore.diskBackend by default
  • [SPARK-42283] - Add Simple Scala UDFs to Scala/JVM Client
  • [SPARK-42287] - Optimize the packaging strategy of connect client module
  • [SPARK-42333] - Change log level to debug when fetching result set from SparkExecuteStatementOperation
  • [SPARK-42334] - Make sure connect client assembly and sql package is built before running client tests - SBT
  • [SPARK-42354] - Upgrade Jackson to 2.14.2
  • [SPARK-42372] - Improve performance of HiveGenericUDTF by making inputProjection instantiate once
  • [SPARK-42390] - Upgrade buf from 1.13.1 to 1.14.0
  • [SPARK-42394] - Fix the usage information of bin/spark-sql --help
  • [SPARK-42398] - refine default column value framework
  • [SPARK-42422] - Upgrade `maven-shade-plugin` to 3.4.1
  • [SPARK-42423] - Add metadata column file block start and length
  • [SPARK-42429] - IntelliJ Build issue: value getArgument is not a member of org.mockito.invocation.InvocationOnMock
  • [SPARK-42436] - Improve multiTransform to generate alternatives dynamically
  • [SPARK-42457] - Scala Client Session Read API
  • [SPARK-42480] - Improve the performance of drop partitions
  • [SPARK-42482] - Scala client Write API V1
  • [SPARK-42514] - Scala Client add partition transforms functions
  • [SPARK-42518] - Scala client Write API V2
  • [SPARK-42526] - Add Classifier.getNumClasses back
  • [SPARK-42527] - Scala Client add Window functions
  • [SPARK-42543] - Specify protocol for UDF artifact transfer in JVM/Scala client
  • [SPARK-42548] - Add ReferenceAllColumns to skip rewriting attributes
  • [SPARK-42599] - Make `CompatibilitySuite` as a tool like `dev/mima`
  • [SPARK-42653] - Artifact transfer from Scala/JVM client to Server
  • [SPARK-42656] - Spark Connect Scala Client Shell Script
  • [SPARK-42675] - Should clean up temp view after test
  • [SPARK-42684] - v2 catalog should not allow column default value by default
  • [SPARK-42712] - Improve docstring of mapInPandas and mapInArrow
  • [SPARK-42722] - Python Connect def schema() should not cache the schema
  • [SPARK-42895] - ValueError when invoking any session operations on a stopped Spark session
  • [SPARK-42904] - Char/Varchar Support for JDBC Catalog
  • [SPARK-42908] - Raise RuntimeError if SparkContext is not initialized when parsing DDL-formatted type strings
  • [SPARK-42917] - Correct getUpdateColumnNullabilityQuery for DerbyDialect
  • [SPARK-42946] - Sensitive data could still be exposed by variable substitution
  • [SPARK-43009] - Parameterized sql() with constants
  • [SPARK-43075] - Change gRPC to grpcio when it is not installed.

Test

  • [SPARK-38755] - Add file to address missing pandas general functions
  • [SPARK-38786] - Test Bug in StatisticsSuite "change stats after add/drop partition command"
  • [SPARK-38893] - Test SourceProgress in PySpark
  • [SPARK-38920] - Add ORC blockSize tests to BloomFilterBenchmark
  • [SPARK-38923] - Regenerate benchmark results
  • [SPARK-38944] - Close `NioBufferedFileInputStream` opened by `ExternalAppendOnlyUnsafeRowArraySuite`
  • [SPARK-38948] - `DiskRowQueue` leak in `PythonForeachWriterSuite `
  • [SPARK-39034] - Add tests for options from `to_json` and `from_json`.
  • [SPARK-39035] - Add tests for options from `to_csv` and `from_csv`.
  • [SPARK-39117] - Do not include number of functions in sql-expression-schema.md
  • [SPARK-39181] - SessionCatalog.reset should not drop temp functions twice
  • [SPARK-39253] - Improve PySpark API reference to be more readable
  • [SPARK-39331] - Flay test: StreamingListenerTests.test_listener_events
  • [SPARK-39369] - Use JAVA_OPTS for AppVeyer build to increase the memory properly
  • [SPARK-39372] - Support R 4.2.0 in SparkR
  • [SPARK-39394] - Improve PySpark structured streaming page more readable
  • [SPARK-39463] - Use UUID for test database location in JavaJdbcRDDSuite
  • [SPARK-39477] - Remove "Number of queries" info from the golden files of SQLQueryTestSuite
  • [SPARK-39495] - Support SPARK_TEST_HIVE_CLIENT_VERSIONS for HiveClientVersions
  • [SPARK-39584] - Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
  • [SPARK-39604] - Miss UT for DerbyDialet's getCatalystType
  • [SPARK-39631] - Update FilterPushdownBenchmark results
  • [SPARK-39663] - Miss UT for MysqlDialect's listIndexes
  • [SPARK-39701] - Move withSecretFile to SparkFunSuite to reuse
  • [SPARK-39711] - Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging
  • [SPARK-39826] - Bump scalatest-maven-plugin to 2.1.0
  • [SPARK-39856] - Avoid OOM in TPC-DS build with SMJ
  • [SPARK-39869] - Fix flaky hive - slow tests because of out-of-memory
  • [SPARK-39874] - Deflake BroadcastJoinSuite*
  • [SPARK-39959] - Recover SparkR CRAN check in GitHub Actions CI
  • [SPARK-40116] - Remove Arrow in AppVeyor for now
  • [SPARK-40133] - Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true
  • [SPARK-40172] - Temporarily disable flaky test cases in ImageFileFormatSuite
  • [SPARK-40203] - Add test cases for Spark Decimal
  • [SPARK-40229] - Re-enable excel I/O test for pandas API on Spark.
  • [SPARK-40265] - Fix the inconsistent behavior for Index.intersection.
  • [SPARK-40271] - Support list type for pyspark.sql.functions.lit
  • [SPARK-40273] - Fix the documents "Contributing and Maintaining Type Hints".
  • [SPARK-40410] - Migrate trait QueryErrorsSuiteBase into SparkFunSuite
  • [SPARK-40461] - Set upperbound for pyzmq 24.0.0 for Python linter
  • [SPARK-40495] - Add additional tests to StreamingSessionWindowSuite
  • [SPARK-40669] - Parameterize InMemoryColumnarBenchmark
  • [SPARK-40682] - Set spark.driver.maxResultSize to 3g in SqlBasedBenchmark
  • [SPARK-40789] - Separate tests under pyspark.sql.tests
  • [SPARK-40903] - Avoid reordering decimal Add for canonicalization
  • [SPARK-40968] - Fix some wrong/misleading comments in DAGSchedulerSuite
  • [SPARK-41486] - Upgrade MySQL docker image to 8.0.31 to support arm64
  • [SPARK-41504] - Update R version to 4.1.2 in Dockerfile comment
  • [SPARK-41558] - Disable Coverage in python.pyspark.tests.test_memory_profiler
  • [SPARK-41559] - Reenable Codecov report in the scheduled job
  • [SPARK-41753] - Add tests for ArrayZip to check the result size and nullability.
  • [SPARK-41774] - Remove def test_vectorized_udf_unsupported_types
  • [SPARK-41782] - Regenerate benchmark results
  • [SPARK-41854] - Automatic reformat/check python/setup.py
  • [SPARK-41863] - Skip `flake8` tests if the command is not available
  • [SPARK-41864] - Fix mypy linter errors
  • [SPARK-41996] - KafkaMicroBatchV2SourceSuite failed for topic partitions unavailable test due to kafka operations taking longer
  • [SPARK-42087] - Use `--no-same-owner` when HiveExternalCatalogVersionsSuite untars.
  • [SPARK-42110] - Reduce the number of repetition in ParquetDeltaEncodingSuite.`random data test`
  • [SPARK-42181] - Skip `torch` tests when torch is not installed
  • [SPARK-42183] - Exclude pyspark.ml.torch.tests in MyPy tests
  • [SPARK-42279] - Simplify `pyspark.pandas.tests.test_resample`
  • [SPARK-42282] - Split 'pyspark.pandas.tests.test_groupby'
  • [SPARK-42341] - Fix JoinSelectionHelperSuite and PlanStabilitySuite to use explicit broadcast threshold
  • [SPARK-42364] - Split 'pyspark.pandas.tests.test_dataframe'
  • [SPARK-42365] - Split 'pyspark.pandas.tests.test_ops_on_diff_frames'
  • [SPARK-42368] - Ignore SparkRemoteFileTest K8s IT test case in GitHub Action
  • [SPARK-42474] - Add extraJVMOptions JVM GC option K8s test cases
  • [SPARK-42507] - Simplify ORC schema merging conflict error check
  • [SPARK-42587] - Use wrapper versions for SBT and Maven in `connect` module tests

Task

  • [SPARK-28764] - Remove unnecessary writePartitionedFile method from ExternalSorter
  • [SPARK-35208] - Add docs for LATERAL subqueries
  • [SPARK-38181] - Update comments for KafkaDataConsumer
  • [SPARK-38289] - Refactor SQL CLI exit code related code
  • [SPARK-38550] - Use a disk-based store to save more information in live UI to help debug
  • [SPARK-38572] - Setting version to 3.4.0-SNAPSHOT
  • [SPARK-38651] - Writing out empty or nested empty schemas in Datasource should be configurable
  • [SPARK-38705] - Use function identifier in create and drop function command
  • [SPARK-38910] - Clean sparkStaging dir should before unregister()
  • [SPARK-39110] - Add metrics properties to Environment page
  • [SPARK-39178] - When throw SparkFatalException, should show root cause too.
  • [SPARK-39195] - Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status
  • [SPARK-39224] - Lower general ProcfsMetricsGetter error log levels except /proc/ lookup error
  • [SPARK-39244] - Use `--no-echo` instead of `--slave` in R 4.0
  • [SPARK-39264] - Add explicit type checks and casting for awaitOffset fix
  • [SPARK-39781] - Add support for configuring max_open_files through RocksDB state store provider
  • [SPARK-39805] - Deprecate Trigger.Once and Promote Trigger.AvailableNow
  • [SPARK-39861] - Deprecate Python 3.7 Support
  • [SPARK-39918] - Replace the wording "un-comparable" with "incomparable"
  • [SPARK-40213] - Incorrect ASCII value for Latin-1 Supplement characters
  • [SPARK-40292] - arrays_zip output unexpected alias column names
  • [SPARK-40319] - Remove duplicated query execution error method for PARSE_DATETIME_BY_NEW_PARSER
  • [SPARK-40389] - Decimals can't upcast as integral types if the cast can overflow
  • [SPARK-40467] - Split FlatMapGroupsWithState down to multiple test suites
  • [SPARK-40491] - Remove too old TODO for JdbcRDD
  • [SPARK-40651] - Drop Hadoop2 binary distribtuion from release process
  • [SPARK-40844] - Flip the default value of Kafka offset fetching config
  • [SPARK-41101] - Add messageClassName support for pypspark-protobuf
  • [SPARK-41224] - Optimize Arrow collect to stream the result from server to client
  • [SPARK-41247] - Unify the protobuf versions in Spark connect and protobuf connector
  • [SPARK-41249] - Add acceptance test for self-union on streaming query
  • [SPARK-41396] - Oneof field support and recursive fields
  • [SPARK-41415] - SASL Request Retries
  • [SPARK-41499] - Upgrade protobuf version to 3.21.11
  • [SPARK-41538] - Metadata column should be appended at the end of project list
  • [SPARK-41639] - Remove ScalaReflectionLock
  • [SPARK-41690] - Introduce AgnosticEncoders
  • [SPARK-41752] - UI improvement for nested SQL executions
  • [SPARK-41853] - Use Map in place of SortedMap for ErrorClassesJsonReader
  • [SPARK-41930] - Remove `branch-3.1` from publish_snapshot job
  • [SPARK-41972] - Fix flaky test in StreamingQueryStatusListenerSuite
  • [SPARK-41993] - Move RowEncoder to AgnosticEncoders
  • [SPARK-42003] - Reduce duplicate code in ResolveGroupByAll
  • [SPARK-42075] - Deprecate DStream API
  • [SPARK-42093] - Move JavaTypeInference to AgnosticEncoders
  • [SPARK-42105] - Document work (Release note & Guide doc) for SPARK-40925
  • [SPARK-42284] - Make sure Connect Server assembly jar is available before we run Scala Client tests
  • [SPARK-42377] - Test Framework for Connect Scala Client
  • [SPARK-42440] - Implement First batch of Dataset APIs
  • [SPARK-42441] - Scala Client - Implement Column API
  • [SPARK-42453] - Implement function max in Scala client
  • [SPARK-42460] - E2E test should clean-up results
  • [SPARK-42461] - Scala Client - Initial Set of Functions
  • [SPARK-42464] - Fix 2.13 build errors caused by explain output changes and udfs.
  • [SPARK-42465] - ProtoToPlanTestSuite should analyze its input plans
  • [SPARK-42495] - Scala Client: Add 2nd batch of functions
  • [SPARK-42512] - Scala Client: Add 3rd batch of functions
  • [SPARK-42520] - Spark Connect Scala Client: Window
  • [SPARK-42569] - Throw unsupported exceptions for non-supported API
  • [SPARK-42624] - Reorganize imports in test_functions
  • [SPARK-42876] - DataType's physicalDataType should be private[sql]
  • [SPARK-42878] - Named Table should support options

Dependency upgrade

  • [SPARK-39099] - Add dependencies to Dockerfile for building Spark releases
  • [SPARK-39125] - Upgrade netty and netty-tcnative
  • [SPARK-39183] - Upgrade Apache Xerces Java to 2.12.2
  • [SPARK-39540] - Upgrade mysql-connector-java to 8.0.29
  • [SPARK-39725] - Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
  • [SPARK-39927] - Upgrade Avro to version 1.11.1
  • [SPARK-39992] - Upgrade slf4j to 1.7.36
  • [SPARK-39996] - Upgrade postgresql to 42.5.0
  • [SPARK-40037] - Upgrade com.google.crypto.tink:tink from 1.6.1 to 1.7.0
  • [SPARK-40326] - upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4
  • [SPARK-40522] - Upgrade Apache Kafka from 3.2.1 to 3.2.3
  • [SPARK-40552] - Upgrade protobuf-python from 4.21.5 to 4.21.6
  • [SPARK-40801] - Upgrade Apache Commons Text to 1.10
  • [SPARK-40884] - Upgrade fabric8io - kubernetes-client to 6.2.0
  • [SPARK-41030] - Upgrade Apache Ivy to 2.5.1
  • [SPARK-41076] - Upgrade protobuf-java to 3.21.9
  • [SPARK-41240] - Upgrade Protobuf from 3.19.4 to 3.19.5
  • [SPARK-41245] - Upgrade postgresql from 42.5.0 to 42.5.1
  • [SPARK-41566] - Upgrade netty from 4.1.84.Final to 4.1.86.Final
  • [SPARK-41634] - Upgrade minimatch to 3.1.2
  • [SPARK-42218] - Upgrade netty to version 4.1.87.Final
  • [SPARK-42362] - Upgrade kubernetes-client from 6.4.0 to 6.4.1

Question

Umbrella

  • [SPARK-39515] - Improve/recover scheduled jobs in GitHub Actions
  • [SPARK-40576] - Support pandas 1.5.x.
  • [SPARK-41053] - Better Spark UI scalability and Driver stability for large applications
  • [SPARK-41283] - Feature parity: Functions API in Spark Connect
  • [SPARK-41550] - Dynamic Allocation on K8S GA
  • [SPARK-41594] - Support table-valued generator functions in the FROM clause
  • [SPARK-41597] - Improve PySpark errors
  • [SPARK-41642] - Deduplicate docstrings in Python Spark Connect
  • [SPARK-42339] - Improve Kryo Serializer Support
  • [SPARK-42802] - Customized K8s Scheduler GA

Documentation

  • [SPARK-38581] - List of supported pandas APIs for pandas API on Spark docs.
  • [SPARK-38961] - Enhance to automatically generate the pandas API support list
  • [SPARK-39001] - Document which options are unsupported in CSV and JSON functions
  • [SPARK-39577] - Add SQL reference for built-in functions
  • [SPARK-39677] - Wrong args item formatting of the regexp functions
  • [SPARK-39707] - Add SQL reference for aggregate functions
  • [SPARK-39737] - PERCENTILE_CONT and PERCENTILE_DISC should support aggregate filter
  • [SPARK-39777] - Remove Hive bucketing incompatibility doc
  • [SPARK-39780] - Add an additional usage example for the map_zip_with function
  • [SPARK-39968] - Update K8s doc to recommend K8s 1.22+
  • [SPARK-40028] - Add binary examples for string expressions
  • [SPARK-40043] - Document DataStreamWriter.toTable and DataStreamReader.table
  • [SPARK-40266] - Corrected console output in quick-start - Datatype Integer instead of Long
  • [SPARK-40279] - Document spark.yarn.report.interval
  • [SPARK-40922] - pyspark.pandas.read_csv supports reading multiple files, but that is undocumented
  • [SPARK-40983] - Remove Hadoop requirements for zstd mention in Parquet compression codec
  • [SPARK-40994] - Add code example for JDBC data source with partitionColumn
  • [SPARK-41014] - Improve documentation and typing of applyInPandas for groupby and cogroup
  • [SPARK-41596] - Document the new feature "Async Progress Tracking" to Structured Streaming guide doc
  • [SPARK-41951] - Update SQL migration guide and documentations
  • [SPARK-42405] - Better documentation of array_insert function
  • [SPARK-42418] - Updating PySpark documentation to support new users better
  • [SPARK-42446] - Updating PySpark documentation to enhance usability
  • [SPARK-42456] - Consolidating the PySpark version upgrade note pages into a single page to make it easier to read
  • [SPARK-42530] - Remove Hadoop 2 from PySpark installation guide
  • [SPARK-42592] - Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)
  • [SPARK-42628] - Add a migration note for bloom filter join
  • [SPARK-42713] - Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference
  • [SPARK-42903] - Avoid documenting None as as a return value in docstring
  • [SPARK-42924] - Clarify the comment of parameterized SQL args

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.