Release Notes - Spark - Version 2.4.1 - HTML format

Sub-task

  • [SPARK-25883] - Override method `prettyName` in `from_avro`/`to_avro`
  • [SPARK-26010] - SparkR vignette fails on CRAN on Java 11
  • [SPARK-26327] - Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set
  • [SPARK-26402] - Accessing nested fields with different cases in case insensitive mode

Bug

  • [SPARK-22148] - TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
  • [SPARK-23458] - Flaky test: OrcQuerySuite
  • [SPARK-24553] - Job UI redirect causing http 302 error
  • [SPARK-24669] - Managed table was not cleared of path after drop database cascade
  • [SPARK-24687] - When NoClassDefError thrown during task serialization will cause job hang
  • [SPARK-25451] - Stages page doesn't show the right number of the total tasks
  • [SPARK-25767] - Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library
  • [SPARK-25786] - If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo
  • [SPARK-25827] - Replicating a block > 2gb with encryption fails
  • [SPARK-25837] - Web UI does not respect spark.ui.retainedJobs in some instances
  • [SPARK-25863] - java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
  • [SPARK-25866] - Update KMeans formatVersion
  • [SPARK-25906] - spark-shell cannot handle `-i` option correctly
  • [SPARK-25909] - Error in documentation: number of cluster managers
  • [SPARK-25918] - LOAD DATA LOCAL INPATH should handle a relative path
  • [SPARK-25921] - Python worker reuse causes Barrier tasks to run without BarrierTaskContext
  • [SPARK-25922] - [K8] Spark Driver/Executor "spark-app-selector" label mismatch
  • [SPARK-25930] - Fix scala version string detection when maven-help-plugin is not pre-installed
  • [SPARK-25934] - Mesos: SPARK_CONF_DIR should not be propogated by spark submit
  • [SPARK-25979] - Window function: allow parentheses around window reference
  • [SPARK-25988] - Keep names unchanged when deduplicating the column names in Analyzer
  • [SPARK-25992] - Accumulators giving KeyError in pyspark
  • [SPARK-26011] - pyspark app with "spark.jars.packages" config does not work
  • [SPARK-26019] - pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
  • [SPARK-26048] - Flume connector for Spark 2.4 does not exist in Maven repository
  • [SPARK-26057] - Table joining is broken in Spark 2.4
  • [SPARK-26078] - WHERE .. IN fails to filter rows when used in combination with UNION
  • [SPARK-26079] - Flaky test: StreamingQueryListenersConfSuite
  • [SPARK-26080] - Unable to run worker.py on Windows
  • [SPARK-26082] - Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
  • [SPARK-26084] - AggregateExpression.references fails on unresolved expression trees
  • [SPARK-26092] - Use CheckpointFileManager to write the streaming metadata file
  • [SPARK-26100] - [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks
  • [SPARK-26109] - Duration in the task summary metrics table and the task table are different
  • [SPARK-26114] - Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
  • [SPARK-26119] - Task metrics summary in the stage page should contain only successful tasks metrics
  • [SPARK-26137] - Linux file separator is hard coded in DependencyUtils used in deploy process
  • [SPARK-26147] - Python UDFs in join condition fail even when using columns from only one side of join
  • [SPARK-26181] - the `hasMinMaxStats` method of `ColumnStatsMap` is not correct
  • [SPARK-26184] - Last updated time is not getting updated in the History Server UI
  • [SPARK-26186] - In progress applications with last updated time is lesser than the cleaning interval are getting removed during cleaning logs
  • [SPARK-26188] - Spark 2.4.0 Partitioning behavior breaks backwards compatibility
  • [SPARK-26198] - Metadata serialize null values throw NPE
  • [SPARK-26201] - python broadcast.value on driver fails with disk encryption enabled
  • [SPARK-26211] - Fix InSet for binary, and struct and array with null.
  • [SPARK-26219] - Executor summary is not getting updated for failure jobs in history server UI
  • [SPARK-26228] - OOM issue encountered when computing Gramian matrix
  • [SPARK-26233] - Incorrect decimal value with java beans and first/last/max... functions
  • [SPARK-26256] - Add proper labels when deleting pods
  • [SPARK-26265] - deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
  • [SPARK-26267] - Kafka source may reprocess data
  • [SPARK-26269] - YarnAllocator should have same blacklist behaviour with YARN to maxmize use of cluster resource
  • [SPARK-26307] - Fix CTAS when INSERT a partitioned table using Hive serde
  • [SPARK-26315] - auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel
  • [SPARK-26351] - Documented formula of precision at k does not match the actual code
  • [SPARK-26352] - Join reordering should not change the order of output attributes
  • [SPARK-26355] - Add a workaround for PyArrow 0.11.
  • [SPARK-26366] - Except with transform regression
  • [SPARK-26370] - Fix resolution of higher-order function for the same identifier.
  • [SPARK-26379] - Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp
  • [SPARK-26382] - prefix sorter should handle -0.0
  • [SPARK-26394] - Annotation error for Utils.timeStringAsMs
  • [SPARK-26422] - Unable to disable Hive support in SparkR when Hadoop version is unsupported
  • [SPARK-26426] - ExpressionInfo related unit tests fail in Windows
  • [SPARK-26427] - Upgrade Apache ORC to 1.5.4
  • [SPARK-26444] - Stage color doesn't change with it's status
  • [SPARK-26496] - Avoid to use Random.nextString in StreamingInnerJoinSuite
  • [SPARK-26501] - Unexpected overriden of exitFn in SparkSubmitSuite
  • [SPARK-26537] - update the release scripts to point to gitbox
  • [SPARK-26538] - Postgres numeric array support
  • [SPARK-26545] - Fix typo in EqualNullSafe's truth table comment
  • [SPARK-26551] - Selecting one complex field and having is null predicate on another complex field can cause error
  • [SPARK-26554] - Update `release-util.sh` to avoid GitBox fake 200 headers
  • [SPARK-26559] - ML image can't work with numpy versions prior to 1.9
  • [SPARK-26571] - Update Hive Serde mapping with canonical name of Parquet and Orc FileFormat
  • [SPARK-26572] - Join on distinct column with monotonically_increasing_id produces wrong output
  • [SPARK-26576] - Broadcast hint not applied to partitioned table
  • [SPARK-26583] - Add `paranamer` dependency to `core` module
  • [SPARK-26586] - Streaming queries should have isolated SparkSessions and confs
  • [SPARK-26606] - parameters passed in extraJavaOptions are not being picked up
  • [SPARK-26615] - Fixing transport server/client resource leaks in the core unittests
  • [SPARK-26629] - Error with multiple file stream in a query + restart on a batch that has no data for one file stream
  • [SPARK-26638] - Pyspark vector classes always return error for unary negation
  • [SPARK-26665] - BlockTransferService.fetchBlockSync may hang forever
  • [SPARK-26677] - Incorrect results of not(eqNullSafe) when data read from Parquet file
  • [SPARK-26680] - StackOverflowError if Stream passed to groupBy
  • [SPARK-26682] - Task attempt ID collision causes lost data
  • [SPARK-26706] - Fix Cast$mayTruncate for bytes
  • [SPARK-26708] - Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan
  • [SPARK-26709] - OptimizeMetadataOnlyQuery does not correctly handle the files with zero record
  • [SPARK-26718] - Fixed integer overflow in SS kafka rateLimit calculation
  • [SPARK-26726] - Synchronize the amount of memory used by the broadcast variable to the UI display
  • [SPARK-26732] - Flaky test: SparkContextInfoSuite.getRDDStorageInfo only reports on RDDs that actually persist data
  • [SPARK-26734] - StackOverflowError on WAL serialization caused by large receivedBlockQueue
  • [SPARK-26740] - Statistics for date and timestamp columns depend on system time zone
  • [SPARK-26745] - Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
  • [SPARK-26751] - HiveSessionImpl might have memory leak since Operation do not close properly
  • [SPARK-26757] - GraphX EdgeRDDImpl and VertexRDDImpl `count` method cannot handle empty RDDs
  • [SPARK-26758] - Idle Executors are not getting killed after spark.dynamicAllocation.executorIdleTimeout value
  • [SPARK-26806] - EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
  • [SPARK-26859] - Fix field writer index bug in non-vectorized ORC deserializer
  • [SPARK-26864] - Query may return incorrect result when python udf is used as a join condition and the udf uses attributes from both legs of left semi join.
  • [SPARK-26873] - FileFormatWriter creates inconsistent MR job IDs
  • [SPARK-26927] - Race condition may cause dynamic allocation not working
  • [SPARK-26950] - Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values
  • [SPARK-26990] - Difference in handling of mixed-case partition column names after SPARK-26188
  • [SPARK-27019] - Spark UI's SQL tab shows inconsistent values
  • [SPARK-27065] - avoid more than one active task set managers for a stage
  • [SPARK-27078] - Read Hive materialized view throw MatchError
  • [SPARK-27080] - Read parquet file with merging metastore schema should compare schema field in uniform case.
  • [SPARK-27094] - Thread interrupt being swallowed while launching executors in YarnAllocator
  • [SPARK-27097] - Avoid embedding platform-dependent offsets literally in whole-stage generated code
  • [SPARK-27107] - Spark SQL Job failing because of Kryo buffer overflow with ORC
  • [SPARK-27111] - A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally
  • [SPARK-27112] - Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
  • [SPARK-27134] - array_distinct function does not work correctly with columns containing array of array
  • [SPARK-27160] - Incorrect Literal Casting of DecimalType in OrcFilters
  • [SPARK-27165] - Upgrade Apache ORC to 1.5.5
  • [SPARK-27178] - k8s test failing due to missing nss library in dockerfile
  • [SPARK-27198] - Heartbeat interval mismatch in driver and executor

New Feature

  • [SPARK-25635] - Support selective direct encoding in native ORC write
  • [SPARK-26118] - Make Jetty's requestHeaderSize configurable in Spark
  • [SPARK-26605] - New executors failing with expired tokens in client mode
  • [SPARK-26910] - Re-release SparkR to CRAN

Improvement

  • [SPARK-25023] - Clarify Spark security documentation
  • [SPARK-25778] - WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS
  • [SPARK-25904] - Avoid allocating arrays too large for JVMs
  • [SPARK-26266] - Update to Scala 2.12.8
  • [SPARK-26316] - Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
  • [SPARK-26392] - Cancel pending allocate requests by taking locality preference into account
  • [SPARK-26409] - SQLConf should be serializable in test sessions
  • [SPARK-26604] - Register channel for stream request
  • [SPARK-26633] - Add ExecutorClassLoader.getResourceAsStream
  • [SPARK-27046] - Remove SPARK-19185 related references from documentation since its resolved

Test

  • [SPARK-25899] - Flaky test: CoarseGrainedSchedulerBackendSuite.compute max number of concurrent tasks can be launched
  • [SPARK-26029] - Bump previousSparkVersion in MimaBuild.scala to be 2.3.0
  • [SPARK-26042] - KafkaContinuousSourceTopicDeletionSuite may hang forever
  • [SPARK-26069] - Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures
  • [SPARK-26120] - Fix a streaming query leak in Structured Streaming R tests

Task

  • [SPARK-26607] - Remove Spark 2.2.x testing from HiveExternalCatalogVersionsSuite
  • [SPARK-26897] - Update Spark 2.3.x testing from HiveExternalCatalogVersionsSuite
  • [SPARK-27274] - Refer to Scala 2.12 in docs; deprecate Scala 2.11 support in 2.4.1

Dependency upgrade

  • [SPARK-26742] - Bump Kubernetes Client Version to 4.1.2

Documentation

  • [SPARK-25933] - Fix pstats reference for spark.python.profile.dump in configuration.md
  • [SPARK-26207] - add PowerIterationClustering (PIC) doc in 2.4 branch
  • [SPARK-26932] - Add a warning for Hive 2.1.1 ORC reader issue

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.