Release Notes - Spark - Version 2.4.5 - HTML format

Sub-task

  • [SPARK-28683] - Upgrade Scala to 2.12.10
  • [SPARK-29203] - Reduce shuffle partitions in SQLQueryTestSuite
  • [SPARK-29708] - Different answers in aggregates of duplicate grouping sets
  • [SPARK-30269] - Should use old partition stats to decide whether to update stats when analyzing partition

Bug

  • [SPARK-17398] - Failed to query on external JSon Partitioned table
  • [SPARK-21287] - Cannot use Int.MIN_VALUE as Spark SQL fetchsize
  • [SPARK-21492] - Memory leak in SortMergeJoin
  • [SPARK-22955] - Error generating jobs when Stopping JobGenerator gracefully
  • [SPARK-23435] - R tests should support latest testthat
  • [SPARK-23519] - Create View Commands Fails with The view output (col1,col1) contains duplicate column name
  • [SPARK-24152] - SparkR CRAN feasibility check server problem
  • [SPARK-24663] - Flaky test: StreamingContextSuite "stop slow receiver gracefully"
  • [SPARK-24666] - Word2Vec generate infinity vectors when numIterations are large
  • [SPARK-25277] - YARN applicationMaster metrics should not register static and JVM metrics
  • [SPARK-25753] - binaryFiles broken for small files
  • [SPARK-25903] - Flaky test: BarrierTaskContextSuite.throw exception on barrier() call timeout
  • [SPARK-26499] - JdbcUtils.makeGetter does not handle ByteType
  • [SPARK-26560] - Repeating select on udf function throws analysis exception - function not registered
  • [SPARK-26713] - PipedRDD may holds stdin writer and stdout read threads even if the task is finished
  • [SPARK-26985] - Test "access only some column of the all of columns " fails on big endian
  • [SPARK-26989] - Flaky test:DAGSchedulerSuite.Barrier task failures from the same stage attempt don't trigger multiple stage retries
  • [SPARK-27558] - NPE in TaskCompletionListener due to Spark OOM in UnsafeExternalSorter causing tasks to hang
  • [SPARK-27812] - kubernetes client import non-daemon thread which block jvm exit.
  • [SPARK-28599] - Fix `Execution Time` and `Duration` column sorting for ThriftServerSessionPage
  • [SPARK-28709] - Memory leaks after stopping of StreamingContext
  • [SPARK-28749] - Fix PySpark tests not to require kafka-0-8 in branch-2.4
  • [SPARK-28778] - Shuffle jobs fail due to incorrect advertised address when running in virtual network
  • [SPARK-28903] - Fix AWS JDK version conflict that breaks Pyspark Kinesis tests
  • [SPARK-28906] - `bin/spark-submit --version` shows incorrect info
  • [SPARK-28912] - MatchError exception in CheckpointWriteHandler
  • [SPARK-28917] - Jobs can hang because of race of RDD.dependencies
  • [SPARK-28921] - Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
  • [SPARK-28939] - SQL configuration are not always propagated
  • [SPARK-28963] - Fall back to archive.apache.org to download Maven if mirrors don't have requested version
  • [SPARK-29042] - Sampling-based RDD with unordered input should be INDETERMINATE
  • [SPARK-29045] - Test failed due to table already exists in SQLMetricsSuite
  • [SPARK-29046] - Possible NPE on SQLConf.get when SparkContext is stopping in another thread
  • [SPARK-29053] - Sort does not work on some columns
  • [SPARK-29055] - Spark UI storage memory increasing overtime
  • [SPARK-29101] - CSV datasource returns incorrect .count() from file with malformed records
  • [SPARK-29177] - Zombie tasks prevents executor from releasing when task exceeds maxResultSize
  • [SPARK-29186] - SubqueryAlias name value is null in Spark 2.4.3 Logical plan.
  • [SPARK-29213] - Make it consistent when get notnull output and generate null checks in FilterExec
  • [SPARK-29229] - Change the additional remote repository in IsolatedClientLoader to google minor
  • [SPARK-29240] - PySpark 2.4 about sql function 'element_at' param 'extraction'
  • [SPARK-29244] - ArrayIndexOutOfBoundsException on TaskCompletionListener during releasing of memory blocks
  • [SPARK-29450] - [SS] In streaming aggregation, metric for output rows is not measured in append mode
  • [SPARK-29494] - ArrayOutOfBoundsException when converting from string to timestamp
  • [SPARK-29498] - CatalogTable to HiveTable should not change the table's ownership
  • [SPARK-29556] - Avoid including path in error response from REST submission server
  • [SPARK-29560] - Add typesafe bintray repo for sbt-mima-plugin
  • [SPARK-29578] - JDK 1.8.0_232 timezone updates cause "Kwajalein" test failures again
  • [SPARK-29604] - SessionState is initialized with isolated classloader for Hive if spark.sql.hive.metastore.jars is being set
  • [SPARK-29637] - SHS Endpoint /applications/<app_id>/jobs/ doesn't include description
  • [SPARK-29647] - Use Python 3.7 in GitHub Action to recover lint-python
  • [SPARK-29651] - Incorrect parsing of interval seconds fraction
  • [SPARK-29666] - Release script fail to publish release under dry run mode
  • [SPARK-29682] - Failure when resolving conflicting references in Join:
  • [SPARK-29743] - sample should set needCopyResult to true if its child is
  • [SPARK-29758] - json_tuple truncates fields
  • [SPARK-29781] - Override SBT Jackson-databind dependency like Maven
  • [SPARK-29796] - HiveExternalCatalogVersionsSuite` should ignore preview release
  • [SPARK-29850] - sort-merge-join an empty table should not memory leak
  • [SPARK-29875] - Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x
  • [SPARK-29890] - Unable to fill na with 0 with duplicate columns
  • [SPARK-29904] - Parse timestamps in microsecond precision by JSON/CSV datasources
  • [SPARK-29918] - RecordBinaryComparator should check endianness when compared by long
  • [SPARK-29932] - lint-r should do non-zero exit in case of errors
  • [SPARK-29949] - JSON/CSV formats timestamps incorrectly
  • [SPARK-29970] - open/close state is not preserved for Timelineview
  • [SPARK-29971] - Multiple possible buffer leaks in TransportFrameDecoder and TransportCipher
  • [SPARK-30030] - Use RegexChecker instead of TokenChecker to check `org.apache.commons.lang.`
  • [SPARK-30050] - analyze table and rename table should not erase the bucketing metadata at hive side
  • [SPARK-30065] - Unable to drop na with duplicate columns
  • [SPARK-30082] - Zeros are being treated as NaNs
  • [SPARK-30129] - New auth engine does not keep client ID in TransportClient after auth
  • [SPARK-30198] - BytesToBytesMap does not grow internal long array as expected
  • [SPARK-30225] - "Stream is corrupted at" exception on reading disk-spilled data of a shuffle operation
  • [SPARK-30238] - hive partition pruning can only support string and integral types
  • [SPARK-30246] - Spark on Yarn External Shuffle Service Memory Leak
  • [SPARK-30263] - Don't log values of ignored non-Spark properties
  • [SPARK-30274] - Avoid BytesToBytesMap lookup hang forever when holding keys reaching max capacity
  • [SPARK-30285] - Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError
  • [SPARK-30310] - SparkUncaughtExceptionHandler halts running process unexpectedly
  • [SPARK-30312] - Preserve path permission when truncate table
  • [SPARK-30325] - markPartitionCompleted cause task status inconsistent
  • [SPARK-30333] - Bump jackson-databind to 2.6.7.3
  • [SPARK-30447] - Constant propagation nullability issue
  • [SPARK-30450] - Exclude .git folder for python linter
  • [SPARK-30458] - The Executor Computing Time in Time Line of Stage Page is Wrong
  • [SPARK-30489] - Make build delete pyspark.zip file properly
  • [SPARK-30512] - Use a dedicated boss event group loop in the netty pipeline for external shuffle service
  • [SPARK-30553] - Fix structured-streaming java example error
  • [SPARK-30556] - Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext
  • [SPARK-30572] - Add a fallback Maven repository
  • [SPARK-30633] - Codegen fails when xxHash seed is not an integer
  • [SPARK-30645] - collect() support Unicode charactes tests fails on Windows
  • [SPARK-30704] - Use jekyll-redirect-from 0.15.0 instead of the latest

Improvement

  • [SPARK-19147] - Gracefully handle error in task after executor is stopped
  • [SPARK-25392] - [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page
  • [SPARK-26003] - Improve performance in SQLAppStatusListener
  • [SPARK-27122] - YARN test failures in Java 9+
  • [SPARK-27460] - Running slowest test suites in their own forked JVMs for higher parallelism
  • [SPARK-28678] - Specify that start index is 1-based in docstring of pyspark.sql.functions.slice
  • [SPARK-28938] - Move to supported OpenJDK docker image for Kubernetes
  • [SPARK-29011] - Upgrade netty-all to 4.1.39-Final
  • [SPARK-29075] - Add enforcer rule to ban duplicated pom dependency
  • [SPARK-29087] - Use DelegatingServletContextHandler to avoid CCE
  • [SPARK-29159] - Increase ReservedCodeCacheSize to 1G
  • [SPARK-29165] - Set log level of log generated code as ERROR in case of compile error on generated code in UT
  • [SPARK-29247] - HiveClientImpl may be log sensitive information
  • [SPARK-29410] - Update Commons BeanUtils to 1.9.4
  • [SPARK-29677] - Upgrade Kinesis Client
  • [SPARK-29820] - Use GitHub Action Cache for `./.m2/repository`
  • [SPARK-29964] - lintr github action failed due to buggy GnuPG
  • [SPARK-30318] - Bump jetty to 9.3.27.v20190418
  • [SPARK-30339] - Avoid to fail twice in function lookup
  • [SPARK-30410] - Calculating size of table having large number of partitions causes flooding logs
  • [SPARK-30601] - Add a Google Maven Central as a primary repository
  • [SPARK-30630] - Deprecate numTrees in GBT at 2.4.5 and remove it at 3.0.0

Test

  • [SPARK-23197] - Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
  • [SPARK-29104] - Fix Flaky Test - PipedRDDSuite. stdin_writer_thread_should_be_exited_when_task_is_finished
  • [SPARK-29286] - UnicodeDecodeError raised when running python tests on arm instance
  • [SPARK-30637] - upgrade testthat on jenkins workers to 2.0.0

Task

  • [SPARK-28951] - Add release announce template
  • [SPARK-29073] - Add GitHub Action to branch-2.4 for `Scala-2.11 / Scala-2.12` build
  • [SPARK-29201] - Add Hadoop 2.6 combination to GitHub Action
  • [SPARK-29445] - Bump netty-all from 4.1.39.Final to 4.1.42.Final

Documentation

  • [SPARK-28650] - Fix the guarantee of ForeachWriter
  • [SPARK-28977] - JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page
  • [SPARK-29367] - pandas udf not working with latest pyarrow release (0.15.0)
  • [SPARK-29790] - Add notes about port being required for Kubernetes API URL when set as master
  • [SPARK-30236] - Clarify date and time patterns supported by date_format
  • [SPARK-30478] - Fix memory package doc

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.