Release Notes - Spark - Version 2.4.4 - HTML format

Sub-task

  • [SPARK-27441] - Add read/write tests to Hive serde tables

Bug

  • [SPARK-21882] - OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function
  • [SPARK-24285] - Flaky test: ContinuousSuite.query without test harness
  • [SPARK-25139] - PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask
  • [SPARK-26038] - Decimal toScalaBigInt/toJavaBigInteger not work for decimals not fitting in long
  • [SPARK-26045] - Error in the spark 2.4 release package with the spark-avro_2.11 depdency
  • [SPARK-26152] - Synchronize Worker Cleanup with Worker Shutdown
  • [SPARK-26555] - Thread safety issue causes createDataset to fail with misleading errors
  • [SPARK-26812] - PushProjectionThroughUnion nullability issue
  • [SPARK-26895] - When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to resolve globs owned by target user
  • [SPARK-26995] - Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy
  • [SPARK-27018] - Checkpointed RDD deleted prematurely when using GBTClassifier
  • [SPARK-27100] - Use `Array` instead of `Seq` in `FilePartition` to prevent StackOverflowError
  • [SPARK-27159] - Update MsSqlServer dialect handling of BLOB type
  • [SPARK-27234] - Continuous Streaming does not support python UDFs
  • [SPARK-27298] - Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
  • [SPARK-27330] - ForeachWriter is not being closed once a batch is aborted
  • [SPARK-27347] - Fix supervised driver retry logic when agent crashes/restarts
  • [SPARK-27416] - UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size
  • [SPARK-27485] - EnsureRequirements.reorder should handle duplicate expressions gracefully
  • [SPARK-27577] - Wrong thresholds selected by BinaryClassificationMetrics when downsampling
  • [SPARK-27596] - The JDBC 'query' option doesn't work for Oracle database
  • [SPARK-27621] - Calling transform() method on a LinearRegressionModel throws NoSuchElementException
  • [SPARK-27624] - Fix CalenderInterval to show an empty interval correctly
  • [SPARK-27626] - Fix `docker-image-tool.sh` to be robust in non-bash shell env
  • [SPARK-27657] - ml.util.Instrumentation.logFailure doesn't log error message
  • [SPARK-27671] - Fix error when casting from a nested null in a struct
  • [SPARK-27711] - InputFileBlockHolder should be unset at the end of tasks
  • [SPARK-27735] - Interval string in upper case is not supported in Trigger
  • [SPARK-27781] - Tried to access method org.apache.avro.specific.SpecificData.<init>()V
  • [SPARK-27798] - ConvertToLocalRelation should tolerate expression reusing output object
  • [SPARK-27858] - Fix for avro deserialization on union types with multiple non-null types
  • [SPARK-27863] - Metadata files and temporary files should not be counted as data files
  • [SPARK-27869] - Redact sensitive information in System Properties from UI
  • [SPARK-27873] - Csv reader, adding a corrupt record column causes error if enforceSchema=false
  • [SPARK-27907] - HiveUDAF should return NULL in case of 0 rows
  • [SPARK-27917] - Semantic equals of CaseWhen is failing with case sensitivity of column Names
  • [SPARK-27992] - PySpark socket server should sync with JVM connection thread future
  • [SPARK-28015] - Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats
  • [SPARK-28025] - HDFSBackedStateStoreProvider should not leak .crc files
  • [SPARK-28058] - Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
  • [SPARK-28081] - word2vec 'large' count value too low for very large corpora
  • [SPARK-28153] - Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)
  • [SPARK-28156] - Join plan sometimes does not use cached query
  • [SPARK-28157] - Make SHS clear KVStore LogInfo for the blacklisted entries
  • [SPARK-28160] - TransportClient.sendRpcSync may hang forever
  • [SPARK-28164] - usage description does not match with shell scripts
  • [SPARK-28302] - SparkLauncher: The process cannot access the file because it is being used by another process
  • [SPARK-28308] - CalendarInterval sub-second part should be padded before parsing
  • [SPARK-28371] - Parquet "starts with" filter is not null-safe
  • [SPARK-28404] - Fix negative timeout value in RateStreamContinuousPartitionReader
  • [SPARK-28430] - Some stage table rows render wrong number of columns if tasks are missing metrics
  • [SPARK-28468] - Upgrade pip to fix `sphinx` install error
  • [SPARK-28489] - KafkaOffsetRangeCalculator.getRanges may drop offsets
  • [SPARK-28582] - Pyspark daemon exit failed when receive SIGTERM on py3.7
  • [SPARK-28606] - Update CRAN key to recover docker image generation
  • [SPARK-28638] - Task summary metrics are wrong when there are running tasks
  • [SPARK-28642] - Hide credentials in show create table
  • [SPARK-28647] - Recover additional metric feature and remove additional-metrics.js
  • [SPARK-28699] - Cache an indeterminate RDD could lead to incorrect result while stage rerun
  • [SPARK-28766] - Fix CRAN incoming feasibility warning on invalid URL
  • [SPARK-28775] - DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone database
  • [SPARK-28780] - Delete the incorrect setWeightCol method in LinearSVCModel
  • [SPARK-28844] - Fix typo in SQLConf FILE_COMRESSION_FACTOR
  • [SPARK-28868] - Specify Jekyll version to 3.8.6 in release docker image
  • [SPARK-29414] - HasOutputCol param isSet() property is not preserved after persistence
  • [SPARK-29773] - Unable to process empty ORC files in Hive Table using Spark SQL
  • [SPARK-31604] - java.lang.IllegalArgumentException: Frame length should be positive

New Feature

  • [SPARK-35197] - Accumulators Explore Page on Spark UI on History Server

Improvement

  • [SPARK-24898] - Adding spark.checkpoint.compress to the docs
  • [SPARK-26192] - MesosClusterScheduler reads options from dispatcher conf instead of submission conf
  • [SPARK-27672] - Add since info to string expressions
  • [SPARK-27673] - Add since info to random. regex, null expressions
  • [SPARK-27771] - Add SQL description for grouping functions (cube, rollup, grouping and grouping_id)
  • [SPARK-27794] - Use secure URLs for downloading CRAN artifacts
  • [SPARK-27973] - Streaming sample DirectKafkaWordCount should mention GroupId in usage
  • [SPARK-28154] - GMM fix double caching
  • [SPARK-28170] - DenseVector .toArray() and .values documentation do not specify they are aliases
  • [SPARK-28378] - Remove usage of cgi.escape
  • [SPARK-28421] - SparseVector.apply performance optimization
  • [SPARK-28496] - Use branch name instead of tag during dry-run
  • [SPARK-28545] - Add the hash map size to the directional log of ObjectAggregationIterator
  • [SPARK-28564] - Access history application defaults to the last attempt id
  • [SPARK-28649] - Git Ignore does not ignore python/.eggs
  • [SPARK-28713] - Bump checkstyle from 8.14 to 8.23

Test

  • [SPARK-24352] - Flaky test: StandaloneDynamicAllocationSuite
  • [SPARK-27168] - Add docker integration test for MsSql Server
  • [SPARK-28031] - Improve or remove doctest on over function of Column
  • [SPARK-28247] - Flaky test: "query without test harness" in ContinuousSuite
  • [SPARK-28261] - Flaky test: org.apache.spark.network.TransportClientFactorySuite.reuseClientsUpToConfigVariable
  • [SPARK-28335] - Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery from kafka
  • [SPARK-28357] - Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling compressed
  • [SPARK-28361] - Test equality of generated code with id in class name
  • [SPARK-28418] - Flaky Test: pyspark.sql.tests.test_dataframe: test_query_execution_listener_on_collect
  • [SPARK-28535] - Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader"
  • [SPARK-28881] - toPandas with Arrow should not return a DataFrame when the result size exceeds `spark.driver.maxResultSize`

Umbrella

  • [SPARK-27726] - Performance of InMemoryStore suffers under load

Documentation

  • [SPARK-27800] - Example for xor function has a wrong answer
  • [SPARK-28464] - Document kafka minPartitions option in "Structured Streaming + Kafka Integration Guide"
  • [SPARK-28609] - Fix broken styles/links and make up-to-date
  • [SPARK-28777] - Pyspark sql function "format_string" has the wrong parameters in doc string
  • [SPARK-28871] - Some codes in 'Policy for handling multiple watermarks' does not show friendly

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.