Sub-task
- [SPARK-27441] - Add read/write tests to Hive serde tables
Bug
- [SPARK-21882] - OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function
- [SPARK-24285] - Flaky test: ContinuousSuite.query without test harness
- [SPARK-25139] - PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask
- [SPARK-26038] - Decimal toScalaBigInt/toJavaBigInteger not work for decimals not fitting in long
- [SPARK-26045] - Error in the spark 2.4 release package with the spark-avro_2.11 depdency
- [SPARK-26152] - Synchronize Worker Cleanup with Worker Shutdown
- [SPARK-26555] - Thread safety issue causes createDataset to fail with misleading errors
- [SPARK-26812] - PushProjectionThroughUnion nullability issue
- [SPARK-26895] - When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to resolve globs owned by target user
- [SPARK-26995] - Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy
- [SPARK-27018] - Checkpointed RDD deleted prematurely when using GBTClassifier
- [SPARK-27100] - Use `Array` instead of `Seq` in `FilePartition` to prevent StackOverflowError
- [SPARK-27159] - Update MsSqlServer dialect handling of BLOB type
- [SPARK-27234] - Continuous Streaming does not support python UDFs
- [SPARK-27298] - Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
- [SPARK-27330] - ForeachWriter is not being closed once a batch is aborted
- [SPARK-27347] - Fix supervised driver retry logic when agent crashes/restarts
- [SPARK-27416] - UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size
- [SPARK-27485] - EnsureRequirements.reorder should handle duplicate expressions gracefully
- [SPARK-27577] - Wrong thresholds selected by BinaryClassificationMetrics when downsampling
- [SPARK-27596] - The JDBC 'query' option doesn't work for Oracle database
- [SPARK-27621] - Calling transform() method on a LinearRegressionModel throws NoSuchElementException
- [SPARK-27624] - Fix CalenderInterval to show an empty interval correctly
- [SPARK-27626] - Fix `docker-image-tool.sh` to be robust in non-bash shell env
- [SPARK-27657] - ml.util.Instrumentation.logFailure doesn't log error message
- [SPARK-27671] - Fix error when casting from a nested null in a struct
- [SPARK-27711] - InputFileBlockHolder should be unset at the end of tasks
- [SPARK-27735] - Interval string in upper case is not supported in Trigger
- [SPARK-27781] - Tried to access method org.apache.avro.specific.SpecificData.<init>()V
- [SPARK-27798] - ConvertToLocalRelation should tolerate expression reusing output object
- [SPARK-27858] - Fix for avro deserialization on union types with multiple non-null types
- [SPARK-27863] - Metadata files and temporary files should not be counted as data files
- [SPARK-27869] - Redact sensitive information in System Properties from UI
- [SPARK-27873] - Csv reader, adding a corrupt record column causes error if enforceSchema=false
- [SPARK-27907] - HiveUDAF should return NULL in case of 0 rows
- [SPARK-27917] - Semantic equals of CaseWhen is failing with case sensitivity of column Names
- [SPARK-27992] - PySpark socket server should sync with JVM connection thread future
- [SPARK-28015] - Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats
- [SPARK-28025] - HDFSBackedStateStoreProvider should not leak .crc files
- [SPARK-28058] - Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
- [SPARK-28081] - word2vec 'large' count value too low for very large corpora
- [SPARK-28153] - Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)
- [SPARK-28156] - Join plan sometimes does not use cached query
- [SPARK-28157] - Make SHS clear KVStore LogInfo for the blacklisted entries
- [SPARK-28160] - TransportClient.sendRpcSync may hang forever
- [SPARK-28164] - usage description does not match with shell scripts
- [SPARK-28302] - SparkLauncher: The process cannot access the file because it is being used by another process
- [SPARK-28308] - CalendarInterval sub-second part should be padded before parsing
- [SPARK-28371] - Parquet "starts with" filter is not null-safe
- [SPARK-28404] - Fix negative timeout value in RateStreamContinuousPartitionReader
- [SPARK-28430] - Some stage table rows render wrong number of columns if tasks are missing metrics
- [SPARK-28468] - Upgrade pip to fix `sphinx` install error
- [SPARK-28489] - KafkaOffsetRangeCalculator.getRanges may drop offsets
- [SPARK-28582] - Pyspark daemon exit failed when receive SIGTERM on py3.7
- [SPARK-28606] - Update CRAN key to recover docker image generation
- [SPARK-28638] - Task summary metrics are wrong when there are running tasks
- [SPARK-28642] - Hide credentials in show create table
- [SPARK-28647] - Recover additional metric feature and remove additional-metrics.js
- [SPARK-28699] - Cache an indeterminate RDD could lead to incorrect result while stage rerun
- [SPARK-28766] - Fix CRAN incoming feasibility warning on invalid URL
- [SPARK-28775] - DateTimeUtilsSuite fails for JDKs using the tzdata2018i or newer timezone database
- [SPARK-28780] - Delete the incorrect setWeightCol method in LinearSVCModel
- [SPARK-28844] - Fix typo in SQLConf FILE_COMRESSION_FACTOR
- [SPARK-28868] - Specify Jekyll version to 3.8.6 in release docker image
- [SPARK-29414] - HasOutputCol param isSet() property is not preserved after persistence
- [SPARK-29773] - Unable to process empty ORC files in Hive Table using Spark SQL
- [SPARK-31604] - java.lang.IllegalArgumentException: Frame length should be positive
New Feature
- [SPARK-35197] - Accumulators Explore Page on Spark UI on History Server
Improvement
- [SPARK-24898] - Adding spark.checkpoint.compress to the docs
- [SPARK-26192] - MesosClusterScheduler reads options from dispatcher conf instead of submission conf
- [SPARK-27672] - Add since info to string expressions
- [SPARK-27673] - Add since info to random. regex, null expressions
- [SPARK-27771] - Add SQL description for grouping functions (cube, rollup, grouping and grouping_id)
- [SPARK-27794] - Use secure URLs for downloading CRAN artifacts
- [SPARK-27973] - Streaming sample DirectKafkaWordCount should mention GroupId in usage
- [SPARK-28154] - GMM fix double caching
- [SPARK-28170] - DenseVector .toArray() and .values documentation do not specify they are aliases
- [SPARK-28378] - Remove usage of cgi.escape
- [SPARK-28421] - SparseVector.apply performance optimization
- [SPARK-28496] - Use branch name instead of tag during dry-run
- [SPARK-28545] - Add the hash map size to the directional log of ObjectAggregationIterator
- [SPARK-28564] - Access history application defaults to the last attempt id
- [SPARK-28649] - Git Ignore does not ignore python/.eggs
- [SPARK-28713] - Bump checkstyle from 8.14 to 8.23
Test
- [SPARK-24352] - Flaky test: StandaloneDynamicAllocationSuite
- [SPARK-27168] - Add docker integration test for MsSql Server
- [SPARK-28031] - Improve or remove doctest on over function of Column
- [SPARK-28247] - Flaky test: "query without test harness" in ContinuousSuite
- [SPARK-28261] - Flaky test: org.apache.spark.network.TransportClientFactorySuite.reuseClientsUpToConfigVariable
- [SPARK-28335] - Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery from kafka
- [SPARK-28357] - Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling compressed
- [SPARK-28361] - Test equality of generated code with id in class name
- [SPARK-28418] - Flaky Test: pyspark.sql.tests.test_dataframe: test_query_execution_listener_on_collect
- [SPARK-28535] - Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader"
- [SPARK-28881] - toPandas with Arrow should not return a DataFrame when the result size exceeds `spark.driver.maxResultSize`
Umbrella
- [SPARK-27726] - Performance of InMemoryStore suffers under load
Documentation
- [SPARK-27800] - Example for xor function has a wrong answer
- [SPARK-28464] - Document kafka minPartitions option in "Structured Streaming + Kafka Integration Guide"
- [SPARK-28609] - Fix broken styles/links and make up-to-date
- [SPARK-28777] - Pyspark sql function "format_string" has the wrong parameters in doc string
- [SPARK-28871] - Some codes in 'Policy for handling multiple watermarks' does not show friendly
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.