Release Notes - Spark - Version 2.4.0 - HTML format

Sub-task

  • [SPARK-6236] - Support caching blocks larger than 2G
  • [SPARK-6237] - Support uploading blocks > 2GB as a stream
  • [SPARK-10884] - Support prediction on single instance for regression and classification related models
  • [SPARK-11239] - PMML export for ML linear regression
  • [SPARK-12850] - Support bucket pruning (predicate pushdown for bucketed tables)
  • [SPARK-14376] - spark.ml parity for trees
  • [SPARK-14540] - Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
  • [SPARK-17091] - Convert IN predicate to equivalent Parquet filter
  • [SPARK-19826] - spark.ml Python API for PIC
  • [SPARK-20114] - spark.ml parity for sequential pattern mining - PrefixSpan
  • [SPARK-21088] - CrossValidator, TrainValidationSplit should collect all models when fitting: Python API
  • [SPARK-21898] - Feature parity for KolmogorovSmirnovTest in MLlib
  • [SPARK-22187] - Update unsaferow format for saved state such that we can set timeouts when state is null
  • [SPARK-22239] - User-defined window functions with pandas udf (unbounded window)
  • [SPARK-22274] - User-defined aggregation functions with pandas udf
  • [SPARK-22362] - Add unit test for Window Aggregate Functions
  • [SPARK-22624] - Expose range partitioning shuffle introduced by SPARK-22614
  • [SPARK-23011] - Support alternative function form with group aggregate pandas UDF
  • [SPARK-23030] - Decrease memory consumption with toPandas() collection using Arrow
  • [SPARK-23046] - Have RFormula include VectorSizeHint in pipeline
  • [SPARK-23096] - Migrate rate source to v2
  • [SPARK-23097] - Migrate text socket source to v2
  • [SPARK-23099] - Migrate foreach sink
  • [SPARK-23120] - Add PMML pipeline export support to PySpark
  • [SPARK-23203] - DataSourceV2 should use immutable trees.
  • [SPARK-23323] - DataSourceV2 should use the output commit coordinator.
  • [SPARK-23325] - DataSourceV2 readers should always produce InternalRow.
  • [SPARK-23341] - DataSourceOptions should handle path and table names to avoid confusion.
  • [SPARK-23344] - Add KMeans distanceMeasure param to PySpark
  • [SPARK-23352] - Explicitly specify supported types in Pandas UDFs
  • [SPARK-23362] - Migrate Kafka microbatch source to v2
  • [SPARK-23380] - Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame
  • [SPARK-23401] - Improve test cases for all supported types and unsupported types
  • [SPARK-23418] - DataSourceV2 should not allow userSpecifiedSchema without ReadSupportWithSchema
  • [SPARK-23491] - continuous symptom
  • [SPARK-23503] - continuous execution should sequence committed epochs
  • [SPARK-23555] - Add BinaryType support for Arrow in PySpark
  • [SPARK-23559] - add epoch ID to data writer factory
  • [SPARK-23577] - Supports line separator for text datasource
  • [SPARK-23581] - Add an interpreted version of GenerateUnsafeProjection
  • [SPARK-23582] - Add interpreted execution to StaticInvoke expression
  • [SPARK-23583] - Add interpreted execution to Invoke expression
  • [SPARK-23584] - Add interpreted execution to NewInstance expression
  • [SPARK-23585] - Add interpreted execution for UnwrapOption expression
  • [SPARK-23586] - Add interpreted execution for WrapOption expression
  • [SPARK-23587] - Add interpreted execution for MapObjects expression
  • [SPARK-23588] - Add interpreted execution for CatalystToExternalMap expression
  • [SPARK-23589] - Add interpreted execution for ExternalMapToCatalyst expression
  • [SPARK-23590] - Add interpreted execution for CreateExternalRow expression
  • [SPARK-23591] - Add interpreted execution for EncodeUsingSerializer expression
  • [SPARK-23592] - Add interpreted execution for DecodeUsingSerializer expression
  • [SPARK-23593] - Add interpreted execution for InitializeJavaBean expression
  • [SPARK-23594] - Add interpreted execution for GetExternalRowField expression
  • [SPARK-23595] - Add interpreted execution for ValidateExternalType expression
  • [SPARK-23596] - Modify Dataset test harness to include interpreted execution
  • [SPARK-23597] - Audit Spark SQL code base for non-interpreted expressions
  • [SPARK-23611] - Extend ExpressionEvalHelper harness to also test failures
  • [SPARK-23615] - Add maxDF Parameter to Python CountVectorizer
  • [SPARK-23633] - Update Pandas UDFs section in sql-programming-guide
  • [SPARK-23687] - Add MemoryStream
  • [SPARK-23688] - Refactor tests away from rate source
  • [SPARK-23690] - VectorAssembler should have handleInvalid to handle columns with null values
  • [SPARK-23706] - spark.conf.get(value, default=None) should produce None in PySpark
  • [SPARK-23711] - Add fallback to interpreted execution logic
  • [SPARK-23713] - Clean-up UnsafeWriter classes
  • [SPARK-23723] - New encoding option for json datasource
  • [SPARK-23724] - Custom record separator for jsons in charsets different from UTF-8
  • [SPARK-23727] - Support DATE predict push down in parquet
  • [SPARK-23736] - High-order function: concat(array1, array2, ..., arrayN) → array
  • [SPARK-23747] - Add EpochCoordinator unit tests
  • [SPARK-23748] - Support select from temp tables
  • [SPARK-23762] - UTF8StringBuilder uses MemoryBlock
  • [SPARK-23765] - Supports line separator for json datasource
  • [SPARK-23783] - Add new generic export trait for ML pipelines
  • [SPARK-23807] - Add Hadoop 3 profile with relevant POM fix ups
  • [SPARK-23821] - High-order function: flatten(x) → array
  • [SPARK-23826] - TestHiveSparkSession should set default session
  • [SPARK-23847] - Add asc_nulls_first, asc_nulls_last to PySpark
  • [SPARK-23859] - Initial PR for Instrumentation improvements: UUID and logging levels
  • [SPARK-23864] - Add Unsafe* copy methods to UnsafeWriter
  • [SPARK-23870] - Forward RFormula handleInvalid Param to VectorAssembler
  • [SPARK-23871] - add python api for VectorAssembler handleInvalid
  • [SPARK-23900] - format_number udf should take user specifed format as argument
  • [SPARK-23902] - Provide an option in months_between UDF to disable rounding-off
  • [SPARK-23903] - Add support for date extract
  • [SPARK-23905] - Add UDF weekday
  • [SPARK-23908] - High-order function: transform(array<T>, function<T, U>) → array<U>
  • [SPARK-23909] - High-order function: filter(array<T>, function<T, boolean>) → array<T>
  • [SPARK-23911] - High-order function: aggregate(array<T>, initialState S, inputFunction<S, T, S>, outputFunction<S, R>) → R
  • [SPARK-23912] - High-order function: array_distinct(x) → array
  • [SPARK-23913] - High-order function: array_intersect(x, y) → array
  • [SPARK-23914] - High-order function: array_union(x, y) → array
  • [SPARK-23915] - High-order function: array_except(x, y) → array
  • [SPARK-23916] - High-order function: array_join(x, delimiter, null_replacement) → varchar
  • [SPARK-23917] - High-order function: array_max(x) → x
  • [SPARK-23918] - High-order function: array_min(x) → x
  • [SPARK-23919] - High-order function: array_position(x, element) → bigint
  • [SPARK-23920] - High-order function: array_remove(x, element) → array
  • [SPARK-23921] - High-order function: array_sort(x) → array
  • [SPARK-23922] - High-order function: arrays_overlap(x, y) → boolean
  • [SPARK-23923] - High-order function: cardinality(x) → bigint
  • [SPARK-23924] - High-order function: element_at
  • [SPARK-23925] - High-order function: repeat(element, count) → array
  • [SPARK-23926] - High-order function: reverse(x) → array
  • [SPARK-23927] - High-order function: sequence
  • [SPARK-23928] - High-order function: shuffle(x) → array
  • [SPARK-23930] - High-order function: slice(x, start, length) → array
  • [SPARK-23931] - High-order function: array_zip(array1, array2[, ...]) → array<row>
  • [SPARK-23932] - High-order function: zip_with(array<T>, array<U>, function<T, U, R>) → array<R>
  • [SPARK-23933] - High-order function: map(array<K>, array<V>) → map<K,V>
  • [SPARK-23934] - High-order function: map_from_entries(array<row<K, V>>) → map<K,V>
  • [SPARK-23936] - High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>
  • [SPARK-23942] - PySpark's collect doesn't trigger QueryExecutionListener
  • [SPARK-23990] - Instruments logging improvements - ML regression package
  • [SPARK-24026] - spark.ml Scala/Java API for PIC
  • [SPARK-24038] - refactor continuous write exec to its own class
  • [SPARK-24039] - remove restarting iterators hack
  • [SPARK-24040] - support single partition aggregates
  • [SPARK-24054] - Add array_position function / element_at functions
  • [SPARK-24069] - Add array_max / array_min functions
  • [SPARK-24070] - TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
  • [SPARK-24071] - Micro-benchmark of Parquet Filter Pushdown
  • [SPARK-24073] - DataSourceV2: Rename DataReaderFactory to InputPartition.
  • [SPARK-24115] - improve instrumentation for spark.ml.tuning
  • [SPARK-24119] - Add interpreted execution to SortPrefix expression
  • [SPARK-24132] - Instrumentation improvement for classification
  • [SPARK-24146] - spark.ml parity for sequential pattern mining - PrefixSpan: Python API
  • [SPARK-24155] - Instrumentation improvement for clustering
  • [SPARK-24157] - Enable no-data micro batches for streaming aggregation and deduplication
  • [SPARK-24158] - Enable no-data micro batches for streaming joins
  • [SPARK-24159] - Enable no-data micro batches for streaming mapGroupswithState
  • [SPARK-24185] - add flatten function
  • [SPARK-24186] - add array_reverse and concat
  • [SPARK-24187] - add array_join
  • [SPARK-24197] - add array_sort function
  • [SPARK-24198] - add slice function
  • [SPARK-24234] - create the bottom-of-task RDD with row buffer
  • [SPARK-24235] - create the top-of-task RDD sending rows to the remote buffer
  • [SPARK-24251] - DataSourceV2: Add AppendData logical operation
  • [SPARK-24290] - Instrumentation Improvement: add logNamedValue taking Array types
  • [SPARK-24296] - Support replicating blocks larger than 2 GB
  • [SPARK-24297] - Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
  • [SPARK-24307] - Support sending messages over 2GB from memory
  • [SPARK-24310] - Instrumentation for frequent pattern mining
  • [SPARK-24324] - Pandas Grouped Map UserDefinedFunction mixes column labels
  • [SPARK-24325] - Tests for Hadoop's LinesReader
  • [SPARK-24331] - Add arrays_overlap / array_repeat / map_entries
  • [SPARK-24334] - Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator
  • [SPARK-24386] - implement continuous processing coalesce(1)
  • [SPARK-24418] - Upgrade to Scala 2.11.12
  • [SPARK-24419] - Upgrade SBT to 0.13.17 with Scala 2.10.7
  • [SPARK-24420] - Upgrade ASM to 6.x to support JDK9+
  • [SPARK-24439] - Add distanceMeasure to BisectingKMeans in PySpark
  • [SPARK-24478] - DataSourceV2 should push filters and projection at physical plan conversion
  • [SPARK-24535] - Fix java version parsing in SparkR on Windows
  • [SPARK-24537] - Add array_remove / array_zip / map_from_arrays / array_distinct
  • [SPARK-24549] - Support DecimalType push down to the parquet data sources
  • [SPARK-24624] - Can not mix vectorized and non-vectorized UDFs
  • [SPARK-24638] - StringStartsWith support push down
  • [SPARK-24706] - Support ByteType and ShortType pushdown to parquet
  • [SPARK-24716] - Refactor ParquetFilters
  • [SPARK-24718] - Timestamp support pushdown to parquet data source
  • [SPARK-24771] - Upgrade AVRO version from 1.7.7 to 1.8.2
  • [SPARK-24772] - support reading AVRO logical types - Date
  • [SPARK-24773] - support reading AVRO logical types - Timestamp with different precisions
  • [SPARK-24774] - support reading AVRO logical types - Decimal
  • [SPARK-24776] - AVRO unit test: use SQLTestUtils and Replace deprecated methods
  • [SPARK-24777] - Add write benchmark for AVRO
  • [SPARK-24800] - Refactor Avro Serializer and Deserializer
  • [SPARK-24805] - Don't ignore files without .avro extension by default
  • [SPARK-24810] - Fix paths to resource files in AvroSuite
  • [SPARK-24811] - Add function `from_avro` and `to_avro`
  • [SPARK-24836] - New option - ignoreExtension
  • [SPARK-24854] - Gather all options into AvroOptions
  • [SPARK-24876] - Simplify schema serialization
  • [SPARK-24881] - New options - compression and compressionLevel
  • [SPARK-24883] - Remove implicit class AvroDataFrameWriter/AvroDataFrameReader
  • [SPARK-24887] - Use SerializableConfiguration in Spark util
  • [SPARK-24924] - Add mapping for built-in Avro data source
  • [SPARK-24967] - Use internal.Logging instead for logging
  • [SPARK-24971] - remove SupportsDeprecatedScanRow
  • [SPARK-24976] - Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
  • [SPARK-24990] - merge ReadSupport and ReadSupportWithSchema
  • [SPARK-24991] - use InternalRow in DataSourceWriter
  • [SPARK-25002] - Avro: revise the output record namespace
  • [SPARK-25007] - Add array_intersect / array_except /array_union / array_shuffle to SparkR
  • [SPARK-25029] - Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
  • [SPARK-25044] - Address translation of LMF closure primitive args to Object in Scala 2.12
  • [SPARK-25047] - Can't assign SerializedLambda to scala.Function1 in deserialization of BucketedRandomProjectionLSHModel
  • [SPARK-25068] - High-order function: exists(array<T>, function<T, boolean>) → boolean
  • [SPARK-25099] - Generate Avro Binary files in test suite
  • [SPARK-25104] - Validate user specified output schema
  • [SPARK-25127] - DataSourceV2: Remove SupportsPushDownCatalystFilters
  • [SPARK-25133] - Documentaion: AVRO data source guide
  • [SPARK-25160] - Remove sql configuration spark.sql.avro.outputTimestampType
  • [SPARK-25179] - Document the features that require Pyarrow 0.10
  • [SPARK-25207] - Case-insensitve field resolution for filter pushdown when reading Parquet
  • [SPARK-25256] - Plan mismatch errors in Hive tests in 2.12
  • [SPARK-25298] - spark-tools build failure for Scala 2.12
  • [SPARK-25304] - enable HiveSparkSubmitSuite SPARK-8489 test for Scala 2.12
  • [SPARK-25320] - ML, Graph 2.4 QA: API: Binary incompatible changes
  • [SPARK-25321] - ML, Graph 2.4 QA: API: New Scala APIs, docs
  • [SPARK-25324] - ML 2.4 QA: API: Java compatibility, docs
  • [SPARK-25328] - Add an example for having two columns as the grouping key in group aggregate pandas UDF
  • [SPARK-25337] - HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
  • [SPARK-25460] - DataSourceV2: Structured Streaming does not respect SessionConfigSupport
  • [SPARK-25572] - SparkR tests failed on CRAN on Java 10
  • [SPARK-25601] - Register Grouped aggregate UDF Vectorized UDFs for SQL Statement
  • [SPARK-25690] - Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely
  • [SPARK-25718] - Detect recursive reference in Avro schema and throw exception
  • [SPARK-25842] - Deprecate APIs introduced in SPARK-21608

Bug

  • [SPARK-6951] - History server slow startup if the event log directory is large
  • [SPARK-10878] - Race condition when resolving Maven coordinates via Ivy
  • [SPARK-15125] - CSV data source recognizes empty quoted strings in the input as null.
  • [SPARK-15750] - Constructing FPGrowth fails when no numPartitions specified in pyspark
  • [SPARK-16451] - Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit
  • [SPARK-17088] - IsolatedClientLoader fails to load Hive client when sharesHadoopClasses is false
  • [SPARK-17147] - Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
  • [SPARK-17166] - CTAS lost table properties after conversion to data source tables.
  • [SPARK-17756] - java.lang.ClassCastException when using cartesian with DStream.transform
  • [SPARK-17916] - CSV data source treats empty string as null no matter what nullValue option is
  • [SPARK-18371] - Spark Streaming backpressure bug - generates a batch with large number of records
  • [SPARK-18630] - PySpark ML memory leak
  • [SPARK-19181] - SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.
  • [SPARK-19185] - ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
  • [SPARK-19613] - Flaky test: StateStoreRDDSuite
  • [SPARK-20947] - Encoding/decoding issue in PySpark pipe implementation
  • [SPARK-21168] - KafkaRDD should always set kafka clientId.
  • [SPARK-21402] - Fix java array of structs deserialization
  • [SPARK-21479] - Outer join filter pushdown in null supplying table when condition is on one of the joined columns
  • [SPARK-21525] - ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
  • [SPARK-21673] - Spark local directory is not set correctly
  • [SPARK-21685] - Params isSet in scala Transformer triggered by _setDefault in pyspark
  • [SPARK-21743] - top-most limit should not cause memory leak
  • [SPARK-21811] - Inconsistency when finding the widest common type of a combination of DateType, StringType, and NumericType
  • [SPARK-21896] - Stack Overflow when window function nested inside aggregate function
  • [SPARK-21945] - pyspark --py-files doesn't work in yarn client mode
  • [SPARK-22151] - PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
  • [SPARK-22279] - Turn on spark.sql.hive.convertMetastoreOrc by default
  • [SPARK-22297] - Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"
  • [SPARK-22357] - SparkContext.binaryFiles ignore minPartitions parameter
  • [SPARK-22371] - dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
  • [SPARK-22384] - Refine partition pruning when attribute is wrapped in Cast
  • [SPARK-22430] - Unknown tag warnings when building R docs with Roxygen 6.0.1
  • [SPARK-22577] - executor page blacklist status should update with TaskSet level blacklisting
  • [SPARK-22606] - There may be two or more tasks in one executor will use the same kafka consumer at the same time, then it will throw an exception: "KafkaConsumer is not safe for multi-threaded access"
  • [SPARK-22676] - Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true
  • [SPARK-22713] - OOM caused by the memory contention and memory leak in TaskMemoryManager
  • [SPARK-22809] - pyspark is sensitive to imports with dots
  • [SPARK-22949] - Reduce memory requirement for TrainValidationSplit
  • [SPARK-22968] - java.lang.IllegalStateException: No current assignment for partition kssh-2
  • [SPARK-22974] - CountVectorModel does not attach attributes to output column
  • [SPARK-23004] - Structured Streaming raise "llegalStateException: Cannot remove after already committed or aborted"
  • [SPARK-23007] - Add schema evolution test suite for file-based data sources
  • [SPARK-23020] - Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
  • [SPARK-23028] - Bump master branch version to 2.4.0-SNAPSHOT
  • [SPARK-23038] - Update docker/spark-test (JDK/OS)
  • [SPARK-23042] - Use OneHotEncoderModel to encode labels in MultilayerPerceptronClassifier
  • [SPARK-23044] - merge script has bug when assigning jiras to non-contributors
  • [SPARK-23059] - Correct some improper with view related method usage
  • [SPARK-23088] - History server not showing incomplete/running applications
  • [SPARK-23094] - Json Readers choose wrong encoding when bad records are present and fail
  • [SPARK-23152] - Invalid guard condition in org.apache.spark.ml.classification.Classifier
  • [SPARK-23173] - from_json can produce nulls for fields which are marked as non-nullable
  • [SPARK-23189] - reflect stage level blacklisting on executor tab
  • [SPARK-23200] - Reset configuration when restarting from checkpoints
  • [SPARK-23240] - PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout
  • [SPARK-23243] - Shuffle+Repartition on an RDD could lead to incorrect answers
  • [SPARK-23271] - Parquet output contains only "_SUCCESS" file after empty DataFrame saving
  • [SPARK-23288] - Incorrect number of written records in structured streaming
  • [SPARK-23291] - SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
  • [SPARK-23306] - Race condition in TaskMemoryManager
  • [SPARK-23340] - Upgrade Apache ORC to 1.4.3
  • [SPARK-23355] - convertMetastore should not ignore table properties
  • [SPARK-23361] - Driver restart fails if it happens after 7 days from app submission
  • [SPARK-23365] - DynamicAllocation with failure in straggler task can lead to a hung spark job
  • [SPARK-23377] - Bucketizer with multiple columns persistence bug
  • [SPARK-23394] - Storage info's Cached Partitions doesn't consider the replications (but sc.getRDDStorageInfo does)
  • [SPARK-23405] - The task will hang up when a small table left semi join a big table
  • [SPARK-23406] - Stream-stream self joins does not work
  • [SPARK-23408] - Flaky test: StreamingOuterJoinSuite.left outer early state exclusion on right
  • [SPARK-23415] - BufferHolderSparkSubmitSuite is flaky
  • [SPARK-23416] - Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
  • [SPARK-23417] - pyspark tests give wrong sbt instructions
  • [SPARK-23425] - load data for hdfs file path with wild card usage is not working properly
  • [SPARK-23433] - java.lang.IllegalStateException: more than one active taskSet for stage
  • [SPARK-23434] - Spark should not warn `metadata directory` for a HDFS file path
  • [SPARK-23436] - Incorrect Date column Inference in partition discovery
  • [SPARK-23438] - DStreams could lose blocks with WAL enabled when driver crashes
  • [SPARK-23449] - Extra java options lose order in Docker context
  • [SPARK-23457] - Register task completion listeners first for ParquetFileFormat
  • [SPARK-23459] - Improve the error message when unknown column is specified in partition columns
  • [SPARK-23461] - vignettes should include model predictions for some ML models
  • [SPARK-23462] - Improve the error message in `StructType`
  • [SPARK-23476] - Spark will not start in local mode with authentication on
  • [SPARK-23486] - LookupFunctions should not check the same function name more than once
  • [SPARK-23489] - Flaky Test: HiveExternalCatalogVersionsSuite
  • [SPARK-23490] - Check storage.locationUri with existing table in CreateTable
  • [SPARK-23496] - Locality of coalesced partitions can be severely skewed by the order of input partitions
  • [SPARK-23508] - blockManagerIdCache in BlockManagerId may cause oom
  • [SPARK-23514] - Replace spark.sparkContext.hadoopConfiguration by spark.sessionState.newHadoopConf()
  • [SPARK-23522] - pyspark should always use sys.exit rather than exit
  • [SPARK-23523] - Incorrect result caused by the rule OptimizeMetadataOnlyQuery
  • [SPARK-23524] - Big local shuffle blocks should not be checked for corruption.
  • [SPARK-23525] - ALTER TABLE CHANGE COLUMN COMMENT doesn't work for external hive table
  • [SPARK-23547] - Cleanup the .pipeout file when the Hive Session closed
  • [SPARK-23549] - Spark SQL unexpected behavior when comparing timestamp to date
  • [SPARK-23551] - Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`
  • [SPARK-23569] - pandas_udf does not work with type-annotated python functions
  • [SPARK-23570] - Add Spark-2.3 in HiveExternalCatalogVersionsSuite
  • [SPARK-23574] - SinglePartition in data source V2 scan
  • [SPARK-23598] - WholeStageCodegen can lead to IllegalAccessError calling append for HashAggregateExec
  • [SPARK-23599] - The UUID() expression is too non-deterministic
  • [SPARK-23602] - PrintToStderr should behave the same in interpreted mode
  • [SPARK-23608] - SHS needs synchronization between attachSparkUI and detachSparkUI functions
  • [SPARK-23614] - Union produces incorrect results when caching is used
  • [SPARK-23618] - docker-image-tool.sh Fails While Building Image
  • [SPARK-23620] - Split thread dump lines by using the br tag
  • [SPARK-23623] - Avoid concurrent use of cached KafkaConsumer in CachedKafkaConsumer (kafka-0-10-sql)
  • [SPARK-23630] - Spark-on-YARN missing user customizations of hadoop config
  • [SPARK-23635] - Spark executor env variable is overwritten by same name AM env variable
  • [SPARK-23636] - [SPARK 2.2] | Kafka Consumer | KafkaUtils.createRDD throws Exception - java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
  • [SPARK-23637] - Yarn might allocate more resource if a same executor is killed multiple times.
  • [SPARK-23639] - SparkSQL CLI fails talk to Kerberized metastore when use proxy user
  • [SPARK-23640] - Hadoop config may override spark config
  • [SPARK-23649] - CSV schema inferring fails on some UTF-8 chars
  • [SPARK-23658] - InProcessAppHandle uses the wrong class in getLogger
  • [SPARK-23660] - Yarn throws exception in cluster mode when the application is small
  • [SPARK-23663] - Spark Streaming Kafka 010 , fails with "java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access"
  • [SPARK-23666] - Undeterministic column name with UDFs
  • [SPARK-23670] - Memory leak of SparkPlanGraphWrapper in sparkUI
  • [SPARK-23671] - SHS is ignoring number of replay threads
  • [SPARK-23679] - uiWebUrl show inproper URL when running on YARN
  • [SPARK-23680] - entrypoint.sh does not accept arbitrary UIDs, returning as an error
  • [SPARK-23682] - Memory issue with Spark structured streaming
  • [SPARK-23697] - Accumulators of Spark 1.x no longer work with Spark 2.x
  • [SPARK-23698] - Spark code contains numerous undefined names in Python 3
  • [SPARK-23729] - Glob resolution breaks remote naming of files/archives
  • [SPARK-23731] - FileSourceScanExec throws NullPointerException in subexpression elimination
  • [SPARK-23732] - Broken link to scala source code in Spark Scala api Scaladoc
  • [SPARK-23743] - IsolatedClientLoader.isSharedClass returns an unindented result against `slf4j` keyword
  • [SPARK-23754] - StopIterator exception in Python UDF results in partial result
  • [SPARK-23759] - Unable to bind Spark UI to specific host name / IP
  • [SPARK-23760] - CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly
  • [SPARK-23775] - Flaky test: DataFrameRangeSuite
  • [SPARK-23778] - SparkContext.emptyRDD confuses SparkContext.union
  • [SPARK-23780] - Failed to use googleVis library with new SparkR
  • [SPARK-23785] - LauncherBackend doesn't check state of connection before setting state
  • [SPARK-23786] - CSV schema validation - column names are not checked
  • [SPARK-23787] - SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9
  • [SPARK-23788] - Race condition in StreamingQuerySuite
  • [SPARK-23794] - UUID() should be stateful
  • [SPARK-23799] - [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics
  • [SPARK-23802] - PropagateEmptyRelation can leave query plan in unresolved state
  • [SPARK-23806] - Broadcast. unpersist can cause fatal exception when used with dynamic allocation
  • [SPARK-23808] - Test spark sessions should set default session
  • [SPARK-23809] - Active SparkSession should be set by getOrCreate
  • [SPARK-23815] - Spark writer dynamic partition overwrite mode fails to write output on multi level partition
  • [SPARK-23816] - FetchFailedException when killing speculative task
  • [SPARK-23823] - ResolveReferences loses correct origin
  • [SPARK-23825] - [K8s] Spark pods should request memory + memoryOverhead as resources
  • [SPARK-23827] - StreamingJoinExec should ensure that input data is partitioned into specific number of partitions
  • [SPARK-23829] - spark-sql-kafka source in spark 2.3 causes reading stream failure frequently
  • [SPARK-23834] - Flaky test: LauncherServerSuite.testAppHandleDisconnect
  • [SPARK-23835] - When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1
  • [SPARK-23850] - We should not redact username|user|url from UI by default
  • [SPARK-23852] - Parquet MR bug can lead to incorrect SQL results
  • [SPARK-23853] - Skip doctests which require hive support built in PySpark
  • [SPARK-23857] - In mesos cluster mode spark submit requires the keytab to be available on the local file system.
  • [SPARK-23868] - Fix scala.MatchError in literals.sql.out
  • [SPARK-23882] - Is UTF8StringSuite.writeToOutputStreamUnderflow() supported?
  • [SPARK-23888] - speculative task should not run on a given host where another attempt is already running on
  • [SPARK-23893] - Possible overflow in long = int * int
  • [SPARK-23941] - Mesos task failed on specific spark app name
  • [SPARK-23951] - Use java classed in ExprValue and simplify a bunch of stuff
  • [SPARK-23971] - Should not leak Spark sessions across test suites
  • [SPARK-23975] - Allow Clustering to take Arrays of Double as input features
  • [SPARK-23976] - UTF8String.concat() or ByteArray.concat() may allocate shorter structure.
  • [SPARK-23986] - CompileException when using too many avg aggregation after joining
  • [SPARK-23989] - When using `SortShuffleWriter`, the data will be overwritten
  • [SPARK-23991] - data loss when allocateBlocksToBatch
  • [SPARK-23997] - Configurable max number of buckets
  • [SPARK-24002] - Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
  • [SPARK-24007] - EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.
  • [SPARK-24012] - Union of map and other compatible column
  • [SPARK-24013] - ApproximatePercentile grinds to a halt on sorted input.
  • [SPARK-24021] - Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
  • [SPARK-24022] - Flaky test: SparkContextSuite
  • [SPARK-24033] - LAG Window function broken in Spark 2.3
  • [SPARK-24043] - InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions
  • [SPARK-24050] - StreamingQuery does not calculate input / processing rates in some cases
  • [SPARK-24056] - Make consumer creation lazy in Kafka source for Structured streaming
  • [SPARK-24061] - [SS]TypedFilter is not supported in Continuous Processing
  • [SPARK-24062] - SASL encryption cannot be worked in ThriftServer
  • [SPARK-24068] - CSV schema inferring doesn't work for compressed files
  • [SPARK-24076] - very bad performance when shuffle.partition = 8192
  • [SPARK-24085] - Scalar subquery error
  • [SPARK-24104] - SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them
  • [SPARK-24107] - ChunkedByteBuffer.writeFully method has not reset the limit value
  • [SPARK-24108] - ChunkedByteBuffer.writeFully method has not reset the limit value
  • [SPARK-24110] - Avoid calling UGI loginUserFromKeytab in ThriftServer
  • [SPARK-24123] - Fix a flaky test `DateTimeUtilsSuite.monthsBetween`
  • [SPARK-24133] - Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException
  • [SPARK-24137] - [K8s] Mount temporary directories in emptydir volumes
  • [SPARK-24141] - Fix bug in CoarseGrainedSchedulerBackend.killExecutors
  • [SPARK-24143] - filter empty blocks when convert mapstatus to (blockId, size) pair
  • [SPARK-24151] - CURRENT_DATE, CURRENT_TIMESTAMP incorrectly resolved as column names when caseSensitive is enabled
  • [SPARK-24165] - UDF within when().otherwise() raises NullPointerException
  • [SPARK-24166] - InMemoryTableScanExec should not access SQLConf at executor side
  • [SPARK-24167] - ParquetFilters should not access SQLConf at executor side
  • [SPARK-24168] - WindowExec should not access SQLConf at executor side
  • [SPARK-24169] - JsonToStructs should not access SQLConf at executor side
  • [SPARK-24190] - lineSep shouldn't be required in JSON write
  • [SPARK-24195] - sc.addFile for local:/ path is broken
  • [SPARK-24214] - StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation.toJSON should not fail
  • [SPARK-24216] - Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
  • [SPARK-24228] - Fix the lint error
  • [SPARK-24230] - With Parquet 1.10 upgrade has errors in the vectorized reader
  • [SPARK-24241] - Do not fail fast when dynamic resource allocation enabled with 0 executor
  • [SPARK-24255] - Require Java 8 in SparkR description
  • [SPARK-24257] - LongToUnsafeRowMap calculate the new size may be wrong
  • [SPARK-24259] - ArrayWriter for Arrow produces wrong output
  • [SPARK-24263] - SparkR java check breaks on openjdk
  • [SPARK-24276] - semanticHash() returns different values for semantically the same IS IN
  • [SPARK-24294] - Throw SparkException when OOM in BroadcastExchangeExec
  • [SPARK-24300] - generateLDAData in ml.cluster.LDASuite didn't set seed correctly
  • [SPARK-24309] - AsyncEventQueue should handle an interrupt from a Listener
  • [SPARK-24313] - Collection functions interpreted execution doesn't work with complex types
  • [SPARK-24319] - run-example can not print usage
  • [SPARK-24322] - Upgrade Apache ORC to 1.4.4
  • [SPARK-24341] - Codegen compile error from predicate subquery
  • [SPARK-24348] - scala.MatchError in the "element_at" expression
  • [SPARK-24350] - ClassCastException in "array_position" function
  • [SPARK-24351] - offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode
  • [SPARK-24364] - Files deletion after globbing may fail StructuredStreaming jobs
  • [SPARK-24368] - Flaky tests: org.apache.spark.sql.execution.datasources.csv.UnivocityParserSuite
  • [SPARK-24369] - A bug when having multiple distinct aggregations
  • [SPARK-24373] - "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans
  • [SPARK-24377] - Make --py-files work in non pyspark application
  • [SPARK-24380] - argument quoting/escaping broken in mesos cluster scheduler
  • [SPARK-24384] - spark-submit --py-files with .py files doesn't work in client mode before context initialization
  • [SPARK-24385] - Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join
  • [SPARK-24391] - from_json should support arrays of primitives, and more generally all JSON
  • [SPARK-24414] - Stages page doesn't show all task attempts when failures
  • [SPARK-24415] - Stage page aggregated executor metrics wrong when failures
  • [SPARK-24416] - Update configuration definition for spark.blacklist.killBlacklistedExecutors
  • [SPARK-24446] - Library path with special characters breaks Spark on YARN
  • [SPARK-24452] - long = int*int or long = int+int may cause overflow.
  • [SPARK-24453] - Fix error recovering from the failure in a no-data batch
  • [SPARK-24466] - TextSocketMicroBatchReader no longer works with nc utility
  • [SPARK-24468] - DecimalType `adjustPrecisionScale` might fail when scale is negative
  • [SPARK-24488] - Analyzer throws when generator is aliased multiple times
  • [SPARK-24495] - SortMergeJoin with duplicate keys wrong results
  • [SPARK-24500] - UnsupportedOperationException when trying to execute Union plan with Stream of children
  • [SPARK-24506] - Spark.ui.filters not applied to /sqlserver/ url
  • [SPARK-24520] - Double braces in link
  • [SPARK-24526] - Spaces in the build dir causes failures in the build/mvn script
  • [SPARK-24530] - Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
  • [SPARK-24531] - HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
  • [SPARK-24536] - Query with nonsensical LIMIT hits AssertionError
  • [SPARK-24548] - JavaPairRDD to Dataset<Row> in SPARK generates ambiguous results
  • [SPARK-24552] - Task attempt numbers are reused when stages are retried
  • [SPARK-24553] - Job UI redirect causing http 302 error
  • [SPARK-24556] - ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning
  • [SPARK-24563] - Allow running PySpark shell without Hive
  • [SPARK-24569] - Spark Aggregator with output type Option[Boolean] creates column of type Row
  • [SPARK-24573] - SBT Java checkstyle affecting the build
  • [SPARK-24578] - Reading remote cache block behavior changes and causes timeout issue
  • [SPARK-24583] - Wrong schema type in InsertIntoDataSourceCommand
  • [SPARK-24588] - StreamingSymmetricHashJoinExec should require HashClusteredPartitioning from children
  • [SPARK-24589] - OutputCommitCoordinator may allow duplicate commits
  • [SPARK-24594] - Introduce metrics for YARN executor allocation problems
  • [SPARK-24598] - SPARK SQL:Datatype overflow conditions gives incorrect result
  • [SPARK-24603] - Typo in comments
  • [SPARK-24610] - wholeTextFiles broken for small files
  • [SPARK-24613] - Cache with UDF could not be matched with subsequent dependent caches
  • [SPARK-24633] - arrays_zip function's code generator splits input processing incorrectly
  • [SPARK-24645] - Skip parsing when csvColumnPruning enabled and partitions scanned only
  • [SPARK-24648] - SQLMetrics counters are not thread safe
  • [SPARK-24653] - Flaky test "JoinSuite.test SortMergeJoin (with spill)"
  • [SPARK-24659] - GenericArrayData.equals should respect element type differences
  • [SPARK-24660] - SHS is not showing properly errors when downloading logs
  • [SPARK-24676] - Project required data from parsed data when csvColumnPruning disabled
  • [SPARK-24677] - TaskSetManager not updating successfulTaskDurations for old stage attempts
  • [SPARK-24681] - Cannot create a view from a table when a nested column name contains ':'
  • [SPARK-24694] - Integration tests pass only one app argument
  • [SPARK-24698] - In Pyspark's ML, an Identifiable's UID has 20 random characters rather than the 12 mentioned in the documentation.
  • [SPARK-24699] - Watermark / Append mode should work with Trigger.Once
  • [SPARK-24704] - The order of stages in the DAG graph is incorrect
  • [SPARK-24705] - Spark.sql.adaptive.enabled=true is enabled and self-join query
  • [SPARK-24711] - Integration tests will not work with exclude/include tags
  • [SPARK-24713] - AppMatser of spark streaming kafka OOM if there are hundreds of topics consumed
  • [SPARK-24715] - sbt build brings a wrong jline versions
  • [SPARK-24717] - Split out min retain version of state for memory in HDFSBackedStateStoreProvider
  • [SPARK-24721] - Failed to use PythonUDF with literal inputs in filter with data sources
  • [SPARK-24734] - Fix containsNull of Concat for array type.
  • [SPARK-24739] - PySpark does not work with Python 3.7.0
  • [SPARK-24742] - Field Metadata raises NullPointerException in hashCode method
  • [SPARK-24743] - Update the JavaDirectKafkaWordCount example to support the new API of Kafka
  • [SPARK-24749] - Cannot filter array<struct> with named_struct
  • [SPARK-24754] - Minhash integer overflow
  • [SPARK-24755] - Executor loss can cause task to not be resubmitted
  • [SPARK-24781] - Using a reference from Dataset in Filter/Sort might not work.
  • [SPARK-24787] - Events being dropped at an alarming rate due to hsync being slow for eventLogging
  • [SPARK-24788] - RelationalGroupedDataset.toString throws errors when grouping by UnresolvedAttribute
  • [SPARK-24804] - There are duplicate words in the title in the DatasetSuite
  • [SPARK-24809] - Serializing LongHashedRelation in executor may result in data error
  • [SPARK-24812] - Last Access Time in the table description is not valid
  • [SPARK-24813] - HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
  • [SPARK-24829] - In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql
  • [SPARK-24846] - Stabilize expression cannonicalization
  • [SPARK-24850] - Query plan string representation grows exponentially on queries with recursive cached datasets
  • [SPARK-24870] - Cache can't work normally if there are case letters in SQL
  • [SPARK-24873] - increase switch to shielding frequent interaction reports with yarn
  • [SPARK-24878] - Fix reverse function for array type of primitive type containing null.
  • [SPARK-24879] - NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
  • [SPARK-24880] - Fix the group id for spark-kubernetes-integration-tests
  • [SPARK-24889] - dataset.unpersist() doesn't update storage memory stats
  • [SPARK-24891] - Fix HandleNullInputsForUDF rule
  • [SPARK-24895] - Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
  • [SPARK-24896] - Uuid expression should produce different values in each execution under streaming query
  • [SPARK-24908] - [R] remove spaces to make lintr happy
  • [SPARK-24909] - Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts
  • [SPARK-24911] - SHOW CREATE TABLE drops escaping of nested column names
  • [SPARK-24919] - Scala linter rule for sparkContext.hadoopConfiguration
  • [SPARK-24927] - The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files
  • [SPARK-24934] - Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases
  • [SPARK-24937] - Datasource partition table should load empty static partitions
  • [SPARK-24948] - SHS filters wrongly some applications due to permission check
  • [SPARK-24950] - scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
  • [SPARK-24957] - Decimal arithmetic can lead to wrong values using codegen
  • [SPARK-24963] - Integration tests will fail if they run in a namespace not being the default
  • [SPARK-24966] - Fix the precedence rule for set operations.
  • [SPARK-24972] - PivotFirst could not handle pivot columns of complex types
  • [SPARK-24981] - ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program
  • [SPARK-24987] - Kafka Cached Consumer Leaking File Descriptors
  • [SPARK-24997] - Support MINUS ALL
  • [SPARK-25004] - Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
  • [SPARK-25005] - Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)
  • [SPARK-25009] - Standalone Cluster mode application submit is not working
  • [SPARK-25010] - Rand/Randn should produce different values for each execution in streaming query
  • [SPARK-25011] - Add PrefixSpan to __all__ in fpm.py
  • [SPARK-25019] - The published spark sql pom does not exclude the normal version of orc-core
  • [SPARK-25021] - Add spark.executor.pyspark.memory support to Kubernetes
  • [SPARK-25028] - AnalyzePartitionCommand failed with NPE if value is null
  • [SPARK-25031] - The schema of MapType can not be printed correctly
  • [SPARK-25033] - Bump Apache commons.{httpclient, httpcore}
  • [SPARK-25036] - Scala 2.12 issues: Compilation error with sbt
  • [SPARK-25041] - genjavadoc-plugin_0.10 is not found with sbt in scala-2.12
  • [SPARK-25046] - Alter View can excute sql like "ALTER VIEW ... AS INSERT INTO"
  • [SPARK-25058] - Use Block.isEmpty/nonEmpty to check whether the code is empty or not.
  • [SPARK-25072] - PySpark custom Row class can be given extra parameters
  • [SPARK-25076] - SQLConf should not be retrieved from a stopped SparkSession
  • [SPARK-25081] - Nested spill in ShuffleExternalSorter may access a released memory page
  • [SPARK-25084] - "distribute by" on multiple columns may lead to codegen issue
  • [SPARK-25090] - java.lang.ClassCastException when using a CrossValidator
  • [SPARK-25092] - Add RewriteExceptAll, RewriteIntersectAll and RewriteCorrelatedScalarSubquery in the list of nonExcludableRules
  • [SPARK-25096] - Loosen nullability if the cast is force-nullable.
  • [SPARK-25114] - RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
  • [SPARK-25116] - Fix the "exit code 1" error when terminating Kafka tests
  • [SPARK-25124] - VectorSizeHint.size is buggy, breaking streaming pipeline
  • [SPARK-25126] - avoid creating OrcFile.Reader for all orc files
  • [SPARK-25132] - Case-insensitive field resolution when reading from Parquet
  • [SPARK-25134] - Csv column pruning with checking of headers throws incorrect error
  • [SPARK-25137] - NumberFormatException` when starting spark-shell from Mac terminal
  • [SPARK-25149] - Personalized PageRank raises an error if vertexIDs are > MaxInt
  • [SPARK-25159] - json schema inference should only trigger one job
  • [SPARK-25161] - Fix several bugs in failure handling of barrier execution mode
  • [SPARK-25163] - Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression
  • [SPARK-25164] - Parquet reader builds entire list of columns once for each column
  • [SPARK-25167] - Minor fixes for R sql tests (tests that fail in development environment)
  • [SPARK-25174] - ApplicationMaster suspends when unregistering itself from RM with extreme large diagnostic message
  • [SPARK-25175] - Field resolution should fail if there's ambiguity for ORC native reader
  • [SPARK-25176] - Kryo fails to serialize a parametrised type hierarchy
  • [SPARK-25181] - Block Manager master and slave thread pools are unbounded
  • [SPARK-25183] - Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; race conditions can arise
  • [SPARK-25204] - rate source test is flaky
  • [SPARK-25205] - typo in spark.network.crypto.keyFactoryIteration
  • [SPARK-25206] - wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
  • [SPARK-25214] - Kafka v2 source may return duplicated records when `failOnDataLoss` is `false`
  • [SPARK-25218] - Potential resource leaks in TransportServer and SocketAuthHelper
  • [SPARK-25221] - [DEPLOY] Consistent trailing whitespace treatment of conf values
  • [SPARK-25231] - Running a Large Job with Speculation On Causes Executor Heartbeats to Time Out on Driver
  • [SPARK-25237] - FileScanRdd's inputMetrics is wrong when select the datasource table with limit
  • [SPARK-25240] - A deadlock in ALTER TABLE RECOVER PARTITIONS
  • [SPARK-25264] - Fix comma-delineated arguments passed into PythonRunner and RRunner
  • [SPARK-25266] - Fix memory leak in Barrier Execution Mode
  • [SPARK-25268] - runParallelPersonalizedPageRank throws serialization Exception
  • [SPARK-25278] - Number of output rows metric of union of views is multiplied by their occurrences
  • [SPARK-25283] - A deadlock in UnionRDD
  • [SPARK-25288] - Kafka transaction tests are flaky
  • [SPARK-25289] - ChiSqSelector max on empty collection
  • [SPARK-25291] - Flakiness of tests in terms of executor memory (SecretsTestSuite)
  • [SPARK-25295] - Pod names conflicts in client mode, if previous submission was not a clean shutdown.
  • [SPARK-25306] - Avoid skewed filter trees to speed up `createFilter` in ORC
  • [SPARK-25307] - ArraySort function may return a error in the code generation phase.
  • [SPARK-25308] - ArrayContains function may return a error in the code generation phase.
  • [SPARK-25310] - ArraysOverlap may throw a CompileException
  • [SPARK-25313] - Fix regression in FileFormatWriter output schema
  • [SPARK-25314] - Invalid PythonUDF - requires attributes from more than one child - in "on" join condition
  • [SPARK-25317] - MemoryBlock performance regression
  • [SPARK-25330] - Permission issue after upgrade hadoop version to 2.7.7
  • [SPARK-25352] - Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
  • [SPARK-25357] - Add metadata to SparkPlanInfo to dump more information like file path to event log
  • [SPARK-25363] - Schema pruning doesn't work if nested column is used in where clause
  • [SPARK-25368] - Incorrect constraint inference returns wrong result
  • [SPARK-25371] - Vector Assembler with no input columns leads to opaque error
  • [SPARK-25387] - Malformed CSV causes NPE
  • [SPARK-25389] - INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
  • [SPARK-25398] - Minor bugs from comparing unrelated types
  • [SPARK-25399] - Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues
  • [SPARK-25402] - Null handling in BooleanSimplification
  • [SPARK-25406] - Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests
  • [SPARK-25416] - ArrayPosition function may return incorrect result when right expression is implicitly downcasted.
  • [SPARK-25417] - ArrayContains function may return incorrect result when right expression is implicitly down casted
  • [SPARK-25425] - Extra options must overwrite sessions options
  • [SPARK-25427] - Add BloomFilter creation test cases
  • [SPARK-25431] - Fix function examples and unify the format of the example results.
  • [SPARK-25438] - Fix FilterPushdownBenchmark to use the same memory assumption
  • [SPARK-25439] - TPCHQuerySuite customer.c_nationkey should be bigint instead of string
  • [SPARK-25443] - fix issues when building docs with release scripts in docker
  • [SPARK-25450] - PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation
  • [SPARK-25471] - Fix tests for Python 3.6 with Pandas 0.23+
  • [SPARK-25495] - FetchedData.reset doesn't reset _nextOffsetInFetchedData and _offsetAfterPoll
  • [SPARK-25502] - [Spark Job History] Empty Page when page number exceeds the reatinedTask size
  • [SPARK-25503] - [Spark Job History] Total task message in stage page is ambiguous
  • [SPARK-25505] - The output order of grouping columns in Pivot is different from the input order
  • [SPARK-25509] - SHS V2 cannot enabled in Windows, because POSIX permissions is not support.
  • [SPARK-25519] - ArrayRemove function may return incorrect result when right expression is implicitly downcasted.
  • [SPARK-25521] - Job id showing null when insert into command Job is finished.
  • [SPARK-25522] - Improve type promotion for input arguments of elementAt function
  • [SPARK-25533] - Inconsistent message for Completed Jobs in the JobUI, when there are failed jobs, compared to spark2.2
  • [SPARK-25536] - executorSource.METRIC read wrong record in Executor.scala Line444
  • [SPARK-25538] - incorrect row counts after distinct()
  • [SPARK-25542] - Flaky test: OpenHashMapSuite
  • [SPARK-25543] - Confusing log messages at DEBUG level, in K8s mode.
  • [SPARK-25546] - RDDInfo uses SparkEnv before it may have been initialized
  • [SPARK-25568] - Continue to update the remaining accumulators when failing to update one accumulator
  • [SPARK-25570] - Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
  • [SPARK-25578] - Update to Scala 2.12.7
  • [SPARK-25579] - Use quoted attribute names if needed in pushed ORC predicates
  • [SPARK-25591] - PySpark Accumulators with multiple PythonUDFs
  • [SPARK-25602] - SparkPlan.getByteArrayRdd should not consume the input when not necessary
  • [SPARK-25636] - spark-submit swallows the failure reason when there is an error connecting to master
  • [SPARK-25644] - Fix java foreachBatch API
  • [SPARK-25646] - docker-image-tool.sh doesn't work on developer build
  • [SPARK-25660] - Impossible to use the backward slash as the CSV fields delimiter
  • [SPARK-25669] - Check CSV header only when it exists
  • [SPARK-25671] - Build external/spark-ganglia-lgpl in Jenkins Test
  • [SPARK-25674] - If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated
  • [SPARK-25677] - Configuring zstd compression in JDBC throwing IllegalArgumentException Exception
  • [SPARK-25697] - When zstd compression enabled in progress application is throwing Error in UI
  • [SPARK-25704] - Replication of > 2GB block fails due to bad config default
  • [SPARK-25708] - HAVING without GROUP BY means global aggregate
  • [SPARK-25714] - Null Handling in the Optimizer rule BooleanSimplification
  • [SPARK-25726] - Flaky test: SaveIntoDataSourceCommandSuite.`simpleString is redacted`
  • [SPARK-25727] - makeCopy failed in InMemoryRelation
  • [SPARK-25738] - LOAD DATA INPATH doesn't work if hdfs conf includes port
  • [SPARK-25741] - Long URLs are not rendered properly in web UI
  • [SPARK-25768] - Constant argument expecting Hive UDAFs doesn't work
  • [SPARK-25793] - Loading model bug in BisectingKMeans
  • [SPARK-25795] - Fix CSV SparkR SQL Example
  • [SPARK-25797] - Views created via 2.1 cannot be read via 2.2+
  • [SPARK-25799] - DataSourceApiV2 scan reuse does not respect options
  • [SPARK-25801] - pandas_udf grouped_map fails with input dataframe with more than 255 columns
  • [SPARK-25803] - The -n option to docker-image-tool.sh causes other options to be ignored
  • [SPARK-25816] - Functions does not resolve Columns correctly
  • [SPARK-25822] - Fix a race condition when releasing a Python worker
  • [SPARK-25832] - remove newly added map related functions
  • [SPARK-25835] - Propagate scala 2.12 profile in k8s integration tests
  • [SPARK-25840] - `make-distribution.sh` should not fail due to missing LICENSE-binary
  • [SPARK-25854] - mvn helper script always exits w/1, causing mvn builds to fail
  • [SPARK-26612] - Speculation kill causing finished stage recomputed
  • [SPARK-26614] - Speculation kill might cause job failure
  • [SPARK-26802] - CVE-2018-11760: Apache Spark local privilege escalation vulnerability
  • [SPARK-28626] - Spark leaves unencrypted data on local disk, even with encryption turned on (CVE-2019-10099)
  • [SPARK-34381] - c

Epic

  • [SPARK-24374] - SPIP: Support Barrier Execution Mode in Apache Spark

Story

  • [SPARK-24124] - Spark history server should create spark.history.store.path and set permissions properly
  • [SPARK-24852] - Have spark.ml training use updated `Instrumentation` APIs.
  • [SPARK-25234] - SparkR:::parallelize doesn't handle integer overflow properly
  • [SPARK-25248] - Audit barrier APIs for Spark 2.4
  • [SPARK-25345] - Deprecate readImages APIs from ImageSchema
  • [SPARK-25347] - Document image data source in doc site

New Feature

  • [SPARK-10697] - Lift Calculation in Association Rule mining
  • [SPARK-14682] - Provide evaluateEachIteration method or equivalent for spark.ml GBTs
  • [SPARK-15064] - Locale support in StopWordsRemover
  • [SPARK-15784] - Add Power Iteration Clustering to spark.ml
  • [SPARK-19480] - Higher order functions in SQL
  • [SPARK-21274] - Implement EXCEPT ALL and INTERSECT ALL
  • [SPARK-22119] - Add cosine distance to KMeans
  • [SPARK-22880] - Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)
  • [SPARK-23010] - Add integration testing for Kubernetes backend into the apache/spark repository
  • [SPARK-23146] - Support client mode for Kubernetes cluster backend
  • [SPARK-23235] - Add executor Threaddump to api
  • [SPARK-23541] - Allow Kafka source to read data with greater parallelism than the number of topic-partitions
  • [SPARK-23751] - Kolmogorov-Smirnoff test Python API in pyspark.ml
  • [SPARK-23846] - samplingRatio for schema inferring of CSV datasource
  • [SPARK-23856] - Spark jdbc setQueryTimeout option
  • [SPARK-23948] - Trigger mapstage's job listener in submitMissingTasks
  • [SPARK-23984] - PySpark Bindings for K8S
  • [SPARK-24027] - Support MapType(StringType, DataType) as root type by from_json
  • [SPARK-24193] - Sort by disk when number of limit is big in TakeOrderedAndProjectExec
  • [SPARK-24231] - Python API: Provide evaluateEachIteration method or equivalent for spark.ml GBTs
  • [SPARK-24232] - Allow referring to kubernetes secrets as env variable
  • [SPARK-24288] - Enable preventing predicate pushdown
  • [SPARK-24371] - Added isInCollection in DataFrame API for Scala and Java.
  • [SPARK-24372] - Create script for preparing RCs
  • [SPARK-24396] - Add Structured Streaming ForeachWriter for python
  • [SPARK-24397] - Add TaskContext.getLocalProperties in Python
  • [SPARK-24411] - Adding native Java tests for `isInCollection`
  • [SPARK-24412] - Adding docs about automagical type casting in `isin` and `isInCollection` APIs
  • [SPARK-24433] - R Bindings for K8S
  • [SPARK-24435] - Support user-supplied YAML that can be merged with k8s pod descriptions
  • [SPARK-24465] - LSHModel should support Structured Streaming for transform
  • [SPARK-24479] - Register StreamingQueryListener in Spark Conf
  • [SPARK-24499] - Split the page of sql-programming-guide.html to multiple separate pages
  • [SPARK-24542] - Hive UDF series UDFXPathXXXX allow users to pass carefully crafted XML to access arbitrary files
  • [SPARK-24662] - Structured Streaming should support LIMIT
  • [SPARK-24730] - Add policy to choose max as global watermark when streaming query has multiple watermarks
  • [SPARK-24768] - Have a built-in AVRO data source implementation
  • [SPARK-24795] - Implement barrier execution mode
  • [SPARK-24802] - Optimization Rule Exclusion
  • [SPARK-24817] - Implement BarrierTaskContext.barrier()
  • [SPARK-24819] - Fail fast when no enough slots to launch the barrier stage on job submitted
  • [SPARK-24820] - Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
  • [SPARK-24821] - Fail fast when submitted job compute on a subset of all the partitions for a barrier stage
  • [SPARK-24822] - Python support for barrier execution mode
  • [SPARK-24918] - Executor Plugin API
  • [SPARK-25468] - Highlight current page index in the history server

Improvement

  • [SPARK-3159] - Check for reducible DecisionTree
  • [SPARK-4502] - Spark SQL reads unneccesary nested fields from Parquet
  • [SPARK-7132] - Add fit with validation set to spark.ml GBT
  • [SPARK-9312] - The OneVsRest model does not provide rawPrediction
  • [SPARK-11630] - ClosureCleaner incorrectly warns for class based closures
  • [SPARK-13343] - speculative tasks that didn't commit shouldn't be marked as success
  • [SPARK-14712] - spark.ml LogisticRegressionModel.toString should summarize model
  • [SPARK-15009] - PySpark CountVectorizerModel should be able to construct from vocabulary list
  • [SPARK-16406] - Reference resolution for large number of columns should be faster
  • [SPARK-16501] - spark.mesos.secret exposed on UI and command line
  • [SPARK-16617] - Upgrade to Avro 1.8.x
  • [SPARK-16630] - Blacklist a node if executors won't launch on it.
  • [SPARK-18057] - Update structured streaming kafka from 0.10.0.1 to 2.0.0
  • [SPARK-18230] - MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist
  • [SPARK-19018] - spark csv writer charset support
  • [SPARK-19602] - Unable to query using the fully qualified column name of the form ( <DBNAME>.<TABLENAME>.<COLUMNNAME>)
  • [SPARK-19724] - create a managed table with an existed default location should throw an exception
  • [SPARK-19947] - RFormulaModel always throws Exception on transforming data with NULL or Unseen labels
  • [SPARK-20087] - Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners
  • [SPARK-20168] - Enable kinesis to start stream from Initial position specified by a timestamp
  • [SPARK-20538] - Dataset.reduce operator should use withNewExecutionId (as foreach or foreachPartition)
  • [SPARK-20659] - Remove StorageStatus, or make it private.
  • [SPARK-20937] - Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide
  • [SPARK-21318] - The exception message thrown by `lookupFunction` is ambiguous.
  • [SPARK-21590] - Structured Streaming window start time should support negative values to adjust time zone
  • [SPARK-21687] - Spark SQL should set createTime for Hive partition
  • [SPARK-21741] - Python API for DataFrame-based multivariate summarizer
  • [SPARK-21783] - Turn on ORC filter push-down by default
  • [SPARK-21860] - Improve memory reuse for heap memory in `HeapMemoryAllocator`
  • [SPARK-21960] - Spark Streaming Dynamic Allocation should respect spark.executor.instances
  • [SPARK-22068] - Reduce the duplicate code between putIteratorAsValues and putIteratorAsBytes
  • [SPARK-22144] - ExchangeCoordinator will not combine the partitions of an 0 sized pre-shuffle
  • [SPARK-22210] - Online LDA variationalTopicInference should use random seed to have stable behavior
  • [SPARK-22219] - Refector "spark.sql.codegen.comments"
  • [SPARK-22269] - Java style checks should be run in Jenkins
  • [SPARK-22666] - Spark datasource for image format
  • [SPARK-22683] - DynamicAllocation wastes resources by allocating containers that will barely be used
  • [SPARK-22751] - Improve ML RandomForest shuffle performance
  • [SPARK-22814] - JDBC support date/timestamp type as partitionColumn
  • [SPARK-22839] - Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction
  • [SPARK-22856] - Add wrapper for codegen output and nullability
  • [SPARK-22941] - Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
  • [SPARK-22959] - Configuration to select the modules for daemon and worker in PySpark
  • [SPARK-23012] - Support for predicate pushdown and partition pruning when left joining large Hive tables
  • [SPARK-23024] - Spark ui about the contents of the form need to have hidden and show features, when the table records very much.
  • [SPARK-23031] - Merge script should allow arbitrary assignees
  • [SPARK-23034] - Display tablename for `HiveTableScan` node in UI
  • [SPARK-23040] - BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified
  • [SPARK-23043] - Upgrade json4s-jackson to 3.5.3
  • [SPARK-23085] - API parity for mllib.linalg.Vectors.sparse
  • [SPARK-23159] - Update Cloudpickle to match version 0.4.3
  • [SPARK-23161] - Add missing APIs to Python GBTClassifier
  • [SPARK-23162] - PySpark ML LinearRegressionSummary missing r2adj
  • [SPARK-23166] - Add maxDF Parameter to CountVectorizer
  • [SPARK-23167] - Update TPCDS queries from v1.4 to v2.7 (latest)
  • [SPARK-23174] - Fix pep8 to latest official version
  • [SPARK-23188] - Make vectorized columar reader batch size configurable
  • [SPARK-23202] - Add new API in DataSourceWriter: onDataWriterCommit
  • [SPARK-23217] - Add cosine distance measure to ClusteringEvaluator
  • [SPARK-23228] - Able to track Python create SparkSession in JVM
  • [SPARK-23247] - combines Unsafe operations and statistics operations in Scan Data Source
  • [SPARK-23253] - Only write shuffle temporary index file when there is not an existing one
  • [SPARK-23259] - Clean up legacy code around hive external catalog
  • [SPARK-23285] - Allow spark.executor.cores to be fractional
  • [SPARK-23295] - Exclude Waring message when generating versions in make-distribution.sh
  • [SPARK-23303] - improve the explain result for data source v2 relations
  • [SPARK-23318] - FP-growth: WARN FPGrowth: Input data is not cached
  • [SPARK-23336] - Upgrade snappy-java to 1.1.7.1
  • [SPARK-23359] - Adds an alias 'names' of 'fieldNames' in Scala's StructType
  • [SPARK-23366] - Improve hot reading path in ReadAheadInputStream
  • [SPARK-23372] - Writing empty struct in parquet fails during execution. It should fail earlier during analysis.
  • [SPARK-23375] - Optimizer should remove unneeded Sort
  • [SPARK-23378] - move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
  • [SPARK-23379] - remove redundant metastore access if the current database name is the same
  • [SPARK-23382] - Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.
  • [SPARK-23383] - Make a distribution should exit with usage while detecting wrong options
  • [SPARK-23389] - When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, we should be able to use serialized sorting.
  • [SPARK-23412] - Add cosine distance measure to BisectingKMeans
  • [SPARK-23424] - Add codegenStageId in comment
  • [SPARK-23445] - ColumnStat refactoring
  • [SPARK-23447] - Cleanup codegen template for Literal
  • [SPARK-23455] - Default Params in ML should be saved separately
  • [SPARK-23456] - Turn on `native` ORC implementation by default
  • [SPARK-23466] - Remove redundant null checks in generated Java code by GenerateUnsafeProjection
  • [SPARK-23500] - Filters on named_structs could be pushed into scans
  • [SPARK-23510] - Support read data from Hive 2.2 and Hive 2.3 metastore
  • [SPARK-23518] - Avoid metastore access when users only want to read and store data frames
  • [SPARK-23528] - Add numIter to ClusteringSummary
  • [SPARK-23529] - Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes
  • [SPARK-23538] - Simplify SSL configuration for https client
  • [SPARK-23550] - Cleanup unused / redundant methods in Utils object
  • [SPARK-23553] - Tests should not assume the default value of `spark.sql.sources.default`
  • [SPARK-23562] - RFormula handleInvalid should handle invalid values in non-string columns.
  • [SPARK-23564] - the optimized logical plan about Left anti join should be further optimization
  • [SPARK-23565] - Improved error message for when the number of sources for a query changes
  • [SPARK-23568] - Silhouette should get number of features from metadata if available
  • [SPARK-23572] - Update security.md to cover new features
  • [SPARK-23573] - Create linter rule to prevent misuse of SparkContext.hadoopConfiguration in SQL modules
  • [SPARK-23604] - ParquetInteroperabilityTest timestamp test should use Statistics.hasNonNullValue
  • [SPARK-23624] - Revise doc of method pushFilters
  • [SPARK-23627] - Provide isEmpty() function in DataSet
  • [SPARK-23628] - WholeStageCodegen can generate methods with too many params
  • [SPARK-23644] - SHS with proxy doesn't show applications
  • [SPARK-23645] - pandas_udf can not be called with keyword arguments
  • [SPARK-23654] - Cut jets3t as a dependency of spark-core
  • [SPARK-23656] - Assertion in XXH64Suite.testKnownByteArrayInputs() is not performed on big endian platform
  • [SPARK-23672] - Document Support returning lists in Arrow UDFs
  • [SPARK-23675] - Title add spark logo, use spark logo image
  • [SPARK-23683] - FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite
  • [SPARK-23691] - Use sql_conf util in PySpark tests where possible
  • [SPARK-23695] - Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled
  • [SPARK-23699] - PySpark should raise same Error when Arrow fallback is disabled
  • [SPARK-23700] - Cleanup unused imports
  • [SPARK-23708] - Comment of ShutdownHookManager.addShutdownHook is error
  • [SPARK-23769] - Remove unnecessary scalastyle check disabling
  • [SPARK-23770] - Expose repartitionByRange in SparkR
  • [SPARK-23772] - Provide an option to ignore column of all null values or empty map/array during JSON schema inference
  • [SPARK-23776] - pyspark-sql tests should display build instructions when components are missing
  • [SPARK-23803] - Support bucket pruning to optimize filtering on a bucketed column
  • [SPARK-23820] - Allow the long form of call sites to be recorded in the log
  • [SPARK-23822] - Improve error message for Parquet schema mismatches
  • [SPARK-23828] - PySpark StringIndexerModel should have constructor from labels
  • [SPARK-23830] - Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
  • [SPARK-23838] - SparkUI: Running SQL query displayed as "completed" in SQL tab
  • [SPARK-23841] - NodeIdCache should unpersist the last cached nodeIdsForInstances
  • [SPARK-23861] - Clarify behavior of default window frame boundaries with and without orderBy clause
  • [SPARK-23867] - com.codahale.metrics.Counter output in log message has no toString method
  • [SPARK-23873] - Use accessors in interpreted LambdaVariable
  • [SPARK-23874] - Upgrade apache/arrow to 0.10.0
  • [SPARK-23875] - Create IndexedSeq wrapper for ArrayData
  • [SPARK-23877] - Metadata-only queries do not push down filter conditions
  • [SPARK-23880] - table cache should be lazy and don't trigger any job
  • [SPARK-23892] - Improve coverage and fix lint error in UTF8String-related Suite
  • [SPARK-23896] - Improve PartitioningAwareFileIndex
  • [SPARK-23944] - Add Param set functions to LSHModel types
  • [SPARK-23947] - Add hashUTF8String convenience method to hasher classes
  • [SPARK-23956] - Use effective RPC port in AM registration
  • [SPARK-23957] - Sorts in subqueries are redundant and can be removed
  • [SPARK-23960] - Mark HashAggregateExec.bufVars as transient
  • [SPARK-23962] - Flaky tests from SQLMetricsTestUtils.currentExecutionIds
  • [SPARK-23963] - Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
  • [SPARK-23966] - Refactoring all checkpoint file writing logic in a common interface
  • [SPARK-23972] - Upgrade to Parquet 1.10
  • [SPARK-23973] - Remove consecutive sorts
  • [SPARK-23979] - MultiAlias should not be a CodegenFallback
  • [SPARK-24003] - Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's
  • [SPARK-24005] - Remove usage of Scala’s parallel collection
  • [SPARK-24014] - Add onStreamingStarted method to StreamingListener
  • [SPARK-24017] - Refactor ExternalCatalog to be an interface
  • [SPARK-24024] - Fix deviance calculations in GLM to handle corner cases
  • [SPARK-24029] - Set "reuse address" flag on listen sockets
  • [SPARK-24035] - SQL syntax for Pivot
  • [SPARK-24057] - put the real data type in the AssertionError message
  • [SPARK-24058] - Default Params in ML should be saved separately: Python API
  • [SPARK-24072] - clearly define pushed filters
  • [SPARK-24083] - Diagnostics message for uncaught exceptions should include the stacktrace
  • [SPARK-24094] - Change description strings of v2 streaming sources to reflect the change
  • [SPARK-24111] - Add TPCDS v2.7 (latest) queries in TPCDSQueryBenchmark
  • [SPARK-24117] - Unified the getSizePerRow
  • [SPARK-24121] - The API for handling expression code generation in expression codegen
  • [SPARK-24126] - PySpark tests leave a lot of garbage in /tmp
  • [SPARK-24127] - Support text socket source in continuous mode
  • [SPARK-24128] - Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg
  • [SPARK-24129] - Add option to pass --build-arg's to docker-image-tool.sh
  • [SPARK-24131] - Add majorMinorVersion API to PySpark for determining Spark versions
  • [SPARK-24136] - MemoryStreamDataReader.next should skip sleeping if record is available
  • [SPARK-24149] - Automatic namespaces discovery in HDFS federation
  • [SPARK-24156] - Enable no-data micro batches for more eager streaming state clean up
  • [SPARK-24160] - ShuffleBlockFetcherIterator should fail if it receives zero-size blocks
  • [SPARK-24161] - Enable debug package feature on structured streaming
  • [SPARK-24172] - we should not apply operator pushdown to data source v2 many times
  • [SPARK-24181] - Better error message for writing sorted data
  • [SPARK-24182] - Improve error message for client mode when AM fails
  • [SPARK-24188] - /api/v1/version not working
  • [SPARK-24204] - Verify a write schema in Json/Orc/ParquetFileFormat
  • [SPARK-24206] - Improve DataSource benchmark code for read and pushdown
  • [SPARK-24209] - 0 configuration Knox gateway support in SHS
  • [SPARK-24215] - Implement eager evaluation for DataFrame APIs
  • [SPARK-24242] - RangeExec should have correct outputOrdering
  • [SPARK-24244] - Parse only required columns of CSV file
  • [SPARK-24246] - Improve AnalysisException by setting the cause when it's available
  • [SPARK-24248] - [K8S] Use the Kubernetes cluster as the backing store for the state of pods
  • [SPARK-24250] - support accessing SQLConf inside tasks
  • [SPARK-24262] - Fix typo in UDF error message
  • [SPARK-24268] - DataType in error messages are not coherent
  • [SPARK-24275] - Revise doc comments in InputPartition
  • [SPARK-24277] - Code clean up in SQL module: HadoopMapReduceCommitProtocol/FileFormatWriter
  • [SPARK-24303] - Update cloudpickle to v0.4.4
  • [SPARK-24305] - Avoid serialization of private fields in new collection expressions
  • [SPARK-24308] - Handle DataReaderFactory to InputPartition renames in left over classes
  • [SPARK-24312] - Upgrade to 2.3.3 for Hive Metastore Client 2.3
  • [SPARK-24321] - Extract common code from Divide/Remainder to a base trait
  • [SPARK-24326] - Add local:// scheme support for the app jar in mesos cluster mode
  • [SPARK-24327] - Verify and normalize a partition column name based on the JDBC resolved schema
  • [SPARK-24329] - Remove comments filtering before parsing of CSV files
  • [SPARK-24330] - Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)
  • [SPARK-24332] - Fix places reading 'spark.network.timeout' as milliseconds
  • [SPARK-24337] - Improve the error message for invalid SQL conf value
  • [SPARK-24339] - spark sql can not prune column in transform/map/reduce query
  • [SPARK-24356] - Duplicate strings in File.path managed by FileSegmentManagedBuffer
  • [SPARK-24361] - Polish code block manipulation API
  • [SPARK-24365] - Add data source write benchmark
  • [SPARK-24366] - Improve error message for Catalyst type converters
  • [SPARK-24367] - Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY
  • [SPARK-24381] - Improve Unit Test Coverage of NOT IN subqueries
  • [SPARK-24408] - Move abs function to math_funcs group
  • [SPARK-24423] - Add a new option `query` for JDBC sources
  • [SPARK-24424] - Support ANSI-SQL compliant syntax for GROUPING SET
  • [SPARK-24428] - Remove unused code and fix any related doc in K8s module
  • [SPARK-24441] - Expose total estimated size of states in HDFSBackedStateStoreProvider
  • [SPARK-24454] - ml.image doesn't have __all__ explicitly defined
  • [SPARK-24455] - fix typo in TaskSchedulerImpl's comments
  • [SPARK-24470] - RestSubmissionClient to be robust against 404 & non json responses
  • [SPARK-24477] - Import submodules under pyspark.ml by default
  • [SPARK-24485] - Measure and log elapsed time for filesystem operations in HDFSBackedStateStoreProvider
  • [SPARK-24490] - Use WebUI.addStaticHandler in web UIs
  • [SPARK-24505] - Convert strings in codegen to blocks: Cast and BoundAttribute
  • [SPARK-24518] - Using Hadoop credential provider API to store password
  • [SPARK-24519] - MapStatus has 2000 hardcoded
  • [SPARK-24525] - Provide an option to limit MemorySink memory usage
  • [SPARK-24534] - Add a way to bypass entrypoint.sh script if no spark cmd is passed
  • [SPARK-24543] - Support any DataType as DDL string for from_json's schema
  • [SPARK-24547] - Spark on K8s docker-image-tool.sh improvements
  • [SPARK-24551] - Add Integration tests for Secrets
  • [SPARK-24555] - logNumExamples in KMeans/BiKM/GMM/AFT/NB
  • [SPARK-24557] - ClusteringEvaluator support array input
  • [SPARK-24558] - Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
  • [SPARK-24565] - Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame
  • [SPARK-24566] - Fix spark.storage.blockManagerSlaveTimeoutMs default config
  • [SPARK-24571] - Support literals with values of the Char type
  • [SPARK-24574] - improve array_contains function of the sql component to deal with Column type
  • [SPARK-24575] - Prohibit window expressions inside WHERE and HAVING clauses
  • [SPARK-24576] - Upgrade Apache ORC to 1.5.2
  • [SPARK-24596] - Non-cascading Cache Invalidation
  • [SPARK-24605] - size(null) should return null
  • [SPARK-24609] - PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well
  • [SPARK-24614] - PySpark - Fix SyntaxWarning on tests.py
  • [SPARK-24626] - Parallelize size calculation in Analyze Table command
  • [SPARK-24635] - Remove Blocks class
  • [SPARK-24636] - Type Coercion of Arrays for array_join Function
  • [SPARK-24637] - Add metrics regarding state and watermark to dropwizard metrics
  • [SPARK-24646] - Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
  • [SPARK-24658] - Remove workaround for ANTLR bug
  • [SPARK-24665] - Add SQLConf in PySpark to manage all sql configs
  • [SPARK-24673] - scala sql function from_utc_timestamp second argument could be Column instead of String
  • [SPARK-24675] - Rename table: validate existence of new location
  • [SPARK-24678] - We should use 'PROCESS_LOCAL' first for Spark-Streaming
  • [SPARK-24683] - SparkLauncher.NO_RESOURCE doesn't work with Java applications
  • [SPARK-24685] - Adjust release scripts to build all versions for older releases
  • [SPARK-24688] - Clarify comments about LabeledPoint as (label, feature) pair rather than (feature, label)
  • [SPARK-24691] - Add new API `supportDataType` in FileFormat
  • [SPARK-24692] - Improvement FilterPushdownBenchmark
  • [SPARK-24696] - ColumnPruning rule fails to remove extra Project
  • [SPARK-24697] - Fix the reported start offsets in streaming query progress
  • [SPARK-24709] - Inferring schema from JSON string literal
  • [SPARK-24722] - Column-based API for pivoting
  • [SPARK-24727] - The cache 100 in CodeGenerator is too small for streaming
  • [SPARK-24732] - Type coercion between MapTypes.
  • [SPARK-24737] - Type coercion between StructTypes.
  • [SPARK-24747] - Make spark.ml.util.Instrumentation class more flexible
  • [SPARK-24757] - Improve error message for broadcast timeouts
  • [SPARK-24759] - No reordering keys for broadcast hash join
  • [SPARK-24761] - Check modifiability of config parameters
  • [SPARK-24763] - Remove redundant key data from value in streaming aggregation
  • [SPARK-24782] - Simplify conf access in expressions
  • [SPARK-24785] - Making sure REPL prints Spark UI info and then Welcome message
  • [SPARK-24790] - Allow complex aggregate expressions in Pivot
  • [SPARK-24801] - Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
  • [SPARK-24807] - Adding files/jars twice: output a warning and add a note
  • [SPARK-24849] - Convert StructType to DDL string
  • [SPARK-24858] - Avoid unnecessary parquet footer reads
  • [SPARK-24860] - Expose dynamic partition overwrite per write operation
  • [SPARK-24865] - Remove AnalysisBarrier
  • [SPARK-24868] - add sequence function in Python
  • [SPARK-24871] - Refactor Concat and MapConcat to avoid creating concatenator object for each row.
  • [SPARK-24890] - Short circuiting the `if` condition when `trueValue` and `falseValue` are the same
  • [SPARK-24893] - Remove the entire CaseWhen if all the outputs are semantic equivalence
  • [SPARK-24926] - Ensure numCores is used consistently in all netty configuration (driver and executors)
  • [SPARK-24929] - Merge script swallow KeyboardInterrupt
  • [SPARK-24940] - Coalesce and Repartition Hint for SQL Queries
  • [SPARK-24943] - Convert a SQL Struct to StructType
  • [SPARK-24945] - Switch to uniVocity >= 2.7.2
  • [SPARK-24951] - Table valued functions should throw AnalysisException instead of IllegalArgumentException
  • [SPARK-24952] - Support LZMA2 compression by Avro datasource
  • [SPARK-24954] - Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled
  • [SPARK-24956] - Upgrade maven from 3.3.9 to 3.5.4
  • [SPARK-24960] - k8s: explicitly expose ports on driver container
  • [SPARK-24962] - refactor CodeGenerator.createUnsafeArray
  • [SPARK-24978] - Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.
  • [SPARK-24979] - add AnalysisHelper#resolveOperatorsUp
  • [SPARK-24982] - UDAF resolution should not throw java.lang.AssertionError
  • [SPARK-24992] - spark should randomize yarn local dir selection
  • [SPARK-24993] - Make Avro fast again
  • [SPARK-24996] - Use DSL to simplify DeclarativeAggregate
  • [SPARK-24999] - Reduce unnecessary 'new' memory operations
  • [SPARK-25001] - Fix build miscellaneous warnings
  • [SPARK-25018] - Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
  • [SPARK-25025] - Remove the default value of isAll in INTERSECT/EXCEPT
  • [SPARK-25043] - spark-sql should print the appId and master on startup
  • [SPARK-25045] - Make `RDDBarrier.mapParititions` similar to `RDD.mapPartitions`
  • [SPARK-25069] - Using UnsafeAlignedOffset to make the entire record of 8 byte Items aligned like which is used in UnsafeExternalSorter
  • [SPARK-25073] - Spark-submit on Yarn Task : When the yarn.nodemanager.resource.memory-mb and/or yarn.scheduler.maximum-allocation-mb is insufficient, Spark always reports an error request to adjust yarn.scheduler.maximum-allocation-mb
  • [SPARK-25077] - Delete unused variable in WindowExec
  • [SPARK-25088] - Rest Server default & doc updates
  • [SPARK-25093] - CodeFormatter could avoid creating regex object again and again
  • [SPARK-25105] - Importing all of pyspark.sql.functions should bring PandasUDFType in as well
  • [SPARK-25108] - Dataset.show() generates incorrect padding for Unicode Character
  • [SPARK-25111] - increment kinesis client/producer lib versions & aws-sdk to match
  • [SPARK-25113] - Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit
  • [SPARK-25115] - Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.
  • [SPARK-25117] - Add EXEPT ALL and INTERSECT ALL support in R.
  • [SPARK-25122] - Deduplication of supports equals code
  • [SPARK-25140] - Add optional logging to UnsafeProjection.create when it falls back to interpreted mode
  • [SPARK-25142] - Add error messages when Python worker could not open socket in `_load_from_socket`.
  • [SPARK-25170] - Add Task Metrics description to the documentation
  • [SPARK-25178] - Directly ship the StructType objects of the keySchema / valueSchema for xxxHashMapGenerator
  • [SPARK-25208] - Loosen Cast.forceNullable for DecimalType.
  • [SPARK-25209] - Optimization in Dataset.apply for DataFrames
  • [SPARK-25212] - Support Filter in ConvertToLocalRelation
  • [SPARK-25228] - Add executor CPU Time metric
  • [SPARK-25233] - Give the user the option of specifying a fixed minimum message per partition per batch when using kafka direct API with backpressure
  • [SPARK-25235] - Merge the REPL code in Scala 2.11 and 2.12 branches
  • [SPARK-25241] - Configurable empty values when reading/writing CSV files
  • [SPARK-25252] - Support arrays of any types in to_json
  • [SPARK-25253] - Refactor pyspark connection & authentication
  • [SPARK-25260] - Fix namespace handling in SchemaConverters.toAvroType
  • [SPARK-25261] - Standardize the default units of spark.driver|executor.memory
  • [SPARK-25275] - require memberhip in wheel to run 'su' (in dockerfiles)
  • [SPARK-25286] - Remove dangerous parmap
  • [SPARK-25287] - Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
  • [SPARK-25300] - Unified the configuration parameter `spark.shuffle.service.enabled`
  • [SPARK-25318] - Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block
  • [SPARK-25335] - Skip Zinc downloading if it's installed in the system
  • [SPARK-25375] - Reenable qualified perm. function checks in UDFSuite
  • [SPARK-25384] - Clarify fromJsonForceNullableSchema will be removed in Spark 3.0
  • [SPARK-25400] - Increase timeouts in schedulerIntegrationSuite
  • [SPARK-25445] - publish a scala 2.12 build with Spark 2.4
  • [SPARK-25469] - Eval methods of Concat, Reverse and ElementAt should use pattern matching only once
  • [SPARK-25639] - Add documentation on foreachBatch, and multiple watermark policy
  • [SPARK-25754] - Change CDN for MathJax
  • [SPARK-25859] - add scala/java/python example and doc for PrefixSpan

Test

  • [SPARK-16139] - Audit tests for leaked threads
  • [SPARK-22882] - ML test for StructuredStreaming: spark.ml.classification
  • [SPARK-22883] - ML test for StructuredStreaming: spark.ml.feature, A-M
  • [SPARK-22884] - ML test for StructuredStreaming: spark.ml.clustering
  • [SPARK-22885] - ML test for StructuredStreaming: spark.ml.tuning
  • [SPARK-22886] - ML test for StructuredStreaming: spark.ml.recommendation
  • [SPARK-22915] - ML test for StructuredStreaming: spark.ml.feature, N-Z
  • [SPARK-23169] - Run lintr on the changes of lint-r script and .lintr configuration
  • [SPARK-23392] - Add some test case for images feature
  • [SPARK-23849] - Tests for the samplingRatio option of json schema inferring
  • [SPARK-23881] - Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader"
  • [SPARK-24044] - Explicitly print out skipped tests from unittest module
  • [SPARK-24502] - flaky test: UnsafeRowSerializerSuite
  • [SPARK-24521] - Fix ineffective test in CachedTableSuite
  • [SPARK-24562] - Allow running same tests with multiple configs in SQLQueryTestSuite
  • [SPARK-24564] - Add test suite for RecordBinaryComparator
  • [SPARK-24740] - PySpark tests do not pass with NumPy 0.14.x+
  • [SPARK-24840] - do not use dummy filter to switch codegen on/off
  • [SPARK-24861] - create corrected temp directories in RateSourceSuite
  • [SPARK-24886] - Increase Jenkins build time
  • [SPARK-25141] - Modify tests for higher-order functions to check bind method.
  • [SPARK-25184] - Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
  • [SPARK-25238] - Lint-Python: Upgrading to the current version of pycodestyle fails
  • [SPARK-25249] - Add a unit test for OpenHashMap
  • [SPARK-25267] - Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive
  • [SPARK-25290] - BytesToBytesMapOnHeapSuite randomizedStressTest can cause OutOfMemoryError
  • [SPARK-25296] - Create ExplainSuite
  • [SPARK-25422] - flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)
  • [SPARK-25453] - OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
  • [SPARK-25456] - PythonForeachWriterSuite failing
  • [SPARK-25673] - Remove Travis CI which enables Java lint check
  • [SPARK-25736] - add tests to verify the behavior of multi-column count
  • [SPARK-25805] - Flaky test: DataFrameSuite.SPARK-25159 unittest failure

Wish

  • [SPARK-23131] - Kryo raises StackOverflow during serializing GLR model
  • [SPARK-25258] - Upgrade kryo package to version 4.0.2

Task

  • [SPARK-20220] - Add thrift scheduling pool config in scheduling docs
  • [SPARK-23092] - Migrate MemoryStream to DataSource V2
  • [SPARK-23451] - Deprecate KMeans computeCost
  • [SPARK-23501] - Refactor AllStagesPage in order to avoid redundant code
  • [SPARK-23533] - Add support for changing ContinuousDataReader's startOffset
  • [SPARK-23601] - Remove .md5 files from release
  • [SPARK-24392] - Mark pandas_udf as Experimental
  • [SPARK-24533] - typesafe has rebranded to lightbend. change the build/mvn endpoint from downloads.typesafe.com to downloads.lightbend.com
  • [SPARK-24654] - Update, fix LICENSE and NOTICE, and specialize for source vs binary
  • [SPARK-25063] - Rename class KnowNotNull to KnownNotNull
  • [SPARK-25095] - Python support for BarrierTaskContext
  • [SPARK-25213] - DataSourceV2 doesn't seem to produce unsafe rows
  • [SPARK-25336] - Revert SPARK-24863 and SPARK-24748
  • [SPARK-25836] - (Temporarily) disable automatic build/test of kubernetes-integration-tests

Dependency upgrade

  • [SPARK-20395] - Update Scala to 2.11.11 and zinc to 0.3.15
  • [SPARK-23509] - Upgrade commons-net from 2.2 to 3.1

Request

  • [SPARK-21607] - Can dropTempView function add a param like dropTempView(viewName: String, dropSelfOnly: Boolean)

Umbrella

Documentation

  • [SPARK-21261] - SparkSQL regexpExpressions example
  • [SPARK-23231] - Add doc for string indexer ordering to user guide (also to RFormula guide)
  • [SPARK-23254] - Add user guide entry for DataFrame multivariate summary
  • [SPARK-23256] - Add columnSchema method to PySpark image reader
  • [SPARK-23329] - Update the function descriptions with the arguments and returned values of the trigonometric functions
  • [SPARK-23566] - Arguement name fix
  • [SPARK-23642] - isZero scaladoc for LongAccumulator describes wrong method
  • [SPARK-23792] - Documentation improvements for datetime functions
  • [SPARK-24134] - A missing full-stop in doc "Tuning Spark"
  • [SPARK-24171] - Update comments for non-deterministic functions
  • [SPARK-24191] - Scala example code for Power Iteration Clustering in Spark ML examples
  • [SPARK-24224] - Java example code for Power Iteration Clustering in spark.ml
  • [SPARK-24378] - Incorrect examples for date_trunc function in spark 2.3.0
  • [SPARK-24444] - Improve pandas_udf GROUPED_MAP docs to explain column assignment
  • [SPARK-24507] - Description in "Level of Parallelism in Data Receiving" section of Spark Streaming Programming Guide in is not relevan for the recent Kafka direct apprach
  • [SPARK-24628] - Typos of the example code in docs/mllib-data-types.md
  • [SPARK-25082] - Documentation for Spark Function expm1 is incomplete
  • [SPARK-25273] - How to install testthat v1.0.2
  • [SPARK-25583] - Add newly added History server related configurations in the documentation
  • [SPARK-25656] - Add an example section about how to use Parquet/ORC library options

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.