Release Notes - Spark - Version 2.3.1 - HTML format

Sub-task

  • [SPARK-23706] - spark.conf.get(value, default=None) should produce None in PySpark
  • [SPARK-23748] - Support select from temp tables
  • [SPARK-23942] - PySpark's collect doesn't trigger QueryExecutionListener
  • [SPARK-24334] - Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

Bug

  • [SPARK-10878] - Race condition when resolving Maven coordinates via Ivy
  • [SPARK-19181] - SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.
  • [SPARK-19613] - Flaky test: StateStoreRDDSuite
  • [SPARK-21945] - pyspark --py-files doesn't work in yarn client mode
  • [SPARK-22371] - dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
  • [SPARK-23004] - Structured Streaming raise "llegalStateException: Cannot remove after already committed or aborted"
  • [SPARK-23020] - Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
  • [SPARK-23173] - from_json can produce nulls for fields which are marked as non-nullable
  • [SPARK-23288] - Incorrect number of written records in structured streaming
  • [SPARK-23291] - SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
  • [SPARK-23340] - Upgrade Apache ORC to 1.4.3
  • [SPARK-23365] - DynamicAllocation with failure in straggler task can lead to a hung spark job
  • [SPARK-23406] - Stream-stream self joins does not work
  • [SPARK-23433] - java.lang.IllegalStateException: more than one active taskSet for stage
  • [SPARK-23434] - Spark should not warn `metadata directory` for a HDFS file path
  • [SPARK-23436] - Incorrect Date column Inference in partition discovery
  • [SPARK-23438] - DStreams could lose blocks with WAL enabled when driver crashes
  • [SPARK-23448] - Dataframe returns wrong result when column don't respect datatype
  • [SPARK-23449] - Extra java options lose order in Docker context
  • [SPARK-23457] - Register task completion listeners first for ParquetFileFormat
  • [SPARK-23462] - Improve the error message in `StructType`
  • [SPARK-23489] - Flaky Test: HiveExternalCatalogVersionsSuite
  • [SPARK-23490] - Check storage.locationUri with existing table in CreateTable
  • [SPARK-23508] - blockManagerIdCache in BlockManagerId may cause oom
  • [SPARK-23517] - Make pyspark.util._exception_message produce the trace from Java side for Py4JJavaError
  • [SPARK-23523] - Incorrect result caused by the rule OptimizeMetadataOnlyQuery
  • [SPARK-23524] - Big local shuffle blocks should not be checked for corruption.
  • [SPARK-23525] - ALTER TABLE CHANGE COLUMN COMMENT doesn't work for external hive table
  • [SPARK-23551] - Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`
  • [SPARK-23569] - pandas_udf does not work with type-annotated python functions
  • [SPARK-23570] - Add Spark-2.3 in HiveExternalCatalogVersionsSuite
  • [SPARK-23598] - WholeStageCodegen can lead to IllegalAccessError calling append for HashAggregateExec
  • [SPARK-23599] - The UUID() expression is too non-deterministic
  • [SPARK-23608] - SHS needs synchronization between attachSparkUI and detachSparkUI functions
  • [SPARK-23614] - Union produces incorrect results when caching is used
  • [SPARK-23623] - Avoid concurrent use of cached KafkaConsumer in CachedKafkaConsumer (kafka-0-10-sql)
  • [SPARK-23630] - Spark-on-YARN missing user customizations of hadoop config
  • [SPARK-23637] - Yarn might allocate more resource if a same executor is killed multiple times.
  • [SPARK-23639] - SparkSQL CLI fails talk to Kerberized metastore when use proxy user
  • [SPARK-23649] - CSV schema inferring fails on some UTF-8 chars
  • [SPARK-23658] - InProcessAppHandle uses the wrong class in getLogger
  • [SPARK-23660] - Yarn throws exception in cluster mode when the application is small
  • [SPARK-23670] - Memory leak of SparkPlanGraphWrapper in sparkUI
  • [SPARK-23671] - SHS is ignoring number of replay threads
  • [SPARK-23697] - Accumulators of Spark 1.x no longer work with Spark 2.x
  • [SPARK-23728] - ML test with expected exceptions testing streaming fails on 2.3
  • [SPARK-23729] - Glob resolution breaks remote naming of files/archives
  • [SPARK-23734] - InvalidSchemaException While Saving ALSModel
  • [SPARK-23754] - StopIterator exception in Python UDF results in partial result
  • [SPARK-23759] - Unable to bind Spark UI to specific host name / IP
  • [SPARK-23760] - CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly
  • [SPARK-23775] - Flaky test: DataFrameRangeSuite
  • [SPARK-23780] - Failed to use googleVis library with new SparkR
  • [SPARK-23788] - Race condition in StreamingQuerySuite
  • [SPARK-23802] - PropagateEmptyRelation can leave query plan in unresolved state
  • [SPARK-23806] - Broadcast. unpersist can cause fatal exception when used with dynamic allocation
  • [SPARK-23808] - Test spark sessions should set default session
  • [SPARK-23809] - Active SparkSession should be set by getOrCreate
  • [SPARK-23815] - Spark writer dynamic partition overwrite mode fails to write output on multi level partition
  • [SPARK-23816] - FetchFailedException when killing speculative task
  • [SPARK-23823] - ResolveReferences loses correct origin
  • [SPARK-23827] - StreamingJoinExec should ensure that input data is partitioned into specific number of partitions
  • [SPARK-23835] - When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1
  • [SPARK-23850] - We should not redact username|user|url from UI by default
  • [SPARK-23852] - Parquet MR bug can lead to incorrect SQL results
  • [SPARK-23853] - Skip doctests which require hive support built in PySpark
  • [SPARK-23868] - Fix scala.MatchError in literals.sql.out
  • [SPARK-23941] - Mesos task failed on specific spark app name
  • [SPARK-23971] - Should not leak Spark sessions across test suites
  • [SPARK-23986] - CompileException when using too many avg aggregation after joining
  • [SPARK-23989] - When using `SortShuffleWriter`, the data will be overwritten
  • [SPARK-23991] - data loss when allocateBlocksToBatch
  • [SPARK-24002] - Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
  • [SPARK-24007] - EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.
  • [SPARK-24021] - Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
  • [SPARK-24022] - Flaky test: SparkContextSuite
  • [SPARK-24033] - LAG Window function broken in Spark 2.3
  • [SPARK-24062] - SASL encryption cannot be worked in ThriftServer
  • [SPARK-24067] - Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))
  • [SPARK-24068] - CSV schema inferring doesn't work for compressed files
  • [SPARK-24085] - Scalar subquery error
  • [SPARK-24104] - SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them
  • [SPARK-24107] - ChunkedByteBuffer.writeFully method has not reset the limit value
  • [SPARK-24133] - Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException
  • [SPARK-24166] - InMemoryTableScanExec should not access SQLConf at executor side
  • [SPARK-24168] - WindowExec should not access SQLConf at executor side
  • [SPARK-24169] - JsonToStructs should not access SQLConf at executor side
  • [SPARK-24214] - StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation.toJSON should not fail
  • [SPARK-24230] - With Parquet 1.10 upgrade has errors in the vectorized reader
  • [SPARK-24255] - Require Java 8 in SparkR description
  • [SPARK-24257] - LongToUnsafeRowMap calculate the new size may be wrong
  • [SPARK-24259] - ArrayWriter for Arrow produces wrong output
  • [SPARK-24263] - SparkR java check breaks on openjdk
  • [SPARK-24309] - AsyncEventQueue should handle an interrupt from a Listener
  • [SPARK-24313] - Collection functions interpreted execution doesn't work with complex types
  • [SPARK-24322] - Upgrade Apache ORC to 1.4.4
  • [SPARK-24364] - Files deletion after globbing may fail StructuredStreaming jobs
  • [SPARK-24373] - "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans
  • [SPARK-24384] - spark-submit --py-files with .py files doesn't work in client mode before context initialization
  • [SPARK-24399] - Reused Exchange is used where it should not be
  • [SPARK-24414] - Stages page doesn't show all task attempts when failures
  • [SPARK-26612] - Speculation kill causing finished stage recomputed
  • [SPARK-26614] - Speculation kill might cause job failure

New Feature

  • [SPARK-23948] - Trigger mapstage's job listener in submitMissingTasks
  • [SPARK-24465] - LSHModel should support Structured Streaming for transform

Improvement

  • [SPARK-23040] - BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified
  • [SPARK-23553] - Tests should not assume the default value of `spark.sql.sources.default`
  • [SPARK-23624] - Revise doc of method pushFilters
  • [SPARK-23628] - WholeStageCodegen can generate methods with too many params
  • [SPARK-23644] - SHS with proxy doesn't show applications
  • [SPARK-23645] - pandas_udf can not be called with keyword arguments
  • [SPARK-23691] - Use sql_conf util in PySpark tests where possible
  • [SPARK-23695] - Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled
  • [SPARK-23769] - Remove unnecessary scalastyle check disabling
  • [SPARK-23822] - Improve error message for Parquet schema mismatches
  • [SPARK-23838] - SparkUI: Running SQL query displayed as "completed" in SQL tab
  • [SPARK-23867] - com.codahale.metrics.Counter output in log message has no toString method
  • [SPARK-23962] - Flaky tests from SQLMetricsTestUtils.currentExecutionIds
  • [SPARK-23963] - Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
  • [SPARK-24014] - Add onStreamingStarted method to StreamingListener
  • [SPARK-24128] - Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg
  • [SPARK-24188] - /api/v1/version not working
  • [SPARK-24246] - Improve AnalysisException by setting the cause when it's available
  • [SPARK-24262] - Fix typo in UDF error message

Test

  • [SPARK-22882] - ML test for StructuredStreaming: spark.ml.classification
  • [SPARK-22883] - ML test for StructuredStreaming: spark.ml.feature, A-M
  • [SPARK-22915] - ML test for StructuredStreaming: spark.ml.feature, N-Z
  • [SPARK-23881] - Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader"

Task

Request

  • [SPARK-28295] - Is there a way of getting feature names from pyspark.ml.regression GeneralizedLinearRegression?

Documentation

  • [SPARK-23329] - Update the function descriptions with the arguments and returned values of the trigonometric functions
  • [SPARK-23642] - isZero scaladoc for LongAccumulator describes wrong method
  • [SPARK-24378] - Incorrect examples for date_trunc function in spark 2.3.0
  • [SPARK-24444] - Improve pandas_udf GROUPED_MAP docs to explain column assignment

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.