Release Notes - Spark - Version 2.3.0 - HTML format

Sub-task

  • [SPARK-9104] - expose network layer memory usage
  • [SPARK-10365] - Support Parquet logical type TIMESTAMP_MICROS
  • [SPARK-11034] - Launcher: add support for monitoring Mesos apps
  • [SPARK-11035] - Launcher: allow apps to be launched in-process
  • [SPARK-12375] - VectorIndexer: allow unknown categories
  • [SPARK-13534] - Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
  • [SPARK-13969] - Extend input format that feature hashing can handle
  • [SPARK-14280] - Update change-version.sh and pom.xml to add Scala 2.12 profiles
  • [SPARK-14650] - Compile Spark REPL for Scala 2.12
  • [SPARK-14878] - Support Trim characters in the string trim function
  • [SPARK-17074] - generate equi-height histogram for column
  • [SPARK-17139] - Add model summary for MultinomialLogisticRegression
  • [SPARK-17642] - Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
  • [SPARK-17729] - Enable creating hive bucketed tables
  • [SPARK-18016] - Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
  • [SPARK-18294] - Implement commit protocol to support `mapred` package's committer
  • [SPARK-19165] - UserDefinedFunction should verify call arguments and provide readable exception in case of mismatch
  • [SPARK-19357] - Parallel Model Evaluation for ML Tuning: Scala
  • [SPARK-19634] - Feature parity for descriptive statistics in MLlib
  • [SPARK-19762] - Implement aggregator/loss function hierarchy and apply to linear regression
  • [SPARK-19791] - Add doc and example for fpgrowth
  • [SPARK-20396] - groupBy().apply() with pandas udf in pyspark
  • [SPARK-20417] - Move error reporting for subquery from Analyzer to CheckAnalysis
  • [SPARK-20585] - R generic hint support
  • [SPARK-20641] - Key-value store abstraction and implementation for storing application data
  • [SPARK-20642] - Use key-value store to keep History Server application listing
  • [SPARK-20643] - Implement listener for saving application status data in key-value store
  • [SPARK-20644] - Hook up Spark UI to the new key-value store backend
  • [SPARK-20645] - Make Environment page use new app state store
  • [SPARK-20646] - Make Executors page use new app state store
  • [SPARK-20647] - Make the Storage page use new app state store
  • [SPARK-20648] - Make Jobs and Stages pages use the new app state store
  • [SPARK-20649] - Simplify REST API class hierarchy
  • [SPARK-20650] - Remove JobProgressListener (and other unneeded classes)
  • [SPARK-20652] - Make SQL UI use new app state store
  • [SPARK-20653] - Add auto-cleanup of old elements to the new app state store
  • [SPARK-20654] - Add controls for how much disk the SHS can use
  • [SPARK-20655] - In-memory key-value store implementation
  • [SPARK-20657] - Speed up Stage page
  • [SPARK-20664] - Remove stale applications from SHS listing
  • [SPARK-20727] - Skip SparkR tests when missing Hadoop winutils on CRAN windows machines
  • [SPARK-20748] - Built-in SQL Function Support - CH[A]R
  • [SPARK-20749] - Built-in SQL Function Support - all variants of LEN[GTH]
  • [SPARK-20750] - Built-in SQL Function Support - REPLACE
  • [SPARK-20751] - Built-in SQL Function Support - COT
  • [SPARK-20754] - Add Function Alias For MOD/TRUNCT/POSITION
  • [SPARK-20770] - Improve ColumnStats
  • [SPARK-20783] - Enhance ColumnVector to support compressed representation
  • [SPARK-20791] - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame
  • [SPARK-20822] - Generate code to get value from CachedBatchColumnVector in ColumnarBatch
  • [SPARK-20881] - Clearly document the mechanism to choose between two sources of statistics
  • [SPARK-20909] - Build-in SQL Function Support - DAYOFWEEK
  • [SPARK-20910] - Build-in SQL Function Support - UUID
  • [SPARK-20931] - Built-in SQL Function ABS support string type
  • [SPARK-20948] - Built-in SQL Function UnaryMinus/UnaryPositive support string type
  • [SPARK-20961] - generalize the dictionary in ColumnVector
  • [SPARK-20962] - Support subquery column aliases in FROM clause
  • [SPARK-20963] - Support column aliases for aliased relation in FROM clause
  • [SPARK-20988] - Convert logistic regression to new aggregator framework
  • [SPARK-21007] - Add SQL function - RIGHT && LEFT
  • [SPARK-21031] - Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats
  • [SPARK-21046] - simplify the array offset and length in ColumnVector
  • [SPARK-21047] - Add test suites for complicated cases in ColumnarBatchSuite
  • [SPARK-21051] - Add hash map metrics to aggregate
  • [SPARK-21052] - Add hash map metrics to join
  • [SPARK-21083] - Store zero size and row count after analyzing empty table
  • [SPARK-21087] - CrossValidator, TrainValidationSplit should collect all models when fitting: Scala API
  • [SPARK-21127] - Update statistics after data changing commands
  • [SPARK-21180] - Remove conf from stats functions since now we have conf in LogicalPlan
  • [SPARK-21190] - SPIP: Vectorized UDFs in Python
  • [SPARK-21205] - pmod(number, 0) should be null
  • [SPARK-21213] - Support collecting partition-level statistics: rowCount and sizeInBytes
  • [SPARK-21237] - Invalidate stats once table data is changed
  • [SPARK-21322] - support histogram in filter cardinality estimation
  • [SPARK-21324] - Improve statistics test suites
  • [SPARK-21375] - Add date and timestamp support to ArrowConverters for toPandas() collection
  • [SPARK-21440] - Refactor ArrowConverters and add ArrayType and StructType support.
  • [SPARK-21456] - Make the driver failover_timeout configurable (Mesos cluster mode)
  • [SPARK-21552] - Add decimal type support to ArrowWriter.
  • [SPARK-21625] - Add incompatible Hive UDF describe to DOC
  • [SPARK-21654] - Complement predicates expression description
  • [SPARK-21671] - Move kvstore package to util.kvstore, add annotations
  • [SPARK-21720] - Filter predicate with many conditions throw stackoverflow error
  • [SPARK-21778] - Simpler Dataset.sample API in Scala / Java
  • [SPARK-21779] - Simpler Dataset.sample API in Python
  • [SPARK-21780] - Simpler Dataset.sample API in R
  • [SPARK-21805] - disable R vignettes code on Windows
  • [SPARK-21893] - Put Kafka 0.8 behind a profile
  • [SPARK-21895] - Support changing database in HiveClient
  • [SPARK-21934] - Expose Netty memory usage via Metrics System
  • [SPARK-21984] - Use histogram stats in join estimation
  • [SPARK-22026] - data source v2 write path
  • [SPARK-22032] - Speed up StructType.fromInternal
  • [SPARK-22053] - Implement stream-stream inner join in Append mode
  • [SPARK-22078] - clarify exception behaviors for all data source v2 interfaces
  • [SPARK-22086] - Add expression description for CASE WHEN
  • [SPARK-22087] - Clear remaining compile errors for 2.12; resolve most warnings
  • [SPARK-22100] - Make percentile_approx support date/timestamp type and change the output type to be the same as input type
  • [SPARK-22128] - Update paranamer to 2.8 to avoid BytecodeReadingParanamer ArrayIndexOutOfBoundsException with Scala 2.12 + Java 8 lambda
  • [SPARK-22136] - Implement stream-stream outer joins in append mode
  • [SPARK-22197] - push down operators to data source before planning
  • [SPARK-22221] - Add User Documentation for Working with Arrow in Spark
  • [SPARK-22226] - splitExpression can create too many method calls (generating a Constant Pool limit error)
  • [SPARK-22278] - Expose current event time watermark and current processing time in GroupState
  • [SPARK-22285] - Change implementation of ApproxCountDistinctForIntervals to TypedImperativeAggregate
  • [SPARK-22310] - Refactor join estimation to incorporate estimation logic for different kinds of statistics
  • [SPARK-22322] - Update FutureAction for compatibility with Scala 2.12 future
  • [SPARK-22324] - Upgrade Arrow to version 0.8.0 and upgrade Netty to 4.1.17
  • [SPARK-22344] - Prevent R CMD check from using /tmp
  • [SPARK-22361] - Add unit test for Window Frames
  • [SPARK-22363] - Add unit test for Window spilling
  • [SPARK-22387] - propagate session configs to data source read/write options
  • [SPARK-22389] - partitioning reporting
  • [SPARK-22392] - columnar reader interface
  • [SPARK-22400] - rename some APIs and classes to make their meaning clearer
  • [SPARK-22409] - Add function type argument to pandas_udf
  • [SPARK-22452] - DataSourceV2Options should have getInt, getBoolean, etc.
  • [SPARK-22475] - show histogram in DESC COLUMN command
  • [SPARK-22483] - Exposing java.nio bufferedPool memory metrics to metrics system
  • [SPARK-22494] - Coalesce and AtLeastNNonNulls can cause 64KB JVM bytecode limit exception
  • [SPARK-22498] - 64KB JVM bytecode limit problem with concat
  • [SPARK-22499] - 64KB JVM bytecode limit problem with least and greatest
  • [SPARK-22500] - 64KB JVM bytecode limit problem with cast
  • [SPARK-22501] - 64KB JVM bytecode limit problem with in
  • [SPARK-22508] - 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
  • [SPARK-22514] - move ColumnVector.Array and ColumnarBatch.Row to individual files
  • [SPARK-22515] - Estimation relation size based on numRows * rowSize
  • [SPARK-22529] - Relation stats should be consistent with other plans based on cbo config
  • [SPARK-22530] - Add ArrayType Support for working with Pandas and Arrow
  • [SPARK-22542] - remove unused features in ColumnarBatch
  • [SPARK-22543] - fix java 64kb compile error for deeply nested expressions
  • [SPARK-22549] - 64KB JVM bytecode limit problem with concat_ws
  • [SPARK-22550] - 64KB JVM bytecode limit problem with elt
  • [SPARK-22570] - Create a lot of global variables to reuse an object in generated code
  • [SPARK-22602] - remove ColumnVector#loadBytes
  • [SPARK-22603] - 64KB JVM bytecode limit problem with FormatString
  • [SPARK-22604] - remove the get address methods from ColumnVector
  • [SPARK-22626] - Wrong Hive table statistics may trigger OOM if enables CBO
  • [SPARK-22643] - ColumnarArray should be an immutable view
  • [SPARK-22646] - Spark on Kubernetes - basic submission client
  • [SPARK-22648] - Documentation for Kubernetes Scheduler Backend
  • [SPARK-22652] - remove set methods in ColumnarRow
  • [SPARK-22669] - Avoid unnecessary function calls in code generation
  • [SPARK-22693] - Avoid the generation of useless mutable states in complexTypeCreator and predicates
  • [SPARK-22695] - Avoid the generation of useless mutable states by scalaUDF
  • [SPARK-22696] - Avoid the generation of useless mutable states by objects functions
  • [SPARK-22699] - Avoid the generation of useless mutable states by GenerateSafeProjection
  • [SPARK-22703] - ColumnarRow should be an immutable view
  • [SPARK-22716] - Avoid the creation of mutable states in addReferenceObj
  • [SPARK-22732] - Add DataSourceV2 streaming APIs
  • [SPARK-22733] - refactor StreamExecution for extensibility
  • [SPARK-22745] - read partition stats from Hive
  • [SPARK-22746] - Avoid the generation of useless mutable states by SortMergeJoin
  • [SPARK-22750] - Introduce reusable mutable states
  • [SPARK-22757] - Init-container in the driver/executor pods for downloading remote dependencies
  • [SPARK-22762] - Basic tests for IfCoercion and CaseWhenCoercion
  • [SPARK-22772] - elt should use splitExpressionsWithCurrentInputs to split expression codes
  • [SPARK-22775] - move dictionary related APIs from ColumnVector to WritableColumnVector
  • [SPARK-22785] - rename ColumnVector.anyNullsSet to hasNull
  • [SPARK-22789] - Add ContinuousExecution for continuous processing queries
  • [SPARK-22807] - Change configuration options to use "container" instead of "docker"
  • [SPARK-22816] - Basic tests for PromoteStrings and InConversion
  • [SPARK-22821] - Basic tests for WidenSetOperationTypes, BooleanEquality, StackCoercion and Division
  • [SPARK-22822] - Basic tests for WindowFrameCoercion and DecimalPrecision
  • [SPARK-22829] - Add new built-in function date_trunc()
  • [SPARK-22845] - Modify spark.kubernetes.allocation.batch.delay to take time instead of int
  • [SPARK-22848] - Avoid the generation of useless mutable states by Stack function
  • [SPARK-22890] - Basic tests for DateTimeOperations
  • [SPARK-22892] - Simplify some estimation logic by using double instead of decimal
  • [SPARK-22904] - Basic tests for decimal operations and string cast
  • [SPARK-22908] - add basic continuous kafka source
  • [SPARK-22909] - Move Structured Streaming v2 APIs to streaming package
  • [SPARK-22912] - Support v2 streaming sources and sinks in MicroBatchExecution
  • [SPARK-22917] - Should not try to generate histogram for empty/null columns
  • [SPARK-22930] - Improve the description of Vectorized UDFs for non-deterministic cases
  • [SPARK-22978] - Register Scalar Vectorized UDFs for SQL Statement
  • [SPARK-22980] - Using pandas_udf when inputs are not Pandas's Series or DataFrame
  • [SPARK-23033] - disable task-level retry for continuous execution
  • [SPARK-23045] - Have RFormula use OneHotEncoderEstimator
  • [SPARK-23046] - Have RFormula include VectorSizeHint in pipeline
  • [SPARK-23047] - Change MapVector to NullableMapVector in ArrowColumnVector
  • [SPARK-23052] - Migrate Microbatch ConsoleSink to v2
  • [SPARK-23063] - Changes to publish the spark-kubernetes package
  • [SPARK-23064] - Add documentation for stream-stream joins
  • [SPARK-23093] - don't modify run id
  • [SPARK-23107] - ML, Graph 2.3 QA: API: New Scala APIs, docs
  • [SPARK-23108] - ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
  • [SPARK-23110] - ML 2.3 QA: API: Java compatibility, docs
  • [SPARK-23111] - ML, Graph 2.3 QA: Update user guide for new features & APIs
  • [SPARK-23112] - ML, Graph 2.3 QA: Programming guide update and migration guide
  • [SPARK-23116] - SparkR 2.3 QA: Update user guide for new features & APIs
  • [SPARK-23118] - SparkR 2.3 QA: Programming guide, migration guide, vignettes updates
  • [SPARK-23137] - spark.kubernetes.executor.podNamePrefix is ignored
  • [SPARK-23196] - Unify continuous and microbatch V2 sinks
  • [SPARK-23218] - simplify ColumnVector.getArray
  • [SPARK-23219] - Rename ReadTask to DataReaderFactory
  • [SPARK-23260] - remove V2 from the class name of data source reader/writer
  • [SPARK-23261] - Rename Pandas UDFs
  • [SPARK-23262] - mix-in interface should extend the interface it aimed to mix in
  • [SPARK-23268] - Reorganize packages in data source V2
  • [SPARK-23272] - add calendar interval type support to ColumnVector
  • [SPARK-23280] - add map type support to ColumnVector
  • [SPARK-23314] - Pandas grouped udf on dataset with timestamp column error
  • [SPARK-23334] - Fix pandas_udf with return type StringType() to handle str type properly in Python 2.
  • [SPARK-23352] - Explicitly specify supported types in Pandas UDFs
  • [SPARK-23446] - Explicitly check supported types in toPandas
  • [SPARK-24077] - Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`

Bug

  • [SPARK-3151] - DiskStore attempts to map any size BlockId without checking MappedByteBuffer limit
  • [SPARK-3577] - Add task metric to report spill time
  • [SPARK-3685] - Spark's local dir should accept only local paths
  • [SPARK-5484] - Pregel should checkpoint periodically to avoid StackOverflowError
  • [SPARK-9825] - Spark overwrites remote cluster "final" properties with local config
  • [SPARK-10719] - SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10
  • [SPARK-11334] - numRunningTasks can't be less than 0, or it will affect executor allocation
  • [SPARK-12552] - Recovered driver's resource is not counted in the Master
  • [SPARK-12559] - Cluster mode doesn't work with --packages
  • [SPARK-12717] - pyspark broadcast fails when using multiple threads
  • [SPARK-13669] - Job will always fail in the external shuffle service unavailable situation
  • [SPARK-13757] - support quoted column names in schema string at types.py#_parse_datatype_string
  • [SPARK-13933] - hadoop-2.7 profile's curator version should be 2.7.1
  • [SPARK-13983] - HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since 1.6 version (both multi-session and single session)
  • [SPARK-14034] - Converting to Dataset causes wrong order and values in nested array of documents
  • [SPARK-14228] - Lost executor of RPC disassociated, and occurs exception: Could not find CoarseGrainedScheduler or it has been stopped
  • [SPARK-14387] - Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
  • [SPARK-14408] - Update RDD.treeAggregate not to use reduce
  • [SPARK-14657] - RFormula output wrong features when formula w/o intercept
  • [SPARK-15243] - Binarizer.explainParam(u"...") raises ValueError
  • [SPARK-15474] - ORC data source fails to write and read back empty dataframe
  • [SPARK-16167] - RowEncoder should preserve array/map type nullability.
  • [SPARK-16542] - bugs about types that result an array of null when creating dataframe using python
  • [SPARK-16548] - java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data
  • [SPARK-16605] - Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports
  • [SPARK-16628] - OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files
  • [SPARK-16986] - "Started" time, "Completed" time and "Last Updated" time in history server UI are not user local time
  • [SPARK-17029] - Dataset toJSON goes through RDD form instead of transforming dataset itself
  • [SPARK-17047] - Spark 2 cannot create table when CLUSTERED.
  • [SPARK-17284] - Remove statistics-related table properties from SHOW CREATE TABLE
  • [SPARK-17321] - YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
  • [SPARK-17410] - Move Hive-generated Stats Info to HiveClientImpl
  • [SPARK-17528] - data should be copied properly before saving into InternalRow
  • [SPARK-17742] - Spark Launcher does not get failed state in Listener
  • [SPARK-17788] - RangePartitioner results in few very large tasks and many small to empty tasks
  • [SPARK-17851] - Make sure all test sqls in catalyst pass checkAnalysis
  • [SPARK-17902] - collect() ignores stringsAsFactors
  • [SPARK-17914] - Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp
  • [SPARK-17920] - HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url
  • [SPARK-18004] - DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
  • [SPARK-18061] - Spark Thriftserver needs to create SPNego principal
  • [SPARK-18355] - Spark SQL fails to read data from a ORC hive table that has a new column added to it
  • [SPARK-18394] - Executing the same query twice in a row results in CodeGenerator cache misses
  • [SPARK-18608] - Spark ML algorithms that check RDD cache level for internal caching double-cache data
  • [SPARK-18646] - ExecutorClassLoader for spark-shell does not honor spark.executor.userClassPathFirst
  • [SPARK-18935] - Use Mesos "Dynamic Reservation" resource for Spark
  • [SPARK-18950] - Report conflicting fields when merging two StructTypes.
  • [SPARK-19109] - ORC metadata section can sometimes exceed protobuf message size limit
  • [SPARK-19122] - Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order
  • [SPARK-19326] - Speculated task attempts do not get launched in few scenarios
  • [SPARK-19372] - Code generation for Filter predicate including many OR conditions exceeds JVM method size limit
  • [SPARK-19451] - rangeBetween method should accept Long value as boundary
  • [SPARK-19471] - A confusing NullPointerException when creating table
  • [SPARK-19531] - History server doesn't refresh jobs for long-life apps like thriftserver
  • [SPARK-19580] - Support for avro.schema.url while writing to hive table
  • [SPARK-19644] - Memory leak in Spark Streaming (Encoder/Scala Reflection)
  • [SPARK-19688] - Spark on Yarn Credentials File set to different application directory
  • [SPARK-19726] - Faild to insert null timestamp value to mysql using spark jdbc
  • [SPARK-19753] - Remove all shuffle files on a host in case of slave lost of fetch failure
  • [SPARK-19809] - NullPointerException on zero-size ORC file
  • [SPARK-19812] - YARN shuffle service fails to relocate recovery DB across NFS directories
  • [SPARK-19824] - Standalone master JSON not showing cores for running applications
  • [SPARK-19900] - [Standalone] Master registers application again when driver relaunched
  • [SPARK-19910] - `stack` should not reject NULL values due to type mismatch
  • [SPARK-20025] - Driver fail over will not work, if SPARK_LOCAL* env is set.
  • [SPARK-20065] - Empty output files created for aggregation query in append mode
  • [SPARK-20079] - Re registration of AM hangs spark cluster in yarn-client mode
  • [SPARK-20098] - DataType's typeName method returns with 'StructF' in case of StructField
  • [SPARK-20140] - Remove hardcoded kinesis retry wait and max retries
  • [SPARK-20205] - DAGScheduler posts SparkListenerStageSubmitted before updating stage
  • [SPARK-20213] - DataFrameWriter operations do not show up in SQL tab
  • [SPARK-20256] - Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
  • [SPARK-20288] - Improve BasicSchedulerIntegrationSuite "multi-stage job"
  • [SPARK-20311] - SQL "range(N) as alias" or "range(N) alias" doesn't work
  • [SPARK-20312] - query optimizer calls udf with null values when it doesn't expect them
  • [SPARK-20329] - Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion
  • [SPARK-20333] - Fix HashPartitioner in DAGSchedulerSuite
  • [SPARK-20338] - Spaces in spark.eventLog.dir are not correctly handled
  • [SPARK-20341] - Support BigIngeger values > 19 precision
  • [SPARK-20342] - DAGScheduler sends SparkListenerTaskEnd before updating task's accumulators
  • [SPARK-20345] - Fix STS error handling logic on HiveSQLException
  • [SPARK-20356] - Spark sql group by returns incorrect results after join + distinct transformations
  • [SPARK-20359] - Catalyst EliminateOuterJoin optimization can cause NPE
  • [SPARK-20365] - Not so accurate classpath format for AM and Containers
  • [SPARK-20367] - Spark silently escapes partition column names
  • [SPARK-20380] - describe table not showing updated table comment after alter operation
  • [SPARK-20412] - NullPointerException in places expecting non-optional partitionSpec.
  • [SPARK-20427] - Issue with Spark interpreting Oracle datatype NUMBER
  • [SPARK-20439] - Catalog.listTables() depends on all libraries used to create tables
  • [SPARK-20451] - Filter out nested mapType datatypes from sort order in randomSplit
  • [SPARK-20453] - Bump master branch version to 2.3.0-SNAPSHOT
  • [SPARK-20466] - HadoopRDD#addLocalConfiguration throws NPE
  • [SPARK-20541] - SparkR SS should support awaitTermination without timeout
  • [SPARK-20543] - R should skip long running or non-essential tests when running on CRAN
  • [SPARK-20565] - Improve the error message for unsupported JDBC types
  • [SPARK-20569] - RuntimeReplaceable functions accept invalid third parameter
  • [SPARK-20586] - Add deterministic to ScalaUDF
  • [SPARK-20591] - Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist
  • [SPARK-20605] - Deprecate not used AM and executor port configuration
  • [SPARK-20609] - Run the SortShuffleSuite unit tests have residual spark_* system directory
  • [SPARK-20613] - Double quotes in Windows batch script
  • [SPARK-20626] - Fix SparkR test warning on Windows with timestamp time zone
  • [SPARK-20633] - FileFormatWriter wrap the FetchFailedException which breaks job's failover
  • [SPARK-20640] - Make rpc timeout and retry for shuffle registration configurable
  • [SPARK-20689] - python doctest leaking bucketed table
  • [SPARK-20690] - Subqueries in FROM should have alias names
  • [SPARK-20704] - CRAN test should run single threaded
  • [SPARK-20706] - Spark-shell not overriding method/variable definition
  • [SPARK-20708] - Make `addExclusionRules` up-to-date
  • [SPARK-20713] - Speculative task that got CommitDenied exception shows up as failed
  • [SPARK-20719] - Support LIMIT ALL
  • [SPARK-20756] - yarn-shuffle jar has references to unshaded guava and contains scala classes
  • [SPARK-20786] - Improve ceil and floor handle the value which is not expected
  • [SPARK-20815] - NullPointerException in RPackageUtils#checkManifestForR
  • [SPARK-20832] - Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs
  • [SPARK-20865] - caching dataset throws "Queries with streaming sources must be executed with writeStream.start()"
  • [SPARK-20873] - Improve the error message for unsupported Column Type
  • [SPARK-20876] - If the input parameter is float type for ceil or floor ,the result is not we expected
  • [SPARK-20898] - spark.blacklist.killBlacklistedExecutors doesn't work in YARN
  • [SPARK-20904] - Task failures during shutdown cause problems with preempted executors
  • [SPARK-20906] - Constrained Logistic Regression for SparkR
  • [SPARK-20914] - Javadoc contains code that is invalid
  • [SPARK-20916] - Improve error message for unaliased subqueries in FROM clause
  • [SPARK-20918] - Use FunctionIdentifier as function identifiers in FunctionRegistry
  • [SPARK-20922] - Unsafe deserialization in Spark LauncherConnection
  • [SPARK-20923] - TaskMetrics._updatedBlockStatuses uses a lot of memory
  • [SPARK-20926] - Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures
  • [SPARK-20935] - A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.
  • [SPARK-20945] - NoSuchElementException key not found in TaskSchedulerImpl
  • [SPARK-20976] - Unify Error Messages for FAILFAST mode.
  • [SPARK-20978] - CSV emits NPE when the number of tokens is less than given schema and corrupt column is given
  • [SPARK-20989] - Fail to start multiple workers on one host if external shuffle service is enabled in standalone mode
  • [SPARK-20991] - BROADCAST_TIMEOUT conf should be a timeoutConf
  • [SPARK-20997] - spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"
  • [SPARK-21033] - fix the potential OOM in UnsafeExternalSorter
  • [SPARK-21041] - With whole-stage codegen, SparkSession.range()'s behavior is inconsistent with SparkContext.range()
  • [SPARK-21050] - ml word2vec write has overflow issue in calculating numPartitions
  • [SPARK-21055] - Support grouping__id
  • [SPARK-21057] - Do not use a PascalDistribution in countApprox
  • [SPARK-21064] - Fix the default value bug in NettyBlockTransferServiceSuite
  • [SPARK-21066] - LibSVM load just one input file
  • [SPARK-21093] - Multiple gapply execution occasionally failed in SparkR
  • [SPARK-21101] - Error running Hive temporary UDTF on latest Spark 2.2
  • [SPARK-21102] - Refresh command is too aggressive in parsing
  • [SPARK-21112] - ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT
  • [SPARK-21119] - unset table properties should keep the table comment
  • [SPARK-21124] - Wrong user shown in UI when using kerberos
  • [SPARK-21138] - Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
  • [SPARK-21145] - Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore
  • [SPARK-21147] - the schema of socket/rate source can not be set.
  • [SPARK-21163] - DataFrame.toPandas should respect the data type
  • [SPARK-21165] - Fail to write into partitioned hive table due to attribute reference not working with cast on partition column
  • [SPARK-21167] - Path is not decoded correctly when reading output of FileSink
  • [SPARK-21170] - Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted
  • [SPARK-21181] - Suppress memory leak errors reported by netty
  • [SPARK-21188] - releaseAllLocksForTask should synchronize the whole method
  • [SPARK-21204] - RuntimeException with Set and Case Class in Spark 2.1.1
  • [SPARK-21216] - Streaming DataFrames fail to join with Hive tables
  • [SPARK-21219] - Task retry occurs on same executor due to race condition with blacklisting
  • [SPARK-21223] - Thread-safety issue in FsHistoryProvider
  • [SPARK-21225] - decrease the Mem using for variable 'tasks' in function resourceOffers
  • [SPARK-21228] - InSet incorrect handling of structs
  • [SPARK-21248] - Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true)
  • [SPARK-21254] - History UI: Taking over 1 minute for initial page display
  • [SPARK-21255] - NPE when creating encoder for enum
  • [SPARK-21263] - NumberFormatException is not thrown while converting an invalid string to float/double
  • [SPARK-21264] - Omitting columns with 'how' specified in join in PySpark throws NPE
  • [SPARK-21271] - UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8
  • [SPARK-21272] - SortMergeJoin LeftAnti does not update numOutputRows
  • [SPARK-21278] - Upgrade to Py4J 0.10.6
  • [SPARK-21281] - cannot create empty typed array column
  • [SPARK-21283] - FileOutputStream should be created as append mode
  • [SPARK-21284] - rename SessionCatalog.registerFunction parameter name
  • [SPARK-21300] - ExternalMapToCatalyst should null-check map key prior to converting to internal value.
  • [SPARK-21306] - OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
  • [SPARK-21312] - UnsafeRow writeToStream has incorrect offsetInByteArray calculation for non-zero offset
  • [SPARK-21319] - UnsafeExternalRowSorter.RowComparator memory leak
  • [SPARK-21327] - ArrayConstructor should handle an array of typecode 'l' as long rather than int in Python 2.
  • [SPARK-21330] - Bad partitioning does not allow to read a JDBC table with extreme values on the partition column
  • [SPARK-21332] - Incorrect result type inferred for some decimal expressions
  • [SPARK-21333] - joinWith documents and analysis allow invalid join types
  • [SPARK-21335] - support un-aliased subquery
  • [SPARK-21338] - AggregatedDialect doesn't override isCascadingTruncateTable() method
  • [SPARK-21339] - spark-shell --packages option does not add jars to classpath on windows
  • [SPARK-21342] - Fix DownloadCallback to work well with RetryingBlockFetcher
  • [SPARK-21343] - Refine the document for spark.reducer.maxReqSizeShuffleToMem
  • [SPARK-21345] - SparkSessionBuilderSuite should clean up stopped sessions
  • [SPARK-21350] - Fix the error message when the number of arguments is wrong when invoking a UDF
  • [SPARK-21354] - INPUT FILE related functions do not support more than one sources
  • [SPARK-21357] - FileInputDStream not remove out of date RDD
  • [SPARK-21369] - Don't use Scala classes in external shuffle service
  • [SPARK-21374] - Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled
  • [SPARK-21376] - Token is not renewed in yarn client process in cluster mode
  • [SPARK-21377] - Jars specified with --jars or --packages are not added into AM's system classpath
  • [SPARK-21383] - YARN can allocate too many executors
  • [SPARK-21384] - Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails
  • [SPARK-21394] - Reviving broken callable objects in UDF in PySpark
  • [SPARK-21400] - Spark shouldn't ignore user defined output committer in append mode
  • [SPARK-21403] - Cluster mode doesn't work with --packages [Mesos]
  • [SPARK-21411] - Failed to get new HDFS delegation tokens in AMCredentialRenewer
  • [SPARK-21414] - Buffer in SlidingWindowFunctionFrame could be big though window is small
  • [SPARK-21418] - NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
  • [SPARK-21422] - Depend on Apache ORC 1.4.0
  • [SPARK-21428] - CliSessionState never be recognized because of IsolatedClientLoader
  • [SPARK-21432] - Reviving broken partial functions in UDF in PySpark
  • [SPARK-21439] - Cannot use Spark with Python ABCmeta (exception from cloudpickle)
  • [SPARK-21441] - Incorrect Codegen in SortMergeJoinExec results failures in some cases
  • [SPARK-21444] - Fetch failure due to node reboot causes job failure
  • [SPARK-21445] - NotSerializableException thrown by UTF8String.IntWrapper
  • [SPARK-21446] - [SQL] JDBC Postgres fetchsize parameter ignored again
  • [SPARK-21447] - Spark history server fails to render compressed inprogress history file in some cases.
  • [SPARK-21451] - HiveConf in SparkSQLCLIDriver doesn't respect spark.hadoop.some.hive.variables
  • [SPARK-21457] - ExternalCatalog.listPartitions should correctly handle partition values with dot
  • [SPARK-21462] - Add batchId to the json of StreamingQueryProgress
  • [SPARK-21463] - Output of StructuredStreaming tables don't respect user specified schema when reading back the table
  • [SPARK-21490] - SparkLauncher may fail to redirect streams
  • [SPARK-21494] - Spark 2.2.0 AES encryption not working with External shuffle
  • [SPARK-21498] - quick start -> one py demo have some bug in code
  • [SPARK-21501] - Spark shuffle index cache size should be memory based
  • [SPARK-21502] - --supervise causing frameworkId conflicts in mesos cluster mode
  • [SPARK-21503] - Spark UI shows incorrect task status for a killed Executor Process
  • [SPARK-21508] - Documentation on 'Spark Streaming Custom Receivers' has error in example code
  • [SPARK-21512] - DatasetCacheSuite needs to execute unpersistent after executing peristent
  • [SPARK-21516] - overriding afterEach() in DatasetCacheSuite must call super.afterEach()
  • [SPARK-21522] - Flaky test: LauncherServerSuite.testStreamFiltering
  • [SPARK-21523] - Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
  • [SPARK-21534] - PickleException when creating dataframe from python row with empty bytearray
  • [SPARK-21541] - Spark Logs show incorrect job status for a job that does not create SparkContext
  • [SPARK-21546] - dropDuplicates with watermark yields RuntimeException due to binding failure
  • [SPARK-21549] - Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs
  • [SPARK-21551] - pyspark's collect fails when getaddrinfo is too slow
  • [SPARK-21555] - GROUP BY don't work with expressions with NVL and nested objects
  • [SPARK-21563] - Race condition when serializing TaskDescriptions and adding jars
  • [SPARK-21565] - aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp
  • [SPARK-21567] - Dataset with Tuple of type alias throws error
  • [SPARK-21568] - ConsoleProgressBar should only be enabled in shells
  • [SPARK-21571] - Spark history server leaves incomplete or unreadable history files around forever.
  • [SPARK-21580] - A bug with `Group by ordinal`
  • [SPARK-21585] - Application Master marking application status as Failed for Client Mode
  • [SPARK-21587] - Filter pushdown for EventTime Watermark Operator
  • [SPARK-21588] - SQLContext.getConf(key, null) should return null, but it throws NPE
  • [SPARK-21593] - Fix broken configuration page
  • [SPARK-21595] - introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow
  • [SPARK-21596] - Audit the places calling HDFSMetadataLog.get
  • [SPARK-21597] - Avg event time calculated in progress may be wrong
  • [SPARK-21599] - Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException
  • [SPARK-21605] - Let IntelliJ IDEA correctly detect Language level and Target byte code version
  • [SPARK-21610] - Corrupt records are not handled properly when creating a dataframe from a file
  • [SPARK-21615] - Fix broken redirect in collaborative filtering docs to databricks training repo
  • [SPARK-21617] - ALTER TABLE...ADD COLUMNS broken in Hive 2.1 for DS tables
  • [SPARK-21621] - Reset numRecordsWritten after DiskBlockObjectWriter.commitAndGet called
  • [SPARK-21637] - `hive.metastore.warehouse` in --hiveconf is not respected
  • [SPARK-21638] - Warning message of RF is not accurate
  • [SPARK-21642] - Use FQDN for DRIVER_HOST_ADDRESS instead of ip address
  • [SPARK-21644] - LocalLimit.maxRows is defined incorrectly
  • [SPARK-21647] - SortMergeJoin failed when using CROSS
  • [SPARK-21648] - Confusing assert failure in JDBC source when users misspell the option `partitionColumn`
  • [SPARK-21652] - Optimizer cannot reach a fixed point on certain queries
  • [SPARK-21656] - spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them
  • [SPARK-21657] - Spark has exponential time complexity to explode(array of structs)
  • [SPARK-21677] - json_tuple throws NullPointException when column is null as string type.
  • [SPARK-21681] - MLOR do not work correctly when featureStd contains zero
  • [SPARK-21696] - State Store can't handle corrupted snapshots
  • [SPARK-21714] - SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again
  • [SPARK-21721] - Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable
  • [SPARK-21723] - Can't write LibSVM - key not found: numFeatures
  • [SPARK-21727] - Operating on an ArrayType in a SparkR DataFrame throws error
  • [SPARK-21738] - Thriftserver doesn't cancel jobs when session is closed
  • [SPARK-21739] - timestamp partition would fail in v2.2.0
  • [SPARK-21753] - running pi example with pypy on spark fails to serialize
  • [SPARK-21759] - In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
  • [SPARK-21762] - FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible
  • [SPARK-21766] - DataFrame toPandas() raises ValueError with nullable int columns
  • [SPARK-21767] - Add Decimal Test For Avro in VersionSuite
  • [SPARK-21782] - Repartition creates skews when numPartitions is a power of 2
  • [SPARK-21786] - The 'spark.sql.parquet.compression.codec' configuration doesn't take effect on tables with partition field(s)
  • [SPARK-21788] - Handle more exceptions when stopping a streaming query
  • [SPARK-21791] - ORC should support column names with dot
  • [SPARK-21793] - Correct validateAndTransformSchema in GaussianMixture and AFTSurvivalRegression
  • [SPARK-21798] - No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server
  • [SPARK-21801] - SparkR unit test randomly fail on trees
  • [SPARK-21804] - json_tuple returns null values within repeated columns except the first one
  • [SPARK-21818] - MultivariateOnlineSummarizer.variance generate negative result
  • [SPARK-21826] - outer broadcast hash join should not throw NPE
  • [SPARK-21830] - Bump the dependency of ANTLR to version 4.7
  • [SPARK-21831] - Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite
  • [SPARK-21832] - Merge SQLBuilderTest into ExpressionSQLBuilderSuite
  • [SPARK-21834] - Incorrect executor request in case of dynamic allocation
  • [SPARK-21835] - RewritePredicateSubquery should not produce unresolved query plans
  • [SPARK-21837] - UserDefinedTypeSuite local UDFs not actually testing what it intends
  • [SPARK-21845] - Make codegen fallback of expressions configurable
  • [SPARK-21877] - Windows command script can not handle quotes in parameter
  • [SPARK-21880] - [spark UI]In the SQL table page, modify jobs trace information
  • [SPARK-21890] - ObtainCredentials does not pass creds to addDelegationTokens
  • [SPARK-21904] - Rename tempTables to tempViews in SessionCatalog
  • [SPARK-21907] - NullPointerException in UnsafeExternalSorter.spill()
  • [SPARK-21912] - ORC/Parquet table should not create invalid column names
  • [SPARK-21913] - `withDatabase` should drop database with CASCADE
  • [SPARK-21917] - Remote http(s) resources is not supported in YARN mode
  • [SPARK-21922] - When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
  • [SPARK-21924] - Bug in Structured Streaming Documentation
  • [SPARK-21928] - ClassNotFoundException for custom Kryo registrator class during serde in netty threads
  • [SPARK-21929] - Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source
  • [SPARK-21941] - Stop storing unused attemptId in SQLTaskMetrics
  • [SPARK-21946] - Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
  • [SPARK-21947] - monotonically_increasing_id doesn't work in Structured Streaming
  • [SPARK-21950] - pyspark.sql.tests.SQLTests2 should stop SparkContext.
  • [SPARK-21953] - Show both memory and disk bytes spilled if either is present
  • [SPARK-21954] - JacksonUtils should verify MapType's value type instead of key type
  • [SPARK-21958] - Attempting to save large Word2Vec model hangs driver in constant GC.
  • [SPARK-21969] - CommandUtils.updateTableStats should call refreshTable
  • [SPARK-21977] - SinglePartition optimizations break certain Streaming Stateful Aggregation requirements
  • [SPARK-21979] - Improve QueryPlanConstraints framework
  • [SPARK-21980] - References in grouping functions should be indexed with resolver
  • [SPARK-21985] - PySpark PairDeserializer is broken for double-zipped RDDs
  • [SPARK-21987] - Spark 2.3 cannot read 2.2 event logs
  • [SPARK-21991] - [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine has very high load
  • [SPARK-21996] - Streaming ignores files with spaces in the file names
  • [SPARK-21998] - SortMergeJoinExec did not calculate its outputOrdering correctly during physical planning
  • [SPARK-22017] - watermark evaluation with multi-input stream operators is unspecified
  • [SPARK-22018] - Catalyst Optimizer does not preserve top-level metadata while collapsing projects
  • [SPARK-22030] - GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB
  • [SPARK-22033] - BufferHolder, other size checks should account for the specific VM array size limitations
  • [SPARK-22036] - BigDecimal multiplication sometimes returns null
  • [SPARK-22042] - ReorderJoinPredicates can break when child's partitioning is not decided
  • [SPARK-22047] - HiveExternalCatalogVersionsSuite is Flaky on Jenkins
  • [SPARK-22052] - Incorrect Metric assigned in MetricsReporter.scala
  • [SPARK-22060] - CrossValidator/TrainValidationSplit parallelism param persist/load bug
  • [SPARK-22062] - BlockManager does not account for memory consumed by remote fetches
  • [SPARK-22067] - ArrowWriter StringWriter not using position of ByteBuffer holding data
  • [SPARK-22071] - Improve release build scripts to check correct JAVA version is being used for build
  • [SPARK-22074] - Task killed by other attempt task should not be resubmitted
  • [SPARK-22076] - Expand.projections should not be a Stream
  • [SPARK-22083] - When dropping multiple blocks to disk, Spark should release all locks on a failure
  • [SPARK-22088] - Incorrect scalastyle comment causes wrong styles in stringExpressions
  • [SPARK-22092] - Reallocation in OffHeapColumnVector.reserveInternal corrupts array data
  • [SPARK-22093] - UtilsSuite "resolveURIs with multiple paths" test always cancelled
  • [SPARK-22094] - processAllAvailable should not block forever when a query is stopped
  • [SPARK-22097] - Request an accurate memory after we unrolled the block
  • [SPARK-22107] - "as" should be "alias" in python quick start documentation
  • [SPARK-22109] - Reading tables partitioned by columns that look like timestamps has inconsistent schema inference
  • [SPARK-22129] - Spark release scripts ignore the GPG_KEY and always sign with your default key
  • [SPARK-22135] - metrics in spark-dispatcher not being registered properly
  • [SPARK-22141] - Propagate empty relation before checking Cartesian products
  • [SPARK-22143] - OffHeapColumnVector may leak memory
  • [SPARK-22145] - Issues with driver re-starting on mesos (supervise)
  • [SPARK-22146] - FileNotFoundException while reading ORC files containing '%'
  • [SPARK-22158] - convertMetastore should not ignore storage properties
  • [SPARK-22159] - spark.sql.execution.arrow.enable and spark.sql.codegen.aggregate.map.twolevel.enable -> enabled
  • [SPARK-22162] - Executors and the driver use inconsistent Job IDs during the new RDD commit protocol
  • [SPARK-22165] - Type conflicts between dates, timestamps and date in partition column
  • [SPARK-22167] - Spark Packaging w/R distro issues
  • [SPARK-22169] - support byte length literal as identifier
  • [SPARK-22171] - Describe Table Extended Failed when Table Owner is Empty
  • [SPARK-22172] - Worker hangs when the external shuffle service port is already in use
  • [SPARK-22176] - Dataset.show(Int.MaxValue) hits integer overflows
  • [SPARK-22178] - Refresh Table does not refresh the underlying tables of the persistent view
  • [SPARK-22206] - gapply in R can't work on empty grouping columns
  • [SPARK-22209] - PySpark does not recognize imports from submodules
  • [SPARK-22211] - LimitPushDown optimization for FullOuterJoin generates wrong results
  • [SPARK-22218] - spark shuffle services fails to update secret on application re-attempts
  • [SPARK-22222] - Fix the ARRAY_MAX in BufferHolder and add a test
  • [SPARK-22223] - ObjectHashAggregate introduces unnecessary shuffle
  • [SPARK-22224] - Override toString of KeyValueGroupedDataset & RelationalGroupedDataset
  • [SPARK-22227] - DiskBlockManager.getAllBlocks could fail if called during shuffle
  • [SPARK-22230] - agg(last('attr)) gives weird results for streaming
  • [SPARK-22238] - EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning is completed
  • [SPARK-22243] - streaming job failed to restart from checkpoint
  • [SPARK-22249] - UnsupportedOperationException: empty.reduceLeft when caching a dataframe
  • [SPARK-22251] - Metric "aggregate time" is incorrect when codegen is off
  • [SPARK-22252] - FileFormatWriter should respect the input query schema
  • [SPARK-22254] - clean up the implementation of `growToSize` in CompactBuffer
  • [SPARK-22257] - Reserve all non-deterministic expressions in ExpressionSet.
  • [SPARK-22267] - Spark SQL incorrectly reads ORC file when column order is different
  • [SPARK-22271] - Describe results in "null" for the value of "mean" of a numeric variable
  • [SPARK-22273] - Fix key/value schema field names in HashMapGenerators.
  • [SPARK-22280] - Improve StatisticsSuite to test `convertMetastore` properly
  • [SPARK-22281] - Handle R method breaking signature changes
  • [SPARK-22284] - Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB
  • [SPARK-22287] - SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
  • [SPARK-22289] - Cannot save LogisticRegressionModel with bounds on coefficients
  • [SPARK-22290] - Starting second context in same JVM fails to get new Hive delegation token
  • [SPARK-22291] - Postgresql UUID[] to Cassandra: Conversion Error
  • [SPARK-22300] - Update ORC to 1.4.1
  • [SPARK-22303] - Getting java.sql.SQLException: Unsupported type 101 for BINARY_DOUBLE
  • [SPARK-22305] - HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state
  • [SPARK-22306] - INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
  • [SPARK-22319] - SparkSubmit calls getFileStatus before calling loginUserFromKeytab
  • [SPARK-22326] - Remove unnecessary hashCode and equals methods
  • [SPARK-22327] - R CRAN check fails on non-latest branches
  • [SPARK-22328] - ClosureCleaner misses referenced superclass fields, gives them null values
  • [SPARK-22330] - Linear containsKey operation for serialized maps.
  • [SPARK-22332] - NaiveBayes unit test occasionly fail
  • [SPARK-22333] - ColumnReference should get higher priority than timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP)
  • [SPARK-22349] - In on-heap mode, when allocating memory from pool,we should fill memory with `MEMORY_DEBUG_FILL_CLEAN_VALUE`
  • [SPARK-22355] - Dataset.collect is not threadsafe
  • [SPARK-22356] - data source table should support overlapped columns between data and partition schema
  • [SPARK-22370] - Config values should be captured in Driver.
  • [SPARK-22373] - Intermittent NullPointerException in org.codehaus.janino.IClass.isAssignableFrom
  • [SPARK-22375] - Test script can fail if eggs are installed by setup.py during test process
  • [SPARK-22376] - run-tests.py fails at exec-sbt if run with Python 3
  • [SPARK-22377] - Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
  • [SPARK-22393] - spark-shell can't find imported types in class constructors, extends clause
  • [SPARK-22395] - Fix the behavior of timestamp values for Pandas to respect session timezone
  • [SPARK-22396] - Unresolved operator InsertIntoDir for Hive format when Hive Support is not enabled
  • [SPARK-22403] - StructuredKafkaWordCount example fails in YARN cluster mode
  • [SPARK-22410] - Excessive spill for Pyspark UDF when a row has shrunk
  • [SPARK-22417] - createDataFrame from a pandas.DataFrame reads datetime64 values as longs
  • [SPARK-22429] - Streaming checkpointing code does not retry after failure due to NullPointerException
  • [SPARK-22431] - Creating Permanent view with illegal type
  • [SPARK-22437] - jdbc write fails to set default mode
  • [SPARK-22442] - Schema generated by Product Encoder doesn't match case class field name when using non-standard characters
  • [SPARK-22443] - AggregatedDialect doesn't override quoteIdentifier and other methods in JdbcDialects
  • [SPARK-22446] - Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
  • [SPARK-22454] - ExternalShuffleClient.close() should check null
  • [SPARK-22462] - SQL metrics missing after foreach operation on dataframe
  • [SPARK-22463] - Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive
  • [SPARK-22464] - <=> is not supported by Hive metastore partition predicate pushdown
  • [SPARK-22465] - Cogroup of two disproportionate RDDs could lead into 2G limit BUG
  • [SPARK-22466] - SPARK_CONF_DIR is not is set by Spark's launch scripts with default value
  • [SPARK-22469] - Accuracy problem in comparison with string and numeric
  • [SPARK-22472] - Datasets generate random values for null primitive types
  • [SPARK-22479] - SaveIntoDataSourceCommand logs jdbc credentials
  • [SPARK-22484] - PySpark DataFrame.write.csv(quote="") uses nullchar as quote
  • [SPARK-22487] - No usages of HIVE_EXECUTION_VERSION found in whole spark project
  • [SPARK-22488] - The view resolution in the SparkSession internal table() API
  • [SPARK-22489] - Shouldn't change broadcast join buildSide if user clearly specified
  • [SPARK-22495] - Fix setup of SPARK_HOME variable on Windows
  • [SPARK-22511] - Update maven central repo address
  • [SPARK-22516] - CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
  • [SPARK-22525] - Spark download page doesn't update package name based package type
  • [SPARK-22533] - SparkConfigProvider does not handle deprecated config keys
  • [SPARK-22535] - PythonRunner.MonitorThread should give the task a little time to finish before killing the python worker
  • [SPARK-22538] - SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame
  • [SPARK-22540] - HighlyCompressedMapStatus's avgSize is incorrect
  • [SPARK-22544] - FileStreamSource should use its own hadoop conf to call globPathIfNecessary
  • [SPARK-22548] - Incorrect nested AND expression pushed down to JDBC data source
  • [SPARK-22557] - Use ThreadSignaler explicitly
  • [SPARK-22559] - history server: handle exception on opening corrupted listing.ldb
  • [SPARK-22572] - spark-shell does not re-initialize on :replay
  • [SPARK-22574] - Wrong request causing Spark Dispatcher going inactive
  • [SPARK-22583] - First delegation token renewal time is not 75% of renewal time in Mesos
  • [SPARK-22585] - Url encoding of jar path expected?
  • [SPARK-22587] - Spark job fails if fs.defaultFS and application jar are different url
  • [SPARK-22591] - GenerateOrdering shouldn't change ctx.INPUT_ROW
  • [SPARK-22605] - OutputMetrics empty for DataFrame writes
  • [SPARK-22607] - Set large stack size consistently for tests to avoid StackOverflowError
  • [SPARK-22615] - Handle more cases in PropagateEmptyRelation
  • [SPARK-22618] - RDD.unpersist can cause fatal exception when used with dynamic allocation
  • [SPARK-22635] - FileNotFoundException again while reading ORC files containing special characters
  • [SPARK-22637] - CatalogImpl.refresh() has quadratic complexity for a view
  • [SPARK-22642] - the createdTempDir will not be deleted if an exception occurs
  • [SPARK-22651] - Calling ImageSchema.readImages initiate multiple Hive clients
  • [SPARK-22653] - executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null
  • [SPARK-22654] - Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite
  • [SPARK-22655] - Fail task instead of complete task silently in PythonRunner during shutdown
  • [SPARK-22662] - Failed to prune columns after rewriting predicate subquery
  • [SPARK-22668] - CodegenContext.splitExpressions() creates incorrect results with global variable arguments
  • [SPARK-22681] - Accumulator should only be updated once for each task in result stage
  • [SPARK-22686] - DROP TABLE IF EXISTS should not show AnalysisException
  • [SPARK-22700] - Bucketizer.transform incorrectly drops row containing NaN
  • [SPARK-22710] - ConfigBuilder.fallbackConf doesn't trigger onCreate function
  • [SPARK-22712] - Use `buildReaderWithPartitionValues` in native OrcFileFormat
  • [SPARK-22721] - BytesToBytesMap peak memory usage not accurate after reset()
  • [SPARK-22759] - Filters can be combined iff both are deterministic
  • [SPARK-22764] - Flaky test: SparkContextSuite "Cancelling stages/jobs with custom reasons"
  • [SPARK-22777] - Docker container built for Kubernetes doesn't allow running entrypoint.sh
  • [SPARK-22778] - Kubernetes scheduler at master failing to run applications successfully
  • [SPARK-22779] - ConfigEntry's default value should actually be a value
  • [SPARK-22788] - HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"
  • [SPARK-22791] - Redact Output of Explain
  • [SPARK-22793] - Memory leak in Spark Thrift Server
  • [SPARK-22811] - pyspark.ml.tests is missing a py4j import.
  • [SPARK-22813] - run-tests.py fails when /usr/sbin/lsof does not exist
  • [SPARK-22815] - Keep PromotePrecision in Optimized Plans
  • [SPARK-22817] - Use fixed testthat version for SparkR tests in AppVeyor
  • [SPARK-22818] - csv escape of quote escape
  • [SPARK-22819] - Download page - updating package type does nothing
  • [SPARK-22824] - Spark Structured Streaming Source trait breaking change
  • [SPARK-22825] - Incorrect results of Casting Array to String
  • [SPARK-22827] - Avoid throwing OutOfMemoryError in case of exception in spill
  • [SPARK-22834] - Make insert commands have real children to fix UI issues
  • [SPARK-22836] - Executors page is not showing driver logs links
  • [SPARK-22837] - Session timeout checker does not work in SessionManager
  • [SPARK-22843] - R localCheckpoint API
  • [SPARK-22846] - table's owner property in hive metastore is null
  • [SPARK-22849] - ivy.retrieve pattern should also consider `classifier`
  • [SPARK-22850] - Executor page in SHS does not show driver
  • [SPARK-22852] - sbt publishLocal fails due to -Xlint:unchecked flag passed to javadoc
  • [SPARK-22854] - AppStatusListener should get Spark version by SparkListenerLogStart
  • [SPARK-22855] - Sbt publishLocal under scala 2.12 fails due to invalid javadoc comments in tags package
  • [SPARK-22861] - SQLAppStatusListener should track all stages in multi-job executions
  • [SPARK-22862] - Docs on lazy elimination of columns missing from an encoder.
  • [SPARK-22864] - Flaky test: ExecutorAllocationManagerSuite "cancel pending executors when no longer needed"
  • [SPARK-22866] - Kubernetes dockerfile path needs update
  • [SPARK-22875] - Assembly build fails for a high user id
  • [SPARK-22889] - CRAN checks can fail if older Spark install exists
  • [SPARK-22891] - NullPointerException when use udf
  • [SPARK-22899] - OneVsRestModel transform on streaming data failed.
  • [SPARK-22901] - Add non-deterministic to Python UDF
  • [SPARK-22905] - Fix ChiSqSelectorModel, GaussianMixtureModel save implementation for Row order issues
  • [SPARK-22916] - shouldn't bias towards build right if user does not specify
  • [SPARK-22920] - R sql functions for current_date, current_timestamp, rtrim/ltrim/trim with trimString
  • [SPARK-22924] - R DataFrame API for sortWithinPartitions
  • [SPARK-22932] - Refactor AnalysisContext
  • [SPARK-22933] - R Structured Streaming API for withWatermark, trigger, partitionBy
  • [SPARK-22934] - Make optional clauses order insensitive for CREATE TABLE SQL statement
  • [SPARK-22940] - Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed
  • [SPARK-22946] - Recursive withColumn calls cause org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
  • [SPARK-22948] - "SparkPodInitContainer" shouldn't be in "rest" package
  • [SPARK-22949] - Reduce memory requirement for TrainValidationSplit
  • [SPARK-22950] - user classpath first cause no class found exception
  • [SPARK-22951] - count() after dropDuplicates() on emptyDataFrame returns incorrect value
  • [SPARK-22953] - Duplicated secret volumes in Spark pods when init-containers are used
  • [SPARK-22956] - Union Stream Failover Cause `IllegalStateException`
  • [SPARK-22957] - ApproxQuantile breaks if the number of rows exceeds MaxInt
  • [SPARK-22961] - Constant columns no longer picked as constraints in 2.3
  • [SPARK-22962] - Kubernetes app fails if local files are used
  • [SPARK-22967] - VersionSuite failed on Windows caused by Windows format path
  • [SPARK-22972] - Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc.
  • [SPARK-22973] - Incorrect results of casting Map to String
  • [SPARK-22975] - MetricsReporter producing NullPointerException when there was no progress reported
  • [SPARK-22976] - Worker cleanup can remove running driver directories
  • [SPARK-22977] - DataFrameWriter operations do not show details in SQL tab
  • [SPARK-22981] - Incorrect results of casting Struct to String
  • [SPARK-22982] - Remove unsafe asynchronous close() call from FileDownloadChannel
  • [SPARK-22983] - Don't push filters beneath aggregates with empty grouping expressions
  • [SPARK-22984] - Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner
  • [SPARK-22985] - Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
  • [SPARK-22986] - Avoid instantiating multiple instances of broadcast variables
  • [SPARK-22990] - Fix method isFairScheduler in JobsTab and StagesTab
  • [SPARK-22992] - Remove assumption of cluster domain in Kubernetes mode
  • [SPARK-22998] - Value for SPARK_MOUNTED_CLASSPATH in executor pods is not set
  • [SPARK-23000] - Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
  • [SPARK-23001] - NullPointerException when running desc database
  • [SPARK-23009] - PySpark should not assume Pandas cols are a basestring type
  • [SPARK-23018] - PySpark creatDataFrame causes Pandas warning of assignment to a copy of a reference
  • [SPARK-23019] - Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
  • [SPARK-23021] - AnalysisBarrier should not cut off the explain output for Parsed Logical Plan
  • [SPARK-23023] - Incorrect results of printing Array/Map/Struct in showString
  • [SPARK-23025] - DataSet with scala.Null causes Exception
  • [SPARK-23035] - Fix improper information of TempTableAlreadyExistsException
  • [SPARK-23037] - RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline
  • [SPARK-23038] - Update docker/spark-test (JDK/OS)
  • [SPARK-23049] - `spark.sql.files.ignoreCorruptFiles` should work for ORC files
  • [SPARK-23051] - job description in Spark UI is broken
  • [SPARK-23053] - taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status
  • [SPARK-23054] - Incorrect results of casting UserDefinedType to String
  • [SPARK-23055] - KafkaContinuousSourceSuite Kafka column types test failing
  • [SPARK-23065] - R API doc empty in Spark 2.3.0 RC1
  • [SPARK-23079] - Fix query constraints propagation with aliases
  • [SPARK-23080] - Improve error message for built-in functions
  • [SPARK-23087] - CheckCartesianProduct too restrictive when condition is constant folded to false/null
  • [SPARK-23089] - "Unable to create operation log session directory" when parent directory not present
  • [SPARK-23095] - Decorrelation of scalar subquery fails with java.util.NoSuchElementException.
  • [SPARK-23103] - LevelDB store not iterating correctly when indexed value has negative value
  • [SPARK-23119] - Fix API annotation in DataSource V2 for streaming
  • [SPARK-23121] - When the Spark Streaming app is running for a period of time, the page is incorrectly reported when accessing '/ jobs /' or '/ jobs / job /? Id = 13' and ui can not be accessed.
  • [SPARK-23133] - Spark options are not passed to the Executor in Docker context
  • [SPARK-23135] - Accumulators don't show up properly in the Stages page anymore
  • [SPARK-23140] - DataSourceV2Strategy is missing in HiveSessionStateBuilder
  • [SPARK-23147] - Stage page will throw exception when there's no complete tasks
  • [SPARK-23148] - spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
  • [SPARK-23157] - withColumn fails for a column that is a result of mapped DataSet
  • [SPARK-23177] - PySpark parameter-less UDFs raise exception if applied after distinct
  • [SPARK-23184] - All jobs page is broken when some stage is missing
  • [SPARK-23186] - Initialize DriverManager first before loading Drivers
  • [SPARK-23192] - Hint is lost after using cached data
  • [SPARK-23198] - Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test ContinuousExecution
  • [SPARK-23205] - ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
  • [SPARK-23207] - Shuffle+Repartition on an DataFrame could lead to incorrect answers
  • [SPARK-23208] - GenArrayData produces illegal code
  • [SPARK-23209] - HiveDelegationTokenProvider throws an exception if Hive jars are not the classpath
  • [SPARK-23214] - cached data should not carry extra hint info
  • [SPARK-23220] - broadcast hint not applied in a streaming left anti join
  • [SPARK-23222] - Flaky test: DataFrameRangeSuite
  • [SPARK-23223] - Stacking dataset transforms performs poorly
  • [SPARK-23230] - When hive.default.fileformat is other kinds of file types, create textfile table cause a serde error
  • [SPARK-23233] - asNondeterministic in Python UDF not being set when the UDF is called at least once
  • [SPARK-23242] - Don't run tests in KafkaSourceSuiteBase twice
  • [SPARK-23245] - KafkaContinuousSourceSuite may hang forever
  • [SPARK-23250] - Typo in JavaDoc/ScalaDoc for DataFrameWriter
  • [SPARK-23267] - Increase spark.sql.codegen.hugeMethodLimit to 65535
  • [SPARK-23274] - ReplaceExceptWithFilter fails on dataframes filtered on same column
  • [SPARK-23275] - hive/tests have been failing when run locally on the laptop (Mac) with OOM
  • [SPARK-23281] - Query produces results in incorrect order when a composite order by clause refers to both original columns and aliases
  • [SPARK-23289] - OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data
  • [SPARK-23290] - inadvertent change in handling of DateType when converting to pandas dataframe
  • [SPARK-23293] - data source v2 self join fails
  • [SPARK-23301] - data source v2 column pruning with arbitrary expressions is broken
  • [SPARK-23307] - Spark UI should sort jobs/stages with the completed timestamp before cleaning up them
  • [SPARK-23310] - Perf regression introduced by SPARK-21113
  • [SPARK-23315] - failed to get output from canonicalized data source v2 related plans
  • [SPARK-23316] - AnalysisException after max iteration reached for IN query
  • [SPARK-23326] - "Scheduler Delay" of a task is confusing
  • [SPARK-23330] - Spark UI SQL executions page throws NPE
  • [SPARK-23345] - Flaky test: FileBasedDataSourceSuite
  • [SPARK-23348] - append data using saveAsTable should adjust the data types
  • [SPARK-23358] - When the number of partitions is greater than 2^28, it will result in an error result
  • [SPARK-23360] - SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath
  • [SPARK-23376] - creating UnsafeKVExternalSorter with BytesToBytesMap may fail
  • [SPARK-23377] - Bucketizer with multiple columns persistence bug
  • [SPARK-23384] - When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.
  • [SPARK-23387] - Backport assertPandasEqual to branch-2.3.
  • [SPARK-23388] - Support for Parquet Binary DecimalType in VectorizedColumnReader
  • [SPARK-23391] - It may lead to overflow for some integer multiplication
  • [SPARK-23394] - Storage info's Cached Partitions doesn't consider the replications (but sc.getRDDStorageInfo does)
  • [SPARK-23399] - Register a task completion listener first for OrcColumnarBatchReader
  • [SPARK-23400] - Add the extra constructors for ScalaUDF
  • [SPARK-23413] - Sorting tasks by Host / Executor ID on the Stage page does not work
  • [SPARK-23419] - data source v2 write path should re-throw interruption exceptions directly
  • [SPARK-23421] - Document the behavior change in SPARK-22356
  • [SPARK-23422] - YarnShuffleIntegrationSuite failure when SPARK_PREPEND_CLASSES set to 1
  • [SPARK-23468] - Failure to authenticate with old shuffle service
  • [SPARK-23470] - org.apache.spark.ui.jobs.ApiHelper.lastStageNameAndDescription is too slow
  • [SPARK-23475] - The "stages" page doesn't show any completed stages
  • [SPARK-23481] - The job page shows wrong stages when some of stages are evicted
  • [SPARK-23484] - Fix possible race condition in KafkaContinuousReader
  • [SPARK-24401] - Aggreate on Decimal Types does not work
  • [SPARK-25523] - Multi thread execute sparkSession.read().jdbc(url, table, properties) problem
  • [SPARK-27191] - union of dataframes depends on order of the columns in 2.4.0

New Feature

  • [SPARK-3181] - Add Robust Regression Algorithm with Huber Estimator
  • [SPARK-4131] - Support "Writing data into the filesystem from queries"
  • [SPARK-12139] - REGEX Column Specification for Hive Queries
  • [SPARK-14516] - Clustering evaluator
  • [SPARK-15689] - Data source API v2
  • [SPARK-15767] - Decision Tree Regression wrapper in SparkR
  • [SPARK-16026] - Cost-based Optimizer Framework
  • [SPARK-16060] - Vectorized ORC reader
  • [SPARK-16742] - Kerberos support for Spark on Mesos
  • [SPARK-17025] - Cannot persist PySpark ML Pipeline model that includes custom Transformer
  • [SPARK-18710] - Add offset to GeneralizedLinearRegression models
  • [SPARK-18791] - Stream-Stream Joins
  • [SPARK-19489] - Stable serialization format for external & native code integration
  • [SPARK-19507] - pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data
  • [SPARK-19606] - Support constraints in spark-dispatcher
  • [SPARK-20090] - Add StructType.fieldNames to Python API
  • [SPARK-20542] - Add an API into Bucketizer that can bin a lot of columns all at once
  • [SPARK-20601] - Python API Changes for Constrained Logistic Regression Params
  • [SPARK-20703] - Add an operator for writing data out
  • [SPARK-20812] - Add Mesos Secrets support to the spark dispatcher
  • [SPARK-20863] - Add metrics/instrumentation to LiveListenerBus
  • [SPARK-20892] - Add SQL trunc function to SparkR
  • [SPARK-20899] - PySpark supports stringIndexerOrderType in RFormula
  • [SPARK-20917] - SparkR supports string encoding consistent with R
  • [SPARK-20953] - Add hash map metrics to aggregate and join
  • [SPARK-20960] - make ColumnVector public
  • [SPARK-20979] - Add a rate source to generate values for tests and benchmark
  • [SPARK-21000] - Add Mesos labels support to the Spark Dispatcher
  • [SPARK-21027] - Parallel One vs. Rest Classifier
  • [SPARK-21043] - Add unionByName API to Dataset
  • [SPARK-21092] - Wire SQLConf in logical plan and expressions
  • [SPARK-21208] - Ability to "setLocalProperty" from sc, in sparkR
  • [SPARK-21221] - CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest
  • [SPARK-21310] - Add offset to PySpark GLM
  • [SPARK-21421] - Add the query id as a local property to allow source and sink using it
  • [SPARK-21468] - FeatureHasher Python API
  • [SPARK-21499] - Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction)
  • [SPARK-21519] - Add an option to the JDBC data source to initialize the environment of the remote database session
  • [SPARK-21542] - Helper functions for custom Python Persistence
  • [SPARK-21633] - Unary Transformer in Python
  • [SPARK-21726] - Check for structural integrity of the plan in QO in test mode
  • [SPARK-21777] - Simpler Dataset.sample API
  • [SPARK-21840] - Allow multiple SparkSubmit invocations in same JVM without polluting system properties
  • [SPARK-21842] - Support Kerberos ticket renewal and creation in Mesos
  • [SPARK-21854] - Python interface for MLOR summary
  • [SPARK-21856] - Update Python API for MultilayerPerceptronClassifierModel
  • [SPARK-21911] - Parallel Model Evaluation for ML Tuning: PySpark
  • [SPARK-22131] - Add Mesos Secrets Support to the Mesos Driver
  • [SPARK-22160] - Allow changing sample points per partition in range shuffle exchange
  • [SPARK-22181] - ReplaceExceptWithFilter if one or both of the datasets are fully derived out of Filters from a same parent
  • [SPARK-22456] - Add new function dayofweek
  • [SPARK-22521] - VectorIndexerModel support handle unseen categories via handleInvalid: Python API
  • [SPARK-22734] - VectorSizeHint Python API
  • [SPARK-22781] - Support creating streaming dataset with ORC files
  • [SPARK-23008] - OnehotEncoderEstimator python API

Improvement

  • [SPARK-7481] - Add spark-hadoop-cloud module to pull in object store support
  • [SPARK-9221] - Support IntervalType in Range Frame
  • [SPARK-10216] - Avoid creating empty files during overwrite into Hive table with group by query
  • [SPARK-10655] - Enhance DB2 dialect to handle XML, and DECIMAL , and DECFLOAT
  • [SPARK-10931] - PySpark ML Models should contain Param values
  • [SPARK-11574] - Spark should support StatsD sink out of box
  • [SPARK-12664] - Expose probability, rawPrediction in MultilayerPerceptronClassificationModel
  • [SPARK-13030] - Change OneHotEncoder to Estimator
  • [SPARK-13041] - Add a driver history ui link and a mesos sandbox link on the dispatcher's ui page for each driver
  • [SPARK-13656] - Delete spark.sql.parquet.cacheMetadata
  • [SPARK-13846] - VectorIndexer output on unknown feature should be more descriptive
  • [SPARK-13947] - The error message from using an invalid table reference is not clear
  • [SPARK-14371] - OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver
  • [SPARK-14659] - OneHotEncoder support drop first category alphabetically in the encoded vector
  • [SPARK-14932] - Allow DataFrame.replace() to replace values with None
  • [SPARK-15648] - add TeradataDialect
  • [SPARK-16019] - Eliminate unexpected delay during spark on yarn job launch
  • [SPARK-16496] - Add wholetext as option for reading text in SQL.
  • [SPARK-16931] - PySpark access to data-frame bucketing api
  • [SPARK-16957] - Use weighted midpoints for split values.
  • [SPARK-17006] - WithColumn Performance Degrades with Number of Invocations
  • [SPARK-17310] - Disable Parquet's record-by-record filter in normal parquet reader and do it in Spark-side
  • [SPARK-17414] - Set type is not supported for creating data frames
  • [SPARK-17701] - Refactor DataSourceScanExec so its sameResult call does not compare strings
  • [SPARK-17924] - Consolidate streaming and batch write path
  • [SPARK-18136] - Make PySpark pip install works on windows
  • [SPARK-18540] - Wholestage code-gen for ORC Hive tables
  • [SPARK-18619] - Make QuantileDiscretizer/Bucketizer/StringIndexer inherit from HasHandleInvalid
  • [SPARK-18623] - Add `returnNullable` to `StaticInvoke` and modify it to handle properly.
  • [SPARK-18838] - High latency of event processing for large jobs
  • [SPARK-18891] - Support for specific collection types
  • [SPARK-19112] - add codec for ZStandard
  • [SPARK-19159] - PySpark UDF API improvements
  • [SPARK-19236] - Add createOrReplaceGlobalTempView
  • [SPARK-19270] - Add summary table to GLM summary
  • [SPARK-19285] - Java - Provide user-defined function of 0 arguments (UDF0)
  • [SPARK-19358] - LiveListenerBus shall log the event name when dropping them due to a fully filled queue
  • [SPARK-19439] - PySpark's registerJavaFunction Should Support UDAFs
  • [SPARK-19552] - Upgrade Netty version to 4.1.x final
  • [SPARK-19558] - Provide a config option to attach QueryExecutionListener to SparkSession
  • [SPARK-19732] - DataFrame.fillna() does not work for bools in PySpark
  • [SPARK-19759] - ALSModel.predict on Dataframes : potential optimization by not using blas
  • [SPARK-19852] - StringIndexer.setHandleInvalid should have another option 'new': Python API and docs
  • [SPARK-19866] - Add local version of Word2Vec findSynonyms for spark.ml: Python API
  • [SPARK-19878] - Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
  • [SPARK-19937] - Collect metrics of block sizes when shuffle.
  • [SPARK-19951] - Add string concatenate operator || to Spark SQL
  • [SPARK-19975] - Add map_keys and map_values functions to Python
  • [SPARK-20014] - Optimize mergeSpillsWithFileStream method
  • [SPARK-20055] - Documentation for CSV datasets in SQL programming guide
  • [SPARK-20073] - Unexpected Cartesian product when using eqNullSafe in join with a derived table
  • [SPARK-20101] - Use OffHeapColumnVector when "spark.sql.columnVector.offheap.enable" is set to "true"
  • [SPARK-20109] - Need a way to convert from IndexedRowMatrix to Dense Block Matrices
  • [SPARK-20199] - GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter
  • [SPARK-20236] - Overwrite a partitioned data source table should only overwrite related partitions
  • [SPARK-20290] - PySpark Column should provide eqNullSafe
  • [SPARK-20307] - SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
  • [SPARK-20331] - Broaden support for Hive partition pruning predicate pushdown
  • [SPARK-20350] - Apply Complementation Laws during boolean expression simplification
  • [SPARK-20355] - Display Spark version on history page
  • [SPARK-20371] - R wrappers for collect_list and collect_set
  • [SPARK-20375] - R wrappers for array and map
  • [SPARK-20376] - Make StateStoreProvider plugable
  • [SPARK-20379] - Allow setting SSL-related passwords through env variables
  • [SPARK-20383] - SparkSQL unsupports to create function with the keyword 'OR REPLACE' and 'IF NOT EXISTS'
  • [SPARK-20392] - Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
  • [SPARK-20416] - Column names inconsistent for UDFs in SQL vs Dataset
  • [SPARK-20425] - Support an extended display mode to print a column data per line
  • [SPARK-20431] - Support a DDL-formatted string in DataFrameReader.schema
  • [SPARK-20433] - Update jackson-databind to 2.6.7.1
  • [SPARK-20437] - R wrappers for rollup and cube
  • [SPARK-20438] - R wrappers for split and repeat
  • [SPARK-20460] - Make it more consistent to handle column name duplication
  • [SPARK-20463] - Add support for IS [NOT] DISTINCT FROM to SPARK SQL
  • [SPARK-20484] - Add documentation to ALS code
  • [SPARK-20490] - Add eqNullSafe, not and ! to SparkR
  • [SPARK-20493] - De-deuplicate parse logics for DDL-like type string in R
  • [SPARK-20495] - Add StorageLevel to cacheTable API
  • [SPARK-20498] - RandomForestRegressionModel should expose getMaxDepth in PySpark
  • [SPARK-20519] - When the input parameter is null, may be a runtime exception occurs
  • [SPARK-20532] - SparkR should provide grouping and grouping_id
  • [SPARK-20533] - SparkR Wrappers Model should be private and value should be lazy
  • [SPARK-20535] - R wrappers for explode_outer and posexplode_outer
  • [SPARK-20544] - R wrapper for input_file_name
  • [SPARK-20550] - R wrappers for Dataset.alias
  • [SPARK-20557] - JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
  • [SPARK-20566] - ColumnVector should support `appendFloats` for array
  • [SPARK-20599] - ConsoleSink should work with write (batch)
  • [SPARK-20614] - Use the same log4j configuration with Jenkins in AppVeyor
  • [SPARK-20619] - StringIndexer supports multiple ways of label ordering
  • [SPARK-20639] - Add single argument support for to_timestamp in SQL
  • [SPARK-20668] - Modify ScalaUDF to handle nullability.
  • [SPARK-20670] - Simplify FPGrowth transform
  • [SPARK-20679] - Let ML ALS recommend for a subset of users/items
  • [SPARK-20682] - Add new ORCFileFormat based on Apache ORC
  • [SPARK-20715] - MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker
  • [SPARK-20720] - 'Executor Summary' should show the exact number, 'Removed Executors' should display the specific number, in the Application Page
  • [SPARK-20726] - R wrapper for SQL broadcast
  • [SPARK-20728] - Make ORCFileFormat configurable between sql/hive and sql/core
  • [SPARK-20730] - Add a new Optimizer rule to combine nested Concats
  • [SPARK-20736] - PySpark StringIndexer supports StringOrderType
  • [SPARK-20775] - from_json should also have an API where the schema is specified with a string
  • [SPARK-20779] - The ASF header placed in an incorrect location in some files
  • [SPARK-20785] - Spark should provide jump links and add (count) in the SQL web ui.
  • [SPARK-20806] - Launcher: redundant check for Spark lib dir
  • [SPARK-20830] - PySpark wrappers for explode_outer and posexplode_outer
  • [SPARK-20835] - It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application
  • [SPARK-20841] - Support table column aliases in FROM clause
  • [SPARK-20842] - Upgrade to 1.2.2 for Hive Metastore Client 1.2
  • [SPARK-20849] - Document R DecisionTree
  • [SPARK-20861] - Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators
  • [SPARK-20871] - Only log Janino code in debug mode
  • [SPARK-20875] - Spark should print the log when the directory has been deleted
  • [SPARK-20883] - Improve StateStore APIs for efficiency
  • [SPARK-20886] - HadoopMapReduceCommitProtocol to fail with message if FileOutputCommitter.getWorkPath==null
  • [SPARK-20887] - support alternative keys in ConfigBuilder
  • [SPARK-20894] - Error while checkpointing to HDFS
  • [SPARK-20930] - Destroy broadcasted centers after computing cost
  • [SPARK-20936] - Lack of an important case about the test of resolveURI
  • [SPARK-20946] - Do not update conf for existing SparkContext in SparkSession.getOrCreate
  • [SPARK-20950] - add a new config to diskWriteBufferSize which is hard coded before
  • [SPARK-20966] - Table data is not sorted by startTime time desc, time is not formatted and redundant code in JDBC/ODBC Server page.
  • [SPARK-20972] - rename HintInfo.isBroadcastable to broadcast
  • [SPARK-20981] - Add --repositories equivalent configuration for Spark
  • [SPARK-20985] - Improve KryoSerializerResizableOutputSuite
  • [SPARK-20994] - Alleviate memory pressure in StreamManager
  • [SPARK-20995] - 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
  • [SPARK-21012] - Support glob path for resources adding to Spark
  • [SPARK-21039] - Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter
  • [SPARK-21060] - Css style about paging function is error in the executor page.
  • [SPARK-21070] - Pick up cloudpickle upgrades from cloudpickle python module
  • [SPARK-21091] - Move constraint code into QueryPlanConstraints
  • [SPARK-21100] - Add summary method as alternative to describe that gives quartiles similar to Pandas
  • [SPARK-21103] - QueryPlanConstraints should be part of LogicalPlan
  • [SPARK-21110] - Structs should be usable in inequality filters
  • [SPARK-21113] - Support for read ahead input stream to amortize disk IO cost in the Spill reader
  • [SPARK-21115] - If the cores left is less than the coresPerExecutor,the cores left will not be allocated, so it should not to check in every schedule
  • [SPARK-21125] - PySpark context missing function to set Job Description.
  • [SPARK-21135] - On history server page,duration of incompleted applications should be hidden instead of showing up as 0
  • [SPARK-21137] - Spark reads many small files slowly off local filesystem
  • [SPARK-21142] - spark-streaming-kafka-0-10 has too fat dependency on kafka
  • [SPARK-21146] - Master/Worker should handle and shutdown when any thread gets UncaughtException
  • [SPARK-21149] - Add job description API for R
  • [SPARK-21153] - Time windowing for tumbling windows can use a project instead of expand + filter
  • [SPARK-21155] - Add (? running tasks) into Spark UI progress
  • [SPARK-21164] - Remove isTableSample from Sample and isGenerated from Alias and AttributeReference
  • [SPARK-21174] - Validate sampling fraction in logical operator level
  • [SPARK-21175] - shuffle service should reject fetch requests if there are already many requests in progress
  • [SPARK-21189] - Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs
  • [SPARK-21192] - Preserve State Store provider class configuration across StreamingQuery restarts
  • [SPARK-21193] - Specify Pandas version in setup.py
  • [SPARK-21196] - Split codegen info of query plan into sequence
  • [SPARK-21217] - Support ColumnVector.Array.to<type>Array()
  • [SPARK-21222] - Move elimination of Distinct clause from analyzer to optimizer
  • [SPARK-21229] - remove QueryPlan.preCanonicalized
  • [SPARK-21238] - allow nested SQL execution
  • [SPARK-21240] - Fix code style for constructing and stopping a SparkContext in UT
  • [SPARK-21243] - Limit the number of maps in a single shuffle fetch
  • [SPARK-21247] - Type comparision should respect case-sensitive SQL conf
  • [SPARK-21250] - Add a url in the table of 'Running Executors' in worker page to visit job page
  • [SPARK-21256] - Add WithSQLConf to Catalyst Test
  • [SPARK-21260] - Remove the unused OutputFakerExec
  • [SPARK-21266] - Support schema a DDL-formatted string in dapply/gapply/from_json
  • [SPARK-21267] - Improvements to the Structured Streaming programming guide
  • [SPARK-21268] - Move center calculations to a distributed map in KMeans
  • [SPARK-21273] - Decouple stats propagation from logical plan
  • [SPARK-21275] - Update GLM test to use supportedFamilyNames
  • [SPARK-21276] - Update lz4-java to remove custom LZ4BlockInputStream
  • [SPARK-21285] - VectorAssembler should report the column name when data type used is not supported
  • [SPARK-21295] - Confusing error message for missing references
  • [SPARK-21296] - Avoid per-record type dispatch in PySpark createDataFrame schema verification
  • [SPARK-21297] - Add count in 'JDBC/ODBC Server' page.
  • [SPARK-21304] - remove unnecessary isNull variable for collection related encoder expressions
  • [SPARK-21305] - The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance
  • [SPARK-21308] - Remove SQLConf parameters from the optimizer
  • [SPARK-21313] - ConsoleSink's string representation
  • [SPARK-21315] - Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray.
  • [SPARK-21321] - Spark very verbose on shutdown confusing users
  • [SPARK-21323] - Rename sql.catalyst.plans.logical.statsEstimation.Range to ValueInterval
  • [SPARK-21326] - Use TextFileFormat in implementation of LibSVMFileFormat
  • [SPARK-21329] - Make EventTimeWatermarkExec explicitly UnaryExecNode
  • [SPARK-21358] - Argument of repartitionandsortwithinpartitions at pyspark
  • [SPARK-21365] - Deduplicate logics parsing DDL-like type definition
  • [SPARK-21373] - Update Jetty to 9.3.20.v20170531
  • [SPARK-21381] - SparkR: pass on setHandleInvalid for classification algorithms
  • [SPARK-21382] - The note about Scala 2.10 in building-spark.md is wrong.
  • [SPARK-21388] - GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold
  • [SPARK-21396] - Spark Hive Thriftserver doesn't return UDT field
  • [SPARK-21401] - add poll function for BoundedPriorityQueue
  • [SPARK-21408] - Default RPC dispatcher thread pool size too large for small executors
  • [SPARK-21409] - Expose state store memory usage in SQL metrics and progress updates
  • [SPARK-21410] - In RangePartitioner(partitions: Int, rdd: RDD[]), RangePartitioner.numPartitions is wrong if the number of elements in RDD (rdd.count()) is less than number of partitions (partitions in constructor).
  • [SPARK-21415] - Triage scapegoat warnings, part 1
  • [SPARK-21434] - Add PySpark pip documentation
  • [SPARK-21435] - Empty files should be skipped while write to file
  • [SPARK-21472] - Introduce ArrowColumnVector as a reader for Arrow vectors.
  • [SPARK-21475] - Change to use NIO's Files API for external shuffle service
  • [SPARK-21477] - Mark LocalTableScanExec's input data transient
  • [SPARK-21491] - Performance enhancement: eliminate creation of intermediate collections
  • [SPARK-21504] - Add spark version info in table metadata
  • [SPARK-21506] - The description of "spark.executor.cores" may be not correct
  • [SPARK-21513] - SQL to_json should support all column types
  • [SPARK-21517] - Fetch local data via block manager cause oom
  • [SPARK-21524] - ValidatorParamsSuiteHelpers generates wrong temp files
  • [SPARK-21527] - Use buffer limit in order to take advantage of JAVA NIO Util's buffercache
  • [SPARK-21530] - Update description of spark.shuffle.maxChunksBeingTransferred
  • [SPARK-21538] - Attribute resolution inconsistency in Dataset API
  • [SPARK-21544] - Test jar of some module should not install or deploy twice
  • [SPARK-21553] - Add the description of the default value of master parameter in the spark-shell
  • [SPARK-21566] - Python method for summary
  • [SPARK-21575] - Eliminate needless synchronization in java-R serialization
  • [SPARK-21578] - Add JavaSparkContextSuite
  • [SPARK-21583] - Create a ColumnarBatch with ArrowColumnVectors for row based iteration
  • [SPARK-21584] - Update R method for summary to call new implementation
  • [SPARK-21589] - Add documents about unsupported functions in Hive UDF/UDTF/UDAF
  • [SPARK-21592] - Skip maven-compiler-plugin main and test compilations in Maven build
  • [SPARK-21602] - Add map_keys and map_values functions to R
  • [SPARK-21603] - The wholestage codegen will be much slower then wholestage codegen is closed when the function is too long
  • [SPARK-21604] - if the object extends Logging, i suggest to remove the var LOG which is useless.
  • [SPARK-21608] - Window rangeBetween() API should allow literal boundary
  • [SPARK-21611] - Error class name for log in several classes.
  • [SPARK-21619] - Fail the execution of canonicalized plans explicitly
  • [SPARK-21622] - Support Offset in SparkR
  • [SPARK-21623] - Comments of parentStats on ml/tree/impl/DTStatsAggregator.scala is wrong
  • [SPARK-21634] - Change OneRowRelation from a case object to case class
  • [SPARK-21640] - Method mode with String parameters within DataFrameWriter is error prone
  • [SPARK-21661] - SparkSQL can't merge load table from Hadoop
  • [SPARK-21665] - Need to close resources after use
  • [SPARK-21667] - ConsoleSink should not fail streaming query with checkpointLocation option
  • [SPARK-21669] - Internal API for collecting metrics/stats during FileFormatWriter jobs
  • [SPARK-21672] - Remove SHS-specific application / attempt data structures
  • [SPARK-21675] - Add a navigation bar at the bottom of the Details for Stage Page
  • [SPARK-21680] - ML/MLLIB Vector compressed optimization
  • [SPARK-21694] - Support Mesos CNI network labels
  • [SPARK-21701] - Add TCP send/rcv buffer size support for RPC client
  • [SPARK-21709] - use sbt 0.13.16 and update sbt plugins
  • [SPARK-21717] - Decouple the generated codes of consuming rows in operators under whole-stage codegen
  • [SPARK-21718] - Heavy log of type: "Skipping partition based on stats ..."
  • [SPARK-21728] - Allow SparkSubmit to use logging
  • [SPARK-21732] - Lazily init hive metastore client
  • [SPARK-21745] - Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.
  • [SPARK-21751] - CodeGeneraor.splitExpressions counts code size more precisely
  • [SPARK-21756] - Add JSON option to allow unquoted control characters
  • [SPARK-21765] - Ensure all leaf nodes that are derived from streaming sources have isStreaming=true
  • [SPARK-21769] - Add a table option for Hive-serde tables to make Spark always respect schemas inferred by Spark SQL
  • [SPARK-21770] - ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
  • [SPARK-21771] - SparkSQLEnv creates a useless meta hive client
  • [SPARK-21773] - Should Install mkdocs if missing in the path in SQL documentation build
  • [SPARK-21781] - Modify DataSourceScanExec to use concrete ColumnVector type.
  • [SPARK-21787] - Support for pushing down filters for DateType in native OrcFileFormat
  • [SPARK-21789] - Remove obsolete codes for parsing abstract schema strings
  • [SPARK-21803] - Remove the HiveDDLCommandSuite
  • [SPARK-21806] - BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
  • [SPARK-21807] - The getAliasedConstraints function in LogicalPlan will take a long time when number of expressions is greater than 100
  • [SPARK-21813] - [core] Modify TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES comments
  • [SPARK-21839] - Support SQL config for ORC compression
  • [SPARK-21862] - Add overflow check in PCA
  • [SPARK-21865] - simplify the distribution semantic of Spark SQL
  • [SPARK-21866] - SPIP: Image support in Spark
  • [SPARK-21871] - Check actual bytecode size when compiling generated code
  • [SPARK-21873] - CachedKafkaConsumer throws NonLocalReturnControl during fetching from Kafka
  • [SPARK-21875] - Jenkins passes Java code that violates ./dev/lint-java
  • [SPARK-21878] - Create SQLMetricsTestUtils
  • [SPARK-21886] - Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator
  • [SPARK-21891] - Add TBLPROPERTIES to DDL statement: CREATE TABLE USING
  • [SPARK-21897] - Add unionByName API to DataFrame in Python and R
  • [SPARK-21901] - Define toString for StateOperatorProgress
  • [SPARK-21902] - BlockManager.doPut will hide actually exception when exception thrown in finally block
  • [SPARK-21903] - Upgrade scalastyle to 1.0.0
  • [SPARK-21923] - Avoid calling reserveUnrollMemoryForThisTask for every record
  • [SPARK-21963] - create temp file should be delete after use
  • [SPARK-21967] - org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance
  • [SPARK-21970] - Do a Project Wide Sweep for Redundant Throws Declarations
  • [SPARK-21973] - Add a new option to filter queries to run in TPCDSQueryBenchmark
  • [SPARK-21975] - Histogram support in cost-based optimizer
  • [SPARK-21981] - Python API for ClusteringEvaluator
  • [SPARK-21983] - Fix ANTLR 4.7 deprecations
  • [SPARK-21988] - Add default stats to StreamingRelation and StreamingExecutionRelation
  • [SPARK-22001] - ImputerModel can do withColumn for all input columns at one pass
  • [SPARK-22002] - Read JDBC table use custom schema support specify partial fields
  • [SPARK-22003] - vectorized reader does not work with UDF when the column is array
  • [SPARK-22009] - Using treeAggregate improve some algs
  • [SPARK-22043] - Python profile, show_profiles() and dump_profiles(), should throw an error with a better message
  • [SPARK-22049] - Confusing behavior of from_utc_timestamp and to_utc_timestamp
  • [SPARK-22050] - Allow BlockUpdated events to be optionally logged to the event log
  • [SPARK-22058] - the BufferedInputStream will not be closed if an exception occurs
  • [SPARK-22066] - Update checkstyle to 8.2, enable it, fix violations
  • [SPARK-22072] - Allow the same shell params to be used for all of the different steps in release-build
  • [SPARK-22075] - GBTs forgot to unpersist datasets cached by Checkpointer
  • [SPARK-22099] - The 'job ids' list style needs to be changed in the SQL page.
  • [SPARK-22103] - Move HashAggregateExec parent consume to a separate function in codegen
  • [SPARK-22106] - Remove support for 0-parameter pandas_udfs
  • [SPARK-22112] - Add missing method to pyspark api: spark.read.csv(Dataset<String>)
  • [SPARK-22120] - TestHiveSparkSession.reset() should clean out Hive warehouse directory
  • [SPARK-22122] - Respect WITH clauses to count input rows in TPCDSQueryBenchmark
  • [SPARK-22123] - Add latest failure reason for task set blacklist
  • [SPARK-22124] - Sample and Limit should also defer input evaluation under codegen
  • [SPARK-22125] - Enable Arrow Stream format for vectorized UDF.
  • [SPARK-22130] - UTF8String.trim() inefficiently scans all white-space string twice.
  • [SPARK-22133] - Document Mesos reject offer duration configutations
  • [SPARK-22138] - Allow retry during release-build
  • [SPARK-22142] - Move Flume support behind a profile
  • [SPARK-22147] - BlockId.hashCode allocates a StringBuilder/String on each call
  • [SPARK-22156] - Word2Vec: incorrect learning rate update equation when numIterations > 1
  • [SPARK-22170] - Broadcast join holds an extra copy of rows in driver memory
  • [SPARK-22173] - Table CSS style needs to be adjusted in History Page and in Executors Page.
  • [SPARK-22188] - Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack
  • [SPARK-22190] - Add Spark executor task metrics to Dropwizard metrics
  • [SPARK-22193] - SortMergeJoinExec: typo correction
  • [SPARK-22203] - Add job description for file listing Spark jobs
  • [SPARK-22208] - Improve percentile_approx by not rounding up targetError and starting from index 0
  • [SPARK-22214] - Refactor the list hive partitions code
  • [SPARK-22217] - ParquetFileFormat to support arbitrary OutputCommitters
  • [SPARK-22233] - filter out empty InputSplit in HadoopRDD
  • [SPARK-22247] - Hive partition filter very slow
  • [SPARK-22263] - Refactor deterministic as lazy value
  • [SPARK-22266] - The same aggregate function was evaluated multiple times
  • [SPARK-22268] - Fix java style errors
  • [SPARK-22282] - Rename OrcRelation to OrcFileFormat and remove ORC_COMPRESSION
  • [SPARK-22294] - Reset spark.driver.bindAddress when starting a Checkpoint
  • [SPARK-22301] - Add rule to Optimizer for In with empty list of values
  • [SPARK-22302] - Remove manual backports for subprocess.check_output and check_call
  • [SPARK-22308] - Support unit tests of spark code using ScalaTest using suites other than FunSuite
  • [SPARK-22313] - Mark/print deprecation warnings as DeprecationWarning for deprecated APIs
  • [SPARK-22315] - Check for version match between R package and JVM
  • [SPARK-22346] - Update VectorAssembler to work with Structured Streaming
  • [SPARK-22348] - The table cache providing ColumnarBatch should also do partition batch pruning
  • [SPARK-22366] - Support ignoreMissingFiles flag parallel to ignoreCorruptFiles
  • [SPARK-22372] - Make YARN client extend SparkApplication
  • [SPARK-22378] - Redundant nullcheck is generated for extracting value in complex types
  • [SPARK-22379] - Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
  • [SPARK-22385] - MapObjects should not access list element by index
  • [SPARK-22397] - Add multiple column support to QuantileDiscretizer
  • [SPARK-22405] - Enrich the event information and add new event of ExternalCatalogEvent
  • [SPARK-22407] - Add rdd id column on storage page to speed up navigating
  • [SPARK-22408] - RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages
  • [SPARK-22422] - Add Adjusted R2 to RegressionMetrics
  • [SPARK-22445] - move CodegenContext.copyResult to CodegenSupport
  • [SPARK-22450] - Safely register class for mllib
  • [SPARK-22476] - Add new function dayofweek in R
  • [SPARK-22496] - beeline display operation log
  • [SPARK-22519] - Remove unnecessary stagingDirPath null check in ApplicationMaster.cleanupStagingDir()
  • [SPARK-22520] - Support code generation also for complex CASE WHEN
  • [SPARK-22537] - Aggregation of map output statistics on driver faces single point bottleneck
  • [SPARK-22554] - Add a config to control if PySpark should use daemon or not
  • [SPARK-22566] - Better error message for `_merge_type` in Pandas to Spark DF conversion
  • [SPARK-22569] - Clean up caller of splitExpressions and addMutableState
  • [SPARK-22592] - cleanup filter converting for hive
  • [SPARK-22596] - set ctx.currentVars in CodegenSupport.consume
  • [SPARK-22597] - Add spark-sql script for Windows users
  • [SPARK-22608] - Avoid code duplication regarding CodeGeneration.splitExpressions()
  • [SPARK-22614] - Expose range partitioning shuffle
  • [SPARK-22617] - make splitExpressions extract current input of the context
  • [SPARK-22638] - Use a separate query for StreamingQueryListenerBus
  • [SPARK-22649] - localCheckpoint support in Dataset API
  • [SPARK-22660] - Use position() and limit() to fix ambiguity issue in scala-2.12
  • [SPARK-22665] - Dataset API: .repartition() inconsistency / issue
  • [SPARK-22667] - Fix model-specific optimization support for ML tuning: Python API
  • [SPARK-22673] - InMemoryRelation should utilize on-disk table stats whenever possible
  • [SPARK-22675] - Refactoring PropagateTypes in TypeCoercion
  • [SPARK-22677] - cleanup whole stage codegen for hash aggregate
  • [SPARK-22682] - HashExpression does not need to create global variables
  • [SPARK-22688] - Upgrade Janino version to 3.0.8
  • [SPARK-22690] - Imputer inherit HasOutputCols
  • [SPARK-22692] - Reduce the number of generated mutable states
  • [SPARK-22701] - add ctx.splitExpressionsWithCurrentInputs
  • [SPARK-22704] - Reduce # of mutable variables in Least and greatest
  • [SPARK-22705] - Reduce # of mutable variables in Case, Coalesce, and In
  • [SPARK-22707] - Optimize CrossValidator memory occupation by models in fitting
  • [SPARK-22719] - refactor ConstantPropagation
  • [SPARK-22729] - Add getTruncateQuery to JdbcDialect
  • [SPARK-22753] - Get rid of dataSource.writeAndRead
  • [SPARK-22754] - Check spark.executor.heartbeatInterval setting in case of ExecutorLost
  • [SPARK-22763] - SHS: Ignore unknown events and parse through the file
  • [SPARK-22767] - use ctx.addReferenceObj in InSet and ScalaUDF
  • [SPARK-22771] - SQL concat for binary
  • [SPARK-22774] - Add compilation check for generated code in TPCDSQuerySuite
  • [SPARK-22786] - only use AppStatusPlugin in history server
  • [SPARK-22790] - add a configurable factor to describe HadoopFsRelation's size
  • [SPARK-22799] - Bucketizer should throw exception if single- and multi-column params are both set
  • [SPARK-22801] - Allow FeatureHasher to specify numeric columns to treat as categorical
  • [SPARK-22810] - PySpark supports LinearRegression with huber loss
  • [SPARK-22830] - Scala Coding style has been improved in Spark Examples
  • [SPARK-22832] - BisectingKMeans unpersist unused datasets
  • [SPARK-22833] - [Examples] Improvements made at SparkHive Example with Scala
  • [SPARK-22844] - R date_trunc API
  • [SPARK-22847] - Remove the duplicate code in AppStatusListener while assigning schedulingPool for stage
  • [SPARK-22870] - Dynamic allocation should allow 0 idle time
  • [SPARK-22874] - Modify checking pandas version to use LooseVersion.
  • [SPARK-22893] - Unified the data type mismatch message
  • [SPARK-22894] - DateTimeOperations should accept SQL like string type
  • [SPARK-22895] - Push down the deterministic predicates that are after the first non-deterministic
  • [SPARK-22896] - Improvement in String interpolation
  • [SPARK-22897] - Expose stageAttemptId in TaskContext
  • [SPARK-22914] - Subbing for spark.history.ui.port does not resolve by default
  • [SPARK-22919] - Bump Apache httpclient versions
  • [SPARK-22921] - Merge script should prompt for assigning jiras
  • [SPARK-22922] - Python API for fitMultiple
  • [SPARK-22937] - SQL elt for binary inputs
  • [SPARK-22939] - Support Spark UDF in registerFunction
  • [SPARK-22944] - improve FoldablePropagation
  • [SPARK-22945] - add java UDF APIs in the functions object
  • [SPARK-22952] - Deprecate stageAttemptId in favour of stageAttemptNumber
  • [SPARK-22960] - Make build-push-docker-images.sh more dev-friendly
  • [SPARK-22979] - Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)
  • [SPARK-22994] - Require a single container image for Spark-on-K8S
  • [SPARK-22997] - Add additional defenses against use of freed MemoryBlocks
  • [SPARK-22999] - 'show databases like command' can remove the like keyword
  • [SPARK-23005] - Improve RDD.take on small number of partitions
  • [SPARK-23029] - Doc spark.shuffle.file.buffer units are kb when no units specified
  • [SPARK-23032] - Add a per-query codegenStageId to WholeStageCodegenExec
  • [SPARK-23036] - Add withGlobalTempView for testing
  • [SPARK-23062] - EXCEPT documentation should make it clear that it's EXCEPT DISTINCT
  • [SPARK-23081] - Add colRegex API to PySpark
  • [SPARK-23090] - polish ColumnVector
  • [SPARK-23091] - Incorrect unit test for approxQuantile
  • [SPARK-23122] - Deprecate register* for UDFs in SQLContext and Catalog in PySpark
  • [SPARK-23129] - Lazy init DiskMapIterator#deserializeStream to reduce memory usage when ExternalAppendOnlyMap spill too many times
  • [SPARK-23141] - Support data type string as a returnType for registerJavaFunction.
  • [SPARK-23142] - Add documentation for Continuous Processing
  • [SPARK-23143] - Add Python support for continuous trigger
  • [SPARK-23144] - Add console sink for continuous queries
  • [SPARK-23149] - polish ColumnarBatch
  • [SPARK-23170] - Dump the statistics of effective runs of analyzer and optimizer rules
  • [SPARK-23199] - improved Removes repetition from group expressions in Aggregate
  • [SPARK-23238] - Externalize SQLConf spark.sql.execution.arrow.enabled
  • [SPARK-23248] - Relocate module docstrings to the top in PySpark examples
  • [SPARK-23249] - Improve partition bin-filling algorithm to have less skew and fewer partitions
  • [SPARK-23276] - Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
  • [SPARK-23279] - Avoid triggering distributed job for Console sink
  • [SPARK-23284] - Document several get API of ColumnVector's behavior when accessing null slot
  • [SPARK-23296] - Diagnostics message for user code exceptions should include the stacktrace
  • [SPARK-23305] - Test `spark.sql.files.ignoreMissingFiles` for all file-based data sources
  • [SPARK-23312] - add a config to turn off vectorized cache reader
  • [SPARK-23317] - rename ContinuousReader.setOffset to setStartOffset
  • [SPARK-23454] - Add Trigger information to the Structured Streaming programming guide
  • [SPARK-23617] - Register a Function without params with Spark SQL Java API
  • [SPARK-23993] - Support DESC FORMATTED table_name column_name
  • [SPARK-24328] - Fix scala.MatchError in literals.sql.out
  • [SPARK-26542] - Support the coordinator to demerminte post-shuffle partitions more reasonably

Test

  • [SPARK-19662] - Add Fair Scheduler Unit Test coverage for different build cases
  • [SPARK-20518] - Supplement the new blockidsuite unit tests
  • [SPARK-20571] - Flaky SparkR StructuredStreaming tests
  • [SPARK-20607] - Add new unit tests to ShuffleSuite
  • [SPARK-20957] - Flaky Test: o.a.s.sql.streaming.StreamingQueryManagerSuite listing
  • [SPARK-21006] - Create rpcEnv and run later needs shutdown and awaitTermination
  • [SPARK-21128] - Running R tests multiple times failed due to pre-exiting "spark-warehouse" / "metastore_db"
  • [SPARK-21286] - [spark core UT]Modify a error for unit test
  • [SPARK-21370] - Avoid doing anything on HDFSBackedStateStore.abort() when there are no updates to commit
  • [SPARK-21464] - Minimize deprecation warnings caused by ProcessingTime class
  • [SPARK-21573] - Tests failing with run-tests.py SyntaxError occasionally in Jenkins
  • [SPARK-21663] - MapOutputTrackerSuite case test("remote fetch below max RPC message size") should call stop
  • [SPARK-21693] - AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
  • [SPARK-21729] - Generic test for ProbabilisticClassifier to ensure consistent output columns
  • [SPARK-21764] - Tests failures on Windows: resources not being closed and incorrect paths
  • [SPARK-21843] - testNameNote should be "(minNumPostShufflePartitions: " + numPartitions + ")" in ExchangeCoordinatorSuite
  • [SPARK-21936] - backward compatibility test framework for HiveExternalCatalog
  • [SPARK-21949] - Tables created in unit tests should be dropped after use
  • [SPARK-21982] - Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US
  • [SPARK-22140] - Add a test suite for TPCDS queries
  • [SPARK-22161] - Add Impala-modified TPC-DS queries
  • [SPARK-22418] - Add test cases for NULL Handling
  • [SPARK-22423] - Scala test source files like TestHiveSingleton.scala should be in scala source root
  • [SPARK-22595] - flaky test: CastSuite.SPARK-22500: cast for struct should not generate codes beyond 64KB
  • [SPARK-22644] - Make ML testsuite support StructuredStreaming test
  • [SPARK-22787] - Add a TPCH query suite
  • [SPARK-22800] - Add a SSB query suite
  • [SPARK-22881] - ML test for StructuredStreaming: spark.ml.regression
  • [SPARK-22938] - Assert that SQLConf.get is accessed only on the driver.
  • [SPARK-23072] - Add a Unicode schema test for file-based data sources
  • [SPARK-23132] - Run ml.image doctests in tests
  • [SPARK-23300] - Print out if Pandas and PyArrow are installed or not in tests
  • [SPARK-23311] - add FilterFunction test case for test CombineTypedFilters
  • [SPARK-23319] - Skip PySpark tests for old Pandas and old PyArrow

Task

  • [SPARK-12297] - Add work-around for Parquet/Hive int96 timestamp bug.
  • [SPARK-19810] - Remove support for Scala 2.10
  • [SPARK-20434] - Move Hadoop delegation token code from yarn to core
  • [SPARK-21366] - Add sql test for window functions
  • [SPARK-21699] - Remove unused getTableOption in ExternalCatalog
  • [SPARK-21731] - Upgrade scalastyle to 0.9
  • [SPARK-21848] - Create trait to identify user-defined functions
  • [SPARK-21939] - Use TimeLimits instead of Timeouts
  • [SPARK-22153] - Rename ShuffleExchange -> ShuffleExchangeExec
  • [SPARK-22416] - Move OrcOptions from `sql/hive` to `sql/core`
  • [SPARK-22473] - Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
  • [SPARK-22485] - Use `exclude[Problem]` instead `excludePackage` in MiMa
  • [SPARK-22634] - Update Bouncy castle dependency
  • [SPARK-22672] - Refactor ORC Tests
  • [SPARK-23104] - Document that kubernetes is still "experimental"
  • [SPARK-23426] - Use `hive` ORC impl and disable PPD for Spark 2.3.0

Dependency upgrade

Brainstorming

  • [SPARK-7146] - Should ML sharedParams be a public API?

Umbrella

  • [SPARK-18085] - SPIP: Better History Server scalability for many / large applications
  • [SPARK-20746] - Built-in SQL Function Improvement
  • [SPARK-21926] - Compatibility between ML Transformers and Structured Streaming
  • [SPARK-22820] - Spark 2.3 SQL API audit
  • [SPARK-23105] - Spark MLlib, GraphX 2.3 QA umbrella

New JIRA Project

  • [SPARK-20758] - Add Constant propagation optimization

Documentation

  • [SPARK-20015] - Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example
  • [SPARK-20132] - Add documentation for column string functions
  • [SPARK-20192] - SparkR 2.2.0 migration guide, release note
  • [SPARK-20442] - Fill up documentations for functions in Column API in PySpark
  • [SPARK-20448] - Document how FileInputDStream works with object storage
  • [SPARK-20456] - Add examples for functions collection for pyspark
  • [SPARK-20477] - Document R bisecting k-means in R programming guide
  • [SPARK-20478] - Document LinearSVC in R programming guide
  • [SPARK-20855] - Update the Spark kinesis docs to use the KinesisInputDStream builder instead of deprecated KinesisUtils
  • [SPARK-20858] - Document ListenerBus event queue size property
  • [SPARK-20889] - SparkR grouped documentation for Column methods
  • [SPARK-20992] - Link to Nomad scheduler backend in docs
  • [SPARK-21042] - Document Dataset.union is resolution by position, not name
  • [SPARK-21069] - Add rate source to programming guide
  • [SPARK-21123] - Options for file stream source are in a wrong table
  • [SPARK-21292] - R document Catalog function metadata refresh
  • [SPARK-21293] - R document update structured streaming
  • [SPARK-21469] - Add doc and example for FeatureHasher
  • [SPARK-21485] - API Documentation for Spark SQL functions
  • [SPARK-21616] - SparkR 2.3.0 migration guide, release note
  • [SPARK-21712] - Clarify PySpark Column.substr() type checking error message
  • [SPARK-21724] - Missing since information in the documentation of date functions
  • [SPARK-21925] - Update trigger interval documentation in docs with behavior change in Spark 2.2
  • [SPARK-21976] - Fix wrong doc about Mean Absolute Error
  • [SPARK-22110] - Enhance function description trim string function
  • [SPARK-22335] - Union for DataSet uses column order instead of types for union
  • [SPARK-22369] - PySpark: Document methods of spark.catalog interface
  • [SPARK-22399] - reference in mllib-clustering.html is out of date
  • [SPARK-22412] - Fix incorrect comment in DataSourceScanExec
  • [SPARK-22428] - Document spark properties for configuring the ContextCleaner
  • [SPARK-22490] - PySpark doc has misleading string for SparkSession.builder
  • [SPARK-22541] - Dataframes: applying multiple filters one after another using udfs and accumulators results in faulty accumulators
  • [SPARK-22735] - Add VectorSizeHint to ML features documentation
  • [SPARK-22993] - checkpointInterval param doc should be clearer
  • [SPARK-23048] - Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator
  • [SPARK-23069] - R doc for describe missing text
  • [SPARK-23127] - Update FeatureHasher user guide for catCols parameter
  • [SPARK-23138] - Add user guide example for multiclass logistic regression summary
  • [SPARK-23154] - Document backwards compatibility guarantees for ML persistence
  • [SPARK-23163] - Sync Python ML API docs with Scala
  • [SPARK-23313] - Add a migration guide for ORC
  • [SPARK-23327] - Update the description of three external API or functions

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.