Release Notes - Spark - Version 2.4.8 - HTML format

Sub-task

  • [SPARK-24266] - Spark client terminates while driver is still running
  • [SPARK-27421] - RuntimeException when querying a view on a partitioned parquet table
  • [SPARK-30894] - The nullability of Size function should not depend on SQLConf.get
  • [SPARK-32247] - scipy installation fails with PyPy
  • [SPARK-33096] - Use LinkedHashMap instead of Map for newlyCreatedExecutors
  • [SPARK-33290] - REFRESH TABLE should invalidate cache even though the table itself may not be cached
  • [SPARK-33464] - Add/remove (un)necessary cache and restructure GitHub Actions yaml
  • [SPARK-33667] - Respect case sensitivity in V1 SHOW PARTITIONS
  • [SPARK-33670] - Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED
  • [SPARK-33732] - Kubernetes integration tests doesn't work with Minikube 1.9+
  • [SPARK-33742] - Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions()
  • [SPARK-33788] - Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions()
  • [SPARK-33911] - Update SQL migration guide about changes in HiveClientImpl
  • [SPARK-34407] - KubernetesClusterSchedulerBackend.stop should clean up K8s resources
  • [SPARK-34507] - Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12

Bug

  • [SPARK-25271] - Creating parquet table with all the column null throws exception
  • [SPARK-26625] - spark.redaction.regex should include oauthToken
  • [SPARK-26645] - CSV infer schema bug infers decimal(9,-1)
  • [SPARK-27575] - Spark overwrites existing value of spark.yarn.dist.* instead of merging value
  • [SPARK-27872] - Driver and executors use a different service account breaking pull secrets
  • [SPARK-29574] - spark with user provided hadoop doesn't work on kubernetes
  • [SPARK-30201] - HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
  • [SPARK-30228] - Update zstd-jni to 1.4.4-3
  • [SPARK-31882] - DAG-viz is not rendered correctly with pagination.
  • [SPARK-32635] - When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result
  • [SPARK-32708] - Query optimization fails to reuse exchange with DataSourceV2
  • [SPARK-32715] - Broadcast block pieces may memory leak
  • [SPARK-32738] - thread safe endpoints may hang due to fatal error
  • [SPARK-32794] - Rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 streaming sources
  • [SPARK-32815] - Fix LibSVM data source loading error on file paths with glob metacharacters
  • [SPARK-32832] - Use CaseInsensitiveMap for DataStreamReader/Writer options
  • [SPARK-32836] - Fix DataStreamReaderWriterSuite to check writer options correctly
  • [SPARK-32845] - Add sinkParameter to check sink options robustly in DataStreamReaderWriterSuite
  • [SPARK-32865] - python section in quickstart page doesn't display SPARK_VERSION correctly
  • [SPARK-32872] - BytesToBytesMap at MAX_CAPACITY exceeds growth threshold
  • [SPARK-32886] - '.../jobs/undefined' link from "Event Timeline" in jobs page
  • [SPARK-32898] - totalExecutorRunTimeMs is too big
  • [SPARK-32900] - UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.
  • [SPARK-32901] - UnsafeExternalSorter may cause a SparkOutOfMemoryError to be thrown while spilling
  • [SPARK-32908] - percentile_approx() returns incorrect results
  • [SPARK-32924] - Web UI sort on duration is wrong
  • [SPARK-32999] - TreeNode.nodeName should not throw malformed class name error
  • [SPARK-33094] - ORC format does not propagate Hadoop config from DS options to underlying HDFS file system
  • [SPARK-33101] - LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system
  • [SPARK-33131] - Fix grouping sets with having clause can not resolve qualified col name
  • [SPARK-33136] - Handling nullability for complex types is broken during resolution of V2 write command
  • [SPARK-33183] - Bug in optimizer rule EliminateSorts
  • [SPARK-33217] - Set upper bound of Pandas and PyArrow version in GitHub Actions in branch-2.4
  • [SPARK-33230] - FileOutputWriter jobs have duplicate JobIDs if launched in same second
  • [SPARK-33268] - Fix bugs for casting data from/to PythonUserDefinedType
  • [SPARK-33277] - Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
  • [SPARK-33292] - Make Literal ArrayBasedMapData string representation disambiguous
  • [SPARK-33313] - R/run-tests.sh is not compatible with testthat >= 3.0
  • [SPARK-33333] - Upgrade Jetty to 9.4.28.v20200408
  • [SPARK-33338] - GROUP BY using literal map should not fail
  • [SPARK-33339] - Pyspark application will hang due to non Exception
  • [SPARK-33372] - Fix InSet bucket pruning
  • [SPARK-33405] - Upgrade commons-compress to 1.20
  • [SPARK-33417] - Correct the behaviour of query filters in TPCDSQueryBenchmark
  • [SPARK-33472] - IllegalArgumentException when applying RemoveRedundantSorts before EnsureRequirements
  • [SPARK-33483] - Fix rat exclusion patterns and add a LICENSE
  • [SPARK-33588] - Partition spec in SHOW TABLE EXTENDED doesn't respect `spark.sql.caseSensitive`
  • [SPARK-33593] - Vector reader got incorrect data with binary partition value
  • [SPARK-33631] - Clean up `spark.core.connection.ack.wait.timeout` from `configuration.md`
  • [SPARK-33681] - Increase K8s IT timeout to 3 minutes
  • [SPARK-33725] - Upgrade snappy-java to 1.1.8.2
  • [SPARK-33726] - Duplicate field names causes wrong answers during aggregation
  • [SPARK-33733] - PullOutNondeterministic should check and collect deterministic field
  • [SPARK-33749] - Exclude target directory in pycodestyle and flake8
  • [SPARK-33756] - BytesToBytesMap's iterator hasNext method should be idempotent.
  • [SPARK-33757] - Fix the R dependencies build error on GitHub Actions and AppVeyor
  • [SPARK-33831] - Update Jetty to 9.4.34
  • [SPARK-34012] - Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide
  • [SPARK-34125] - Make EventLoggingListener.codecMap thread-safe
  • [SPARK-34187] - Use available offset range obtained during polling when checking offset validation
  • [SPARK-34212] - For parquet table, after changing the precision and scale of decimal type in hive, spark reads incorrect value
  • [SPARK-34229] - Avro should read decimal values with the file schema
  • [SPARK-34231] - AvroSuite has test failure when run from IDE due to bad loading of resource file
  • [SPARK-34260] - UnresolvedException when creating temp view twice
  • [SPARK-34268] - The Signature for ConcatWs in Spark SQL Docs Is Inconsistent with the Actual Behavior
  • [SPARK-34270] - Combine StateStoreMetrics should not override StateStoreCustomMetric
  • [SPARK-34273] - Do not reregister BlockManager when SparkContext is stopped
  • [SPARK-34318] - Dataset.colRegex should work with column names and qualifiers which contain newlines
  • [SPARK-34327] - Omit inlining passwords during build process.
  • [SPARK-34449] - Upgrade Jetty to fix CVE-2020-27218
  • [SPARK-34596] - NewInstance.doGenCode should not throw malformed class name error
  • [SPARK-34607] - NewInstance.resolved should not throw malformed class name error
  • [SPARK-34672] - Fix docker file for creating release
  • [SPARK-34696] - Fix CodegenInterpretedPlanTest to generate correct test cases
  • [SPARK-34703] - Fix pyspark test when using sort_values on Pandas
  • [SPARK-34719] - fail if the view query has duplicated column names
  • [SPARK-34724] - Fix Interpreted evaluation by using getClass.getMethod instead of getDeclaredMethod
  • [SPARK-34726] - Fix collectToPython timeouts
  • [SPARK-34743] - ExpressionEncoderSuite should use deepEquals when we expect `array of array`
  • [SPARK-34774] - The `change-scala- version.sh` script not replaced scala.version property correctly
  • [SPARK-34776] - Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
  • [SPARK-34811] - Redact fs.s3a.access.key like secret and token
  • [SPARK-34855] - SparkContext - avoid using local lazy val
  • [SPARK-34874] - Recover test reports for failed GA builds
  • [SPARK-34876] - Non-nullable aggregates can return NULL in a correlated subquery
  • [SPARK-34909] - conv() does not convert negative inputs to unsigned correctly
  • [SPARK-34939] - Throw fetch failure exception when unable to deserialize broadcasted map statuses
  • [SPARK-34963] - Nested column pruning fails to extract case-insensitive struct field from array
  • [SPARK-34988] - Upgrade Jetty for CVE-2021-28165
  • [SPARK-34994] - Fix git error when pushing the tag after release script succeeds
  • [SPARK-35080] - Correlated subqueries with equality predicates can return wrong results
  • [SPARK-35278] - Invoke should find the method with correct number of parameters
  • [SPARK-35288] - StaticInvoke should find the method without exact argument classes match

Improvement

  • [SPARK-31225] - Override `sql` method for OuterReference
  • [SPARK-31807] - Use python 3 style in release-build.sh
  • [SPARK-32090] - UserDefinedType.equal() does not have symmetry
  • [SPARK-33123] - Ignore `GitHub Action file` change in Amplab Jenkins
  • [SPARK-33156] - Upgrade GithubAction image from 18.04 to 20.04
  • [SPARK-33228] - Don't uncache data when replacing an existing view having the same plan
  • [SPARK-33535] - export LANG to en_US.UTF-8 in jenkins test script
  • [SPARK-33675] - Add GitHub Action job to publish snapshot
  • [SPARK-34059] - Use for/foreach rather than map to make sure execute it eagerly
  • [SPARK-34118] - Replaces filter and check for emptiness with exists or forall
  • [SPARK-34153] - Remove unused `getRawTable()` from `HiveExternalCatalog.alterPartitions()`
  • [SPARK-34275] - Replaces filter and size with count
  • [SPARK-34310] - Replaces map and flatten with flatMap
  • [SPARK-35227] - Replace Bintray with the new repository service for the spark-packages resolver in SparkSubmit

Test

  • [SPARK-24931] - Recover lint-r job in GitHub Actions workflow
  • [SPARK-26646] - Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
  • [SPARK-33051] - Uses setup-r to install R in GitHub Actions build
  • [SPARK-33770] - Test failures: ALTER TABLE .. DROP PARTITION tries to delete files out of partition path
  • [SPARK-33869] - Have a separate metastore dir for each PySpark test process

Task

  • [SPARK-35233] - Switch from bintray to scala.jfrog.io for SBT download in branch 2.4 and 3.0

Documentation

  • [SPARK-32306] - `approx_percentile` in Spark SQL gives incorrect results
  • [SPARK-32888] - reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
  • [SPARK-33585] - The comment for SQLContext.tables() doesn't mention the `database` column

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.