Release Notes - Nutch - Version 1.18 - HTML format

Sub-task

  • [NUTCH-2671] - Upgrade ant ivy library
  • [NUTCH-2672] - Ant build erronously installs *-test.jar instead *.jar for target "nightly"
  • [NUTCH-2803] - Rename property http.robot.rules.whitelist
  • [NUTCH-2805] - Rename plugin urlfilter-domainblacklist
  • [NUTCH-2809] - Upgrade any23 plugin dependency to 2.4
  • [NUTCH-2816] - Add Spotbugs target to ant build
  • [NUTCH-2817] - Avoid check for equality of URL path and file part using ==/!=
  • [NUTCH-2829] - Fix ant target "clean-cache"

Bug

  • [NUTCH-2669] - Reliable solution for javax.ws packaging.type
  • [NUTCH-2697] - Upgrade Ivy to fix the issue of an unset packaging.type property
  • [NUTCH-2801] - RobotsRulesParser command-line checker to use http.robots.agents as fall-back
  • [NUTCH-2810] - FreeGenerator to actually apply configured number of fetch lists
  • [NUTCH-2813] - MoreIndexingFilter - can't parse erroneous date - 2019-07-03T10:28:14
  • [NUTCH-2814] - HttpDateFormat's internal time zone may change after parsing a date
  • [NUTCH-2818] - Ant build: upgrade Apache Rat report task
  • [NUTCH-2823] - IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer
  • [NUTCH-2824] - urlnormalizer-basic to unescape percent-encoded host names

Improvement

  • [NUTCH-1190] - MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.
  • [NUTCH-2582] - Set pool size of XML SAX parsers used for MIME detection in Tika 1.19
  • [NUTCH-2730] - SitemapProcessor to treat sitemap URLs as Set instead of List
  • [NUTCH-2782] - protocol-http / lib-http: support TLSv1.3
  • [NUTCH-2796] - Upgrade to crawler-commons 1.1
  • [NUTCH-2799] - Add .asf.yaml file
  • [NUTCH-2833] - Upgrade to Tika 1.25
  • [NUTCH-2835] - Upgrade commons-jexl from 2 --> 3
  • [NUTCH-2836] - Upgrade various commons dependencies
  • [NUTCH-2837] - Update multiple dependencies
  • [NUTCH-2841] - Upgrade xercesImpl dependency

Wish

  • [NUTCH-2834] - Deduplication mode via command line in crawl script

Task

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.