Release Notes - Nutch - Version 1.20 - HTML format

Sub-task

  • [NUTCH-2596] - Upgrade from org.mortbay.jetty to org.eclipse.jetty
  • [NUTCH-2852] - Method invokes System.exit(...) 9 bugs
  • [NUTCH-2972] - Javadoc build fails using JDK 17
  • [NUTCH-3007] - Fix impossible casts

Bug

  • [NUTCH-2634] - Some links marked as "nofollow" are followed anyway.
  • [NUTCH-2820] - Review sample files used in any23 unit tests
  • [NUTCH-2924] - Generate maxCount expr evaluated only once
  • [NUTCH-2937] - parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
  • [NUTCH-2973] - Single domain names (eg https://localnet) can't be crawled - filtering fails
  • [NUTCH-2974] - Ant build fails with "Unparseable date" on certain platforms
  • [NUTCH-2979] - Upgrade Commons Text to 1.10.0
  • [NUTCH-2982] - Generator: parameter for URL normalization not passed forward
  • [NUTCH-2985] - Disable plugin urlfilter-validator by default
  • [NUTCH-2992] - Fetcher: always block fetch queues when exceptions threshold is reached
  • [NUTCH-3000] - protocol-selenium returns only the body,strips off the <head/> element
  • [NUTCH-3001] - protocol-selenium requires Content-Type header
  • [NUTCH-3002] - Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive
  • [NUTCH-3008] - indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
  • [NUTCH-3012] - SegmentReader when dumping with option -recode: NPE on unparsed documents
  • [NUTCH-3027] - Trivial resource leak patch in DomainSuffixes.java
  • [NUTCH-3035] - Update license and notice file for release of 1.20

New Feature

  • [NUTCH-2832] - Create tutorial on sending Nutch logs to Elasticsearch
  • [NUTCH-2888] - Selenium Protocol: Support for Selenium 4
  • [NUTCH-2920] - Implement a indexer-opensearch plugin
  • [NUTCH-2991] - Support HTTP/S Header Authorization for Solr connections
  • [NUTCH-3029] - Host specific max. and min. intervals in adaptive scheduler

Improvement

  • [NUTCH-2853] - bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean
  • [NUTCH-2883] - Provide means to run server as a persistent service in Docker container
  • [NUTCH-2897] - Do not supress deprecated API warnings
  • [NUTCH-2961] - Upgrade dependencies of parsefilter-naivebayes
  • [NUTCH-2980] - Upgrade Selenium Java to 4.7.2
  • [NUTCH-2983] - nutch-default.xml improvements
  • [NUTCH-2990] - HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
  • [NUTCH-2993] - ScoringDepth plugin to skip depth check based on URL Pattern
  • [NUTCH-2995] - Upgrade to crawler-commons 1.4
  • [NUTCH-2996] - Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
  • [NUTCH-2997] - Add Override annotations where applicable
  • [NUTCH-3004] - Avoid NPE in HttpResponse
  • [NUTCH-3005] - Upgrade selenium as needed
  • [NUTCH-3009] - Upgrade to Hadoop 3.3.6
  • [NUTCH-3010] - Injector: count unique number of injected URLs
  • [NUTCH-3011] - HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)
  • [NUTCH-3013] - Employ commons-lang3's StopWatch to simplify timing logic
  • [NUTCH-3014] - Standardize Job names
  • [NUTCH-3015] - Add more CI steps to GitHub master-build.yml
  • [NUTCH-3017] - Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
  • [NUTCH-3025] - urlfilter-fast to filter based on the length of the URL
  • [NUTCH-3031] - ProtocolFactory host mapper to support domains
  • [NUTCH-3032] - Indexing plugin as an adapter for end user's own POJO instances
  • [NUTCH-3036] - Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

Task

  • [NUTCH-2959] - Upgrade to Apache Tika 2.9.0
  • [NUTCH-2977] - Support for showing dependency tree
  • [NUTCH-2978] - Move to slf4j2 and remove log4j1 and reload4j
  • [NUTCH-2984] - Drop test proxy server and benchmark tool
  • [NUTCH-2989] - Can't have username/pw AND https on elastic-indexer?!
  • [NUTCH-2998] - Remove the Any23 plugin
  • [NUTCH-2999] - Update Lucene version to latest 8.x
  • [NUTCH-3016] - Upgrade Apache Ivy to 2.5.2
  • [NUTCH-3019] - Upgrade to Apache Tika 2.9.1
  • [NUTCH-3020] - ParseSegment should check for protocol's flags for truncation
  • [NUTCH-3024] - Remove flaky 'dependency check' target
  • [NUTCH-3033] - Upgrade Ivy to v2.5.2
  • [NUTCH-3037] - Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
  • [NUTCH-3038] - Address issues discovered during 1.20 release management dryrun

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.