Release Notes - Nutch - Version 2.3.1 - HTML format

Sub-task

Bug

  • [NUTCH-1572] - Nutch 2.x should use o.a.g.mem.store.MemStore for testing
  • [NUTCH-1679] - UpdateDb using batchId, link may override crawled page.
  • [NUTCH-1893] - Parse-tika fails to parse feed files
  • [NUTCH-1922] - DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches
  • [NUTCH-2009] - Fetcher does not work with batchID
  • [NUTCH-2019] - ClassPathException sending topN argument for /job/create using Nutch 2.x RESTApi
  • [NUTCH-2028] - java.lang.IllegalArgumentException: can't serialize class org.apache.avro.util.Utf8
  • [NUTCH-2029] - Mark.checkMark returns empty string when null is expected with mongodb storage
  • [NUTCH-2042] - parse-html increase chunk size used to detect charset
  • [NUTCH-2045] - index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time
  • [NUTCH-2080] - Eclipse compilation issue
  • [NUTCH-2094] - Stopping and Restarting a crawl has issues in the Web UI
  • [NUTCH-2101] - Upgrade Nutch 2.X to Hadoop 2.5.1
  • [NUTCH-2130] - copyField rawcontent creates error within schema.xml
  • [NUTCH-2143] - GeneratorJob ignores batch id passed as argument
  • [NUTCH-2168] - Parse-tika fails to retrieve parser
  • [NUTCH-2377] - Nutch can't parse relative links

New Feature

  • [NUTCH-1900] - DockerFile for Nutch 2.x
  • [NUTCH-1941] - Optional rolling http.agent.name's
  • [NUTCH-1944] - Add raw content to indexes
  • [NUTCH-2105] - Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1

Improvement

  • [NUTCH-1062] - Migrate BasicURLNormalizer from Apache ORO to java.util.regex
  • [NUTCH-1286] - Refactoring/reimplementing crawling API (NutchApp)
  • [NUTCH-1920] - Upgrade Nutch to use Java 1.7
  • [NUTCH-1925] - Upgrade Tika to version 1.7
  • [NUTCH-1946] - Upgrade to Gora 0.6.1
  • [NUTCH-1981] - Upgrade icu4j
  • [NUTCH-1990] - Use URI.normalise() in BasicURLNormalizer
  • [NUTCH-1994] - Upgrade to Apache Tika 1.8
  • [NUTCH-2018] - Ensure that the Docker containers for Nutch 2.X are part of the Release Management Documentation
  • [NUTCH-2050] - Upgrade HBase and Hadoop versioning on 2.X HBase Docker
  • [NUTCH-2077] - Upgrade to Tika 1.10
  • [NUTCH-2082] - Upgrade to Apache Tika 1.10
  • [NUTCH-2107] - plugin.xml to validate against plugin.dtd
  • [NUTCH-2169] - Integrate index-html into Nutch build

Task

  • [NUTCH-1886] - Review and update default.properties
  • [NUTCH-1936] - GSoC 2015 - Move Nutch to Hadoop 2.X

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.