Release Notes - Nutch - Version 1.11 - HTML format

Sub-task

  • [NUTCH-2015] - Make FetchNodeDb optional (off by default) if NutchServer is not used
  • [NUTCH-2031] - Create Admin End point for Nutch 1.x REST service
  • [NUTCH-2037] - Job endpoint to support Indexing from the REST API
  • [NUTCH-2066] - Parameterize Generate REST endpoint
  • [NUTCH-2090] - Refactor Seed Resource in REST API
  • [NUTCH-2092] - Unit Test for NutchServer
  • [NUTCH-2099] - Refactoring the REST endpoints for integration with webui
  • [NUTCH-2128] - Refactor configuration end point
  • [NUTCH-2167] - Backport TableUtil from 2.x for URL reversing

Bug

  • [NUTCH-1247] - CrawlDatum.retries should be int
  • [NUTCH-1692] - SegmentReader broken in distributed mode
  • [NUTCH-1711] - Normalizer does not encode exclamation mark
  • [NUTCH-1873] - Solr IndexWriter/Job to report number of docs indexed.
  • [NUTCH-1905] - Nutch index tool should be resilient to segments that don't have crawl_* data
  • [NUTCH-1911] - Improve DomainStatistics tool command line parsing
  • [NUTCH-2000] - Link inversion fails with .locked already exists.
  • [NUTCH-2007] - add test libs to classpath of bin/nutch junit
  • [NUTCH-2013] - Fetcher: missing logs "fetching ..." on stdout
  • [NUTCH-2014] - Fetcher hang-up on completion
  • [NUTCH-2017] - Remove debug log from MimeUtil
  • [NUTCH-2041] - indexer fails if linkdb is missing
  • [NUTCH-2059] - protocol-httpclient, protocol-http unit test errors on Jenkins
  • [NUTCH-2063] - Add -mimeStats flag to FileDumper tool
  • [NUTCH-2064] - URLNormalizer basic to encode reserved chars and decode non-reserved chars
  • [NUTCH-2072] - Deflate encoding support is broken when http.content.limit is set to -1
  • [NUTCH-2084] - Track changes in input dirs for SegmentMerger
  • [NUTCH-2093] - Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator
  • [NUTCH-2098] - Add null SeedUrl constructor
  • [NUTCH-2106] - Runtime to contain Selenium and dependencies only once
  • [NUTCH-2119] - Eclipse shows build path errors on building Nutch
  • [NUTCH-2121] - Update javadoc link for Hadoop 2.4.0 in default.properties
  • [NUTCH-2123] - Seed List REST API returns Text but headers indicate/require JSON
  • [NUTCH-2124] - redirect following same link again and again , max redirect exceed and went db_gone
  • [NUTCH-2142] - Nutch File Dump - FileNotFoundException (Invalid Argument) Error
  • [NUTCH-2146] - hashCode on the Outlink class
  • [NUTCH-2154] - Nutch REST API (DB) suffering NullPointerException
  • [NUTCH-2159] - Ensure that all WebApp files are copied into generated artifacts for 1.X Webapp
  • [NUTCH-2165] - FileDumper Util hard codes part-# folder name
  • [NUTCH-2173] - String.join in FileDumper breaks the build
  • [NUTCH-2176] - Clean up of log4j.properties
  • [NUTCH-2177] - Generator produces only one partition even in distributed mode

New Feature

  • [NUTCH-208] - http: proxy exception list:
  • [NUTCH-1040] - Backport REST-API from 2.0
  • [NUTCH-1517] - CloudSearch indexer
  • [NUTCH-1785] - Ability to index raw content
  • [NUTCH-1800] - Documentation for Nutch 1.X REST API
  • [NUTCH-1913] - LinkDB to implement db.ignore.external.links
  • [NUTCH-1980] - Jexl expressions for CrawlDbReader
  • [NUTCH-2021] - Use protocol-selenium to Capture Screenshots of the Page as it is Fetched
  • [NUTCH-2027] - seed list REST endpoint for Nutch 1.10
  • [NUTCH-2038] - Naive Bayes classifier based html Parse filter (for filtering outlinks)
  • [NUTCH-2039] - Relevance based scoring filter
  • [NUTCH-2086] - Nutch 1.X Webui
  • [NUTCH-2148] - Review and update mapred --> mapreduce config params in crawl script
  • [NUTCH-2149] - REST endpoint to read Nutch sequence files

Improvement

  • [NUTCH-1486] - Upgrade to Solr 4.10.2
  • [NUTCH-1684] - ParseMeta to be added before fetch schedulers are run
  • [NUTCH-1697] - SegmentMerger to implement Tool
  • [NUTCH-1934] - Refactor Fetcher in trunk
  • [NUTCH-1948] - Make the Selenium remote web driver specification, configuration and selection available via a Factory-type mechanism
  • [NUTCH-1988] - Make nested output directory dump optional
  • [NUTCH-1995] - Add support for wildcard to http.robot.rules.whitelist
  • [NUTCH-1998] - Add support for user-defined file extension to CommonCrawlDataDumper
  • [NUTCH-2004] - ParseChecker does not handle redirects
  • [NUTCH-2006] - IndexingFiltersChecker to take custom metadata as input
  • [NUTCH-2008] - IndexerMapReduce to use single instance of NutchIndexAction for deletions
  • [NUTCH-2036] - Adding some continuous crawl goodies to the crawl script
  • [NUTCH-2048] - parse-tika: fix dependencies in plugin.xml
  • [NUTCH-2049] - Upgrade Trunk to Hadoop > 2.4 stable
  • [NUTCH-2052] - Enhance index-static to allow configurable delimiters
  • [NUTCH-2058] - Indexer plugin that allows RegEx replacements on the NutchDocument field values
  • [NUTCH-2062] - Add Plugin for interacting with Selenium WebDriver
  • [NUTCH-2069] - Ignore external links based on domain
  • [NUTCH-2077] - Upgrade to Tika 1.10
  • [NUTCH-2082] - Upgrade to Apache Tika 1.10
  • [NUTCH-2083] - Implement functionality to shadow nutch-selenium-grid-plugin from Mo Omer
  • [NUTCH-2088] - Add Optional Execution to Interactive Selenium Handlers
  • [NUTCH-2096] - Explicitly indicate broswer binary to use when selecting selenium remote option in config
  • [NUTCH-2102] - WARC Exporter
  • [NUTCH-2104] - Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation
  • [NUTCH-2107] - plugin.xml to validate against plugin.dtd
  • [NUTCH-2115] - Add total counts to dump stats
  • [NUTCH-2117] - NutchServer CLI Option for CMD_PORT is incorrect and should be CMD_HOST
  • [NUTCH-2129] - Track Protocol Status in Crawl Datum
  • [NUTCH-2139] - Basic plugin to index inlinks and outlinks
  • [NUTCH-2141] - Change the InteractiveSelenium plugin handler Interface to return page content
  • [NUTCH-2150] - Add ProtocolStatus Utility
  • [NUTCH-2155] - Create a "crawl completeness" utility
  • [NUTCH-2160] - Upgrade Selenium Java to 2.48.2
  • [NUTCH-2166] - Add reverse URL format to dump tool
  • [NUTCH-2175] - Typos in property descriptions in nutch-default.xml

Wish

  • [NUTCH-2016] - Remove unused class OldFetcher
  • [NUTCH-2022] - Investigate better documentation for the Nutch REST API's

Task

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.