Sub-task
- [NUTCH-2015] - Make FetchNodeDb optional (off by default) if NutchServer is not used
- [NUTCH-2031] - Create Admin End point for Nutch 1.x REST service
- [NUTCH-2037] - Job endpoint to support Indexing from the REST API
- [NUTCH-2066] - Parameterize Generate REST endpoint
- [NUTCH-2090] - Refactor Seed Resource in REST API
- [NUTCH-2092] - Unit Test for NutchServer
- [NUTCH-2099] - Refactoring the REST endpoints for integration with webui
- [NUTCH-2128] - Refactor configuration end point
- [NUTCH-2167] - Backport TableUtil from 2.x for URL reversing
Bug
- [NUTCH-1247] - CrawlDatum.retries should be int
- [NUTCH-1692] - SegmentReader broken in distributed mode
- [NUTCH-1711] - Normalizer does not encode exclamation mark
- [NUTCH-1873] - Solr IndexWriter/Job to report number of docs indexed.
- [NUTCH-1905] - Nutch index tool should be resilient to segments that don't have crawl_* data
- [NUTCH-1911] - Improve DomainStatistics tool command line parsing
- [NUTCH-2000] - Link inversion fails with .locked already exists.
- [NUTCH-2007] - add test libs to classpath of bin/nutch junit
- [NUTCH-2013] - Fetcher: missing logs "fetching ..." on stdout
- [NUTCH-2014] - Fetcher hang-up on completion
- [NUTCH-2017] - Remove debug log from MimeUtil
- [NUTCH-2041] - indexer fails if linkdb is missing
- [NUTCH-2059] - protocol-httpclient, protocol-http unit test errors on Jenkins
- [NUTCH-2063] - Add -mimeStats flag to FileDumper tool
- [NUTCH-2064] - URLNormalizer basic to encode reserved chars and decode non-reserved chars
- [NUTCH-2072] - Deflate encoding support is broken when http.content.limit is set to -1
- [NUTCH-2084] - Track changes in input dirs for SegmentMerger
- [NUTCH-2093] - Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator
- [NUTCH-2098] - Add null SeedUrl constructor
- [NUTCH-2106] - Runtime to contain Selenium and dependencies only once
- [NUTCH-2119] - Eclipse shows build path errors on building Nutch
- [NUTCH-2121] - Update javadoc link for Hadoop 2.4.0 in default.properties
- [NUTCH-2123] - Seed List REST API returns Text but headers indicate/require JSON
- [NUTCH-2124] - redirect following same link again and again , max redirect exceed and went db_gone
- [NUTCH-2142] - Nutch File Dump - FileNotFoundException (Invalid Argument) Error
- [NUTCH-2146] - hashCode on the Outlink class
- [NUTCH-2154] - Nutch REST API (DB) suffering NullPointerException
- [NUTCH-2159] - Ensure that all WebApp files are copied into generated artifacts for 1.X Webapp
- [NUTCH-2165] - FileDumper Util hard codes part-# folder name
- [NUTCH-2173] - String.join in FileDumper breaks the build
- [NUTCH-2176] - Clean up of log4j.properties
- [NUTCH-2177] - Generator produces only one partition even in distributed mode
New Feature
- [NUTCH-208] - http: proxy exception list:
- [NUTCH-1040] - Backport REST-API from 2.0
- [NUTCH-1517] - CloudSearch indexer
- [NUTCH-1785] - Ability to index raw content
- [NUTCH-1800] - Documentation for Nutch 1.X REST API
- [NUTCH-1913] - LinkDB to implement db.ignore.external.links
- [NUTCH-1980] - Jexl expressions for CrawlDbReader
- [NUTCH-2021] - Use protocol-selenium to Capture Screenshots of the Page as it is Fetched
- [NUTCH-2027] - seed list REST endpoint for Nutch 1.10
- [NUTCH-2038] - Naive Bayes classifier based html Parse filter (for filtering outlinks)
- [NUTCH-2039] - Relevance based scoring filter
- [NUTCH-2086] - Nutch 1.X Webui
- [NUTCH-2148] - Review and update mapred --> mapreduce config params in crawl script
- [NUTCH-2149] - REST endpoint to read Nutch sequence files
Improvement
- [NUTCH-1486] - Upgrade to Solr 4.10.2
- [NUTCH-1684] - ParseMeta to be added before fetch schedulers are run
- [NUTCH-1697] - SegmentMerger to implement Tool
- [NUTCH-1934] - Refactor Fetcher in trunk
- [NUTCH-1948] - Make the Selenium remote web driver specification, configuration and selection available via a Factory-type mechanism
- [NUTCH-1988] - Make nested output directory dump optional
- [NUTCH-1995] - Add support for wildcard to http.robot.rules.whitelist
- [NUTCH-1998] - Add support for user-defined file extension to CommonCrawlDataDumper
- [NUTCH-2004] - ParseChecker does not handle redirects
- [NUTCH-2006] - IndexingFiltersChecker to take custom metadata as input
- [NUTCH-2008] - IndexerMapReduce to use single instance of NutchIndexAction for deletions
- [NUTCH-2036] - Adding some continuous crawl goodies to the crawl script
- [NUTCH-2048] - parse-tika: fix dependencies in plugin.xml
- [NUTCH-2049] - Upgrade Trunk to Hadoop > 2.4 stable
- [NUTCH-2052] - Enhance index-static to allow configurable delimiters
- [NUTCH-2058] - Indexer plugin that allows RegEx replacements on the NutchDocument field values
- [NUTCH-2062] - Add Plugin for interacting with Selenium WebDriver
- [NUTCH-2069] - Ignore external links based on domain
- [NUTCH-2077] - Upgrade to Tika 1.10
- [NUTCH-2082] - Upgrade to Apache Tika 1.10
- [NUTCH-2083] - Implement functionality to shadow nutch-selenium-grid-plugin from Mo Omer
- [NUTCH-2088] - Add Optional Execution to Interactive Selenium Handlers
- [NUTCH-2096] - Explicitly indicate broswer binary to use when selecting selenium remote option in config
- [NUTCH-2102] - WARC Exporter
- [NUTCH-2104] - Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation
- [NUTCH-2107] - plugin.xml to validate against plugin.dtd
- [NUTCH-2115] - Add total counts to dump stats
- [NUTCH-2117] - NutchServer CLI Option for CMD_PORT is incorrect and should be CMD_HOST
- [NUTCH-2129] - Track Protocol Status in Crawl Datum
- [NUTCH-2139] - Basic plugin to index inlinks and outlinks
- [NUTCH-2141] - Change the InteractiveSelenium plugin handler Interface to return page content
- [NUTCH-2150] - Add ProtocolStatus Utility
- [NUTCH-2155] - Create a "crawl completeness" utility
- [NUTCH-2160] - Upgrade Selenium Java to 2.48.2
- [NUTCH-2166] - Add reverse URL format to dump tool
- [NUTCH-2175] - Typos in property descriptions in nutch-default.xml
Wish
- [NUTCH-2016] - Remove unused class OldFetcher
- [NUTCH-2022] - Investigate better documentation for the Nutch REST API's
Task
- [NUTCH-1936] - GSoC 2015 - Move Nutch to Hadoop 2.X
- [NUTCH-2085] - Upgrade Guava
- [NUTCH-2120] - Remove MapWritable from trunk codebase
- [NUTCH-2158] - Upgrade to Tika 1.11
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.