Sub-task
- [NUTCH-2250] - CommonCrawlDumper : Invalid format + skipped parts
Bug
- [NUTCH-2042] - parse-html increase chunk size used to detect charset
- [NUTCH-2180] - FileDumper dumps data, but breaks midway on corrupt segments
- [NUTCH-2189] - Domain filter must deactivate if no rules are present
- [NUTCH-2203] - Suffix URL filter can't handle trailing/leading whitespaces
- [NUTCH-2206] - Provide example scoring.similarity.stopword.file
- [NUTCH-2213] - CommonCrawlDataDumper saves gzipped body in extracted form
- [NUTCH-2223] - Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
- [NUTCH-2224] - Average bytes/second calculated incorrectly in fetcher
- [NUTCH-2225] - Parsed time calculated incorrectly
- [NUTCH-2228] - Plugin index-replace unit test broken on Java 8
- [NUTCH-2232] - DeduplicationJob should decode URL's before length is compared
- [NUTCH-2241] - Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
- [NUTCH-2256] - Inconsistent log level practice
New Feature
- [NUTCH-961] - Expose Tika's boilerpipe support
- [NUTCH-1325] - HostDB for Nutch
- [NUTCH-2144] - Plugin to override db.ignore.external to exempt interesting external domain URLs
- [NUTCH-2190] - Protocol normalizer
- [NUTCH-2191] - Add protocol-htmlunit
- [NUTCH-2194] - Run IndexingFilterChecker as simple Telnet server
- [NUTCH-2219] - Criteria order to be configurable in DeduplicationJob
- [NUTCH-2227] - RegexParseFilter
- [NUTCH-2245] - Developed the NGram Model on the existing Unigram Cosine Similarity Model
Improvement
- [NUTCH-1233] - Rely on Tika for outlink extraction
- [NUTCH-1712] - Use MultipleInputs in Injector to make it a single mapreduce job
- [NUTCH-2172] - index-more: document format of contenttype-mapping.txt
- [NUTCH-2178] - DeduplicationJob to optionally group on host or domain
- [NUTCH-2182] - Make reverseUrlDirs file dumper option hash the URL for consistency
- [NUTCH-2183] - Improvement to SegmentChecker for skipping non-segments present in segments directory
- [NUTCH-2187] - Change FileDumper SHAs to all uppercase
- [NUTCH-2195] - IndexingFilterChecker to optionally follow N redirects
- [NUTCH-2196] - IndexingFilterChecker to optionally normalize
- [NUTCH-2197] - Add solr5 solrcloud indexer support
- [NUTCH-2204] - Remove junit lib from runtime
- [NUTCH-2218] - Switch CrawlCompletion arg parsing to Commons CLI
- [NUTCH-2221] - Introduce db.ignore.internal.links to FetcherThread
- [NUTCH-2229] - Allow Jexl expressions on CrawlDatum's fixed attributes
- [NUTCH-2231] - Jexl support in generator job
- [NUTCH-2252] - Allow phantomjs as a browser for selenium options
- [NUTCH-2263] - Support for mingram and maxgram at Unigram Cosine Similarity Model
Task
- [NUTCH-2201] - Remove loops program from webgraph package
- [NUTCH-2211] - Filter and normalizer checkers missing in bin/nutch
- [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.