Release Notes - Nutch - Version 1.10 - HTML format

Sub-task

  • [NUTCH-1164] - Write JUnit tests for protocol-http
  • [NUTCH-1218] - Improve trunk API documentation
  • [NUTCH-1878] - urlnormalizer-regex to keep third slash in file:///path/index.html
  • [NUTCH-1879] - Regex URL normalizer should remove multiple slashes after file: protocol
  • [NUTCH-1880] - URLUtil should not add additional slashes for file URLs
  • [NUTCH-1885] - Protocol-file should treat symbolic links as redirects
  • [NUTCH-1966] - Configuration endpoint for 1x REST API
  • [NUTCH-1970] - Pretty print JSON output in config resource
  • [NUTCH-1973] - Job Administration end point for the REST service

Bug

  • [NUTCH-1483] - Can't crawl filesystem with protocol-file plugin
  • [NUTCH-1592] - TikaParser can uppercase the element names while generating the DOM
  • [NUTCH-1755] - Project name bug in build.xml
  • [NUTCH-1771] - Solrindex fails if a segment is corrupted or incomplete
  • [NUTCH-1825] - protocol-http may hang for certain web pages
  • [NUTCH-1826] - indexchecker fails if solr.server.url not configured
  • [NUTCH-1828] - bin/crawl : incorrect handling of nutch errors
  • [NUTCH-1829] - Generator : unable to distinguish real errors
  • [NUTCH-1832] - Make Nutch work without an indexer
  • [NUTCH-1835] - Nutch's Solr schema doesn't work with Solr 4.9 because of the RealTimeGet handler
  • [NUTCH-1844] - testresources/testcrawl not referenced anywhere in code
  • [NUTCH-1854] - ./bin/crawl fails with a parsing fetcher
  • [NUTCH-1864] - Bug in indexchecker CLI parsing and invocation of index-solr plugin by default
  • [NUTCH-1865] - Enable use of SNAPSHOT's with Nutch Ivy dependency management
  • [NUTCH-1866] - ant eclipse target should not delete runtime
  • [NUTCH-1874] - FileDumper comment typos
  • [NUTCH-1877] - Suffix URL filter to ignore query string by default
  • [NUTCH-1881] - ant target resolve-default to keep test libs
  • [NUTCH-1882] - ant eclipse target to add output path to src/test
  • [NUTCH-1884] - NullPointerException in parsechecker and indexchecker with symlinks in file URL
  • [NUTCH-1890] - Major Typo in Documentation for Integrating Nutch and Solr
  • [NUTCH-1893] - Parse-tika fails to parse feed files
  • [NUTCH-1897] - Easier debugging of plugin XML errors
  • [NUTCH-1904] - Schema for Solr4 doesn't include _version_ field
  • [NUTCH-1906] - Typo in CrawlDbReader command line help
  • [NUTCH-1911] - Improve DomainStatistics tool command line parsing
  • [NUTCH-1912] - Dump tool -mimetype parameter needs to be optional to prevent NPE
  • [NUTCH-1916] - Apache Nutch CXF-based REST services
  • [NUTCH-1918] - TikaParser specifies a default namespace when generating DOM
  • [NUTCH-1919] - Getting timeout when server returns Content-Length: 0
  • [NUTCH-1921] - Optionally disable HTTP if-modified-since header
  • [NUTCH-1937] - Error: Could not find or load main class bin.crawl
  • [NUTCH-1939] - Fetcher fails to follow redirects
  • [NUTCH-1950] - File name too long when bin/nutch dump
  • [NUTCH-1954] - FilenameTooLong error appears in CommonCrawlDumper
  • [NUTCH-1957] - FileDumper output file name collisions
  • [NUTCH-1963] - CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked
  • [NUTCH-1964] - tmp directory not cleaned up after using commoncrawldump tool
  • [NUTCH-1967] - Possible SIooBE in MimeAdaptiveFetchSchedule
  • [NUTCH-1968] - File Name too long issue of DumpFileUtil.java file
  • [NUTCH-1974] - keyPrefix option for CommonCrawlDataDumper tool
  • [NUTCH-1977] - commoncrawldump java heap space
  • [NUTCH-1978] - solrindex will fail when indexing corrupted segments
  • [NUTCH-1983] - CommonCrawlDumper and FileDumper don't dump correct JSON
  • [NUTCH-1991] - Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based detection
  • [NUTCH-2001] - SubCollection Field Name incorrect in nutch-default.xml

New Feature

  • [NUTCH-827] - HTTP POST Authentication
  • [NUTCH-1323] - AjaxNormalizer
  • [NUTCH-1526] - Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs
  • [NUTCH-1660] - Index filter for Page's latitude and longitude
  • [NUTCH-1693] - TextMD5Signature computed on textual content
  • [NUTCH-1857] - readb -dump -format csv should use comma
  • [NUTCH-1927] - Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
  • [NUTCH-1933] - nutch-selenium plugin
  • [NUTCH-1941] - Optional rolling http.agent.name's
  • [NUTCH-1949] - Dump out the Nutch data into the Common Crawl format
  • [NUTCH-1969] - URL Normalizer properly handling slashes
  • [NUTCH-1976] - Allow Users to Set Hostname for Server

Improvement

  • [NUTCH-865] - Format source code in unique style
  • [NUTCH-881] - Good quality documentation for Nutch
  • [NUTCH-1062] - Migrate BasicURLNormalizer from Apache ORO to java.util.regex
  • [NUTCH-1409] - Remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip
  • [NUTCH-1724] - LinkDBReader to support regex output filtering
  • [NUTCH-1775] - IndexingFilter: document origin of passed CrawlDatum
  • [NUTCH-1823] - Upgrade to elasticsearch 1.4.1
  • [NUTCH-1833] - Include version number within nutch binary usage statement
  • [NUTCH-1839] - Improve WebGraph CLI parsing
  • [NUTCH-1853] - Add commented out WebGraph executions to ./bin/crawl
  • [NUTCH-1867] - CrawlDbReader: use setFloat to pass min score
  • [NUTCH-1868] - Document and improve CLI for FileDumper tool
  • [NUTCH-1869] - Add a flag to -mimeType fiag to FileDumper
  • [NUTCH-1875] - Add 'version' field to Solr schema as required by new Solr servers
  • [NUTCH-1876] - Upgrade to Crawler Commons 0.5
  • [NUTCH-1883] - bin/crawl: use function to run bin/nutch and check exit value
  • [NUTCH-1887] - Specify HTMLMapper to use in TikaParser
  • [NUTCH-1889] - Store all values from Tika metadata in Nutch metadata
  • [NUTCH-1920] - Upgrade Nutch to use Java 1.7
  • [NUTCH-1925] - Upgrade Tika to version 1.7
  • [NUTCH-1928] - Indexing filter of documents by the MIME type
  • [NUTCH-1959] - Improving CommonCrawlFormat implementations
  • [NUTCH-1962] - Need to have mimetype-filter.txt file available by default
  • [NUTCH-1972] - Dockerfile for Nutch 1.x
  • [NUTCH-1975] - New configuration for CommonCrawlDataDumper tool
  • [NUTCH-1979] - CrawlDbReader to implement Tool
  • [NUTCH-1981] - Upgrade icu4j
  • [NUTCH-1985] - Adding a main() method to the MimeTypeIndexingFilter
  • [NUTCH-1986] - Clarify Elastic Search Indexer Plugin Settings
  • [NUTCH-1987] - Make bin/crawl indexer agnostic
  • [NUTCH-1988] - Make nested output directory dump optional
  • [NUTCH-1989] - Handling invalid URLs in CommonCrawlDataDumper
  • [NUTCH-1990] - Use URI.normalise() in BasicURLNormalizer
  • [NUTCH-1994] - Upgrade to Apache Tika 1.8
  • [NUTCH-1996] - Make protocol-selenium README part of plugin
  • [NUTCH-1997] - Add CBOR "magic header" to CommonCrawlDataDumper output
  • [NUTCH-2136] - Implement a different version of Naive Bayes Parse Filter

Test

  • [NUTCH-1960] - JUnit test for dump method of CommonCrawlDataDumper

Task

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.