Sub-task
- [NUTCH-1164] - Write JUnit tests for protocol-http
- [NUTCH-1218] - Improve trunk API documentation
- [NUTCH-1878] - urlnormalizer-regex to keep third slash in file:///path/index.html
- [NUTCH-1879] - Regex URL normalizer should remove multiple slashes after file: protocol
- [NUTCH-1880] - URLUtil should not add additional slashes for file URLs
- [NUTCH-1885] - Protocol-file should treat symbolic links as redirects
- [NUTCH-1966] - Configuration endpoint for 1x REST API
- [NUTCH-1970] - Pretty print JSON output in config resource
- [NUTCH-1973] - Job Administration end point for the REST service
Bug
- [NUTCH-1483] - Can't crawl filesystem with protocol-file plugin
- [NUTCH-1592] - TikaParser can uppercase the element names while generating the DOM
- [NUTCH-1755] - Project name bug in build.xml
- [NUTCH-1771] - Solrindex fails if a segment is corrupted or incomplete
- [NUTCH-1825] - protocol-http may hang for certain web pages
- [NUTCH-1826] - indexchecker fails if solr.server.url not configured
- [NUTCH-1828] - bin/crawl : incorrect handling of nutch errors
- [NUTCH-1829] - Generator : unable to distinguish real errors
- [NUTCH-1832] - Make Nutch work without an indexer
- [NUTCH-1835] - Nutch's Solr schema doesn't work with Solr 4.9 because of the RealTimeGet handler
- [NUTCH-1844] - testresources/testcrawl not referenced anywhere in code
- [NUTCH-1854] - ./bin/crawl fails with a parsing fetcher
- [NUTCH-1864] - Bug in indexchecker CLI parsing and invocation of index-solr plugin by default
- [NUTCH-1865] - Enable use of SNAPSHOT's with Nutch Ivy dependency management
- [NUTCH-1866] - ant eclipse target should not delete runtime
- [NUTCH-1874] - FileDumper comment typos
- [NUTCH-1877] - Suffix URL filter to ignore query string by default
- [NUTCH-1881] - ant target resolve-default to keep test libs
- [NUTCH-1882] - ant eclipse target to add output path to src/test
- [NUTCH-1884] - NullPointerException in parsechecker and indexchecker with symlinks in file URL
- [NUTCH-1890] - Major Typo in Documentation for Integrating Nutch and Solr
- [NUTCH-1893] - Parse-tika fails to parse feed files
- [NUTCH-1897] - Easier debugging of plugin XML errors
- [NUTCH-1904] - Schema for Solr4 doesn't include _version_ field
- [NUTCH-1906] - Typo in CrawlDbReader command line help
- [NUTCH-1911] - Improve DomainStatistics tool command line parsing
- [NUTCH-1912] - Dump tool -mimetype parameter needs to be optional to prevent NPE
- [NUTCH-1916] - Apache Nutch CXF-based REST services
- [NUTCH-1918] - TikaParser specifies a default namespace when generating DOM
- [NUTCH-1919] - Getting timeout when server returns Content-Length: 0
- [NUTCH-1921] - Optionally disable HTTP if-modified-since header
- [NUTCH-1937] - Error: Could not find or load main class bin.crawl
- [NUTCH-1939] - Fetcher fails to follow redirects
- [NUTCH-1950] - File name too long when bin/nutch dump
- [NUTCH-1954] - FilenameTooLong error appears in CommonCrawlDumper
- [NUTCH-1957] - FileDumper output file name collisions
- [NUTCH-1963] - CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked
- [NUTCH-1964] - tmp directory not cleaned up after using commoncrawldump tool
- [NUTCH-1967] - Possible SIooBE in MimeAdaptiveFetchSchedule
- [NUTCH-1968] - File Name too long issue of DumpFileUtil.java file
- [NUTCH-1974] - keyPrefix option for CommonCrawlDataDumper tool
- [NUTCH-1977] - commoncrawldump java heap space
- [NUTCH-1978] - solrindex will fail when indexing corrupted segments
- [NUTCH-1983] - CommonCrawlDumper and FileDumper don't dump correct JSON
- [NUTCH-1991] - Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based detection
- [NUTCH-2001] - SubCollection Field Name incorrect in nutch-default.xml
New Feature
- [NUTCH-827] - HTTP POST Authentication
- [NUTCH-1323] - AjaxNormalizer
- [NUTCH-1526] - Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs
- [NUTCH-1660] - Index filter for Page's latitude and longitude
- [NUTCH-1693] - TextMD5Signature computed on textual content
- [NUTCH-1857] - readb -dump -format csv should use comma
- [NUTCH-1927] - Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
- [NUTCH-1933] - nutch-selenium plugin
- [NUTCH-1941] - Optional rolling http.agent.name's
- [NUTCH-1949] - Dump out the Nutch data into the Common Crawl format
- [NUTCH-1969] - URL Normalizer properly handling slashes
- [NUTCH-1976] - Allow Users to Set Hostname for Server
Improvement
- [NUTCH-865] - Format source code in unique style
- [NUTCH-881] - Good quality documentation for Nutch
- [NUTCH-1062] - Migrate BasicURLNormalizer from Apache ORO to java.util.regex
- [NUTCH-1409] - Remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip
- [NUTCH-1724] - LinkDBReader to support regex output filtering
- [NUTCH-1775] - IndexingFilter: document origin of passed CrawlDatum
- [NUTCH-1823] - Upgrade to elasticsearch 1.4.1
- [NUTCH-1833] - Include version number within nutch binary usage statement
- [NUTCH-1839] - Improve WebGraph CLI parsing
- [NUTCH-1853] - Add commented out WebGraph executions to ./bin/crawl
- [NUTCH-1867] - CrawlDbReader: use setFloat to pass min score
- [NUTCH-1868] - Document and improve CLI for FileDumper tool
- [NUTCH-1869] - Add a flag to -mimeType fiag to FileDumper
- [NUTCH-1875] - Add 'version' field to Solr schema as required by new Solr servers
- [NUTCH-1876] - Upgrade to Crawler Commons 0.5
- [NUTCH-1883] - bin/crawl: use function to run bin/nutch and check exit value
- [NUTCH-1887] - Specify HTMLMapper to use in TikaParser
- [NUTCH-1889] - Store all values from Tika metadata in Nutch metadata
- [NUTCH-1920] - Upgrade Nutch to use Java 1.7
- [NUTCH-1925] - Upgrade Tika to version 1.7
- [NUTCH-1928] - Indexing filter of documents by the MIME type
- [NUTCH-1959] - Improving CommonCrawlFormat implementations
- [NUTCH-1962] - Need to have mimetype-filter.txt file available by default
- [NUTCH-1972] - Dockerfile for Nutch 1.x
- [NUTCH-1975] - New configuration for CommonCrawlDataDumper tool
- [NUTCH-1979] - CrawlDbReader to implement Tool
- [NUTCH-1981] - Upgrade icu4j
- [NUTCH-1985] - Adding a main() method to the MimeTypeIndexingFilter
- [NUTCH-1986] - Clarify Elastic Search Indexer Plugin Settings
- [NUTCH-1987] - Make bin/crawl indexer agnostic
- [NUTCH-1988] - Make nested output directory dump optional
- [NUTCH-1989] - Handling invalid URLs in CommonCrawlDataDumper
- [NUTCH-1990] - Use URI.normalise() in BasicURLNormalizer
- [NUTCH-1994] - Upgrade to Apache Tika 1.8
- [NUTCH-1996] - Make protocol-selenium README part of plugin
- [NUTCH-1997] - Add CBOR "magic header" to CommonCrawlDataDumper output
- [NUTCH-2136] - Implement a different version of Naive Bayes Parse Filter
Test
- [NUTCH-1960] - JUnit test for dump method of CommonCrawlDataDumper
Task
- [NUTCH-1837] - Upgrade to Tika 1.6
- [NUTCH-1886] - Review and update default.properties
- [NUTCH-1955] - ByteWritable missing in NutchWritable
- [NUTCH-1956] - Members to be public in URLCrawlDatum
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.