Release Notes - ASF JIRA

Release Notes - Nutch - Version 1.10 - HTML format

Configure Release Notes

Sub-task

[NUTCH-1164] - Write JUnit tests for protocol-http
[NUTCH-1218] - Improve trunk API documentation
[NUTCH-1878] - urlnormalizer-regex to keep third slash in file:///path/index.html
[NUTCH-1879] - Regex URL normalizer should remove multiple slashes after file: protocol
[NUTCH-1880] - URLUtil should not add additional slashes for file URLs
[NUTCH-1885] - Protocol-file should treat symbolic links as redirects
[NUTCH-1966] - Configuration endpoint for 1x REST API
[NUTCH-1970] - Pretty print JSON output in config resource
[NUTCH-1973] - Job Administration end point for the REST service

Bug

[NUTCH-1483] - Can't crawl filesystem with protocol-file plugin
[NUTCH-1592] - TikaParser can uppercase the element names while generating the DOM
[NUTCH-1755] - Project name bug in build.xml
[NUTCH-1771] - Solrindex fails if a segment is corrupted or incomplete
[NUTCH-1825] - protocol-http may hang for certain web pages
[NUTCH-1826] - indexchecker fails if solr.server.url not configured
[NUTCH-1828] - bin/crawl : incorrect handling of nutch errors
[NUTCH-1829] - Generator : unable to distinguish real errors
[NUTCH-1832] - Make Nutch work without an indexer
[NUTCH-1835] - Nutch's Solr schema doesn't work with Solr 4.9 because of the RealTimeGet handler
[NUTCH-1844] - testresources/testcrawl not referenced anywhere in code
[NUTCH-1854] - ./bin/crawl fails with a parsing fetcher
[NUTCH-1864] - Bug in indexchecker CLI parsing and invocation of index-solr plugin by default
[NUTCH-1865] - Enable use of SNAPSHOT's with Nutch Ivy dependency management
[NUTCH-1866] - ant eclipse target should not delete runtime
[NUTCH-1874] - FileDumper comment typos
[NUTCH-1877] - Suffix URL filter to ignore query string by default
[NUTCH-1881] - ant target resolve-default to keep test libs
[NUTCH-1882] - ant eclipse target to add output path to src/test
[NUTCH-1884] - NullPointerException in parsechecker and indexchecker with symlinks in file URL
[NUTCH-1890] - Major Typo in Documentation for Integrating Nutch and Solr
[NUTCH-1893] - Parse-tika fails to parse feed files
[NUTCH-1897] - Easier debugging of plugin XML errors
[NUTCH-1904] - Schema for Solr4 doesn't include _version_ field
[NUTCH-1906] - Typo in CrawlDbReader command line help
[NUTCH-1911] - Improve DomainStatistics tool command line parsing
[NUTCH-1912] - Dump tool -mimetype parameter needs to be optional to prevent NPE
[NUTCH-1916] - Apache Nutch CXF-based REST services
[NUTCH-1918] - TikaParser specifies a default namespace when generating DOM
[NUTCH-1919] - Getting timeout when server returns Content-Length: 0
[NUTCH-1921] - Optionally disable HTTP if-modified-since header
[NUTCH-1937] - Error: Could not find or load main class bin.crawl
[NUTCH-1939] - Fetcher fails to follow redirects
[NUTCH-1950] - File name too long when bin/nutch dump
[NUTCH-1954] - FilenameTooLong error appears in CommonCrawlDumper
[NUTCH-1957] - FileDumper output file name collisions
[NUTCH-1963] - CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked
[NUTCH-1964] - tmp directory not cleaned up after using commoncrawldump tool
[NUTCH-1967] - Possible SIooBE in MimeAdaptiveFetchSchedule
[NUTCH-1968] - File Name too long issue of DumpFileUtil.java file
[NUTCH-1974] - keyPrefix option for CommonCrawlDataDumper tool
[NUTCH-1977] - commoncrawldump java heap space
[NUTCH-1978] - solrindex will fail when indexing corrupted segments
[NUTCH-1983] - CommonCrawlDumper and FileDumper don't dump correct JSON
[NUTCH-1991] - Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based detection
[NUTCH-2001] - SubCollection Field Name incorrect in nutch-default.xml

New Feature

[NUTCH-827] - HTTP POST Authentication
[NUTCH-1323] - AjaxNormalizer
[NUTCH-1526] - Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs
[NUTCH-1660] - Index filter for Page's latitude and longitude
[NUTCH-1693] - TextMD5Signature computed on textual content
[NUTCH-1857] - readb -dump -format csv should use comma
[NUTCH-1927] - Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[NUTCH-1933] - nutch-selenium plugin
[NUTCH-1941] - Optional rolling http.agent.name's
[NUTCH-1949] - Dump out the Nutch data into the Common Crawl format
[NUTCH-1969] - URL Normalizer properly handling slashes
[NUTCH-1976] - Allow Users to Set Hostname for Server

Improvement

[NUTCH-865] - Format source code in unique style
[NUTCH-881] - Good quality documentation for Nutch
[NUTCH-1062] - Migrate BasicURLNormalizer from Apache ORO to java.util.regex
[NUTCH-1409] - Remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip
[NUTCH-1724] - LinkDBReader to support regex output filtering
[NUTCH-1775] - IndexingFilter: document origin of passed CrawlDatum
[NUTCH-1823] - Upgrade to elasticsearch 1.4.1
[NUTCH-1833] - Include version number within nutch binary usage statement
[NUTCH-1839] - Improve WebGraph CLI parsing
[NUTCH-1853] - Add commented out WebGraph executions to ./bin/crawl
[NUTCH-1867] - CrawlDbReader: use setFloat to pass min score
[NUTCH-1868] - Document and improve CLI for FileDumper tool
[NUTCH-1869] - Add a flag to -mimeType fiag to FileDumper
[NUTCH-1875] - Add 'version' field to Solr schema as required by new Solr servers
[NUTCH-1876] - Upgrade to Crawler Commons 0.5
[NUTCH-1883] - bin/crawl: use function to run bin/nutch and check exit value
[NUTCH-1887] - Specify HTMLMapper to use in TikaParser
[NUTCH-1889] - Store all values from Tika metadata in Nutch metadata
[NUTCH-1920] - Upgrade Nutch to use Java 1.7
[NUTCH-1925] - Upgrade Tika to version 1.7
[NUTCH-1928] - Indexing filter of documents by the MIME type
[NUTCH-1959] - Improving CommonCrawlFormat implementations
[NUTCH-1962] - Need to have mimetype-filter.txt file available by default
[NUTCH-1972] - Dockerfile for Nutch 1.x
[NUTCH-1975] - New configuration for CommonCrawlDataDumper tool
[NUTCH-1979] - CrawlDbReader to implement Tool
[NUTCH-1981] - Upgrade icu4j
[NUTCH-1985] - Adding a main() method to the MimeTypeIndexingFilter
[NUTCH-1986] - Clarify Elastic Search Indexer Plugin Settings
[NUTCH-1987] - Make bin/crawl indexer agnostic
[NUTCH-1988] - Make nested output directory dump optional
[NUTCH-1989] - Handling invalid URLs in CommonCrawlDataDumper
[NUTCH-1990] - Use URI.normalise() in BasicURLNormalizer
[NUTCH-1994] - Upgrade to Apache Tika 1.8
[NUTCH-1996] - Make protocol-selenium README part of plugin
[NUTCH-1997] - Add CBOR "magic header" to CommonCrawlDataDumper output
[NUTCH-2136] - Implement a different version of Naive Bayes Parse Filter

Test

[NUTCH-1960] - JUnit test for dump method of CommonCrawlDataDumper

Task

[NUTCH-1837] - Upgrade to Tika 1.6
[NUTCH-1886] - Review and update default.properties
[NUTCH-1955] - ByteWritable missing in NutchWritable
[NUTCH-1956] - Members to be public in URLCrawlDatum

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.

Release Notes - Nutch - Version 1.10
    
<h2>        Sub-task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1164'>NUTCH-1164</a>] -         Write JUnit tests for protocol-http
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1218'>NUTCH-1218</a>] -         Improve trunk API documentation
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1878'>NUTCH-1878</a>] -         urlnormalizer-regex to keep third slash in file:///path/index.html
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1879'>NUTCH-1879</a>] -         Regex URL normalizer should remove multiple slashes after file: protocol
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1880'>NUTCH-1880</a>] -         URLUtil should not add additional slashes for file URLs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1885'>NUTCH-1885</a>] -         Protocol-file should treat symbolic links as redirects
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1966'>NUTCH-1966</a>] -         Configuration endpoint for 1x REST API 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1970'>NUTCH-1970</a>] -         Pretty print JSON output in config resource
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1973'>NUTCH-1973</a>] -         Job Administration end point for the REST service
</li>
</ul>
            
<h2>        Bug
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1483'>NUTCH-1483</a>] -         Can&#39;t crawl filesystem with protocol-file plugin
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1592'>NUTCH-1592</a>] -         TikaParser can uppercase the element names while generating the DOM 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1755'>NUTCH-1755</a>] -         Project name bug in build.xml
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1771'>NUTCH-1771</a>] -         Solrindex fails if a segment is corrupted or incomplete
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1825'>NUTCH-1825</a>] -         protocol-http may hang for certain web pages
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1826'>NUTCH-1826</a>] -         indexchecker fails if solr.server.url not configured
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1828'>NUTCH-1828</a>] -         bin/crawl : incorrect handling of nutch errors
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1829'>NUTCH-1829</a>] -         Generator : unable to distinguish real errors
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1832'>NUTCH-1832</a>] -         Make Nutch work without an indexer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1835'>NUTCH-1835</a>] -         Nutch&#39;s Solr schema doesn&#39;t work with Solr 4.9 because of the RealTimeGet handler
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1844'>NUTCH-1844</a>] -         testresources/testcrawl not referenced anywhere in code
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1854'>NUTCH-1854</a>] -         ./bin/crawl fails with a parsing fetcher
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1864'>NUTCH-1864</a>] -         Bug in indexchecker CLI parsing and invocation of index-solr plugin by default
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1865'>NUTCH-1865</a>] -         Enable use of SNAPSHOT&#39;s with Nutch Ivy dependency management
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1866'>NUTCH-1866</a>] -         ant eclipse target should not delete runtime
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1874'>NUTCH-1874</a>] -         FileDumper comment typos
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1877'>NUTCH-1877</a>] -         Suffix URL filter to ignore query string by default
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1881'>NUTCH-1881</a>] -         ant target resolve-default to keep test libs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1882'>NUTCH-1882</a>] -         ant eclipse target to add output path to src/test
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1884'>NUTCH-1884</a>] -         NullPointerException in parsechecker and indexchecker with symlinks in file URL
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1890'>NUTCH-1890</a>] -         Major Typo in Documentation for Integrating Nutch and Solr
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1893'>NUTCH-1893</a>] -         Parse-tika fails to parse feed files
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1897'>NUTCH-1897</a>] -         Easier debugging of plugin XML errors
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1904'>NUTCH-1904</a>] -         Schema for Solr4 doesn&#39;t include _version_ field
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1906'>NUTCH-1906</a>] -         Typo in CrawlDbReader command line help
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1911'>NUTCH-1911</a>] -         Improve DomainStatistics tool command line parsing
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1912'>NUTCH-1912</a>] -         Dump tool -mimetype parameter needs to be optional to prevent NPE
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1916'>NUTCH-1916</a>] -         Apache Nutch CXF-based REST services
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1918'>NUTCH-1918</a>] -         TikaParser specifies a default namespace when generating DOM
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1919'>NUTCH-1919</a>] -         Getting timeout when server returns Content-Length: 0 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1921'>NUTCH-1921</a>] -         Optionally disable HTTP if-modified-since header
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1937'>NUTCH-1937</a>] -         Error: Could not find or load main class bin.crawl
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1939'>NUTCH-1939</a>] -         Fetcher fails to follow redirects
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1950'>NUTCH-1950</a>] -         File name too long when bin/nutch dump
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1954'>NUTCH-1954</a>] -         FilenameTooLong error appears in CommonCrawlDumper
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1957'>NUTCH-1957</a>] -         FileDumper output file name collisions
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1963'>NUTCH-1963</a>] -         CommonsCrawlDataDumper is too long ( &gt; 100 bytes) when -gzip option invoked
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1964'>NUTCH-1964</a>] -         tmp directory not cleaned up after using commoncrawldump tool
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1967'>NUTCH-1967</a>] -         Possible SIooBE in MimeAdaptiveFetchSchedule
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1968'>NUTCH-1968</a>] -         File Name too long issue of DumpFileUtil.java file
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1974'>NUTCH-1974</a>] -         keyPrefix option for CommonCrawlDataDumper tool
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1977'>NUTCH-1977</a>] -         commoncrawldump java heap space
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1978'>NUTCH-1978</a>] -         solrindex will fail when indexing corrupted segments
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1983'>NUTCH-1983</a>] -         CommonCrawlDumper and FileDumper don&#39;t dump correct JSON
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1991'>NUTCH-1991</a>] -         Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based detection
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2001'>NUTCH-2001</a>] -         SubCollection Field Name incorrect in nutch-default.xml
</li>
</ul>
            
<h2>        New Feature
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-827'>NUTCH-827</a>] -         HTTP POST Authentication
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1323'>NUTCH-1323</a>] -         AjaxNormalizer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1526'>NUTCH-1526</a>] -         Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1660'>NUTCH-1660</a>] -         Index filter for Page&#39;s latitude and longitude
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1693'>NUTCH-1693</a>] -         TextMD5Signature computed on textual content
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1857'>NUTCH-1857</a>] -         readb -dump -format csv should use comma
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1927'>NUTCH-1927</a>] -         Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1933'>NUTCH-1933</a>] -         nutch-selenium plugin
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1941'>NUTCH-1941</a>] -         Optional rolling http.agent.name&#39;s
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1949'>NUTCH-1949</a>] -         Dump out the Nutch data into the Common Crawl format
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1969'>NUTCH-1969</a>] -         URL Normalizer properly handling slashes
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1976'>NUTCH-1976</a>] -         Allow Users to Set Hostname for Server
</li>
</ul>
    
<h2>        Improvement
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-865'>NUTCH-865</a>] -         Format source code in unique style
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-881'>NUTCH-881</a>] -         Good quality documentation for Nutch
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1062'>NUTCH-1062</a>] -         Migrate BasicURLNormalizer from Apache ORO to java.util.regex
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1409'>NUTCH-1409</a>] -         Remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1724'>NUTCH-1724</a>] -         LinkDBReader to support regex output filtering
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1775'>NUTCH-1775</a>] -         IndexingFilter: document origin of passed CrawlDatum
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1823'>NUTCH-1823</a>] -         Upgrade to elasticsearch 1.4.1
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1833'>NUTCH-1833</a>] -         Include version number within nutch binary usage statement
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1839'>NUTCH-1839</a>] -         Improve WebGraph CLI parsing
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1853'>NUTCH-1853</a>] -         Add commented out WebGraph executions to ./bin/crawl
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1867'>NUTCH-1867</a>] -         CrawlDbReader: use setFloat to pass min score
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1868'>NUTCH-1868</a>] -         Document and improve CLI for FileDumper tool
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1869'>NUTCH-1869</a>] -         Add a flag to -mimeType fiag to FileDumper 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1875'>NUTCH-1875</a>] -         Add &#39;version&#39; field to Solr schema as required by new Solr servers
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1876'>NUTCH-1876</a>] -         Upgrade to Crawler Commons 0.5
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1883'>NUTCH-1883</a>] -         bin/crawl: use function to run bin/nutch and check exit value
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1887'>NUTCH-1887</a>] -         Specify HTMLMapper to use in TikaParser
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1889'>NUTCH-1889</a>] -         Store all values from Tika metadata in Nutch metadata
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1920'>NUTCH-1920</a>] -         Upgrade Nutch to use Java 1.7
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1925'>NUTCH-1925</a>] -         Upgrade Tika to version 1.7
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1928'>NUTCH-1928</a>] -         Indexing filter of documents by the MIME type
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1959'>NUTCH-1959</a>] -         Improving CommonCrawlFormat implementations
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1962'>NUTCH-1962</a>] -         Need to have mimetype-filter.txt file available by default
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1972'>NUTCH-1972</a>] -         Dockerfile for Nutch 1.x
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1975'>NUTCH-1975</a>] -         New configuration for CommonCrawlDataDumper tool
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1979'>NUTCH-1979</a>] -         CrawlDbReader to implement Tool
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1981'>NUTCH-1981</a>] -         Upgrade icu4j
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1985'>NUTCH-1985</a>] -         Adding a main() method to the MimeTypeIndexingFilter
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1986'>NUTCH-1986</a>] -         Clarify Elastic Search Indexer Plugin Settings
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1987'>NUTCH-1987</a>] -         Make bin/crawl indexer agnostic
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1988'>NUTCH-1988</a>] -         Make nested output directory dump optional
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1989'>NUTCH-1989</a>] -         Handling invalid URLs in CommonCrawlDataDumper
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1990'>NUTCH-1990</a>] -         Use URI.normalise() in BasicURLNormalizer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1994'>NUTCH-1994</a>] -         Upgrade to Apache Tika 1.8
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1996'>NUTCH-1996</a>] -         Make protocol-selenium README part of plugin
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1997'>NUTCH-1997</a>] -         Add CBOR &quot;magic header&quot; to CommonCrawlDataDumper output
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2136'>NUTCH-2136</a>] -         Implement a different version of Naive Bayes Parse Filter
</li>
</ul>
    
<h2>        Test
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1960'>NUTCH-1960</a>] -         JUnit test for dump method of CommonCrawlDataDumper
</li>
</ul>
        
<h2>        Task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1837'>NUTCH-1837</a>] -         Upgrade to Tika 1.6
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1886'>NUTCH-1886</a>] -         Review and update default.properties
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1955'>NUTCH-1955</a>] -         ByteWritable missing in NutchWritable
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1956'>NUTCH-1956</a>] -         Members to be public in URLCrawlDatum
</li>
</ul>