Release Notes - ASF JIRA

Release Notes - Nutch - Version 1.15 - HTML format

Configure Release Notes

Sub-task

[NUTCH-1223] - Migrate WebGraph to MapReduce API
[NUTCH-1224] - Migrate FreeGenerator to MapReduce API
[NUTCH-1226] - Migrate CrawlDbReader to MapReduce API
[NUTCH-2152] - CommonCrawl dump via Service endpoint
[NUTCH-2555] - URL normalization problem: path not starting with a '/'
[NUTCH-2556] - protocol-http makes invalid HTTP/1.0 requests
[NUTCH-2557] - protocol-http fails to follow redirections when an HTTP response body is invalid
[NUTCH-2558] - protocol-http cannot handle a missing HTTP status line
[NUTCH-2559] - protocol-http cannot handle colons after the HTTP status code
[NUTCH-2560] - protocol-http throws an error when an http header spans over multiple lines
[NUTCH-2561] - protocol-http can be made to read arbitrarily large HTTP responses
[NUTCH-2562] - protocol-http fails to read large chunked HTTP responses
[NUTCH-2563] - HTTP header spellchecking issues
[NUTCH-2575] - protocol-http does not respect the maximum content-size for chunked responses
[NUTCH-2622] - Unbundle LGPL-licensed jars from binary release

Bug

[NUTCH-1993] - Nutch does not use backup parsers
[NUTCH-2071] - A parser failure on a single document may fail crawling job if parser.timeout=-1
[NUTCH-2145] - parse/index checker fail to fetch valid percent-encoded URLs
[NUTCH-2161] - Interrupted failed and/or killed tasks fail to clean up temp directories in HDFS
[NUTCH-2273] - Selenium and InteractiveSelenium Do Not Support HTTPS
[NUTCH-2310] - Protocol-Selenium does not support HTTPS protocol
[NUTCH-2321] - Indexing filter checker leaks threads
[NUTCH-2324] - Issue in setting default linkdb path
[NUTCH-2447] - Work-around SSLProtocolException: handshake alert: unrecognized_name
[NUTCH-2454] - REST API fix for usage of hostdb in generator
[NUTCH-2461] - Generate passes the data to when maxCount == 0
[NUTCH-2466] - Sitemap processor to follow redirects
[NUTCH-2467] - Sitemap type field can be null
[NUTCH-2485] - ParserFactory swallows exception
[NUTCH-2486] - Compiler Warning: Unchecked / unsafe operations in MimeTypeIndexingFilter
[NUTCH-2489] - Dependency collision with lucene-analyzers-common in scoring-similarity plugin
[NUTCH-2490] - Sitemap processing: Sitemap index files not working
[NUTCH-2494] - Fetcher: java.lang.IllegalArgumentException: Wrong FS: s3
[NUTCH-2499] - Elastic REST Indexer: Duplicate values
[NUTCH-2505] - nutch does not delete the .locked file, when the generator partition got an exception
[NUTCH-2508] - Misleading documentation about http.proxy.exception.list
[NUTCH-2509] - Inconsistent behavior in SitemapProcessor
[NUTCH-2513] - ant eclipse target fails with "protocol switch unsafe"
[NUTCH-2517] - mergesegs corrupts segment data
[NUTCH-2518] - Must check return value of job.waitForCompletion()
[NUTCH-2520] - Wrong Accept-Charset sent when http.accept.charset is not defined
[NUTCH-2521] - SitemapProcessor to use property sitemap.redir.max
[NUTCH-2523] - UpdateHostDB blocks usage of plugins unintentionally
[NUTCH-2524] - bin/crawl: fix check for HostDb in distributed mode
[NUTCH-2533] - Injector: NullPointerException if seed URL dir contains non-file entries
[NUTCH-2535] - CrawlDbReader -stats: ClassCastException
[NUTCH-2544] - Nutch 1.15 no longer compatible with AWS EMR and S3
[NUTCH-2547] - urlnormalizer-basic fails on special characters in path/query
[NUTCH-2549] - protocol-http does not behave the same as browsers
[NUTCH-2550] - Fetcher fails to follow redirects
[NUTCH-2551] - NullPointerException in generator
[NUTCH-2552] - CrawlDbReader -topN fails
[NUTCH-2553] - Fetcher not to modify URLs to be fetched
[NUTCH-2554] - parserchecker can't fetch some URLs
[NUTCH-2565] - MergeDB incorrectly handles unfetched CrawlDatums
[NUTCH-2568] - Caught exception is immediately rethrown
[NUTCH-2569] - ClassNotFoundException when running in (pseudo-)distributed mode
[NUTCH-2570] - Deduplication job fails to install deduplicated CrawlDb
[NUTCH-2571] - SegmentReader -list fails to read segment
[NUTCH-2572] - HostDb: updatehostdb does not set values
[NUTCH-2574] - Generator: hostCount >= maxCount comparison wrong
[NUTCH-2581] - Caching of redirected robots.txt may overwrite correct robots.txt rules
[NUTCH-2589] - HTML redirections are not followed when using parse-tika
[NUTCH-2590] - SegmentReader -get fails
[NUTCH-2592] - Fetcher to log reason of failed fetches
[NUTCH-2593] - Single mode doesn't work in RabbitMQ indexer
[NUTCH-2597] - NPE in updatehostdb
[NUTCH-2601] - Elasticsearch Rest and Amazon CloudSearch have the same implementation class in indexer-writers.xml
[NUTCH-2607] - ParserChecker should call ScoringFilters.passScoreAfterParsing() on all parses
[NUTCH-2609] - urlnormalizer-basic to normalize path of file: URLs
[NUTCH-2614] - NPE in CrawlDbReader -stats on empty CrawlDb
[NUTCH-2616] - Review routing of deletions by Exchange component
[NUTCH-2618] - protocol-okhttp not to use http.timeout for max duration to fetch document
[NUTCH-2620] - urlfilter-validator incorrectly assumes that top-level domains are not longer than 4 characters
[NUTCH-2624] - protocol-okhttp resource leak

New Feature

[NUTCH-1129] - Any23 Nutch plugin
[NUTCH-1541] - Indexer plugin to write CSV
[NUTCH-2412] - Exchange component for indexing job
[NUTCH-2492] - Add more configuration parameters to crawl script

Improvement

[NUTCH-1106] - Options to skip url's based on length
[NUTCH-1480] - SolrIndexer to write to multiple servers.
[NUTCH-2012] - Merge parsechecker and indexchecker
[NUTCH-2375] - Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
[NUTCH-2390] - No documentation on pluggable indexing
[NUTCH-2411] - Index-metadata to support indexing multiple values for a field
[NUTCH-2416] - Fetcher to log thread ID
[NUTCH-2432] - Protocol httpclient to disable cookies if http.enable.cookie.header is false
[NUTCH-2441] - ARG_SEGMENT usage
[NUTCH-2491] - Integrate sitemap processing and HostDB into crawl script
[NUTCH-2493] - Add configuration parameter for sitemap processing to crawler script
[NUTCH-2497] - Elastic REST Indexer: Allow multiple hosts
[NUTCH-2502] - Any23 Plugin: Add Content-Type filtering
[NUTCH-2503] - Add option to run tests for a single plugin
[NUTCH-2510] - Crawl script modification. HostDb : generate, optional usage and description
[NUTCH-2516] - Hadoop imports use wildcards
[NUTCH-2519] - Log mapreduce job counters in local mode
[NUTCH-2526] - NPE in scoring-opic when indexing document without CrawlDb datum
[NUTCH-2527] - URL filter: provide rules to exclude localhost and private address spaces
[NUTCH-2530] - Rename property db.max.anchor.length > linkdb.max.anchor.length
[NUTCH-2534] - CrawlDbReader -stats: make score quantiles configurable
[NUTCH-2539] - Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml
[NUTCH-2543] - readdb & readlinkdb to implement AbstractChecker
[NUTCH-2545] - Upgrade to Any23 2.2
[NUTCH-2566] - Fix exception log messages
[NUTCH-2576] - HTTP protocol plugin based on okhttp
[NUTCH-2577] - protocol-selenium can't handle https
[NUTCH-2578] - Avoid lock by MimeUtil in constructor of protocol.Content
[NUTCH-2579] - Fetcher to use parsed URL to call ProtocolFactory.getProtocol(url)
[NUTCH-2580] - Improvements for Rabbitmq support
[NUTCH-2583] - Upgrading Nutch's dependencies
[NUTCH-2584] - Upgrade parse-tika to use Tika 1.18
[NUTCH-2594] - Documentation for indexer plugins
[NUTCH-2595] - Upgrade crawler-commons dependency to 0.10
[NUTCH-2600] - Refactoring indexer-solr
[NUTCH-2611] - Add line-breaks when parsing HTML block-level elements
[NUTCH-2617] - Disable Exchange component by default
[NUTCH-2619] - protocol-okhttp: allow to keep partially fetched docs as truncated

Task

[NUTCH-1219] - Upgrade all jobs to new MapReduce API
[NUTCH-1228] - Change mapred.task.timeout to mapreduce.task.timeout in fetcher

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.

Release Notes - Nutch - Version 1.15
    
<h2>        Sub-task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1223'>NUTCH-1223</a>] -         Migrate WebGraph to MapReduce API
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1224'>NUTCH-1224</a>] -         Migrate FreeGenerator to MapReduce API
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1226'>NUTCH-1226</a>] -         Migrate CrawlDbReader to MapReduce API
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2152'>NUTCH-2152</a>] -         CommonCrawl dump via Service endpoint
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2555'>NUTCH-2555</a>] -         URL normalization problem: path not starting with a &#39;/&#39;
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2556'>NUTCH-2556</a>] -         protocol-http makes invalid HTTP/1.0 requests
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2557'>NUTCH-2557</a>] -         protocol-http fails to follow redirections when an HTTP response body is invalid
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2558'>NUTCH-2558</a>] -         protocol-http cannot handle a missing HTTP status line
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2559'>NUTCH-2559</a>] -         protocol-http cannot handle colons after the HTTP status code
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2560'>NUTCH-2560</a>] -         protocol-http throws an error when an http header spans over multiple lines
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2561'>NUTCH-2561</a>] -         protocol-http can be made to read arbitrarily large HTTP responses
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2562'>NUTCH-2562</a>] -         protocol-http fails to read large chunked HTTP responses
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2563'>NUTCH-2563</a>] -         HTTP header spellchecking issues
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2575'>NUTCH-2575</a>] -         protocol-http does not respect the maximum content-size for chunked responses
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2622'>NUTCH-2622</a>] -         Unbundle LGPL-licensed jars from binary release
</li>
</ul>
            
<h2>        Bug
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1993'>NUTCH-1993</a>] -         Nutch does not use backup parsers
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2071'>NUTCH-2071</a>] -          A parser failure on a single document may fail crawling job if parser.timeout=-1
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2145'>NUTCH-2145</a>] -         parse/index checker fail to fetch valid percent-encoded URLs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2161'>NUTCH-2161</a>] -         Interrupted failed and/or killed tasks fail to clean up temp directories in HDFS
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2273'>NUTCH-2273</a>] -         Selenium and InteractiveSelenium Do Not Support HTTPS
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2310'>NUTCH-2310</a>] -         Protocol-Selenium does not support HTTPS protocol
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2321'>NUTCH-2321</a>] -         Indexing filter checker leaks threads
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2324'>NUTCH-2324</a>] -         Issue in setting default linkdb path 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2447'>NUTCH-2447</a>] -         Work-around SSLProtocolException: handshake alert: unrecognized_name
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2454'>NUTCH-2454</a>] -         REST API fix for usage of hostdb in generator
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2461'>NUTCH-2461</a>] -         Generate passes the data to when maxCount  == 0
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2466'>NUTCH-2466</a>] -         Sitemap processor to follow redirects
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2467'>NUTCH-2467</a>] -         Sitemap type field can be null
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2485'>NUTCH-2485</a>] -         ParserFactory swallows exception
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2486'>NUTCH-2486</a>] -         Compiler Warning: Unchecked / unsafe operations in MimeTypeIndexingFilter
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2489'>NUTCH-2489</a>] -         Dependency collision with lucene-analyzers-common in scoring-similarity plugin
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2490'>NUTCH-2490</a>] -         Sitemap processing: Sitemap index files not working
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2494'>NUTCH-2494</a>] -         Fetcher: java.lang.IllegalArgumentException: Wrong FS: s3
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2499'>NUTCH-2499</a>] -         Elastic REST Indexer: Duplicate values
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2505'>NUTCH-2505</a>] -         nutch does not delete the .locked file, when the generator partition got an exception
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2508'>NUTCH-2508</a>] -         Misleading documentation about http.proxy.exception.list
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2509'>NUTCH-2509</a>] -         Inconsistent behavior in SitemapProcessor
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2513'>NUTCH-2513</a>] -         ant eclipse target fails with &quot;protocol switch unsafe&quot;
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2517'>NUTCH-2517</a>] -         mergesegs corrupts segment data
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2518'>NUTCH-2518</a>] -         Must check return value of job.waitForCompletion()
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2520'>NUTCH-2520</a>] -         Wrong Accept-Charset sent when http.accept.charset is not defined
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2521'>NUTCH-2521</a>] -         SitemapProcessor to use property sitemap.redir.max
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2523'>NUTCH-2523</a>] -         UpdateHostDB blocks usage of plugins unintentionally
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2524'>NUTCH-2524</a>] -         bin/crawl: fix check for HostDb in distributed mode
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2533'>NUTCH-2533</a>] -         Injector: NullPointerException if seed URL dir contains non-file entries
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2535'>NUTCH-2535</a>] -         CrawlDbReader -stats: ClassCastException
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2544'>NUTCH-2544</a>] -         Nutch 1.15 no longer compatible with AWS EMR and S3
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2547'>NUTCH-2547</a>] -         urlnormalizer-basic fails on special characters in path/query
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2549'>NUTCH-2549</a>] -         protocol-http does not behave the same as browsers
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2550'>NUTCH-2550</a>] -         Fetcher fails to follow redirects
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2551'>NUTCH-2551</a>] -         NullPointerException in generator
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2552'>NUTCH-2552</a>] -         CrawlDbReader -topN fails
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2553'>NUTCH-2553</a>] -         Fetcher not to modify URLs to be fetched
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2554'>NUTCH-2554</a>] -         parserchecker can&#39;t fetch some URLs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2565'>NUTCH-2565</a>] -         MergeDB incorrectly handles unfetched CrawlDatums
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2568'>NUTCH-2568</a>] -         Caught exception is immediately rethrown
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2569'>NUTCH-2569</a>] -         ClassNotFoundException when running in (pseudo-)distributed mode
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2570'>NUTCH-2570</a>] -         Deduplication job fails to install deduplicated CrawlDb
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2571'>NUTCH-2571</a>] -         SegmentReader -list fails to read segment
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2572'>NUTCH-2572</a>] -         HostDb: updatehostdb does not set values
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2574'>NUTCH-2574</a>] -         Generator: hostCount &gt;= maxCount comparison wrong
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2581'>NUTCH-2581</a>] -         Caching of redirected robots.txt may overwrite correct robots.txt rules
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2589'>NUTCH-2589</a>] -         HTML redirections are not followed when using parse-tika
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2590'>NUTCH-2590</a>] -         SegmentReader -get fails
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2592'>NUTCH-2592</a>] -         Fetcher to log reason of failed fetches
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2593'>NUTCH-2593</a>] -         Single mode doesn&#39;t work in RabbitMQ indexer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2597'>NUTCH-2597</a>] -         NPE in updatehostdb
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2601'>NUTCH-2601</a>] -         Elasticsearch Rest and Amazon CloudSearch have the same implementation class in indexer-writers.xml
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2607'>NUTCH-2607</a>] -         ParserChecker should call ScoringFilters.passScoreAfterParsing() on all parses
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2609'>NUTCH-2609</a>] -         urlnormalizer-basic to normalize path of file: URLs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2614'>NUTCH-2614</a>] -         NPE in CrawlDbReader -stats on empty CrawlDb
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2616'>NUTCH-2616</a>] -         Review routing of deletions by Exchange component
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2618'>NUTCH-2618</a>] -         protocol-okhttp not to use http.timeout for max duration to fetch document
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2620'>NUTCH-2620</a>] -         urlfilter-validator incorrectly assumes that top-level domains are not longer than 4 characters
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2624'>NUTCH-2624</a>] -         protocol-okhttp resource leak
</li>
</ul>
            
<h2>        New Feature
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1129'>NUTCH-1129</a>] -         Any23 Nutch plugin
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1541'>NUTCH-1541</a>] -         Indexer plugin to write CSV
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2412'>NUTCH-2412</a>] -         Exchange component for indexing job
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2492'>NUTCH-2492</a>] -         Add more configuration parameters to crawl script 
</li>
</ul>
    
<h2>        Improvement
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1106'>NUTCH-1106</a>] -         Options to skip url&#39;s based on length
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1480'>NUTCH-1480</a>] -         SolrIndexer to write to multiple servers.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2012'>NUTCH-2012</a>] -         Merge parsechecker and indexchecker
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2375'>NUTCH-2375</a>] -         Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2390'>NUTCH-2390</a>] -         No documentation on pluggable indexing
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2411'>NUTCH-2411</a>] -         Index-metadata to support indexing multiple values for a field 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2416'>NUTCH-2416</a>] -         Fetcher to log thread ID
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2432'>NUTCH-2432</a>] -         Protocol httpclient to disable cookies if http.enable.cookie.header is false
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2441'>NUTCH-2441</a>] -         ARG_SEGMENT usage
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2491'>NUTCH-2491</a>] -         Integrate sitemap processing and HostDB into crawl script
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2493'>NUTCH-2493</a>] -         Add configuration parameter for sitemap processing to crawler script
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2497'>NUTCH-2497</a>] -         Elastic REST Indexer: Allow multiple hosts
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2502'>NUTCH-2502</a>] -         Any23 Plugin: Add Content-Type filtering
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2503'>NUTCH-2503</a>] -         Add option to run tests for a single plugin
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2510'>NUTCH-2510</a>] -         Crawl script modification. HostDb : generate, optional usage and description
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2516'>NUTCH-2516</a>] -         Hadoop imports use wildcards
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2519'>NUTCH-2519</a>] -         Log mapreduce job counters in local mode
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2526'>NUTCH-2526</a>] -         NPE in scoring-opic when indexing document without CrawlDb datum
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2527'>NUTCH-2527</a>] -         URL filter: provide rules to exclude localhost and private address spaces
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2530'>NUTCH-2530</a>] -         Rename property db.max.anchor.length &gt; linkdb.max.anchor.length
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2534'>NUTCH-2534</a>] -         CrawlDbReader -stats: make score quantiles configurable
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2539'>NUTCH-2539</a>] -         Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2543'>NUTCH-2543</a>] -         readdb &amp; readlinkdb to implement AbstractChecker
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2545'>NUTCH-2545</a>] -         Upgrade to Any23 2.2
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2566'>NUTCH-2566</a>] -         Fix exception log messages
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2576'>NUTCH-2576</a>] -         HTTP protocol plugin based on okhttp
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2577'>NUTCH-2577</a>] -         protocol-selenium can&#39;t handle https
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2578'>NUTCH-2578</a>] -         Avoid lock by MimeUtil in constructor of protocol.Content
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2579'>NUTCH-2579</a>] -         Fetcher to use parsed URL to call ProtocolFactory.getProtocol(url)
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2580'>NUTCH-2580</a>] -         Improvements for Rabbitmq support
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2583'>NUTCH-2583</a>] -         Upgrading Nutch&#39;s dependencies
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2584'>NUTCH-2584</a>] -         Upgrade parse-tika to use Tika 1.18
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2594'>NUTCH-2594</a>] -         Documentation for indexer plugins
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2595'>NUTCH-2595</a>] -         Upgrade crawler-commons dependency to 0.10
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2600'>NUTCH-2600</a>] -         Refactoring indexer-solr
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2611'>NUTCH-2611</a>] -         Add line-breaks when parsing HTML block-level elements
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2617'>NUTCH-2617</a>] -         Disable Exchange component by default
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-2619'>NUTCH-2619</a>] -         protocol-okhttp: allow to keep partially fetched docs as truncated
</li>
</ul>
            
<h2>        Task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1219'>NUTCH-1219</a>] -         Upgrade all jobs to new MapReduce API
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1228'>NUTCH-1228</a>] -         Change mapred.task.timeout to mapreduce.task.timeout in fetcher
</li>
</ul>