Release Notes - ASF JIRA

Release Notes - Nutch - Version 2.3 - HTML format

Configure Release Notes

Sub-task

[NUTCH-1124] - JUnit test for scoring-opic
[NUTCH-1125] - JUnit test for tld
[NUTCH-1164] - Write JUnit tests for protocol-http
[NUTCH-1170] - Write JUnit tests for urlfilter-validator
[NUTCH-1655] - Indexer Plugin for Elastic Search
[NUTCH-1878] - urlnormalizer-regex to keep third slash in file:///path/index.html
[NUTCH-1879] - Regex URL normalizer should remove multiple slashes after file: protocol
[NUTCH-1880] - URLUtil should not add additional slashes for file URLs
[NUTCH-1885] - Protocol-file should treat symbolic links as redirects

Bug

[NUTCH-356] - Plugin repository cache can lead to memory leak
[NUTCH-385] - Improve description of thread related configuration for Fetcher
[NUTCH-797] - URL not properly constructed when link target begins with a "?"
[NUTCH-911] - recrawls file protocol causes Errors/Exceptions when actually not modified or gone
[NUTCH-970] - Injector job crashes with MySQL with table collation set to utf8_general_ci
[NUTCH-992] - SolrDedup is broken in 2.x
[NUTCH-1182] - fetcher to log hung threads
[NUTCH-1253] - Incompatible neko and xerces versions
[NUTCH-1329] - parser not extract outlinks to external web sites
[NUTCH-1410] - impact of a map-reduce problem
[NUTCH-1473] - Column length too big for column 'text' (max = 21845); use BLOB or TEXT instead
[NUTCH-1481] - When using MySQL as storage unicode characters within URLS cause nutch to fail
[NUTCH-1483] - Can't crawl filesystem with protocol-file plugin
[NUTCH-1490] - Data Truncation exceptions when using mysql
[NUTCH-1549] - Fix deprecated use of Tika MimeType API in o.a.n.util.MimeUtil
[NUTCH-1562] - Order of execution for scoring filters
[NUTCH-1566] - bin/nutch to allow whitespace in paths
[NUTCH-1579] - NPE when using solr indexing
[NUTCH-1587] - misspelled property "threshold" in conf/log4j.properties
[NUTCH-1588] - Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x
[NUTCH-1603] - ZIP parser complains about truncated PDF file
[NUTCH-1604] - ProtocolFactory not thread-safe
[NUTCH-1605] - mime type detector recognizes xlsx as zip file
[NUTCH-1610] - Can't run individual unit tests for plugins in nutch 2.x
[NUTCH-1613] - Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
[NUTCH-1618] - Turn speculative execution off for Fetching
[NUTCH-1621] - Deprecated class o.a.n.crawl.Crawler is still in code base
[NUTCH-1624] - Typo in WebTableReader line 486
[NUTCH-1633] - slf4j is provided by hadoop and should not be included in the job file.
[NUTCH-1634] - readdb -stats show the result twice
[NUTCH-1650] - Adaptive Fetch Scheduler interval Wrong Set
[NUTCH-1651] - modifiedTime and prevmodifiedTime never set
[NUTCH-1657] - ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser
[NUTCH-1667] - Updatedb always ignore batchId
[NUTCH-1671] - indexchecker to add digest field
[NUTCH-1672] - Inlinks are added twice in DbUpdateReducer
[NUTCH-1673] - Title isn't reset in MoreIndexingFilter
[NUTCH-1677] - ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are not set in Parse HTML
[NUTCH-1685] - URLUtil.toUNICODE fails on IDNs
[NUTCH-1699] - Tika Parser - Image Parse Bug
[NUTCH-1708] - use same id when indexing and deleting redirects
[NUTCH-1715] - RobotRulesParser adds additional '*' to the robots name
[NUTCH-1716] - RobotRulesParser adds extra '*' to the robots name
[NUTCH-1718] - redefine http.robots.agent as "additional agent names"
[NUTCH-1719] - DomainStatistics fails in 2.x because URL is not unreversed
[NUTCH-1720] - Duplicate lines in HttpBase.java
[NUTCH-1725] - CleaningJob's reducer does not commit deleted docs.
[NUTCH-1727] - Configurable length for Tlds
[NUTCH-1728] - indexer-solr plugin is not delete docs from solr
[NUTCH-1733] - parse-html to support HTML5 charset definitions
[NUTCH-1736] - Can't fetch page if http response header contains Transfer-Encoding：chunked
[NUTCH-1738] - Expose number of URLs generated per batch in GeneratorJob
[NUTCH-1751] - Empty anchors should not index
[NUTCH-1752] - cache robots.txt rules per protocol:host:port
[NUTCH-1753] - Eclipse dependecy problem for 2.x
[NUTCH-1755] - Project name bug in build.xml
[NUTCH-1759] - Upgrade to Crawler Commons 0.4
[NUTCH-1761] - Crawl script fails to find job file if not started from inside bin dir
[NUTCH-1767] - remove special treatment of "params" in relative links
[NUTCH-1773] - Solr Indexer fails
[NUTCH-1774] - Crawling from REST API giving NullPointerException
[NUTCH-1776] - Log incorrect plugin.folder file path
[NUTCH-1778] - Generator not logging number of URLs in batch correctly
[NUTCH-1780] - ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
[NUTCH-1784] - modifiedTime and prevmodifiedTime never set
[NUTCH-1788] - Tika may return multiple values for Title on PDF's
[NUTCH-1796] - Ensure Gora object builders are used as oppose to empty constructors.
[NUTCH-1798] - Crawl script not calling index command correctly
[NUTCH-1811] - bin/nutch junit to use junit 4 test runner
[NUTCH-1819] - Check for batchId input in GeneratorJob#run
[NUTCH-1820] - remove field "orig" which duplicates "id"
[NUTCH-1825] - protocol-http may hang for certain web pages
[NUTCH-1828] - bin/crawl : incorrect handling of nutch errors
[NUTCH-1829] - Generator : unable to distinguish real errors
[NUTCH-1832] - Make Nutch work without an indexer
[NUTCH-1834] - GeneratorMapper behavior depends on log level
[NUTCH-1845] - Nutch cannot save inlinks
[NUTCH-1848] - Bug in DashboardPage.html instances counter
[NUTCH-1865] - Enable use of SNAPSHOT's with Nutch Ivy dependency management
[NUTCH-1866] - ant eclipse target should not delete runtime
[NUTCH-1877] - Suffix URL filter to ignore query string by default
[NUTCH-1882] - ant eclipse target to add output path to src/test
[NUTCH-1891] - Can't run nutch2.3-snapshot on hadoop2.4.0 using gora0.5 and mongodb as backend datastore
[NUTCH-1899] - upgrade restlet lib to prevent build failure
[NUTCH-1903] - Resolve-default failed with branch 2.x
[NUTCH-1907] - Incorrect output of Outlinks to Hosts within HostDbUpdateReducer

New Feature

[NUTCH-929] - Create a REST-based admin UI for Nutch
[NUTCH-1360] - Suport the storing of IP address connected to when web crawling
[NUTCH-1590] - [SECURITY] Frame injection vulnerability in published Javadoc
[NUTCH-1693] - TextMD5Signature computed on textual content
[NUTCH-1856] - Document webpage.avsc and host.avsc

Improvement

[NUTCH-841] - Create a Wicket-based Web Application for Nutch
[NUTCH-945] - Indexing to multiple SOLR Servers
[NUTCH-1294] - IndexClean job with solr implementation.
[NUTCH-1409] - Remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip
[NUTCH-1413] - Record response time
[NUTCH-1478] - Parse-metatags and index-metadata plugin for Nutch 2.x series
[NUTCH-1497] - Better default gora-sql-mapping.xml with larger field sizes for MySQL
[NUTCH-1513] - Support Robots.txt for Ftp urls
[NUTCH-1556] - enabling updatedb to accept batchId
[NUTCH-1568] - port pluggable indexing architecture to 2.x
[NUTCH-1595] - Upgrade to Tika 1.4
[NUTCH-1599] - Obtain consensus on new description of Nutch
[NUTCH-1619] - Writes Dmoz Description and Title information to db with snippet argument
[NUTCH-1629] - there is no need to fail on empty lines in seed file when injecting.
[NUTCH-1631] - Display Document Count Added To Solr Server
[NUTCH-1632] - add batchId argument for DbUpdaterJob
[NUTCH-1641] - Log timings for main jobs
[NUTCH-1674] - Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
[NUTCH-1710] - Add gora package logging to log4j.properties
[NUTCH-1714] - Nutch 2.x upgrade to Gora 0.4
[NUTCH-1721] - Upgrade to Crawler commons 0.3
[NUTCH-1731] - Better cmd line parsing for NutchServer
[NUTCH-1743] - parsechecker to show outlinks
[NUTCH-1768] - Upgrade to ElasticSearch 1.1.0
[NUTCH-1769] - REST API refactoring
[NUTCH-1781] - Update gora-*-mapping.xml and gora.proeprties to reflect Gora 0.4
[NUTCH-1782] - NodeWalker to return current node
[NUTCH-1787] - update and complete API doc overview page
[NUTCH-1797] - remove unused package o.a.n.html
[NUTCH-1823] - Upgrade to elasticsearch 1.4.1
[NUTCH-1827] - Port NUTCH-1467 and NUTCH-1561 to 2.x
[NUTCH-1843] - Upgrade to Gora 0.5
[NUTCH-1851] - Add/Update wiki pages for NutchServer and WebApp
[NUTCH-1876] - Upgrade to Crawler Commons 0.5
[NUTCH-1883] - bin/crawl: use function to run bin/nutch and check exit value
[NUTCH-1888] - Specify HTMLMapper to use in TikaParser

Test

[NUTCH-1645] - Junit Test Case for Adaptive Fetch Schedule class

Task

[NUTCH-1696] - Enable use of (Gora) SNAPSHOT dependencies
[NUTCH-1700] - Remove deprecated code in src/plugin/creativecommons/build.xml
[NUTCH-1779] - Apply formatting to the code
[NUTCH-1789] - Migrate Nutch site to Apache CMS
[NUTCH-1817] - Remove pom.xml from source
[NUTCH-1837] - Upgrade to Tika 1.6
[NUTCH-1859] - Make Nutch webapp port configurable

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.

Release Notes - Nutch - Version 2.3
    
<h2>        Sub-task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1124'>NUTCH-1124</a>] -         JUnit test for scoring-opic
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1125'>NUTCH-1125</a>] -         JUnit test for tld
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1164'>NUTCH-1164</a>] -         Write JUnit tests for protocol-http
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1170'>NUTCH-1170</a>] -         Write JUnit tests for urlfilter-validator
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1655'>NUTCH-1655</a>] -         Indexer Plugin for Elastic Search
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1878'>NUTCH-1878</a>] -         urlnormalizer-regex to keep third slash in file:///path/index.html
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1879'>NUTCH-1879</a>] -         Regex URL normalizer should remove multiple slashes after file: protocol
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1880'>NUTCH-1880</a>] -         URLUtil should not add additional slashes for file URLs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1885'>NUTCH-1885</a>] -         Protocol-file should treat symbolic links as redirects
</li>
</ul>
            
<h2>        Bug
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-356'>NUTCH-356</a>] -         Plugin repository cache can lead to memory leak
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-385'>NUTCH-385</a>] -         Improve description of thread related configuration for Fetcher
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-797'>NUTCH-797</a>] -         URL not properly constructed when link target begins with a &quot;?&quot;
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-911'>NUTCH-911</a>] -         recrawls file protocol causes Errors/Exceptions when actually not modified or gone
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-970'>NUTCH-970</a>] -         Injector job crashes with MySQL with table collation set to utf8_general_ci
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-992'>NUTCH-992</a>] -         SolrDedup is broken in 2.x
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1182'>NUTCH-1182</a>] -         fetcher to log hung threads
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1253'>NUTCH-1253</a>] -         Incompatible neko and xerces versions
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1329'>NUTCH-1329</a>] -         parser not extract outlinks to external web sites
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1410'>NUTCH-1410</a>] -         impact of a map-reduce problem
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1473'>NUTCH-1473</a>] -         Column length too big for column &#39;text&#39; (max = 21845); use BLOB or TEXT instead
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1481'>NUTCH-1481</a>] -         When using MySQL as storage unicode characters within URLS cause nutch to fail
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1483'>NUTCH-1483</a>] -         Can&#39;t crawl filesystem with protocol-file plugin
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1490'>NUTCH-1490</a>] -         Data Truncation exceptions when using mysql
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1549'>NUTCH-1549</a>] -         Fix deprecated use of Tika MimeType API in o.a.n.util.MimeUtil 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1562'>NUTCH-1562</a>] -         Order of execution for scoring filters
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1566'>NUTCH-1566</a>] -         bin/nutch to allow whitespace in paths
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1579'>NUTCH-1579</a>] -         NPE when using solr indexing
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1587'>NUTCH-1587</a>] -         misspelled property &quot;threshold&quot; in conf/log4j.properties
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1588'>NUTCH-1588</a>] -         Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1603'>NUTCH-1603</a>] -         ZIP parser complains about truncated PDF file
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1604'>NUTCH-1604</a>] -         ProtocolFactory not thread-safe
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1605'>NUTCH-1605</a>] -         mime type detector recognizes xlsx as zip file
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1610'>NUTCH-1610</a>] -         Can&#39;t run individual unit tests for plugins in nutch 2.x
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1613'>NUTCH-1613</a>] -         Timeouts in protocol-httpclient when crawling same host with &gt;2 threads and added cookie strings for both http protocols
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1618'>NUTCH-1618</a>] -         Turn speculative execution off for Fetching
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1621'>NUTCH-1621</a>] -         Deprecated class o.a.n.crawl.Crawler is still in code base
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1624'>NUTCH-1624</a>] -         Typo in WebTableReader  line 486
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1633'>NUTCH-1633</a>] -         slf4j is provided by hadoop and should not be included in the job file.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1634'>NUTCH-1634</a>] -         readdb -stats show the result twice
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1650'>NUTCH-1650</a>] -         Adaptive Fetch Scheduler interval Wrong Set
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1651'>NUTCH-1651</a>] -         modifiedTime and prevmodifiedTime never set 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1657'>NUTCH-1657</a>] -         ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1667'>NUTCH-1667</a>] -         Updatedb always ignore batchId
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1671'>NUTCH-1671</a>] -         indexchecker to add digest field
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1672'>NUTCH-1672</a>] -         Inlinks are added twice in DbUpdateReducer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1673'>NUTCH-1673</a>] -         Title isn&#39;t reset in MoreIndexingFilter
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1677'>NUTCH-1677</a>] -         ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are not set in Parse HTML 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1685'>NUTCH-1685</a>] -         URLUtil.toUNICODE fails on IDNs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1699'>NUTCH-1699</a>] -         Tika Parser - Image Parse Bug
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1708'>NUTCH-1708</a>] -         use same id when indexing and deleting redirects
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1715'>NUTCH-1715</a>] -         RobotRulesParser adds additional &#39;*&#39; to the robots name
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1716'>NUTCH-1716</a>] -         RobotRulesParser adds extra &#39;*&#39; to the robots name
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1718'>NUTCH-1718</a>] -         redefine http.robots.agent as &quot;additional agent names&quot;
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1719'>NUTCH-1719</a>] -         DomainStatistics fails in 2.x because URL is not unreversed
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1720'>NUTCH-1720</a>] -         Duplicate lines in HttpBase.java
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1725'>NUTCH-1725</a>] -         CleaningJob&#39;s reducer does not commit deleted docs. 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1727'>NUTCH-1727</a>] -         Configurable length for Tlds
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1728'>NUTCH-1728</a>] -         indexer-solr plugin is not delete docs from solr
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1733'>NUTCH-1733</a>] -         parse-html to support HTML5 charset definitions
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1736'>NUTCH-1736</a>] -         Can&#39;t fetch page if http response header contains Transfer-Encoding：chunked
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1738'>NUTCH-1738</a>] -         Expose number of URLs generated per batch in GeneratorJob
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1751'>NUTCH-1751</a>] -         Empty anchors should not index
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1752'>NUTCH-1752</a>] -         cache robots.txt rules per protocol:host:port
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1753'>NUTCH-1753</a>] -         Eclipse dependecy problem for 2.x
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1755'>NUTCH-1755</a>] -         Project name bug in build.xml
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1759'>NUTCH-1759</a>] -         Upgrade to Crawler Commons 0.4
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1761'>NUTCH-1761</a>] -         Crawl script fails to find job file if not started from inside bin dir
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1767'>NUTCH-1767</a>] -         remove special treatment of &quot;params&quot; in relative links
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1773'>NUTCH-1773</a>] -         Solr Indexer fails
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1774'>NUTCH-1774</a>] -         Crawling from REST API giving NullPointerException
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1776'>NUTCH-1776</a>] -         Log incorrect plugin.folder file path
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1778'>NUTCH-1778</a>] -         Generator not logging number of URLs in batch correctly
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1780'>NUTCH-1780</a>] -         ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1784'>NUTCH-1784</a>] -         modifiedTime and prevmodifiedTime never set 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1788'>NUTCH-1788</a>] -         Tika may return multiple values for Title on PDF&#39;s
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1796'>NUTCH-1796</a>] -         Ensure Gora object builders are used as oppose to empty constructors.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1798'>NUTCH-1798</a>] -         Crawl script not calling index command correctly
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1811'>NUTCH-1811</a>] -         bin/nutch junit to use junit 4 test runner
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1819'>NUTCH-1819</a>] -         Check for batchId input in GeneratorJob#run
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1820'>NUTCH-1820</a>] -         remove field &quot;orig&quot; which duplicates &quot;id&quot;
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1825'>NUTCH-1825</a>] -         protocol-http may hang for certain web pages
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1828'>NUTCH-1828</a>] -         bin/crawl : incorrect handling of nutch errors
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1829'>NUTCH-1829</a>] -         Generator : unable to distinguish real errors
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1832'>NUTCH-1832</a>] -         Make Nutch work without an indexer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1834'>NUTCH-1834</a>] -         GeneratorMapper behavior depends on log level
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1845'>NUTCH-1845</a>] -         Nutch cannot save inlinks 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1848'>NUTCH-1848</a>] -         Bug in DashboardPage.html instances counter
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1865'>NUTCH-1865</a>] -         Enable use of SNAPSHOT&#39;s with Nutch Ivy dependency management
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1866'>NUTCH-1866</a>] -         ant eclipse target should not delete runtime
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1877'>NUTCH-1877</a>] -         Suffix URL filter to ignore query string by default
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1882'>NUTCH-1882</a>] -         ant eclipse target to add output path to src/test
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1891'>NUTCH-1891</a>] -         Can&#39;t run nutch2.3-snapshot on hadoop2.4.0 using gora0.5 and mongodb as backend datastore
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1899'>NUTCH-1899</a>] -         upgrade restlet lib to prevent build failure
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1903'>NUTCH-1903</a>] -         Resolve-default failed with branch 2.x
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1907'>NUTCH-1907</a>] -         Incorrect output of Outlinks to Hosts within HostDbUpdateReducer 
</li>
</ul>
            
<h2>        New Feature
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-929'>NUTCH-929</a>] -         Create a REST-based admin UI for Nutch
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1360'>NUTCH-1360</a>] -         Suport the storing of IP address connected to when web crawling
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1590'>NUTCH-1590</a>] -         [SECURITY] Frame injection vulnerability in published Javadoc
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1693'>NUTCH-1693</a>] -         TextMD5Signature computed on textual content
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1856'>NUTCH-1856</a>] -         Document webpage.avsc and host.avsc
</li>
</ul>
    
<h2>        Improvement
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-841'>NUTCH-841</a>] -         Create a Wicket-based Web Application for Nutch
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-945'>NUTCH-945</a>] -         Indexing to multiple SOLR Servers
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1294'>NUTCH-1294</a>] -         IndexClean job with solr implementation.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1409'>NUTCH-1409</a>] -         Remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1413'>NUTCH-1413</a>] -         Record response time
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1478'>NUTCH-1478</a>] -         Parse-metatags and index-metadata plugin for Nutch 2.x series 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1497'>NUTCH-1497</a>] -         Better default gora-sql-mapping.xml with larger field sizes for MySQL
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1513'>NUTCH-1513</a>] -         Support Robots.txt for Ftp urls
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1556'>NUTCH-1556</a>] -         enabling updatedb to accept batchId 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1568'>NUTCH-1568</a>] -         port pluggable indexing architecture to 2.x
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1595'>NUTCH-1595</a>] -         Upgrade to Tika 1.4
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1599'>NUTCH-1599</a>] -         Obtain consensus on new description of Nutch
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1619'>NUTCH-1619</a>] -         Writes Dmoz Description and Title information to db with snippet argument
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1629'>NUTCH-1629</a>] -         there is no need to fail on empty lines in seed file when injecting.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1631'>NUTCH-1631</a>] -         Display Document Count Added To Solr Server
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1632'>NUTCH-1632</a>] -         add batchId argument for DbUpdaterJob
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1641'>NUTCH-1641</a>] -         Log timings for main jobs
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1674'>NUTCH-1674</a>] -         Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1710'>NUTCH-1710</a>] -         Add gora package logging to log4j.properties
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1714'>NUTCH-1714</a>] -         Nutch 2.x upgrade to Gora 0.4
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1721'>NUTCH-1721</a>] -         Upgrade to Crawler commons 0.3
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1731'>NUTCH-1731</a>] -         Better cmd line parsing for NutchServer
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1743'>NUTCH-1743</a>] -         parsechecker to show outlinks
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1768'>NUTCH-1768</a>] -         Upgrade to ElasticSearch 1.1.0
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1769'>NUTCH-1769</a>] -         REST API refactoring
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1781'>NUTCH-1781</a>] -         Update gora-*-mapping.xml and gora.proeprties to reflect Gora 0.4
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1782'>NUTCH-1782</a>] -         NodeWalker to return current node
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1787'>NUTCH-1787</a>] -         update and complete API doc overview page
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1797'>NUTCH-1797</a>] -         remove unused package o.a.n.html
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1823'>NUTCH-1823</a>] -         Upgrade to elasticsearch 1.4.1
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1827'>NUTCH-1827</a>] -         Port NUTCH-1467 and NUTCH-1561 to 2.x
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1843'>NUTCH-1843</a>] -         Upgrade to Gora 0.5
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1851'>NUTCH-1851</a>] -         Add/Update wiki pages for NutchServer and WebApp
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1876'>NUTCH-1876</a>] -         Upgrade to Crawler Commons 0.5
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1883'>NUTCH-1883</a>] -         bin/crawl: use function to run bin/nutch and check exit value
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1888'>NUTCH-1888</a>] -         Specify HTMLMapper to use in TikaParser
</li>
</ul>
    
<h2>        Test
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1645'>NUTCH-1645</a>] -         Junit Test Case for Adaptive Fetch Schedule class
</li>
</ul>
        
<h2>        Task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1696'>NUTCH-1696</a>] -         Enable use of (Gora) SNAPSHOT dependencies
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1700'>NUTCH-1700</a>] -         Remove deprecated code in src/plugin/creativecommons/build.xml
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1779'>NUTCH-1779</a>] -         Apply formatting to the code
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1789'>NUTCH-1789</a>] -         Migrate Nutch site to Apache CMS
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1817'>NUTCH-1817</a>] -         Remove pom.xml from source
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1837'>NUTCH-1837</a>] -         Upgrade to Tika 1.6
</li>
<li>[<a href='https://issues.apache.org/jira/browse/NUTCH-1859'>NUTCH-1859</a>] -         Make Nutch webapp port configurable
</li>
</ul>