Release Notes - Nutch - Version 2.3 - HTML format

Sub-task

  • [NUTCH-1124] - JUnit test for scoring-opic
  • [NUTCH-1125] - JUnit test for tld
  • [NUTCH-1164] - Write JUnit tests for protocol-http
  • [NUTCH-1170] - Write JUnit tests for urlfilter-validator
  • [NUTCH-1655] - Indexer Plugin for Elastic Search
  • [NUTCH-1878] - urlnormalizer-regex to keep third slash in file:///path/index.html
  • [NUTCH-1879] - Regex URL normalizer should remove multiple slashes after file: protocol
  • [NUTCH-1880] - URLUtil should not add additional slashes for file URLs
  • [NUTCH-1885] - Protocol-file should treat symbolic links as redirects

Bug

  • [NUTCH-356] - Plugin repository cache can lead to memory leak
  • [NUTCH-385] - Improve description of thread related configuration for Fetcher
  • [NUTCH-797] - URL not properly constructed when link target begins with a "?"
  • [NUTCH-911] - recrawls file protocol causes Errors/Exceptions when actually not modified or gone
  • [NUTCH-970] - Injector job crashes with MySQL with table collation set to utf8_general_ci
  • [NUTCH-992] - SolrDedup is broken in 2.x
  • [NUTCH-1182] - fetcher to log hung threads
  • [NUTCH-1253] - Incompatible neko and xerces versions
  • [NUTCH-1329] - parser not extract outlinks to external web sites
  • [NUTCH-1410] - impact of a map-reduce problem
  • [NUTCH-1473] - Column length too big for column 'text' (max = 21845); use BLOB or TEXT instead
  • [NUTCH-1481] - When using MySQL as storage unicode characters within URLS cause nutch to fail
  • [NUTCH-1483] - Can't crawl filesystem with protocol-file plugin
  • [NUTCH-1490] - Data Truncation exceptions when using mysql
  • [NUTCH-1549] - Fix deprecated use of Tika MimeType API in o.a.n.util.MimeUtil
  • [NUTCH-1562] - Order of execution for scoring filters
  • [NUTCH-1566] - bin/nutch to allow whitespace in paths
  • [NUTCH-1579] - NPE when using solr indexing
  • [NUTCH-1587] - misspelled property "threshold" in conf/log4j.properties
  • [NUTCH-1588] - Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x
  • [NUTCH-1603] - ZIP parser complains about truncated PDF file
  • [NUTCH-1604] - ProtocolFactory not thread-safe
  • [NUTCH-1605] - mime type detector recognizes xlsx as zip file
  • [NUTCH-1610] - Can't run individual unit tests for plugins in nutch 2.x
  • [NUTCH-1613] - Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
  • [NUTCH-1618] - Turn speculative execution off for Fetching
  • [NUTCH-1621] - Deprecated class o.a.n.crawl.Crawler is still in code base
  • [NUTCH-1624] - Typo in WebTableReader line 486
  • [NUTCH-1633] - slf4j is provided by hadoop and should not be included in the job file.
  • [NUTCH-1634] - readdb -stats show the result twice
  • [NUTCH-1650] - Adaptive Fetch Scheduler interval Wrong Set
  • [NUTCH-1651] - modifiedTime and prevmodifiedTime never set
  • [NUTCH-1657] - ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser
  • [NUTCH-1667] - Updatedb always ignore batchId
  • [NUTCH-1671] - indexchecker to add digest field
  • [NUTCH-1672] - Inlinks are added twice in DbUpdateReducer
  • [NUTCH-1673] - Title isn't reset in MoreIndexingFilter
  • [NUTCH-1677] - ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are not set in Parse HTML
  • [NUTCH-1685] - URLUtil.toUNICODE fails on IDNs
  • [NUTCH-1699] - Tika Parser - Image Parse Bug
  • [NUTCH-1708] - use same id when indexing and deleting redirects
  • [NUTCH-1715] - RobotRulesParser adds additional '*' to the robots name
  • [NUTCH-1716] - RobotRulesParser adds extra '*' to the robots name
  • [NUTCH-1718] - redefine http.robots.agent as "additional agent names"
  • [NUTCH-1719] - DomainStatistics fails in 2.x because URL is not unreversed
  • [NUTCH-1720] - Duplicate lines in HttpBase.java
  • [NUTCH-1725] - CleaningJob's reducer does not commit deleted docs.
  • [NUTCH-1727] - Configurable length for Tlds
  • [NUTCH-1728] - indexer-solr plugin is not delete docs from solr
  • [NUTCH-1733] - parse-html to support HTML5 charset definitions
  • [NUTCH-1736] - Can't fetch page if http response header contains Transfer-Encoding:chunked
  • [NUTCH-1738] - Expose number of URLs generated per batch in GeneratorJob
  • [NUTCH-1751] - Empty anchors should not index
  • [NUTCH-1752] - cache robots.txt rules per protocol:host:port
  • [NUTCH-1753] - Eclipse dependecy problem for 2.x
  • [NUTCH-1755] - Project name bug in build.xml
  • [NUTCH-1759] - Upgrade to Crawler Commons 0.4
  • [NUTCH-1761] - Crawl script fails to find job file if not started from inside bin dir
  • [NUTCH-1767] - remove special treatment of "params" in relative links
  • [NUTCH-1773] - Solr Indexer fails
  • [NUTCH-1774] - Crawling from REST API giving NullPointerException
  • [NUTCH-1776] - Log incorrect plugin.folder file path
  • [NUTCH-1778] - Generator not logging number of URLs in batch correctly
  • [NUTCH-1780] - ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
  • [NUTCH-1784] - modifiedTime and prevmodifiedTime never set
  • [NUTCH-1788] - Tika may return multiple values for Title on PDF's
  • [NUTCH-1796] - Ensure Gora object builders are used as oppose to empty constructors.
  • [NUTCH-1798] - Crawl script not calling index command correctly
  • [NUTCH-1811] - bin/nutch junit to use junit 4 test runner
  • [NUTCH-1819] - Check for batchId input in GeneratorJob#run
  • [NUTCH-1820] - remove field "orig" which duplicates "id"
  • [NUTCH-1825] - protocol-http may hang for certain web pages
  • [NUTCH-1828] - bin/crawl : incorrect handling of nutch errors
  • [NUTCH-1829] - Generator : unable to distinguish real errors
  • [NUTCH-1832] - Make Nutch work without an indexer
  • [NUTCH-1834] - GeneratorMapper behavior depends on log level
  • [NUTCH-1845] - Nutch cannot save inlinks
  • [NUTCH-1848] - Bug in DashboardPage.html instances counter
  • [NUTCH-1865] - Enable use of SNAPSHOT's with Nutch Ivy dependency management
  • [NUTCH-1866] - ant eclipse target should not delete runtime
  • [NUTCH-1877] - Suffix URL filter to ignore query string by default
  • [NUTCH-1882] - ant eclipse target to add output path to src/test
  • [NUTCH-1891] - Can't run nutch2.3-snapshot on hadoop2.4.0 using gora0.5 and mongodb as backend datastore
  • [NUTCH-1899] - upgrade restlet lib to prevent build failure
  • [NUTCH-1903] - Resolve-default failed with branch 2.x
  • [NUTCH-1907] - Incorrect output of Outlinks to Hosts within HostDbUpdateReducer

New Feature

  • [NUTCH-929] - Create a REST-based admin UI for Nutch
  • [NUTCH-1360] - Suport the storing of IP address connected to when web crawling
  • [NUTCH-1590] - [SECURITY] Frame injection vulnerability in published Javadoc
  • [NUTCH-1693] - TextMD5Signature computed on textual content
  • [NUTCH-1856] - Document webpage.avsc and host.avsc

Improvement

  • [NUTCH-841] - Create a Wicket-based Web Application for Nutch
  • [NUTCH-945] - Indexing to multiple SOLR Servers
  • [NUTCH-1294] - IndexClean job with solr implementation.
  • [NUTCH-1409] - Remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip
  • [NUTCH-1413] - Record response time
  • [NUTCH-1478] - Parse-metatags and index-metadata plugin for Nutch 2.x series
  • [NUTCH-1497] - Better default gora-sql-mapping.xml with larger field sizes for MySQL
  • [NUTCH-1513] - Support Robots.txt for Ftp urls
  • [NUTCH-1556] - enabling updatedb to accept batchId
  • [NUTCH-1568] - port pluggable indexing architecture to 2.x
  • [NUTCH-1595] - Upgrade to Tika 1.4
  • [NUTCH-1599] - Obtain consensus on new description of Nutch
  • [NUTCH-1619] - Writes Dmoz Description and Title information to db with snippet argument
  • [NUTCH-1629] - there is no need to fail on empty lines in seed file when injecting.
  • [NUTCH-1631] - Display Document Count Added To Solr Server
  • [NUTCH-1632] - add batchId argument for DbUpdaterJob
  • [NUTCH-1641] - Log timings for main jobs
  • [NUTCH-1674] - Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
  • [NUTCH-1710] - Add gora package logging to log4j.properties
  • [NUTCH-1714] - Nutch 2.x upgrade to Gora 0.4
  • [NUTCH-1721] - Upgrade to Crawler commons 0.3
  • [NUTCH-1731] - Better cmd line parsing for NutchServer
  • [NUTCH-1743] - parsechecker to show outlinks
  • [NUTCH-1768] - Upgrade to ElasticSearch 1.1.0
  • [NUTCH-1769] - REST API refactoring
  • [NUTCH-1781] - Update gora-*-mapping.xml and gora.proeprties to reflect Gora 0.4
  • [NUTCH-1782] - NodeWalker to return current node
  • [NUTCH-1787] - update and complete API doc overview page
  • [NUTCH-1797] - remove unused package o.a.n.html
  • [NUTCH-1823] - Upgrade to elasticsearch 1.4.1
  • [NUTCH-1827] - Port NUTCH-1467 and NUTCH-1561 to 2.x
  • [NUTCH-1843] - Upgrade to Gora 0.5
  • [NUTCH-1851] - Add/Update wiki pages for NutchServer and WebApp
  • [NUTCH-1876] - Upgrade to Crawler Commons 0.5
  • [NUTCH-1883] - bin/crawl: use function to run bin/nutch and check exit value
  • [NUTCH-1888] - Specify HTMLMapper to use in TikaParser

Test

  • [NUTCH-1645] - Junit Test Case for Adaptive Fetch Schedule class

Task

  • [NUTCH-1696] - Enable use of (Gora) SNAPSHOT dependencies
  • [NUTCH-1700] - Remove deprecated code in src/plugin/creativecommons/build.xml
  • [NUTCH-1779] - Apply formatting to the code
  • [NUTCH-1789] - Migrate Nutch site to Apache CMS
  • [NUTCH-1817] - Remove pom.xml from source
  • [NUTCH-1837] - Upgrade to Tika 1.6
  • [NUTCH-1859] - Make Nutch webapp port configurable

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.