Release Notes - Nutch - Version 2.2 - HTML format

Sub-task

  • [NUTCH-1094] - create comprehensive documentation for Nutchgora branch
  • [NUTCH-1273] - Fix [deprecation] javac warnings
  • [NUTCH-1274] - Fix [cast] javac warnings
  • [NUTCH-1275] - Fix [unchecked] javac warnings
  • [NUTCH-1277] - Fix [fallthrough] javac warnings

Bug

  • [NUTCH-342] - Nutch commands log to nutch/logs/hadoop.logs by default
  • [NUTCH-706] - Url regex normalizer: default pattern for session id removal not to match "newsId"
  • [NUTCH-829] - duplicate hadoop temp files
  • [NUTCH-891] - Nutch build should not depend on unversioned local deps
  • [NUTCH-956] - solrindex issues
  • [NUTCH-1042] - Fetcher.max.crawl.delay property not taken into account correctly when set to -1
  • [NUTCH-1344] - BasicURLNormalizer to normalize https same as http
  • [NUTCH-1390] - readdb -url $url throws NPE with gora-cassandra
  • [NUTCH-1393] - Display consistent usage of GeneratorJob with 1.X
  • [NUTCH-1418] - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
  • [NUTCH-1447] - Nutch 2.x with Cloudera CDH 4 get Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
  • [NUTCH-1455] - RobotRulesParser to match multi-word user-agent names
  • [NUTCH-1479] - nutch readhostdb and updatehostdb do not work with MySQL
  • [NUTCH-1484] - TableUtil unreverseURL fails on file:// URLs
  • [NUTCH-1491] - UTF-8 non-character codepoints in title
  • [NUTCH-1493] - Error adding field 'contentLength'='' during solrindex using index-more
  • [NUTCH-1496] - ParserJob logs skipped urls with level info
  • [NUTCH-1503] - Configuration properties not in sync between FetcherReducer and nutch-default.xml
  • [NUTCH-1516] - Nutch 2.x pom.xml out of sync with ivy.xml
  • [NUTCH-1523] - Upgrade solr-solr4j dependency to 4.1.0
  • [NUTCH-1532] - Replace 'segment' mapping field with batchId
  • [NUTCH-1533] - Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage
  • [NUTCH-1536] - Ant build file has hardcoded conf dir location
  • [NUTCH-1540] - Add Gora buffered read and write maximum limits to nutch-default.xml configuration.
  • [NUTCH-1542] - adddays param for generator not present in 2.x
  • [NUTCH-1547] - BasicIndexingFilter - Problem to index full title
  • [NUTCH-1551] - Improve WebTableReader field order and display batchId
  • [NUTCH-1554] - org.apache.nutch.net.protocols.HttpDateFormat should NOT be Locale.US aware
  • [NUTCH-1563] - FetchSchedule#getFields is never used by GeneraterJob
  • [NUTCH-1565] - Proper downloads page for Nutch
  • [NUTCH-1576] - Need to keep hotStore.flush() exception catching
  • [NUTCH-1681] - In URLUtil.java, toUNICODE method does not work correctly

New Feature

  • [NUTCH-427] - protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation.
  • [NUTCH-1038] - Port IndexingFiltersChecker to 2.0
  • [NUTCH-1284] - Add site fetcher.max.crawl.delay as log output by default.
  • [NUTCH-1518] - session cookies support

Improvement

  • [NUTCH-213] - checkstyle
  • [NUTCH-346] - Improve readability of logs/hadoop.log
  • [NUTCH-431] - Move plugin specific properties out of nutch-site.xml and into specific conf files for plugins
  • [NUTCH-449] - Format of junit output should be configurable
  • [NUTCH-789] - Improvements to Tika parser
  • [NUTCH-842] - AutoGenerate WebPage code
  • [NUTCH-1249] - Resolve all issues flagged up by adding javac -Xlint arguement
  • [NUTCH-1369] - Improve ParserChecker in Nutchgora
  • [NUTCH-1370] - Expose exact number of urls injected @runtime
  • [NUTCH-1389] - parsechecker and indexchecker to report truncated content
  • [NUTCH-1394] - backport NUTCH-1232 Remove site field from index-basic
  • [NUTCH-1419] - parsechecker and indexchecker to report protocol status
  • [NUTCH-1421] - RegexURLNormalizer to only skip rules with invalid patterns
  • [NUTCH-1433] - Upgrade to Tika 1.2
  • [NUTCH-1451] - Upgrade automaton jar to 1.11-8
  • [NUTCH-1471] - make explicit which datastore urls are injected to
  • [NUTCH-1488] - bin/nutch to run junit from any directory
  • [NUTCH-1501] - Harmonize behavior of parsechecker and indexchecker
  • [NUTCH-1510] - Upgrade to Hadoop 1.1.1
  • [NUTCH-1514] - Phase out the deprecated configuration properties (if possible)
  • [NUTCH-1550] - xercesImpl and xmlParserAPIs (org.apache.xml) packages and classes only used in three Nutch classes
  • [NUTCH-1569] - Upgrade 2.x to Gora 0.3
  • [NUTCH-1573] - Upgrade to most recent JUnit 4.x to improve test flexibility
  • [NUTCH-1575] - support solr authentication in nutch 2.x
  • [NUTCH-1577] - Add target for creating eclipse project

Test

  • [NUTCH-1453] - Substantiate tests for IndexingFilters

Task

  • [NUTCH-1031] - Delegate parsing of robots.txt to crawler-commons
  • [NUTCH-1087] - Deprecate crawl command and replace with example script
  • [NUTCH-1545] - capture batchId and remove references to segments in 2.x crawl script.

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.