Release Notes - Nutch - Version 1.7 - HTML format

Sub-task

Bug

  • [NUTCH-342] - Nutch commands log to nutch/logs/hadoop.logs by default
  • [NUTCH-802] - Problems managing outlinks with large url length
  • [NUTCH-813] - Repetitive crawl 403 status page
  • [NUTCH-829] - duplicate hadoop temp files
  • [NUTCH-956] - solrindex issues
  • [NUTCH-1039] - Fetcher fails for pages without content-length header
  • [NUTCH-1042] - Fetcher.max.crawl.delay property not taken into account correctly when set to -1
  • [NUTCH-1053] - Parsing of RSS feeds fails
  • [NUTCH-1245] - URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
  • [NUTCH-1334] - NPE in FetcherOutputFormat
  • [NUTCH-1418] - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
  • [NUTCH-1455] - RobotRulesParser to match multi-word user-agent names
  • [NUTCH-1475] - Index-More Plugin -- A better fall back value for date field
  • [NUTCH-1494] - RSS feed plugin seems broken
  • [NUTCH-1500] - bin/crawl fails on step solrindex with wrong path to segment
  • [NUTCH-1509] - Implement read/write in NutchField
  • [NUTCH-1523] - Upgrade solr-solr4j dependency to 4.1.0
  • [NUTCH-1527] - Port nutch-elasticsearch-indexer to Nutch
  • [NUTCH-1536] - Ant build file has hardcoded conf dir location
  • [NUTCH-1547] - BasicIndexingFilter - Problem to index full title
  • [NUTCH-1554] - org.apache.nutch.net.protocols.HttpDateFormat should NOT be Locale.US aware
  • [NUTCH-1565] - Proper downloads page for Nutch
  • [NUTCH-1658] - Nutch mangles seed URLs and then reports on the mangled ones
  • [NUTCH-1744] - FTP Issue when entering Passive mode

New Feature

  • [NUTCH-427] - protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation.
  • [NUTCH-737] - urlnormalizer-unalias plugin
  • [NUTCH-1047] - Pluggable indexing backends
  • [NUTCH-1284] - Add site fetcher.max.crawl.delay as log output by default.
  • [NUTCH-1331] - limit crawler to defined depth
  • [NUTCH-1499] - Usage of multiple ipv4 addresses and network cards on fetcher machines

Improvement

  • [NUTCH-213] - checkstyle
  • [NUTCH-346] - Improve readability of logs/hadoop.log
  • [NUTCH-431] - Move plugin specific properties out of nutch-site.xml and into specific conf files for plugins
  • [NUTCH-449] - Format of junit output should be configurable
  • [NUTCH-789] - Improvements to Tika parser
  • [NUTCH-1183] - Summary task for adding command line usage instructions to webgraph classes
  • [NUTCH-1249] - Resolve all issues flagged up by adding javac -Xlint arguement
  • [NUTCH-1389] - parsechecker and indexchecker to report truncated content
  • [NUTCH-1419] - parsechecker and indexchecker to report protocol status
  • [NUTCH-1420] - Get rid of the dreaded �
  • [NUTCH-1506] - Add UPDATE action to NutchIndexAction
  • [NUTCH-1507] - Remove FetcherOutput
  • [NUTCH-1510] - Upgrade to Hadoop 1.1.1
  • [NUTCH-1514] - Phase out the deprecated configuration properties (if possible)
  • [NUTCH-1550] - xercesImpl and xmlParserAPIs (org.apache.xml) packages and classes only used in three Nutch classes
  • [NUTCH-1560] - index-metadata to add all values of multivalued metadata
  • [NUTCH-1573] - Upgrade to most recent JUnit 4.x to improve test flexibility
  • [NUTCH-1577] - Add target for creating eclipse project
  • [NUTCH-1583] - Headings does not support multiValued headings
  • [NUTCH-1585] - Ensure duplicate tags do not exist in microformat-reltag tag set.

Test

  • [NUTCH-1453] - Substantiate tests for IndexingFilters

Task

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.