Sub-task
- [NUTCH-1118] - JUnit test for index-basic
- [NUTCH-1119] - JUnit test for index-static
- [NUTCH-1127] - JUnit test for urlfilter-validator
- [NUTCH-1273] - Fix [deprecation] javac warnings
- [NUTCH-1274] - Fix [cast] javac warnings
- [NUTCH-1275] - Fix [unchecked] javac warnings
- [NUTCH-1277] - Fix [fallthrough] javac warnings
Bug
- [NUTCH-342] - Nutch commands log to nutch/logs/hadoop.logs by default
- [NUTCH-802] - Problems managing outlinks with large url length
- [NUTCH-813] - Repetitive crawl 403 status page
- [NUTCH-829] - duplicate hadoop temp files
- [NUTCH-956] - solrindex issues
- [NUTCH-1039] - Fetcher fails for pages without content-length header
- [NUTCH-1042] - Fetcher.max.crawl.delay property not taken into account correctly when set to -1
- [NUTCH-1053] - Parsing of RSS feeds fails
- [NUTCH-1245] - URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
- [NUTCH-1334] - NPE in FetcherOutputFormat
- [NUTCH-1418] - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
- [NUTCH-1455] - RobotRulesParser to match multi-word user-agent names
- [NUTCH-1475] - Index-More Plugin -- A better fall back value for date field
- [NUTCH-1494] - RSS feed plugin seems broken
- [NUTCH-1500] - bin/crawl fails on step solrindex with wrong path to segment
- [NUTCH-1509] - Implement read/write in NutchField
- [NUTCH-1523] - Upgrade solr-solr4j dependency to 4.1.0
- [NUTCH-1527] - Port nutch-elasticsearch-indexer to Nutch
- [NUTCH-1536] - Ant build file has hardcoded conf dir location
- [NUTCH-1547] - BasicIndexingFilter - Problem to index full title
- [NUTCH-1554] - org.apache.nutch.net.protocols.HttpDateFormat should NOT be Locale.US aware
- [NUTCH-1565] - Proper downloads page for Nutch
- [NUTCH-1658] - Nutch mangles seed URLs and then reports on the mangled ones
- [NUTCH-1744] - FTP Issue when entering Passive mode
New Feature
- [NUTCH-427] - protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation.
- [NUTCH-737] - urlnormalizer-unalias plugin
- [NUTCH-1047] - Pluggable indexing backends
- [NUTCH-1284] - Add site fetcher.max.crawl.delay as log output by default.
- [NUTCH-1331] - limit crawler to defined depth
- [NUTCH-1499] - Usage of multiple ipv4 addresses and network cards on fetcher machines
Improvement
- [NUTCH-213] - checkstyle
- [NUTCH-346] - Improve readability of logs/hadoop.log
- [NUTCH-431] - Move plugin specific properties out of nutch-site.xml and into specific conf files for plugins
- [NUTCH-449] - Format of junit output should be configurable
- [NUTCH-789] - Improvements to Tika parser
- [NUTCH-1183] - Summary task for adding command line usage instructions to webgraph classes
- [NUTCH-1249] - Resolve all issues flagged up by adding javac -Xlint arguement
- [NUTCH-1389] - parsechecker and indexchecker to report truncated content
- [NUTCH-1419] - parsechecker and indexchecker to report protocol status
- [NUTCH-1420] - Get rid of the dreaded �
- [NUTCH-1506] - Add UPDATE action to NutchIndexAction
- [NUTCH-1507] - Remove FetcherOutput
- [NUTCH-1510] - Upgrade to Hadoop 1.1.1
- [NUTCH-1514] - Phase out the deprecated configuration properties (if possible)
- [NUTCH-1550] - xercesImpl and xmlParserAPIs (org.apache.xml) packages and classes only used in three Nutch classes
- [NUTCH-1560] - index-metadata to add all values of multivalued metadata
- [NUTCH-1573] - Upgrade to most recent JUnit 4.x to improve test flexibility
- [NUTCH-1577] - Add target for creating eclipse project
- [NUTCH-1583] - Headings does not support multiValued headings
- [NUTCH-1585] - Ensure duplicate tags do not exist in microformat-reltag tag set.
Test
- [NUTCH-1453] - Substantiate tests for IndexingFilters
Task
- [NUTCH-1031] - Delegate parsing of robots.txt to crawler-commons
- [NUTCH-1522] - Upgrade to Tika 1.3
- [NUTCH-1578] - Upgrade to Hadoop 1.2.0
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.