Sub-task
- [NUTCH-1094] - create comprehensive documentation for Nutchgora branch
- [NUTCH-1273] - Fix [deprecation] javac warnings
- [NUTCH-1274] - Fix [cast] javac warnings
- [NUTCH-1275] - Fix [unchecked] javac warnings
- [NUTCH-1277] - Fix [fallthrough] javac warnings
Bug
- [NUTCH-342] - Nutch commands log to nutch/logs/hadoop.logs by default
- [NUTCH-706] - Url regex normalizer: default pattern for session id removal not to match "newsId"
- [NUTCH-829] - duplicate hadoop temp files
- [NUTCH-891] - Nutch build should not depend on unversioned local deps
- [NUTCH-956] - solrindex issues
- [NUTCH-1042] - Fetcher.max.crawl.delay property not taken into account correctly when set to -1
- [NUTCH-1344] - BasicURLNormalizer to normalize https same as http
- [NUTCH-1390] - readdb -url $url throws NPE with gora-cassandra
- [NUTCH-1393] - Display consistent usage of GeneratorJob with 1.X
- [NUTCH-1418] - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
- [NUTCH-1447] - Nutch 2.x with Cloudera CDH 4 get Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
- [NUTCH-1455] - RobotRulesParser to match multi-word user-agent names
- [NUTCH-1479] - nutch readhostdb and updatehostdb do not work with MySQL
- [NUTCH-1484] - TableUtil unreverseURL fails on file:// URLs
- [NUTCH-1491] - UTF-8 non-character codepoints in title
- [NUTCH-1493] - Error adding field 'contentLength'='' during solrindex using index-more
- [NUTCH-1496] - ParserJob logs skipped urls with level info
- [NUTCH-1503] - Configuration properties not in sync between FetcherReducer and nutch-default.xml
- [NUTCH-1516] - Nutch 2.x pom.xml out of sync with ivy.xml
- [NUTCH-1523] - Upgrade solr-solr4j dependency to 4.1.0
- [NUTCH-1532] - Replace 'segment' mapping field with batchId
- [NUTCH-1533] - Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage
- [NUTCH-1536] - Ant build file has hardcoded conf dir location
- [NUTCH-1540] - Add Gora buffered read and write maximum limits to nutch-default.xml configuration.
- [NUTCH-1542] - adddays param for generator not present in 2.x
- [NUTCH-1547] - BasicIndexingFilter - Problem to index full title
- [NUTCH-1551] - Improve WebTableReader field order and display batchId
- [NUTCH-1554] - org.apache.nutch.net.protocols.HttpDateFormat should NOT be Locale.US aware
- [NUTCH-1563] - FetchSchedule#getFields is never used by GeneraterJob
- [NUTCH-1565] - Proper downloads page for Nutch
- [NUTCH-1576] - Need to keep hotStore.flush() exception catching
- [NUTCH-1681] - In URLUtil.java, toUNICODE method does not work correctly
New Feature
- [NUTCH-427] - protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation.
- [NUTCH-1038] - Port IndexingFiltersChecker to 2.0
- [NUTCH-1284] - Add site fetcher.max.crawl.delay as log output by default.
- [NUTCH-1518] - session cookies support
Improvement
- [NUTCH-213] - checkstyle
- [NUTCH-346] - Improve readability of logs/hadoop.log
- [NUTCH-431] - Move plugin specific properties out of nutch-site.xml and into specific conf files for plugins
- [NUTCH-449] - Format of junit output should be configurable
- [NUTCH-789] - Improvements to Tika parser
- [NUTCH-842] - AutoGenerate WebPage code
- [NUTCH-1249] - Resolve all issues flagged up by adding javac -Xlint arguement
- [NUTCH-1369] - Improve ParserChecker in Nutchgora
- [NUTCH-1370] - Expose exact number of urls injected @runtime
- [NUTCH-1389] - parsechecker and indexchecker to report truncated content
- [NUTCH-1394] - backport NUTCH-1232 Remove site field from index-basic
- [NUTCH-1419] - parsechecker and indexchecker to report protocol status
- [NUTCH-1421] - RegexURLNormalizer to only skip rules with invalid patterns
- [NUTCH-1433] - Upgrade to Tika 1.2
- [NUTCH-1451] - Upgrade automaton jar to 1.11-8
- [NUTCH-1471] - make explicit which datastore urls are injected to
- [NUTCH-1488] - bin/nutch to run junit from any directory
- [NUTCH-1501] - Harmonize behavior of parsechecker and indexchecker
- [NUTCH-1510] - Upgrade to Hadoop 1.1.1
- [NUTCH-1514] - Phase out the deprecated configuration properties (if possible)
- [NUTCH-1550] - xercesImpl and xmlParserAPIs (org.apache.xml) packages and classes only used in three Nutch classes
- [NUTCH-1569] - Upgrade 2.x to Gora 0.3
- [NUTCH-1573] - Upgrade to most recent JUnit 4.x to improve test flexibility
- [NUTCH-1575] - support solr authentication in nutch 2.x
- [NUTCH-1577] - Add target for creating eclipse project
Test
- [NUTCH-1453] - Substantiate tests for IndexingFilters
Task
- [NUTCH-1031] - Delegate parsing of robots.txt to crawler-commons
- [NUTCH-1087] - Deprecate crawl command and replace with example script
- [NUTCH-1545] - capture batchId and remove references to segments in 2.x crawl script.
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.