Sub-task
- [NUTCH-1802] - Move TestbedProxy to test environment
- [NUTCH-1803] - Put test dependencies in a separate lib dir
- [NUTCH-1804] - Move JUnit dependency to test scope
Bug
- [NUTCH-385] - Improve description of thread related configuration for Fetcher
- [NUTCH-578] - URL fetched with 403 is generated over and over again
- [NUTCH-926] - Redirections from META tag don't get filtered
- [NUTCH-1100] - SolrDedup broken
- [NUTCH-1182] - fetcher to log hung threads
- [NUTCH-1422] - bypass signature comparison when a document is redirected
- [NUTCH-1454] - parsing chm failed
- [NUTCH-1467] - nutch 1.5.1 not able to parse mutliValued metatags
- [NUTCH-1521] - CrawlDbFilter pass null url to urlNormailzers
- [NUTCH-1566] - bin/nutch to allow whitespace in paths
- [NUTCH-1603] - ZIP parser complains about truncated PDF file
- [NUTCH-1605] - mime type detector recognizes xlsx as zip file
- [NUTCH-1613] - Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
- [NUTCH-1633] - slf4j is provided by hadoop and should not be included in the job file.
- [NUTCH-1647] - protocol-http throws 'unzipBestEffort returned null' for redirected pages
- [NUTCH-1671] - indexchecker to add digest field
- [NUTCH-1681] - In URLUtil.java, toUNICODE method does not work correctly
- [NUTCH-1708] - use same id when indexing and deleting redirects
- [NUTCH-1718] - redefine http.robots.agent as "additional agent names"
- [NUTCH-1720] - Duplicate lines in HttpBase.java
- [NUTCH-1733] - parse-html to support HTML5 charset definitions
- [NUTCH-1736] - Can't fetch page if http response header contains Transfer-Encoding:chunked
- [NUTCH-1752] - cache robots.txt rules per protocol:host:port
- [NUTCH-1759] - Upgrade to Crawler Commons 0.4
- [NUTCH-1761] - Crawl script fails to find job file if not started from inside bin dir
- [NUTCH-1764] - readdb to show command-line help if no action (-stats, -dump, etc.) given
- [NUTCH-1766] - Generator to unlock crawldb and remove tempdir if generate job fails
- [NUTCH-1767] - remove special treatment of "params" in relative links
- [NUTCH-1776] - Log incorrect plugin.folder file path
- [NUTCH-1786] - CrawlDb should follow db.url.normalizers and db.url.filters
- [NUTCH-1793] - HttpRobotRulesParser not configured properly => "http.robots.403.allow" property is not read
- [NUTCH-1801] - Improve handling of test dependencies in ANT+Ivy
- [NUTCH-1811] - bin/nutch junit to use junit 4 test runner
- [NUTCH-1818] - Add deps-test-compile task for building plugins
New Feature
- [NUTCH-207] - Bandwidth target for fetcher rather than a thread count
- [NUTCH-1327] - QueryStringNormalizer
- [NUTCH-1590] - [SECURITY] Frame injection vulnerability in published Javadoc
- [NUTCH-1871] - Generic xsl parser plugin
Improvement
- [NUTCH-1502] - Test for CrawlDatum state transitions
- [NUTCH-1561] - improve usability of parse-metatags and index-metadata
- [NUTCH-1676] - Add rudimentary SSL support to protocol-http
- [NUTCH-1735] - code dedup fetcher queue redirects
- [NUTCH-1737] - Upgrade to recent JUnit 4.x
- [NUTCH-1745] - Upgrade to ElasticSearch 1.1.0
- [NUTCH-1747] - Use AtomicInteger as semaphore in Fetcher
- [NUTCH-1757] - ParserChecker to take custom metadata as input
- [NUTCH-1758] - IndexChecker to send document to IndexWriters
- [NUTCH-1772] - Injector does not need merging if no pre-existing crawldb
- [NUTCH-1782] - NodeWalker to return current node
- [NUTCH-1787] - update and complete API doc overview page
- [NUTCH-1794] - IndexingFilterChecker to optionally dumpText
- [NUTCH-1799] - ANT Eclipse task discovers all plugin jars automatically
Test
- [NUTCH-1645] - Junit Test Case for Adaptive Fetch Schedule class
Task
- [NUTCH-1700] - Remove deprecated code in src/plugin/creativecommons/build.xml
- [NUTCH-1789] - Migrate Nutch site to Apache CMS
- [NUTCH-1817] - Remove pom.xml from source
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.