|
|
TIKA-2272
|
Add CRC32 option to DigestingParser
|
Unassigned
|
Jason (at Wshrdryr)
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-2168
|
Incorrect <a> and <p> parsing in PdfParser
|
Unassigned
|
Sara Miller
|
|
Resolved |
Implemented
|
|
|
|
|
|
|
TIKA-2151
|
Imposed Write Limit Causes Lost Data With Pdfs
|
Unassigned
|
Josh Cummings
|
|
Resolved |
Duplicate
|
|
|
|
|
|
|
TIKA-2123
|
CommonsDigester calculates wrong hashes on large files
|
Unassigned
|
Yahav Amsalem
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2122
|
Extract all email headers from Outlook .msg files into Metadata
|
Unassigned
|
Chris Knott
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2113
|
Upgrade metadata-extractor to 2.9.1
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2101
|
Don't use MAPIMessage.close()
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2098
|
Tika.parseToString() with maxLength doesn't work correctly for PDF files
|
Tim Allison
|
Alexander Kazakov
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2097
|
Fix NPE in mbox parser
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2095
|
Include version of Tika in tika-server's GREETING
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2093
|
Add hOCR output type to the TesseractOCRParser
|
Tim Allison
|
Eric Pugh
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2082
|
Upgrade to PDFBox 2.0.3
|
Unassigned
|
Luís Filipe Nassif
|
|
Closed |
Duplicate
|
|
|
|
|
|
|
TIKA-2081
|
Add back 'fileUrl' functionality to TikaJAXRS Server subject to security controls
|
Tim Allison
|
John Dougrez-Lewis
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2078
|
Account for potentially multiple runs within a hyperlink in DOCX
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2069
|
Extract Macro text from Microsoft Office documents
|
Unassigned
|
Jeff Swindle
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2068
|
RTFParser crashes with NullPointerException
|
Unassigned
|
Nam-Quang Tran
|
|
Resolved |
Duplicate
|
|
|
|
|
|
|
TIKA-2067
|
Upgrade maven plugin versions
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2066
|
Upgrade commons-io to 2.5
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2065
|
Upgrade forbiddenapis to 2.2
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2058
|
Memory Leak in Tika version 1.13 when parsing millions of files
|
Unassigned
|
Tim Barrett
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2057
|
Extract PDF DocInfo fields into separate metadata fields
|
Tim Allison
|
John Haynes
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2055
|
Exception on parsing .docx file
|
Unassigned
|
Sebastian Iturra
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2051
|
Upgrade to PDFBox 2.0.3 when available
|
Tim Allison
|
Tim Allison
|
|
Closed |
Fixed
|
|
|
|
|
|
|
TIKA-2048
|
Add space for <br/> elements in MSWord 2003XML
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2047
|
TXTParser overwrites mime type/masks types that are subtype of text
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2045
|
TIKA crashes / runs out of memory on simple PDF
|
Unassigned
|
Egbert
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2042
|
MBOX file detected wrongly as text/html
|
Unassigned
|
Vjeran Marcinko
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2041
|
Charset detection doesn't appear to be thread-safe
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2040
|
OOM when parsing a corrupted CHM
|
Tim Allison
|
Luís Filipe Nassif
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2039
|
Upgrade jackcess to 2.1.4
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2037
|
Problems with email attachments
|
Unassigned
|
Eli Trucco
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2031
|
Update Tesseract OCR Parser
|
Chris A. Mattmann
|
Zarana Parekh
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2026
|
Handle OLE 2.0 embedded non-Office document in PPT/X and XLSX
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2025
|
Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results
|
Tim Allison
|
Aeham Abushwashi
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2024
|
Extract original filename/path when possible
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2022
|
Add applefile parser
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2021
|
Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction
|
Chris A. Mattmann
|
Zarana Parekh
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2019
|
WordMLParser and SpreadsheetMLParser incorrectly concatenate tokens with ToTextHandler
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2015
|
MAPIMessage String fileName constructor leaves file open
|
Unassigned
|
Tim Barrett
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2013
|
Upgrade to POI 3.15 when available
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2011
|
Add mime detection for Endnote Import File (PRONOM: fmt/328)
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2009
|
Add magic for djvu
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2008
|
Add mime detection (and parser?) for MSOffice Owner File (PRONOM fmt/473)
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2006
|
Add magic for vCalendar and iCalendar
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2004
|
Add mime detection for Windows Media Metafile, PRONOM: application/x-puid-fmt-584
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1999
|
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
|
Tim Allison
|
Egbert
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1996
|
Upgrade to PDFBox 2.0.2 when available
|
Tim Allison
|
Tim Allison
|
|
Closed |
Fixed
|
|
|
|
|
|
|
TIKA-1994
|
Integrate OCR with PDFParser
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1993
|
Image Recognition with Tika
|
Chris A. Mattmann
|
Thamme Gowda
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1990
|
Broken .jpg inline image from .pdf files
|
Tim Allison
|
Kukushkin Alexander
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1989
|
Weird sentence in website
|
Unassigned
|
Tilman Hausherr
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1986
|
TIKA-1508
support parser parameters with type (int, double, etc) in configuration XML file
|
Chris A. Mattmann
|
Thamme Gowda
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1980
|
HTML head tags found after first script not parsed by HtmlParser (regression)
|
Tim Allison
|
Joseph Naegele
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1979
|
Issue message when server mode has started
|
Tim Allison
|
Matthias Pigulla
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1978
|
Invocation of java.net.URL.equals(Object), which blocks to do domain name resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL)
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1977
|
RFC822Parser 'adds' dc:title causing rare exceptions if > 1 'subject'
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1976
|
Add more robust date parsing fallbacks for RFC822 parser
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1971
|
Email saved as .eml with no body not detected as rfc822, while same email saved as plain txt is.
|
Unassigned
|
Philipp Steinkrueger
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1970
|
Date not extracted from email saved as plain txt
|
Unassigned
|
Philipp Steinkrueger
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1958
|
Add mime detection and lightweight parsers for Office 2003 Word and Excel formats
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1938
|
HtmlParser drops <script> elements found inside <head>
|
Kenneth William Krugler
|
Joseph Naegele
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1928
|
Filename detection misses when a # is in a filename
|
Unassigned
|
Jean Coudon
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1925
|
Composite External Parser like Exiftool fails to run on Windows.
|
Chris A. Mattmann
|
Nilay Chheda
|
|
Resolved |
Won't Fix
|
|
|
|
|
|
|
TIKA-1513
|
Add mime detection and parsing for dbf files
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1267
|
Improve Mbox file detection
|
Unassigned
|
Luís Filipe Nassif
|
|
Closed |
Fixed
|
|
|
|
|
|
|
TIKA-1255
|
WordExtractor - bold hyperlink not closed properly
|
Tim Allison
|
Alan Hunter
|
|
Resolved |
Fixed
|
|
|
|
|