ASF JIRA

Tika
1.14
Key descending
166 of 66 as at: 26/Apr/24 10:42
T Patch Info Key Summary Assignee Reporter P Status Resolution Created Updated Due Development
Improvement TIKA-2272

Add CRC32 option to DigestingParser

Unassigned Jason (at Wshrdryr) Minor Open Unresolved  
Bug TIKA-2168

Incorrect <a> and <p> parsing in PdfParser

Unassigned Sara Miller Major Resolved Implemented  
Bug TIKA-2151

Imposed Write Limit Causes Lost Data With Pdfs

Unassigned Josh Cummings Critical Resolved Duplicate  
Bug TIKA-2123

CommonsDigester calculates wrong hashes on large files

Unassigned Yahav Amsalem Major Resolved Fixed  
Improvement TIKA-2122

Extract all email headers from Outlook .msg files into Metadata

Unassigned Chris Knott Minor Resolved Fixed  
Improvement TIKA-2113

Upgrade metadata-extractor to 2.9.1

Unassigned Tim Allison Trivial Resolved Fixed  
Bug TIKA-2101

Don't use MAPIMessage.close()

Unassigned Tim Allison Minor Resolved Fixed  
Bug TIKA-2098

Tika.parseToString() with maxLength doesn't work correctly for PDF files

Tim Allison Alexander Kazakov Major Resolved Fixed  
Bug TIKA-2097

Fix NPE in mbox parser

Unassigned Tim Allison Trivial Resolved Fixed  
Improvement TIKA-2095

Include version of Tika in tika-server's GREETING

Unassigned Tim Allison Trivial Resolved Fixed  
Improvement TIKA-2093

Add hOCR output type to the TesseractOCRParser

Tim Allison Eric Pugh Major Resolved Fixed  
Improvement TIKA-2082

Upgrade to PDFBox 2.0.3

Unassigned Luís Filipe Nassif Major Closed Duplicate  
Task TIKA-2081

Add back 'fileUrl' functionality to TikaJAXRS Server subject to security controls

Tim Allison John Dougrez-Lewis Minor Resolved Fixed  
Bug TIKA-2078

Account for potentially multiple runs within a hyperlink in DOCX

Tim Allison Tim Allison Minor Resolved Fixed  
Improvement TIKA-2069

Extract Macro text from Microsoft Office documents

Unassigned Jeff Swindle Major Resolved Fixed  
Bug TIKA-2068

RTFParser crashes with NullPointerException

Unassigned Nam-Quang Tran Major Resolved Duplicate  
Task TIKA-2067

Upgrade maven plugin versions

Unassigned Tim Allison Trivial Resolved Fixed  
Task TIKA-2066

Upgrade commons-io to 2.5

Unassigned Tim Allison Trivial Resolved Fixed  
Task TIKA-2065

Upgrade forbiddenapis to 2.2

Unassigned Tim Allison Trivial Resolved Fixed  
Bug TIKA-2058

Memory Leak in Tika version 1.13 when parsing millions of files

Unassigned Tim Barrett Major Resolved Fixed  
Improvement TIKA-2057

Extract PDF DocInfo fields into separate metadata fields

Tim Allison John Haynes Minor Resolved Fixed  
Bug TIKA-2055

Exception on parsing .docx file

Unassigned Sebastian Iturra Critical Resolved Fixed  
Improvement TIKA-2051

Upgrade to PDFBox 2.0.3 when available

Tim Allison Tim Allison Minor Closed Fixed  
Bug TIKA-2048

Add space for <br/> elements in MSWord 2003XML

Tim Allison Tim Allison Trivial Resolved Fixed  
Bug TIKA-2047

TXTParser overwrites mime type/masks types that are subtype of text

Tim Allison Tim Allison Minor Resolved Fixed  
Bug TIKA-2045

TIKA crashes / runs out of memory on simple PDF

Unassigned Egbert Major Resolved Fixed  
Bug TIKA-2042

MBOX file detected wrongly as text/html

Unassigned Vjeran Marcinko Major Resolved Fixed  
Bug TIKA-2041

Charset detection doesn't appear to be thread-safe

Tim Allison Tim Allison Major Resolved Fixed  
Bug TIKA-2040

OOM when parsing a corrupted CHM

Tim Allison Luís Filipe Nassif Major Resolved Fixed  
Improvement TIKA-2039

Upgrade jackcess to 2.1.4

Tim Allison Tim Allison Trivial Resolved Fixed  
Bug TIKA-2037

Problems with email attachments

Unassigned Eli Trucco Minor Resolved Fixed  
Improvement TIKA-2031

Update Tesseract OCR Parser

Chris A. Mattmann Zarana Parekh Blocker Resolved Fixed  
Bug TIKA-2026

Handle OLE 2.0 embedded non-Office document in PPT/X and XLSX

Tim Allison Tim Allison Major Resolved Fixed  
Bug TIKA-2025

Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results

Tim Allison Aeham Abushwashi Major Resolved Fixed  
Improvement TIKA-2024

Extract original filename/path when possible

Unassigned Tim Allison Major Resolved Fixed  
Improvement TIKA-2022

Add applefile parser

Tim Allison Tim Allison Trivial Resolved Fixed  
Improvement TIKA-2021

Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction

Chris A. Mattmann Zarana Parekh Major Resolved Fixed  
Bug TIKA-2019

WordMLParser and SpreadsheetMLParser incorrectly concatenate tokens with ToTextHandler

Unassigned Tim Allison Major Resolved Fixed  
Bug TIKA-2015

MAPIMessage String fileName constructor leaves file open

Unassigned Tim Barrett Major Resolved Fixed  
Improvement TIKA-2013

Upgrade to POI 3.15 when available

Tim Allison Tim Allison Minor Resolved Fixed  
Improvement TIKA-2011

Add mime detection for Endnote Import File (PRONOM: fmt/328)

Unassigned Tim Allison Trivial Resolved Fixed  
Improvement TIKA-2009

Add magic for djvu

Unassigned Tim Allison Trivial Resolved Fixed  
Improvement TIKA-2008

Add mime detection (and parser?) for MSOffice Owner File (PRONOM fmt/473)

Unassigned Tim Allison Trivial Resolved Fixed  
Improvement TIKA-2006

Add magic for vCalendar and iCalendar

Unassigned Tim Allison Minor Resolved Fixed  
Improvement TIKA-2004

Add mime detection for Windows Media Metafile, PRONOM: application/x-puid-fmt-584

Unassigned Tim Allison Trivial Resolved Fixed  
Bug TIKA-1999

org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)

Tim Allison Egbert Major Resolved Fixed  
Improvement TIKA-1996

Upgrade to PDFBox 2.0.2 when available

Tim Allison Tim Allison Minor Closed Fixed  
Improvement TIKA-1994

Integrate OCR with PDFParser

Tim Allison Tim Allison Major Resolved Fixed  
New Feature TIKA-1993

Image Recognition with Tika

Chris A. Mattmann Thamme Gowda Major Resolved Fixed  
Bug TIKA-1990

Broken .jpg inline image from .pdf files

Tim Allison Kukushkin Alexander Major Resolved Fixed  
Bug TIKA-1989

Weird sentence in website

Unassigned Tilman Hausherr Major Resolved Fixed  
Sub-task TIKA-1986

TIKA-1508 support parser parameters with type (int, double, etc) in configuration XML file

Chris A. Mattmann Thamme Gowda Major Resolved Fixed  
Bug TIKA-1980

HTML head tags found after first script not parsed by HtmlParser (regression)

Tim Allison Joseph Naegele Major Resolved Fixed  
Improvement TIKA-1979

Issue message when server mode has started

Tim Allison Matthias Pigulla Trivial Resolved Fixed  
Bug TIKA-1978

Invocation of java.net.URL.equals(Object), which blocks to do domain name resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL)

Lewis John McGibbney Lewis John McGibbney Critical Resolved Fixed  
Improvement TIKA-1977

RFC822Parser 'adds' dc:title causing rare exceptions if > 1 'subject'

Unassigned Tim Allison Trivial Resolved Fixed  
Improvement TIKA-1976

Add more robust date parsing fallbacks for RFC822 parser

Unassigned Tim Allison Minor Resolved Fixed  
Bug TIKA-1971

Email saved as .eml with no body not detected as rfc822, while same email saved as plain txt is.

Unassigned Philipp Steinkrueger Minor Resolved Fixed  
Bug TIKA-1970

Date not extracted from email saved as plain txt

Unassigned Philipp Steinkrueger Minor Resolved Fixed  
Improvement TIKA-1958

Add mime detection and lightweight parsers for Office 2003 Word and Excel formats

Tim Allison Tim Allison Minor Resolved Fixed  
Bug TIKA-1938

HtmlParser drops <script> elements found inside <head>

Kenneth William Krugler Joseph Naegele Major Resolved Fixed  
Bug TIKA-1928

Filename detection misses when a # is in a filename

Unassigned Jean Coudon Minor Resolved Fixed  
Bug TIKA-1925

Composite External Parser like Exiftool fails to run on Windows.

Chris A. Mattmann Nilay Chheda Major Resolved Won't Fix  
Improvement TIKA-1513

Add mime detection and parsing for dbf files

Tim Allison Tim Allison Minor Resolved Fixed  
Improvement TIKA-1267

Improve Mbox file detection

Unassigned Luís Filipe Nassif Minor Closed Fixed  
Bug TIKA-1255

WordExtractor - bold hyperlink not closed properly

Tim Allison Alan Hunter Minor Resolved Fixed