|
|
TIKA-3029
|
to extract information from ppt formats along with tables and image content
|
Unassigned
|
aashika
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-2723
|
Issue with parsing .mht container
|
Unassigned
|
Ghenadie
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-2521
|
SAX-based docx/pptx should start a new line before second paragraph within a cell
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2519
|
Issue parsing multiple CHM files concurrently
|
Unassigned
|
Eamonn Saunders
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2516
|
Upgrade CFX version to > 3.0.13
|
Unassigned
|
Julian Reschke
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2512
|
Add underline and strikethrough to SAX-based docx/pptx parsers
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2511
|
Slowness parsing SQLite database file
|
Unassigned
|
Eamonn Saunders
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2510
|
Embedded MP3 file in PPTX document no longer identified
|
Tim Allison
|
Eamonn Saunders
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2506
|
Nullpointer in tika-dl test on windows
|
Bob Paulin
|
Bob Paulin
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2504
|
TIKA-2499
Upgrade or remove plexus-utils
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2503
|
TIKA-2499
Try to upgrade httpclient to >=4.5.3
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2502
|
TIKA-2499
Upgrade OpenNLP to 1.8.3
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2501
|
TIKA-2499
Upgrade jackson to 2.9.2
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2499
|
Sonatype Nexus Auditor is reporting that Tika 1.13 is using a number of vulnerable Third party components.
|
Tim Allison
|
Abhijit Rajwade
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2497
|
Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser
|
Unassigned
|
Advokat
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2492
|
Remove pdfdebugger from tika
|
Unassigned
|
Tilman Hausherr
|
|
Closed |
Fixed
|
|
|
|
|
|
|
TIKA-2491
|
Cannot use TikaConfig
|
Unassigned
|
Markus Jelsma
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2490
|
Turn off stderr warnings in Tika-app
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2489
|
Upgrade to PDFBox 2.0.8
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2486
|
Upgrade metadata-extractor to 2.10.1
|
Unassigned
|
Julian Reschke
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2485
|
EncodingDetectors markLimits to be configurable
|
Tim Allison
|
Markus Jelsma
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2483
|
Using PackageParser in ForkParser causes NPE
|
Unassigned
|
TzeKai Lee
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2478
|
RFC822 includes redundant copies of the text
|
Tim Allison
|
Robert Letzler
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2476
|
Metadata.toString always returns a trailing space
|
Sergey Beryozkin
|
Sergey Beryozkin
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2472
|
Implement Metadata.hashCode
|
Sergey Beryozkin
|
Sergey Beryozkin
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2470
|
Illegal reflective Access -- more cleanup for Java 9
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2469
|
False positives with x-ms-owner detection
|
Tim Allison
|
Luís Filipe Nassif
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2466
|
Remove JAXB usage
|
Unassigned
|
Robert Munteanu
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2465
|
Add explicit unit tests for xxe
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2464
|
No PIL found while running the docker image 'InceptionVideoRestDockerfile'
|
Chris A. Mattmann
|
Aman R Mathur
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2459
|
Missing text in .doc file (but can be extracted by POI)
|
Unassigned
|
Dustin Spicuzza
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2456
|
Emails extracted from MBOX not detected as rfc822
|
Unassigned
|
Luís Filipe Nassif
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2455
|
Flag in metadata for alternative email bodies
|
Unassigned
|
Matthew Caruana Galizia
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2454
|
Emails extracted from PSTs detected as unexpected file types
|
Unassigned
|
Matthew Caruana Galizia
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2451
|
Detect image frame counts for tiff files
|
Unassigned
|
Mike Cantrell
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2450
|
OfficeParser.parse called for zero-byte file with .doc extension
|
Unassigned
|
Matthew Caruana Galizia
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2449
|
Enabling extraction of standard references from text
|
Giuseppe Totaro
|
Giuseppe Totaro
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2448
|
Handle phonetic strings in the SAX docx parser
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2447
|
PSDParser creates unnecessary large byte array and discards it
|
Unassigned
|
Jan Burkhardt
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2445
|
Windows BAT / CMD detection
|
Unassigned
|
Nick Burch
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2442
|
Non-terminal interactive form fields not handled recursively
|
Unassigned
|
Christopher Creutzig
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2440
|
Phonetic strings handling for multilingual environments.
|
Unassigned
|
Takahiro Ochi
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2439
|
Avoid NullPointerException in org.apache.tika.langdetect.OptimaizeLangDetector if models haven't been loaded
|
Unassigned
|
Karl-Philipp Richter
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2438
|
Test failure at OOXMLParserTest.testBigIntegersWGeneralFormat:1350->TikaTest.assertContains:102
|
Unassigned
|
Karl-Philipp Richter
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2435
|
docx parser missing content when multiple body sections
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2433
|
Tika 1.16 - Nullpointer Exception after update - Asking for help
|
Unassigned
|
Karl Buchta
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2431
|
Upgrade to PDFBox 2.0.7
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2430
|
Add at least dev test capability to run Tika against fuzzed files
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2429
|
Upgrade to POI 3.17-final when available
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2428
|
EMFParser loops forever with corrupted files
|
Unassigned
|
Luís Filipe Nassif
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2426
|
Fix locale-dependent test in xlsb unit test
|
Unassigned
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2402
|
TIKA-2398
Support all image formats in Object Recognition REST Parser
|
Chris A. Mattmann
|
Thejan Wijesinghe
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2400
|
TIKA-2398
Standardizing current Object Recognition REST parsers
|
Chris A. Mattmann
|
Thejan Wijesinghe
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2389
|
Warn log level is pretty strong for missing JBIG2ImageReader
|
Unassigned
|
Thomas Mortagne
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2385
|
Tesseract OCR rotation.py not run
|
Dave Meikle
|
Peter Weiss
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2369
|
Define a clean Recogniser interface: for objects from binary data; and for text classification
|
Chris A. Mattmann
|
Chris A. Mattmann
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-2355
|
Cache trained mode while running ObjectRecognition server from Docker builds
|
Chris A. Mattmann
|
Madhav Sharan
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2347
|
Underlined text is not decorated as such when extracting from word documents
|
Dave Meikle
|
Stuart Hendren
|
|
Closed |
Fixed
|
|
|
|
|
|
|
TIKA-2346
|
Allow Office format parsers to exclude parsing shapes
|
Unassigned
|
Nick Burch
|
|
Reopened |
Unresolved
|
|
|
|
|
|
|
TIKA-2340
|
Add explicit deps to tika-parsers which are currently used from transitive scope
|
Konstantin Gribov
|
Konstantin Gribov
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-2332
|
Output SNOMED codes for CUIs in CTAKES output?
|
Chris A. Mattmann
|
Dillon Welch
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2312
|
[Mp3Parser] expose fields form ID3TagsAndAudio
|
Unassigned
|
Łukasz Ozimek
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-2262
|
Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
|
Chris A. Mattmann
|
Thamme Gowda
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-2034
|
Upgrade XMPCore to 5.1.3
|
Tim Allison
|
Tim Allison
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1988
|
Age Detection Tika Recogniser
|
Chris A. Mattmann
|
Madhav Sharan
|
|
Reopened |
Unresolved
|
|
|
|
|
|
|
TIKA-1953
|
tika-server NullPointerException while processing rtfs
|
Tim Allison
|
Ravi
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1952
|
Access Date is getting modified while capturing the MetaData information using AutoDetectParser
|
Unassigned
|
RameshKalidindi
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1840
|
No way to link slide notes to slide in PPT output.
|
Chris A. Mattmann
|
Sam H
|
|
Reopened |
Unresolved
|
|
|
|
|
|
|
TIKA-1829
|
org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE
|
Tim Allison
|
frank
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1800
|
MediaType#parse does not decode escaped special characters
|
Unassigned
|
Roberto Benedetti
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1788
|
message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header
|
Tim Allison
|
Sergey Tsalkov
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1738
|
ForkClient does not always delete temporary bootstrap jar
|
Unassigned
|
Yaniv Kunda
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1724
|
Create parser for .obo file format.
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1705
|
Update ASM dependency to 5.0.4
|
Dave Meikle
|
Uwe Schindler
|
|
Reopened |
Unresolved
|
|
|
|
|
|
|
TIKA-1697
|
Parser Implementation for AkomaNtoso Legal XML Documents
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1688
|
Tika Version in Metadata
|
Unassigned
|
Paul Ramirez
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1674
|
Add example to show how to extract embedded files
|
Unassigned
|
Tim Allison
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1672
|
Integrate tika-java7 component
|
Unassigned
|
Tyler Bui-Palsulich
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1640
|
Make ExternalParser support aliases for key names in extracted metadata
|
Chris A. Mattmann
|
Chris A. Mattmann
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1616
|
Tika Parser for GIBS Metadata
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1609
|
Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1607
|
TIKA-2085
Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1598
|
Parser Implementation for Streaming Video
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1577
|
NetCDF Data Extraction
|
Ann Burgess
|
Ann Burgess
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1540
|
New Tika plugin for image based feature extraction using computer vision techniques
|
Lewis John McGibbney
|
Aashish Chaudhary
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1518
|
Docker with Tika Server
|
Dave Meikle
|
Paul Ramirez
|
|
Reopened |
Unresolved
|
|
|
|
|
|
|
TIKA-1505
|
chmparser breaks down when extracting from file of CHM format v3
|
Unassigned
|
Bin Hawking
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1465
|
Implement extraction of non-global variables from netCDF3 and netCDF4
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1456
|
Visual Sentiment API parser
|
Chris A. Mattmann
|
Chris A. Mattmann
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1454
|
Extracting as HTML loses links in xlsx, ppt, and pptx files
|
Tim Allison
|
Chris Bryant
|
|
Resolved |
Fixed
|
|
|
|
|
|
|
TIKA-1425
|
Automatic batching of Microsoft service calls
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1417
|
Create Extract Embedded Images from PDFs Example
|
Unassigned
|
Tyler Bui-Palsulich
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1395
|
TIKA-1390
Create embedded image extraction example
|
Unassigned
|
Tyler Bui-Palsulich
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1390
|
Create tika-example module
|
Unassigned
|
Tyler Bui-Palsulich
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1379
|
error in Tika().detect for xml files with xades signature
|
Unassigned
|
Alessandro De Angelis
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1367
|
Tika documentation should list tika-parsers parser dependencies
|
Unassigned
|
Sergey Beryozkin
|
|
Resolved |
Invalid
|
|
|
|
|
|
|
TIKA-1366
|
Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse
|
Unassigned
|
Sergey Beryozkin
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1329
|
TIKA-1390
Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser
|
Unassigned
|
Tim Allison
|
|
Reopened |
Unresolved
|
|
|
|
|
|
|
TIKA-1328
|
Translate Metadata and Content
|
Unassigned
|
Tyler Bui-Palsulich
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1318
|
Use of Deprecated Word6Extractor.getParagraphText() Method
|
Unassigned
|
Tyler Bui-Palsulich
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1308
|
Support in memory parse mode(don't create temp file): to support run Tika in GAE
|
Unassigned
|
jefferyyuan
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1295
|
Make some Dublin Core items multi-valued
|
Tim Allison
|
Tim Allison
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1276
|
Missing embedded dependencies in tika-bundle
|
Unassigned
|
Rupert Westenthaler
|
|
Reopened |
Unresolved
|
|
|
|
|
|
|
TIKA-1220
|
Parser implementration for IFC files
|
Lewis John McGibbney
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1208
|
TIKA-1207
Migrate Any23 mime contributions to Tika
|
Unassigned
|
Lewis John McGibbney
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1108
|
Represent individual slides in pptx
|
Unassigned
|
Daniel Bonniot de Ruisselet
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-1059
|
Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
|
Unassigned
|
Ray Gauss II
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-988
|
We don't extract a placeholder for a Word document embedded in an Excel document
|
Unassigned
|
Michael McCandless
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-987
|
Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
|
Unassigned
|
Michael McCandless
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-985
|
Support for HTML5 elements
|
Unassigned
|
Markus Jelsma
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-980
|
MicrodataContentHandler for Apache Tika
|
Kenneth William Krugler
|
Markus Jelsma
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-894
|
Add webapp mode for Tika Server, simplifies deployment
|
Unassigned
|
Graham Charters
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-891
|
Use POST in addition to PUT on method calls in tika-server
|
Chris A. Mattmann
|
Chris A. Mattmann
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-819
|
Make Option to Exclude Embedded Files' Text for Text Content
|
Unassigned
|
Albert L.
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-776
|
ExifTool Embedder
|
Chris A. Mattmann
|
Ray Gauss II
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-774
|
ExifTool Parser
|
Chris A. Mattmann
|
Ray Gauss II
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-715
|
Some parsers produce non-well-formed XHTML SAX events
|
Unassigned
|
Michael McCandless
|
|
Open |
Unresolved
|
|
|
|
|
|
|
TIKA-539
|
Encoding detection is too biased by encoding in meta tag
|
Kenneth William Krugler
|
Reinhard Pötz
|
|
Reopened |
Unresolved
|
|
|
|
|