Extracting Content from Files within a Zip Archive Using Apache Tika
Problem:
Develop a Java program that reads the contents of files stored within a zip archive utilizing Apache Tika. The zip archive contains various file formats (such as txt, pdf, and docx).
Solution:
To achieve the desired functionality, follow these steps:
Parse the Zip Archive:
Invoke Apache Tika:
Extract and Convert Content:
Consolidate Extracted Content:
Code Snippet (Modified):
<code class="java">import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; import java.util.zip.ZipEntry; import java.util.zip.ZipInputStream; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.BodyContentHandler; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandlerFactory; import org.xml.sax.SAXException; public class ImprovedZipExtractor { public static void main(String[] args) { List<String> tempString = new ArrayList<>(); StringBuffer sbf = new StringBuffer(); File file = new File("C:\Users\xxx\Desktop\abc.zip"); InputStream input; try { input = new FileInputStream(file); ZipInputStream zip = new ZipInputStream(input); ZipEntry entry = zip.getNextEntry(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); while (entry != null) { if (entry.getName().endsWith(".txt") || entry.getName().endsWith(".pdf") || entry.getName().endsWith(".docx")) { System.out.println("entry=" + entry.getName() + " " + entry.getSize()); parser.parse(zip, new BodyContentHandlerFactory(BodyContentHandlerFactory.INCLUDE_ENTITY_ROOT, false).getNewBodyContentHandler(), metadata, new ParseContext()); tempString.add(sbf.toString()); } entry = zip.getNextEntry(); } zip.close(); input.close(); for (String text : tempString) { System.out.println("Apache Tika - Converted input string : " + text); sbf.append(text); System.out.println("Final text from all the three files " + sbf.toString()); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } } }</code>
Note: It's important to modify the code to prevent the sbf being overwritten during each iteration and to clear it outside the loop to store the concated content from all files.
The above is the detailed content of How can I extract content from files within a zip archive using Apache Tika in Java?. For more information, please follow other related articles on the PHP Chinese website!