Apache Tika를 사용하여 Zip 아카이브 내에서 여러 파일 형식의 콘텐츠를 어떻게 읽나요?-java지도 시간-php.cn

Apache Tika를 사용하여 Zip 아카이브 내에서 여러 파일 형식의 콘텐츠를 어떻게 읽나요?

Mary-Kate Olsen

풀어 주다： 2024-10-28 21:20:30

원래의

830명이 탐색했습니다.

How Do I Read Content from Multiple File Types Within a Zip Archive Using Apache Tika?

Apache Tika를 사용하여 Zip 내 파일에서 콘텐츠 읽기

과제:

당신은 다음과 같은 Java 프로그램을 작성하고자 합니다. Apache Tika를 사용하여 zip 아카이브 내의 여러 파일 내용을 추출하고 읽습니다. 특히 zip 파일에는 텍스트, PDF 및 docx 파일이 혼합되어 있습니다.

해결책:

public class ZipContentExtractor {

    public static void main(String[] args) throws IOException, SAXException, TikaException {
        File zipFile = new File("C:\Users\xxx\Desktop\abc.zip");

        try (ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(zipFile))) {
            ZipEntry entry;
            while ((entry = zipInputStream.getNextEntry()) != null) {
                // Checking file types
                if (entry.getName().endsWith(".txt") || entry.getName().endsWith(".pdf") || entry.getName().endsWith(".docx")) {
                    // Handling text files
                    if (entry.getName().endsWith(".txt")) {
                        BodyContentHandler textHandler = new BodyContentHandler();
                        Parser parser = new AutoDetectParser();
                        parser.parse(zipInputStream, textHandler, new Metadata(), new ParseContext());
                        System.out.println("TXT file content: " + textHandler.toString());
                    }
                    // Handling PDF files
                    else if (entry.getName().endsWith(".pdf")) {
                        Metadata metadata = new Metadata();
                        Parser parser = new PDFParser();
                        parser.parse(zipInputStream, new StreamingContentHandler(), metadata, new ParseContext());
                        System.out.println("PDF file content: " + metadata.get("xmpDM:documentID"));
                    }
                    // Handling DOCX files
                    else {
                        BodyContentHandler textHandler = new BodyContentHandler();
                        Parser parser = new OOXMLParser();
                        parser.parse(zipInputStream, textHandler, new Metadata(), new ParseContext());
                        System.out.println("DOCX file content: " + textHandler.toString());
                    }
                }
            }
        }
    }
}

로그인 후 복사

설명: