JAVA：探索 PDF 文字提取的技术指南

拾荒的小海螺 2024-08-21

532

1、简述

随着信息化的发展，PDF 文档成为了信息传播的重要媒介。在许多应用场景下，如数据迁移、内容分析和信息检索，我们需要从 PDF 文件中提取文字内容。JAVA提供了多种库来处理 PDF 文件，其中 PDFBox 和 iText 是最常用的两个。

在这篇博客中，我们将深入探讨如何使用多种方式来提取 PDF 文本，分析各自的优缺点，并讨论在不同场景下的最佳实践。

2、准备工作

在开始之前，你需要以下准备工作：

百度开发者账号：前往百度AI开放平台注册账号，并创建一个应用以获取 API Key 和 Secret Key。

Java 开发环境：确保你的开发环境已经配置好，包括 JDK 和一个集成开发环境（IDE），如 IntelliJ IDEA 或 Eclipse。

引入依赖：百度官方提供了 Java SDK，或者你可以直接使用 HttpClient 进行 API 调用。

引入Maven依赖：

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.7</version>
</dependency>
<dependency>
    <groupId>org.apache.directory.studio</groupId>
    <artifactId>org.apache.commons.codec</artifactId>
    <version>1.8</version>
</dependency>
<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>fastjson</artifactId>
    <version>1.2.83</version>
</dependency>
<!-- spring-boot-actuator -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>3.0.2</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>5.2.3</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>5.2.3</version>
</dependency>
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.0</version>
</dependency>
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>font-asian</artifactId>
    <version>7.1.16</version>
</dependency>
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>kernel</artifactId>
    <version>7.1.16</version>
</dependency>
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>io</artifactId>
    <version>7.1.16</version>
</dependency>
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>layout</artifactId>
    <version>7.1.16</version>
</dependency>
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>forms</artifactId>
    <version>7.1.16</version>
</dependency>
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>pdfa</artifactId>
    <version>7.1.16</version>
</dependency>
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.8.0</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-collections4</artifactId>
    <version>4.4</version>
</dependency>
<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.13</version>
</dependency>

3、利用 PDFBox 解析

可以使用 PDFBox 库来解析 PDF 文件并提取文本内容。PDFBox 可以帮助你逐行读取 PDF 的文本，然后你可以编写逻辑来查找指定的文字。

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;


import java.io.File;
import java.io.IOException;


public class PDFReader {
    public static void main(String[] args) {
        String filePath = "path/to/your/pdf-file.pdf";
        String keyword = "指定文字";  // 要查找的指定文字


        try (PDDocument document = PDDocument.load(new File(filePath))) {
            PDFTextStripper pdfStripper = new PDFTextStripper();
            String text = pdfStripper.getText(document);


            // 将文本按行分割
            String[] lines = text.split("\n");
            for (int i = 0; i < lines.length; i++) {
                if (lines[i].contains(keyword)) {
                   System.out.println("在第 " + page + " 页找到关键字: " + keyword);
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

4、利用 Tesseract 来解析 PDFBox

将 PDF 转换为图像并使用 Tesseract OCR 进行文本识别是一种有效的方法来处理 PDF 文档中的复杂布局或不规则表格。以下是如何在 Java 中实现这一过程的详细步骤：

将 PDF 页面转换为图像：使用 PDFBox 将每个 PDF 页面转换为图像。

使用 Tesseract OCR 识别图像中的文本：通过 Tesseract OCR 读取每个图像，并提取文本。

查找关键字并提取信息：在 OCR 识别的文本中查找关键字（如“图号”），并提取相邻单元格的值。

@PostMapping("/pdf2Excel")
 public ResponseEntity<String> pdf2Excel(@RequestParam("keyword") String keyword, @RequestParam("file")MultipartFile file) throws IOException {
     if (file.isEmpty()) {
         return new ResponseEntity<>("File is empty", HttpStatus.BAD_REQUEST);
     }
     String basePath = System.getProperty("java.io.tmpdir");
     String imagesPath = basePath + "\\images\\";
     File directory = new File(imagesPath);
     if(!directory.exists()){
         directory.mkdirs();
     }
     File convFile = new File(basePath+ "/" + file.getOriginalFilename());
     file.transferTo(convFile);


     String excelPath = "C:\\Users\\WIN10\\Desktop\\fsdownload\\excel\\MapData.xlsx";
     File excelFile = new File(excelPath);
     if(!excelFile.exists()){
         Files.createFile(excelFile.toPath());
     }
     Map<String, Object> objectMap = new HashMap<>();




     try (PDDocument document = Loader.loadPDF(convFile)) {
         PDFRenderer pdfRenderer = new PDFRenderer(document);
         int numberOfPages = document.getNumberOfPages();


         ITesseract instance = new Tesseract();
         instance.setDatapath("D:\\soft\\Tesseract-OCR\\tessdata"); // 设置Tesseract的tessdata路径
         instance.setLanguage("chi_sim"); // 设置识别语言


         for (int pageIndex = 0; pageIndex < numberOfPages; pageIndex++) {
             BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(pageIndex, 144);
             int currentPageNum = pageIndex + 1;
             File imageFile = new File(imagesPath + "page_" + currentPageNum + ".jpg");
             ImageIO.write(bufferedImage, "jpg", imageFile);


             // 调用OCR服务识别文字
             String result =  instance.doOCR(imageFile);
             processOCRResult(result, objectMap, currentPageNum);
             System.out.println("Page " + (currentPageNum) + " converted to image.");
         }
         System.out.println("获取所有的图号-结束");
         System.out.println("数据转换Excel-开始");
         // 创建 Excel 工作簿
         Workbook workbook = new XSSFWorkbook();
         Sheet sheet = workbook.createSheet("Map Data");


         // 创建表头
         Row headerRow = sheet.createRow(0);
         headerRow.createCell(0).setCellValue("Key");
         headerRow.createCell(1).setCellValue("Value");


         // 填充数据
         int rowNum = 1;
         for (Map.Entry<String, Object> entry : objectMap.entrySet()) {
             Row row = sheet.createRow(rowNum++);
             row.createCell(0).setCellValue(entry.getKey());
             row.createCell(1).setCellValue((String) entry.getValue());
         }


         // 自动调整列宽
         sheet.autoSizeColumn(0);
         sheet.autoSizeColumn(1);


         FileOutputStream fileOut = new FileOutputStream(excelFile);
         workbook.write(fileOut);
         workbook.close();


         return ResponseEntity.ok()
                 .body("数据转换Excel成功");
     } catch (Exception e) {
         return new ResponseEntity<>("File upload error: " + e.getMessage(), HttpStatus.INTERNAL_SERVER_ERROR);
     }
 }

4、利用百度文字识别来解析 PDFBox

在现代开发中，文字识别（OCR，Optical Character Recognition）技术已经被广泛应用于图像处理、文档管理等领域。百度提供的文字识别 API 功能强大、易于使用，能够帮助开发者快速实现图像中的文字提取。本文将介绍如何在 Java 中利用百度文字识别 API 进行图片文字提取。

4.1 获取 Access Token

在调用文字识别 API 之前，需要先获取 Access Token。这一步通常在应用初始化时执行，并且 Access Token 具有一定的有效期。

public  Map<String,Object> token() throws Exception {


  //获取当前配置表数据参数


  Map<String,Object> map  = new HashMap<>();


  CloseableHttpClient httpClient = HttpClients.createDefault();
  HttpPost httpPost = new HttpPost("https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials");
  httpPost.addHeader("Content-Type", "application/json");
  httpPost.addHeader("Accept", "application/json");


  //post请求参数配置
  List<NameValuePair> formparams = new ArrayList<NameValuePair>();
  formparams.add(new BasicNameValuePair("client_id", API_KEY));
  formparams.add(new BasicNameValuePair("client_secret", SECRET_KEY));
  UrlEncodedFormEntity uefEntity = new UrlEncodedFormEntity(formparams, "UTF-8");   //设置编码格式为utf-8
  httpPost.setEntity(uefEntity);  //设置POST请求参数


  //使用httpclient的execute方法发送接口请求
  CloseableHttpResponse response =  httpClient.execute(httpPost);
  HttpEntity  httpEntity = response.getEntity();
  String responseString = EntityUtils.toString(httpEntity);
  JSONObject obj = JSON.parseObject(responseString);
  if(response.getStatusLine().getStatusCode() == 200){
    map.put("access_token", obj.getString("access_token"));
    map.put("refresh_token",obj.getString("refresh_token"));
    map.put("expires_in",obj.getIntValue("expires_in"));
  }else {
    map.put("error", obj.getString("error"));
    map.put("error_description", obj.getString("error_description"));
  }
  map.put("stateCode", response.getStatusLine().getStatusCode());
  return map;
}

4.2 调用百度文字识别 API

获取到 Access Token 后，我们可以使用它来调用百度的文字识别 API。我们将通过一个 POST 请求发送图片数据，并接收识别结果。

public Map<String,Object> accurate(String token , String image) throws Exception{
  Map<String,Object> map  = new HashMap<>();
  CloseableHttpClient httpClient = HttpClients.createDefault();
  HttpPost httpPost = new HttpPost("https://aip.baidubce.com/rest/2.0/ocr/v1/accurate_basic?access_token=" + token+"&language_type=CHN_ENG&detect_direction=false&paragraph=false&probability=false");
  httpPost.addHeader("Content-Type", "application/x-www-form-urlencoded");
  httpPost.addHeader("Accept", "application/json");


  //post请求参数配置
  List<NameValuePair> formparams = new ArrayList<NameValuePair>();
  formparams.add(new BasicNameValuePair("image", image));
  UrlEncodedFormEntity uefEntity = new UrlEncodedFormEntity(formparams, "UTF-8");   //设置编码格式为utf-8
  httpPost.setEntity(uefEntity);  //设置POST请求参数


  //使用httpclient的execute方法发送接口请求
  CloseableHttpResponse response =  httpClient.execute(httpPost);
  HttpEntity  httpEntity = response.getEntity();
  String responseString = EntityUtils.toString(httpEntity);
  JSONObject obj = JSON.parseObject(responseString);
  String errorCode = obj.getString("error_code");
  if(Objects.nonNull(errorCode)){
    map.put("stateCode", 500);
    map.put("error_code", obj.getString("error_code"));
    map.put("error_msg", obj.getString("error_msg"));
  }else {
    map.put("stateCode", 200);
    map.put("words_result", obj.getJSONArray("words_result"));
    map.put("words_result_num",obj.getString("words_result_num"));
    map.put("log_id",obj.getString("log_id"));
  }
  return map;
}

4.3 将识别的文字转成 Excel

百度 OCR API 返回的结果是 JSON 格式的。我们可以使用 Gson 或其他 JSON 解析库来处理这些结果，并提取出识别到的文字并转成Excel输出。

@PostMapping("/pdf2Excel")
public ResponseEntity<byte[]> pdf2Excel(@RequestParam("keyword") String keyword, @RequestParam("file") MultipartFile file) throws Exception {
    if (file.isEmpty()) {
        return new ResponseEntity<>("File is empty".getBytes(), HttpStatus.BAD_REQUEST);
    }
    String basePath = System.getProperty("java.io.tmpdir");
    String imagesPath = basePath + "\\images\\";
    File directory = new File(imagesPath);
    if(!directory.exists()){
        directory.mkdirs();
    }
    File convFile = new File(basePath+ "/" + file.getOriginalFilename());
    file.transferTo(convFile);


    Map<String, Object> tokenMap = ocrAPIFactory.token();
    int stateCode = (int)tokenMap.get("stateCode");
    if(stateCode != 200){
        return  new ResponseEntity<>("ERROR".getBytes(), HttpStatus.OK);
    }
    Map<Integer, String> objectMap = new HashMap<>();
    String accessToken= String.valueOf(tokenMap.get("access_token"));
    System.out.println("获取所有的图号-开始");
    try (PDDocument document = Loader.loadPDF(convFile)) {
        PDFRenderer pdfRenderer = new PDFRenderer(document);
        int numberOfPages = document.getNumberOfPages();
        for (int pageIndex = 0; pageIndex < numberOfPages; pageIndex++) {
            BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(pageIndex, 144);
            int currentPageNum = pageIndex + 1;


            String imageFilePath = imagesPath + "page_" + currentPageNum + ".jpg";
            File imageFile = new File(imageFilePath);
            ImageIO.write(bufferedImage, "jpg", imageFile);


            String imgStr =  ocrAPIFactory.getFileContentAsBase64(imageFilePath, false);


            Map<String, Object> fileMap = ocrAPIFactory.accurate(accessToken, imgStr);
            stateCode = (int)fileMap.get("stateCode");
            if(stateCode != 200){
                System.out.println ("获取百度OCR 百度文件转换失败：" + fileMap.get("error_msg"));
                return  new ResponseEntity<>("ERROR".getBytes(), HttpStatus.OK);
            }


            JSONArray wordsResults = (JSONArray)fileMap.get("words_result");
            if(Objects.nonNull(wordsResults)){
                processOCRResult(wordsResults, currentPageNum, objectMap);
            }
            System.out.println("Page " + (currentPageNum) + " converted to image.");
        }


        System.out.println("获取所有的图号-结束");
        System.out.println("数据转换Excel-开始");


        //页数排序
        Map<Integer, String> sortedMap = objectMap.entrySet()
                .stream() // 将 Map 转换为 Stream
                .sorted(Map.Entry.comparingByKey()) // 按值排序
                .collect(Collectors.toMap(
                        Map.Entry::getKey,
                        Map.Entry::getValue,
                        (oldValue, newValue) -> oldValue, // 如果有重复键时的合并策略
                        () -> new LinkedHashMap<>()  // 保持顺序的 Map 实现
                ));


        // 创建 Excel 工作簿
        Workbook workbook = new XSSFWorkbook();
        Sheet sheet = workbook.createSheet("Map Data");


        // 创建表头
        Row headerRow = sheet.createRow(0);
        headerRow.createCell(0).setCellValue("页码");
        headerRow.createCell(1).setCellValue("图号");


        // 填充数据
        int rowNum = 1;
        for (Map.Entry<Integer, String> entry : sortedMap.entrySet()) {
            Row row = sheet.createRow(rowNum++);
            row.createCell(0).setCellValue(Objects.nonNull(entry.getKey()) ? entry.getKey().toString() :"0");
            row.createCell(1).setCellValue(Objects.nonNull(entry.getValue()) ? entry.getValue().toString() :"" );
        }


        // 自动调整列宽
        sheet.autoSizeColumn(0);
        sheet.autoSizeColumn(1);


        // 将工作簿内容写入字节数组输出流
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        try {
            workbook.write(outputStream);
            workbook.close();
        } catch (IOException e) {
            e.printStackTrace();
            return ResponseEntity.status(500).build();
        }


        // 创建 Http 响应
        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_OCTET_STREAM);
        headers.setContentDispositionFormData("attachment", "MapData.xlsx");
        System.out.println("数据转换Excel-结束");
        //删除文件夹
        try {
            // 递归删除文件夹及其内容
            Files.walkFileTree(new File(imagesPath).toPath(), new SimpleFileVisitor<Path>() {
                @Override
                public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
                    Files.delete(file);
                    return FileVisitResult.CONTINUE;
                }


                @Override
                public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {
                    Files.delete(dir);
                    return FileVisitResult.CONTINUE;
                }
            });
            System.out.println("Directory deleted successfully.");
        } catch (IOException e) {
            e.printStackTrace();
        }


        return ResponseEntity.ok()
                .headers(headers)
                .body(outputStream.toByteArray());
    } catch (Exception e) {
        System.out.println("数据转换Excel异常:" + e);
        return new ResponseEntity<>("File upload error: ".getBytes(), HttpStatus.INTERNAL_SERVER_ERROR);
    }
}

5、结论

在 Java 中，PDF 文字提取可以通过 PDFBox 轻松实现。PDFBox 适合简单的文档处理，复杂的文档结构通过OCR来解析。在选择使用哪个库时，建议根据项目需求、文档复杂度和性能要求进行评估。

这篇博客提供了从 PDF 中提取文字的基础方法，并介绍了如何处理复杂的文档结构。希望这对你的项目有所帮助！如果有任何问题或建议，欢迎留言讨论。

文章转载自拾荒的小海螺，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

JAVA：探索 PDF 文字提取的技术指南

评论