Apache-PDFBox

Apache-PDFBox
https://pdfbox.apache.org/

提取pdf正文文本去除页眉页脚

PDFTextStripper.setSortByPosition(true) 用于设置文本按可视化页面顺序排序（从上到下，从左到右）
PDF writer 允许按任意顺序在页面输出文本，比如先写粗体文本，再写普通文本。
出于性能考虑，默认情况下，PDFTextStripper 未按页面显示顺序对文本进行排序，想要提取到有序文本需 setSortByPosition 为 true
注意：setSortByPosition 不能随便开启，开启后可能导致pdf水印出现在正文中，导致正文错乱

@Test
public void readWithoutHeaderFooter() {
    String file = "mybook.pdf";
    StringBuilder result = new StringBuilder();
    // 页眉高度百分比
    float headerRate = 0.1F;
    // 页脚高度百分比
    float footerRate = 0.03F;
    try (PDDocument document = Loader.loadPDF(new File(file))) {
        PDFTextStripperByArea regionStripper = new PDFTextStripperByArea();
        regionStripper.setSortByPosition(true); // 文本按可视化页面顺序排序（从上到下，从左到右）
        for (int pageIndex = 0; pageIndex < document.getNumberOfPages(); pageIndex++) {
            var page = document.getPage(pageIndex);
            float pageHeight = page.getMediaBox().getHeight();
            float pageWidth = page.getMediaBox().getWidth();
            // 动态计算裁剪区域（顶部 headerRate 比例为页眉，底部 footerRate 比例为页脚）
            Rectangle2D bodyArea = new Rectangle2D.Double(
                    0,
                    pageHeight * headerRate,  // 页眉高度
                    pageWidth,
                    pageHeight - pageHeight * headerRate - pageHeight * footerRate   // 正文高度
            );
            regionStripper.addRegion("mainContent", bodyArea);
            regionStripper.extractRegions(page);
            String cleanText = regionStripper.getTextForRegion("mainContent");
            result.append(cleanText);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    System.out.println(result);
}

tabula-java 从pdf提取表格

从 PDF 文件中进行表格抽取 (tabula || paddle-pp-structure)
https://xie.infoq.cn/article/6fa3e4ffab86202c22677ddd5

https://github.com/tabulapdf/tabula-java?tab=readme-ov-file

当前位置 : 首页 » 文章分类 : » Apache-PDFBox

Apache-PDFBox

提取pdf正文文本去除页眉页脚

tabula-java 从pdf提取表格

页面信息

评论