pdf-ingest

Name: pdf-ingest
Availability: InStock
Author: gyy0592

Community

Dual-channel PDF reader: text + images

Education & Research #pdf #document-processing #text-extraction #pdftotext #poppler-utils #page-images

Authorgyy0592

Version1.0.0

Installs0

System Documentation

What problem does it solve?

PDFs often lose critical visual information when converted to plain text: equations, subscripts, tables, and figures can be mangled or removed, making automated analysis unreliable. This skill preserves both a searchable text channel and faithful page images so downstream processing can quickly scan prose while verifying any visual content against an exact image.

Core Features & Use Cases

Dual-channel extraction: produces a text.txt extracted with layout preservation and one PNG per page to retain visual fidelity.
Reliable tooling: uses pdftotext -layout and pdftoppm to keep columns and render high-quality page images at configurable DPI.
Practical scenarios: ideal for paper reading, report parsing, form extraction, and any task where formulas, tables, or figures must be confirmed visually before use.
Failure handling guidance: detects missing tools, missing files, and scanned-image PDFs (text may be empty but images remain usable).

Quick Start

Ingest paper.pdf at 150 DPI and return the path to the generated paper_temp directory along with the counts of text lines and page images.

pdf-ingest

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper