pi-ceo-docparser
CommunityDeterministically parse docs into structure.
Education & Research#metadata#table extraction#pdf parsing#research pipeline#docx extraction#deterministic ingestion#page citations
AuthorCleanExpo
Version1.0.0
Installs0
System Documentation
What problem does it solve?
It removes the manual burden of converting PDFs, DOCX files, or plain text into consistent, page-cited, structured content that downstream workflows can reliably consume.
Core Features & Use Cases
- Deterministic extraction into ParsedDoc: Produces a
ParsedDoccontaining full text, page-separated content, tables (DOCX), and metadata such as title, without any LLM calls. - Failed-soft dependency handling: Tries PyMuPDF first for PDFs, falls back to pypdf if needed, supports DOCX via python-docx, and always supports TXT with stdlib; failures populate
doc.errorinstead of raising. - Page-number citation preservation: Keeps page indices so later research or marketing pipelines can cite sources like “p. 3” precisely.
Quick Start
Ask your pipeline to parse the document at a local path like /path/to/customer-interview.pdf into a ParsedDoc and then read doc.text and doc.pages for downstream analysis.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: pi-ceo-docparser Download link: https://github.com/CleanExpo/Pi-Dev-Ops/archive/main.zip#pi-ceo-docparser Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.