pi-ceo-docparser

Community

Deterministically parse docs into structure.

AuthorCleanExpo
Version1.0.0
Installs0

System Documentation

What problem does it solve?

It removes the manual burden of converting PDFs, DOCX files, or plain text into consistent, page-cited, structured content that downstream workflows can reliably consume.

Core Features & Use Cases

  • Deterministic extraction into ParsedDoc: Produces a ParsedDoc containing full text, page-separated content, tables (DOCX), and metadata such as title, without any LLM calls.
  • Failed-soft dependency handling: Tries PyMuPDF first for PDFs, falls back to pypdf if needed, supports DOCX via python-docx, and always supports TXT with stdlib; failures populate doc.error instead of raising.
  • Page-number citation preservation: Keeps page indices so later research or marketing pipelines can cite sources like “p. 3” precisely.

Quick Start

Ask your pipeline to parse the document at a local path like /path/to/customer-interview.pdf into a ParsedDoc and then read doc.text and doc.pages for downstream analysis.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: pi-ceo-docparser
Download link: https://github.com/CleanExpo/Pi-Dev-Ops/archive/main.zip#pi-ceo-docparser

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.