converter-pdf

Community

Turn judicial PDFs into searchable text

Authorgeorgemarmelstein
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill converts judicial PDF documents into TXT using OCR for scanned files, removing typical PJe noise so the extracted text becomes usable for downstream analysis.

Core Features & Use Cases

  • OCR-based conversion (default): Uses OCR (Tesseract) for scanned judicial PDFs to produce page-preserving, cleaned TXT output.
  • Digital extraction mode: Supports a faster path for native digital PDFs using pdfplumber, with fallback to OCR when needed.
  • PJe-specific text cleaning & metrics: Removes recurring headers/footers and PJe pollution patterns, and reports quality stats such as pages, character counts, and reduction percentage.

Quick Start

Ask the Skill to convert a judicial PDF into TXT by running the existing script in OCR mode and saving the results to your chosen output directory.

Dependency Matrix

Required Modules

pdfplumberpdf2imagepytesseractPyPDF2

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: converter-pdf
Download link: https://github.com/georgemarmelstein/sistema-marmelstein/archive/main.zip#converter-pdf

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.