extract-from-pdfs

Name: extract-from-pdfs
Availability: InStock
Author: brunoasm

Community

Turn PDFs into structured data for analysis.

Education & Research #workflow #pdf #Claude #metadata #pipeline #data-extraction

Authorbrunoasm

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill turns large collections of scientific PDFs into structured data ready for meta-analyses, systematic reviews, and database creation, automating extraction, validation, enrichment, and export.

Core Features & Use Cases

Organize metadata from BibTeX, RIS, directories, or DOI lists
Filter papers by abstract (Claude or local models) to focus on relevant literature
Extract structured data from full PDFs using Claude's vision capabilities
Repair and validate outputs, enrich with external databases, and export to Python, R, CSV, Excel, or SQLite
Use case: conduct a rapid systematic review across hundreds of papers

Quick Start

Start by preparing your metadata and extraction schema, then run the 6-step pipeline as outlined in the repository workflow. Example files are provided in assets; to kick off the pipeline, run the steps in sequence: 01_organize_metadata.py, 02_filter_abstracts.py, 03_extract_from_pdfs.py, 04_repair_json.py, 05_validate_with_apis.py, and 06_export_database.py.

Dependency Matrix

Required Modules

anthropic>=0.40.0pybtex>=0.24.0rispy>=0.6.0json-repair>=0.25.0jsonschema>=4.20.0pandas>=2.0.0openpyxl>=3.1.0pyreadr>=0.5.0requests>=2.31.0

Components

scriptsreferencesassets