scientific-papers-to-dataset
CommunityExtract datasets from academic papers
System Documentation
What problem does it solve?
Many research questions require structured experimental data that exist only scattered across academic papers and PDFs, and manually finding, downloading, filtering, and extracting these results is slow and error-prone. This Skill automates discovery, PDF retrieval, relevance filtering, data extraction, and citation traversal so users can assemble reproducible datasets from the literature.
Core Features & Use Cases
- Automated Paper Discovery: Query OpenAlex to find seed works and batch-fetch metadata and IDs.
- Robust PDF Retrieval: Attempt PDF downloads from OpenAlex locations, bioRxiv, and Unpaywall with rate limiting and fallbacks.
- Relevance Filtering & Extraction Pipeline: Use a queue-based BFS workflow with relevance checks and a thinking-model-driven extractor to produce per-paper JSON outputs.
- Use Case: Create a dataset of compound toxicity measurements by searching OpenAlex, downloading accessible PDFs, extracting experimental values into structured JSON, and expanding via cited and citing works.
Quick Start
Create a new project by describing the dataset you want, run the initial OpenAlex search to seed the queue, then process the queue to download PDFs, filter relevance, and extract structured JSON data.
Dependency Matrix
Required Modules
Components
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: scientific-papers-to-dataset Download link: https://github.com/eamag/papers2dataset/archive/main.zip#scientific-papers-to-dataset Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.