scientific-papers-to-dataset

Community

Extract datasets from academic papers

Authoreamag
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Many research questions require structured experimental data that exist only scattered across academic papers and PDFs, and manually finding, downloading, filtering, and extracting these results is slow and error-prone. This Skill automates discovery, PDF retrieval, relevance filtering, data extraction, and citation traversal so users can assemble reproducible datasets from the literature.

Core Features & Use Cases

  • Automated Paper Discovery: Query OpenAlex to find seed works and batch-fetch metadata and IDs.
  • Robust PDF Retrieval: Attempt PDF downloads from OpenAlex locations, bioRxiv, and Unpaywall with rate limiting and fallbacks.
  • Relevance Filtering & Extraction Pipeline: Use a queue-based BFS workflow with relevance checks and a thinking-model-driven extractor to produce per-paper JSON outputs.
  • Use Case: Create a dataset of compound toxicity measurements by searching OpenAlex, downloading accessible PDFs, extracting experimental values into structured JSON, and expanding via cited and citing works.

Quick Start

Create a new project by describing the dataset you want, run the initial OpenAlex search to seed the queue, then process the queue to download PDFs, filter relevance, and extract structured JSON data.

Dependency Matrix

Required Modules

httpx

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: scientific-papers-to-dataset
Download link: https://github.com/eamag/papers2dataset/archive/main.zip#scientific-papers-to-dataset

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.