Journal Research
An automated pipeline that scrapes eight sport psychology journals, classifies articles by topic, and maintains a searchable catalog going back to 2015.
Situation
A sport psychology consultant needed to stay current across eight academic journals spanning multiple publishers, each with different publishing schedules (quarterly, bimonthly, continuous). The field is active and relevant articles publish constantly, but finding them was manual work.
Complication
Manually checking each journal's website every month was tedious and easy to skip. Relevant articles were getting missed. But beyond tracking, the real problem was filtering: most published articles fall outside the consultant's four focus areas. Without classification, every new batch required manually scanning titles and abstracts to find the two or three that actually mattered.
What I Did
I built a two-phase pipeline. Phase one queries the CrossRef API for article metadata (title, authors, DOI, abstract, volume, issue) filtered by journal ISSN and date range. Phase two enriches records missing abstracts by visiting the publisher page directly with Playwright and extracting the text.
A separate classification tool processes each article against four practitioner-defined topics using the Gemini API with structured JSON output. Each article gets a relevance flag per topic. The classifier checkpoints every 20 articles and supports resume-on-failure.
Everything runs as Node.js CLI tools with configurable date ranges. A publishing calendar tracks which journals publish when, and a status checker shows a full-year calendar view of what has been scraped and what is pending.
journal-research/
├── journal-scraper.js Phase 1: CrossRef + Phase 2: Playwright
├── classify-articles.js Gemini-powered topic classification
│ └── --resume Checkpoint recovery
├── check-status.js Calendar view + next-check dates
├── publishing-schedule.json 8 journals, ISSNs, frequencies
└── data/
├── journal database.csv Main DB (2015-present, 3,000+ rows)
└── journal_classified.csv Classified output
Classification Topics:
T1: Program Creation & Evaluation
T2: Practitioners & Professional Practice
T3: Student-Athletes & University Sport
T4: Coaches & Sport PsychologyResult
3,000+ articles cataloged and classified across 4 topic areas. The consultant gets a filtered list of relevant articles, not a raw feed requiring manual review. The pipeline runs monthly on the publication calendar. The journal tracking that used to take hours now happens in the background.
8 journals
Tracked across 4 publishers
3,000+ articles
Cataloged from 2015 to present
4 topic classifiers
Multi-label with negative detection
Monthly automation
Calendar-driven scrape scheduling
Checkpoint recovery
Resume-on-failure for long runs
3 CLI tools
Scraper, classifier, status checker
Tech Stack
Scraping
- CrossRef API
- Playwright
- Chromium
- HTTP polling
Classification
- Gemini API
- Multi-label detection
- JSON schema enforcement
Data
- CSV pipelines
- Publishing calendar
- Checkpoint/resume
Infrastructure
- Node.js CLI
- Cron scheduling
- Git-tracked output