Active

Journal Research

An automated pipeline that scrapes eight sport psychology journals, classifies articles by topic, and maintains a searchable catalog going back to 2015.

Situation

A sport psychology consultant needed to stay current across eight academic journals spanning multiple publishers, each with different publishing schedules (quarterly, bimonthly, continuous). The field is active and relevant articles publish constantly, but finding them was manual work.

Complication

Manually checking each journal's website every month was tedious and easy to skip. Relevant articles were getting missed. But beyond tracking, the real problem was filtering: most published articles fall outside the consultant's four focus areas. Without classification, every new batch required manually scanning titles and abstracts to find the two or three that actually mattered.

What I Did

I built a two-phase pipeline. Phase one queries the CrossRef API for article metadata (title, authors, DOI, abstract, volume, issue) filtered by journal ISSN and date range. Phase two enriches records missing abstracts by visiting the publisher page directly with Playwright and extracting the text.

A separate classification tool processes each article against four practitioner-defined topics using the Gemini API with structured JSON output. Each article gets a relevance flag per topic. The classifier checkpoints every 20 articles and supports resume-on-failure.

Everything runs as Node.js CLI tools with configurable date ranges. A publishing calendar tracks which journals publish when, and a status checker shows a full-year calendar view of what has been scraped and what is pending.

journal-research/
├── journal-scraper.js         Phase 1: CrossRef + Phase 2: Playwright
├── classify-articles.js       Gemini-powered topic classification
│   └── --resume               Checkpoint recovery
├── check-status.js            Calendar view + next-check dates
├── publishing-schedule.json   8 journals, ISSNs, frequencies
└── data/
    ├── journal database.csv        Main DB (2015-present, 3,000+ rows)
    └── journal_classified.csv      Classified output

Classification Topics:
  T1: Program Creation & Evaluation
  T2: Practitioners & Professional Practice
  T3: Student-Athletes & University Sport
  T4: Coaches & Sport Psychology

Result

3,000+ articles cataloged and classified across 4 topic areas. The consultant gets a filtered list of relevant articles, not a raw feed requiring manual review. The pipeline runs monthly on the publication calendar. The journal tracking that used to take hours now happens in the background.

8 journals

Tracked across 4 publishers

3,000+ articles

Cataloged from 2015 to present

4 topic classifiers

Multi-label with negative detection

Monthly automation

Calendar-driven scrape scheduling

Checkpoint recovery

Resume-on-failure for long runs

3 CLI tools

Scraper, classifier, status checker

Tech Stack

Scraping

CrossRef API
Playwright
Chromium
HTTP polling

Classification

Gemini API
Multi-label detection
JSON schema enforcement

Data

CSV pipelines
Publishing calendar
Checkpoint/resume

Infrastructure

Node.js CLI
Cron scheduling
Git-tracked output

← All work See also: The Moment →