DeepCritical / docs /bugs /P0_ACTIONABLE_FIXES.md
VibecoderMcSwaggins's picture
refactor(tools): replace BioRxiv with Europe PMC (Phase 01)
2f8ae1f
|
raw
history blame
8.02 kB

P0 Actionable Fixes - What to Do

Date: November 27, 2025 Status: ACTIONABLE


Summary: What's Broken and What's Fixable

Tool Problem Fixable? How
BioRxiv API has NO search endpoint NO Replace with Europe PMC
PubMed No query preprocessing YES Add query cleaner
ClinicalTrials No filters applied YES Add filter params
Magentic Framework Nothing wrong N/A Already working

FIX 1: Replace BioRxiv with Europe PMC (30 min)

Why BioRxiv Can't Be Fixed

The bioRxiv API only has this endpoint:

https://api.biorxiv.org/details/{server}/{date-range}/{cursor}/json

This returns papers by date, not by keyword. There is NO search endpoint.

Proof: I queried medrxiv/2024-01-01/2024-01-02 and got:

  • "Global risk of Plasmodium falciparum" (malaria)
  • "Multiple Endocrine Neoplasia in India"
  • "Acupuncture for Acute Musculoskeletal Pain"

None of these are about Long COVID because the API doesn't search.

Europe PMC Has Search + Preprints

curl "https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=long+covid+treatment&resultType=core&pageSize=3&format=json"

Returns 283,058 results including:

  • "Long COVID Treatment No Silver Bullets, Only a Few Bronze BBs" βœ…

The Fix

Replace src/tools/biorxiv.py with src/tools/europepmc.py:

"""Europe PMC preprint and paper search tool."""

import httpx
from src.utils.models import Citation, Evidence

class EuropePMCTool:
    """Search Europe PMC for papers and preprints."""

    BASE_URL = "https://www.ebi.ac.uk/europepmc/webservices/rest/search"

    @property
    def name(self) -> str:
        return "europepmc"

    async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
        """Search Europe PMC (includes preprints from bioRxiv/medRxiv)."""
        params = {
            "query": query,
            "resultType": "core",
            "pageSize": max_results,
            "format": "json",
        }

        async with httpx.AsyncClient(timeout=30.0) as client:
            response = await client.get(self.BASE_URL, params=params)
            response.raise_for_status()

            data = response.json()
            results = data.get("resultList", {}).get("result", [])

            return [self._to_evidence(r) for r in results]

    def _to_evidence(self, result: dict) -> Evidence:
        """Convert Europe PMC result to Evidence."""
        title = result.get("title", "Untitled")
        abstract = result.get("abstractText", "No abstract")
        doi = result.get("doi", "")
        pub_year = result.get("pubYear", "Unknown")
        source = result.get("source", "europepmc")

        # Mark preprints
        pub_type = result.get("pubTypeList", {}).get("pubType", [])
        is_preprint = "Preprint" in pub_type

        content = f"{'[PREPRINT] ' if is_preprint else ''}{abstract[:1800]}"

        return Evidence(
            content=content,
            citation=Citation(
                source="europepmc" if not is_preprint else "preprint",
                title=title[:500],
                url=f"https://doi.org/{doi}" if doi else "",
                date=str(pub_year),
            ),
            relevance=0.75 if is_preprint else 0.9,
        )

FIX 2: Add PubMed Query Preprocessing (1 hour)

Current Problem

User enters: What medications show promise for Long COVID? PubMed receives: What medications show promise for Long COVID?

The question words pollute the search.

The Fix

Add src/tools/query_utils.py:

"""Query preprocessing utilities."""

import re

# Question words to remove
QUESTION_WORDS = {
    "what", "which", "how", "why", "when", "where", "who",
    "is", "are", "can", "could", "would", "should", "do", "does",
    "show", "promise", "help", "treat", "cure",
}

# Medical synonyms to expand
SYNONYMS = {
    "long covid": ["long COVID", "PASC", "post-COVID syndrome", "post-acute sequelae"],
    "alzheimer": ["Alzheimer's disease", "AD", "Alzheimer dementia"],
    "cancer": ["neoplasm", "tumor", "malignancy", "carcinoma"],
}

def preprocess_pubmed_query(raw_query: str) -> str:
    """Convert natural language to cleaner PubMed query."""
    # Lowercase
    query = raw_query.lower()

    # Remove question marks
    query = query.replace("?", "")

    # Remove question words
    words = query.split()
    words = [w for w in words if w not in QUESTION_WORDS]
    query = " ".join(words)

    # Expand synonyms
    for term, expansions in SYNONYMS.items():
        if term in query:
            # Add OR clause
            expansion = " OR ".join([f'"{e}"' for e in expansions])
            query = query.replace(term, f"({expansion})")

    return query.strip()

Then update src/tools/pubmed.py:

from src.tools.query_utils import preprocess_pubmed_query

async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
    # Preprocess query
    clean_query = preprocess_pubmed_query(query)

    search_params = self._build_params(
        db="pubmed",
        term=clean_query,  # Use cleaned query
        retmax=max_results,
        sort="relevance",
    )
    # ... rest unchanged

FIX 3: Add ClinicalTrials.gov Filters (30 min)

Current Problem

Returns ALL trials including withdrawn, terminated, observational studies.

The Fix

The API supports filter.overallStatus and other filters. Update src/tools/clinicaltrials.py:

async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
    params: dict[str, str | int] = {
        "query.term": query,
        "pageSize": min(max_results, 100),
        "fields": "|".join(self.FIELDS),
        # ADD THESE FILTERS:
        "filter.overallStatus": "COMPLETED|RECRUITING|ACTIVE_NOT_RECRUITING",
        # Only interventional studies (not observational)
        "aggFilters": "studyType:int",
    }
    # ... rest unchanged

Note: I tested the API - it supports filtering but with slightly different syntax. Check the API docs.


What NOT to Change

Microsoft Agent Framework - WORKING

I verified:

from agent_framework import MagenticBuilder, ChatAgent
from agent_framework.openai import OpenAIChatClient
# All imports OK

orchestrator = MagenticOrchestrator(max_rounds=2)
workflow = orchestrator._build_workflow()
# Workflow built successfully

The Magentic agents are correctly wired:

  • SearchAgent β†’ GPT-5.1 βœ…
  • JudgeAgent β†’ GPT-5.1 βœ…
  • HypothesisAgent β†’ GPT-5.1 βœ…
  • ReportAgent β†’ GPT-5.1 βœ…

The framework is fine. The tools it calls are broken.


Priority Order

  1. Replace BioRxiv β†’ Immediate, fundamental
  2. Add PubMed preprocessing β†’ High impact, easy
  3. Add ClinicalTrials filters β†’ Medium impact, easy

Test After Fixes

# Test Europe PMC
uv run python -c "
import asyncio
from src.tools.europepmc import EuropePMCTool
tool = EuropePMCTool()
results = asyncio.run(tool.search('long covid treatment', 3))
for r in results:
    print(r.citation.title)
"

# Test PubMed with preprocessing
uv run python -c "
from src.tools.query_utils import preprocess_pubmed_query
q = 'What medications show promise for Long COVID?'
print(preprocess_pubmed_query(q))
# Should output: (\"long COVID\" OR \"PASC\" OR \"post-COVID syndrome\") medications
"

After These Fixes

The Magentic workflow will:

  1. SearchAgent calls search_pubmed("long COVID treatment") β†’ Gets RELEVANT papers
  2. SearchAgent calls search_preprints("long COVID treatment") β†’ Gets RELEVANT preprints via Europe PMC
  3. SearchAgent calls search_clinical_trials("long COVID") β†’ Gets INTERVENTIONAL trials only
  4. JudgeAgent evaluates GOOD evidence
  5. HypothesisAgent generates hypotheses from GOOD evidence
  6. ReportAgent synthesizes GOOD report

The framework will work once we feed it good data.