Spaces:

gkip
/

clinical_trial_inspector

Sleeping

App Files Files Community

Geoffrey Kip commited on 19 days ago

Commit

507be68

0 Parent(s):

Initial Release

Browse files

Files changed (22) hide show

.dockerignore +47 -0
.flake8 +4 -0
.gitattributes +1 -0
.gitignore +62 -0
DEPLOYMENT.md +82 -0
Dockerfile +33 -0
README.md +240 -0
ct_agent_app.py +583 -0
modules/__init__.py +0 -0
modules/cohort_tools.py +145 -0
modules/constants.py +103 -0
modules/graph_viz.py +97 -0
modules/tools.py +706 -0
modules/utils.py +281 -0
requirements.txt +19 -0
scripts/analyze_db.py +149 -0
scripts/ingest_ct.py +449 -0
scripts/remove_duplicates.py +174 -0
tests/test_data_integrity.py +75 -0
tests/test_hybrid_search.py +65 -0
tests/test_sponsor_normalization.py +45 -0
tests/test_unit.py +149 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,47 @@

+# Git
+.git
+.gitignore
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual Environment
+venv/
+.venv/
+# Environment Variables (CRITICAL: Do not include secrets)
+.env
+.env.local
+# IDE
+.vscode/
+.idea/
+# Mac
+.DS_Store
+# Logs
+*.log
+# Temporary
+*.tmp

.flake8 ADDED Viewed

	@@ -0,0 +1,4 @@

+[flake8]
+max-line-length = 120
+extend-ignore = E203
+exclude = venv, .git, __pycache__, build, dist

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ ct_gov_lancedb/*/ filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,62 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+*.egg-info/
+*.pyc
+*.pyo
+# Build artifacts
+dist/
+build/
+*.spec
+# Virtual Environment
+.venv/
+venv/
+env/
+ENV/
+# Environment variables
+.env
+.env.local
+# Session files
+amazon_session.json
+# Database files
+agent_data.db
+*.db
+*.db-journal
+ct_gov_lancedb/
+# Chrome/Browser session data
+user_session/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+*.code-workspace
+# OS
+.DS_Store
+Thumbs.db
+.DS_Store?
+# Logs
+*.log
+# Dev Containers
+.devcontainer/
+# macOS App Bundle (generated, but keep source)
+*.app/
+# Temporary files
+*.tmp
+*.temp

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# Deployment Guide: Hugging Face Spaces 🐳
+This guide will walk you through deploying the **Clinical Trial Inspector Agent** to **Hugging Face Spaces** using Docker.
+## Prerequisites
+1.  **Hugging Face Account**: [Sign up here](https://huggingface.co/join).
+2.  **Git LFS (Large File Storage)**: Required to upload the database (~700MB).
+    *   **Mac**: `brew install git-lfs`
+    *   **Windows**: Download from [git-lfs.com](https://git-lfs.com/)
+    *   **Linux**: `sudo apt-get install git-lfs`
+## Step 1.5: Authentication (Crucial!) 🔑
+Hugging Face requires an **Access Token** for Git operations (passwords don't work).
+1.  Go to **[Settings > Access Tokens](https://huggingface.co/settings/tokens)**.
+2.  Click **Create new token**.
+3.  **Type**: Select **Write** (important!).
+4.  Copy the token (starts with `hf_...`).
+5.  **Usage**: When `git push` asks for a password, **paste this token**.
+## Step 2: Create a New Space
+1.  Go to [huggingface.co/new-space](https://huggingface.co/new-space).
+2.  **Space Name**: e.g., `clinical-trial-agent`.
+3.  **License**: `MIT` (or your choice).
+4.  **SDK**: Select **Docker**.
+5.  **Visibility**: Public or Private.
+6.  Click **Create Space**.
+## Step 2: Prepare Your Local Repo
+You need to initialize Git LFS to track the large LanceDB files.
+```bash
+# Initialize LFS
+git lfs install
+# Track the LanceDB files
+git lfs track "ct_gov_lancedb/**/*"
+git add .gitattributes
+```
+## Step 3: Push to Hugging Face
+You can either push your existing repo or clone the Space and copy files. Pushing existing is easier:
+```bash
+# Add the Space as a remote (replace YOUR_USERNAME and SPACE_NAME)
+git remote add space https://huggingface.co/spaces/YOUR_USERNAME/SPACE_NAME
+# Push the main branch
+git push space main
+# OR if you are on a feature branch:
+git push space feature/deploy_app:main
+```
+> **Note**: The first push will take time as it uploads the 700MB database.
+## Step 4: Configure Secrets (Optional but Recommended)
+To run in **Admin Mode** (no user prompt for API key):
+1.  Go to your Space's **Settings** tab.
+2.  Scroll to **Variables and secrets**.
+3.  Click **New secret**.
+4.  **Name**: `GOOGLE_API_KEY`
+5.  **Value**: Your Google API Key (starts with `AIza...`).
+## Step 5: Verify Deployment
+1.  Go to the **App** tab in your Space.
+2.  You should see "Building..." in the logs.
+3.  Once built, the app will launch! 🚀
+---
+## Troubleshooting
+*   **"LFS upload failed"**: Ensure you ran `git lfs install` and `git lfs track`.
+*   **"Runtime Error"**: Check the **Logs** tab. If it says "API Key Missing", ensure you set the Secret or enter it in the UI.

Dockerfile ADDED Viewed

	@@ -0,0 +1,33 @@

+# Use an official Python runtime as a parent image
+FROM python:3.10-slim
+# Set the working directory in the container
+WORKDIR /app
+# Install system dependencies
+# build-essential is often needed for compiling python packages
+# git is needed if you install packages from git
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Copy the requirements file into the container at /app
+COPY requirements.txt .
+# Install any needed packages specified in requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the current directory contents into the container at /app
+COPY . .
+# Expose port 8501 for Streamlit
+EXPOSE 8501
+# Define environment variable for Streamlit to run in headless mode
+ENV STREAMLIT_SERVER_HEADLESS=true
+ENV STREAMLIT_SERVER_PORT=8501
+ENV STREAMLIT_SERVER_ADDRESS=0.0.0.0
+# Run the application
+CMD ["streamlit", "run", "ct_agent_app.py"]

README.md ADDED Viewed

	@@ -0,0 +1,240 @@

+# Clinical Trial Inspector Agent 🕵️‍♂️💊
+**Clinical Trial Inspector** is an advanced AI agent designed to revolutionize how researchers, clinicians, and analysts explore clinical trial data. By combining **Semantic Search**, **Retrieval-Augmented Generation (RAG)**, and **Visual Analytics**, it transforms raw data from [ClinicalTrials.gov](https://clinicaltrials.gov/) into actionable insights.
+Built with **LangChain**, **LlamaIndex**, **Streamlit**, **Altair**, **Streamlit-Agraph**, and **Google Gemini**, this tool goes beyond simple keyword search. It understands natural language, generates inline visualizations, performs complex multi-dimensional analysis, and visualizes relationships in an interactive knowledge graph.
+## ✨ Key Features
+### 2. 🧠 Intelligent Search & Retrieval
+*   **Hybrid Search**: Combines **Semantic Search** (vector similarity) with **BM25 Keyword Search** (sparse retrieval) using **LanceDB's Native Hybrid Search**. This ensures you find studies that match both the *meaning* (e.g., "kidney cancer" -> "renal cell carcinoma") and *exact terms* (e.g., "NCT04589845", "Teclistamab").
+*   **Smart Filtering**:
+    *   **Strict Pre-Filtering**: For specific sponsors (e.g., "Pfizer"), it forces the engine to look *only* at that sponsor's studies first, ensuring 100% recall.
+    *   **Strict Keyword Filtering (Analytics Only)**: For counting questions (e.g., "How many studies..."), the **Analytics Engine** (`get_study_analytics`) prioritizes studies where the query explicitly appears in the **Title** or **Conditions**, ensuring high precision and accurate counts.
+    *   **Sponsor Alias Support**: Intelligently maps aliases (e.g., "J&J", "MSD") to their canonical sponsor names ("Janssen", "Merck Sharp & Dohme") for accurate aggregation.
+*   **Smart Summary**: Returns a clean, concise list of relevant studies.
+*   **Query Expansion**: Automatically expands your search terms with medical synonyms (e.g., "Heart Attack" -> "Myocardial Infarction").
+*   **Re-Ranking**: Uses a Cross-Encoder (`ms-marco-MiniLM`) to re-score results for maximum relevance.
+*   **Query Decomposition**: Breaks down complex multi-part questions (e.g., *"Compare the primary outcomes of Keytruda vs Opdivo"*) into sub-questions for precise answers.
+*   **Cohort SQL Generation**: Translates eligibility criteria into standard SQL queries (OMOP CDM) for patient cohort identification.
+### 📊 Visual Analytics & Insights
+- **Inline Charts (Contextual)**: The agent automatically generates **Bar Charts** and **Line Charts** directly in the chat stream when you ask aggregation questions (e.g., *"Top sponsors for Multiple Myeloma"*).
+- **Analytics Dashboard (Global)**: A dedicated dashboard to analyze trends across the **entire dataset** (60,000+ studies), independent of your chat session.
+- **Interactive Knowledge Graph**: Visualize connections between **Studies**, **Sponsors**, and **Conditions** in a dynamic, interactive network graph.
+### 🌍 Geospatial Dashboard
+- **Global Trial Map**: Visualize the geographic distribution of clinical trials on an interactive world map.
+- **Region Toggle**: Switch between **World View** (Country-level aggregation) and **USA View** (State-level aggregation).
+- **Dot Visualization**: Uses dynamic **CircleMarkers** (dots) sized by trial count to show density.
+- **Interactive Filters**: Filter the map by **Phase**, **Status**, **Sponsor**, **Start Year**, and **Study Type**.
+### 🔍 Multi-Filter Analysis
+- **Complex Filtering**: Answer sophisticated questions by applying multiple filters simultaneously.
+    - *Example*: *"For **Phase 2 and 3** studies, what are **Pfizer's** most common study indications?"*
+- **Full Dataset Scope**: General analytics questions analyze the **entire database**, not just a sample.
+- **Smart Retrieval**: Retrieves up to **5,000 relevant studies** for comprehensive analysis.
+### ⚡ High-Performance Ingestion
+- **Parallel Processing**: Uses multi-core processing to ingest and embed thousands of studies per minute.
+- **LanceDB Integration**: Uses **LanceDB** for high-performance vector storage and native hybrid search.
+- **Idempotent Updates**: Smartly updates existing records without duplication, allowing for seamless data refreshes.
+## 🤖 Agent Capabilities & Tools
+The agent is equipped with specialized tools to handle different types of requests:
+### 1. `search_trials`
+*   **Purpose**: Finds specific clinical trials based on natural language queries.
+*   **Capabilities**: Semantic Search, Smart Filtering (Phase, Status, Sponsor, Intervention), Query Expansion, Hybrid Search, Re-Ranking.
+### 2. `get_study_analytics`
+*   **Purpose**: Aggregates data to reveal trends and insights.
+*   **Capabilities**: Multi-Filtering, Grouping (Phase, Status, Sponsor, Year, Condition), Full Dataset Access, Inline Visualization.
+### 3. `compare_studies`
+*   **Purpose**: Handles complex comparison or multi-part questions.
+*   **Capabilities**: Uses **Query Decomposition** to break a complex query into sub-queries, executes them against the database, and synthesizes the results.
+### 4. `find_similar_studies`
+*   **Purpose**: Discovers studies that are semantically similar to a specific trial.
+*   **Capabilities**:
+    *   **NCT Lookup**: Automatically fetches content if queried with an NCT ID.
+    *   **Self-Exclusion**: Filters out the reference study from results.
+    *   **Scoring**: Returns similarity scores for transparency.
+### 5. `get_study_details`
+*   **Purpose**: Fetches the full text content of a specific study by NCT ID.
+*   **Capabilities**: Retrieves all chunks of a study to provide comprehensive details (Criteria, Summary, Protocol).
+### 6. `get_cohort_sql`
+*   **Purpose**: Translates clinical trial eligibility criteria into standard SQL queries for claims data analysis.
+*   **Capabilities**:
+    *   **Extraction**: Parses text into structured inclusion/exclusion rules (Concepts, Codes).
+    *   **SQL Generation**: Generates OMOP-compatible SQL queries targeting `medical_claims` and `pharmacy_claims`.
+    *   **Logic Enforcement**: Applies temporal logic (e.g., "2 diagnoses > 30 days apart") for chronic conditions.
+## ⚙️ How It Works (RAG Pipeline)
+1.  **Ingestion**: `ingest_ct.py` fetches study data from ClinicalTrials.gov. It extracts rich text (including **Eligibility Criteria** and **Interventions**) and structured metadata. It uses **multiprocessing** for speed.
+2.  **Embedding**: Text is converted into vector embeddings using `PubMedBERT` and stored in **LanceDB**.
+3.  **Retrieval**:
+    *   **Query Transformation**: Synonyms are injected via LLM.
+    *   **Pre-Filtering**: Strict filters (Status, Year, Sponsor) reduce the search scope.
+    *   **Hybrid Search**: Parallel **Vector Search** (Semantic) and **BM25** (Keyword) combined via **LanceDB Native Hybrid Search**.
+    *   **Post-Filtering**: Additional metadata checks (Phase, Intervention) on retrieved candidates.
+    *   **Re-Ranking**: Cross-Encoder re-scoring.
+4.  **Synthesis**: **Google Gemini** synthesizes the final answer.
+### 🏗️ Ingestion Pipeline
+```mermaid
+graph TD
+    API[ClinicalTrials.gov API] -->|Fetch Batches| Script[ingest_ct.py]
+    Script -->|Process & Embed| LanceDB[(LanceDB)]
+```
+### 🧠 RAG Retrieval Flow
+```mermaid
+graph TD
+    User[User Query] -->|Expand| Synonyms[Synonym Injection]
+    Synonyms -->|Pre-Filter| PreFilter[Pre-Retrieval Filters]
+    PreFilter -->|Filtered Scope| Hybrid[Hybrid Search]
+    Hybrid -->|Parallel Search| Vector[Vector Search] & BM25[BM25 Keyword Search]
+    Vector & BM25 -->|Reciprocal Rank Fusion| Fusion[Merged Candidates]
+    Fusion -->|Candidates| PostFilter[Post-Retrieval Filters]
+    PostFilter -->|Top N| ReRank[Cross-Encoder Re-Ranking]
+    ReRank -->|Context| LLM[Google Gemini]
+    LLM -->|Answer| Response[Final Response]
+```
+### 🕸️ Knowledge Graph
+```mermaid
+graph TD
+    LanceDB[(LanceDB)] -->|Metadata| GraphBuilder[build_graph]
+    GraphBuilder -->|Nodes & Edges| Agraph[Streamlit Agraph]
+```
+## 🛠️ Tech Stack
+- **Frontend**: Streamlit, Altair, Streamlit-Agraph
+- **LLM**: Google Gemini (`gemini-2.5-flash`)
+- **Orchestration**: LangChain (Agents, Tool Calling)
+- **Retrieval (RAG)**: LlamaIndex (VectorStoreIndex, SubQuestionQueryEngine)
+- **Vector Database**: LanceDB (Local)
+- **Embeddings**: HuggingFace (`pritamdeka/S-PubMedBert-MS-MARCO`)
+## 🚀 Getting Started
+### Prerequisites
+- Python 3.10+
+- A Google Cloud API Key with access to Gemini
+### Installation
+1. **Clone the repository**
+   ```bash
+   git clone <repository-url>
+   cd clinical_trial_agent
+   ```
+2. **Create and activate a virtual environment**
+   ```bash
+   python -m venv venv
+   source venv/bin/activate  # On Windows: venv\Scripts\activate
+   ```
+3. **Install dependencies**
+   ```bash
+   pip install -r requirements.txt
+   ```
+4. **Set up Environment Variables**
+   Create a `.env` file in the root directory and add your Google API Key:
+   ```bash
+   GOOGLE_API_KEY=your_google_api_key_here
+   ```
+## 📖 Usage
+### 1. Ingest Data
+Populate the local database. The script uses parallel processing for speed.
+```bash
+# Recommended: Ingest 5000 recent studies
+python scripts/ingest_ct.py --limit 5000 --years 5
+# Ingest ALL studies (Warning: Large download!)
+python scripts/ingest_ct.py --limit -1 --years 10
+```
+### 2. Run the Agent
+Launch the Streamlit application:
+```bash
+streamlit run ct_agent_app.py
+```
+### 3. Ask Questions!
+- **Search**: *"Find studies for Multiple Myeloma."*
+- **Comparison**: *"Compare the primary outcomes of Keytruda vs Opdivo."*
+- **Analytics**: *"Who are the top sponsors for Breast Cancer?"* (Now supports grouping by **Intervention** and **Study Type**!)
+- **Graph**: Go to the **Knowledge Graph** tab to visualize connections.
+## 🧪 Testing & Quality
+- **Unit Tests**: Run `python -m pytest tests/test_unit.py` to verify core logic.
+- **Hybrid Search Tests**: Run `python -m pytest tests/test_hybrid_search.py` to verify the search engine's precision and recall.
+- **Data Integrity**: Run `python -m unittest tests/test_data_integrity.py` to verify database content against known ground truths.
+- **Sponsor Normalization**: Run `python -m pytest tests/test_sponsor_normalization.py` to verify alias mapping logic.
+- **Linting**: Codebase is formatted with `black` and linted with `flake8`.
+## 📂 Project Structure
+- `ct_agent_app.py`: Main application logic.
+- `modules/`:
+    - `utils.py`: Configuration, Normalization, Custom Filters.
+    - `constants.py`: Static data (Coordinates, Mappings).
+    - `tools.py`: Tool definitions (`search_trials`, `compare_studies`, etc.).
+    - `cohort_tools.py`: SQL generation logic (`get_cohort_sql`).
+    - `graph_viz.py`: Knowledge Graph logic.
+- `scripts/`:
+    - `ingest_ct.py`: Parallel data ingestion pipeline.
+    - `analyze_db.py`: Database inspection.
+- `ct_gov_lancedb/`: Persisted LanceDB vector store.
+- `tests/`:
+    - `test_unit.py`: Core logic tests.
+    - `test_hybrid_search.py`: Integration tests for search engine.
+## 🐳 Deployment
+The application is container-ready and can be deployed using Docker.
+### Build the Image
+```bash
+docker build -t clinical-trial-agent .
+```
+### Run the Container
+You can run the container in two modes:
+**1. Admin Mode (API Key in Environment)**
+Pass the key as an environment variable. Users will not be prompted.
+```bash
+docker run -p 8501:8501 -e GOOGLE_API_KEY=your_key_here clinical-trial-agent
+```
+**2. User Mode (Prompt for Key)**
+Run without the key. Users will be prompted to enter their own key in the sidebar.
+```bash
+docker run -p 8501:8501 clinical-trial-agent
+```
+### Hosting Options
+- **Hugging Face Spaces**: Select "Docker" SDK. Add `GOOGLE_API_KEY` to Secrets for Admin Mode.
+- **Google Cloud Run**: Deploy the container and map port 8501.

ct_agent_app.py ADDED Viewed

	@@ -0,0 +1,583 @@

+"""
+Clinical Trial Inspector Agent Application.
+This is the main Streamlit application script. It orchestrates:
+1.  **LLM & Agents**: Initializes Google Gemini and the LangChain agent.
+2.  **RAG Pipeline**: Loads the LlamaIndex vector store for semantic retrieval.
+3.  **User Interface**: Renders the Streamlit UI with tabs for Chat, Analytics, and Raw Data.
+4.  **Visualization**: Handles dynamic chart generation using Altair.
+"""
+import streamlit as st
+import pandas as pd
+import os
+import altair as alt
+import logging
+from dotenv import load_dotenv
+# Suppress logging
+logging.getLogger("langchain_google_genai._function_utils").setLevel(logging.ERROR)
+# Load environment variables
+load_dotenv()
+# Module Imports
+from modules.utils import load_index, setup_llama_index
+from modules.constants import COUNTRY_COORDINATES, STATE_COORDINATES
+# ... (imports)
+from modules.tools import (
+    search_trials,
+    find_similar_studies,
+    get_study_analytics,
+    compare_studies,
+    get_study_details,
+    fetch_study_analytics_data,
+)
+from modules.cohort_tools import get_cohort_sql
+from modules.graph_viz import build_graph
+from streamlit_agraph import agraph
+from streamlit_option_menu import option_menu
+import folium
+from streamlit_folium import st_folium
+# LangChain Imports
+from langchain_google_genai import ChatGoogleGenerativeAI
+from langchain.agents import AgentExecutor, create_tool_calling_agent
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.messages import HumanMessage, AIMessage
+from langchain_core.prompts import MessagesPlaceholder
+# --- App Configuration ---
+st.set_page_config(
+    page_title="Clinical Trial Inspector",
+    layout="wide",
+    initial_sidebar_state="expanded",
+)
+# --- Custom CSS for Sidebar Width ---
+st.markdown(
+    """
+    <style>
+    [data-testid="stSidebar"] {
+        min-width: 200px;
+        max-width: 250px;
+    }
+    </style>
+    """,
+    unsafe_allow_html=True,
+)
+st.title("🧬 Clinical Trial Inspector Agent")
+# 1. Setup LLM & LlamaIndex Settings
+# We use Google Gemini-2.5-Flash for fast and accurate responses.
+api_key = os.environ.get("GOOGLE_API_KEY")
+if not api_key:
+    st.sidebar.warning("⚠️ API Key Missing")
+    user_key = st.sidebar.text_input("Enter Google API Key:", type="password", help="Get one at https://aistudio.google.com/")
+    if user_key:
+        st.session_state["api_key"] = user_key
+        api_key = user_key
+        st.sidebar.success("Key set!")
+        st.rerun()
+    else:
+        # Check if key is already in session state (from previous run)
+        if "api_key" in st.session_state:
+            api_key = st.session_state["api_key"]
+        else:
+            st.warning("Please enter your Google API Key in the sidebar to continue.")
+            st.stop()
+else:
+    # Env var exists, ensure it's in session state for tools to find
+    st.session_state["api_key"] = api_key
+# Ensure LlamaIndex settings (Embeddings, LLM) are applied on every run
+setup_llama_index(api_key=api_key)
+llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api_key=api_key)
+# 2. Load LlamaIndex (Cached)
+# The index is loaded once and cached to avoid reloading on every interaction.
+index = load_index()
+# 3. Define Agent (Cached)
+@st.cache_resource
+def get_agent():
+    """Initializes and caches the LangChain agent."""
+    tools = [
+        search_trials,
+        find_similar_studies,
+        get_study_analytics,
+        compare_studies,
+        get_study_details,
+        get_cohort_sql,
+    ]
+    prompt = ChatPromptTemplate.from_messages(
+        [
+            (
+                "system",
+                "You are a Clinical Trial Expert Assistant. "
+                "Your goal is to help researchers and analysts understand clinical trial data. "
+                "You have access to a local database of clinical trials (embedded from ClinicalTrials.gov). "
+                "Use the available tools to search for studies, find similar studies, and generate analytics. "
+                "When asked about 'trends', 'counts', 'how many', or 'most common', ALWAYS use the `get_study_analytics` tool. "
+                "Do NOT use `search_trials` for counting questions like 'How many studies...'. "
+                "When asked to 'find studies', 'search', or 'list', use `search_trials`. "
+                "When asked to 'compare' multiple studies or answer complex multi-part questions, use `compare_studies`. "
+                "If the user asks for a specific study by ID (e.g., NCT12345678), `search_trials` handles that automatically. "
+                "However, if the user asks for specific **details**, **criteria**, **summary**, or **protocol** of a single study, "
+                "you MUST use the `get_study_details` tool to fetch the full content. "
+                "If the user asks to **generate SQL**, **build a cohort**, or **translate criteria to code** for a study, "
+                "use the `get_cohort_sql` tool. "
+                "When reporting 'similar studies', ALWAYS include the similarity score provided by the tool "
+                "and DO NOT include the study that was used as the query (the reference study). "
+                "Provide concise, evidence-based answers citing specific studies when possible.",
+            ),
+            MessagesPlaceholder(variable_name="chat_history"),
+            ("human", "{input}"),
+            ("placeholder", "{agent_scratchpad}"),
+        ]
+    )
+    agent = create_tool_calling_agent(llm, tools, prompt)
+    return AgentExecutor(agent=agent, tools=tools, verbose=True)
+agent_executor = get_agent()
+# --- Sidebar ---
+with st.sidebar:
+    st.image(
+        "https://cdn-icons-png.flaticon.com/512/3004/3004458.png", width=50
+    )
+    st.title("Clinical Trial Agent")
+    page = option_menu(
+        "Main Menu",
+        ["Chat Assistant", "Analytics Dashboard", "Knowledge Graph", "Study Map", "Raw Data"],
+        icons=["chat-dots", "graph-up", "diagram-3", "map", "database"],
+        menu_icon="cast",
+        default_index=0,
+    )
+# --- Helper Functions ---
+def generate_dashboard_analytics():
+    """Callback to generate analytics and update session state."""
+    # Map UI selection to tool arguments
+    group_map = {
+        "Phase": "phase",
+        "Status": "status",
+        "Sponsor": "sponsor",
+        "Start Year": "start_year",
+        "Intervention": "intervention",
+        "Study Type": "study_type",
+    }
+    # Get values from session state
+    # We use .get() to avoid KeyErrors if the widget hasn't initialized yet (though it should have)
+    g_by = st.session_state.get("dash_group_by", "Sponsor")
+    p_filter = st.session_state.get("dash_phase", "")
+    s_filter = st.session_state.get("dash_sponsor", "")
+    with st.spinner(f"Analyzing studies by {g_by}..."):
+        # Call the tool directly
+        result = get_study_analytics.invoke(
+            {
+                "query": "overall",
+                "group_by": group_map.get(g_by, "sponsor"),
+                "phase": p_filter if p_filter else None,
+                "sponsor": s_filter if s_filter else None,
+            }
+        )
+        # The tool sets session state 'inline_chart_data'
+        if "inline_chart_data" in st.session_state:
+            st.session_state["dashboard_data"] = st.session_state["inline_chart_data"]
+        else:
+            st.warning(result)
+# --- PAGE 1: CHAT ---
+if page == "Chat Assistant":
+    st.header("💬 Chat Assistant")
+    if "messages" not in st.session_state:
+        st.session_state.messages = []
+    # Render Chat History
+    for message in st.session_state.messages:
+        with st.chat_message(message["role"]):
+            st.markdown(message["content"])
+            # Render chart if present in message history (persisted charts)
+            if "chart_data" in message:
+                chart_data = message["chart_data"]
+                st.caption(chart_data["title"])
+                chart = (
+                    alt.Chart(pd.DataFrame(chart_data["data"]))
+                    .mark_bar()
+                    .encode(
+                        x=alt.X(
+                            chart_data["x"], sort="-y", axis=alt.Axis(labelLimit=200)
+                        ),
+                        y=alt.Y(chart_data["y"], title="Count"),
+                        tooltip=[chart_data["x"], chart_data["y"]],
+                    )
+                    .interactive()
+                )
+                st.altair_chart(chart, theme="streamlit", width="stretch")
+    # Chat Input
+    if prompt := st.chat_input("Ask about clinical trials..."):
+        st.session_state.messages.append({"role": "user", "content": prompt})
+        with st.chat_message("user"):
+            st.markdown(prompt)
+        with st.chat_message("assistant"):
+            with st.spinner("Analyzing clinical trials..."):
+                try:
+                    # Clear previous inline chart data to avoid stale charts
+                    if "inline_chart_data" in st.session_state:
+                        del st.session_state["inline_chart_data"]
+                    # Construct chat history for the agent context
+                    chat_history = []
+                    for msg in st.session_state.messages[:-1]:
+                        if msg["role"] == "user":
+                            chat_history.append(HumanMessage(content=msg["content"]))
+                        else:
+                            chat_history.append(AIMessage(content=msg["content"]))
+                    # Invoke Agent
+                    response = agent_executor.invoke(
+                        {"input": prompt, "chat_history": chat_history}
+                    )
+                    output = response["output"]
+                    st.markdown(output)
+                    # Check for inline chart data (set by tools)
+                    chart_data = None
+                    if "inline_chart_data" in st.session_state:
+                        chart_data = st.session_state["inline_chart_data"]
+                        st.caption(chart_data["title"])
+                        if chart_data["type"] == "bar":
+                            # Use Altair for better charts
+                            chart = (
+                                alt.Chart(pd.DataFrame(chart_data["data"]))
+                                .mark_bar()
+                                .encode(
+                                    x=alt.X(
+                                        chart_data["x"],
+                                        sort="-y",
+                                        axis=alt.Axis(labelLimit=200),
+                                    ),
+                                    y=alt.Y(chart_data["y"], title="Count"),
+                                    tooltip=[chart_data["x"], chart_data["y"]],
+                                )
+                                .interactive()
+                            )
+                            st.altair_chart(chart, theme="streamlit", width="stretch")
+                        # Clean up session state
+                        del st.session_state["inline_chart_data"]
+                    # Save message with chart data if present
+                    msg_obj = {"role": "assistant", "content": output}
+                    if chart_data:
+                        msg_obj["chart_data"] = chart_data
+                    st.session_state.messages.append(msg_obj)
+                except Exception as e:
+                    st.error(f"An error occurred: {e}")
+# --- PAGE 2: ANALYTICS DASHBOARD ---
+if page == "Analytics Dashboard":
+    st.header("📊 Global Analytics")
+    st.write(
+        "Analyze trends across the entire clinical trial dataset (60,000+ studies)."
+    )
+    col1, col2 = st.columns([1, 3])
+    with col1:
+        st.subheader("Configuration")
+        group_by = st.selectbox(
+            "Group By",
+            ["Phase", "Status", "Sponsor", "Start Year", "Intervention", "Study Type"],
+            index=2,
+            key="dash_group_by",
+        )
+        # Optional Filters
+        st.markdown("---")
+        st.markdown("**Filters (Optional)**")
+        filter_phase = st.text_input("Phase (e.g., Phase 2)", key="dash_phase")
+        filter_sponsor = st.text_input("Sponsor (e.g., Pfizer)", key="dash_sponsor")
+        st.button(
+            "Generate Analytics", type="primary", on_click=generate_dashboard_analytics
+        )
+    with col2:
+        # Always render if data exists in session state
+        if "dashboard_data" in st.session_state:
+            c_data = st.session_state["dashboard_data"]
+            st.subheader(c_data["title"])
+            # Altair Chart Rendering
+            if (
+                c_data["x"] == "start_year" or group_by == "Start Year"
+            ):  # Check both key and UI selection
+                # Line chart for years
+                chart = (
+                    alt.Chart(pd.DataFrame(c_data["data"]))
+                    .mark_line(point=True)
+                    .encode(
+                        x=alt.X(
+                            c_data["x"], axis=alt.Axis(format="d"), title="Year"
+                        ),  # 'd' for integer year
+                        y=alt.Y(c_data["y"], title="Count"),
+                        tooltip=[c_data["x"], c_data["y"]],
+                    )
+                    .interactive()
+                )
+            else:
+                # Bar chart for others
+                chart = (
+                    alt.Chart(pd.DataFrame(c_data["data"]))
+                    .mark_bar()
+                    .encode(
+                        x=alt.X(
+                            c_data["x"],
+                            sort="-y",
+                            axis=alt.Axis(labelLimit=200),
+                        ),
+                        y=alt.Y(c_data["y"], title="Count"),
+                        tooltip=[c_data["x"], c_data["y"]],
+                    )
+                    .interactive()
+                )
+            st.altair_chart(chart, theme="streamlit", width="stretch")
+            # Show raw table
+            with st.expander("View Source Data"):
+                st.dataframe(pd.DataFrame(c_data["data"]))
+# --- PAGE 3: KNOWLEDGE GRAPH ---
+if page == "Knowledge Graph":
+    st.header("🕸️ Interactive Knowledge Graph")
+    st.write("Visualize connections between Studies, Sponsors, and Conditions.")
+    col_g1, col_g2 = st.columns([1, 3])
+    with col_g1:
+        st.subheader("Graph Settings")
+        graph_query = st.text_input("Search Topic", value="Cancer")
+        limit = st.slider("Max Nodes", 10, 100, 50)
+        if st.button("Build Graph"):
+            with st.spinner("Fetching data and building graph..."):
+                # Use retriever to get relevant nodes
+                retriever = index.as_retriever(similarity_top_k=limit)
+                nodes = retriever.retrieve(graph_query)
+                data = [n.metadata for n in nodes]
+                # Build Graph
+                g_nodes, g_edges, g_config = build_graph(data)
+                st.session_state["graph_data"] = {
+                    "nodes": g_nodes,
+                    "edges": g_edges,
+                    "config": g_config,
+                }
+    with col_g2:
+        if "graph_data" in st.session_state:
+            g_data = st.session_state["graph_data"]
+            st.success(
+                f"Graph built with {len(g_data['nodes'])} nodes and {len(g_data['edges'])} edges."
+            )
+            agraph(
+                nodes=g_data["nodes"], edges=g_data["edges"], config=g_data["config"]
+            )
+        else:
+            st.info("Enter a topic and click 'Build Graph' to visualize connections.")
+# --- PAGE# --- Study Map Tab ---
+elif page == "Study Map":
+    st.header("🌍 Global Clinical Trial Map")
+    st.markdown("Visualize the geographic distribution of clinical trials.")
+    # Sidebar Filters for Map
+    st.sidebar.markdown("### 🗺️ Map Filters")
+    map_region = st.sidebar.radio("Region", ["World", "USA"], index=0)
+    map_phase = st.sidebar.multiselect(
+        "Phase", ["PHASE1", "PHASE2", "PHASE3", "PHASE4"], default=["PHASE2", "PHASE3"]
+    )
+    map_status = st.sidebar.selectbox(
+        "Status", ["RECRUITING", "COMPLETED", "ACTIVE_NOT_RECRUITING"], index=0
+    )
+    map_sponsor = st.sidebar.text_input("Sponsor (Optional)", "")
+    map_year = st.sidebar.number_input("Start Year (>=)", min_value=2000, value=2020)
+    map_type = st.sidebar.selectbox(
+        "Study Type", ["Interventional", "Observational", "All"], index=0
+    )
+    # Convert filters to arguments
+    phase_str = ",".join(map_phase) if map_phase else None
+    type_arg = map_type if map_type != "All" else None
+    if st.button("Update Map"):
+        with st.spinner("Aggregating geographic data..."):
+            # Determine grouping based on Region
+            group_by_field = "state" if map_region == "USA" else "country"
+            # Call analytics logic directly
+            summary = fetch_study_analytics_data(
+                query="overall",
+                group_by=group_by_field,
+                phase=phase_str,
+                status=map_status,
+                sponsor=map_sponsor,
+                start_year=map_year,
+                study_type=type_arg,
+            )
+            # Retrieve data from session state
+            chart_data = st.session_state.get("inline_chart_data", {})
+            data_records = chart_data.get("data", [])
+            if not data_records:
+                st.warning("No data found for these filters.")
+                st.session_state["map_data"] = None
+                st.session_state["map_region"] = map_region # Store region too
+            else:
+                # Store in session state for persistence
+                st.session_state["map_data"] = data_records
+                st.session_state["map_region"] = map_region
+    # Render Map (Outside Button Block)
+    if st.session_state.get("map_data"):
+        data_records = st.session_state["map_data"]
+        region_mode = st.session_state.get("map_region", "World")
+        df_map = pd.DataFrame(data_records)
+        # Configure Map Center/Zoom
+        if region_mode == "USA":
+            m = folium.Map(location=[37.0902, -95.7129], zoom_start=4)
+            coord_map = STATE_COORDINATES
+        else:
+            m = folium.Map(location=[20, 0], zoom_start=2)
+            coord_map = COUNTRY_COORDINATES
+        # Add CircleMarkers
+        for _, row in df_map.iterrows():
+            loc_name = row["category"]
+            count = row["count"]
+            # Clean name if needed (strip trailing parenthesis)
+            loc_clean = loc_name.rstrip(")")
+            coords = coord_map.get(loc_clean)
+            if coords:
+                folium.CircleMarker(
+                    location=coords,
+                    radius=min(max(count / 5, 3), 20),  # Adjust scale
+                    popup=f"{loc_clean}: {count} trials",
+                    color="blue" if region_mode == "USA" else "crimson",
+                    fill=True,
+                    fill_color="blue" if region_mode == "USA" else "crimson",
+                ).add_to(m)
+        st_folium(m, width=800, height=500)
+        # Show data table
+        st.subheader(f"{region_mode} Data")
+        st.dataframe(df_map)
+# --- PAGE 4: RAW DATA ---
+if page == "Raw Data":
+    st.header("📂 Raw Data Explorer")
+    st.write("View and filter the underlying dataset.")
+    # Load a sample or full dataset? Full might be slow.
+    # We load a sample (top 100) to avoid performance issues.
+    col_raw_1, col_raw_2 = st.columns([1, 1])
+    with col_raw_1:
+        if st.button("Load Sample Data (Top 100)"):
+            with st.spinner("Fetching data..."):
+                retriever = index.as_retriever(similarity_top_k=100)
+                nodes = retriever.retrieve("clinical trial")
+                data = [n.metadata for n in nodes]
+                df_raw = pd.DataFrame(data)
+                # Format Year to remove commas (e.g., 2,023 -> 2023)
+                if "start_year" in df_raw.columns:
+                    df_raw["start_year"] = (
+                        pd.to_numeric(df_raw["start_year"], errors="coerce")
+                        .astype("Int64")
+                        .astype(str)
+                        .str.replace(",", "")
+                    )
+                # Store in session state to persist the table
+                st.session_state["sample_data"] = df_raw
+    with col_raw_2:
+        # Download Full Dataset Logic
+        if st.button("Prepare Full Download (CSV)"):
+            with st.spinner("Fetching all records from database..."):
+                try:
+                    # Access LanceDB directly for speed
+                    import lancedb
+                    db = lancedb.connect("./ct_gov_lancedb")
+                    tbl = db.open_table("clinical_trials")
+                    # Fetch all data
+                    df_full = tbl.to_pandas()
+                    # Handle metadata flattening if needed
+                    if "metadata" in df_full.columns:
+                        meta_df = pd.json_normalize(df_full["metadata"])
+                        # Combine or just use metadata
+                        df_full = meta_df
+                        # Convert to CSV
+                        csv = df_full.to_csv(index=False).encode("utf-8")
+                        st.session_state["full_csv"] = csv
+                        st.success(f"Ready! Fetched {len(df_full)} records.")
+                    else:
+                        st.warning("No data found in database.")
+                except Exception as e:
+                    st.error(f"Error fetching data: {e}")
+        if "full_csv" in st.session_state:
+            st.download_button(
+                label="⬇️ Download Full CSV",
+                data=st.session_state["full_csv"],
+                file_name="clinical_trials_full.csv",
+                mime="text/csv",
+            )
+    # Display Sample Data Table (Full Width)
+    if "sample_data" in st.session_state:
+        st.markdown("### Sample Data (Top 100)")
+        st.dataframe(
+            st.session_state["sample_data"],
+            column_config={
+                "nct_id": "NCT ID",
+                "title": "Study Title",
+                "start_year": st.column_config.TextColumn(
+                    "Start Year"
+                ),  # Force text to avoid commas
+                "url": st.column_config.LinkColumn("URL"),
+            },
+            width="stretch",
+            hide_index=True,
+        )

modules/__init__.py ADDED Viewed

File without changes

modules/cohort_tools.py ADDED Viewed

	@@ -0,0 +1,145 @@

+import json
+from langchain.tools import tool
+from langchain_google_genai import ChatGoogleGenerativeAI
+from langchain.prompts import PromptTemplate
+from modules.tools import get_study_details
+from modules.utils import load_environment
+import streamlit as st
+import os
+# Load env for API key
+load_environment()
+def get_llm():
+    """Retrieves LLM instance with dynamic API key."""
+    # Check session state first (User provided key)
+    api_key = None
+    if hasattr(st, "session_state") and "api_key" in st.session_state:
+        api_key = st.session_state["api_key"]
+    # Fallback to environment variable
+    if not api_key:
+        api_key = os.environ.get("GOOGLE_API_KEY")
+    if not api_key:
+        raise ValueError("Google API Key not found in session state or environment.")
+    return ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, google_api_key=api_key)
+# Initialize LLM (Dynamic)
+# llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)
+EXTRACT_PROMPT = PromptTemplate(
+    template="""
+    You are a Clinical Informatics Expert.
+    Your task is to extract structured cohort requirements from the following Clinical Trial Eligibility Criteria.
+    Output a JSON object with two keys: "inclusion" and "exclusion".
+    Each key should contain a list of rules.
+    Each rule should have:
+    - "concept": The medical concept (e.g., "Type 2 Diabetes", "Metformin").
+    - "domain": The domain (Condition, Drug, Measurement, Procedure, Observation).
+    - "temporal": Any temporal logic (e.g., "History of", "Within last 6 months").
+    - "codes": A list of potential ICD-10 or RxNorm codes (make a best guess).
+    CRITERIA:
+    {criteria}
+    JSON OUTPUT:
+    """,
+    input_variables=["criteria"],
+)
+SQL_PROMPT = PromptTemplate(
+    template="""
+    You are a SQL Expert specializing in Healthcare Claims Data Analysis.
+    Generate a standard SQL query to define a cohort of patients based on the following structured requirements.
+    ### Schema Assumptions
+    1.  **`medical_claims`** (Diagnoses & Procedures):
+        - `patient_id`, `claim_date`, `diagnosis_code` (ICD-10), `procedure_code` (CPT/HCPCS).
+    2.  **`pharmacy_claims`** (Drugs):
+        - `patient_id`, `fill_date`, `ndc_code`.
+    ### Logic Rules
+    1.  **Conditions (Diagnoses)**:
+        - Require **at least 2 distinct claim dates** where the diagnosis code matches.
+        - These 2 claims must be **at least 30 days apart** (to confirm chronic condition).
+    2.  **Drugs**:
+        - Require at least 1 claim with a matching NDC code.
+    3.  **Procedures**:
+        - Require at least 1 claim with a matching CPT/HCPCS code.
+    4.  **Exclusions**:
+        - Exclude patients who have ANY matching claims for exclusion criteria.
+    ### Requirements (JSON)
+    {requirements}
+    ### Output
+    Generate a single SQL query that selects `patient_id` from the claims tables meeting the criteria.
+    Use Common Table Expressions (CTEs) for clarity.
+    Do NOT output markdown formatting (```sql), just the raw SQL.
+    SQL QUERY:
+    """,
+    input_variables=["requirements"],
+)
+def extract_cohort_requirements(criteria_text: str) -> dict:
+    """Uses LLM to parse criteria text into structured JSON."""
+    llm = get_llm()
+    chain = EXTRACT_PROMPT | llm
+    response = chain.invoke({"criteria": criteria_text})
+    try:
+        # Clean up potential markdown code blocks
+        text = response.content.replace("```json", "").replace("```", "").strip()
+        return json.loads(text)
+    except json.JSONDecodeError:
+        return {"error": "Failed to parse LLM output", "raw_output": response.content}
+def generate_cohort_sql(requirements: dict) -> str:
+    """Uses LLM to translate structured requirements into SQL."""
+    llm = get_llm()
+    chain = SQL_PROMPT | llm
+    response = chain.invoke({"requirements": json.dumps(requirements, indent=2)})
+    return response.content.replace("```sql", "").replace("```", "").strip()
+@tool("get_cohort_sql")
+def get_cohort_sql(nct_id: str) -> str:
+    """
+    Generates a SQL query to define the patient cohort for a specific study (NCT ID).
+    Args:
+        nct_id (str): The ClinicalTrials.gov identifier (e.g., NCT01234567).
+    Returns:
+        str: A formatted string containing the Extracted Requirements (JSON) and the Generated SQL.
+    """
+    # 1. Fetch Study Details
+    # We reuse the existing tool logic to get the text
+    study_text = get_study_details.invoke(nct_id)
+    if "No study found" in study_text:
+        return f"Could not find study {nct_id}."
+    # 2. Extract Requirements
+    requirements = extract_cohort_requirements(study_text)
+    # 3. Generate SQL
+    sql_query = generate_cohort_sql(requirements)
+    return f"""
+### 📋 Extracted Cohort Requirements
+```json
+{json.dumps(requirements, indent=2)}
+```
+### 💾 Generated SQL Query (OMOP CDM)
+```sql
+{sql_query}
+```
+"""

modules/constants.py ADDED Viewed

	@@ -0,0 +1,103 @@

+# --- Geographic Constants ---
+COUNTRY_COORDINATES = {
+    "United States": [37.0902, -95.7129],
+    "Canada": [56.1304, -106.3468],
+    "United Kingdom": [55.3781, -3.4360],
+    "Germany": [51.1657, 10.4515],
+    "France": [46.2276, 2.2137],
+    "China": [35.8617, 104.1954],
+    "Japan": [36.2048, 138.2529],
+    "Australia": [-25.2744, 133.7751],
+    "Brazil": [-14.2350, -51.9253],
+    "India": [20.5937, 78.9629],
+    "Russia": [61.5240, 105.3188],
+    "South Korea": [35.9078, 127.7669],
+    "Italy": [41.8719, 12.5674],
+    "Spain": [40.4637, -3.7492],
+    "Netherlands": [52.1326, 5.2913],
+    "Belgium": [50.5039, 4.4699],
+    "Switzerland": [46.8182, 8.2275],
+    "Sweden": [60.1282, 18.6435],
+    "Israel": [31.0461, 34.8516],
+    "Poland": [51.9194, 19.1451],
+    "Taiwan": [23.6978, 120.9605],
+    "Mexico": [23.6345, -102.5528],
+    "Argentina": [-38.4161, -63.6167],
+    "South Africa": [-30.5595, 22.9375],
+    "Turkey": [38.9637, 35.2433],
+    "Denmark": [56.2639, 9.5018],
+    "New Zealand": [-40.9006, 174.8860],
+    "Czech Republic": [49.8175, 15.4730],
+    "Hungary": [47.1625, 19.5033],
+    "Finland": [61.9241, 25.7482],
+    "Norway": [60.4720, 8.4689],
+    "Austria": [47.5162, 14.5501],
+    "Greece": [39.0742, 21.8243],
+    "Ireland": [53.1424, -7.6921],
+    "Portugal": [39.3999, -8.2245],
+    "Ukraine": [48.3794, 31.1656],
+    "Egypt": [26.8206, 30.8025],
+    "Thailand": [15.8700, 100.9925],
+    "Singapore": [1.3521, 103.8198],
+    "Malaysia": [4.2105, 101.9758],
+    "Vietnam": [14.0583, 108.2772],
+    "Philippines": [12.8797, 121.7740],
+    "Indonesia": [-0.7893, 113.9213],
+    "Saudi Arabia": [23.8859, 45.0792],
+    "United Arab Emirates": [23.4241, 53.8478],
+}
+STATE_COORDINATES = {
+    "Alabama": [32.806671, -86.791130],
+    "Alaska": [61.370716, -152.404419],
+    "Arizona": [33.729759, -111.431221],
+    "Arkansas": [34.969704, -92.373123],
+    "California": [36.116203, -119.681564],
+    "Colorado": [39.059811, -105.311104],
+    "Connecticut": [41.597782, -72.755371],
+    "Delaware": [39.318523, -75.507141],
+    "District of Columbia": [38.897438, -77.026817],
+    "Florida": [27.766279, -81.686783],
+    "Georgia": [33.040619, -83.643074],
+    "Hawaii": [21.094318, -157.498337],
+    "Idaho": [44.240459, -114.478828],
+    "Illinois": [40.349457, -88.986137],
+    "Indiana": [39.849426, -86.258278],
+    "Iowa": [42.011539, -93.210526],
+    "Kansas": [38.526600, -96.726486],
+    "Kentucky": [37.668140, -84.670067],
+    "Louisiana": [31.169546, -91.867805],
+    "Maine": [44.693947, -69.381927],
+    "Maryland": [39.063946, -76.802101],
+    "Massachusetts": [42.230171, -71.530106],
+    "Michigan": [43.326618, -84.536095],
+    "Minnesota": [45.694454, -93.900192],
+    "Mississippi": [32.741646, -89.678696],
+    "Missouri": [38.456085, -92.288368],
+    "Montana": [46.921925, -110.454353],
+    "Nebraska": [41.125370, -98.268082],
+    "Nevada": [38.313515, -117.055374],
+    "New Hampshire": [43.452492, -71.563896],
+    "New Jersey": [40.298904, -74.521011],
+    "New Mexico": [34.840515, -106.248482],
+    "New York": [42.165726, -74.948051],
+    "North Carolina": [35.630066, -79.806419],
+    "North Dakota": [47.528912, -99.784012],
+    "Ohio": [40.388783, -82.764915],
+    "Oklahoma": [35.565342, -96.928917],
+    "Oregon": [44.572021, -122.070938],
+    "Pennsylvania": [41.203323, -77.194527],
+    "Rhode Island": [41.680893, -71.511780],
+    "South Carolina": [33.856892, -80.945007],
+    "South Dakota": [44.299782, -99.438828],
+    "Tennessee": [35.747845, -86.692345],
+    "Texas": [31.054487, -97.563461],
+    "Utah": [40.150032, -111.862434],
+    "Vermont": [44.045876, -72.710686],
+    "Virginia": [37.769337, -78.169968],
+    "Washington": [47.400902, -121.490494],
+    "West Virginia": [38.491226, -80.954453],
+    "Wisconsin": [44.268543, -89.616508],
+    "Wyoming": [42.755966, -107.302490],
+}

modules/graph_viz.py ADDED Viewed

	@@ -0,0 +1,97 @@

+from streamlit_agraph import Node, Edge, Config
+def build_graph(data):
+    """
+    Constructs a knowledge graph from clinical trial data.
+    Args:
+        data (list): List of study metadata dictionaries.
+    Returns:
+        tuple: (nodes, edges, config) for streamlit-agraph.
+    """
+    nodes = []
+    edges = []
+    # Sets to track unique entities
+    study_ids = set()
+    sponsors = set()
+    conditions = set()
+    for study in data:
+        nct_id = study.get("nct_id", "Unknown")
+        title = study.get("title", "Unknown")
+        # Use 'sponsor' if available (new ingestion), else fallback to 'org'
+        sponsor = study.get("sponsor", study.get("org", "Unknown"))
+        condition_str = study.get("condition", "")
+        # 1. Study Node
+        if nct_id not in study_ids:
+            nodes.append(
+                Node(
+                    id=nct_id,
+                    label=nct_id,
+                    size=20,
+                    color="#4B8BBE",  # Blue
+                    title=title,
+                    shape="dot",
+                )
+            )
+            study_ids.add(nct_id)
+        # 2. Sponsor Node & Edge
+        if sponsor and sponsor != "Unknown":
+            if sponsor not in sponsors:
+                nodes.append(
+                    Node(
+                        id=sponsor,
+                        label=sponsor,
+                        size=15,
+                        color="#FF6B6B",  # Red
+                        shape="triangle",
+                    )
+                )
+                sponsors.add(sponsor)
+            # Edge: Study -> Sponsor
+            edges.append(
+                Edge(
+                    source=nct_id, target=sponsor, label="sponsored_by", color="#CCCCCC"
+                )
+            )
+        # 3. Condition Nodes & Edges
+        if condition_str:
+            conds = [c.strip() for c in condition_str.split(",") if c.strip()]
+            for cond in conds:
+                if cond not in conditions:
+                    nodes.append(
+                        Node(
+                            id=cond,
+                            label=cond,
+                            size=15,
+                            color="#6BCB77",  # Green
+                            shape="diamond",
+                        )
+                    )
+                    conditions.add(cond)
+                # Edge: Study -> Condition
+                edges.append(
+                    Edge(source=nct_id, target=cond, label="studies", color="#CCCCCC")
+                )
+    # Configuration
+    config = Config(
+        width=800,
+        height=600,
+        directed=True,
+        physics=True,
+        hierarchical=False,
+        nodeHighlightBehavior=True,
+        highlightColor="#F7A7A6",
+        collapsible=False,
+    )
+    return nodes, edges, config

modules/tools.py ADDED Viewed

	@@ -0,0 +1,706 @@

+"""
+LangChain Tools for the Clinical Trial Agent.
+This module defines the tools that the agent can use to interact with the clinical trial data.
+Tools include:
+1.  **search_trials**: Semantic search with optional strict filtering.
+2.  **find_similar_studies**: Finding studies semantically similar to a given text.
+3.  **get_study_analytics**: Aggregating data for trends and insights (with inline charts).
+"""
+import pandas as pd
+import streamlit as st
+from typing import Optional
+from langchain.tools import tool as langchain_tool
+from llama_index.core.vector_stores import (
+    MetadataFilter,
+    MetadataFilters,
+    FilterOperator,
+)
+from llama_index.core import Settings
+from llama_index.core.postprocessor import MetadataReplacementPostProcessor
+from llama_index.core.postprocessor import SentenceTransformerRerank
+from llama_index.core.query_engine import SubQuestionQueryEngine
+from llama_index.core.tools import QueryEngineTool, ToolMetadata
+from modules.utils import (
+    load_index,
+    normalize_sponsor,
+    get_sponsor_variations,
+    get_hybrid_retriever,
+)
+import re
+import traceback
+# --- Tools ---
+def expand_query(query: str) -> str:
+    """Expands a search query with synonyms using the LLM."""
+    if not query or len(query.split()) > 10:  # Skip expansion for long queries
+        return query
+    # Skip expansion if it looks like an NCT ID
+    if re.search(r"NCT\d+", query, re.IGNORECASE):
+        return query
+    prompt = (
+        f"You are a helpful medical assistant. "
+        f"Expand the following search query with relevant medical synonyms and acronyms. "
+        f"Return ONLY the expanded query string combined with OR operators. "
+        f"Do not add any explanation.\n\n"
+        f"Query: {query}\n"
+        f"Expanded Query:"
+    )
+    try:
+        # Use the global Settings.llm
+        if not Settings.llm:
+        # Fallback if not initialized (though load_index does it)
+            from modules.utils import setup_llama_index
+            setup_llama_index()
+        response = Settings.llm.complete(prompt)
+        expanded = response.text.strip()
+        # Clean up if LLM is chatty
+        if "Expanded Query:" in expanded:
+            expanded = expanded.split("Expanded Query:")[-1].strip()
+        if not expanded:
+            print(f"⚠️ Expansion returned empty. Using original query.")
+            return query
+        print(f"✨ Expanded Query: '{query}' -> '{expanded}'")
+        return expanded
+    except Exception as e:
+        print(f"⚠️ Query expansion failed: {e}")
+        return query
+@langchain_tool("search_trials")
+def search_trials(
+    query: str = None,
+    status: str = None,
+    phase: str = None,
+    sponsor: str = None,
+    intervention: str = None,
+    year: int = None,
+):
+    """
+    Searches for clinical trials using semantic search with robust filtering.
+    Args:
+        query (str, optional): The natural language search query.
+        status (str, optional): Filter by recruitment status.
+        phase (str, optional): Filter by trial phase.
+        sponsor (str, optional): Filter by sponsor name.
+        intervention (str, optional): Filter by intervention/drug name.
+        year (int, optional): Filter for studies starting on or after this year.
+    Returns:
+        str: A structured list of relevant studies.
+    """
+    index = load_index()
+    # Constants
+    TOP_K_STRICT = 500  # High recall for pre-filtered search
+    # --- Query Construction ---
+    if not query:
+        parts = [p for p in [sponsor, intervention, phase, status] if p]
+        query = " ".join(parts) if parts else "clinical trial"
+    else:
+        # Inject context for vector search
+        if sponsor and normalize_sponsor(sponsor).lower() not in query.lower():
+            query = f"{normalize_sponsor(sponsor)} {query}"
+        if intervention and intervention.lower() not in query.lower():
+            query = f"{intervention} {query}"
+        query = expand_query(query)
+    print(f"🔍 Tool Called: search_trials(query='{query}', sponsor='{sponsor}')")
+    # --- Strategy 1: Strict Pre-Retrieval Filtering (High Precision) ---
+    # Filter by Sponsor/Status/Year at the database level first.
+    pre_filters = []
+    # NCT ID Match
+    nct_match = re.search(r"NCT\d+", query, re.IGNORECASE)
+    if nct_match:
+        nct_id = nct_match.group(0).upper()
+        pre_filters.append(MetadataFilter(key="nct_id", value=nct_id, operator=FilterOperator.EQ))
+    if status:
+        pre_filters.append(MetadataFilter(key="status", value=status.upper(), operator=FilterOperator.EQ))
+    if year:
+        pre_filters.append(MetadataFilter(key="start_year", value=year, operator=FilterOperator.GTE))
+    # Sponsor Pre-Filter
+    if sponsor:
+        from modules.utils import get_sponsor_variations
+        variations = get_sponsor_variations(sponsor)
+        if variations:
+            print(f"🎯 Applying strict pre-filter for sponsor '{sponsor}' ({len(variations)} variants)")
+            # Use 'sponsor' field which is the Lead Sponsor
+            pre_filters.append(MetadataFilter(key="sponsor", value=variations, operator=FilterOperator.IN))
+        else:
+            print(f"⚠️ No strict mapping for sponsor '{sponsor}'. Will rely on fuzzy post-filtering.")
+    metadata_filters = MetadataFilters(filters=pre_filters) if pre_filters else None
+    # Post-processors (Reranking)
+    reranker = SentenceTransformerRerank(model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_n=50)
+    # --- HYBRID SEARCH IMPLEMENTATION ---
+    # Combine Vector + BM25 using get_hybrid_retriever
+    try:
+        retriever = get_hybrid_retriever(index, similarity_top_k=TOP_K_STRICT, filters=metadata_filters)
+        nodes = retriever.retrieve(query)
+        # (QueryFusionRetriever returns nodes, but we want to rerank them)
+        if nodes:
+            from llama_index.core.schema import QueryBundle
+            nodes = reranker.postprocess_nodes(nodes, query_bundle=QueryBundle(query_str=query))
+    except Exception as e:
+        print(f"⚠️ Hybrid search failed: {e}. Falling back to standard vector search.")
+        traceback.print_exc()
+        query_engine = index.as_query_engine(
+            similarity_top_k=TOP_K_STRICT,
+            filters=metadata_filters,
+            node_postprocessors=[reranker]
+        )
+        response = query_engine.query(query)
+        nodes = response.source_nodes
+    # --- Strict Metadata Filtering (Post-Fusion) ---
+    # BM25 results might not respect the vector filters, so filter them out.
+    final_nodes = []
+    for node in nodes:
+        meta = node.metadata
+        keep = True
+        # Re-apply filters to ensure BM25 results are valid
+        if status and meta.get("status", "").upper() != status.upper():
+            keep = False
+        if year:
+            try:
+                if int(meta.get("start_year", 0)) < year:
+                    keep = False
+            except:
+                pass
+        if sponsor:
+            # Strict logic for sponsor in pre-filters is ignored by BM25.
+            # Check if the sponsor matches one of the variations OR fuzzy match
+            # If strict variations exist, enforce them.
+            variations = get_sponsor_variations(sponsor)
+            node_sponsor = meta.get("sponsor", "")
+            # Fallback to org if sponsor is missing (legacy data)
+            if not node_sponsor:
+                node_sponsor = meta.get("org", "")
+            if variations:
+                if node_sponsor not in variations:
+                    keep = False
+            else:
+                # Fuzzy fallback
+                if normalize_sponsor(sponsor).lower() not in normalize_sponsor(node_sponsor).lower():
+                    keep = False
+        if keep:
+            final_nodes.append(node)
+    nodes = final_nodes
+    # --- Strict Keyword Filtering ---
+    # BM25 handles keyword relevance naturally, so rely on the Hybrid Search + Reranker
+    # rather than applying an aggressive substring check here.
+    # Update response object structure to match expected format if we used retriever
+    class MockResponse:
+        def __init__(self, nodes):
+            self.source_nodes = nodes
+    response = MockResponse(nodes)
+    # --- Strategy 2: Hybrid Search (Fallback) ---
+    # Hybrid Search is enabled by default.
+    # Strict filters are handled in post-processing above.
+    # --- Formatting Output ---
+    if not response.source_nodes:
+        return "No matching studies found. Try broadening your search terms or filters."
+    # Filter by Relevance Score for display
+    MIN_SCORE = 1.5
+    relevant_nodes = [node for node in response.source_nodes if node.score > MIN_SCORE]
+    # If strict filtering removes too much, show at least top 3 to be helpful
+    if len(relevant_nodes) < 3 and len(response.source_nodes) > 0:
+        relevant_nodes = response.source_nodes[:3]
+    display_limit = 20
+    display_nodes = relevant_nodes[:display_limit]
+    results = []
+    for node in display_nodes:
+        meta = node.metadata
+        entry = (
+            f"**{meta.get('title', 'Untitled')}**\n"
+            f"   - ID: {meta.get('nct_id')}\n"
+            f"   - Phase: {meta.get('phase', 'N/A')}\n"
+            f"   - Status: {meta.get('status', 'N/A')}\n"
+            f"   - Sponsor: {meta.get('sponsor', meta.get('org', 'Unknown'))}\n"
+            f"   - Relevance: {node.score:.2f}"
+        )
+        results.append(entry)
+    return f"Found {len(results)} relevant studies:\n\n" + "\n\n".join(results)
+@langchain_tool("find_similar_studies")
+def find_similar_studies(query: str):
+    """
+    Finds studies semantically similar to a given query or study description.
+    This tool is useful for "more like this" functionality. It relies purely
+    on vector similarity without strict metadata filtering.
+    Args:
+        query (str): The text to match against (e.g., a study title or description).
+    Returns:
+        str: A string containing the top 5 similar studies with their titles and summaries.
+    """
+    index = load_index()
+    # 1. Check if query is an NCT ID
+    nct_match = re.search(r"NCT\d+", query, re.IGNORECASE)
+    target_nct = None
+    search_text = query
+    if nct_match:
+        target_nct = nct_match.group(0).upper()
+        print(f"🎯 Detected NCT ID for similarity: {target_nct}")
+        # Fetch the study content to use as the semantic query
+        # Use the vector store directly to get the text
+        retriever = index.as_retriever(
+            filters=MetadataFilters(
+                filters=[MetadataFilter(key="nct_id", value=target_nct, operator=FilterOperator.EQ)]
+            ),
+            similarity_top_k=1
+        )
+        nodes = retriever.retrieve(target_nct)
+        if nodes:
+            # Use the study's text (Title + Summary) as the query
+            search_text = nodes[0].text
+            print(f"✅ Found study content. Using {len(search_text)} chars for semantic search.")
+        else:
+            print(f"⚠️ Study {target_nct} not found. Falling back to text search.")
+    # 2. Perform Semantic Search
+    # Fetch more candidates (10) to allow for filtering
+    retriever = index.as_retriever(similarity_top_k=10)
+    nodes = retriever.retrieve(search_text)
+    results = []
+    count = 0
+    for node in nodes:
+        # 3. Self-Exclusion
+        if target_nct and node.metadata.get("nct_id") == target_nct:
+            continue
+        # Deduplication (if multiple chunks of same study appear)
+        if any(r["nct_id"] == node.metadata.get("nct_id") for r in results):
+            continue
+        results.append({
+            "nct_id": node.metadata.get("nct_id"),
+            "text": f"Study: {node.metadata['title']} (NCT: {node.metadata.get('nct_id')})\nScore: {node.score:.4f}\nSummary: {node.text[:200]}..."
+        })
+        count += 1
+        if count >= 5:  # Limit to top 5 unique results
+            break
+    if not results:
+        return "No similar studies found."
+    return "\n\n".join([r["text"] for r in results])
+def fetch_study_analytics_data(
+    query: str,
+    group_by: str,
+    phase: Optional[str] = None,
+    status: Optional[str] = None,
+    sponsor: Optional[str] = None,
+    intervention: Optional[str] = None,
+    start_year: Optional[int] = None,
+    study_type: Optional[str] = None,
+) -> str:
+    """
+    Underlying logic for fetching and aggregating clinical trial data.
+    See get_study_analytics for full docstring.
+    """
+    index = load_index()
+    # 1. Retrieve Data
+    if query.lower() == "overall":
+        try:
+            # Connect to LanceDB directly for speed
+            import lancedb
+            db = lancedb.connect("./ct_gov_lancedb")
+            tbl = db.open_table("clinical_trials")
+            # Fetch all data as pandas DataFrame
+            df = tbl.to_pandas()
+            # LlamaIndex stores metadata in a 'metadata' column (usually as a dict/struct)
+            # We need to flatten it to get columns like 'status', 'phase', etc.
+            if "metadata" in df.columns:
+                # Check if it's already a dict or needs parsing
+                # LanceDB to_pandas() converts struct to dict
+                meta_df = pd.json_normalize(df["metadata"])
+                df = meta_df
+            # If columns are already flat (depending on schema evolution), we are good.
+            # But usually it's nested.
+        except Exception as e:
+            return f"Error fetching full dataset: {e}"
+    else:
+        filters = []
+        if status:
+            filters.append(
+                MetadataFilter(
+                    key="status", value=status.upper(), operator=FilterOperator.EQ
+                )
+            )
+        if phase and "," not in phase:
+            pass
+        if sponsor:
+            # Use the helper to get all variations (e.g. "Pfizer" -> ["Pfizer", "Pfizer Inc."])
+            sponsor_variations = get_sponsor_variations(sponsor)
+            if sponsor_variations:
+                print(f"🎯 Using strict pre-filter for sponsor '{sponsor}': {len(sponsor_variations)} variations found.")
+                filters.append(
+                    MetadataFilter(
+                        key="sponsor", value=sponsor_variations, operator=FilterOperator.IN
+                    )
+                )
+        metadata_filters = MetadataFilters(filters=filters) if filters else None
+        search_query = query
+        if sponsor and sponsor.lower() not in query.lower():
+            search_query = f"{sponsor} {query}"
+        # Use hybrid search for better recall
+        retriever = index.as_retriever(
+            similarity_top_k=5000,
+            filters=metadata_filters,
+            vector_store_query_mode="hybrid"
+        )
+        nodes = retriever.retrieve(search_query)
+        # --- Strict Keyword Filtering ---
+        # Strictly check if the query appears in Title or Conditions to ensure accurate counting.
+        # EXCEPTION: If the query matches the requested sponsor, we also check the 'org' field.
+        if query.lower() != "overall":
+            q_term = query.lower()
+            # Check if the query is essentially the sponsor name
+            is_sponsor_query = False
+            # Check if the query itself normalizes to a known sponsor
+            query_normalized = normalize_sponsor(query)
+            if query_normalized and query_normalized != query:
+                 # If normalization changed it (or found a mapping), it's likely a sponsor
+                 is_sponsor_query = True
+            if sponsor:
+                # Normalize both to see if they refer to the same entity
+                norm_query = normalize_sponsor(query)
+                norm_sponsor = normalize_sponsor(sponsor)
+                if norm_query and norm_sponsor and norm_query.lower() == norm_sponsor.lower():
+                    is_sponsor_query = True
+                elif sponsor.lower() in query.lower() or query.lower() in sponsor.lower():
+                    is_sponsor_query = True
+            filtered_nodes = []
+            for node in nodes:
+                meta = node.metadata
+                title = meta.get("title", "").lower()
+                conditions = meta.get("condition", "").lower() # Note: key is 'condition' in DB
+                org = meta.get("org", "").lower()
+                sponsor_val = meta.get("sponsor", "").lower()
+                # If it's a sponsor query, we allow matches on the Organization OR Sponsor field
+                # AND we check if the normalized values match (handling aliases like J&J -> Janssen)
+                match = False
+                if q_term in title or q_term in conditions:
+                    match = True
+                elif is_sponsor_query:
+                    # Check raw match
+                    if q_term in org or q_term in sponsor_val:
+                        match = True
+                    else:
+                        # Check normalized match
+                        norm_org = normalize_sponsor(org)
+                        norm_val = normalize_sponsor(sponsor_val)
+                        # Compare against the normalized query (which is the sponsor in this case)
+                        target_norm = norm_sponsor if sponsor else query_normalized
+                        if norm_org and target_norm and norm_org.lower() == target_norm.lower():
+                            match = True
+                        elif norm_val and target_norm and norm_val.lower() == target_norm.lower():
+                            match = True
+                if match:
+                    filtered_nodes.append(node)
+            print(f"📉 Strict Filter: {len(nodes)} -> {len(filtered_nodes)} nodes for '{query}'")
+            nodes = filtered_nodes
+        data = [node.metadata for node in nodes]
+        df = pd.DataFrame(data)
+    if "nct_id" in df.columns:
+        df = df.drop_duplicates(subset="nct_id")
+    if df.empty:
+        return "No studies found for analytics."
+    # --- APPLY FILTERS (Pandas) ---
+    if phase:
+        target_phases = [p.strip().upper().replace(" ", "") for p in phase.split(",")]
+        df["phase_upper"] = df["phase"].astype(str).str.upper().str.replace(" ", "")
+        mask = df["phase_upper"].apply(lambda x: any(tp in x for tp in target_phases))
+        df = df[mask]
+    if status:
+        df = df[df["status"].str.upper() == status.upper()]
+    if sponsor:
+        target_sponsor = normalize_sponsor(sponsor).lower()
+        # Use 'sponsor' column if it exists, otherwise fallback to 'org'
+        if "sponsor" in df.columns:
+             df["sponsor_check"] = df["sponsor"].fillna(df["org"]).astype(str).apply(normalize_sponsor).str.lower()
+        else:
+             df["sponsor_check"] = df["org"].astype(str).apply(normalize_sponsor).str.lower()
+        df = df[df["sponsor_check"].str.contains(target_sponsor, regex=False)]
+    if intervention:
+        target_intervention = intervention.lower()
+        df["intervention_lower"] = df["intervention"].astype(str).str.lower()
+        df = df[df["intervention_lower"].str.contains(target_intervention, regex=False)]
+    if start_year:
+        df["start_year"] = pd.to_numeric(df["start_year"], errors="coerce").fillna(0)
+        df = df[df["start_year"] >= start_year]
+    if study_type:
+        df = df[df["study_type"].str.upper() == study_type.upper()]
+    if df.empty:
+        return "No studies found after applying filters."
+    key_map = {
+        "phase": "phase",
+        "status": "status",
+        "sponsor": "sponsor" if "sponsor" in df.columns else "org",
+        "start_year": "start_year",
+        "condition": "condition",
+        "intervention": "intervention",
+        "study_type": "study_type",
+        "country": "country",
+        "state": "state",
+    }
+    if group_by not in key_map:
+        return f"Invalid group_by field: {group_by}. Valid options: phase, status, sponsor, start_year, condition, intervention, study_type, country, state"
+    col = key_map[group_by]
+    if col == "start_year":
+        df[col] = pd.to_numeric(df[col], errors="coerce")
+        counts = df[col].value_counts().sort_index()
+    elif col == "condition":
+        counts = df[col].astype(str).str.split(", ").explode().value_counts().head(10)
+    elif col == "intervention":
+        all_interventions = []
+        for interventions in df[col].dropna():
+            parts = [i.strip() for i in interventions.split(";") if i.strip()]
+            all_interventions.extend(parts)
+        counts = pd.Series(all_interventions).value_counts().head(10)
+    else:
+        counts = df[col].value_counts().head(10)
+    summary = counts.to_string()
+    chart_df = counts.reset_index()
+    chart_df.columns = ["category", "count"]
+    chart_data = {
+        "type": "bar",
+        "title": f"Studies by {group_by.capitalize()}",
+        "data": chart_df.to_dict("records"),
+        "x": "category",
+        "y": "count",
+    }
+    if "inline_chart_data" not in st.session_state:
+        st.session_state["inline_chart_data"] = chart_data
+    else:
+        st.session_state["inline_chart_data"] = chart_data
+    return f"Found {len(df)} studies. Top counts:\n{summary}\n\n(Chart generated in UI)"
+@langchain_tool("get_study_analytics")
+def get_study_analytics(
+    query: str,
+    group_by: str,
+    phase: Optional[str] = None,
+    status: Optional[str] = None,
+    sponsor: Optional[str] = None,
+    intervention: Optional[str] = None,
+    start_year: Optional[int] = None,
+    study_type: Optional[str] = None,
+):
+    """
+    Aggregates clinical trial data based on a search query and groups by a specific field.
+    This tool performs the following steps:
+    1.  Retrieves a large number of relevant studies (up to 500).
+    2.  Applies strict filters (Phase, Status, Sponsor) in memory (Pandas).
+    3.  Groups the data by the requested field (e.g., Sponsor).
+    4.  Generates a summary string for the LLM.
+    5.  **Side Effect**: Injects chart data into `st.session_state` to trigger an inline chart in the UI.
+    Args:
+        query (str): The search query to filter studies (e.g., "cancer").
+        group_by (str): The field to group by. Options: "phase", "status", "sponsor", "start_year", "condition".
+        phase (Optional[str]): Optional filter for phase (e.g., "PHASE2").
+        status (Optional[str]): Optional filter for status (e.g., "RECRUITING").
+        sponsor (Optional[str]): Optional filter for sponsor (e.g., "Pfizer").
+        intervention (Optional[str]): Optional filter for intervention (e.g., "Keytruda").
+    Returns:
+        str: A summary string of the top counts and a note that a chart has been generated.
+    """
+    return fetch_study_analytics_data(
+        query=query,
+        group_by=group_by,
+        phase=phase,
+        status=status,
+        sponsor=sponsor,
+        intervention=intervention,
+        start_year=start_year,
+        study_type=study_type,
+    )
+@langchain_tool("compare_studies")
+def compare_studies(query: str):
+    """
+    Compares multiple studies or answers complex multi-part questions using query decomposition.
+    Use this tool when the user asks to "compare", "contrast", or analyze differences/similarities
+    between specific studies, sponsors, or phases. It breaks down the question into sub-questions.
+    Args:
+        query (str): The complex comparison query (e.g., "Compare the primary outcomes of Keytruda vs Opdivo").
+    Returns:
+        str: A detailed response synthesizing the answers to sub-questions.
+    """
+    index = load_index()
+    # Create a base query engine for the sub-questions
+    # Increase top_k and add re-ranking to improve recall for comparison queries
+    reranker = SentenceTransformerRerank(model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_n=10)
+    base_engine = index.as_query_engine(
+        similarity_top_k=50,
+        node_postprocessors=[reranker]
+    )
+    # Wrap it in a QueryEngineTool
+    query_tool = QueryEngineTool(
+        query_engine=base_engine,
+        metadata=ToolMetadata(
+            name="clinical_trials_db",
+            description="Vector database of clinical trial protocols, results, and metadata.",
+        ),
+    )
+    # Create the SubQuestionQueryEngine
+    # Explicitly define the question generator to use the configured LLM (Gemini)
+    # This avoids the default behavior which might try to import OpenAI modules
+    from llama_index.core.question_gen import LLMQuestionGenerator
+    from llama_index.core import Settings
+    question_gen = LLMQuestionGenerator.from_defaults(llm=Settings.llm)
+    query_engine = SubQuestionQueryEngine.from_defaults(
+        query_engine_tools=[query_tool],
+        question_gen=question_gen,
+        use_async=True,
+    )
+    try:
+        response = query_engine.query(query)
+        return str(response) + "\n\n(Note: This analysis is based on the most relevant studies retrieved from the database, not necessarily an exhaustive list.)"
+    except Exception as e:
+        return f"Error during comparison: {e}"
+@langchain_tool("get_study_details")
+def get_study_details(nct_id: str):
+    """
+    Retrieves the full details of a specific clinical trial by its NCT ID.
+    Use this tool when the user asks for specific information about a single study,
+    such as "What are the inclusion criteria for NCT12345678?" or "Give me a summary of study NCT...".
+    It returns the full text content of the study document, including criteria, outcomes, and contacts.
+    Args:
+        nct_id (str): The NCT ID of the study (e.g., "NCT01234567").
+    Returns:
+        str: The full text content of the study, or a message if not found.
+    """
+    index = load_index()
+    # Clean the ID
+    clean_id = nct_id.strip().upper()
+    # Use a retriever with a strict metadata filter for the ID
+    # Set top_k=20 to capture all chunks if the document was split
+    filters = MetadataFilters(
+        filters=[
+            MetadataFilter(key="nct_id", value=clean_id, operator=FilterOperator.EQ)
+        ]
+    )
+    retriever = index.as_retriever(similarity_top_k=20, filters=filters)
+    nodes = retriever.retrieve(clean_id)
+    if not nodes:
+        return f"Study {clean_id} not found in the database."
+    # Sort nodes by their position in the document to reconstruct full text
+    # LlamaIndex nodes usually have 'start_char_idx' in metadata or relationships
+    # Try to sort by node ID or just concatenate them
+    # Simple concatenation (assuming retrieval order is roughly correct or sufficient)
+    full_text = "\n\n".join([node.text for node in nodes])
+    return f"Details for {clean_id} (Combined {len(nodes)} parts):\n\n{full_text}"

modules/utils.py ADDED Viewed

	@@ -0,0 +1,281 @@

+"""
+Utility functions for the Clinical Trial Agent.
+Handles configuration, LanceDB index loading, data normalization, and custom filtering logic.
+"""
+import os
+import streamlit as st
+from typing import List, Optional
+from llama_index.core import VectorStoreIndex, StorageContext, Settings
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from llama_index.vector_stores.lancedb import LanceDBVectorStore
+from llama_index.llms.gemini import Gemini
+import lancedb
+from dotenv import load_dotenv
+# --- MONKEYPATCH START ---
+# Patch LanceDBVectorStore to handle 'nprobes' AttributeError and fix SQL quoting for IN filters.
+original_query = LanceDBVectorStore.query
+def patched_query(self, query, **kwargs):
+    try:
+        return original_query(self, query, **kwargs)
+    except Exception as e:
+        print(f"⚠️ LanceDB Query Error: {e}")
+        if hasattr(query, "filters"):
+            print(f"   Filters: {query.filters}")
+        if "nprobes" in str(e):
+            from llama_index.core.vector_stores.types import VectorStoreQueryResult
+            return VectorStoreQueryResult(nodes=[], similarities=[], ids=[])
+        raise e
+LanceDBVectorStore.query = patched_query
+# Patch _to_lance_filter to fix SQL quoting for IN operator with strings.
+from llama_index.vector_stores.lancedb import base as lancedb_base
+from llama_index.core.vector_stores.types import FilterOperator
+original_to_lance_filter = lancedb_base._to_lance_filter
+def patched_to_lance_filter(standard_filters, metadata_keys):
+    if not standard_filters:
+        return None
+    # Reimplement filter logic to ensure correct SQL generation for LanceDB
+    filters = []
+    for filter in standard_filters.filters:
+        key = filter.key
+        if metadata_keys and key not in metadata_keys:
+             continue
+        # Prefix key with 'metadata.' for LanceDB struct column
+        lance_key = f"metadata.{key}"
+        # Handle IN operator with proper string quoting
+        if filter.operator == FilterOperator.IN:
+            if isinstance(filter.value, list):
+                # Quote strings properly
+                values = []
+                for v in filter.value:
+                    if isinstance(v, str):
+                        values.append(f"'{v}'") # Single quotes for SQL
+                    else:
+                        values.append(str(v))
+                val_str = ", ".join(values)
+                filters.append(f"{lance_key} IN ({val_str})")
+                continue
+        # Standard operators
+        op = filter.operator
+        val = filter.value
+        if op == FilterOperator.EQ:
+            if isinstance(val, str):
+                filters.append(f"{lance_key} = '{val}'")
+            else:
+                filters.append(f"{lance_key} = {val}")
+        elif op == FilterOperator.GT:
+            filters.append(f"{lance_key} > {val}")
+        elif op == FilterOperator.LT:
+            filters.append(f"{lance_key} < {val}")
+        elif op == FilterOperator.GTE:
+            filters.append(f"{lance_key} >= {val}")
+        elif op == FilterOperator.LTE:
+            filters.append(f"{lance_key} <= {val}")
+        elif op == FilterOperator.NE:
+            if isinstance(val, str):
+                filters.append(f"{lance_key} != '{val}'")
+            else:
+                filters.append(f"{lance_key} != {val}")
+        # Add other operators as needed
+    if not filters:
+        return None
+    return " AND ".join(filters)
+lancedb_base._to_lance_filter = patched_to_lance_filter
+# --- MONKEYPATCH END ---
+def load_environment():
+    """Loads environment variables from .env file."""
+    load_dotenv()
+# --- Configuration ---
+def setup_llama_index(api_key: Optional[str] = None):
+    """
+    Configures global LlamaIndex settings (LLM and Embeddings).
+    """
+    # Use passed key, or fallback to env var
+    final_key = api_key or os.environ.get("GOOGLE_API_KEY")
+    if not final_key:
+        # App handles prompting for key, so we just return or log warning
+        pass
+    try:
+        # Pass the key explicitly if available
+        Settings.llm = Gemini(model="models/gemini-2.5-flash", temperature=0, api_key=final_key)
+    except Exception as e:
+        print(f"⚠️ LLM initialization failed (likely missing API key): {e}")
+        print("⚠️ Using MockLLM for testing/fallback.")
+        from llama_index.core.llms import MockLLM
+        Settings.llm = MockLLM()
+    Settings.embed_model = HuggingFaceEmbedding(
+        model_name="pritamdeka/S-PubMedBert-MS-MARCO"
+    )
+@st.cache_resource
+def load_index() -> VectorStoreIndex:
+    """
+    Loads and caches the persistent LanceDB index.
+    """
+    setup_llama_index()
+    # Initialize LanceDB
+    db_path = "./ct_gov_lancedb"
+    db = lancedb.connect(db_path)
+    # Define metadata keys explicitly to ensure filters work
+    metadata_keys = [
+        "nct_id", "title", "org", "sponsor", "status", "phase",
+        "study_type", "start_year", "condition", "intervention",
+        "country", "state"
+    ]
+    # Create the vector store wrapper
+    vector_store = LanceDBVectorStore(
+        uri=db_path,
+        table_name="clinical_trials",
+        query_mode="hybrid",
+    )
+    # Manually set metadata keys as constructor doesn't accept them
+    vector_store._metadata_keys = metadata_keys
+    # Create storage context
+    storage_context = StorageContext.from_defaults(vector_store=vector_store)
+    # Load the index from the vector store
+    index = VectorStoreIndex.from_vector_store(
+        vector_store, storage_context=storage_context
+    )
+    return index
+def get_hybrid_retriever(index: VectorStoreIndex, similarity_top_k: int = 50, filters=None):
+    """
+    Creates a Hybrid Retriever using LanceDB's native hybrid search.
+    Args:
+        index (VectorStoreIndex): The loaded vector index.
+        similarity_top_k (int): Number of top results to retrieve.
+        filters (MetadataFilters, optional): Filters to apply.
+    Returns:
+        VectorIndexRetriever: The configured retriever.
+    """
+    # LanceDB supports native hybrid search via query_mode="hybrid"
+    # We pass this configuration to the retriever
+    # Use standard retriever first to avoid LanceDB hybrid search issues on small datasets
+    return index.as_retriever(
+        similarity_top_k=similarity_top_k,
+        filters=filters,
+    )
+# --- Normalization ---
+# Centralized Sponsor Mappings
+# Key: Canonical Name
+# Value: List of variations/aliases (including the canonical name itself if needed for matching)
+SPONSOR_MAPPINGS = {
+    "GlaxoSmithKline": [
+        "gsk", "glaxo", "glaxosmithkline", "glaxosmithkline",
+        "GlaxoSmithKline"
+    ],
+    "Janssen": [
+        "j&j", "johnson & johnson", "johnson and johnson", "janssen", "Janssen",
+        "Janssen Research & Development, LLC",
+        "Janssen Vaccines & Prevention B.V.",
+        "Janssen Pharmaceutical K.K.",
+        "Janssen-Cilag International NV",
+        "Janssen Sciences Ireland UC",
+        "Janssen Pharmaceutica N.V., Belgium",
+        "Janssen Scientific Affairs, LLC",
+        "Janssen-Cilag Ltd.",
+        "Xian-Janssen Pharmaceutical Ltd.",
+        "Janssen Korea, Ltd., Korea",
+        "Janssen-Cilag G.m.b.H",
+        "Janssen-Cilag, S.A.",
+        "Janssen BioPharma, Inc.",
+    ],
+    "Bristol-Myers Squibb": [
+        "bms", "bristol", "bristol myers squibb", "bristol-myers squibb",
+        "Bristol-Myers Squibb"
+    ],
+    "Merck Sharp & Dohme": [
+        "merck", "msd", "merck sharp & dohme",
+        "Merck Sharp & Dohme LLC"
+    ],
+    "Pfizer": ["pfizer", "Pfizer", "Pfizer Inc."],
+    "AstraZeneca": ["astrazeneca", "AstraZeneca"],
+    "Eli Lilly and Company": ["lilly", "eli lilly", "Eli Lilly and Company"],
+    "Sanofi": ["sanofi", "Sanofi"],
+    "Novartis": ["novartis", "Novartis"],
+}
+def normalize_sponsor(sponsor: str) -> Optional[str]:
+    """
+    Normalizes sponsor names to canonical forms using centralized mappings.
+    """
+    if not sponsor:
+        return None
+    s = sponsor.lower().strip()
+    for canonical, variations in SPONSOR_MAPPINGS.items():
+        # Check if input matches canonical name (case-insensitive)
+        if s == canonical.lower():
+            return canonical
+        # Check variations and aliases
+        for v in variations:
+            v_lower = v.lower()
+            if v_lower == s:
+                return canonical
+            # If the variation is a known alias (like 'gsk'), check if it's in the string
+            if len(v) < 5 and v_lower in s:
+                 return canonical
+            if canonical.lower() in s:
+                return canonical
+    return sponsor
+def get_sponsor_variations(sponsor: str) -> Optional[List[str]]:
+    """
+    Returns list of exact database 'org' values for a given sponsor alias.
+    """
+    if not sponsor:
+        return None
+    # First, normalize the input to get the canonical name
+    canonical = normalize_sponsor(sponsor)
+    if canonical in SPONSOR_MAPPINGS:
+        return SPONSOR_MAPPINGS[canonical]
+    return None

requirements.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+streamlit
+requests
+python-dotenv
+langchain
+langchain-community
+langchain-google-genai==2.0.0
+lancedb
+lark
+langchain-huggingface
+llama-index
+llama-index-vector-stores-lancedb
+llama-index-embeddings-huggingface
+llama-index-llms-gemini
+streamlit-option-menu
+streamlit-agraph
+folium
+streamlit-folium
+rank_bm25
+llama-index-retrievers-bm25

scripts/analyze_db.py ADDED Viewed

	@@ -0,0 +1,149 @@

+"""
+Database Analysis Script.
+This script connects to the local ChromaDB vector store and performs a quick analysis
+of the ingested clinical trial data. It prints statistics about:
+- Top Sponsors
+- Phase Distribution
+- Status Distribution
+- Top Medical Conditions
+- Sample of Recent Studies
+Usage:
+    python scripts/analyze_db.py
+    # OR
+    cd scripts && python analyze_db.py
+"""
+import lancedb
+import pandas as pd
+import os
+def analyze_db():
+    """
+    Connects to ChromaDB and prints summary statistics of the dataset.
+    """
+    # Determine the project root directory (one level up from this script)
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.dirname(script_dir)
+    db_path = os.path.join(project_root, "ct_gov_lancedb")
+    if not os.path.exists(db_path):
+        print(f"❌ Database directory '{db_path}' does not exist.")
+        print("   Please run 'python scripts/ingest_ct.py' first to ingest data.")
+        return
+    print(f"📂 Loading database from {db_path}...")
+    try:
+        db = lancedb.connect(db_path)
+        # Check for table existence
+        if "clinical_trials" not in db.table_names():
+            print(f"❌ Table 'clinical_trials' not found. Available: {db.table_names()}")
+            return
+        tbl = db.open_table("clinical_trials")
+        count = len(tbl)
+        print(f"✅ Found 'clinical_trials' table with {count} documents.")
+        # Fetch all data for analysis
+        df = tbl.to_pandas()
+        if df.empty:
+            print("❌ No data found.")
+            return
+        # Handle metadata if nested (LlamaIndex might nest it)
+        if "metadata" in df.columns:
+             # Try to flatten if it's a struct/dict
+             try:
+                 meta_df = pd.json_normalize(df["metadata"])
+                 # Merge with original df or just use meta_df for analysis
+                 # We'll use meta_df for the metadata fields analysis
+                 # But we might need 'text' from original
+                 df = pd.concat([df.drop(columns=["metadata"]), meta_df], axis=1)
+             except:
+                 pass
+        if "nct_id" in df.columns:
+            unique_ncts = df["nct_id"].nunique()
+            print(f"🔢 Unique NCT IDs: {unique_ncts}")
+            if unique_ncts < count:
+                print(f"⚠️ Warning: {count - unique_ncts} duplicate records found!")
+        else:
+            print("⚠️ 'nct_id' field not found in metadata.")
+        # --- Analysis Sections ---
+        print("\n📊 --- Top 10 Sponsors ---")
+        if "org" in df.columns:
+            print(df["org"].value_counts().head(10))
+        else:
+            print("⚠️ 'org' field not found in metadata.")
+        print("\n📊 --- Phase Distribution ---")
+        if "phase" in df.columns:
+            print(df["phase"].value_counts())
+        else:
+            print("⚠️ 'phase' field not found in metadata.")
+        print("\n📊 --- Status Distribution ---")
+        if "status" in df.columns:
+            print(df["status"].value_counts())
+        else:
+            print("⚠️ 'status' field not found in metadata.")
+        print("\n📊 --- Top Conditions ---")
+        if "condition" in df.columns:
+            # Conditions are comma-separated strings, so we split and explode them
+            all_conditions = []
+            for conditions in df["condition"].dropna():
+                all_conditions.extend([c.strip() for c in conditions.split(",")])
+            print(pd.Series(all_conditions).value_counts().head(10))
+        else:
+            print("⚠️ 'condition' field not found in metadata.")
+        print("\n📊 --- Top Interventions ---")
+        if "intervention" in df.columns:
+            # Interventions are semicolon-separated strings (from ingest_ct.py), so we split by "; "
+            all_interventions = []
+            for interventions in df["intervention"].dropna():
+                # Split by semicolon and strip whitespace
+                parts = [i.strip() for i in interventions.split(";") if i.strip()]
+                all_interventions.extend(parts)
+            if all_interventions:
+                print(pd.Series(all_interventions).value_counts().head(20))
+            else:
+                print("No interventions found.")
+        else:
+            print("⚠️ 'intervention' field not found in metadata.")
+        print("\n📝 --- Sample Studies (Most Recent Start Years) ---")
+        if "start_year" in df.columns and "title" in df.columns:
+            # Ensure start_year is numeric for sorting
+            df["start_year"] = pd.to_numeric(df["start_year"], errors="coerce")
+            top_recent = df.sort_values(by="start_year", ascending=False).head(5)
+            for _, row in top_recent.iterrows():
+                print(
+                    f"- [{row.get('start_year', 'N/A')}] {row.get('title', 'N/A')} ({row.get('nct_id', 'N/A')})"
+                )
+                print(f"  Sponsor: {row.get('org', 'N/A')}")
+                print(f"  Intervention: {row.get('intervention', 'N/A')}")
+        print("\n📊 --- Intervention Check ---")
+        if "intervention" in df.columns:
+            non_empty = df[df["intervention"].str.len() > 0]
+            print(f"Total records with interventions: {len(non_empty)}")
+            if not non_empty.empty:
+                print("Sample Intervention:", non_empty.iloc[0]["intervention"])
+        else:
+            print("⚠️ 'intervention' field not found.")
+    except Exception as e:
+        print(f"⚠️ Error analyzing DB: {e}")
+if __name__ == "__main__":
+    analyze_db()

scripts/ingest_ct.py ADDED Viewed

	@@ -0,0 +1,449 @@

+"""
+Data Ingestion Script for Clinical Trial Agent.
+This script fetches clinical trial data from the ClinicalTrials.gov API (v2),
+processes it into a rich text format, and ingests it into a local ChromaDB vector index
+using LlamaIndex and PubMedBERT embeddings.
+Features:
+- **Pagination**: Fetches data in batches using the API's pagination tokens.
+- **Robustness**: Implements retry logic for network errors.
+- **Efficiency**: Uses batch insertion and reuses the existing index.
+- **Progress Tracking**: Displays a progress bar using `tqdm`.
+"""
+import requests
+import re
+from datetime import datetime, timedelta
+from dotenv import load_dotenv
+import argparse
+import time
+from tqdm import tqdm
+import os
+import concurrent.futures
+# LlamaIndex Imports
+from llama_index.core import Document, VectorStoreIndex, StorageContext, Settings
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from llama_index.vector_stores.lancedb import LanceDBVectorStore
+import lancedb
+# List of US States for extraction
+US_STATES = [
+    "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
+    "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa",
+    "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan",
+    "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire",
+    "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio",
+    "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota",
+    "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia",
+    "Wisconsin", "Wyoming", "District of Columbia"
+]
+load_dotenv()
+# Disable LLM for ingestion (we only need embeddings, not generation)
+Settings.llm = None
+def clean_text(text: str) -> str:
+    """
+    Cleans raw text by removing HTML tags and normalizing whitespace.
+    Args:
+        text (str): The raw text string.
+    Returns:
+        str: The cleaned text.
+    """
+    if not text:
+        return ""
+    # Remove HTML tags
+    text = re.sub(r"<[^>]+>", "", text)
+    # Remove multiple spaces/newlines and trim
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+def fetch_trials_generator(
+    years: int = 5, max_studies: int = 1000, status: list = None, phases: list = None
+):
+    """
+    Yields batches of clinical trials from the ClinicalTrials.gov API.
+    Handles pagination automatically and implements retry logic for API requests.
+    Args:
+        years (int): Number of years to look back for study start dates.
+        max_studies (int): Maximum total number of studies to fetch (-1 for all).
+        status (list): List of status strings to filter by (e.g., ["RECRUITING"]).
+        phases (list): List of phase strings to filter by (e.g., ["PHASE2"]).
+    Yields:
+        list: A batch of study dictionaries (JSON objects).
+    """
+    base_url = "https://clinicaltrials.gov/api/v2/studies"
+    # Calculate start date for filtering
+    start_date = (datetime.now() - timedelta(days=365 * years)).strftime("%Y-%m-%d")
+    print("📡 Connecting to CT.gov API...")
+    print(f"🔎 Fetching trials starting after: {start_date}")
+    if status:
+        print(f"   Filters - Status: {status}")
+    if phases:
+        print(f"   Filters - Phases: {phases}")
+    fetched_count = 0
+    next_page_token = None
+    # If max_studies is -1, fetch ALL studies (infinite limit)
+    fetch_limit = float("inf") if max_studies == -1 else max_studies
+    while fetched_count < fetch_limit:
+        # Determine batch size (max 1000 per API limit)
+        current_limit = 1000
+        if max_studies != -1:
+            current_limit = min(1000, max_studies - fetched_count)
+        # --- Query Construction ---
+        # Build the query term using the API's syntax
+        query_parts = [f"AREA[StartDate]RANGE[{start_date},MAX]"]
+        if status:
+            status_str = " OR ".join(status)
+            query_parts.append(f"AREA[OverallStatus]({status_str})")
+        if phases:
+            phase_str = " OR ".join(phases)
+            query_parts.append(f"AREA[Phase]({phase_str})")
+        full_query = " AND ".join(query_parts)
+        params = {
+            "query.term": full_query,
+            "pageSize": current_limit,
+            # Request specific fields to minimize payload size
+            "fields": ",".join(
+                [
+                    "protocolSection.identificationModule.nctId",
+                    "protocolSection.identificationModule.briefTitle",
+                    "protocolSection.identificationModule.officialTitle",
+                    "protocolSection.identificationModule.organization",
+                    "protocolSection.statusModule.overallStatus",
+                    "protocolSection.statusModule.startDateStruct",
+                    "protocolSection.statusModule.completionDateStruct",
+                    "protocolSection.designModule.phases",
+                    "protocolSection.designModule.studyType",
+                    "protocolSection.eligibilityModule.eligibilityCriteria",
+                    "protocolSection.eligibilityModule.sex",
+                    "protocolSection.eligibilityModule.stdAges",
+                    "protocolSection.descriptionModule.briefSummary",
+                    "protocolSection.conditionsModule.conditions",
+                    "protocolSection.outcomesModule.primaryOutcomes",
+                    "protocolSection.contactsLocationsModule.locations",
+                    "protocolSection.outcomesModule.primaryOutcomes",
+                    "protocolSection.contactsLocationsModule.locations",
+                    "protocolSection.armsInterventionsModule",
+                    "protocolSection.sponsorCollaboratorsModule.leadSponsor",
+                ]
+            ),
+        }
+        if next_page_token:
+            params["pageToken"] = next_page_token
+        # --- Retry Logic ---
+        retries = 3
+        for attempt in range(retries):
+            try:
+                response = requests.get(base_url, params=params, timeout=30)
+                if response.status_code == 200:
+                    data = response.json()
+                    studies = data.get("studies", [])
+                    if not studies:
+                        return  # Stop generator if no studies returned
+                    yield studies
+                    fetched_count += len(studies)
+                    next_page_token = data.get("nextPageToken")
+                    if not next_page_token:
+                        return  # Stop generator if no more pages
+                    break  # Success, exit retry loop
+                else:
+                    print(f"❌ API Error: {response.status_code} - {response.text}")
+                    if attempt < retries - 1:
+                        time.sleep(2)
+                    else:
+                        return  # Stop generator on persistent error
+            except Exception as e:
+                print(f"❌ Request Error (Attempt {attempt+1}/{retries}): {e}")
+                if attempt < retries - 1:
+                    time.sleep(2)
+                else:
+                    return  # Stop generator
+def process_study(study):
+    """
+    Processes a single study dictionary into a LlamaIndex Document.
+    This function is designed to be run in parallel.
+    """
+    try:
+        # Extract Modules
+        protocol = study.get("protocolSection", {})
+        identification = protocol.get("identificationModule", {})
+        status_module = protocol.get("statusModule", {})
+        design = protocol.get("designModule", {})
+        eligibility = protocol.get("eligibilityModule", {})
+        description = protocol.get("descriptionModule", {})
+        conditions_module = protocol.get("conditionsModule", {})
+        outcomes_module = protocol.get("outcomesModule", {})
+        arms_interventions_module = protocol.get("armsInterventionsModule", {})
+        outcomes_module = protocol.get("outcomesModule", {})
+        arms_interventions_module = protocol.get("armsInterventionsModule", {})
+        locations_module = protocol.get("contactsLocationsModule", {})
+        sponsor_module = protocol.get("sponsorCollaboratorsModule", {})
+        # Extract Fields
+        nct_id = identification.get("nctId", "N/A")
+        title = identification.get("briefTitle", "N/A")
+        official_title = identification.get("officialTitle", "N/A")
+        official_title = identification.get("officialTitle", "N/A")
+        org = identification.get("organization", {}).get("fullName", "N/A")
+        sponsor_name = sponsor_module.get("leadSponsor", {}).get("name", "N/A")
+        summary = clean_text(description.get("briefSummary", "N/A"))
+        overall_status = status_module.get("overallStatus", "N/A")
+        start_date = status_module.get("startDateStruct", {}).get("date", "N/A")
+        completion_date = status_module.get("completionDateStruct", {}).get(
+            "date", "N/A"
+        )
+        phases = ", ".join(design.get("phases", []))
+        study_type = design.get("studyType", "N/A")
+        criteria = clean_text(eligibility.get("eligibilityCriteria", "N/A"))
+        gender = eligibility.get("sex", "N/A")
+        ages = ", ".join(eligibility.get("stdAges", []))
+        conditions = ", ".join(conditions_module.get("conditions", []))
+        interventions = []
+        for interv in arms_interventions_module.get("interventions", []):
+            name = interv.get("name", "")
+            type_ = interv.get("type", "")
+            interventions.append(f"{type_}: {name}")
+        interventions_str = "; ".join(interventions)
+        primary_outcomes = []
+        for outcome in outcomes_module.get("primaryOutcomes", []):
+            measure = outcome.get("measure", "")
+            desc = outcome.get("description", "")
+            primary_outcomes.append(f"- {measure}: {desc}")
+        outcomes_str = clean_text("\n".join(primary_outcomes))
+        locations = []
+        for loc in locations_module.get("locations", []):
+            facility = loc.get("facility", "N/A")
+            city = loc.get("city", "")
+            country = loc.get("country", "")
+            locations.append(f"{facility} ({city}, {country})")
+        locations_str = "; ".join(locations[:5])  # Limit to 5 locations to save space
+        # Extract State (First match)
+        state = "Unknown"
+        # Check locations for US States
+        for loc_str in locations:
+            if "United States" in loc_str:
+                for s in US_STATES:
+                    if s in loc_str:
+                        state = s
+                        break
+                if state != "Unknown":
+                    break
+        # Construct Rich Page Content with Markdown Headers
+        # This text is what gets embedded and searched
+        page_content = (
+            f"# {title}\n"
+            f"**NCT ID:** {nct_id}\n"
+            f"**Official Title:** {official_title}\n"
+            f"**Sponsor:** {sponsor_name}\n"
+            f"**Organization:** {org}\n"
+            f"**Status:** {overall_status}\n"
+            f"**Phase:** {phases}\n"
+            f"**Study Type:** {study_type}\n"
+            f"**Start Date:** {start_date}\n"
+            f"**Completion Date:** {completion_date}\n\n"
+            f"## Summary\n{summary}\n\n"
+            f"## Conditions\n{conditions}\n\n"
+            f"## Interventions\n{interventions_str}\n\n"
+            f"## Eligibility Criteria\n"
+            f"**Gender:** {gender}\n"
+            f"**Ages:** {ages}\n"
+            f"**Criteria:**\n{criteria}\n\n"
+            f"## Primary Outcomes\n{outcomes_str}\n\n"
+            f"## Locations\n{locations_str}"
+        )
+        # Metadata for filtering (Structured Data)
+        metadata = {
+            "nct_id": nct_id,
+            "title": title,
+            "org": org,
+            "sponsor": sponsor_name,
+            "status": overall_status,
+            "phase": phases,
+            "study_type": study_type,
+            "start_year": (int(start_date.split("-")[0]) if start_date != "N/A" else 0),
+            "condition": conditions,
+            "intervention": interventions_str,
+            "country": (
+                locations[0].split(",")[-1].strip() if locations else "Unknown"
+            ),
+            "state": state,
+        }
+        return Document(text=page_content, metadata=metadata, id_=nct_id)
+    except Exception as e:
+        print(
+            f"⚠️ Error processing study {study.get('protocolSection', {}).get('identificationModule', {}).get('nctId', 'Unknown')}: {e}"
+        )
+        return None
+def run_ingestion():
+    """
+    Main execution function for the ingestion script.
+    Parses arguments, initializes the index, and runs the ingestion loop.
+    """
+    parser = argparse.ArgumentParser(description="Ingest Clinical Trials data.")
+    parser.add_argument(
+        "--limit",
+        type=int,
+        default=-1,
+        help="Number of studies to ingest. Set to -1 for ALL.",
+    )
+    parser.add_argument(
+        "--years", type=int, default=10, help="Number of years to look back."
+    )
+    parser.add_argument(
+        "--status",
+        type=str,
+        default="COMPLETED",
+        help="Comma-separated list of statuses (e.g., COMPLETED,RECRUITING).",
+    )
+    parser.add_argument(
+        "--phases",
+        type=str,
+        default="PHASE1,PHASE2,PHASE3,PHASE4",
+        help="Comma-separated list of phases (e.g., PHASE2,PHASE3).",
+    )
+    args = parser.parse_args()
+    status_list = args.status.split(",") if args.status else []
+    phase_list = args.phases.split(",") if args.phases else []
+    print(f"⚙️ Configuration: Limit={args.limit}, Years={args.years}")
+    print(f"   Status Filter: {status_list}")
+    print(f"   Phase Filter: {phase_list}")
+    # --- INITIALIZE LLAMAINDEX COMPONENTS ---
+    print("🧠 Initializing LlamaIndex Embeddings (PubMedBERT)...")
+    embed_model = HuggingFaceEmbedding(model_name="pritamdeka/S-PubMedBert-MS-MARCO")
+    # Initialize LanceDB (Persistent)
+    print("🚀 Initializing LanceDB...")
+    # Determine the project root directory (one level up from this script)
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.dirname(script_dir)
+    db_path = os.path.join(project_root, "ct_gov_lancedb")
+    # Connect to LanceDB
+    db = lancedb.connect(db_path)
+    table_name = "clinical_trials"
+    if table_name in db.table_names():
+        mode = "append"
+        print(f"ℹ️ Table '{table_name}' exists. Appending data.")
+    else:
+        mode = "create"
+        print(f"ℹ️ Table '{table_name}' does not exist. Creating new table.")
+    # Initialize Vector Store
+    vector_store = LanceDBVectorStore(
+        uri=db_path,
+        table_name=table_name,
+        mode=mode,
+        query_mode="hybrid" # Enable hybrid search support
+    )
+    storage_context = StorageContext.from_defaults(vector_store=vector_store)
+    # Initialize Index ONCE
+    # We pass the storage context to link it to the vector store
+    index = VectorStoreIndex.from_vector_store(
+        vector_store, storage_context=storage_context, embed_model=embed_model
+    )
+    total_ingested = 0
+    # Progress Bar
+    pbar = tqdm(
+        total=args.limit if args.limit > 0 else float("inf"),
+        desc="Ingesting Studies",
+        unit="study",
+    )
+    # --- INGESTION LOOP ---
+    # Use ProcessPoolExecutor for parallel processing of study data
+    with concurrent.futures.ProcessPoolExecutor() as executor:
+        for batch_studies in fetch_trials_generator(
+            years=args.years,
+            max_studies=args.limit,
+            status=status_list,
+            phases=phase_list,
+        ):
+            # Parallelize the processing of the batch
+            # map returns an iterator, so we convert to list to trigger execution
+            documents_iter = executor.map(process_study, batch_studies)
+            # Filter out None results (errors)
+            documents = [doc for doc in documents_iter if doc is not None]
+            if documents:
+                # Overwrite Logic:
+                # To avoid duplicates, we delete existing records with the same NCT IDs.
+                doc_ids = [doc.id_ for doc in documents]
+                try:
+                    # LanceDB supports deletion via SQL-like filter
+                    # We construct a filter string: "nct_id IN ('NCT123', 'NCT456')"
+                    ids_str = ", ".join([f"'{id}'" for id in doc_ids])
+                    if ids_str:
+                        tbl = db.open_table("clinical_trials")
+                        tbl.delete(f"nct_id IN ({ids_str})")
+                except Exception as e:
+                    # Ignore if table doesn't exist yet
+                    pass
+                # Efficient Batch Insertion
+                # We convert documents to nodes and insert them into the index.
+                # This handles embedding generation automatically.
+                parser = Settings.node_parser
+                nodes = parser.get_nodes_from_documents(documents)
+                index.insert_nodes(nodes)
+                total_ingested += len(documents)
+                pbar.update(len(documents))
+    pbar.close()
+    print(f"🎉 Ingestion Complete! Total studies in DB: {total_ingested}")
+if __name__ == "__main__":
+    run_ingestion()

scripts/remove_duplicates.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""
+Script to remove duplicate records from the LanceDB database.
+This script scans the 'clinical_trials' table, identifies records with duplicate content
+(same 'nct_id' AND same 'text'), and removes the extras.
+It uses a safe "Fetch -> Dedupe -> Overwrite" strategy:
+1. Identifies NCT IDs that have duplicates.
+2. For each such NCT ID, fetches ALL its records (chunks).
+3. Deduplicates these records in memory based on their text content.
+4. Deletes ALL records for that NCT ID from the database.
+5. Re-inserts the unique records.
+This ensures that valid chunks of the same study are PRESERVED, while exact duplicates are removed.
+"""
+import os
+import pandas as pd
+import lancedb
+import argparse
+def calculate_richness(record):
+    """Calculates a 'richness' score for a record based on metadata field count and content length."""
+    score = 0
+    if not record:
+        return 0
+    for key, value in record.items():
+        if key == "vector": continue
+        # Handle nested metadata
+        if key == "metadata" and isinstance(value, dict):
+            score += calculate_richness(value) # Recurse
+            continue
+        # Check for non-empty values
+        if value is not None and str(value).strip() != "":
+            score += 10  # Base points for having a populated field
+            # Bonus points for content length
+            if isinstance(value, str):
+                score += len(value) / 100.0
+    return score
+def remove_duplicates(dry_run=False):
+    # Determine the project root directory
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.dirname(script_dir)
+    db_path = os.path.join(project_root, "ct_gov_lancedb")
+    if not os.path.exists(db_path):
+        print(f"❌ Database directory '{db_path}' does not exist.")
+        return
+    print(f"📂 Loading database from {db_path}...")
+    if dry_run:
+        print("🧪 RUNNING IN DRY-RUN MODE (No changes will be made)")
+    try:
+        db = lancedb.connect(db_path)
+        tbl = db.open_table("clinical_trials")
+        print("🔍 Scanning for duplicates...")
+        # Fetch all data
+        df = tbl.to_pandas()
+        if df.empty:
+            print("Database is empty.")
+            return
+        # Create a working copy to flatten metadata for analysis
+        working_df = df.copy()
+        if "metadata" in working_df.columns:
+             # Flatten metadata
+             meta_df = pd.json_normalize(working_df["metadata"])
+             # We drop the original metadata column from working_df and join the flattened one
+             working_df = pd.concat([working_df.drop(columns=["metadata"]), meta_df], axis=1)
+        if "nct_id" not in working_df.columns:
+            print("❌ 'nct_id' column not found (checked metadata too).")
+            return
+        if "text" not in working_df.columns:
+            print("❌ 'text' column not found. Cannot safely deduplicate chunks.")
+            return
+        # Identify duplicates based on (nct_id, text) using the flattened view
+        duplicates_mask = working_df.duplicated(subset=["nct_id", "text"], keep=False)
+        # We use the mask on working_df to find the IDs
+        duplicates_working_df = working_df[duplicates_mask]
+        if duplicates_working_df.empty:
+            print("✅ No exact duplicates found. Database is clean.")
+            return
+        unique_duplicate_ids = duplicates_working_df["nct_id"].unique()
+        print(f"⚠️ Found duplicates affecting {len(unique_duplicate_ids)} studies (NCT IDs).")
+        total_deleted = 0
+        total_reinserted = 0
+        # Process each affected NCT ID
+        for nct_id in unique_duplicate_ids:
+            # Get indices from working_df where nct_id matches
+            # This ensures we are looking at the right rows in the ORIGINAL df
+            indices = working_df[working_df["nct_id"] == nct_id].index
+            # Extract original records (preserving structure)
+            study_records_df = df.loc[indices]
+            original_count = len(study_records_df)
+            unique_records = []
+            seen_texts = set()
+            records = study_records_df.to_dict("records")
+            records.sort(key=calculate_richness, reverse=True)
+            for record in records:
+                text_content = record.get("text", "")
+                if text_content not in seen_texts:
+                    unique_records.append(record)
+                    seen_texts.add(text_content)
+            new_count = len(unique_records)
+            if new_count < original_count:
+                print(f"   - {nct_id}: Reducing {original_count} -> {new_count} records.")
+                if not dry_run:
+                    # Delete using the ID (LanceDB SQL filter)
+                    # Note: In LanceDB SQL, if nct_id is in metadata struct, we access it via metadata.nct_id
+                    # But wait, tbl.delete() takes a SQL string.
+                    # If the schema has 'metadata' struct, we must use 'metadata.nct_id'.
+                    # If it was flattened (unlikely for the table itself), we use 'nct_id'.
+                    # We check if 'nct_id' is a top-level column in the original DF
+                    if "nct_id" in df.columns:
+                        where_clause = f"nct_id = '{nct_id}'"
+                    else:
+                        where_clause = f"metadata.nct_id = '{nct_id}'"
+                    tbl.delete(where_clause)
+                    if unique_records:
+                        tbl.add(unique_records)
+                total_deleted += original_count
+                total_reinserted += new_count
+            else:
+                print(f"   - {nct_id}: No reduction needed (false positive?).")
+        if dry_run:
+            print(f"\n🧪 DRY RUN COMPLETE.")
+            print(f"   - WOULD remove {total_deleted - total_reinserted} duplicate records.")
+            print(f"   - WOULD preserve {total_reinserted} unique chunks.")
+        else:
+            print(f"\n🎉 Deduplication complete!")
+            print(f"   - Removed {total_deleted - total_reinserted} duplicate records.")
+            print(f"   - Preserved {total_reinserted} unique chunks.")
+    except Exception as e:
+        print(f"❌ Error: {e}")
+        import traceback
+        traceback.print_exc()
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Remove duplicate records from LanceDB.")
+    parser.add_argument("--dry-run", action="store_true", help="Simulate the process without making changes.")
+    args = parser.parse_args()
+    remove_duplicates(dry_run=args.dry_run)

tests/test_data_integrity.py ADDED Viewed

	@@ -0,0 +1,75 @@

+import unittest
+import lancedb
+import pandas as pd
+import os
+import sys
+# Add project root to path
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
+class TestDataIntegrity(unittest.TestCase):
+    def setUp(self):
+        # Determine the project root directory
+        self.test_dir = os.path.dirname(os.path.abspath(__file__))
+        self.project_root = os.path.dirname(self.test_dir)
+        self.db_path = os.path.join(self.project_root, "ct_gov_lancedb")
+    def test_pfizer_myeloma_counts(self):
+        """
+        Verifies that the database contains the expected number of Pfizer studies
+        related to Multiple Myeloma, based on strict keyword matching.
+        """
+        if not os.path.exists(self.db_path):
+            self.skipTest(f"Database directory '{self.db_path}' does not exist. Skipping data integrity test.")
+        print(f"\n📂 Loading database from {self.db_path}...")
+        try:
+            db = lancedb.connect(self.db_path)
+            tbl = db.open_table("clinical_trials")
+        except Exception as e:
+            self.skipTest(f"Failed to load LanceDB table: {e}")
+        # Fetch all data (LanceDB is fast enough for this size, or we could query)
+        # For integrity check, loading into DF is fine.
+        df = tbl.to_pandas()
+        # Handle metadata flattening if needed (LanceDB stores metadata in a struct)
+        if "metadata" in df.columns:
+            # Flatten the metadata column
+            meta_df = pd.json_normalize(df["metadata"])
+            df = meta_df
+        # 1. Check for 'org' column
+        if "org" not in df.columns:
+            self.fail("'org' column missing from metadata.")
+        # 2. Filter by Sponsor (Pfizer)
+        pfizer_studies = df[df["org"].str.contains("Pfizer", case=False, na=False)]
+        # We expect at least some Pfizer studies if the DB is populated
+        self.assertGreater(len(pfizer_studies), 0, "No Pfizer studies found in DB.")
+        # 3. Filter by "Multiple Myeloma" in Title or Conditions
+        query = "Multiple Myeloma"
+        def is_relevant(row):
+            title = str(row.get("title", "")).lower()
+            conditions = str(row.get("condition", "")).lower()
+            q = query.lower()
+            return q in title or q in conditions
+        relevant_studies = pfizer_studies[pfizer_studies.apply(is_relevant, axis=1)]
+        count = len(relevant_studies)
+        print(f"🎯 Pfizer Studies with '{query}' in Title or Conditions: {count}")
+        # Assertion: Based on our previous check, we expect exactly 7.
+        # However, to be robust against minor data updates, we can assert a range or exact value.
+        # Let's assert it's non-zero and reasonably small (since we know it shouldn't be 514).
+        self.assertGreater(count, 0, "Should find at least one relevant study.")
+        self.assertLess(count, 50, "Should not find hundreds of studies (strict filter check).")
+        # Optional: Assert exact count if we want to be very strict about data consistency
+        # self.assertEqual(count, 7, "Expected exactly 7 studies based on known ground truth.")
+if __name__ == "__main__":
+    unittest.main()

tests/test_hybrid_search.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import pytest
+import sys
+import os
+# Add project root to path
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
+from modules.tools import search_trials
+from modules.utils import load_environment
+# Mark as integration test since it loads the DB
+@pytest.mark.integration
+def test_hybrid_search_integration():
+    """
+    Integration test for Hybrid Search.
+    Verifies that the search_trials tool can retrieve results using the hybrid retriever.
+    """
+    load_environment()
+    # Test 1: Dynamic ID Search
+    # First, find a valid ID from a broad search
+    print("\n🔍 Finding a valid ID for testing...")
+    broad_results = search_trials.invoke({"query": "cancer"})
+    # Extract an ID from the results
+    import re
+    match = re.search(r"ID: (NCT\d+)", broad_results)
+    if not match:
+        pytest.skip("Could not find any studies in DB to test against.")
+    target_id = match.group(1)
+    print(f"🎯 Found target ID: {target_id}. Now testing exact search...")
+    # Now search for that specific ID
+    results_id = search_trials.invoke({"query": target_id})
+    assert "Found" in results_id
+    assert target_id in results_id, f"Hybrid search failed to retrieve exact ID {target_id}"
+    # Extract sponsor from the first result to ensure we test with valid data
+    # Result format: "**Title** ... - Sponsor: SponsorName ..."
+    sponsor_match = re.search(r"Sponsor: (.*?)\n", broad_results)
+    if not sponsor_match:
+        print("⚠️ Could not extract sponsor from results. Skipping hybrid test.")
+        return
+    target_sponsor = sponsor_match.group(1).strip()
+    # Normalize it to get the simple name if possible, or just use it
+    # But search_trials expects a simple name to map to variations.
+    # If we pass the full name, get_sponsor_variations might return None if not mapped.
+    # So let's try to find a mapped sponsor if possible, or just skip if not mapped.
+    from modules.utils import normalize_sponsor
+    simple_sponsor = normalize_sponsor(target_sponsor)
+    # If normalization didn't change it, it might not be in our alias list.
+    # But we can still try to search with it.
+    print(f"\n🔍 Testing Hybrid Search with dynamic sponsor: '{simple_sponsor}' (Original: {target_sponsor})")
+    # Use a generic query that likely matches the study, or just "study"
+    results_hybrid = search_trials.invoke({"query": "study", "sponsor": simple_sponsor})
+    assert "Found" in results_hybrid, f"Should find results for valid sponsor {simple_sponsor}"
+    assert target_sponsor in results_hybrid or simple_sponsor in results_hybrid

tests/test_sponsor_normalization.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import unittest
+from modules.utils import normalize_sponsor, get_sponsor_variations, SPONSOR_MAPPINGS
+class TestSponsorNormalization(unittest.TestCase):
+    def test_normalize_sponsor_aliases(self):
+        """Test that common aliases map to canonical names."""
+        self.assertEqual(normalize_sponsor("J&J"), "Janssen")
+        self.assertEqual(normalize_sponsor("Johnson & Johnson"), "Janssen")
+        self.assertEqual(normalize_sponsor("GSK"), "GlaxoSmithKline")
+        self.assertEqual(normalize_sponsor("Merck"), "Merck Sharp & Dohme")
+        self.assertEqual(normalize_sponsor("BMS"), "Bristol-Myers Squibb")
+    def test_normalize_sponsor_variations(self):
+        """Test that specific DB variations map to canonical names."""
+        self.assertEqual(normalize_sponsor("Janssen Research & Development, LLC"), "Janssen")
+        self.assertEqual(normalize_sponsor("Pfizer Inc."), "Pfizer")
+        self.assertEqual(normalize_sponsor("Merck Sharp & Dohme LLC"), "Merck Sharp & Dohme")
+    def test_normalize_sponsor_canonical(self):
+        """Test that canonical names return themselves."""
+        self.assertEqual(normalize_sponsor("Janssen"), "Janssen")
+        self.assertEqual(normalize_sponsor("Pfizer"), "Pfizer")
+    def test_get_sponsor_variations(self):
+        """Test that getting variations works for aliases and canonical names."""
+        # Test with alias
+        vars_jnj = get_sponsor_variations("J&J")
+        self.assertIn("Janssen Research & Development, LLC", vars_jnj)
+        self.assertIn("Janssen", vars_jnj)
+        # Test with canonical
+        vars_janssen = get_sponsor_variations("Janssen")
+        self.assertEqual(vars_jnj, vars_janssen)
+        # Test with variation input (should normalize first)
+        vars_variation = get_sponsor_variations("Janssen Research & Development, LLC")
+        self.assertEqual(vars_janssen, vars_variation)
+    def test_unknown_sponsor(self):
+        """Test behavior for unknown sponsors."""
+        self.assertEqual(normalize_sponsor("Unknown Pharma"), "Unknown Pharma")
+        self.assertIsNone(get_sponsor_variations("Unknown Pharma"))
+if __name__ == "__main__":
+    unittest.main()

tests/test_unit.py ADDED Viewed

	@@ -0,0 +1,149 @@

+import pytest
+import pandas as pd
+import sys
+import os
+from unittest.mock import MagicMock, patch
+# Add project root to path to import app modules
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
+from modules.utils import normalize_sponsor  # noqa: E402
+from modules.tools import expand_query  # noqa: E402
+from modules.graph_viz import build_graph  # noqa: E402
+from llama_index.core.schema import NodeWithScore, TextNode  # noqa: E402
+# --- Tests for normalize_sponsor ---
+def test_normalize_sponsor_aliases():
+    assert normalize_sponsor("J&J") == "Janssen"
+    assert normalize_sponsor("Johnson & Johnson") == "Janssen"
+    assert normalize_sponsor("GSK") == "GlaxoSmithKline"
+    assert normalize_sponsor("Merck") == "Merck Sharp & Dohme"
+    assert normalize_sponsor("MSD") == "Merck Sharp & Dohme"
+    assert normalize_sponsor("BMS") == "Bristol-Myers Squibb"
+def test_normalize_sponsor_no_change():
+    assert normalize_sponsor("Pfizer") == "Pfizer"
+    assert normalize_sponsor("Moderna") == "Moderna"
+    assert normalize_sponsor("Unknown Sponsor") == "Unknown Sponsor"
+# --- Tests for Analytics Logic (Mocked) ---
+def filter_dataframe(df, phase=None, status=None, sponsor=None, intervention=None):
+    """
+    Replicating the logic from get_study_analytics for testing purposes.
+    """
+    if phase:
+        target_phases = [p.strip().upper().replace(" ", "") for p in phase.split(",")]
+        df["phase_upper"] = df["phase"].astype(str).str.upper().str.replace(" ", "")
+        mask = df["phase_upper"].apply(lambda x: any(tp in x for tp in target_phases))
+        df = df[mask]
+    if status:
+        df = df[df["status"].str.upper() == status.upper()]
+    if sponsor:
+        target_sponsor = normalize_sponsor(sponsor).lower()
+        df["org_lower"] = df["org"].astype(str).apply(normalize_sponsor).str.lower()
+        df = df[df["org_lower"].str.contains(target_sponsor, regex=False)]
+    if intervention:
+        target_intervention = intervention.lower()
+        df["intervention_lower"] = df["intervention"].astype(str).str.lower()
+        df = df[df["intervention_lower"].str.contains(target_intervention, regex=False)]
+    return df
+@pytest.fixture
+def sample_df():
+    data = {
+        "nct_id": ["NCT001", "NCT002", "NCT003", "NCT004"],
+        "phase": ["PHASE1", "PHASE2", "PHASE3", "PHASE2"],
+        "status": ["RECRUITING", "COMPLETED", "COMPLETED", "RECRUITING"],
+        "org": ["Pfizer", "Janssen", "Merck Sharp & Dohme", "Pfizer"],
+        "intervention": ["Drug A", "Drug B", "Keytruda", "Drug A + Drug C"],
+        "start_year": [2020, 2021, 2022, 2023],
+        "title": [
+            "Study of Drug A",
+            "Study of Drug B",
+            "Keytruda Trial",
+            "Combo Study",
+        ],
+        "condition": ["Cancer", "Diabetes", "Lung Cancer", "Cancer"],
+    }
+    return pd.DataFrame(data)
+def test_analytics_filter_intervention(sample_df):
+    # Filter for Keytruda
+    filtered = filter_dataframe(sample_df, intervention="Keytruda")
+    assert len(filtered) == 1
+    assert filtered.iloc[0]["nct_id"] == "NCT003"
+def test_analytics_filter_intervention_partial(sample_df):
+    # Filter for "Drug A" (should match NCT001 and NCT004)
+    filtered = filter_dataframe(sample_df, intervention="Drug A")
+    assert len(filtered) == 2
+    assert set(filtered["nct_id"]) == {"NCT001", "NCT004"}
+# --- Tests for Query Expansion ---
+@patch("modules.tools.Settings")
+def test_expand_query(mock_settings):
+    # Mock LLM response
+    mock_response = MagicMock()
+    mock_response.text = "Expanded Query: cancer OR carcinoma OR tumor"
+    mock_settings.llm.complete.return_value = mock_response
+    query = "cancer"
+    expanded = expand_query(query)
+    assert "cancer OR carcinoma OR tumor" in expanded
+    mock_settings.llm.complete.assert_called_once()
+def test_expand_query_skip_long():
+    long_query = "this is a very long query that should definitely be skipped because it has too many words"
+    assert expand_query(long_query) == long_query
+# --- Tests for Graph Visualization ---
+def test_build_graph():
+    data = [
+        {"nct_id": "NCT1", "title": "Study 1", "org": "Pfizer", "condition": "Cancer"},
+        {
+            "nct_id": "NCT2",
+            "title": "Study 2",
+            "org": "Merck",
+            "condition": "Cancer, Diabetes",
+        },
+    ]
+    nodes, edges, config = build_graph(data)
+    # Check Nodes
+    # 2 Studies + 2 Sponsors + 2 Conditions (Cancer, Diabetes) = 6 Nodes
+    assert len(nodes) == 6
+    node_ids = [n.id for n in nodes]
+    assert "NCT1" in node_ids
+    assert "Pfizer" in node_ids
+    assert "Cancer" in node_ids
+    # Check Edges
+    # NCT1 -> Pfizer, NCT1 -> Cancer (2 edges)
+    # NCT2 -> Merck, NCT2 -> Cancer, NCT2 -> Diabetes (3 edges)
+    assert len(edges) == 5