Skip to content

Conversation

@MathieuCayssol
Copy link
Collaborator

@MathieuCayssol MathieuCayssol commented Nov 27, 2025

Thank you for your Pull Request! We have developed this task checklist to help with the final steps of the process. Completing the below tasks helps to ensure our reviewers can maximize their time on your blog post.

Please check off each taskbox as an acknowledgment that you completed the task or check off that it is not relevant to your Pull Request. This checklist is part of the Github Action workflows and the Pull Request will not be merged into the main branch until you have checked off each task.

  • Place Closes #<insert_issue_number> into the beginning of your Pull Request Title (Use Edit button in top-right if you need to update), and make sure the corresponding issue is linked in the Development section on the right hand side
  • Ensure your new post folder is of the form "posts/zzz_DO_NOT_EDIT_<your post title>". This is so that the post date can be auto-updated upon the merge into main.
  • Run the script from CICD.R line by line to first check the spelling in your post and then to make sure your code is compatible with our code-style. Address any incongruences by following the instructions in the file!
  • Choose (possibly several) tag(s) or categories from the current list: c("Metadata", "SDTM", "ADaM", "TLG", "Shiny", "Python", "Community", "Conferences", "Submissions", "Technical", "DEI") for your blog post. If you cannot find anything that fits your blog post, propose a new tag to the maintainers! Note: if you use a tag not from this list, the "Check Post Tags" CICD pipeline will error. We occasionally tidy up all tags for consistency.
  • Add a short description for your blog post in the description field at the top of the markdown document.
  • Blog post is short, personalized, reproducible and readable
  • Add a disclaimer at the top of your post, if appropriate (example: Disclaimer
    This blog contains opinions that are of the authors alone and do not necessarily reflect the strategy of their respective organizations.)
  • Address all merge conflicts and resolve appropriately
  • Assign two of us (@bms63, @manciniedoardo, @StefanThoma, @kaz462) as reviewers in the PR.
  • Pat yourself on the back for a job well done! Much love to your accomplishment!

Copy link
Collaborator

@StefanThoma StefanThoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great post! I've added some comments :)


## Why this matters for pharmaverse programmers

Clinical reporting follows repeatable patterns—ADaM derivations, TFL layouts, and QC checks—but examples are scattered across templates and repositories. Templates cover most standard outputs; however, for non-standard requests or bespoke TFLs, keyword search often misses intent (e.g., “derive ABLFL in ADLB using a pre-dose baseline rule and analysis-visit windows”) because variable names and structures vary by study. CSA translates that intent into semantic retrieval, surfacing relevant, provenance-linked code snippets even when the wording doesn’t match exactly—so you can review, trust, and reuse.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does provenance-linked in that context?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked to original content?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of "–"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clinical reporting is built on repeatable patterns for ADaM derivations, TFLs, and QC, but finding these patterns is a significant challenge. Relevant code is often scattered, and traditional keyword searches fail for non-standard logic because the underlying variable names and structural conventions differ from study to study. CSA directly addresses this root cause by using semantic retrieval to find code based on its conceptual intent, not just matching keywords (e.g: plain english query such as "derive ABLFL in ADLB using a pre-dose baseline rule and analysis-visit windows"). This means programmers can reliably discover, review, and reuse relevant patterns from across all repositories, saving substantial time and ensuring greater consistency in their work.


<img width="1015" height="622" alt="Image" src="https://github.com/user-attachments/assets/254c9966-663e-4d21-950c-9dede4939b31" />

Behind the scenes, we index code from our repositories, split it into coherent chunks, and generate short summaries that describe what each piece of code does, its inputs and outputs, and, for TLG programs, details such as titles and footnotes. We embed the summaries, store the vectors, and retain rich metadata such as programming language, file name, repository, and study description. The interface displays both the results and the parameters used so retrieval stays transparent and audit-friendly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering the audience, maybe you could add a short explanation on:

"embed he summaries, store the vectors".

For people not knowing about how NLP works, this is confusing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have a screenshot (or other visualisation) of what this looks like.
I.e. table where one column is the semantic summary, one is the code, one is the metadata, (and one could be something like a truncated version of the embeddings, just so people can get a glimpse of what that looks like under the hood, but unsure if this makes sense.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe silly question: Is this image staying there?
Usually we add images to the repo to ensure that it stays for future renders.

## How it works

### Understand the ask
We transform user's question into a semantic search query + metadata filters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is each of tese tasks done with an LLM? Or how is this being transformed?

Copy link
Collaborator Author

@MathieuCayssol MathieuCayssol Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done by an agent (so few LLM calls)

We transform user's question into a semantic search query + metadata filters.

### Retrieve with context
We perform a semantic search over the vector database + apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more information on the database here.
Also add which branches you are considering (I think only main and devel, right?), and how that design impacts the cost (e.g. fewer updates required) and the quality (qced code).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the part concerning the branch for pulling the code. For the details related to the data pipeline, should I link the PHUSE paper? It gives a lot of details: https://phuse.s3.eu-central-1.amazonaws.com/Archive/2025/Connect/EU/Hamburg/PAP_ML15.pdf

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be nice


## What we built?

CSA is a focused agent inside our Clinical Analysis Assistant. A user asks a question; we create an embedding of that query, apply optional metadata filters, search the vector database, and return the most relevant code chunks alongside their origins.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cover also the scope: Is this for R code only, or do we do the same for SAS?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both, but I'm not sure if it's meaningful to mention SAS for the pharmaverse blog

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's relevant, as it may appeal to a wider audience if it applies to both.


## What programmers can do today

With CSA, programmers can look up ADaM and TLG scripts by intent, such as *"asking for an ADSL baseline flag with visit windowing"* or *"an AVALC mapping that treats missing values explicitly"*. They can locate TLG programs described in plain English. For instance, a table that splits columns by treatment arm and summarizes AVAL with mean and standard deviation and then adapt a proven layout. They can also compare similar implementations across repositories to converge on a standard approach while maintaining confidence through clear provenance back to source.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to tell whether code has been used in production yet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope


## Metrics and usage

Over the observation window, the Code Search Assistant (CSA) handled 791 questions from 45 unique users, averaging 42 ± 14 weekly conversations and 92 ± 43 weekly questions. Satisfaction signals were positive: the broader Clinical Analysis Assistant (CAA) scored 3.39/5 for usefulness (n = 174), while the CSA specifically was rated at ~4/5, indicating strong early traction among active users.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the observation window in this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated for a specific period?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added: 15th of September 2025 - 19th of October 2025


## Limitations & what’s next

Today, CSA focuses on fast, trustworthy discovery. We are working on tighter IDE integration via MCP so you can search directly from your editor. We also want to enable better study awareness so results are automatically filtered to the context of your study and clinical programming code. Finally, we are refining the data pipeline to continuously index changes without any manual effort and adding SDTM programs as well. These steps aim to make retrieval not only smarter but also more seamlessly woven into day-to-day programming.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is MCP?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Study awareness is great. This could be done by ranking the results based on closeness to the study people are working on. e.g. same indication, same TA, or something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to include some information about the data summary creation, how often it is done, approximate cost for how much data, why you chose to do it that way, instead of asking the full LLM to cover everything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For update frequency, I would need to ask George as he's gonna take over the Clinical Analysis Assistant. I would expect from every month to every quarter

Copy link
Collaborator Author

@MathieuCayssol MathieuCayssol Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cost, ~100$ for 100K summarized code chunks. Then, for the updates, it depends on how often the programs are changing.

Copy link
Collaborator

@StefanThoma StefanThoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


## What we built?

CSA is a focused agent inside our Clinical Analysis Assistant. A user asks a question; we create an embedding of that query, apply optional metadata filters, search the vector database, and return the most relevant code chunks alongside their origins.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's relevant, as it may appeal to a wider audience if it applies to both.


### Understand the ask
We transform user's question into a semantic search query + metadata filters.
We transform user's question into a semantic search query + metadata filters using the openai agent SDK.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link here to more information for openai agent SDK?
https://openai.github.io/openai-agents-python/


### Retrieve with context
We perform a semantic search over the vector database + apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description).
We perform a semantic search over the vector database and apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). The code is pulled from all the clinical programming repositories from the `devel` branch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We perform a semantic search over the vector database and apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). The code is pulled from all the clinical programming repositories from the `devel` branch.
We perform a semantic search over the vector database and apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). The code is pulled from all the clinical programming repositories from the `devel` branch. Code on the 'devel' branch has generally been QC'ed in our setup, which is a nice feature for code reuse.

@StefanThoma StefanThoma self-requested a review January 2, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants