#312 feat: push blog article first draft #332

MathieuCayssol · 2025-11-27T09:02:09Z

Thank you for your Pull Request! We have developed this task checklist to help with the final steps of the process. Completing the below tasks helps to ensure our reviewers can maximize their time on your blog post.

Please check off each taskbox as an acknowledgment that you completed the task or check off that it is not relevant to your Pull Request. This checklist is part of the Github Action workflows and the Pull Request will not be merged into the main branch until you have checked off each task.

StefanThoma

Great post! I've added some comments :)

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

StefanThoma · 2025-12-18T07:35:25Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

+
+## Why this matters for pharmaverse programmers
+
+Clinical reporting follows repeatable patterns—ADaM derivations, TFL layouts, and QC checks—but examples are scattered across templates and repositories. Templates cover most standard outputs; however, for non-standard requests or bespoke TFLs, keyword search often misses intent (e.g., “derive ABLFL in ADLB using a pre-dose baseline rule and analysis-visit windows”) because variable names and structures vary by study. CSA translates that intent into semantic retrieval, surfacing relevant, provenance-linked code snippets even when the wording doesn’t match exactly—so you can review, trust, and reuse.


What does provenance-linked in that context?

Linked to original content?

Lots of "–"

Clinical reporting is built on repeatable patterns for ADaM derivations, TFLs, and QC, but finding these patterns is a significant challenge. Relevant code is often scattered, and traditional keyword searches fail for non-standard logic because the underlying variable names and structural conventions differ from study to study. CSA directly addresses this root cause by using semantic retrieval to find code based on its conceptual intent, not just matching keywords (e.g: plain english query such as "derive ABLFL in ADLB using a pre-dose baseline rule and analysis-visit windows"). This means programmers can reliably discover, review, and reuse relevant patterns from across all repositories, saving substantial time and ensuring greater consistency in their work.

StefanThoma · 2025-12-18T07:37:32Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

+
+<img width="1015" height="622" alt="Image" src="https://github.com/user-attachments/assets/254c9966-663e-4d21-950c-9dede4939b31" />
+
+Behind the scenes, we index code from our repositories, split it into coherent chunks, and generate short summaries that describe what each piece of code does, its inputs and outputs, and, for TLG programs, details such as titles and footnotes. We embed the summaries, store the vectors, and retain rich metadata such as programming language, file name, repository, and study description. The interface displays both the results and the parameters used so retrieval stays transparent and audit-friendly.


considering the audience, maybe you could add a short explanation on:

"embed he summaries, store the vectors".

For people not knowing about how NLP works, this is confusing.

Would be great to have a screenshot (or other visualisation) of what this looks like.
I.e. table where one column is the semantic summary, one is the code, one is the metadata, (and one could be something like a truncated version of the embeddings, just so people can get a glimpse of what that looks like under the hood, but unsure if this makes sense.)

Maybe silly question: Is this image staying there?
Usually we add images to the repo to ensure that it stays for future renders.

StefanThoma · 2025-12-18T07:39:56Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

+## How it works
+
+### Understand the ask
+We transform user's question into a semantic search query + metadata filters.


Is each of tese tasks done with an LLM? Or how is this being transformed?

It's done by an agent (so few LLM calls)

StefanThoma · 2025-12-18T07:42:17Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

+We transform user's question into a semantic search query + metadata filters.
+
+### Retrieve with context
+We perform a semantic search over the vector database + apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description).


Please add more information on the database here.
Also add which branches you are considering (I think only main and devel, right?), and how that design impacts the cost (e.g. fewer updates required) and the quality (qced code).

I added the part concerning the branch for pulling the code. For the details related to the data pipeline, should I link the PHUSE paper? It gives a lot of details: https://phuse.s3.eu-central-1.amazonaws.com/Archive/2025/Connect/EU/Hamburg/PAP_ML15.pdf

Yes that would be nice

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

StefanThoma · 2025-12-18T07:43:29Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

+
+## What we built?
+
+CSA is a focused agent inside our Clinical Analysis Assistant. A user asks a question; we create an embedding of that query, apply optional metadata filters, search the vector database, and return the most relevant code chunks alongside their origins.


Cover also the scope: Is this for R code only, or do we do the same for SAS?

Both, but I'm not sure if it's meaningful to mention SAS for the pharmaverse blog

I think it's relevant, as it may appeal to a wider audience if it applies to both.

StefanThoma · 2025-12-18T07:44:11Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

+
+## What programmers can do today
+
+With CSA, programmers can look up ADaM and TLG scripts by intent, such as *"asking for an ADSL baseline flag with visit windowing"* or *"an AVALC mapping that treats missing values explicitly"*. They can locate TLG programs described in plain English. For instance, a table that splits columns by treatment arm and summarizes AVAL with mean and standard deviation and then adapt a proven layout. They can also compare similar implementations across repositories to converge on a standard approach while maintaining confidence through clear provenance back to source.


Is there a way to tell whether code has been used in production yet?

StefanThoma · 2025-12-18T07:44:51Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

+
+## Metrics and usage
+
+Over the observation window, the Code Search Assistant (CSA) handled 791 questions from 45 unique users, averaging 42 ± 14 weekly conversations and 92 ± 43 weekly questions. Satisfaction signals were positive: the broader Clinical Analysis Assistant (CAA) scored 3.39/5 for usefulness (n = 174), while the CSA specifically was rated at ~4/5, indicating strong early traction among active users.


What is the observation window in this case?

Can this be updated for a specific period?

Added: 15th of September 2025 - 19th of October 2025

StefanThoma · 2025-12-18T07:45:28Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

+
+## Limitations & what’s next
+
+Today, CSA focuses on fast, trustworthy discovery. We are working on tighter IDE integration via MCP so you can search directly from your editor. We also want to enable better study awareness so results are automatically filtered to the context of your study and clinical programming code. Finally, we are refining the data pipeline to continuously index changes without any manual effort and adding SDTM programs as well. These steps aim to make retrieval not only smarter but also more seamlessly woven into day-to-day programming.


What is MCP?

Study awareness is great. This could be done by ranking the results based on closeness to the study people are working on. e.g. same indication, same TA, or something.

Would be great to include some information about the data summary creation, how often it is done, approximate cost for how much data, why you chose to do it that way, instead of asking the full LLM to cover everything.

For update frequency, I would need to ask George as he's gonna take over the Clinical Analysis Assistant. I would expect from every month to every quarter

For the cost, ~100$ for 100K summarized code chunks. Then, for the updates, it depends on how often the programs are changing.

StefanThoma

LGTM

StefanThoma · 2026-01-02T12:56:18Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd

+
+## What we built?
+
+CSA is a focused agent inside our Clinical Analysis Assistant. A user asks a question; we create an embedding of that query, apply optional metadata filters, search the vector database, and return the most relevant code chunks alongside their origins.


I think it's relevant, as it may appeal to a wider audience if it applies to both.

StefanThoma · 2026-01-02T12:57:55Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd


 ### Understand the ask
-We transform user's question into a semantic search query + metadata filters.
+We transform user's question into a semantic search query + metadata filters using the openai agent SDK. 


Could you link here to more information for openai agent SDK?
https://openai.github.io/openai-agents-python/

StefanThoma · 2026-01-02T13:01:43Z

...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd


 ### Retrieve with context
-We perform a semantic search over the vector database + apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description).
+We perform a semantic search over the vector database and apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). The code is pulled from all the clinical programming repositories from the `devel` branch. 


Suggested change

We perform a semantic search over the vector database and apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). The code is pulled from all the clinical programming repositories from the `devel` branch.

We perform a semantic search over the vector database and apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). The code is pulled from all the clinical programming repositories from the `devel` branch. Code on the 'devel' branch has generally been QC'ed in our setup, which is a nice feature for code reuse.

feat: push blog article first draft

b72e3d8

StefanThoma requested changes Dec 18, 2025

View reviewed changes

fix: suggestions from stefan

9fed56c

StefanThoma approved these changes Jan 2, 2026

View reviewed changes

StefanThoma self-requested a review January 2, 2026 13:05


		## Why this matters for pharmaverse programmers

		Clinical reporting follows repeatable patterns—ADaM derivations, TFL layouts, and QC checks—but examples are scattered across templates and repositories. Templates cover most standard outputs; however, for non-standard requests or bespoke TFLs, keyword search often misses intent (e.g., “derive ABLFL in ADLB using a pre-dose baseline rule and analysis-visit windows”) because variable names and structures vary by study. CSA translates that intent into semantic retrieval, surfacing relevant, provenance-linked code snippets even when the wording doesn’t match exactly—so you can review, trust, and reuse.


		<img width="1015" height="622" alt="Image" src="https://github.com/user-attachments/assets/254c9966-663e-4d21-950c-9dede4939b31" />

		Behind the scenes, we index code from our repositories, split it into coherent chunks, and generate short summaries that describe what each piece of code does, its inputs and outputs, and, for TLG programs, details such as titles and footnotes. We embed the summaries, store the vectors, and retain rich metadata such as programming language, file name, repository, and study description. The interface displays both the results and the parameters used so retrieval stays transparent and audit-friendly.


		## What we built?

		CSA is a focused agent inside our Clinical Analysis Assistant. A user asks a question; we create an embedding of that query, apply optional metadata filters, search the vector database, and return the most relevant code chunks alongside their origins.


		## What programmers can do today

		With CSA, programmers can look up ADaM and TLG scripts by intent, such as "asking for an ADSL baseline flag with visit windowing" or "an AVALC mapping that treats missing values explicitly". They can locate TLG programs described in plain English. For instance, a table that splits columns by treatment arm and summarizes AVAL with mean and standard deviation and then adapt a proven layout. They can also compare similar implementations across repositories to converge on a standard approach while maintaining confidence through clear provenance back to source.


		## Metrics and usage

		Over the observation window, the Code Search Assistant (CSA) handled 791 questions from 45 unique users, averaging 42 ± 14 weekly conversations and 92 ± 43 weekly questions. Satisfaction signals were positive: the broader Clinical Analysis Assistant (CAA) scored 3.39/5 for usefulness (n = 174), while the CSA specifically was rated at ~4/5, indicating strong early traction among active users.


		## Limitations & what’s next

		Today, CSA focuses on fast, trustworthy discovery. We are working on tighter IDE integration via MCP so you can search directly from your editor. We also want to enable better study awareness so results are automatically filtered to the context of your study and clinical programming code. Finally, we are refining the data pipeline to continuously index changes without any manual effort and adding SDTM programs as well. These steps aim to make retrieval not only smarter but also more seamlessly woven into day-to-day programming.

#312 feat: push blog article first draft #332

Are you sure you want to change the base?

#312 feat: push blog article first draft #332

Uh oh!

Conversation

MathieuCayssol commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanThoma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathieuCayssol Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathieuCayssol Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanThoma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

MathieuCayssol commented Nov 27, 2025 •

edited

Loading

MathieuCayssol Dec 25, 2025 •

edited

Loading

MathieuCayssol Dec 25, 2025 •

edited

Loading