-
Notifications
You must be signed in to change notification settings - Fork 16
#312 feat: push blog article first draft #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
StefanThoma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great post! I've added some comments :)
...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd
Outdated
Show resolved
Hide resolved
|
|
||
| ## Why this matters for pharmaverse programmers | ||
|
|
||
| Clinical reporting follows repeatable patterns—ADaM derivations, TFL layouts, and QC checks—but examples are scattered across templates and repositories. Templates cover most standard outputs; however, for non-standard requests or bespoke TFLs, keyword search often misses intent (e.g., “derive ABLFL in ADLB using a pre-dose baseline rule and analysis-visit windows”) because variable names and structures vary by study. CSA translates that intent into semantic retrieval, surfacing relevant, provenance-linked code snippets even when the wording doesn’t match exactly—so you can review, trust, and reuse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does provenance-linked in that context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linked to original content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of "–"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clinical reporting is built on repeatable patterns for ADaM derivations, TFLs, and QC, but finding these patterns is a significant challenge. Relevant code is often scattered, and traditional keyword searches fail for non-standard logic because the underlying variable names and structural conventions differ from study to study. CSA directly addresses this root cause by using semantic retrieval to find code based on its conceptual intent, not just matching keywords (e.g: plain english query such as "derive ABLFL in ADLB using a pre-dose baseline rule and analysis-visit windows"). This means programmers can reliably discover, review, and reuse relevant patterns from across all repositories, saving substantial time and ensuring greater consistency in their work.
|
|
||
| <img width="1015" height="622" alt="Image" src="https://github.com/user-attachments/assets/254c9966-663e-4d21-950c-9dede4939b31" /> | ||
|
|
||
| Behind the scenes, we index code from our repositories, split it into coherent chunks, and generate short summaries that describe what each piece of code does, its inputs and outputs, and, for TLG programs, details such as titles and footnotes. We embed the summaries, store the vectors, and retain rich metadata such as programming language, file name, repository, and study description. The interface displays both the results and the parameters used so retrieval stays transparent and audit-friendly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering the audience, maybe you could add a short explanation on:
"embed he summaries, store the vectors".
For people not knowing about how NLP works, this is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to have a screenshot (or other visualisation) of what this looks like.
I.e. table where one column is the semantic summary, one is the code, one is the metadata, (and one could be something like a truncated version of the embeddings, just so people can get a glimpse of what that looks like under the hood, but unsure if this makes sense.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe silly question: Is this image staying there?
Usually we add images to the repo to ensure that it stays for future renders.
| ## How it works | ||
|
|
||
| ### Understand the ask | ||
| We transform user's question into a semantic search query + metadata filters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is each of tese tasks done with an LLM? Or how is this being transformed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done by an agent (so few LLM calls)
| We transform user's question into a semantic search query + metadata filters. | ||
|
|
||
| ### Retrieve with context | ||
| We perform a semantic search over the vector database + apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more information on the database here.
Also add which branches you are considering (I think only main and devel, right?), and how that design impacts the cost (e.g. fewer updates required) and the quality (qced code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the part concerning the branch for pulling the code. For the details related to the data pipeline, should I link the PHUSE paper? It gives a lot of details: https://phuse.s3.eu-central-1.amazonaws.com/Archive/2025/Connect/EU/Hamburg/PAP_ML15.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would be nice
...nd__key.../beyond__keywords:__how__semantic__search_is__unlocking__clinical__code__reuse.qmd
Outdated
Show resolved
Hide resolved
|
|
||
| ## What we built? | ||
|
|
||
| CSA is a focused agent inside our Clinical Analysis Assistant. A user asks a question; we create an embedding of that query, apply optional metadata filters, search the vector database, and return the most relevant code chunks alongside their origins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cover also the scope: Is this for R code only, or do we do the same for SAS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both, but I'm not sure if it's meaningful to mention SAS for the pharmaverse blog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's relevant, as it may appeal to a wider audience if it applies to both.
|
|
||
| ## What programmers can do today | ||
|
|
||
| With CSA, programmers can look up ADaM and TLG scripts by intent, such as *"asking for an ADSL baseline flag with visit windowing"* or *"an AVALC mapping that treats missing values explicitly"*. They can locate TLG programs described in plain English. For instance, a table that splits columns by treatment arm and summarizes AVAL with mean and standard deviation and then adapt a proven layout. They can also compare similar implementations across repositories to converge on a standard approach while maintaining confidence through clear provenance back to source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to tell whether code has been used in production yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope
|
|
||
| ## Metrics and usage | ||
|
|
||
| Over the observation window, the Code Search Assistant (CSA) handled 791 questions from 45 unique users, averaging 42 ± 14 weekly conversations and 92 ± 43 weekly questions. Satisfaction signals were positive: the broader Clinical Analysis Assistant (CAA) scored 3.39/5 for usefulness (n = 174), while the CSA specifically was rated at ~4/5, indicating strong early traction among active users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the observation window in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be updated for a specific period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added: 15th of September 2025 - 19th of October 2025
|
|
||
| ## Limitations & what’s next | ||
|
|
||
| Today, CSA focuses on fast, trustworthy discovery. We are working on tighter IDE integration via MCP so you can search directly from your editor. We also want to enable better study awareness so results are automatically filtered to the context of your study and clinical programming code. Finally, we are refining the data pipeline to continuously index changes without any manual effort and adding SDTM programs as well. These steps aim to make retrieval not only smarter but also more seamlessly woven into day-to-day programming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is MCP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Study awareness is great. This could be done by ranking the results based on closeness to the study people are working on. e.g. same indication, same TA, or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to include some information about the data summary creation, how often it is done, approximate cost for how much data, why you chose to do it that way, instead of asking the full LLM to cover everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For update frequency, I would need to ask George as he's gonna take over the Clinical Analysis Assistant. I would expect from every month to every quarter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the cost, ~100$ for 100K summarized code chunks. Then, for the updates, it depends on how often the programs are changing.
StefanThoma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| ## What we built? | ||
|
|
||
| CSA is a focused agent inside our Clinical Analysis Assistant. A user asks a question; we create an embedding of that query, apply optional metadata filters, search the vector database, and return the most relevant code chunks alongside their origins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's relevant, as it may appeal to a wider audience if it applies to both.
|
|
||
| ### Understand the ask | ||
| We transform user's question into a semantic search query + metadata filters. | ||
| We transform user's question into a semantic search query + metadata filters using the openai agent SDK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you link here to more information for openai agent SDK?
https://openai.github.io/openai-agents-python/
|
|
||
| ### Retrieve with context | ||
| We perform a semantic search over the vector database + apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). | ||
| We perform a semantic search over the vector database and apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). The code is pulled from all the clinical programming repositories from the `devel` branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We perform a semantic search over the vector database and apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). The code is pulled from all the clinical programming repositories from the `devel` branch. | |
| We perform a semantic search over the vector database and apply substring matching based on the metadata filters. Summaries help match intent (“derive AVALC from PARAMCD”) even if the snippet uses different variable names. The metadata filters help to narrow down the results to the most relevant snippets (e.g: adsl in the program name or phase III in the study description). The code is pulled from all the clinical programming repositories from the `devel` branch. Code on the 'devel' branch has generally been QC'ed in our setup, which is a nice feature for code reuse. |
Thank you for your Pull Request! We have developed this task checklist to help with the final steps of the process. Completing the below tasks helps to ensure our reviewers can maximize their time on your blog post.
Please check off each taskbox as an acknowledgment that you completed the task or check off that it is not relevant to your Pull Request. This checklist is part of the Github Action workflows and the Pull Request will not be merged into the
mainbranch until you have checked off each task."posts/zzz_DO_NOT_EDIT_<your post title>". This is so that the post date can be auto-updated upon the merge intomain.CICD.Rline by line to first check the spelling in your post and then to make sure your code is compatible with our code-style. Address any incongruences by following the instructions in the file!tag(s)or categories from the current list:c("Metadata", "SDTM", "ADaM", "TLG", "Shiny", "Python", "Community", "Conferences", "Submissions", "Technical", "DEI")for your blog post. If you cannot find anything that fits your blog post, propose a new tag to the maintainers! Note: if you use a tag not from this list, the "Check Post Tags" CICD pipeline will error. We occasionally tidy up alltagsfor consistency.descriptionfield at the top of the markdown document.This blog contains opinions that are of the authors alone and do not necessarily reflect the strategy of their respective organizations.)