Skip to content

Conversation

@jjttkk
Copy link

@jjttkk jjttkk commented Nov 2, 2025

Prompt optimization for TLR

in Praktikum: Werkzeuge für agile Modellierung - Summer Term 2025

Table of Contents

  • Introduction
  • Design and Implementation of the LiSSA-Prompt-Optimizer
  • Performance-Evaluation
  • Conclusion and Future Work

Introduction

Task Description

The task of this project was to extend the LiSSA framework with the ability to optimize its prompts automatically. LiSSA
is a tool for trace link recovery (TLR), which relies on Large Language Models (LLMs) for classifying relations between
software artifacts. Software artifacts are requirements, code, documentation, and models. The quality of results is
highly dependent on the phrasing of prompts given to the model, but manually writing and testing different prompts until
the best one is found is very time-consuming.

The main goal was therefore to design and implement an automatic prompt optimization mechanism that iteratively improves
prompts based on evaluation results. This required not only a deep analyzation of the LiSSA framework but also of
various already existing prompt optimizers before the implementation of new classes and adaptations to existing LiSSA
components in order to integrate seamlessly into the framework.

LiSSA

LiSSA (Language-based Support for Software Architecture) is an extensible framework for trace link recovery and related
tasks. It integrates artifact providers, preprocessors, embedding creators, classifiers, and result aggregators into a
configurable pipeline. A configuration is provided as a JSON file, specifying the components and their parameters. LiSSA
evaluates trace link candidates against a gold standard and reports metrics such as precision, recall, and F1-score.

Precision measures the accuracy of positive predictions, defined as the ratio of correctly predicted positive examples (
True Positives (TP))
to all examples predicted as positive (TP and False Positives (FP) → Retrieved elements).
Recall measures how well the model identifies actual positive examples, defined as the ratio of correctly predicted
positive examples (TP) to all actual positive examples (TP and False Negatives (FN) → Relevant elements).
The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both. It is useful
when needing a trade-off between precision and recall, especially on imbalanced datasets.

tpfpfn

Image to explain what relevant and retrieved elements are

Figure 1: Overview of the classification of relevant elements and retrieved elements for a better understanding of precision, recall, and F1 score.
  • Precision: $\text{Precision} = \frac{TP}{TP + FP}$
  • Recall: $\text{Recall} = \frac{TP}{TP + FN}$
  • F1-score: $F1 = \frac{2}{\text{Precision}^{-1} + \text{Recall}^{-1}} = \frac{2 \times TP}{2 \times TP + FP + FN}$

Exemplary Prompt-Optimizers

In order to gain a better understanding of prompt optimizers, several prompt optimizers were analyzed and evaluated in
terms of their advantages, disadvantages, purpose, and how useful they could be in a framework such as LISSA.
Promptimizer (automatic prompt engineering). Other strategies here
include zero-shot, few-shot, and chain of thought.
With Promptoptimizer, the focus is more on reducing costs by
reducing the number of tokens that the LLM receives as
input. This approach was not pursued further with LiSSA initially, but it is interesting for future development in terms
of cost reduction and resource conservation.

The prompt strategies that were used are:

  • Zero Shot: No examples, quick response to simple tasks.
  • Few Shot: Multiple examples, more robust and nuanced results.
  • Chain of Thought: Explicit step-by-step thought processes for complex problems.

Design and Implementation of the LiSSA-Prompt-Optimizer

Model

As mentioned above, the prompt optimizer for LiSSA is based on
the Promptimizer. First, LiSSA receives training
data and an initial prompt as usual, processes them, and returns an output or results. These results are then compared
to a predefined baseline, i.e., a target value for the F1-score, based on their F1-score. If the results are good enough
or the maximum number of iterations has been reached, the optimal prompt has been found. If not, an optimization prompt
is generated that contains the previous prompt as well as examples of true positives, false positives, and false
negatives (current distribution 1 TP, 2 FP, 2 FN) from LiSSAs outputs/results - implementing the few shot strategy.
This prompt is given to the optimizer-LLM, which inturn generates an optimized prompt for LiSSA. LiSSA then starts the
next iteration.

model

Model of LiSSA

Figure 2: Model for the LiSSA prompt optimizer.

In addition to the design of this model, the following design objectives were established:

  • Iteration control: The optimizer stops early if a target F1-score is reached, preventing unnecessary runs.
  • Separation of concerns: Optimization logic and prompt generation are split. This makes it easier to exchange or
    extend strategies.
  • Logging: Each iteration logs F1 values, generated prompts, and the best prompt. This
    ensures transparency and comparability of runs for future evaluation.
  • Minimally invasive: Changes to LiSSAs existing classes are kept to a minimum. LiSSA is LiSSA will not be modified
    and treated as a black box as far as possible. The new package should exist largely separately from the rest of the
    project.

Components

classdiagram

Class diagram

Figure 3: Class diagram of the implementation.

Package edu.kit.kastel.sdq.lissa.cli.command

OptimizeCommand.java (new) This CLI subcommand enables users to run the prompt optimizer directly from the command line. It accepts a configuration file path, a maximum iteration count, and a target F1-score as parameters. The command initializes the cache directory from the configuration and then delegates control to the PromptOptimizer.

Package edu.kit.kastel.sdq.lissa.ratlr.optimizer (new)

PromptOptimizer.java (new) This is the core component of the optimization process. It coordinates repeated evaluation runs, tracks the best prompt and F1-score, and decides when to stop iterating depending on the F1-score and the number of iterations. It logs the F1 scores, prompts to the Optimizer-LLM and LiSSAas well as the best result in dedicated files in the "logs/" directory. When further improvement is needed, it calls PromptWriter to generate a new prompt. The optimizer ensures that optimized configurations are saved in a separate "optimized/" directory to preserve reproducibility.
PromptWriter.java (new) The PromptWriter is responsible for generating new prompts using an external LLM (with the ChatLanguageModelProvider). It constructs a meta-prompt which describes the task, provides the last template, the last F1-score, and representative classification results. The LLM responds with a new LiSSA prompt, which is then returned for use in the next iteration with LiSSA.
PromptGenerationResult.java (new) A data holder that contains two strings: The meta-prompt for the optimization-LLM and the prompt for LiSSA.
ClassificationResultsManager.java (new) This class supports saving and loading detailed classification results. It allows the PromptOptimizer to give realistic examples back into the LLM during prompt improvement. This feedback loop helps generate prompts that directly address observed weaknesses or strengths.
DetailedClassificationResult.java (new) Represents a single classification outcome, including its type (TP, FP, FN), and the IDs and contents of its source and target artifact. These objects provide the building blocks for ClassificationResultsManager to generate useful examples for the prompt optimization process.

Package edu.kit.kastel.sdq.lissa.ratlr

Evaluation.java The Evaluation class represents a complete LiSSA evaluation run, coordinating all steps of the trace link recovery pipeline. For the optimizer, the class was slightly extended to hook into the pipeline and allow saving detailed classification results. These results are written into JSON files and later reused by the PromptWriter to build improved prompts.
Statistics.java The Statistics class provides utilities for evaluating results against a gold standard. The main change was to make the method getTraceLinksFromGoldStandard(...) public instead of package-private. This modification allows external components, particularly the PromptOptimizer, to directly retrieve the gold standard trace links. These are needed to compute F1-scores outside the standard reporting workflow. By exposing this method, the optimizer can compare predicted trace links with the ground truth independently of the usual reporting, enabling fine-grained control of the optimization loop.

Package edu.kit.kastel.sdq.lissa.ratlr.configuration

Configuration.java The record was extended with a configuration for the optimizer and the method withReplacedClassifier, enabling the optimizer to substitute only the classifier (and its prompt) while leaving the rest of the configuration unchanged. This method is important for creating optimized configurations efficiently.
ModuleConfiguration.java Extended with methods to retrieve the ModuleConfiguration as hashCode, String or Map of arguments.

Package edu.kit.kastel.sdq.lissa.ratlr.classifier

Classifier.java Extended with Getter- and Setter-Method for the prompt.
SimpleClassifier.java Extends Classifier. Implements the Getter- and Setter-Methods.

Package edu.kit.kastel.sdq.lissa.ratlr.utils

LogWriter.java (new) A utility class to standardize file logging within the PromptOptimizer. It writes logs to the "logs/" directory, handling both appending and overwriting. By centralizing this functionality, consistent and reliable logging across all iterations is ensured.

Performance-Evaluation

The goal of the performance evaluation was to measure F1-scores across multiple optimization runs and analyze their
progression over time. The central question was whether a global peak exists, after which no further improvements can be
expected. Identifying such a peak would allow future optimization runs to be restricted to a sensible maximum number of
iterations, thereby saving both computational resources and costs.

In the first test series, five runs were executed with ten iterations each. The results indicated that improvements
were still visible in the last and second to last iterations. At the same time, two runs already achieved F1-scores
above 0.5 in the second iteration - quite the improvement compared to the F1-score of the initial prompt without any
optimization from about 0.4394.

diagram10iter

Diagram for 5 runs with 10 iterations

Figure 4: F1-score over 50 iterationen for 5 runs, a tabular representation of the data set can be found under [dataset_testseries_1.md](dataset_testseries_1.md)

This provided a promising first impression, but no clear global peak could be identified at this stage. Consequently,
the evaluation was extended to a second test series with up to fifty iterations.
The extended evaluation revealed a clearer picture. Although once again no global maximum was reached, it became
apparent that the F1-scores did not improve any further after the 23rd iteration and that the values stabilized
relatively early on overall. Interestingly, the best value of all runs in the second test series was
already reached in iteration 11 (in run 4). This suggests that performing a significantly larger number of iterations
is not meaningful, as the F1-score plateaus early and the additional computational effort is not justified by marginal
or non-existent improvements.

diagram50iter

Diagram for 5 runs with 50 iterations

Figure 5: F1-score over 50 iterationen for 5 runs, a tabular representation of the data set can be found under [dataset_testseries_2.md](dataset_testseries_2.md)

The best prompt of each run in the test series can be found in testSeriesPrompts.md, the
F1-scores were extracted from the f1_log.txt file and paired with the prompts LiSSA worked with in the respective
Iteration from the promptsToLissa_log.txt.

One run in particular (run 5) showed a distinct anomaly. After reaching its best value in iteration 23, the F1-score
dropped sharply, even falling below the LiSSA score without optimization (0.4394), and it did not recover in later
iterations. A closer comparison with other runs revealed that this run generated a higher number of negative rules (
instructions on what LiSSA should classify as false). In contrast, run 4 emphasized positive rules (clarifications on
what LiSSA should classify as true), which resulted in a much better performance. In run 5, LiSSA became overly
cautious, which led to a collapse in recall and, as a consequence, a sharp decline in the overall F1-score.
This shows that using more positive examples when generating prompts for LiSSA could be a valid option in future
implementations.

Another important observation was that even very early iterations already provided notable improvements. For instance,
in the second iteration of the first test series in the runs 3 and 4, the F1-score increased from 0.4394 to 0.5045
simply by adding the sentence:

Consider the context and intent of both requirements. Are they addressing the same functionality, feature, or goal
within the software development process?

This highlights that even small carefully designed prompt changes can produce significant performance gains without
requiring long optimization cycles.

Conclusion and Future Work

The integration of a prompt optimization mechanism into LiSSA demonstrated that automatic refinement of classification
prompts can significantly improve trace link recovery performance. Even with a limited number of iterations, F1-scores
increased notably compared to the baseline configuration. At the same time, the evaluation results showed that
improvements plateau after a certain point, which supports the idea of restricting optimization runs to avoid
unnecessary computational effort. Overall, the optimizer successfully extends LiSSA by introducing an iterative feedback
loop that adapts prompts based on observed classification outcomes.

Looking ahead, several extensions and enhancements are possible.
At present, the optimizer has been evaluated with requirement-to-requirement datasets. A reasonable
next step would be to extend the approach to other artifact combinations, such as requirements-to-code or
documentation-to-cod. This would
broaden the applicability of the optimizer and demonstrate its usefulness in more diverse traceability scenarios.
Also, integrating additional language models (e.g., larger GPT models or open-source alternatives such as LLaMA or
Mistral) would increase flexibility and allow to balance quality, performance, and cost depending on the specific use
case. This would also enable systematic comparisons between different LLMs in terms of trace link classification
quality. In addition, an implementation that uses more positive examples when generating the prompt for LiSSA could be
a promising modification, as indicated by the comparison of runs 4 and 5 from the second test series.
At the moment, optimization progress is logged primarily through textual files. Adding an automated visualization
component, for example generating plots of F1-scores over iterations for each optimization-run, would make the
optimization process more transparent. Such visualizations would help users quickly assess whether further iterations
are worthwhile and identify anomalies such as sudden performance drops. Additionally, the logs can be enhanced with
timestamps so that the prompts and F1-scores between the logs can be better compared, even if several runs
have already been completed. Inspired by related work on prompt optimization (see
Promptoptimizer), future work could focus more on reducing
token usage. Since LLM calls are both resource- and cost-sensitive, strategies such as prompt compression,
token-efficient formulations, or early stopping mechanisms could further improve the practicality of the optimizer.
Combining quality improvements with cost reduction would make the system more attractive for real-world applications.

In summary, the current work lays the foundation for prompt optimization within LiSSA, showing both its potential as
well as
some limitations. By extending the optimizer to new artifact types, extended model support, adding visualization tools,
and focusing more on resource efficiency, the approach can be made more versatile and impactful in the future.

@dfuchss
Copy link
Member

dfuchss commented Nov 4, 2025

@jjttkk could you just change the code please :) The caches & datasets are not so important for the PR :)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates a prompt optimization system into the LiSSA framework, enabling automatic iterative improvement of classification prompts used for trace link recovery. The implementation is based on the Promptimizer approach and uses an LLM to generate improved prompts based on classification results (TP, FP, FN examples) from previous iterations.

Key changes:

  • Added a new optimizer package with core optimization logic including PromptOptimizer, PromptWriter, and classification results management
  • Removed the ContextStore pattern throughout the codebase, simplifying component initialization
  • Added CLI support for running optimization via the new OptimizeCommand

Reviewed Changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/optimizer/PromptOptimizer.java Core optimizer coordinating iterative prompt refinement with F1-score tracking and early stopping
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/optimizer/PromptWriter.java Generates improved prompts by sending meta-prompts with classification examples to an LLM
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/optimizer/ClassificationResultsManager.java Manages storage and retrieval of detailed classification results (TP/FP/FN) for prompt improvement
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/optimizer/DetailedClassificationResult.java Data model for individual classification outcomes with category and artifact content
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/optimizer/PromptGenerationResult.java Container for both the meta-prompt and the generated LiSSA prompt
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/utils/LogWriter.java Utility for standardized file logging in the logs directory
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/utils/Environment.java Modified to log errors instead of throwing exceptions for missing env variables
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/configuration/Configuration.java Added optimizer configuration field and withReplacedClassifier() method for generating optimized configs
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/configuration/ModuleConfiguration.java Added arguments() accessor and removed unused argumentAsDouble methods
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/classifier/Classifier.java Removed ContextStore dependency and added getPrompt()/setPrompt() methods
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/classifier/SimpleClassifier.java Implemented prompt getter/setter and modified cache key generation
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/Evaluation.java Added detailed classification results saving after each evaluation run
src/main/java/edu/kit/kastel/sdq/lissa/ratlr/Statistics.java Made getTraceLinksFromGoldStandard() public for optimizer access
src/main/java/edu/kit/kastel/sdq/lissa/cli/command/OptimizeCommand.java New CLI command for running prompt optimization with configurable iterations and target F1
src/main/java/edu/kit/kastel/sdq/lissa/cli/MainCLI.java Registered OptimizeCommand and added debugging code
src/test/java/edu/kit/kastel/sdq/lissa/ratlr/optimizer/PromptOptimizerTest.java Integration tests for optimizer creation and F1 improvement validation
.gitignore Added entries for cache, configs, and datasets directories
Comments suppressed due to low confidence (1)

src/main/java/edu/kit/kastel/sdq/lissa/ratlr/configuration/Configuration.java:120

                for (var classifier : group) {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

List<Pair<Element, Element>> tasks = new ArrayList<>();

for (var source : sourceStore.getAllElements(true)) {
var targetCandidates = targetStore.findSimilar(source.second());
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect method call in createClassificationTasks. Line 53 calls targetStore.findSimilar(source.second()) but should call targetStore.findSimilar(source) based on the pattern used elsewhere in the codebase. The source variable is already a Pair<String, Element> from getAllElements(true), so calling .second() would extract just the Element, losing the identifier information that may be needed by findSimilar.

Suggested change
var targetCandidates = targetStore.findSimilar(source.second());
var targetCandidates = targetStore.findSimilar(source);

Copilot uses AI. Check for mistakes.
Comment on lines 221 to 222
(traceLinkIdPostprocessor != null ? traceLinkIdPostprocessor
: new ModuleConfiguration("TraceLinkIdPostprocessor", Map.of()))
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return statement documentation. The method withReplacedClassifier at line 221 creates and returns a TraceLinkIdPostprocessor with an empty Map when traceLinkIdPostprocessor is null, but the return statement logic is embedded in a ternary operator without clear documentation of this fallback behavior. This could lead to unexpected behavior if callers assume the original value is preserved.

Copilot uses AI. Check for mistakes.
You are optimizing a classification task.
A model called "LiSSA" will be given a high-level requirement and a low-level requirement.\
"LiSSA" must decide whether these requirements are related or not. \
"LiSSA" must answer with "yes" or "no".\\
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escape sequence \\ at line 69 appears to be incorrect. The line uses a backslash before a newline in a text block, which would escape the newline character rather than creating one. This should likely be just a regular newline without the escape, or if intentional, needs clarification in comments.

Suggested change
"LiSSA" must answer with "yes" or "no".\\
"LiSSA" must answer with "yes" or "no".

Copilot uses AI. Check for mistakes.
private static final @Nullable Dotenv DOTENV = load();

private Environment() {
throw new IllegalAccessError("Utility class");
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor declares an IllegalAccessError but the more appropriate exception type would be UnsupportedOperationException for utility classes with private constructors that should never be instantiated. IllegalAccessError is typically used for class loading and access control violations at runtime.

Suggested change
throw new IllegalAccessError("Utility class");
throw new UnsupportedOperationException("Utility class");

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +65
* <li>Logs an error if the variable is not found</li>
* <li>Returns the value (which may be null, despite the method name)</li>
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states "Returns the value (which may be null, despite the method name)" but this contradicts the expected behavior of a method named getenvNonNull. This creates confusion for API consumers about whether null checking is required. The documentation should align with either the method throwing an exception or being renamed.

Copilot uses AI. Check for mistakes.
}

/**
*
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty Javadoc line without any content. Line 246 has an empty Javadoc comment /** followed by just a newline and *. This should either contain documentation or be removed entirely.

Suggested change
*

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +218
/**
* Saves detailed classification results for trace links to disk.
* This includes generating JSON results and merging source/target elements
* to produce TP, FP, FN categorizations for further analysis.
*/
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment block starting at line 214 uses a multi-line comment (/** ... */) instead of inline code comments. This Javadoc-style comment is placed inside a method body, which is unconventional. Either convert it to regular inline comments (//) or move the documentation to a proper location.

Suggested change
/**
* Saves detailed classification results for trace links to disk.
* This includes generating JSON results and merging source/target elements
* to produce TP, FP, FN categorizations for further analysis.
*/
// Saves detailed classification results for trace links to disk.
// This includes generating JSON results and merging source/target elements
// to produce TP, FP, FN categorizations for further analysis.

Copilot uses AI. Check for mistakes.
From the examples, extract general rules \
and incorporate them into the new prompt so that LiSSA classifies better.\
Your output must ONLY be the final prompt text, no explanations or notes. \
If you refer to LiSSA directly, call it "you" since it does not know its name.\\
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent escape sequences in text block. Line 106 uses \\ before a newline which escapes it, but line 70 in the same text block structure does the same. This creates inconsistency with the rest of the text block formatting where newlines are not escaped. Review whether these escaped newlines are intentional or should be regular newlines.

Copilot uses AI. Check for mistakes.
"--target-f1", "0.40"
};
}
System.out.println("ARGS: " + Arrays.toString(args));
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statement should be removed from production code. The System.out.println("ARGS: " + Arrays.toString(args)); statement at line 47 is a debugging leftover that should not be in the final codebase. Use proper logging instead or remove it entirely.

Suggested change
System.out.println("ARGS: " + Arrays.toString(args));

Copilot uses AI. Check for mistakes.
Comment on lines 73 to 75
double lastF1Score = Double.parseDouble(
lastF1String.substring("Best F1: ".length()).replace(',', '.')
);
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential uncaught 'java.lang.NumberFormatException'.

Suggested change
double lastF1Score = Double.parseDouble(
lastF1String.substring("Best F1: ".length()).replace(',', '.')
);
double lastF1Score;
try {
lastF1Score = Double.parseDouble(
lastF1String.substring("Best F1: ".length()).replace(',', '.')
);
} catch (NumberFormatException e) {
org.junit.jupiter.api.Assertions.fail("Failed to parse F1 score from log: '" + lastF1String + "'", e);
return;
}

Copilot uses AI. Check for mistakes.
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants