Enable Unrestricted LLM Output with Parsing + Add One-Shot Anomaly Detection Support#39
Conversation
sarahmish
left a comment
There was a problem hiding this comment.
Thanks @scherkao31! Great features to add! I have a few minor comments.
...ves/jsons/sigllm.primitives.prompting.timeseries_preprocessing.rolling_window_sequences.json
Outdated
Show resolved
Hide resolved
sigllm/primitives/transformation.py
Outdated
| if normal: | ||
| # Handle as single time series (window_size, 1) | ||
| return _as_string(X) | ||
| else: | ||
| # Handle as multiple windows (num_windows, window_size, 1) | ||
| results = list(map(_as_string, X)) | ||
| return np.array(results) |
There was a problem hiding this comment.
I'm wondering if we should have it more generalized, regardless of normal or not.
if X.ndim < 2:
return _as_string(X)
else:
...What do you think?
| def parse_anomaly_response(X): | ||
| """Parse a list of lists of LLM responses to extract anomaly values and format them as strings. | ||
|
|
||
| Args: | ||
| X (List[List[str]]): List of lists of response texts from the LLM in the format | ||
| "Answer: no anomalies" or "Answer: [val1, val2, ..., valN]" | ||
| Returns: | ||
| List[List[str]]: List of lists of parsed responses where each element is either | ||
| "val1,val2,...,valN" if anomalies are found, | ||
| or empty string if no anomalies are present | ||
| """ |
There was a problem hiding this comment.
reformat docstrings to keep arg name and description on separate lines:
"""Extract anomalies from LLM response.
Parse a list of lists of LLM responses to extract anomaly
values and format them as strings.
Args:
X (List[List[str]]):
List of lists of response texts from the LLM in the format
"Answer: no anomalies" or "Answer: [val1, val2, ..., valN]".
Returns:
List[List[str]]:
List of lists of parsed responses where each element is either
"val1,val2,...,valN" if anomalies are found, or empty string if
no anomalies are present.
"""
There was a problem hiding this comment.
first line of docstrings should be shorter. I already have a description mentioned in my comment above
sigllm/primitives/transformation.py
Outdated
| or empty string if no anomalies are present | ||
| """ | ||
|
|
||
| def parse_single_response(text: str) -> str: |
There was a problem hiding this comment.
I would remove type checking here since it's a private function. I also recommend removing comments that are not necessary.
For private functions, we normally make them start with _, so that would be
def _parse_single_response(text):There was a problem hiding this comment.
you still have str typing in your implementation
sarahmish
left a comment
There was a problem hiding this comment.
Thanks @scherkao31 for making the changes! Some previous comments were not addressed so I kept them as unresolved.
In addition, I want to note:
mistral_detectorshould be modified to setrestrict_tokensto True.- the PR is missing the hyperparameter settings for MSL
- do you plan include hyperparameters of Yahoo?
sigllm/data.py
Outdated
| """Load the CSV with the given name from S3. | ||
|
|
||
| If the CSV has never been loaded before, it will be downloaded | ||
| from the [d3-ai-orion bucket](https://d3-ai-orion.s3.amazonaws.com) or |
There was a problem hiding this comment.
update d3-ai-orion to sintel-sigllm
sigllm/data.py
Outdated
| directory, and then returned. | ||
|
|
||
| Otherwise, if it has been downloaded and cached before, it will be directly | ||
| loaded from the `orion/data` folder without contacting S3. |
There was a problem hiding this comment.
update orion/data to sigllm/data
sigllm/data.py
Outdated
| If a `test_size` value is given, the data will be split in two parts | ||
| without altering its order, making the second one proportionally as | ||
| big as the given value. |
There was a problem hiding this comment.
test_size argument doesn't make sense in load_normal. I would remove it. please make sure the code, docstrings, and function arguments are not using this variable.
sigllm/data.py
Outdated
| use_timestamps (bool): | ||
| If True, start and end are interpreted as timestamps. | ||
| If False, start and end are interpreted as row indices. |
There was a problem hiding this comment.
I would not make this a function argument, but rather I would check if start or end are of integer or timestamp type. See comment below.
sigllm/data.py
Outdated
| # Handle slicing if start or end is specified | ||
| if start is not None or end is not None: | ||
| if use_timestamps: | ||
| # If start and end are timestamps | ||
| mask = (data['timestamp'] >= start) & (data['timestamp'] <= end) | ||
| data = data[mask] | ||
| else: | ||
| # If start and end are indices | ||
| data = data.iloc[start:end] |
There was a problem hiding this comment.
couple of things
- we don't know the exact name of the timestamp column, so you should use the variable.
- the user typically is not keen on specifying so many parameters. we can check if start or end are integers then we can use them to slice. Otherwise, assume that they are timestamps.
- check if the edge cases
start=Nonebutendhas a value works and vice versa.
if start is not None or end is not None:
if data.index.isin([start, end]):
# If start and end are indices
data = data.iloc[start:end]
else:
# If start and end are timestamps
mask = (data[timestamp_column] >= start) & (data[timestamp_column] <= end)
data = data[mask]
sigllm/primitives/transformation.py
Outdated
| or empty string if no anomalies are present | ||
| """ | ||
|
|
||
| def parse_single_response(text: str) -> str: |
There was a problem hiding this comment.
you still have str typing in your implementation
| def parse_anomaly_response(X): | ||
| """Parse a list of lists of LLM responses to extract anomaly values and format them as strings. | ||
|
|
||
| Args: | ||
| X (List[List[str]]): List of lists of response texts from the LLM in the format | ||
| "Answer: no anomalies" or "Answer: [val1, val2, ..., valN]" | ||
| Returns: | ||
| List[List[str]]: List of lists of parsed responses where each element is either | ||
| "val1,val2,...,valN" if anomalies are found, | ||
| or empty string if no anomalies are present | ||
| """ |
There was a problem hiding this comment.
first line of docstrings should be shorter. I already have a description mentioned in my comment above
| Additional padding token to forecast to reduce short horizon predictions. | ||
| Default to `0`. | ||
| restrict_tokens (bool): | ||
| Whether to restrict tokens or not. Default to `True`. |
There was a problem hiding this comment.
I still don't see a change to mistral_detector
| "user_message": "Below is a [SEQUENCE], please return the anomalies in that sequence in [RESPONSE]. Only return the numbers. [SEQUENCE]" | ||
| "system_message": "You are an expert in time series analysis. Your task is to detect anomalies in time series data.", | ||
| "user_message": "Below is a [SEQUENCE], please return the anomalies in that sequence in [RESPONSE]. Only return the numbers. [SEQUENCE]", | ||
| "user_message_2": "Below is a [SEQUENCE], analyze the following time series and identify any anomalies. If you find anomalies, provide their values in the format [first_anomaly, ..., last_anomaly]. If no anomalies are found, respond with 'no anomalies'. Be concise, do not write code, do not permorm any calculations, just give your answers as told.: [SEQUENCE]", |
There was a problem hiding this comment.
typo permorm -> perform
…tection Support (#39) * Core Changed to get normal behavior in pipeline * Transformation changed * anomalies.py changed * Hugginface.py changed : no restrictions token, and also has normal as input if 1-shot * Timeseries preprocessing.py * jsons files added for primitives * jsons files added for primitives * pipelines 0shot and 1shot added * add boolean for restrict_tokens in HF * good messages.json for prompt * Added load_normal in sigllm.data * Fixed load_normal in sigllm.data * Fixed lint format * Fixed lint format Ruff * Fixed from review Sarah * Fixed lint format after working on Sarah's reviews * Dataset prompter parameters * .jons removed from input names in 1_shot pipeline.json * .jons removed from input names in 1_shot pipeline.json * fix PR issues & add unittests * add unittests for parse_anomaly_response * remove unused functions * add new functionality tests * update ubuntu image * change normal->single * fix lint * swap normal -> single --------- Co-authored-by: Salim Cherkaoui <salim31@dai-desk32.lids.mit.edu> Co-authored-by: Sarah Alnegheimish <sarahalnegheimish@gmail.com>
* init benchmark * fix details * fix lint * add benchmark tests * paper benchmark results (#41) * Enable Unrestricted LLM Output with Parsing + Add One-Shot Anomaly Detection Support (#39) * Core Changed to get normal behavior in pipeline * Transformation changed * anomalies.py changed * Hugginface.py changed : no restrictions token, and also has normal as input if 1-shot * Timeseries preprocessing.py * jsons files added for primitives * jsons files added for primitives * pipelines 0shot and 1shot added * add boolean for restrict_tokens in HF * good messages.json for prompt * Added load_normal in sigllm.data * Fixed load_normal in sigllm.data * Fixed lint format * Fixed lint format Ruff * Fixed from review Sarah * Fixed lint format after working on Sarah's reviews * Dataset prompter parameters * .jons removed from input names in 1_shot pipeline.json * .jons removed from input names in 1_shot pipeline.json * fix PR issues & add unittests * add unittests for parse_anomaly_response * remove unused functions * add new functionality tests * update ubuntu image * change normal->single * fix lint * swap normal -> single --------- Co-authored-by: Salim Cherkaoui <salim31@dai-desk32.lids.mit.edu> Co-authored-by: Sarah Alnegheimish <sarahalnegheimish@gmail.com> * support load_normal --------- Co-authored-by: scherkao31 <93837850+scherkao31@users.noreply.github.com> Co-authored-by: Salim Cherkaoui <salim31@dai-desk32.lids.mit.edu>
This PR introduces several changes :