StreamDFP

This repository is based on the open-source StreamDFP project and extends it with an LLM-enhanced workflow for root-cause extraction, rule fusion, and model-level policy evaluation in disk failure prediction.

The current codebase keeps both the upstream Python + Java prediction pipeline and the extension work added on top of it. It is organized for research reproduction rather than as a minimal library package. Source code, experiment scripts, and result notes are kept together; large datasets, logs, generated caches, and local demo bundles are ignored by default so the repository can be uploaded to GitHub without dragging in runtime artifacts.

If you publish this repository separately on GitHub, prefer a name such as StreamDFP-LLM-Extension instead of the bare StreamDFP, so the upstream relationship stays clear.

Upstream Attribution

This project builds on the open-source StreamDFP framework:

Upstream repository: https://github.com/shujiehan/StreamDFP

The work in this repository focuses on extending StreamDFP with an LLM-enhanced pipeline for semantic root-cause extraction, rule blending, fallback control, and model-level policy evaluation.

Highlights

Upstream StreamDFP pipeline for HDD/SSD failure prediction with preprocessing in Python and training/simulation in Java.
StreamDFP-based LLM-enhanced framework (framework_v1) for Phase1 window summarization, Phase2 root-cause extraction, and Phase3 policy grid evaluation.
Extension modules for model-level policy registry, rule blending, fallback control, and multi-disk experiment summaries.

Repository Layout

StreamDFP/
├── pyloader/          # Python preprocessing, feature extraction, labeling, sample generation
├── simulate/          # Java simulation and prediction entry points
├── moa/               # MOA dependency source tree used by the Java pipeline
├── llm/               # LLM prompts, extraction logic, event mappings, contracts, tests
├── scripts/           # Phase2/Phase3 orchestration, watchers, probes, reproducibility helpers
├── docs/              # Experiment notes, summaries, comparison tables, metric reports
├── parse.py           # Parse simulation outputs into metric tables
└── run_*.sh           # Legacy example launchers for baseline experiments

Detailed directory notes are in docs/REPOSITORY_LAYOUT.md. Documentation entry points are indexed in docs/README.md.

Main Workflows

1. Classic StreamDFP Pipeline

Generate train/test samples with pyloader/run.py or the pyloader/run_*_loader.sh helpers.
Train and simulate with the Java entrypoint in simulate/ using simulate.Simulate.
Parse metrics with parse.py.

Relevant files:

2. LLM-Enhanced Framework (`framework_v1`)

Convert sliding windows into textual summaries with llm/window_to_text.py.
Run offline LLM extraction with llm/llm_offline_extract.py.
Build cache variants and evaluate them through the Phase3 grid scripts.
Merge per-model results into model-level policy decisions (llm_enabled vs fallback).

Relevant files:

Environment

Minimum runtime dependencies:

Python 3
numpy, pandas
Java JDK 8

Optional LLM runtime:

vllm for GPU-backed Phase2 extraction
Qwen-family model weights downloaded locally from HuggingFace or ModelScope

Public repo environment files:

The public reproducibility walkthrough is in docs/PUBLIC_REPRODUCIBILITY.md.

Data and Models

This repository does not require committing raw datasets or downloaded model weights.

Public HDD data typically comes from Backblaze SMART records.
Public SSD experiments can use Alibaba SSD SMART datasets.
Local datasets under data/ are ignored by .gitignore.
Local model directories outside the repo are recommended for Qwen checkpoints.

GitHub Upload Notes

This repository is now prepared for GitHub-style uploading with source code and experiment docs kept visible, while the following classes of files are ignored:

raw datasets and local backups
logs, generated caches, and temporary JSONL files
training/test output folders under pyloader/
local demo bundles and compressed share packages
Java build outputs and notebook checkpoints

Before pushing, check git status and only stage the code/docs you really want to publish.

Contact

Original project contact from the upstream README:

Shujie Han (shujiehan@pku.edu.cn)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
configs		configs
docs		docs
llm		llm
moa		moa
pyloader		pyloader
scripts		scripts
simulate		simulate
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment-public.yml		environment-public.yml
parse.py		parse.py
parse_reg.py		parse_reg.py
pom.xml		pom.xml
requirements-llm-public.txt		requirements-llm-public.txt
requirements-public.txt		requirements-public.txt
run_cross_model_llm_recall_controller.sh		run_cross_model_llm_recall_controller.sh
run_hi640_transfer.sh		run_hi640_transfer.sh
run_hi7.sh		run_hi7.sh
run_hi7_reg.sh		run_hi7_reg.sh
run_hi7_rnn.sh		run_hi7_rnn.sh
run_llm_feature_flow_mc1_qwen3_4b_2507.sh		run_llm_feature_flow_mc1_qwen3_4b_2507.sh
run_llm_feature_flow_qwen3_4b_2507.sh		run_llm_feature_flow_qwen3_4b_2507.sh
run_mc1_mlp.sh		run_mc1_mlp.sh
run_robust_eval_report_v2.sh		run_robust_eval_report_v2.sh
run_stage2_7models_fs_20140901_20141109.sh		run_stage2_7models_fs_20140901_20141109.sh
run_stage2_remaining5_fs_zs_then_shutdown.sh		run_stage2_remaining5_fs_zs_then_shutdown.sh
run_stage2_remaining5_resume_safe_then_shutdown.sh		run_stage2_remaining5_resume_safe_then_shutdown.sh
run_stage3_5_for_completed_map70_models.sh		run_stage3_5_for_completed_map70_models.sh
stop_after_model_fs_zs.sh		stop_after_model_fs_zs.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StreamDFP

Upstream Attribution

Highlights

Repository Layout

Main Workflows

1. Classic StreamDFP Pipeline

2. LLM-Enhanced Framework (`framework_v1`)

Environment

Data and Models

Recommended Reading Order

GitHub Upload Notes

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StreamDFP

Upstream Attribution

Highlights

Repository Layout

Main Workflows

1. Classic StreamDFP Pipeline

2. LLM-Enhanced Framework (framework_v1)

Environment

Data and Models

Recommended Reading Order

GitHub Upload Notes

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. LLM-Enhanced Framework (`framework_v1`)

Packages