Skip to content

Commit 2c55e6e

Browse files
authored
Merge pull request #35 from wwhenxuan/master
Fixed a bug in the ARIMA model caused by linear operations.
2 parents c5e3592 + 996ca13 commit 2c55e6e

6 files changed

Lines changed: 417 additions & 141 deletions

File tree

README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,18 @@ This method allows for the unrestricted creation of high-quality time series dat
1818

1919
### 🔥 News
2020

21+
**[Feb. 2026]** Since all stationary time series can be obtained by exciting a linear time-invariant system with white noise, we propose [a learnable series generation method](https://github.com/wwhenxuan/S2Generator/blob/main/s2generator/simulator/arima.py) based on the ARIMA model. This method ensures the generated series is highly similar to the inputs in autocorrelation and power spectrum density.
22+
2123
**[Sep. 2025]** Our paper "Synthetic Series-Symbol Data Generation for Time Series Foundation Models" has been accepted by **NeurIPS 2025**, where **[*SymTime*](https://arxiv.org/abs/2502.15466)** pre-trained on the $S^2$ synthetic dataset achieved SOTA results in fine-tuning of forecasting, classification, imputation and anomaly detection tasks.
2224

2325
## 🚀 Installation <a id="Installation"></a>
2426

25-
We have highly encapsulated the algorithm and uploaded the code to PyPI. Users can download the code through `pip`.
26-
27+
We have highly encapsulated the algorithm and uploaded the code to PyPI:
2728
~~~
2829
pip install s2generator
2930
~~~
3031

31-
We only used [`NumPy`](https://numpy.org/), [`Scipy`](https://scipy.org/) and [`matplotlib`](https://matplotlib.org/) when developing the project.
32+
We used [`NumPy`](https://numpy.org/), [`Pandas`](https://pandas.pydata.org/), and [`Scipy`](https://scipy.org/) to build the data science environment, [`Matplotlib`](https://matplotlib.org/) for data visualization, and [`Statsmodels`](https://www.statsmodels.org/stable/index.html) for time series analysis and statistical processing.
3233

3334
## ✨ Usage
3435

examples/19-arma_simulator.ipynb

Lines changed: 25 additions & 24 deletions
Large diffs are not rendered by default.

s2generator/simulator/arima.py

Lines changed: 78 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@
1515
from statsmodels.tsa.api import acf, pacf
1616
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
1717

18-
from s2generator.utils._tools import eacf_rlike, plot_shapiro_wilk
18+
from s2generator.utils._tools import eacf_rlike
19+
from s2generator.utils.visualization import plot_shapiro_wilk
1920

2021
import warnings
2122

@@ -32,6 +33,17 @@ class ARIMASimulator(object):
3233
3334
Based on these two points, we can use the ARIMA model to generate non-stationary time series data.
3435
Compared to previous data generation methods, we can further fit the statistical characteristics of real time series data through the ARIMA model, thereby generating more realistic time series data.
36+
37+
Since this generation method involves the fitting and training of the ARIMA model, linear operations may trigger exceptions such as `LinAlgError`, resulting in generation failure.
38+
This issue is generally related to the input time series data and the order of the ARIMA model. We have investigated the common input data problems as follows:
39+
40+
1. The data is completely constant (variance = 0);
41+
2. The length of the input time series is too short;
42+
3. There are obvious extreme values or outliers in the input sequence after standardization;
43+
4. An excessively high order setting (p,q) leads to matrix dimension mismatch or singularity.
44+
45+
In addition, the `ARIMA` implementation in `statsmodels` has limited ability to handle certain ill-conditioned matrices (e.g., nearly singular matrices).
46+
Even if the data appears normal, LU decomposition may still fail due to floating-point precision issues.
3547
"""
3648

3749
def __init__(
@@ -41,10 +53,17 @@ def __init__(
4153
max_q: int = 5,
4254
signif: float = 0.05,
4355
not_white_alarm: bool = True,
56+
revin: bool = True,
4457
random_state: Optional[int] = 42,
4558
) -> None:
4659
"""
47-
:param order: A tuple specifying the (p, d, q) order of the ARIMA model.
60+
:param max_p: Maximum AR order (p) to consider when fitting the ARIMA model.
61+
:param max_d: Maximum differencing order (d) to consider when fitting the ARIMA model.
62+
:param max_q: Maximum MA order (q) to consider when fitting the ARIMA model.
63+
:param signif: Significance level for the ADF test to determine stationarity.
64+
:param not_white_alarm: Whether to issue a warning when the residuals of the fitted model are not white noise.
65+
:param revin: Should reversible normalization be performed on time series data?
66+
:param random_state: Random state for reproducibility when generating new time series data.
4867
"""
4968
self.max_p = max_p
5069
self.max_d = max_d
@@ -56,6 +75,12 @@ def __init__(
5675
# Whether to issue a warning when residuals are not white noise
5776
self.not_white_alarm = not_white_alarm
5877

78+
# Should reversible normalization be performed on time series data?
79+
# If True, the generated time series data will be normalized to have zero mean and unit variance,
80+
# and the original mean and variance will be recorded for potential inverse transformation.
81+
self.revin = revin
82+
self.mean, self.std = None, None
83+
5984
# Record the parameters of the model fit
6085
self.d_order = None
6186
self.p_order, self.q_order = None, None
@@ -82,6 +107,13 @@ def fit(
82107
# Check the input time series data
83108
time_series = self.check_inputs(time_series=time_series)
84109

110+
# Optionally reverse the time series data to generate data in reverse order
111+
if self.revin:
112+
self.mean, self.std = time_series.mean(), time_series.std()
113+
time_series = (
114+
time_series - self.mean
115+
) / self.std # Normalize the time series data
116+
85117
# First, difference the time series to make it stationary
86118
stationary_series, self.d_order = self.diff_stationary(time_series=time_series)
87119

@@ -103,8 +135,9 @@ def fit(
103135

104136
# Perform residual diagnosis
105137
mean_p_value, is_white = self.residual_diagnosis(signif=self.signif)
138+
106139
if not is_white and self.not_white_alarm:
107-
print(
140+
raise ValueError(
108141
f"Warning: Model residuals may not be white noise (mean p-value={mean_p_value:.4f} < significance level={self.signif}), please re-evaluate the model order or parameters."
109142
)
110143

@@ -132,7 +165,35 @@ def transform(
132165
),
133166
)
134167

135-
return generated_series.values.T
168+
return (
169+
generated_series.values.T * self.std + self.mean
170+
if self.revin
171+
else generated_series.values.T
172+
)
173+
174+
@property
175+
def param_names(self) -> List[str]:
176+
"""Return the names of the parameters in the fitted ARIMA model."""
177+
if not hasattr(self, "model"):
178+
raise ValueError("The model must be fitted before calling param_names.")
179+
180+
return self.model.param_names
181+
182+
@property
183+
def params(self) -> Union[np.ndarray, pd.Series]:
184+
"""Return the parameter values of the fitted ARIMA model."""
185+
if not hasattr(self, "model"):
186+
raise ValueError("The model must be fitted before calling params.")
187+
188+
return self.model.params
189+
190+
@property
191+
def param_items(self) -> List[Tuple[str, float]]:
192+
"""Return a list of (parameter name, parameter value) tuples for the fitted ARIMA model."""
193+
if not hasattr(self, "model"):
194+
raise ValueError("The model must be fitted before calling param_items.")
195+
196+
return list(zip(self.param_names, self.params))
136197

137198
def check_inputs(self, time_series: Union[pd.Series, np.ndarray]) -> pd.Series:
138199
"""
@@ -163,6 +224,19 @@ def check_inputs(self, time_series: Union[pd.Series, np.ndarray]) -> pd.Series:
163224
if len(time_series) < 10:
164225
raise ValueError("Input time series must have at least 10 data points.")
165226

227+
# Check if the time series contains NaN values
228+
if pd.isnull(time_series).any():
229+
raise ValueError("Input time series must not contain NaN values.")
230+
231+
# std = np.std(time_series)
232+
std = np.std(time_series)
233+
if (
234+
std < 1e-8
235+
): # A very small threshold to check if the variance is effectively zero
236+
raise ValueError(
237+
"The time series variance is 0 (all values ​​are the same), making it impossible to fit the ARIMA model."
238+
)
239+
166240
return pd.Series(time_series)
167241

168242
def select_arma_order(
@@ -192,7 +266,6 @@ def select_arma_order(
192266
continue
193267
try:
194268
# Fit ARMA model
195-
# FIXME: Consider using the EACF method to select the optimal (p,q) combination?
196269
model = ARIMA(stationary_series, order=(p, 0, q))
197270
results = model.fit()
198271
if results.aic < best_aic:

s2generator/utils/__init__.py

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@
2121
"generate_arma_samples",
2222
"generate_nonstationary_sine",
2323
"eacf_rlike",
24-
"plot_shapiro_wilk",
2524
"fft",
2625
"fftshift",
2726
"ifft",
@@ -39,14 +38,12 @@
3938
"exponential_smoothing",
4039
"smooth_show_info",
4140
"MovingDecomp",
41+
"plot_series",
42+
"plot_symbol",
43+
"plot_shapiro_wilk",
44+
"plot_simulator_statistics",
4245
]
4346

44-
# # Visualization the time series data in S2
45-
# from .visualization import plot_series
46-
#
47-
# # Visualization the Symbol data in S2
48-
# from .visualization import plot_symbol
49-
5047
# Transform the symbol from string to latex
5148
from .print_symbol import symbol_to_markdown
5249

@@ -71,9 +68,6 @@
7168
# The EACF function to determine the order of ARMA model
7269
from ._tools import eacf_rlike
7370

74-
# The Shapiro-Wilk test for normality of the residuals
75-
from ._tools import plot_shapiro_wilk
76-
7771
# Print the Generation Status
7872
from ._print_status import PrintStatus
7973

@@ -101,3 +95,11 @@
10195

10296
# The Seasonal-Trend decomposition using LOESS (STL)
10397
from ._decomposition import STL, STLResult
98+
99+
# The Shapiro-Wilk test for normality of the residuals
100+
from .visualization import (
101+
plot_series,
102+
plot_symbol,
103+
plot_shapiro_wilk,
104+
plot_simulator_statistics,
105+
)

s2generator/utils/_tools.py

Lines changed: 0 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@
3030
"generate_arma_samples",
3131
"generate_nonstationary_sine",
3232
"eacf_rlike",
33-
"plot_shapiro_wilk",
3433
]
3534

3635
import os
@@ -41,7 +40,6 @@
4140
from numpy import fft as np_fft
4241

4342
import pandas as pd
44-
from matplotlib import pyplot as plt
4543

4644
from typing import Optional, Dict, Union, Tuple
4745

@@ -529,99 +527,3 @@ def eacf_rlike(
529527
)
530528

531529
return eacf_matrix, threshold, eacf_df
532-
533-
534-
def plot_shapiro_wilk(
535-
residuals: np.ndarray,
536-
bins: int = 13,
537-
dpi: int = 500,
538-
figsize: Tuple[int, int] = (12, 5),
539-
) -> Tuple[plt.Figure, float, float]:
540-
"""
541-
Plot the Shapiro-Wilk test for normality of the residuals.
542-
This method generates a Q-Q plot to visually assess whether the residuals
543-
of the fitted ARIMA model follow a normal distribution.
544-
545-
:param residuals: Residuals from the fitted ARIMA model.
546-
:param bins: Number of bins for the histogram of residuals.
547-
:param dpi: Dots per inch (resolution) for the generated plot.
548-
:param figsize: Figure size for the generated plot.
549-
:return: A tuple containing the matplotlib Figure object, the Shapiro-Wilk statistic, and the p-value.
550-
"""
551-
# Ensure the model has been fitted and the residuals have been calculated.
552-
if residuals is None:
553-
raise ValueError("Residuals must be provided before calling plot_shapiro_wilk.")
554-
555-
# Convert residuals to a numpy array for consistency
556-
residuals = np.asarray(residuals)
557-
558-
# Import necessary libraries
559-
from statsmodels.graphics.gofplots import qqplot
560-
from scipy.stats import shapiro
561-
562-
# import seaborn as sns
563-
# sns.set_theme(style="ticks")
564-
565-
# Perform Shapiro-Wilk normality test
566-
stat, p_value = shapiro(residuals)
567-
568-
# Create visualization figure
569-
fig, ax = plt.subplots(1, 2, figsize=figsize, dpi=dpi)
570-
fig.subplots_adjust(wspace=0.16)
571-
572-
# Plot histogram of the fitted residuals
573-
ax[0].hist(residuals, bins=bins, alpha=1, color="w", edgecolor="k", lw=1.2)
574-
575-
# Plot Q-Q plot for normality test
576-
qqplot(
577-
residuals,
578-
line="s",
579-
ax=ax[1],
580-
markerfacecolor="white",
581-
markeredgecolor="k",
582-
markersize=7.5,
583-
)
584-
for line in ax[1].get_lines():
585-
if line.get_linestyle() == "-":
586-
line.set_color("#DC143C")
587-
line.set_linewidth(2.1)
588-
589-
# Set titles and labels
590-
ax[0].grid(which="major", color="gray", linestyle="--", lw=0.5, alpha=0.8)
591-
ax[1].grid(which="major", color="gray", linestyle="--", lw=0.5, alpha=0.8)
592-
ax[0].set_xlabel("Standard Residual", fontsize=12.5)
593-
ax[0].set_ylabel("Frequency", fontsize=12.5)
594-
ax[1].set_xlabel("Theoretical Quantiles", fontsize=12.5)
595-
ax[1].set_ylabel("Sample Quantiles", fontsize=12.5)
596-
597-
# Annotate the plots with statistics
598-
mean = np.round(np.mean(residuals), 4)
599-
std = np.round(np.std(residuals), 4)
600-
stat = np.round(stat, 4)
601-
p_value = np.round(p_value, 4)
602-
603-
# Set the text annotations for the mean and std on the histogram
604-
ax[0].text(
605-
0.05,
606-
0.95,
607-
f"$\mu$ = {mean}\n$\sigma$ = {std}",
608-
transform=ax[0].transAxes,
609-
verticalalignment="top",
610-
horizontalalignment="left",
611-
fontsize=13.5,
612-
color="k",
613-
)
614-
615-
# Set the text annotations for the Shapiro-Wilk test on the Q-Q plot
616-
ax[1].text(
617-
0.05,
618-
0.95,
619-
f"$W$ = {stat}\n$p$ = {p_value}",
620-
transform=ax[1].transAxes,
621-
verticalalignment="top",
622-
horizontalalignment="left",
623-
fontsize=13.5,
624-
color="k",
625-
)
626-
627-
return fig, stat, p_value

0 commit comments

Comments
 (0)