diff --git a/documentation/ADRs/002_evaluation_strategy.md b/documentation/ADRs/002_evaluation_strategy.md
index 0990d84..e693ded 100644
--- a/documentation/ADRs/002_evaluation_strategy.md
+++ b/documentation/ADRs/002_evaluation_strategy.md
@@ -4,81 +4,99 @@
 |---------------------|-------------------|
 | Subject             | Evaluation Strategy  |
 | ADR Number          | 002   |
-| Status              | Accepted|
-| Author              | Mihai, Xiaolong|
-| Date                | 31.10.2024 |
+| Status              | Proposed |
+| Author              | Xiaolong, Mihai|
+| Date                | 16.07.2025 |
 
 ## Context
-The primary output of VIEWS is a panel of forecasts, which consist of temporal sequences of predictions for each observation at the corresponding level of analysis (LOA). These forecasts span a defined forecasting window and can represent values such as predicted fatalities, probabilities, quantiles, or sample vectors.
+To ensure reliable and realistic model performance assessment, our forecasting framework supports both **offline** and **online** evaluation strategies. These strategies serve complementary purposes: offline evaluation simulates the forecasting process retrospectively, while online evaluation assesses actual deployed forecasts against observed data.
 
-The machine learning models generating these predictions are trained on historical time-series data, supplemented by covariates that may be either known in advance (future features) or only available for training data (shifted features). Given the variety of models used, evaluation routines must remain model- and paradigm-agnostic to ensure consistency across different methodologies.
+Both strategies are designed to work with time-series predictions and support multi-step forecast horizons, ensuring robustness across temporal scales and use cases.
 
 
 ## Decision
-The evaluation strategy must be structured to assess predictive performance comprehensively.
+We adopt a dual evaluation approach consisting of:
+1. **Offline Evaluation:** Evaluating a model's performance on historical data, before deployment.
+
+2. **Online Evaluation:** The ongoing process of evaluating a deployed model's performance as new, real-world data becomes available.
+
 
 ### Points of Definition: 
 
-1. *Time*: All time points and horizons mentioned below are in  **outcome space**, also known as $Y$-space : this means that they refer to the time point of the (forecasted or observed) outcome. This is especially important for such models where the feature-space and outcome-space are shifted and refer to different time points.
+- **Rolling-Origin Holdout:** A robust backtesting strategy that simulates a real-world forecasting scenario by generating forecasts from multiple, rolling time origins.
 
-2. *Temporal resolution*: The temporal resolution of VIEWS is the calendar-month. These are referred in VIEWS by an ordinal (Julian) month identifier (`month_id`) which is a serial numeric identifier with a reference epoch (month 0) of December 1979. For control purposes, January 2024 is month 529. VIEWS does not define behavior and does not have the ability to store data prior to the reference epoch (with negative `month_id`). Conflict history data, which marks the earliest possible meaningful start of the training time-series, is available from `month_id==109`.
+- **Forecast Steps:** The time increment between predictions within a \textbf{sequence} of forecasts (further referred to as steps).
 
-3. *Forecasting Steps* (further referred to as steps) is defined as the 1-indexed number of months from the start of a forecast time-series.
+- **Sequence:** An ordered set of data points indexed by time. 
 
-### General Evaluation Strategy
+### Diagram
 ![path](../img/approach.png)
 
-The general evaluation strategy involves *training* one model on a time-series that goes up to the training horizon $H_0$. This sequence is then used to predict a number of sequences (time-series). The first such sequence goes from $H_{0+1}$ to $H_{0+36}$, thus containing 36 forecasted values -- i.e. 36 months. The next one goes from $H_{0+2}$ to $H_{0+37}$. This is repeated until we reach a constant stop-point $k$ such that the last sequence forecasted is $H_{0+k+1}$ to $H_{0+k+36}$. 
+### Offline Evaluation
+We adopt a **rolling-origin holdout evaluation strategy** for all offline (backtesting) evaluations.
 
-Normally, it is up to the modeller whether the model performs *expanding window* or *rolling window* evaluation, since *how* prediction is carried out all evaluations are of the *expanding window forecasting* type, i.e. the training window. 
+The offline evaluation strategy involves 
+1. **A single model** is trained on historical data up to training cutoff $H_0$.
+2. Using this trained model object, a forecast is generated for the next **36 months**:
+    - Sequence 1: $H_{0+1}$ -> $H_{0+36}$
+3. The origin is then rolled forward by one month, and another forecast is generated:
+    - Sequence 2: $H_{0+2}$ -> $H_{0+37}$
+4. This process continues until a fixed number of sequences **k** is reached.
+5. In our standardized offline evaluation, **12 forecast sequences** are used (i.e., k = 12).
 
-#### Live evaluation
+It is important to note that **offline evaluation is not a true forecast**. Instead it is a simulation using historical data from the **Validation Partition** to approximate forecasting performance under realistic, rolling deployment conditions. (See [ADR TBD] for data partitioning strategy.)
 
-For **live** evaluation, we suggest doing this in the same way as has been done for VIEWS2020/FCDO (_confirm with HH and/or Mike_), i.e. predict to k=12, resulting in *12* time series over a prediction window of *48* months. We call this the evaluation partition end $H_{e,live}$. This gives a prediction horizon of 48 months, thus $H_{47}$ in our notation.
 
-Note that this is **not** the final version of online evaluation.
+### Online Evaluation
+Online evaluation reflects **true forecasting** and is based on the **Forecasting Partition** 
 
-#### Offline evaluation
+Suppose the latest available data point is $H_{36}$. Over time, the system would have generated the following forecast sequences:
+- Sequence 1: forecast for $H_{1}$ → $H_{36}$, generated at time **t = 0**
+- Sequence 2: forecast for $H_{2}$ → $H_{37}$, generated at **t = 1**
+- ...
+- Sequence 36: forecast for $H_{36}$ → $H_{71}$, generated at **t = 35**
 
-For **offline** model evaluation, we suggest doing this in a way that simulates production over a longer time-span. For this, a new model is trained at every **twelve** months interval, thus resetting $H_0$ at months $H_{0+0}, H_{0+12}, H_{0+24}, \dots H_{0+12r}$ where $12r=H_e$.
+At time $H_{36}$, we evaluate all forecasts made for $H_{36}$, i.e., the predictions from each of these 36 sequences are compared to the true value observed at $H_{36}$.
 
-The default way is to set $H_{e_eval}$ to 48 months, meaning we only train the model once at $H_0$. This will result in **12** time series. We call it **standard** evaluation.
+This provides a comprehensive view of how well the deployed model performs across multiple forecast origins and steps.
 
-We also propose the following practical approaches:
+## Consequences
 
-1. A **long** evaluation where we set $H_{e_eval}$ to 72 months. This will result in *36* predicted time-series.
-   
-2. A **complete** evaluation system, the longest one, where we set $H_0$ at 36 months of data (157 for models depending on UCDP GED), and iterate until the end of data (currently, the final $H_0$ will be 529).
+**Positive Effects:**
+- Reflects realistic deployment and monitoring conditions.
 
-For comparability and abstraction of seasonality (which is inherent in both the DGP as well as the conflict data we rely on, due to their definition), $H_0$ should always be December or June (this also adds convenience).
+- Allows for evaluation across multiple forecast origins and time horizons.
 
-The three approaches have trade-offs besides increasing computational complexity.  Since conflict is not a stationary process, evaluation carried for long time-periods will prefer models that predict whatever stationary components exist in the DGP (and thus in the time-series). For example these may include salient factors such GDP, HDI, infant mortality etc.. Evaluation on such very long time-spans may substantially penalize models that predict more current event, due shorter term causes that were not so salient in the past. Examples of these may be the change in the taboo on inter-state war after 2014 and 2022 with Russia invading Ukraine.
+- Improves robustness by capturing temporal variation in model performance.
 
 
+**Negative Effects:**
+- Requires careful alignment of sequences and forecast windows.
 
-## Consequences
+- May introduce computational overhead due to repeated evaluation across multiple origins.
 
-**Positive Effects:**
-- Standardized evaluation across models, ensuring comparability.
+- Models must be capable of generalizing across slightly shifted input windows.
 
-- Clear separation of live and offline evaluation, facilitating both operational monitoring and research validation.
 
-**Negative Effects:**
-- Increased computational demands for long and complete evaluations.
+## Rationale
+The dual evaluation setup strikes a balance between experimentation and real-world monitoring:
 
-- Potential complexity in managing multiple evaluation strategies.
+- **Offline evaluation** provides a controlled and reproducible environment for backtesting.
+- **Online evaluation** reflects actual model behavior in production.
+
+For further technical details:
+- See [ADR 004 – Evaluation Input Schema](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/004_evaluation_input_schema.md)
+- See [ADR 003 – Metric Calculation](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/003_metric_calculation.md)
 
-- Additional infrastructure requirements.
 
-## Rationale
-By structuring evaluation routines to be agnostic of the modeling approach, the framework ensures consistency in assessing predictive performance. Using multiple evaluation methodologies balances computational feasibility with robustness in performance assessment.
 
 ### Considerations
-- Computational cost vs. granularity of evaluation results.
+- Sequence length (currently 36 months) may need to be adjusted for different use cases (e.g., quarterly or annual models).
+
+- The number of sequences (k) can be tuned depending on evaluation budget or forecast range.
 
-- Trade-offs between short-term and long-term predictive performance.
+- Consider future support for probabilistic or uncertainty-aware forecasts in the same rolling evaluation framework.
 
-- Ensuring reproducibility and scalability of evaluation routines.
 
 
 ## Feedback and Suggestions
diff --git a/documentation/ADRs/004_evaluation_input_schema.md b/documentation/ADRs/004_evaluation_input_schema.md
new file mode 100644
index 0000000..52499d7
--- /dev/null
+++ b/documentation/ADRs/004_evaluation_input_schema.md
@@ -0,0 +1,55 @@
+# Evaluation Input Schema
+
+| ADR Info            | Details           |
+|---------------------|-------------------|
+| Subject             | Evaluation Input Schema  |
+| ADR Number          | 004   |
+| Status              | Proposed   |
+| Author              | Xiaolong   |
+| Date                | 16.06.2025     |
+
+## Context
+In our modeling pipeline, a consistent and flexible evaluation framework is essential to compare model performance.
+
+
+## Decision
+
+We adopt the `views-evaluation` package to standardize the evaluation of model predictions. The core component of this package is the `EvaluationManager` class, which is initialized with a **list of evaluation metrics**.
+
+The `evaluate` method accepts the following inputs:
+1. A DataFrame of actual values,  
+2. A list of prediction DataFrames,  
+3. The target variable name,  
+4. The model config.  
+
+Both the actual and prediction DataFrames must use a multi-index of `(month_id, country_id/priogrid_gid)` and contain a column for the target variable. In the actuals DataFrame, this column must be named exactly as the target. In each prediction DataFrame, the predicted column must be named `f'pred_{target}'`.
+
+The number of prediction DataFrames is flexible. However, the standard practice is to evaluate **12 sequences**. When more than two predictions are provided, the evaluation will behave similarly to a **rolling origin evaluation** with a **fixed holdout size of 1**. For further reference, see the [ADR 002](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/002_evaluation_strategy.md) on rolling origin methodology.
+
+The class automatically determines the evaluation type (point or uncertainty) and aligns `month_id` values between the actuals and each prediction. By default, the evaluation is performed **month-wise**, **step-wise**, **time-series-wise** (more information in [ADR 003](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/003_metric_calculation.md))
+
+
+## Consequences
+
+**Positive Effects:**
+
+- Standardized evaluation across all models.
+
+**Negative Effects:**
+
+- Requires strict adherence to index and column naming conventions.
+
+## Rationale
+
+Using the `views-evaluation` package enforces consistency and reproducibility in model evaluation. The built-in support for rolling origin evaluation reflects a realistic scenario for time-series forecasting where the model is updated or evaluated sequentially. Its flexible design aligns with our workflow, where multiple prediction sets across multiple horizons are common.
+
+
+### Considerations
+
+- Other evaluation types, such as correlation matrices, may be requested in the future. These might not be compatible with the current architecture or evaluation strategy of the `views-evaluation` package.
+
+- Consider accepting `config` as input instead of separate `target` and `steps` arguments. This would improve consistency because these parameters are already defined in config. It would allow for more flexible or partial evaluation workflows (e.g., when only one or two evaluation strategies are desired).
+
+## Feedback and Suggestions
+Any feedback or suggestion is welcomed
+
diff --git a/documentation/ADRs/005_evaluation_output_schema.md b/documentation/ADRs/005_evaluation_output_schema.md
new file mode 100644
index 0000000..e9d463c
--- /dev/null
+++ b/documentation/ADRs/005_evaluation_output_schema.md
@@ -0,0 +1,110 @@
+# Evaluation Output Schema
+
+| ADR Info            | Details           |
+|---------------------|-------------------|
+| Subject             | Evaluation Output Schema  |
+| ADR Number          | 005   |
+| Status              | Proposed   |
+| Author              | Xiaolong   |
+| Date                | 16.06.2025     |
+
+## Context
+As part of our model evaluation workflow, we generate comprehensive reports summarizing model performance across a range of metrics and time periods. These reports are intended primarily for comparing ensemble models against their constituent models and baselines.
+
+## Decision
+
+We define a standard output schema for model evaluation reports using two formats:
+
+1. **JSON file** – machine-readable output storing structured evaluation data.
+2. **HTML file** – human-readable report with charts, tables, and summaries.
+
+These files are stored in the `reports/` directory for each model within `views-models`.
+
+To prevent a circular dependency between `views-evaluation` and `views-pipeline-core`, the `views-evaluation` package returns the evaluation dictionary, and then  `views-pipeline-core` continues saving it as a json file.
+
+### Schema Overview (JSON)
+Each report follows a standardized JSON structure that includes:
+````
+{
+    "Target": "target",
+    "Forecast Type": "point",
+    "Level of Analysis": "cm",
+    "Data Partition": "validation",
+    "Training Period": [121,492],
+    "Testing Period": [493,540],
+    "Forecast Horizon": 36,
+    "Number of Rolling Origins": 12,
+    "Evaluation Results": [
+        {
+            "Type": "Ensemble",
+            "Model Name": "ensemble_model",
+            "MSE": mse_e,
+            "MSLE": msle_e,
+            "mean prediction": mp_e 
+        },
+        {
+            "Type": "Constituent",
+            "Model Name": "constitute_a",
+            "MSE": mse_a,
+            "MSLE": msle_a,
+            "mean prediction": mp_a 
+        },
+        {
+            "Type": "Constituent",
+            "Model Name": "constitute_b",
+            "MSE": mse_b,
+            "MSLE": msle_b,
+            "mean prediction": mp_b 
+        }
+        ...
+    ]
+}
+````
+Here, the 
+
+The output file is name with the following name convention:
+```
+eval_validation_{conflict_type}_{timestamp}.json
+```
+
+
+
+## Consequences
+
+**Positive Effects:**
+
+- Avoids circular dependency between `views-evaluation` and `views-pipeline-core`.
+
+- Provides consistent input for both HTML rendering and potential downstream systems (e.g., dashboards, APIs)
+
+- Facilitates modularity and separation of concerns.
+
+
+**Negative Effects:**
+
+- Requires tight coordination between both packages to maintain schema compatibility
+
+- Some redundancy between evaluation and report generation may occur
+
+- May require schema migrations as new report sections are added
+
+
+
+## Rationale
+
+Saving reports within `views-pipeline-core` ensures full control over rendering, formatting, and contextual customization (e.g., comparing different model families). By letting `views-evaluation` focus strictly on metrics and alignment logic, we maintain cleaner package boundaries.
+
+
+### Considerations
+
+- This schema may evolve as we introduce new types of evaluation (e.g., correlation matrix).
+
+- Reports are currently only generated for **ensemble models**, as comparison against constituent models is the primary use case.
+
+-Future extensibility (e.g., visual version diffs) should be considered when evolving the format.
+
+
+
+## Feedback and Suggestions
+Any feedback or suggestion is welcomed
+
diff --git a/examples/quickstart.ipynb b/examples/quickstart.ipynb
index 084dba4..c7246f1 100644
--- a/examples/quickstart.ipynb
+++ b/examples/quickstart.ipynb
@@ -53,13 +53,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
     "import pandas as pd\n",
     "import numpy as np\n",
-    "from views_evaluation.evaluation.evaluation_manager import EvaluationManager"
+    "from views_evaluation.evaluation.evaluation_manager import EvaluationManager\n"
    ]
   },
   {
@@ -73,7 +73,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -93,7 +93,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -109,7 +109,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -145,7 +145,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -162,7 +162,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
@@ -176,13 +176,13 @@
     }
    ],
    "source": [
-    "steps = [1, 2]\n",
-    "point_evaluation_results = evaluation_manager.evaluate(df_actual, dfs_point, target='lr_target', steps=steps)"
+    "config = {\"steps\": [1, 2]}\n",
+    "point_evaluation_results = evaluation_manager.evaluate(df_actual, dfs_point, target='lr_target', config=config)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [
     {
@@ -200,7 +200,7 @@
        " ts01  0.420849   2.0)"
       ]
      },
-     "execution_count": 20,
+     "execution_count": 13,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -218,32 +218,26 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Metric RMSLE is not a default metric, skipping...\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
+      "Metric RMSLE is not a default metric, skipping...\n",
       "Metric RMSLE is not a default metric, skipping...\n",
       "Metric RMSLE is not a default metric, skipping...\n"
      ]
     }
    ],
    "source": [
-    "uncertainty_evaluation_results = evaluation_manager.evaluate(df_actual, dfs_uncertainty, target='lr_target', steps=steps)"
+    "uncertainty_evaluation_results = evaluation_manager.evaluate(df_actual, dfs_uncertainty, target='lr_target', config=config)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [
     {
@@ -261,7 +255,7 @@
        " ts01  3.611111  107.8)"
       ]
      },
-     "execution_count": 24,
+     "execution_count": 16,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -279,27 +273,35 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 19,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Metric MIS is not a default metric, skipping...\n"
+     ]
+    }
+   ],
    "source": [
     "# Get the evaluation type, i.e., uncertainty or point\n",
     "actual = EvaluationManager.transform_data(\n",
-    "            EvaluationManager.convert_to_arrays(df_actual), 'lr_target'\n",
+    "            EvaluationManager.convert_to_array(df_actual, \"lr_target\"), 'lr_target'\n",
     "        )\n",
     "predictions = [\n",
     "    EvaluationManager.transform_data(\n",
-    "        EvaluationManager.convert_to_arrays(pred), f\"pred_lr_target\"\n",
+    "        EvaluationManager.convert_to_array(pred, f\"pred_lr_target\"), f\"pred_lr_target\"\n",
     "    )\n",
     "    for pred in dfs_point\n",
     "]\n",
-    "is_uncertainty = EvaluationManager.get_evaluation_type(predictions)\n",
+    "is_uncertainty = EvaluationManager.get_evaluation_type(predictions, 'pred_lr_target')\n",
     "month_point_evaluation_results = evaluation_manager.month_wise_evaluation(actual, predictions, target='lr_target', is_uncertainty=is_uncertainty)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 20,
    "metadata": {},
    "outputs": [
     {
@@ -317,6 +319,27 @@
     "print(month_point_evaluation_results[1])"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'step01': PointEvaluationMetrics(MSE=None, MSLE=None, RMSLE=0.18203984406117593, CRPS=0.5, AP=None, EMD=None, SD=None, pEMDiv=None, Pearson=None, Variogram=None, y_hat_bar=None),\n",
+       " 'step02': PointEvaluationMetrics(MSE=None, MSLE=None, RMSLE=0.636311445241193, CRPS=3.5, AP=None, EMD=None, SD=None, pEMDiv=None, Pearson=None, Variogram=None, y_hat_bar=None)}"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "point_evaluation_results['step'][0]"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
diff --git a/tests/test_evaluation_manager.py b/tests/test_evaluation_manager.py
index 46aec9c..3c3f807 100644
--- a/tests/test_evaluation_manager.py
+++ b/tests/test_evaluation_manager.py
@@ -60,14 +60,14 @@ def mock_actual():
         },
         index=index,
     )
-    return EvaluationManager.convert_to_arrays(df)
+    return EvaluationManager.convert_to_array(df, "target")
 
 
 @pytest.fixture
 def mock_point_predictions(mock_index):
     df1 = pd.DataFrame({"pred_target": [1.0, 3.0, 5.0, 7.0, 9.0, 7.0]}, index=mock_index[0])
     df2 = pd.DataFrame({"pred_target": [2.0, 4.0, 6.0, 8.0, 10.0, 8.0]}, index=mock_index[1])
-    return [EvaluationManager.convert_to_arrays(df1), EvaluationManager.convert_to_arrays(df2)]
+    return [EvaluationManager.convert_to_array(df1, "pred_target"), EvaluationManager.convert_to_array(df2, "pred_target")]
 
 
 @pytest.fixture
@@ -98,7 +98,7 @@ def mock_uncertainty_predictions(mock_index):
         },
         index=mock_index[1],
     )
-    return [EvaluationManager.convert_to_arrays(df1), EvaluationManager.convert_to_arrays(df2)]
+    return [EvaluationManager.convert_to_array(df1, "pred_target"), EvaluationManager.convert_to_array(df2, "pred_target")]
 
 
 def test_validate_dataframes_valid_type(mock_point_predictions):
@@ -120,14 +120,14 @@ def test_get_evaluation_type():
         pd.DataFrame({'pred_target': [[1.0, 2.0], [3.0, 4.0]]}),
         pd.DataFrame({'pred_target': [[5.0, 6.0], [7.0, 8.0]]}),
     ]
-    assert EvaluationManager.get_evaluation_type(predictions_uncertainty) == True
+    assert EvaluationManager.get_evaluation_type(predictions_uncertainty, "pred_target") == True
 
     # Test case 2: All DataFrames for point evaluation
     predictions_point = [
         pd.DataFrame({'pred_target': [[1.0], [2.0]]}),
         pd.DataFrame({'pred_target': [[3.0], [4.0]]}),
     ]
-    assert EvaluationManager.get_evaluation_type(predictions_point) == False
+    assert EvaluationManager.get_evaluation_type(predictions_point, "pred_target") == False
 
     # Test case 3: Mixed evaluation types
     predictions_mixed = [
@@ -135,14 +135,14 @@ def test_get_evaluation_type():
         pd.DataFrame({'pred_target': [[5.0], [6.0]]}),
     ]
     with pytest.raises(ValueError):
-        EvaluationManager.get_evaluation_type(predictions_mixed)
+        EvaluationManager.get_evaluation_type(predictions_mixed, "pred_target")
 
     # Test case 4: Single element lists
     predictions_single_element = [
         pd.DataFrame({'pred_target': [[1.0], [2.0]]}),
         pd.DataFrame({'pred_target': [[3.0], [4.0]]}),
     ]
-    assert EvaluationManager.get_evaluation_type(predictions_single_element) == False
+    assert EvaluationManager.get_evaluation_type(predictions_single_element, "pred_target") == False
 
 
 def test_match_actual_pred_point(
@@ -171,44 +171,44 @@ def test_match_actual_pred_point(
 
 def test_split_dfs_by_step(mock_point_predictions, mock_uncertainty_predictions):
     df_splitted_point = [
-        EvaluationManager.convert_to_arrays(pd.DataFrame(
+        EvaluationManager.convert_to_array(pd.DataFrame(
             {"pred_target": [[1.0], [3.0], [2.0], [4.0]]},
             index=pd.MultiIndex.from_tuples(
                 [(100, 1), (100, 2), (101, 1), (101, 2)], names=["month", "country"]
             ),
-        )),
-        EvaluationManager.convert_to_arrays(pd.DataFrame(
+        ), "pred_target"),
+        EvaluationManager.convert_to_array(pd.DataFrame(
             {"pred_target": [[5.0], [7.0], [6.0], [8.0]]},
             index=pd.MultiIndex.from_tuples(
                 [(101, 1), (101, 2), (102, 1), (102, 2)], names=["month", "country"]
             ),
-        )),
-        EvaluationManager.convert_to_arrays(pd.DataFrame(
+        ), "pred_target"),
+        EvaluationManager.convert_to_array(pd.DataFrame(
             {"pred_target": [[9.0], [7.0], [10.0], [8.0]]},
             index=pd.MultiIndex.from_tuples(
                 [(102, 1), (102, 2), (103, 1), (103, 2)], names=["month", "country"]
             ),
-        )),
+        ), "pred_target"),
     ]
     df_splitted_uncertainty = [
-        EvaluationManager.convert_to_arrays(pd.DataFrame(
+        EvaluationManager.convert_to_array(pd.DataFrame(
             {"pred_target": [[1.0, 2.0, 3.0], [2.0, 3.0, 4.0], [4.0, 6.0, 8.0], [5.0, 7.0, 9.0]]},
             index=pd.MultiIndex.from_tuples(
                 [(100, 1), (100, 2), (101, 1), (101, 2)], names=["month", "country"]
             ),
-        )),
-        EvaluationManager.convert_to_arrays(pd.DataFrame(
+        ), "pred_target"),
+        EvaluationManager.convert_to_array(pd.DataFrame(
             {"pred_target": [[3.0, 4.0, 5.0], [4.0, 5.0, 6.0], [6.0, 8.0, 10.0], [7.0, 9.0, 11.0]]},
             index=pd.MultiIndex.from_tuples(
                 [(101, 1), (101, 2), (102, 1), (102, 2)], names=["month", "country"]
             ),
-        )),
-        EvaluationManager.convert_to_arrays(pd.DataFrame(
+        ), "pred_target"),
+        EvaluationManager.convert_to_array(pd.DataFrame(
             {"pred_target": [[5.0, 6.0, 7.0], [6.0, 7.0, 8.0], [8.0, 10.0, 12.0], [9.0, 11.0, 13.0]]},
             index=pd.MultiIndex.from_tuples(
                 [(102, 1), (102, 2), (103, 1), (103, 2)], names=["month", "country"]
             ),
-        )),
+        ), "pred_target"),
     ]
     df_splitted_point_test = EvaluationManager._split_dfs_by_step(
         mock_point_predictions
diff --git a/tests/test_metric_calculators.py b/tests/test_metric_calculators.py
index 1ee54f1..31872a2 100644
--- a/tests/test_metric_calculators.py
+++ b/tests/test_metric_calculators.py
@@ -2,6 +2,7 @@
 import pandas as pd
 import numpy as np
 from views_evaluation.evaluation.metric_calculators import (
+    calculate_mse,
     calculate_rmsle,
     calculate_crps,
     calculate_ap,
@@ -39,15 +40,29 @@ def sample_uncertainty_data():
     return actual, pred
 
 
-def test_calculate_rmsle(sample_data):
+def test_calculate_mse(sample_data):
+    """Test MSE calculation."""
+    actual, pred = sample_data
+    result = calculate_mse(actual, pred, 'target')
+    assert isinstance(result, float)
+    assert result >= 0
+
+def test_calculate_rmsle_point(sample_data):
     """Test RMSLE calculation."""
     actual, pred = sample_data
     result = calculate_rmsle(actual, pred, 'target')
     assert isinstance(result, float)
     assert result >= 0
 
+def test_calculate_crps_point(sample_data):
+    """Test CRPS calculation."""
+    actual, pred = sample_data
+    result = calculate_crps(actual, pred, 'target')
+    assert isinstance(result, float)
+    assert result >= 0
+
 
-def test_calculate_crps(sample_uncertainty_data):
+def test_calculate_crps_uncertainty(sample_uncertainty_data):
     """Test CRPS calculation."""
     actual, pred = sample_uncertainty_data
     result = calculate_crps(actual, pred, 'target')
@@ -79,7 +94,7 @@ def test_calculate_pearson(sample_data):
     assert -1 <= result <= 1
 
 
-def test_calculate_coverage(sample_uncertainty_data):
+def test_calculate_coverage_uncertainty(sample_uncertainty_data):
     """Test Coverage calculation."""
     actual, pred = sample_uncertainty_data
     result = calculate_coverage(actual, pred, 'target')
@@ -87,7 +102,7 @@ def test_calculate_coverage(sample_uncertainty_data):
     assert 0 <= result <= 1
 
 
-def test_calculate_ignorance_score(sample_uncertainty_data):
+def test_calculate_ignorance_score_uncertainty(sample_uncertainty_data):
     """Test Ignorance Score calculation."""
     actual, pred = sample_uncertainty_data
     result = calculate_ignorance_score(actual, pred, 'target')
@@ -95,7 +110,7 @@ def test_calculate_ignorance_score(sample_uncertainty_data):
     assert result >= 0
 
 
-def test_calculate_mis(sample_uncertainty_data):
+def test_calculate_mis_uncertainty(sample_uncertainty_data):
     """Test Mean Interval Score calculation."""
     actual, pred = sample_uncertainty_data
     result = calculate_mean_interval_score(actual, pred, 'target')
diff --git a/views_evaluation/evaluation/__init__.py b/views_evaluation/evaluation/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/views_evaluation/evaluation/evaluation_manager.py b/views_evaluation/evaluation/evaluation_manager.py
index 9b0f859..6f8371d 100644
--- a/views_evaluation/evaluation/evaluation_manager.py
+++ b/views_evaluation/evaluation/evaluation_manager.py
@@ -3,6 +3,7 @@
 import pandas as pd
 import numpy as np
 from views_evaluation.evaluation.metrics import (
+    BaseEvaluationMetrics,
     PointEvaluationMetrics,
     UncertaintyEvaluationMetrics,
 )
@@ -33,36 +34,39 @@ def __init__(self, metrics_list: list):
         self.uncertainty_metric_functions = UNCERTAINTY_METRIC_FUNCTIONS
 
     @staticmethod
-    def transform_data(df: pd.DataFrame, target: str) -> pd.DataFrame:
+    def transform_data(df: pd.DataFrame, target: str | list[str]) -> pd.DataFrame:
         """
-        Transform the data to normal distribution.
+        Transform the data.
         """
-        if target.startswith("ln") or target.startswith("pred_ln"):
-            df[[target]] = df[[target]].applymap(
-                lambda x: (
-                    np.exp(x) - 1
-                    if isinstance(x, (list, np.ndarray))
-                    else np.exp(x) - 1
+        if isinstance(target, str):
+            target = [target]
+        for t in target:
+            if t.startswith("ln") or t.startswith("pred_ln"):
+                df[[t]] = df[[t]].applymap(
+                    lambda x: (
+                        np.exp(x) - 1
+                        if isinstance(x, (list, np.ndarray))
+                        else np.exp(x) - 1
+                    )
                 )
-            )
-        elif target.startswith("lx") or target.startswith("pred_lx"):
-            df[[target]] = df[[target]].applymap(
-                lambda x: (
-                    np.exp(x) - np.exp(100)
-                    if isinstance(x, (list, np.ndarray))
-                    else np.exp(x) - np.exp(100)
+            elif t.startswith("lx") or t.startswith("pred_lx"):
+                df[[t]] = df[[t]].applymap(
+                    lambda x: (
+                        np.exp(x) - np.exp(100)
+                        if isinstance(x, (list, np.ndarray))
+                        else np.exp(x) - np.exp(100)
+                    )
                 )
-            )
-        elif target.startswith("lr") or target.startswith("pred_lr"):
-            df[[target]] = df[[target]].applymap(
-                lambda x: x if isinstance(x, (list, np.ndarray)) else x
-            )
-        else:
-            raise ValueError(f"Target {target} is not a valid target")
+            elif t.startswith("lr") or t.startswith("pred_lr"):
+                df[[t]] = df[[t]].applymap(
+                    lambda x: x if isinstance(x, (list, np.ndarray)) else x
+                )
+            else:
+                raise ValueError(f"Target {t} is not a valid target")
         return df
 
     @staticmethod
-    def convert_to_arrays(df: pd.DataFrame) -> pd.DataFrame:
+    def convert_to_array(df: pd.DataFrame, target: str | list[str]) -> pd.DataFrame:
         """
         Convert columns in a DataFrame to numpy arrays.
 
@@ -73,14 +77,35 @@ def convert_to_arrays(df: pd.DataFrame) -> pd.DataFrame:
             pd.DataFrame: A new DataFrame with columns converted to numpy arrays.
         """
         converted = df.copy()
-        for col in converted.columns:
-            converted[col] = converted[col].apply(
-                lambda x: np.array(x) if isinstance(x, list) else np.array([x])
+        if isinstance(target, str):
+            target = [target]
+
+        for t in target:
+            converted[t] = converted[t].apply(
+                lambda x: (
+                    x
+                    if isinstance(x, np.ndarray)
+                    else (np.array(x) if isinstance(x, list) else np.array([x]))
+                )
+            )
+        return converted
+
+    @staticmethod
+    def convert_to_scalar(df: pd.DataFrame, target: str | list[str]) -> pd.DataFrame:
+        """
+        Convert columns in a DataFrame to scalar values by taking the mean of the list.
+        """
+        converted = df.copy()
+        if isinstance(target, str):
+            target = [target]
+        for t in target:
+            converted[t] = converted[t].apply(
+                lambda x: np.mean(x) if isinstance(x, (list, np.ndarray)) else x
             )
         return converted
 
     @staticmethod
-    def get_evaluation_type(predictions: List[pd.DataFrame]) -> bool:
+    def get_evaluation_type(predictions: List[pd.DataFrame], target: str) -> bool:
         """
         Validates the values in each DataFrame in the list.
         The return value indicates whether all DataFrames are for uncertainty evaluation.
@@ -101,12 +126,12 @@ def get_evaluation_type(predictions: List[pd.DataFrame]) -> bool:
         uncertainty_length = None
 
         for df in predictions:
-            for value in df.values.flatten():
+            for value in df[target].values.flatten():
                 if not (isinstance(value, np.ndarray) or isinstance(value, list)):
                     raise ValueError(
                         "All values must be lists or numpy arrays. Convert the data."
                     )
-                
+
                 if len(value) > 1:
                     is_uncertainty = True
                     # For uncertainty evaluation, check that all lists have the same length
@@ -171,11 +196,18 @@ def _match_actual_pred(
         - matched_pred: pd.DataFrame aligned with actual.
         """
         actual_target = actual[[target]]
-        aligned_actual, aligned_pred = actual_target.align(pred, join="inner")
-        matched_actual = aligned_actual.reindex(index=aligned_pred.index)
-        matched_actual[[target]] = actual_target
+        common_indices = actual_target.index.intersection(pred.index)
+        matched_pred = pred[pred.index.isin(common_indices)].copy()
+        
+        # Create matched_actual by reindexing actual_target to match pred's index structure
+        # This will duplicate rows in actual where pred has duplicate indices
+        matched_actual = actual_target.reindex(matched_pred.index)
+        
+        matched_actual = matched_actual.sort_index()
+        matched_pred = matched_pred.sort_index()
+
+        return matched_actual, matched_pred
 
-        return matched_actual.sort_index(), pred.sort_index()
 
     @staticmethod
     def _split_dfs_by_step(dfs: list) -> list:
@@ -208,6 +240,24 @@ def _split_dfs_by_step(dfs: list) -> list:
 
         return result_dfs
 
+    def _process_data(
+        self, actual: pd.DataFrame, predictions: List[pd.DataFrame], target: str
+    ):
+        """
+        Process the data for evaluation.
+        """
+        actual = EvaluationManager.transform_data(
+            EvaluationManager.convert_to_array(actual, target), target
+        )
+        predictions = [
+            EvaluationManager.transform_data(
+                EvaluationManager.convert_to_array(pred, f"pred_{target}"),
+                f"pred_{target}",
+            )
+            for pred in predictions
+        ]
+        return actual, predictions
+
     def step_wise_evaluation(
         self,
         actual: pd.DataFrame,
@@ -254,7 +304,9 @@ def step_wise_evaluation(
                     )
                     evaluation_dict[f"step{str(step).zfill(2)}"].__setattr__(
                         metric,
-                        metric_functions[metric](matched_actual, matched_pred, target, **kwargs),
+                        metric_functions[metric](
+                            matched_actual, matched_pred, target, **kwargs
+                        ),
                     )
             else:
                 logger.warning(f"Metric {metric} is not a default metric, skipping...")
@@ -307,7 +359,9 @@ def time_series_wise_evaluation(
                     )
                     evaluation_dict[f"ts{str(i).zfill(2)}"].__setattr__(
                         metric,
-                        metric_functions[metric](matched_actual, matched_pred, target, **kwargs),
+                        metric_functions[metric](
+                            matched_actual, matched_pred, target, **kwargs
+                        ),
                     )
             else:
                 logger.warning(f"Metric {metric} is not a default metric, skipping...")
@@ -339,8 +393,8 @@ def month_wise_evaluation(
         """
         pred_concat = pd.concat(predictions)
         month_range = pred_concat.index.get_level_values(0).unique()
-        month_start = month_range.min()
-        month_end = month_range.max()
+        month_start = int(month_range.min())
+        month_end = int(month_range.max())
 
         if is_uncertainty:
             evaluation_dict = (
@@ -366,8 +420,8 @@ def month_wise_evaluation(
                     level=matched_pred.index.names[0]
                 ).apply(
                     lambda df: metric_functions[metric](
-                        matched_actual.loc[df.index, [target]],
-                        matched_pred.loc[df.index, [f"pred_{target}"]],
+                        matched_actual.loc[df.index.unique(), [target]],
+                        matched_pred.loc[df.index.unique(), [f"pred_{target}"]],
                         target,
                         **kwargs,
                     )
@@ -390,7 +444,7 @@ def evaluate(
         actual: pd.DataFrame,
         predictions: List[pd.DataFrame],
         target: str,
-        steps: List[int],
+        config: dict,
         **kwargs,
     ):
         """
@@ -400,36 +454,145 @@ def evaluate(
             actual (pd.DataFrame): The actual values.
             predictions (List[pd.DataFrame]): A list of DataFrames containing the predictions.
             target (str): The target column in the actual DataFrame.
-            steps (List[int]): The steps to evaluate.
-
+            config (dict): The configuration dictionary.
         """
-
         EvaluationManager.validate_predictions(predictions, target)
-        actual = EvaluationManager.transform_data(
-            EvaluationManager.convert_to_arrays(actual), target
+        self.actual, self.predictions = self._process_data(actual, predictions, target)
+        self.is_uncertainty = EvaluationManager.get_evaluation_type(
+            self.predictions, f"pred_{target}"
         )
-        predictions = [
-            EvaluationManager.transform_data(
-                EvaluationManager.convert_to_arrays(pred), f"pred_{target}"
-            )
-            for pred in predictions
-        ]
-        is_uncertainty = EvaluationManager.get_evaluation_type(predictions)
 
         evaluation_results = {}
         evaluation_results["month"] = self.month_wise_evaluation(
-            actual, predictions, target, is_uncertainty, **kwargs
+            self.actual, self.predictions, target, self.is_uncertainty, **kwargs
         )
         evaluation_results["time_series"] = self.time_series_wise_evaluation(
-            actual, predictions, target, is_uncertainty, **kwargs
+            self.actual, self.predictions, target, self.is_uncertainty, **kwargs
         )
         evaluation_results["step"] = self.step_wise_evaluation(
-            actual,
-            predictions,
+            self.actual,
+            self.predictions,
             target,
-            steps,
-            is_uncertainty,
+            config["steps"],
+            self.is_uncertainty,
             **kwargs,
         )
 
         return evaluation_results
+
+    @staticmethod
+    def filter_step_wise_evaluation(
+        step_wise_evaluation_results: dict,
+        filter_steps: list[int] = [1, 3, 6, 12, 36],
+    ):
+        """
+        Filter step-wise evaluation results to include only specific steps.
+
+        Args:
+            step_wise_evaluation_results (dict): The step-wise evaluation results containing evaluation dict and DataFrame.
+            filter_steps (list[int]): List of step numbers to include in the filtered results. Defaults to [1, 3, 6, 12, 36].
+
+        Returns:
+            dict: A dictionary containing the filtered evaluation dictionary and DataFrame for the selected steps.
+        """
+        step_wise_evaluation_dict = step_wise_evaluation_results[0]
+        step_wise_evaluation_df = step_wise_evaluation_results[1]
+
+        selected_keys = [f"step{str(step).zfill(2)}" for step in filter_steps]
+
+        filtered_evaluation_dict = {
+            key: step_wise_evaluation_dict[key]
+            for key in selected_keys
+            if key in step_wise_evaluation_dict
+        }
+
+        filtered_evaluation_df = step_wise_evaluation_df.loc[
+            step_wise_evaluation_df.index.isin(selected_keys)
+        ]
+
+        return (filtered_evaluation_dict, filtered_evaluation_df)
+
+    @staticmethod
+    def aggregate_month_wise_evaluation(
+        month_wise_evaluation_results: dict,
+        aggregation_period: int = 6,
+        aggregation_type: str = "mean",
+    ):
+        """
+        Aggregate month-wise evaluation results by grouping months into periods and applying aggregation.
+
+        Args:
+            month_wise_evaluation_results (dict): The month-wise evaluation results containing evaluation dict and DataFrame.
+            aggregation_period (int): Number of months to group together for aggregation.
+            aggregation_type (str): Type of aggregation to apply.
+        Returns:
+            dict: A dictionary containing the aggregated evaluation dictionary and DataFrame.
+        """
+        month_wise_evaluation_dict = month_wise_evaluation_results[0]
+        month_wise_evaluation_df = month_wise_evaluation_results[1]
+
+        available_months = [
+            int(month.replace("month", "")) for month in month_wise_evaluation_df.index
+        ]
+        available_months.sort()
+
+        if len(available_months) < aggregation_period:
+            raise ValueError(
+                f"Not enough months to aggregate. Available months: {available_months}, aggregation period: {aggregation_period}"
+            )
+
+        aggregated_dict = {}
+        aggregated_data = []
+
+        for i in range(0, len(available_months), aggregation_period):
+            period_months = available_months[i : i + aggregation_period]
+            period_start = period_months[0]
+            period_end = period_months[-1]
+            period_key = f"month_{period_start}_{period_end}"
+
+            period_metrics = []
+            for month in period_months:
+                month_key = f"month{month}"
+                if month_key in month_wise_evaluation_dict:
+                    period_metrics.append(month_wise_evaluation_dict[month_key])
+
+            if period_metrics:
+                aggregated_metrics = {}
+                for metric_name in period_metrics[0].__annotations__.keys():
+                    metric_values = [
+                        getattr(metric, metric_name)
+                        for metric in period_metrics
+                        if getattr(metric, metric_name) is not None
+                    ]
+
+                    if metric_values:
+                        if aggregation_type == "mean":
+                            aggregated_value = np.mean(metric_values)
+                        elif aggregation_type == "median":
+                            aggregated_value = np.median(metric_values)
+                        else:
+                            raise ValueError(
+                                f"Unsupported aggregation type: {aggregation_type}"
+                            )
+
+                        aggregated_metrics[metric_name] = aggregated_value
+                    else:
+                        aggregated_metrics[metric_name] = None
+
+                if hasattr(period_metrics[0], "__class__"):
+                    aggregated_eval_metrics = period_metrics[0].__class__(
+                        **aggregated_metrics
+                    )
+                else:
+                    aggregated_eval_metrics = aggregated_metrics
+
+                aggregated_dict[period_key] = aggregated_eval_metrics
+
+                aggregated_data.append({"month_id": period_key, **aggregated_metrics})
+
+        if aggregated_data:
+            aggregated_df = BaseEvaluationMetrics.evaluation_dict_to_dataframe(
+                aggregated_dict
+            )
+
+        return (aggregated_dict, aggregated_df)
diff --git a/views_evaluation/evaluation/metric_calculators.py b/views_evaluation/evaluation/metric_calculators.py
index 02d775f..1bc7523 100644
--- a/views_evaluation/evaluation/metric_calculators.py
+++ b/views_evaluation/evaluation/metric_calculators.py
@@ -1,15 +1,62 @@
-from typing import List, Dict, Tuple, Optional
 from collections import Counter
 import pandas as pd
 import numpy as np
 import properscoring as ps
 from sklearn.metrics import (
     root_mean_squared_log_error,
+    mean_squared_error,
+    mean_squared_log_error,
     average_precision_score,
 )
 from scipy.stats import wasserstein_distance, pearsonr
 
 
+def calculate_mse(
+    matched_actual: pd.DataFrame, matched_pred: pd.DataFrame, target: str
+) -> float:
+    """
+    Calculate Mean Square Error for each prediction.
+
+    Args:
+        matched_actual (pd.DataFrame): DataFrame containing actual values
+        matched_pred (pd.DataFrame): DataFrame containing predictions
+        target (str): The target column name
+
+    Returns:
+        float: Average MSE score
+    """
+    actual_values = np.concatenate(matched_actual[target].values)
+    pred_values = np.concatenate(matched_pred[f"pred_{target}"].values)
+
+    actual_expanded = np.repeat(
+        actual_values, [len(x) for x in matched_pred[f"pred_{target}"]]
+    )
+
+    return mean_squared_error(actual_expanded, pred_values)
+
+
+def calculate_msle(
+    matched_actual: pd.DataFrame, matched_pred: pd.DataFrame, target: str
+) -> float:
+    """
+    Calculate Mean Squared Logarithmic Error (MSLE) for each prediction.
+
+    Args:
+        matched_actual (pd.DataFrame): DataFrame containing actual values
+        matched_pred (pd.DataFrame): DataFrame containing predictions
+        target (str): The target column name
+
+    Returns:
+        float: Average MSLE score
+    """
+    actual_values = np.concatenate(matched_actual[target].values)
+    pred_values = np.concatenate(matched_pred[f"pred_{target}"].values)
+    actual_expanded = np.repeat(
+        actual_values, [len(x) for x in matched_pred[f"pred_{target}"]]
+    )
+    return mean_squared_log_error(actual_expanded, pred_values)
+
+
 def calculate_rmsle(
     matched_actual: pd.DataFrame, matched_pred: pd.DataFrame, target: str
 ) -> float:
@@ -334,9 +381,11 @@ def calculate_ignorance_score(
     def digitize_minus_one(x, edges):
         return np.digitize(x, edges, right=False) - 1
 
-    def _calculate_ignorance_score(predictions, observed, n):
-        c = Counter(predictions)
-        prob = c[observed] / n
+    def _calculate_ignorance_score(predictions, observed, n, all_bins):
+        # Initialize each bin with 1 (Laplace smoothing)
+        c = Counter({bin_idx: 1 for bin_idx in all_bins})
+        c.update(predictions)
+        prob = c[observed] / sum(c.values())
         return -np.log2(prob)
 
     scores = []
@@ -353,13 +402,24 @@ def _calculate_ignorance_score(predictions, observed, n):
         binned_preds = np.concatenate([binned_preds, synthetic])
 
         n = len(binned_preds)
-        score = _calculate_ignorance_score(binned_preds, binned_obs, n)
+        score = _calculate_ignorance_score(binned_preds, binned_obs, n, synthetic)
         scores.append(score)
 
     return np.mean(scores)
 
 
+def calculate_mean_prediction(
+    matched_actual: pd.DataFrame, matched_pred: pd.DataFrame, target: str
+) -> float:
+    """
+    Calculate the mean prediction.
+    """
+    all_preds = np.concatenate([np.asarray(v).flatten() for v in matched_pred[f"pred_{target}"]])
+    return np.mean(all_preds)
+
 POINT_METRIC_FUNCTIONS = {
+    "MSE": calculate_mse,
+    "MSLE": calculate_msle,
     "RMSLE": calculate_rmsle,
     "CRPS": calculate_crps,
     "AP": calculate_ap,
@@ -368,6 +428,7 @@ def _calculate_ignorance_score(predictions, observed, n):
     "pEMDiv": calculate_pEMDiv,
     "Pearson": calculate_pearson,
     "Variogram": calculate_variogram,
+    "y_hat_bar": calculate_mean_prediction,
 }
 
 UNCERTAINTY_METRIC_FUNCTIONS = {
@@ -377,4 +438,6 @@ def _calculate_ignorance_score(predictions, observed, n):
     "Brier": calculate_brier,
     "Jeffreys": calculate_jeffreys,
     "Coverage": calculate_coverage,
+    "pEMDiv": calculate_pEMDiv,
+    "y_hat_bar": calculate_mean_prediction,
 }
diff --git a/views_evaluation/evaluation/metrics.py b/views_evaluation/evaluation/metrics.py
index 36b2cb5..a7dcf33 100644
--- a/views_evaluation/evaluation/metrics.py
+++ b/views_evaluation/evaluation/metrics.py
@@ -118,6 +118,8 @@ class PointEvaluationMetrics(BaseEvaluationMetrics):
         Variogram (Optional[float]): Variogram.
     """
 
+    MSE: Optional[float] = None
+    MSLE: Optional[float] = None
     RMSLE: Optional[float] = None
     CRPS: Optional[float] = None
     AP: Optional[float] = None
@@ -126,6 +128,7 @@ class PointEvaluationMetrics(BaseEvaluationMetrics):
     pEMDiv: Optional[float] = None
     Pearson: Optional[float] = None
     Variogram: Optional[float] = None
+    y_hat_bar: Optional[float] = None
 
   
 @dataclass
@@ -140,7 +143,9 @@ class UncertaintyEvaluationMetrics(BaseEvaluationMetrics):
     CRPS: Optional[float] = None
     MIS: Optional[float] = None
     Ignorance: Optional[float] = None
+    Coverage: Optional[float] = None
+    pEMDiv: Optional[float] = None
     Brier: Optional[float] = None
     Jeffreys: Optional[float] = None
-    Coverage: Optional[float] = None
+    y_hat_bar: Optional[float] = None
     
\ No newline at end of file
diff --git a/views_evaluation/reports/__init__.py b/views_evaluation/reports/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/views_evaluation/reports/generator.py b/views_evaluation/reports/generator.py
new file mode 100644
index 0000000..30feeab
--- /dev/null
+++ b/views_evaluation/reports/generator.py
@@ -0,0 +1,72 @@
+import numpy as np
+import pandas as pd
+
+
+class EvalReportGenerator:
+    """Generate evaluation reports for ensemble or single model forecasts."""
+
+    def __init__(self, config: dict, target: str, conflict_type: str):
+        self.config = config
+        self.target = target
+        self.conflict_type = conflict_type
+        self.level = config.get("level")
+        self.run_type = config.get("run_type")
+        self.eval_type = config.get("eval_type")
+        self.is_ensemble = True if "models" in config else False
+        self.eval_report = {}
+
+    def generate_eval_report_dict(self, df_preds: list[pd.DataFrame], df_eval_ts: pd.DataFrame):
+        """Return a dictionary with evaluation report data."""
+        self.eval_report = {
+            "Target": self.target,
+            "Forecast Type": self._forecast_type(df_preds),
+            "Level of Analysis": self.level,
+            "Data Partition": self.run_type,
+            "Training Period": self._partition("train"),
+            "Testing Period": self._partition("test"),
+            "Forecast Horizon": len(self.config.get("steps", [])),
+            "Number of Rolling Origins": len(df_preds), 
+            "Evaluation Results": []
+        }
+
+        self.eval_report["Evaluation Results"].append(
+            self._single_result(
+                "Ensemble" if self.is_ensemble else "Model",
+                self.config["name"],
+                df_eval_ts,
+            )
+        )
+        return self.eval_report
+
+    def update_ensemble_eval_report(self, model_name, df_eval_ts: pd.DataFrame):
+        self.eval_report["Evaluation Results"].append(
+            self._single_result(
+                "Constituent",
+                model_name,
+                df_eval_ts,
+            )
+        )
+        return self.eval_report
+
+    def _forecast_type(self, df_preds: list[pd.DataFrame]):
+        from views_evaluation.evaluation.evaluation_manager import EvaluationManager
+        arr = [EvaluationManager.convert_to_array(df_pred, f"pred_{self.target}") for df_pred in df_preds]
+        return "point" if not EvaluationManager.get_evaluation_type(arr, f"pred_{self.target}") else "uncertainty"
+
+    def _partition(self, key: str):
+        return self.config[self.run_type][key]
+
+    def _single_result(self, model_type: str, model_name: str, df_eval_ts: pd.DataFrame):
+        mse = df_eval_ts["MSE"].mean() 
+        msle = df_eval_ts["MSLE"].mean()
+        mean_pred = df_eval_ts["y_hat_bar"].mean()
+        
+        return {
+            "Type": model_type,
+            "Model Name": model_name,
+            "MSE": mse,
+            "MSLE": msle,
+            r"$\bar{\hat{y}}$": mean_pred
+        }
+
+