From ea40f8fe808a61d7280ef530f0bf6f538a78b7a9 Mon Sep 17 00:00:00 2001
From: Nikos Livathinos <nli@zurich.ibm.com>
Date: Tue, 9 Dec 2025 17:27:23 +0100
Subject: [PATCH 1/4] docs: Objectives, confusion matrix for the multi-label
 pixel evaluations

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---
 docs/multi_label_pixel_layout_evaluations.md | 63 ++++++++++++++++++++
 1 file changed, 63 insertions(+)
 create mode 100644 docs/multi_label_pixel_layout_evaluations.md

diff --git a/docs/multi_label_pixel_layout_evaluations.md b/docs/multi_label_pixel_layout_evaluations.md
new file mode 100644
index 00000000..71daf73b
--- /dev/null
+++ b/docs/multi_label_pixel_layout_evaluations.md
@@ -0,0 +1,63 @@
+# Multi-label pixel layout evaluations
+
+## Objectives
+
+We want to compute metrices for the multi-label document layout analysis task.
+Each document page undergoes a layout resolution, where each detected object is assigned a bounding box and one or many classes.
+The ground truth contains the bounding box and one object class, although in a generalized version the ground truth can also assign multiple classes for the same object.
+Everything which is not classified is considered to be the *Background*.
+
+We want to evaluate 2 sets of layout resolutions against each other.
+This can be either the ground truth layout resolutions against the prediction layout resolutions, or 2 predictions against each other.
+We name those layout resolutions as LR1 and LR2.
+
+We also want to solve this evaluation task under the following conditions:
+
+- The evaluations take place at the pixel level.
+- The evaluation of each document page produces a square confusion matrix [n, n], which is the basis to compute:
+  - Document-level confusion matrix.
+  - Recall/Precision/F1 matrices per page and document.
+  - Recall/Precision/F1 vectors per class.
+  - Collapsed recall/precision/F1 matrices which contain only the background and the non-background classes.
+
+Additionally we have the following freedoms:
+
+- We do not require the predictions to contain any confidence scores but only bounding boxes and object classes.
+- The two evaluated layout resolutions are free to use any classification labels.
+
+
+## Confusion matrix
+
+The rows of the matrix correspond to the first layout resolution (ground truth or prediction A) and the columns to the second layout resolution.
+
+Each cell (i, j) is the number of pixels that correspond to class i according to the first layout resolution (e.g. ground truth) and to class j according to the second layout resolution.
+
+The exact structure of the confusion matrix and the evaluation metrics that can be derived from it depend on the number of classes in the two layout resolutions.
+More specifically we distinguish two cases:
+- Case A: Both layout resolutions use the same classification classes.
+- Case B: When the classes differ across the layout resolutions.
+
+|                                           | Same classes in LR1/LR2| Different classes in LR1/LR2           |
+|---------------------------------------- --|------------------------|----------------------------------------|
+|Rows represent                             | LR1 (e.g. GT)          | LR1 (e.g. GT, predictions A)           |
+|Columns represent                          | LR2 (e.g. predictions) | LR2 (e.g. predictions B)               |
+|Rows/Columns indices                       | background - classes   | background - classes LR1 - classes LR2 |
+|Matrix structure when perfect match        | diagonal               | block                                  |
+|Location of mis-predictions/mis-matches    | off-diagonal           |                                        |
+|Recall/Precision/F1 matrices               | yes                    | yes                                    |
+|Background/class-collapsed R/P/F1 matrices | yes                    | yes                                    |
+|Recall/Precision/F1 detailed class vectors | yes                    | no                                     |
+|Recall/Precision/F1 collapsed class vectors| yes                    | yes                                    |
+|
+
+The background is always in index 0.
+
+
+## Binary representation of the Layout Resolution
+
+
+## Multi-label classification confusion matrix
+
+
+## Computation Optimizations
+

From 2581eca8b269da75541a38fca79fd1b34b5f137b Mon Sep 17 00:00:00 2001
From: Nikos Livathinos <nli@zurich.ibm.com>
Date: Wed, 10 Dec 2025 12:47:30 +0100
Subject: [PATCH 2/4] docs: Documentation for the Multi-label pixel evaluation:
 Computation of the confusion matrix and derivatives

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---
 docs/multi_label_pixel_layout_evaluations.md | 61 ++++++++++++++++++--
 1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/docs/multi_label_pixel_layout_evaluations.md b/docs/multi_label_pixel_layout_evaluations.md
index 71daf73b..0cf2236b 100644
--- a/docs/multi_label_pixel_layout_evaluations.md
+++ b/docs/multi_label_pixel_layout_evaluations.md
@@ -26,7 +26,7 @@ Additionally we have the following freedoms:
 - The two evaluated layout resolutions are free to use any classification labels.
 
 
-## Confusion matrix
+## Confusion matrix structure
 
 The rows of the matrix correspond to the first layout resolution (ground truth or prediction A) and the columns to the second layout resolution.
 
@@ -37,27 +37,78 @@ More specifically we distinguish two cases:
 - Case A: Both layout resolutions use the same classification classes.
 - Case B: When the classes differ across the layout resolutions.
 
+The following table provides some insight on the properties of the confusion matrix and the derived metrics on each case:
+
 |                                           | Same classes in LR1/LR2| Different classes in LR1/LR2           |
 |---------------------------------------- --|------------------------|----------------------------------------|
 |Rows represent                             | LR1 (e.g. GT)          | LR1 (e.g. GT, predictions A)           |
 |Columns represent                          | LR2 (e.g. predictions) | LR2 (e.g. predictions B)               |
 |Rows/Columns indices                       | background - classes   | background - classes LR1 - classes LR2 |
+|Background class row/column                | (0, 0)                 | (0, 0)                                 |
 |Matrix structure when perfect match        | diagonal               | block                                  |
 |Location of mis-predictions/mis-matches    | off-diagonal           |                                        |
 |Recall/Precision/F1 matrices               | yes                    | yes                                    |
 |Background/class-collapsed R/P/F1 matrices | yes                    | yes                                    |
 |Recall/Precision/F1 detailed class vectors | yes                    | no                                     |
 |Recall/Precision/F1 collapsed class vectors| yes                    | yes                                    |
-|
+|                                           |                        |                                        |
 
-The background is always in index 0.
+Table 1: Confusion matrix and derivatives configuration across label-set consistency cases
 
 
-## Binary representation of the Layout Resolution
+## Computation of the confusion matrix and derivatives
+
+The computation of the multi-label classification matrix is based on the papers:
+[Multi-Label Classifier Performance Evaluation with Confusion Matrix](https://csitcp.org/paper/10/108csit01.pdf)
+[Comments on "MLCM: Multi-Label Confusion Matrix"](https://www.academia.edu/121504684/Comments_on_MLCM_Multi_Label_Confusion_Matrix)
+
+The papers describe how to build the confusion matrix for the multi-label classification problem under the assumptions:
+- The rows represent the ground truth and the columns the predictions.
+- Both ground-truth and predictions use the same classes.
+- The ground truth may assign more than one classes to the same object.
+
+A _contribution matrix_ is computed for each pair of ground-truth / prediction samples and the sum of them is the _confusion matrix_ of the entire dataset.
+
+Each contribution matrix is computed according to an algorithm that distinguishes 4 cases:
+
+Case 1: Prediction and GT are a perfect match.
+Case 2: Prediction is a superset of the GT classes (over-prediction).
+Case 3: Prediction is a subset of the GT classes (under-prediction).
+Case 4: Prediction and GT have some partial overlap and some diff (diff-prediction).
+
+For each of those cases the contributions to the confusion matrix can be seen as "gains" that go to the diagonal cells and "penalties" that go to the off-diagonal cells.
+In case 1 the contributions are only gains and their value equals to the count of detections.
+For the other cases the gains have been penalized by the mis-predictions and both gains and penalties have fractional values.
+For example in case of "over-prediction", if the classifier has predicted 3 classes (a, b, c) and the ground truth is (a, b),
+the contribution is a gain of 2/3 for the diagonal cells (a, a), (b, b) because 2 out of 3 predictions are correct
+and a penalty of 1/3 for the off-diagonal cells (a, c) and (b, c) because the prediction c is wrong.
 
+The contribution matrix for each dataset sample has the following properties:
+- All rows without ground truth and all columns without predictions are zero.
+- The sum of each non-zero row is 1.
+- The sum of all cells equals to the number of GT classes for that sample.
 
-## Multi-label classification confusion matrix
+Dividing the dataset-wide confusion matrix by each row-sum gives us the _recall matrix_
+and dividing by each column-sum provides the _precision matrix_.
+The diagonal of the recall/precision matrices are the recall/precision vectors for the classification classes.
+
+The _F1 matrix_ is the harmonic mean of the precision (P) and recall (R) matrices and is computed as (2 * P * R) / (P + R).
+
+We compute a contribution matrix for each page pixel according to the previous algorithm.
+Summing up the pixel-level contributions gives the confusion matrix for each page
+and the sum of all page-level confusion matrices provides the confusion matrix for the entire dataset.
+
+Additionally we compute 2x2 "abstractions" of the page and dataset level confusion matrices
+<!-- TODO -->
+
+
+## Binary representation of the Layout Resolution
 
 
 ## Computation Optimizations
 
+<!-- TODO
+- Compression
+- Vectorization
+-->
+

From 7bf0bb031a8583e284e0345a0009d609abab8c42 Mon Sep 17 00:00:00 2001
From: Nikos Livathinos <nli@zurich.ibm.com>
Date: Wed, 10 Dec 2025 16:49:34 +0100
Subject: [PATCH 3/4] docs: First version of the
 multi_label_pixel_layout_evaluations.md

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---
 docs/multi_label_pixel_layout_evaluations.md | 41 +++++++++++++++-----
 1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/docs/multi_label_pixel_layout_evaluations.md b/docs/multi_label_pixel_layout_evaluations.md
index 0cf2236b..f49c22fa 100644
--- a/docs/multi_label_pixel_layout_evaluations.md
+++ b/docs/multi_label_pixel_layout_evaluations.md
@@ -39,8 +39,9 @@ More specifically we distinguish two cases:
 
 The following table provides some insight on the properties of the confusion matrix and the derived metrics on each case:
 
+
 |                                           | Same classes in LR1/LR2| Different classes in LR1/LR2           |
-|---------------------------------------- --|------------------------|----------------------------------------|
+|-------------------------------------------|------------------------|----------------------------------------|
 |Rows represent                             | LR1 (e.g. GT)          | LR1 (e.g. GT, predictions A)           |
 |Columns represent                          | LR2 (e.g. predictions) | LR2 (e.g. predictions B)               |
 |Rows/Columns indices                       | background - classes   | background - classes LR1 - classes LR2 |
@@ -53,6 +54,7 @@ The following table provides some insight on the properties of the confusion mat
 |Recall/Precision/F1 collapsed class vectors| yes                    | yes                                    |
 |                                           |                        |                                        |
 
+
 Table 1: Confusion matrix and derivatives configuration across label-set consistency cases
 
 
@@ -98,17 +100,38 @@ We compute a contribution matrix for each page pixel according to the previous a
 Summing up the pixel-level contributions gives the confusion matrix for each page
 and the sum of all page-level confusion matrices provides the confusion matrix for the entire dataset.
 
-Additionally we compute 2x2 "abstractions" of the page and dataset level confusion matrices
-<!-- TODO -->
+Additionally we compute 2x2 "abstractions" of the confusion matrices that contain only the
+"Background" and the non-Background classes collapsed as one:
+
+
+|                | Background | non-Background |
+|----------------|------------|----------------|
+| Background     | cell(0,0)  | sum(0, 1:)     |
+| non-Background | sum(1:, 0) | sum(1:, 1:)    |
+
+
+Table 2: Collapsed matrix computed for Background and non-Background classes
+
+The collapsed confusion matrix and its derivatives, collapsed recall/precision/F1,
+allow the evaluation across layout resolutions with incompatible classes.
+
 
+## Implementation
 
-## Binary representation of the Layout Resolution
+We use a bit‑packed encoding to represent multi‑label layout resolutions for up to 63 classes plus the Background class.
+Each pixel is stored as a single 64‑bit unsigned integer; the i‑th class is encoded by setting bit i.
+The background occupies bit 0.
 
+This compact representation enables a vectorized implementation using numpy bitwise and linear algebra operations.
+Thanks to instruction-level parallelism, we can compute multiple pixel-level contribution matrices at once.
 
-## Computation Optimizations
+Each pair of binary page layout representations is then compressed by counting the distinct pixel-pairs.
+Only the contribution matrices of the unique pixel-pairs need to be computed.
+The page-level confusion matrix is obtained as the weighted sum of the computed contribution matrices
+multiplied by the number of appearances of each unique pixel-pair.
+Because the number of unique pixel‑pairs is significantly less than the total number of pixels,
+ this approach dramatically reduces the computational overhead.
 
-<!-- TODO
-- Compression
-- Vectorization
--->
+Finally, since pages are independent, the computation of each page‑level confusion matrix can be
+also parallelized.
 

From cb28df734e5a2fa965fbbabecf0ec8a920ecf0dd Mon Sep 17 00:00:00 2001
From: Nikos Livathinos <nli@zurich.ibm.com>
Date: Wed, 10 Dec 2025 17:32:17 +0100
Subject: [PATCH 4/4] docs: Improve multi_label_pixel_layout_evaluations.md.
 More TODOs

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---
 docs/multi_label_pixel_layout_evaluations.md | 95 ++++++++++++--------
 1 file changed, 58 insertions(+), 37 deletions(-)

diff --git a/docs/multi_label_pixel_layout_evaluations.md b/docs/multi_label_pixel_layout_evaluations.md
index f49c22fa..3fd46c20 100644
--- a/docs/multi_label_pixel_layout_evaluations.md
+++ b/docs/multi_label_pixel_layout_evaluations.md
@@ -2,13 +2,13 @@
 
 ## Objectives
 
-We want to compute metrices for the multi-label document layout analysis task.
-Each document page undergoes a layout resolution, where each detected object is assigned a bounding box and one or many classes.
-The ground truth contains the bounding box and one object class, although in a generalized version the ground truth can also assign multiple classes for the same object.
+We want to evaluate the multi-label document layout analysis task.
+The layout resolution for each document page consists of the bounding boxes of each detected item and one or many classes.
+The ground truth contains the bounding box and one class, although in a generalized version of the ground truth can also assign multiple classes to each item.
 Everything which is not classified is considered to be the *Background*.
 
-We want to evaluate 2 sets of layout resolutions against each other.
-This can be either the ground truth layout resolutions against the prediction layout resolutions, or 2 predictions against each other.
+We want to evaluate two sets of layout resolutions against each other.
+This can be either the ground truth versus a model prediction or the evaluation across two model predictions.
 We name those layout resolutions as LR1 and LR2.
 
 We also want to solve this evaluation task under the following conditions:
@@ -23,48 +23,57 @@ We also want to solve this evaluation task under the following conditions:
 Additionally we have the following freedoms:
 
 - We do not require the predictions to contain any confidence scores but only bounding boxes and object classes.
-- The two evaluated layout resolutions are free to use any classification labels.
+- The two evaluated layout resolutions are free to use any classification taxonomies.
 
 
 ## Confusion matrix structure
 
 The rows of the matrix correspond to the first layout resolution (ground truth or prediction A) and the columns to the second layout resolution.
 
-Each cell (i, j) is the number of pixels that correspond to class i according to the first layout resolution (e.g. ground truth) and to class j according to the second layout resolution.
+Each cell (i, j) is the number of pixels that have been assigned to class i according to the first layout resolution (e.g. ground truth)
+and to class j according to the second layout resolution.
 
-The exact structure of the confusion matrix and the evaluation metrics that can be derived from it depend on the number of classes in the two layout resolutions.
+The structure of the confusion matrix depends on the classification taxonomies used by the two layout resolutions. 
 More specifically we distinguish two cases:
-- Case A: Both layout resolutions use the same classification classes.
-- Case B: When the classes differ across the layout resolutions.
+- Case A: Both layout resolutions use the same classification taxonomy.
+- Case B: The taxonomies differ across the layout resolutions.
 
-The following table provides some insight on the properties of the confusion matrix and the derived metrics on each case:
+<!--------------------------------------------------------------------------------------------->
+TODO: Make an illustration to show the differences in the confusion matrix structures
 
+The following table provides some insight on the properties of the confusion matrix and the derived metrics for each case:
 
-|                                           | Same classes in LR1/LR2| Different classes in LR1/LR2           |
-|-------------------------------------------|------------------------|----------------------------------------|
-|Rows represent                             | LR1 (e.g. GT)          | LR1 (e.g. GT, predictions A)           |
-|Columns represent                          | LR2 (e.g. predictions) | LR2 (e.g. predictions B)               |
-|Rows/Columns indices                       | background - classes   | background - classes LR1 - classes LR2 |
-|Background class row/column                | (0, 0)                 | (0, 0)                                 |
-|Matrix structure when perfect match        | diagonal               | block                                  |
-|Location of mis-predictions/mis-matches    | off-diagonal           |                                        |
-|Recall/Precision/F1 matrices               | yes                    | yes                                    |
-|Background/class-collapsed R/P/F1 matrices | yes                    | yes                                    |
-|Recall/Precision/F1 detailed class vectors | yes                    | no                                     |
-|Recall/Precision/F1 collapsed class vectors| yes                    | yes                                    |
-|                                           |                        |                                        |
 
+|                                           | Same class taxonomy    | Different class taxonomies        |
+|-------------------------------------------|------------------------|-----------------------------------|
+|Confusion matrix rows represent            | LR1 (e.g. GT)          | LR1 (e.g. GT, predictions A)      |
+|Confusion matrix columns represent         | LR2 (e.g. predictions) | LR2 (e.g. predictions B)          |
+|Row/column index of the Background class   | (0, 0)                 | (0, 0)                            |
+|Rows/Columns after the Background class    | common taxonomy        | taxonomy of LR1 - taxonomy of LR2 |
+|Matrix structure when perfect match        | diagonal               | block                             |
+|Location of mis-predictions/mis-matches    | off-diagonal           |                                   |
+|Recall/Precision/F1 matrices               | yes                    | yes                               |
+|Background/class-collapsed R/P/F1 matrices | yes                    | yes                               |
+|Recall/Precision/F1 detailed class vectors | yes                    | no                                |
+|Recall/Precision/F1 collapsed class vectors| yes                    | yes                               |
+|                                           |                        |                                   |
 
-Table 1: Confusion matrix and derivatives configuration across label-set consistency cases
 
+Table 1: Properties of the confusion matrix and its derivatives across different taxonomy schemes
 
-## Computation of the confusion matrix and derivatives
+
+<!--------------------------------------------------------------------------------------------->
+
+
+## Computation of the confusion matrix and its derivatives
 
 The computation of the multi-label classification matrix is based on the papers:
-[Multi-Label Classifier Performance Evaluation with Confusion Matrix](https://csitcp.org/paper/10/108csit01.pdf)
-[Comments on "MLCM: Multi-Label Confusion Matrix"](https://www.academia.edu/121504684/Comments_on_MLCM_Multi_Label_Confusion_Matrix)
+
+- [Multi-Label Classifier Performance Evaluation with Confusion Matrix.](https://csitcp.org/paper/10/108csit01.pdf)
+- [Comments on "MLCM: Multi-Label Confusion Matrix".](https://www.academia.edu/121504684/Comments_on_MLCM_Multi_Label_Confusion_Matrix)
 
 The papers describe how to build the confusion matrix for the multi-label classification problem under the assumptions:
+
 - The rows represent the ground truth and the columns the predictions.
 - Both ground-truth and predictions use the same classes.
 - The ground truth may assign more than one classes to the same object.
@@ -73,13 +82,13 @@ A _contribution matrix_ is computed for each pair of ground-truth / prediction s
 
 Each contribution matrix is computed according to an algorithm that distinguishes 4 cases:
 
-Case 1: Prediction and GT are a perfect match.
-Case 2: Prediction is a superset of the GT classes (over-prediction).
-Case 3: Prediction is a subset of the GT classes (under-prediction).
-Case 4: Prediction and GT have some partial overlap and some diff (diff-prediction).
+- Case 1: Prediction and GT are a perfect match.
+- Case 2: Prediction is a superset of the GT classes (over-prediction).
+- Case 3: Prediction is a subset of the GT classes (under-prediction).
+- Case 4: Prediction and GT have some partial overlap and some diff (diff-prediction).
 
 For each of those cases the contributions to the confusion matrix can be seen as "gains" that go to the diagonal cells and "penalties" that go to the off-diagonal cells.
-In case 1 the contributions are only gains and their value equals to the count of detections.
+In case 1 the contributions are only gains and their value equals to the number of page items.
 For the other cases the gains have been penalized by the mis-predictions and both gains and penalties have fractional values.
 For example in case of "over-prediction", if the classifier has predicted 3 classes (a, b, c) and the ground truth is (a, b),
 the contribution is a gain of 2/3 for the diagonal cells (a, a), (b, b) because 2 out of 3 predictions are correct
@@ -96,13 +105,18 @@ The diagonal of the recall/precision matrices are the recall/precision vectors f
 
 The _F1 matrix_ is the harmonic mean of the precision (P) and recall (R) matrices and is computed as (2 * P * R) / (P + R).
 
-We compute a contribution matrix for each page pixel according to the previous algorithm.
-Summing up the pixel-level contributions gives the confusion matrix for each page
+
+## Pixel-level multi-label confusion matrix
+
+We consider each page pixel as a dataset sample and we compute a contribution matrix according to the previous algorithm.
+Summing up the pixel-level contributions provides the confusion matrix for each page
 and the sum of all page-level confusion matrices provides the confusion matrix for the entire dataset.
 
 Additionally we compute 2x2 "abstractions" of the confusion matrices that contain only the
 "Background" and the non-Background classes collapsed as one:
 
+TODO: Make an illustration to show how the confusion matrix is collapsed
+
 
 |                | Background | non-Background |
 |----------------|------------|----------------|
@@ -112,12 +126,14 @@ Additionally we compute 2x2 "abstractions" of the confusion matrices that contai
 
 Table 2: Collapsed matrix computed for Background and non-Background classes
 
-The collapsed confusion matrix and its derivatives, collapsed recall/precision/F1,
-allow the evaluation across layout resolutions with incompatible classes.
+The collapsed confusion matrix and its derivatives - collapsed recall/precision/F1 -,
+allow the evaluation across layout resolutions with different class taxonomies.
 
 
 ## Implementation
 
+TODO: Make an illustration how the bit-packed encoding works.
+
 We use a bit‑packed encoding to represent multi‑label layout resolutions for up to 63 classes plus the Background class.
 Each pixel is stored as a single 64‑bit unsigned integer; the i‑th class is encoded by setting bit i.
 The background occupies bit 0.
@@ -135,3 +151,8 @@ Because the number of unique pixel‑pairs is significantly less than the total
 Finally, since pages are independent, the computation of each page‑level confusion matrix can be
 also parallelized.
 
+
+## Discussion
+
+TODO 
+