Multi-sentence ROUGE-L scores

Hi,

I've been working with ROUGE for a while and I'm still not sure how to implement ROUGE-L correctly. 
Both your implementation and the one I'm using (in Python) implements the summary level ROUGE-LCS score as described in the paper. The thing is, the score isn't close to the official scores (i.e. using the perl script).

## Example
Ref:
```
brendan @entity8 is under pressure following @entity11 semi-final defeat . but the @entity10 boss says he will bounce back despite the criticism . @entity10 owners @entity9 maintain @entity8 wo n't be sacked . @entity13 hopes @entity18 commits his future to the @entity23 .
```
Summary:
```
brendan @entity8 insists he is the man to guide @entity10 to success . brendan @entity8 has not been rattled by the intensity of the criticism . @entity10 manager is under pressure following the semi-final defeat by @entity12 last sunday .
```

## Experiment
* **Official scores**: In fact, I'm using python wrappers ([`files2rouge`](https://github.com/pltrdy/files2rouge), that uses [`pyrouge`](https://github.com/bheinzerling/pyrouge)). I tested those wrappers by scoring some prediction / reference pairs and finding the exact same numbers.
```
---------------------------------------------
1 ROUGE-1 Average_R: 0.43902 (95%-conf.int. 0.43902 - 0.43902)
1 ROUGE-1 Average_P: 0.47368 (95%-conf.int. 0.47368 - 0.47368)
1 ROUGE-1 Average_F: 0.45569 (95%-conf.int. 0.45569 - 0.45569)
---------------------------------------------
1 ROUGE-2 Average_R: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
1 ROUGE-2 Average_P: 0.21622 (95%-conf.int. 0.21622 - 0.21622)
1 ROUGE-2 Average_F: 0.20779 (95%-conf.int. 0.20779 - 0.20779)
---------------------------------------------
1 ROUGE-L Average_R: 0.41463 (95%-conf.int. 0.41463 - 0.41463)
1 ROUGE-L Average_P: 0.44737 (95%-conf.int. 0.44737 - 0.44737)
1 ROUGE-L Average_F: 0.43038 (95%-conf.int. 0.43038 - 0.43038)
```
* **Other implementations**:
  * [`rouge`](https://github.com/pltrdy/rouge) (python)
```
{
  "rouge-1": {
    "f": 0.43076922582721894,
    "p": 0.4827586206896552,
    "r": 0.3888888888888889
  },
  "rouge-2": {
    "f": 0.19999999501250013,
    "p": 0.21052631578947367,
    "r": 0.19047619047619047
  },
  "rouge-l": {
    "f": 0.048830315339539125,
    "p": 0.0507936507936508,
    "r": 0.047249907715024
  }
}
```
  * [`rouge` (this repo)](https://github.com/kenlimmj/rouge) (JS)
```
> rougeL(hyp, ref);
0.07972913936216687
```

The difference between R1, R2 scores does not really bother me. But it seems like we're not using the right LCS. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-sentence ROUGE-L scores #7

Example

Experiment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi-sentence ROUGE-L scores #7

Description

Example

Experiment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions