Skip to content

Create documentation for the HF pipelines.  #41

@avidale

Description

@avidale

We have a recently created huggingface_pipelines directory with some nice code, but no obvios examples of how to use it.

One could create a documentation page that explains the purpose of the pipelines and illustrates the code with which they could be applied.

An example of the task would be to use the FLORES dataset (https://huggingface.co/datasets/facebook/flores) to compare the quality of translation from various languages to one (e.g. to English or to Spanish).

Motivation for the task

A typical way to evaluate SONAR models for a particular language would be to encode some dataset of sentences and then to decode it to the same language (reconstruction) or to another language (translation). Then the generated texts get compared with the reference texts using numeric scores such as BLEU (from the sacrebleu package).

We want to use this task as an opportunity of learning more about the pipelines which are kind of glue that connects the models to the data (by e.g. batching the data to feed to the models).

How to approach

All or most of the code elements are (probably) already somewhere in the repo, the goal is to put them together with the new Hugginface pipeline using segmentation, encoding, decoding, and BLEU computation.

A good entrypoint might be the tests (e.g. https://github.com/facebookresearch/SONAR/blob/main/tests/unit_tests/huggingface_pipelines/text.py) that illustrate some of potential use cases of the HF pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationgood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions