More modular alternative to ChoiceLayer

#649 is currently pending because we don't want to extend ChoiceLayer with even more special cases.

Quote from https://github.com/rwth-i6/returnn/pull/649#issuecomment-919278027
> In general, we always prefer if we have RETURNN flexible enough such that the user can implement things in the config. We want to keep RETURNN simple, and not extend more and more. (See e.g. SelfAttentionLayer for another such bad example, and #391 how this was resolved.)

So I thought a bit about how this could be done for ChoiceLayer. It implements beam pruning and sets `SearchChoices` which are used for beam score accumulation and backtracking, so "extending" it would mean we want to implement an alternative way to select the beam entries, and/or an alternative way to calculate the beam scores. Prefix decoding from #649 is one example, other examples are things already implemented as special cases in ChoiceLayer: cheating, sampling (for inference), scheduled sampling, etc.

An important difference to #391 is that here we manipulate the beam, and we want to hide that from the user/network definition as much as possible. So for example, (I assume) we don't want a layer that explicitly calculates the accumulated beam scores. However, to implement the features mentioned above we have to operate on the beam dimension to some degree, which normally is not touched by the layers.

What I came up with so far to re-implement the standard functionality of ChoiceLayer is:

1. a `BeamPruneIndicesLayer` (naming is hard... 😅 ) which gets scores for the current step via the source layer, accesses the accumulated beam scores via `get_search_choices().beam_scores`, calculates the top-k combined scores, but now in contrast to `ChoiceLayer` does not set `SearchChoices` itself, instead it has an output of shape `(batch, beam_size, 2)` which contains tuples `(src_beam, label)`, so it only returns the indices needed to gather the new beam.
2. a `ConstructBeamLayer` (or maybe `SetSearchChoicesLayer`?), which is the layer that owns the `SearchChoices`. It gets the output of `BeamPruneIndicesLayer` and also the scores as input layers and sets the beam scores and src_beams of the `SearchChoices` according to its inputs. The output would be the new beam of labels.

Custom functionality can then be implemented by manipulating the scores and beam indices before feeding them into the `ConstructBeamLayer`.
For prefix decoding for example, the beam indices from `BeamPruneIndicesLayer` would first go though a `SwitchLayer` that has the prefix labels as a second input (extended with `src_beam=0`), and the condition would be whether the prefix has ended.
For cheating one could replace the last entry in the `BeamPruneIndicesLayer` with `(beam_size - 1, golden_label)`, etc.

Note, that the output of `BeamPruneIndicesLayer` has no beam, instead, the second dimension kind of contains a preliminary beam that is treated in a feature dimension. This might be pretty unintuitive. An alternative which keeps the beam as part of the batch dimension would be to create zeros of shape `(batch * beam, dim)` (same as input scores) and then mark the positions of the top-k scores (inside the hidden beam dim) with integers from 1 to `beam_size`. But this is much less efficient and probably not really more intuitive.

---

Would something like that be worth implementing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More modular alternative to ChoiceLayer #714

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

More modular alternative to ChoiceLayer #714

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions