Skip to content

More modular alternative to ChoiceLayer #714

@patrick-wilken

Description

@patrick-wilken

#649 is currently pending because we don't want to extend ChoiceLayer with even more special cases.

Quote from #649 (comment)

In general, we always prefer if we have RETURNN flexible enough such that the user can implement things in the config. We want to keep RETURNN simple, and not extend more and more. (See e.g. SelfAttentionLayer for another such bad example, and #391 how this was resolved.)

So I thought a bit about how this could be done for ChoiceLayer. It implements beam pruning and sets SearchChoices which are used for beam score accumulation and backtracking, so "extending" it would mean we want to implement an alternative way to select the beam entries, and/or an alternative way to calculate the beam scores. Prefix decoding from #649 is one example, other examples are things already implemented as special cases in ChoiceLayer: cheating, sampling (for inference), scheduled sampling, etc.

An important difference to #391 is that here we manipulate the beam, and we want to hide that from the user/network definition as much as possible. So for example, (I assume) we don't want a layer that explicitly calculates the accumulated beam scores. However, to implement the features mentioned above we have to operate on the beam dimension to some degree, which normally is not touched by the layers.

What I came up with so far to re-implement the standard functionality of ChoiceLayer is:

  1. a BeamPruneIndicesLayer (naming is hard... 😅 ) which gets scores for the current step via the source layer, accesses the accumulated beam scores via get_search_choices().beam_scores, calculates the top-k combined scores, but now in contrast to ChoiceLayer does not set SearchChoices itself, instead it has an output of shape (batch, beam_size, 2) which contains tuples (src_beam, label), so it only returns the indices needed to gather the new beam.
  2. a ConstructBeamLayer (or maybe SetSearchChoicesLayer?), which is the layer that owns the SearchChoices. It gets the output of BeamPruneIndicesLayer and also the scores as input layers and sets the beam scores and src_beams of the SearchChoices according to its inputs. The output would be the new beam of labels.

Custom functionality can then be implemented by manipulating the scores and beam indices before feeding them into the ConstructBeamLayer.
For prefix decoding for example, the beam indices from BeamPruneIndicesLayer would first go though a SwitchLayer that has the prefix labels as a second input (extended with src_beam=0), and the condition would be whether the prefix has ended.
For cheating one could replace the last entry in the BeamPruneIndicesLayer with (beam_size - 1, golden_label), etc.

Note, that the output of BeamPruneIndicesLayer has no beam, instead, the second dimension kind of contains a preliminary beam that is treated in a feature dimension. This might be pretty unintuitive. An alternative which keeps the beam as part of the batch dimension would be to create zeros of shape (batch * beam, dim) (same as input scores) and then mark the positions of the top-k scores (inside the hidden beam dim) with integers from 1 to beam_size. But this is much less efficient and probably not really more intuitive.


Would something like that be worth implementing?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions