espnet2.asr package

espnet2.asr.__init__

espnet2.asr.ctc

class espnet2.asr.ctc.CTC(odim: int, encoder_output_size: int, dropout_rate: float = 0.0, ctc_type: str = 'builtin', reduce: bool = True, ignore_nan_grad: bool = True)[source]

Bases: torch.nn.modules.module.Module

CTC module.

Parameters
  • odim – dimension of outputs

  • encoder_output_size – number of encoder projection units

  • dropout_rate – dropout rate (0.0 ~ 1.0)

  • ctc_type – builtin or warpctc

  • reduce – reduce the CTC loss into a scalar

argmax(hs_pad)[source]

argmax of frame activations

Parameters

hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)

Returns

argmax applied 2d tensor (B, Tmax)

Return type

torch.Tensor

forward(hs_pad, hlens, ys_pad, ys_lens)[source]

Calculate CTC loss.

Parameters
  • hs_pad – batch of padded hidden state sequences (B, Tmax, D)

  • hlens – batch of lengths of hidden state sequences (B)

  • ys_pad – batch of padded character id sequence tensor (B, Lmax)

  • ys_lens – batch of lengths of character sequence (B)

log_softmax(hs_pad)[source]

log_softmax of frame activations

Parameters

hs_pad (Tensor) – 3d tensor (B, Tmax, eprojs)

Returns

log softmax applied 3d tensor (B, Tmax, odim)

Return type

torch.Tensor

loss_fn(th_pred, th_target, th_ilen, th_olen) → torch.Tensor[source]
softmax(hs_pad)[source]

softmax of frame activations

Parameters

hs_pad (Tensor) – 3d tensor (B, Tmax, eprojs)

Returns

softmax applied 3d tensor (B, Tmax, odim)

Return type

torch.Tensor

espnet2.asr.espnet_model

class espnet2.asr.espnet_model.ESPnetASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

CTC-attention hybrid Encoder-Decoder model

batchify_nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]

Compute negative log likelihood(nll) from transformer-decoder

To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,

you may change this to avoid OOM or increase GPU memory usage

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor) → Dict[str, torch.Tensor][source]
encode(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Frontend + Encoder. Note that this method is used by asr_inference.py

Parameters
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

  • text – (Batch, Length)

  • text_lengths – (Batch,)

nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]

Compute negative log likelihood(nll) from transformer-decoder

Normally, this function is called in batchify_nll.

Parameters
  • encoder_out – (Batch, Length, Dim)

  • encoder_out_lens – (Batch,)

  • ys_pad – (Batch, Length)

  • ys_pad_lens – (Batch,)

espnet2.asr.maskctc_model

class espnet2.asr.maskctc_model.MaskCTCInference(asr_model: espnet2.asr.maskctc_model.MaskCTCModel, n_iterations: int, threshold_probability: float)[source]

Bases: torch.nn.modules.module.Module

Mask-CTC-based non-autoregressive inference

Initialize Mask-CTC inference

forward(enc_out: torch.Tensor) → List[espnet.nets.beam_search.Hypothesis][source]

Perform Mask-CTC inference

ids2text(ids: List[int])[source]
class espnet2.asr.maskctc_model.MaskCTCModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.mlm_decoder.MLMDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module] = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', sym_mask: str = '<mask>', extract_feats_in_collect_stats: bool = True)[source]

Bases: espnet2.asr.espnet_model.ESPnetASRModel

Hybrid CTC/Masked LM Encoder-Decoder model (Mask-CTC)

batchify_nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]

Compute negative log likelihood(nll) from transformer-decoder

To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,

you may change this to avoid OOM or increase GPU memory usage

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters
  • speech – (Batch, Length, …)

  • speech_lengths – (Batch, )

  • text – (Batch, Length)

  • text_lengths – (Batch,)

nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]

Compute negative log likelihood(nll) from transformer-decoder

Normally, this function is called in batchify_nll.

Parameters
  • encoder_out – (Batch, Length, Dim)

  • encoder_out_lens – (Batch,)

  • ys_pad – (Batch, Length)

  • ys_pad_lens – (Batch,)

espnet2.asr.decoder.__init__

espnet2.asr.decoder.abs_decoder

class espnet2.asr.decoder.abs_decoder.AbsDecoder[source]

Bases: torch.nn.modules.module.Module, espnet.nets.scorer_interface.ScorerInterface, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.decoder.mlm_decoder

Masked LM Decoder definition.

class espnet2.asr.decoder.mlm_decoder.MLMDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward decoder.

Parameters
  • hs_pad – encoded memory, float32 (batch, maxlen_in, feat)

  • hlens – (batch)

  • ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases

  • ys_in_lens – (batch)

Returns

tuple containing: x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type

(tuple)

espnet2.asr.decoder.rnn_decoder

class espnet2.asr.decoder.rnn_decoder.RNNDecoder(vocab_size: int, encoder_output_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, sampling_probability: float = 0.0, dropout: float = 0.0, context_residual: bool = False, replace_sos: bool = False, num_encs: int = 1, att_conf: dict = {'aconv_chans': 10, 'aconv_filts': 100, 'adim': 320, 'aheads': 4, 'atype': 'location', 'awin': 5, 'han_conv_chans': -1, 'han_conv_filts': 100, 'han_dim': 320, 'han_heads': 4, 'han_mode': False, 'han_type': None, 'han_win': 5, 'num_att': 1, 'num_encs': 1})[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

forward(hs_pad, hlens, ys_in_pad, ys_in_lens, strm_idx=0)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_state(x)[source]

Get an initial state for decoding (optional).

Parameters

x (torch.Tensor) – The encoded feature tensor

Returns: initial state

rnn_forward(ey, z_list, c_list, z_prev, c_prev)[source]
score(yseq, state, x)[source]

Score new token (required).

Parameters
  • y (torch.Tensor) – 1D torch.int64 prefix tokens.

  • state – Scorer state for prefix tokens

  • x (torch.Tensor) – The encoder feature that generates ys.

Returns

Tuple of

scores for next token that has a shape of (n_vocab) and next state for ys

Return type

tuple[torch.Tensor, Any]

zero_state(hs_pad)[source]
espnet2.asr.decoder.rnn_decoder.build_attention_list(eprojs: int, dunits: int, atype: str = 'location', num_att: int = 1, num_encs: int = 1, aheads: int = 4, adim: int = 320, awin: int = 5, aconv_chans: int = 10, aconv_filts: int = 100, han_mode: bool = False, han_type=None, han_heads: int = 4, han_dim: int = 320, han_conv_chans: int = -1, han_conv_filts: int = 100, han_win: int = 5)[source]

espnet2.asr.decoder.transformer_decoder

Decoder definition.

class espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True)[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

Base class of Transfomer decoder module.

Parameters
  • vocab_size – output dim

  • encoder_output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • self_attention_dropout_rate – dropout rate for attention

  • input_layer – input layer type

  • use_output_layer – whether to use output layer

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]

Score new token batch.

Parameters
  • ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).

  • states (List[Any]) – Scorer states for prefix tokens.

  • xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns

Tuple of

batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward decoder.

Parameters
  • hs_pad – encoded memory, float32 (batch, maxlen_in, feat)

  • hlens – (batch)

  • ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases

  • ys_in_lens – (batch)

Returns

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type

(tuple)

forward_one_step(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]

Forward one step.

Parameters
  • tgt – input token ids, int64 (batch, maxlen_out)

  • tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)

  • memory – encoded memory, float32 (batch, maxlen_in, feat)

  • cache – cached output list of (batch, max_time_out-1, size)

Returns

NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)

Return type

y, cache

score(ys, state, x)[source]

Score.

class espnet2.asr.decoder.transformer_decoder.DynamicConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.TransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]

Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

espnet2.asr.encoder.__init__

espnet2.asr.encoder.abs_encoder

class espnet2.asr.encoder.abs_encoder.AbsEncoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.encoder.conformer_encoder

Conformer encoder definition.

class espnet2.asr.encoder.conformer_encoder.ConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, stochastic_depth_rate: Union[float, List[float]] = 0.0)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Conformer encoder module.

Parameters
  • input_size (int) – Input dimension.

  • output_size (int) – Dimension of attention.

  • attention_heads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • num_blocks (int) – The number of decoder blocks.

  • dropout_rate (float) – Dropout rate.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.

  • encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.

  • encoder_attn_layer_type (str) – Encoder attention layer type.

  • activation_type (str) – Encoder activation function type.

  • macaron_style (bool) – Whether to use macaron style for positionwise layer.

  • use_cnn_module (bool) – Whether to use convolution module.

  • zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.

  • cnn_module_kernel (int) – Kernerl size of convolution module.

  • padding_idx (int) – Padding idx for input_layer=embed.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Calculate forward propagation.

Parameters
  • xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).

  • ilens (torch.Tensor) – Input length (#batch).

  • prev_states (torch.Tensor) – Not to be used now.

Returns

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type

torch.Tensor

output_size() → int[source]

espnet2.asr.encoder.contextual_block_conformer_encoder

Created on Sat Aug 21 17:27:16 2021.

@author: Keqi Deng (UCAS)

class espnet2.asr.encoder.contextual_block_conformer_encoder.ContextualBlockConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, cnn_module_kernel: int = 31, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Contextual Block Conformer encoder module.

Parameters
  • input_size – input dim

  • output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • positional_dropout_rate – dropout rate after adding positional encoding

  • input_layer – input layer type

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type – linear of conv1d

  • positionwise_conv_kernel_size – kernel size of positionwise conv1d layer

  • padding_idx – padding_idx for input_layer=embed

  • block_size – block size for contextual block processing

  • hop_Size – hop size for block processing

  • look_ahead – look-ahead size for block_processing

  • init_average – whether to use average as initial context (otherwise max values)

  • ctx_pos_enc – whether to use positional encoding to the context vectors

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

  • infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).

Returns

position embedded tensor and mask

forward_infer(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns

position embedded tensor and mask

forward_train(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]

espnet2.asr.encoder.contextual_block_transformer_encoder

Encoder definition.

class espnet2.asr.encoder.contextual_block_transformer_encoder.ContextualBlockTransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Contextual Block Transformer encoder module.

Details in Tsunoo et al. “Transformer ASR with contextual block processing” (https://arxiv.org/abs/1910.07204)

Parameters
  • input_size – input dim

  • output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of encoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • positional_dropout_rate – dropout rate after adding positional encoding

  • input_layer – input layer type

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type – linear of conv1d

  • positionwise_conv_kernel_size – kernel size of positionwise conv1d layer

  • padding_idx – padding_idx for input_layer=embed

  • block_size – block size for contextual block processing

  • hop_Size – hop size for block processing

  • look_ahead – look-ahead size for block_processing

  • init_average – whether to use average as initial context (otherwise max values)

  • ctx_pos_enc – whether to use positional encoding to the context vectors

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

  • infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).

Returns

position embedded tensor and mask

forward_infer(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns

position embedded tensor and mask

forward_train(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]

espnet2.asr.encoder.hubert_encoder

Encoder definition.

class espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder(input_size: int, hubert_url: str = './', hubert_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0, dropout_rate: float = 0.0, activation_dropout: float = 0.1, attention_dropout: float = 0.0, mask_length: int = 10, mask_prob: float = 0.75, mask_selection: str = 'static', mask_other: int = 0, apply_mask: bool = True, mask_channel_length: int = 64, mask_channel_prob: float = 0.5, mask_channel_other: int = 0, mask_channel_selection: str = 'static', layerdrop: float = 0.1, feature_grad_mult: float = 0.0)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Hubert encoder module, used for loading pretrained weight and finetuning

Parameters
  • input_size – input dim

  • hubert_url – url to Hubert pretrained model

  • hubert_dir_path – directory to download the Wav2Vec2.0 pretrained model.

  • output_size – dimension of attention

  • normalize_before – whether to use layer_norm before the first block

  • freeze_finetune_updates – steps that freeze all layers except output layer before tuning the whole model (nessasary to prevent overfit).

  • dropout_rate – dropout rate

  • activation_dropout – dropout rate in activation function

  • attention_dropout – dropout rate in attention

Hubert specific Args:

Please refer to: https://github.com/pytorch/fairseq/blob/master/fairseq/models/hubert/hubert.py

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward Hubert ASR Encoder.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]
reload_pretrained_parameters()[source]
class espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder(input_size: int = 1, output_size: int = 1024, linear_units: int = 1024, attention_heads: int = 12, num_blocks: int = 12, dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, activation_dropout_rate: float = 0.0, hubert_dict: str = './dict.txt', label_rate: int = 100, checkpoint_activations: bool = False, sample_rate: int = 16000, use_amp: bool = False, **kwargs)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Hubert pretrain encoder module, only used for pretraining stage

Parameters
  • input_size – input dim

  • output_size – dimension of attention

  • linear_units – dimension of feedforward layers

  • attention_heads – the number of heads of multi head attention

  • num_blocks – the number of encoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • hubert_dict – target dictionary for Hubert pretraining

  • label_rate – label frame rate. -1 for sequence label

  • sample_rate – target sample rate.

  • use_amp – whether to use automatic mixed precision

  • normalize_before – whether to use layer_norm before the first block

cast_mask_emb()[source]
forward(xs_pad: torch.Tensor, ilens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_length: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward Hubert Pretrain Encoder.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]
reload_pretrained_parameters()[source]
espnet2.asr.encoder.hubert_encoder.download_hubert(model_url, dir_path)[source]

espnet2.asr.encoder.longformer_encoder

Conformer encoder definition.

class espnet2.asr.encoder.longformer_encoder.LongformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'abs_pos', selfattention_layer_type: str = 'lf_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, attention_windows: list = [100, 100, 100, 100, 100, 100], attention_dilation: list = [1, 1, 1, 1, 1, 1], attention_mode: str = 'sliding_chunks')[source]

Bases: espnet2.asr.encoder.conformer_encoder.ConformerEncoder

Longformer SA Conformer encoder module.

Parameters
  • input_size (int) – Input dimension.

  • output_size (int) – Dimension of attention.

  • attention_heads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • num_blocks (int) – The number of decoder blocks.

  • dropout_rate (float) – Dropout rate.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.

  • encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.

  • encoder_attn_layer_type (str) – Encoder attention layer type.

  • activation_type (str) – Encoder activation function type.

  • macaron_style (bool) – Whether to use macaron style for positionwise layer.

  • use_cnn_module (bool) – Whether to use convolution module.

  • zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.

  • cnn_module_kernel (int) – Kernerl size of convolution module.

  • padding_idx (int) – Padding idx for input_layer=embed.

  • attention_windows (list) – Layer-wise attention window sizes for longformer self-attn

  • attention_dilation (list) – Layer-wise attention dilation sizes for longformer self-attn

  • attention_mode (str) – Implementation for longformer self-attn. Default=”sliding_chunks” Choose ‘n2’, ‘tvm’ or ‘sliding_chunks’. More details in https://github.com/allenai/longformer

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Calculate forward propagation.

Parameters
  • xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).

  • ilens (torch.Tensor) – Input length (#batch).

  • prev_states (torch.Tensor) – Not to be used now.

Returns

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type

torch.Tensor

output_size() → int[source]

espnet2.asr.encoder.rnn_encoder

class espnet2.asr.encoder.rnn_encoder.RNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, subsample: Optional[Sequence[int]] = (2, 2, 1, 1))[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

RNNEncoder class.

Parameters
  • input_size – The number of expected features in the input

  • output_size – The number of output features

  • hidden_size – The number of hidden features

  • bidirectional – If True becomes a bidirectional LSTM

  • use_projection – Use projection layer or not

  • num_layers – Number of recurrent layers

  • dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.encoder.transformer_encoder

Transformer encoder definition.

class espnet2.asr.encoder.transformer_encoder.TransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Transformer encoder module.

Parameters
  • input_size – input dim

  • output_size – dimension of attention

  • attention_heads – the number of heads of multi head attention

  • linear_units – the number of units of position-wise feed forward

  • num_blocks – the number of decoder blocks

  • dropout_rate – dropout rate

  • attention_dropout_rate – dropout rate in attention

  • positional_dropout_rate – dropout rate after adding positional encoding

  • input_layer – input layer type

  • pos_enc_class – PositionalEncoding or ScaledPositionalEncoding

  • normalize_before – whether to use layer_norm before the first block

  • concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type – linear of conv1d

  • positionwise_conv_kernel_size – kernel size of positionwise conv1d layer

  • padding_idx – padding_idx for input_layer=embed

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Embed positions in tensor.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]

espnet2.asr.encoder.vgg_rnn_encoder

class espnet2.asr.encoder.vgg_rnn_encoder.VGGRNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, in_channel: int = 1)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

VGGRNNEncoder class.

Parameters
  • input_size – The number of expected features in the input

  • bidirectional – If True becomes a bidirectional LSTM

  • use_projection – Use projection layer or not

  • num_layers – Number of recurrent layers

  • hidden_size – The number of hidden features

  • output_size – The number of output features

  • dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.encoder.wav2vec2_encoder

Encoder definition.

class espnet2.asr.encoder.wav2vec2_encoder.FairSeqWav2Vec2Encoder(input_size: int, w2v_url: str, w2v_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0)[source]

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Wav2Vec2 encoder module.

Parameters
  • input_size – input dim

  • output_size – dimension of attention

  • w2v_url – url to Wav2Vec2.0 pretrained model

  • w2v_dir_path – directory to download the Wav2Vec2.0 pretrained model.

  • normalize_before – whether to use layer_norm before the first block

  • finetune_last_n_layers – last n layers to be finetuned in Wav2Vec2.0 0 means to finetune every layer if freeze_w2v=False.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward FairSeqWav2Vec2 Encoder.

Parameters
  • xs_pad – input tensor (B, L, D)

  • ilens – input length (B)

  • prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]
reload_pretrained_parameters()[source]
espnet2.asr.encoder.wav2vec2_encoder.download_w2v(model_url, dir_path)[source]

espnet2.asr.frontend.__init__

espnet2.asr.frontend.abs_frontend

class espnet2.asr.frontend.abs_frontend.AbsFrontend[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.frontend.default

class espnet2.asr.frontend.default.DefaultFrontend(fs: Union[int, str] = 16000, n_fft: int = 512, win_length: int = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: int = None, fmax: int = None, htk: bool = False, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, apply_stft: bool = True)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Conventional frontend structure for ASR.

Stft -> WPE -> MVDR-Beamformer -> Power-spec -> Mel-Fbank -> CMVN

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.frontend.fused

class espnet2.asr.frontend.fused.FusedFrontends(frontends=None, align_method='linear_projection', proj_dim=100, fs=16000)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]

espnet2.asr.frontend.s3prl

class espnet2.asr.frontend.s3prl.S3prlFrontend(fs: Union[int, str] = 16000, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, download_dir: str = None, multilayer_feature: bool = False)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Speech Pretrained Representation frontend structure for ASR.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]
reload_pretrained_parameters()[source]
espnet2.asr.frontend.s3prl.base_s3prl_setup(args)[source]

espnet2.asr.frontend.windowing

Sliding Window for raw audio input data.

class espnet2.asr.frontend.windowing.SlidingWindow(win_length: int = 400, hop_length: int = 160, channels: int = 1, padding: int = None, fs=None)[source]

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Sliding Window.

Provides a sliding window over a batched continuous raw audio tensor. Optionally, provides padding (Currently not implemented). Combine this module with a pre-encoder compatible with raw audio data, for example Sinc convolutions.

Known issues: Output length is calculated incorrectly if audio shorter than win_length. WARNING: trailing values are discarded - padding not implemented yet. There is currently no additional window function applied to input values.

Initialize.

Parameters
  • win_length – Length of frame.

  • hop_length – Relative starting point of next frame.

  • channels – Number of input channels.

  • padding – Padding (placeholder, currently not implemented).

  • fs – Sampling rate (placeholder for compatibility, not used).

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Apply a sliding window on the input.

Parameters
  • input – Input (B, T, C*D) or (B, T*C*D), with D=C=1.

  • input_lengths – Input lengths within batch.

Returns

Output with dimensions (B, T, C, D), with D=win_length. Tensor: Output lengths within batch.

Return type

Tensor

output_size() → int[source]

Return output length of feature dimension D, i.e. the window length.

espnet2.asr.postencoder.__init__

espnet2.asr.postencoder.abs_postencoder

class espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.postencoder.hugging_face_transformers_postencoder

Hugging Face Transformers PostEncoder.

class espnet2.asr.postencoder.hugging_face_transformers_postencoder.HuggingFaceTransformersPostEncoder(input_size: int, model_name_or_path: str)[source]

Bases: espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder

Hugging Face Transformers PostEncoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward.

output_size() → int[source]

Get the output size.

reload_pretrained_parameters()[source]

espnet2.asr.preencoder.__init__

espnet2.asr.preencoder.abs_preencoder

class espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]

espnet2.asr.preencoder.linear

Linear Projection.

class espnet2.asr.preencoder.linear.LinearProjection(input_size: int, output_size: int)[source]

Bases: espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder

Linear Projection Preencoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward.

output_size() → int[source]

Get the output size.

espnet2.asr.preencoder.sinc

Sinc convolutions for raw audio input.

class espnet2.asr.preencoder.sinc.LightweightSincConvs(fs: Union[int, str, float] = 16000, in_channels: int = 1, out_channels: int = 256, activation_type: str = 'leakyrelu', dropout_type: str = 'dropout', windowing_type: str = 'hamming', scale_type: str = 'mel')[source]

Bases: espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder

Lightweight Sinc Convolutions.

Instead of using precomputed features, end-to-end speech recognition can also be done directly from raw audio using sinc convolutions, as described in “Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions” by Kürzinger et al. https://arxiv.org/abs/2010.07597

To use Sinc convolutions in your model instead of the default f-bank frontend, set this module as your pre-encoder with preencoder: sinc and use the input of the sliding window frontend with frontend: sliding_window in your yaml configuration file. So that the process flow is:

Frontend (SlidingWindow) -> SpecAug -> Normalization -> Pre-encoder (LightweightSincConvs) -> Encoder -> Decoder

Note that this method also performs data augmentation in time domain (vs. in spectral domain in the default frontend). Use plot_sinc_filters.py to visualize the learned Sinc filters.

Initialize the module.

Parameters
  • fs – Sample rate.

  • in_channels – Number of input channels.

  • out_channels – Number of output channels (for each input channel).

  • activation_type – Choice of activation function.

  • dropout_type – Choice of dropout function.

  • windowing_type – Choice of windowing function.

  • scale_type – Choice of filter-bank initialization scale.

espnet_initialization_fn()[source]

Initialize sinc filters with filterbank values.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Apply Lightweight Sinc Convolutions.

The input shall be formatted as (B, T, C_in, D_in) with B as batch size, T as time dimension, C_in as channels, and D_in as feature dimension.

The output will then be (B, T, C_out*D_out) with C_out and D_out as output dimensions.

The current module structure only handles D_in=400, so that D_out=1. Remark for the multichannel case: C_out is the number of out_channels given at initialization multiplied with C_in.

gen_lsc_block(in_channels: int, out_channels: int, depthwise_kernel_size: int = 9, depthwise_stride: int = 1, depthwise_groups=None, pointwise_groups=0, dropout_probability: float = 0.15, avgpool=False)[source]

Generate a convolutional block for Lightweight Sinc convolutions.

Each block consists of either a depthwise or a depthwise-separable convolutions together with dropout, (batch-)normalization layer, and an optional average-pooling layer.

Parameters
  • in_channels – Number of input channels.

  • out_channels – Number of output channels.

  • depthwise_kernel_size – Kernel size of the depthwise convolution.

  • depthwise_stride – Stride of the depthwise convolution.

  • depthwise_groups – Number of groups of the depthwise convolution.

  • pointwise_groups – Number of groups of the pointwise convolution.

  • dropout_probability – Dropout probability in the block.

  • avgpool – If True, an AvgPool layer is inserted.

Returns

Neural network building block.

Return type

torch.nn.Sequential

output_size() → int[source]

Get the output size.

class espnet2.asr.preencoder.sinc.SpatialDropout(dropout_probability: float = 0.15, shape: Union[tuple, list, None] = None)[source]

Bases: torch.nn.modules.module.Module

Spatial dropout module.

Apply dropout to full channels on tensors of input (B, C, D)

Initialize.

Parameters
  • dropout_probability – Dropout probability.

  • shape (tuple, list) – Shape of input tensors.

forward(x: torch.Tensor) → torch.Tensor[source]

Forward of spatial dropout module.

espnet2.asr.specaug.__init__

espnet2.asr.specaug.abs_specaug

class espnet2.asr.specaug.abs_specaug.AbsSpecAug[source]

Bases: torch.nn.modules.module.Module

Abstract class for the augmentation of spectrogram

The process-flow:

Frontend -> SpecAug -> Normalization -> Encoder -> Decoder

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor, x_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.specaug.specaug

SpecAugment module.

class espnet2.asr.specaug.specaug.SpecAug(apply_time_warp: bool = True, time_warp_window: int = 5, time_warp_mode: str = 'bicubic', apply_freq_mask: bool = True, freq_mask_width_range: Union[int, Sequence[int]] = (0, 20), num_freq_mask: int = 2, apply_time_mask: bool = True, time_mask_width_range: Union[int, Sequence[int], None] = None, time_mask_width_ratio_range: Union[float, Sequence[float], None] = None, num_time_mask: int = 2)[source]

Bases: espnet2.asr.specaug.abs_specaug.AbsSpecAug

Implementation of SpecAug.

Reference:

Daniel S. Park et al. “SpecAugment: A Simple Data

Augmentation Method for Automatic Speech Recognition”

Warning

When using cuda mode, time_warp doesn’t have reproducibility due to torch.nn.functional.interpolate.

forward(x, x_lengths=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.transducer.__init__

espnet2.asr.transducer.beam_search_transducer

Search algorithms for Transducer models.

class espnet2.asr.transducer.beam_search_transducer.BeamSearchTransducer(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr.transducer.joint_network.JointNetwork, beam_size: int, lm: torch.nn.modules.module.Module = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, score_norm: bool = True, nbest: int = 1)[source]

Bases: object

Beam search implementation for Transducer.

Initialize Transducer search module.

Parameters
  • decoder – Decoder module.

  • joint_network – Joint network module.

  • beam_size – Beam size.

  • lm – LM class.

  • lm_weight – LM weight for soft fusion.

  • search_type – Search algorithm to use during inference.

  • max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)

  • u_max – Maximum output sequence length. (ALSD)

  • nstep – Number of maximum expansion steps at each time step. (NSC/mAES)

  • prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)

  • expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)

  • expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)

  • score_norm – Normalize final scores by length. (“default”)

  • nbest – Number of final hypothesis.

align_length_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]

Alignment-length synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters

h – Encoder output sequences. (T, D)

Returns

N-best hypothesis.

Return type

nbest_hyps

Beam search implementation.

Modified from https://arxiv.org/pdf/1211.3711.pdf

Parameters

enc_out – Encoder output sequence. (T, D)

Returns

N-best hypothesis.

Return type

nbest_hyps

Greedy search implementation.

Parameters

enc_out – Encoder output sequence. (T, D_enc)

Returns

1-best hypotheses.

Return type

hyp

It’s the modified Adaptive Expansion Search (mAES) implementation.

Based on/modified from https://ieeexplore.ieee.org/document/9250505 and NSC.

Parameters

enc_out – Encoder output sequence. (T, D_enc)

Returns

N-best hypothesis.

Return type

nbest_hyps

N-step constrained beam search implementation.

Based on/Modified from https://arxiv.org/pdf/2002.03577.pdf. Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet until further modifications.

Parameters

enc_out – Encoder output sequence. (T, D_enc)

Returns

N-best hypothesis.

Return type

nbest_hyps

Prefix search for NSC and mAES strategies.

Based on https://arxiv.org/pdf/1211.3711.pdf

sort_nbest(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]]) → Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]][source]

Sort hypotheses by score or score given sequence length.

Parameters

hyps – Hypothesis.

Returns

Sorted hypothesis.

Return type

hyps

time_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]

Time synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters

enc_out – Encoder output sequence. (T, D)

Returns

N-best hypothesis.

Return type

nbest_hyps

class espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None, dec_out: List[torch.Tensor] = None, lm_scores: torch.Tensor = None)[source]

Bases: espnet2.asr.transducer.beam_search_transducer.Hypothesis

Extended hypothesis definition for NSC beam search and mAES.

dec_out = None
lm_scores = None
class espnet2.asr.transducer.beam_search_transducer.Hypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None)[source]

Bases: object

Default hypothesis definition for Transducer search algorithms.

lm_state = None

espnet2.asr.transducer.error_calculator

Error Calculator module for Transducer.

class espnet2.asr.transducer.error_calculator.ErrorCalculatorTransducer(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: torch.nn.modules.module.Module, token_list: List[int], sym_space: str, sym_blank: str, report_cer: bool = False, report_wer: bool = False)[source]

Bases: object

Calculate CER and WER for transducer models.

Parameters
  • decoder – Decoder module.

  • token_list – List of tokens.

  • sym_space – Space symbol.

  • sym_blank – Blank symbol.

  • report_cer – Whether to compute CER.

  • report_wer – Whether to compute WER.

Construct an ErrorCalculatorTransducer.

calculate_cer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]

Calculate sentence-level CER score.

Parameters
  • char_pred – Prediction character sequences. (B, ?)

  • char_target – Target character sequences. (B, ?)

Returns

Average sentence-level CER score.

calculate_wer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]

Calculate sentence-level WER score.

Parameters
  • char_pred – Prediction character sequences. (B, ?)

  • char_target – Target character sequences. (B, ?)

Returns

Average sentence-level WER score

convert_to_char(pred: torch.Tensor, target: torch.Tensor) → Tuple[List, List][source]

Convert label ID sequences to character sequences.

Parameters
  • pred – Prediction label ID sequences. (B, U)

  • target – Target label ID sequences. (B, L)

Returns

Prediction character sequences. (B, ?) char_target: Target character sequences. (B, ?)

Return type

char_pred

espnet2.asr.transducer.joint_network

Transducer joint network implementation.

class espnet2.asr.transducer.joint_network.JointNetwork(joint_output_size: int, encoder_output_size: int, decoder_output_size: int, joint_space_size: int = 256, joint_activation_type: str = 'tanh')[source]

Bases: torch.nn.modules.module.Module

Transducer joint network module.

Parameters
  • joint_output_size – Joint network output dimension

  • encoder_output_size – Encoder output dimension.

  • decoder_output_size – Decoder output dimension.

  • joint_space_size – Dimension of joint space.

  • joint_activation_type – Type of activation for joint network.

Joint network initializer.

forward(enc_out: torch.Tensor, dec_out: torch.Tensor) → torch.Tensor[source]

Joint computation of encoder and decoder hidden state sequences.

Parameters
  • enc_out – Expanded encoder output state sequences (B, T, 1, D_enc)

  • dec_out – Expanded decoder output state sequences (B, 1, U, D_dec)

Returns

Joint output state sequences. (B, T, U, D_out)

Return type

joint_out

espnet2.asr.transducer.transducer_decoder

(RNN-)Transducer decoder definition.

class espnet2.asr.transducer.transducer_decoder.TransducerDecoder(vocab_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, dropout: float = 0.0, dropout_embed: float = 0.0, embed_pad: int = 0)[source]

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

(RNN-)Transducer decoder module.

Parameters
  • vocab_size – Output dimension.

  • layers_type – (RNN-)Decoder layers type.

  • num_layers – Number of decoder layers.

  • hidden_size – Number of decoder units per layer.

  • dropout – Dropout rate for decoder layers.

  • dropout_embed – Dropout rate for embedding layer.

  • embed_pad – Embed/Blank symbol ID.

batch_score(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]], dec_states: Tuple[torch.Tensor, Optional[torch.Tensor]], cache: Dict[str, Any], use_lm: bool) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], torch.Tensor][source]

One-step forward hypotheses.

Parameters
  • hyps – Hypotheses.

  • states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

  • cache – Pairs of (dec_out, dec_states) for each label sequences. (keys)

  • use_lm – Whether to compute label ID sequences for LM.

Returns

Decoder output sequences. (B, D_dec) dec_states: Decoder hidden states. ((N, B, D_dec), (N, B, D_dec)) lm_labels: Label ID sequences for LM. (B,)

Return type

dec_out

create_batch_states(states: Tuple[torch.Tensor, Optional[torch.Tensor]], new_states: List[Tuple[torch.Tensor, Optional[torch.Tensor]]], check_list: Optional[List] = None) → List[Tuple[torch.Tensor, Optional[torch.Tensor]]][source]

Create decoder hidden states.

Parameters
  • states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

  • new_states – Decoder hidden states. [N x ((1, D_dec), (1, D_dec))]

Returns

Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

Return type

states

forward(labels: torch.Tensor) → torch.Tensor[source]

Encode source label sequences.

Parameters

labels – Label ID sequences. (B, L)

Returns

Decoder output sequences. (B, T, U, D_dec)

Return type

dec_out

init_state(batch_size: int) → Tuple[torch.Tensor, Optional[None._VariableFunctionsClass.tensor]][source]

Initialize decoder states.

Parameters

batch_size – Batch size.

Returns

Initial decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

rnn_forward(sequence: torch.Tensor, state: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]

Encode source label sequences.

Parameters
  • sequence – RNN input sequences. (B, D_emb)

  • state – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

Returns

RNN output sequences. (B, D_dec) (h_next, c_next): Decoder hidden states. (N, B, D_dec), (N, B, D_dec))

Return type

sequence

score(hyp: espnet2.asr.transducer.beam_search_transducer.Hypothesis, cache: Dict[str, Any]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]], torch.Tensor][source]

One-step forward hypothesis.

Parameters
  • hyp – Hypothesis.

  • cache – Pairs of (dec_out, state) for each label sequence. (key)

Returns

Decoder output sequence. (1, D_dec) new_state: Decoder hidden states. ((N, 1, D_dec), (N, 1, D_dec)) label: Label ID for LM. (1,)

Return type

dec_out

select_state(states: Tuple[torch.Tensor, Optional[torch.Tensor]], idx: int) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Get specified ID state from decoder hidden states.

Parameters
  • states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

  • idx – State ID to extract.

Returns

Decoder hidden state for given ID.

((N, 1, D_dec), (N, 1, D_dec))

set_device(device: torch.device)[source]

Set GPU device to use.

Parameters

device – Device ID.

espnet2.asr.transducer.utils

Utility functions for Transducer models.

espnet2.asr.transducer.utils.get_transducer_task_io(labels: torch.Tensor, encoder_out_lens: torch.Tensor, ignore_id: int = -1, blank_id: int = 0)[source]

Get Transducer loss I/O.

Parameters
  • labels – Label ID sequences. (B, L)

  • encoder_out_lens – Encoder output lengths. (B,)

  • ignore_id – Padding symbol ID.

  • blank_id – Blank symbol ID.

Returns

Decoder inputs. (B, U) target: Target label ID sequences. (B, U) t_len: Time lengths. (B,) u_len: Label lengths. (B,)

Return type

decoder_in