espnet.nets package¶
espnet.nets.asr_interface¶
ASR Interface module.
-
class
espnet.nets.asr_interface.
ASRInterface
[source]¶ Bases:
object
ASR Interface for ESPnet model implementation.
-
property
attention_plot_class
¶ Get attention plot class.
-
classmethod
build
(idim: int, odim: int, **kwargs)[source]¶ Initialize this class with python-level args.
- Parameters
idim (int) – The number of an input feature dim.
odim (int) – The number of output vocab.
- Returns
A new instance of ASRInterface.
- Return type
ASRinterface
-
calculate_all_attentions
(xs, ilens, ys)[source]¶ Caluculate attention.
- Parameters
xs_pad (list) – list of padded input sequences [(T1, idim), (T2, idim), …]
ilens (ndarray) – batch of lengths of input sequences (B)
ys (list) – list of character id sequence tensor [(L1), (L2), (L3), …]
- Returns
attention weights (B, Lmax, Tmax)
- Return type
float ndarray
-
encode
(feat)[source]¶ Encode feature in beam_search (optional).
- Parameters
x (numpy.ndarray) – input feature (T, D)
- Returns
encoded feature (T, D)
- Return type
torch.Tensor for pytorch, chainer.Variable for chainer
-
forward
(xs, ilens, ys)[source]¶ Compute loss for training.
- Parameters
xs – For pytorch, batch of padded source sequences torch.Tensor (B, Tmax, idim) For chainer, list of source sequences chainer.Variable
ilens – batch of lengths of source sequences (B) For pytorch, torch.Tensor For chainer, list of int
ys – For pytorch, batch of padded source sequences torch.Tensor (B, Lmax) For chainer, list of source sequences chainer.Variable
- Returns
loss value
- Return type
torch.Tensor for pytorch, chainer.Variable for chainer
-
recognize
(x, recog_args, char_list=None, rnnlm=None)[source]¶ Recognize x for evaluation.
- Parameters
x (ndarray) – input acouctic feature (B, T, D) or (T, D)
recog_args (namespace) – argment namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
-
recognize_batch
(x, recog_args, char_list=None, rnnlm=None)[source]¶ Beam search implementation for batch.
- Parameters
x (torch.Tensor) – encoder hidden state sequences (B, Tmax, Henc)
recog_args (namespace) – argument namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
-
scorers
()[source]¶ Get scorers for beam_search (optional).
- Returns
dict of ScorerInterface objects
- Return type
dict[str, ScorerInterface]
-
property
espnet.nets.beam_search¶
Beam search module.
-
class
espnet.nets.beam_search.
BeamSearch
(scorers: Dict[str, espnet.nets.scorer_interface.ScorerInterface], weights: Dict[str, float], beam_size: int, vocab_size: int, sos: int, eos: int, token_list: List[str] = None, pre_beam_ratio: float = 1.5, pre_beam_score_key: str = 'decoder')[source]¶ Bases:
torch.nn.modules.module.Module
Beam search implementation.
Initialize beam search.
- Parameters
scorers (dict[str, ScorerInterface]) – Dict of decoder modules e.g., Decoder, CTCPrefixScorer, LM The scorer will be ignored if it is None
weights (dict[str, float]) – Dict of weights for each scorers The scorer will be ignored if its weight is 0
beam_size (int) – The number of hypotheses kept during search
vocab_size (int) – The number of vocabulary
sos (int) – Start of sequence id
eos (int) – End of sequence id
token_list (list[str]) – List of tokens for debug log
pre_beam_score_key (str) – key of scores to perform pre-beam search
pre_beam_ratio (float) – beam size in the pre-beam search will be int(pre_beam_ratio * beam_size)
-
static
append_token
(xs: torch.Tensor, x: int) → torch.Tensor[source]¶ Append new token to prefix tokens.
- Parameters
xs (torch.Tensor) – The prefix token
x (int) – The new token to append
- Returns
New tensor contains: xs + [x] with xs.dtype and xs.device
- Return type
torch.Tensor
-
forward
(x: torch.Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0) → List[espnet.nets.beam_search.Hypothesis][source]¶ Perform beam search.
- Parameters
x (torch.Tensor) – Encoded speech feature (T, D)
maxlenratio (float) – Input length ratio to obtain max output length. If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
minlenratio (float) – Input length ratio to obtain min output length.
- Returns
N-best decoding results
- Return type
list[Hypothesis]
-
init_hyp
(x: torch.Tensor) → espnet.nets.beam_search.Hypothesis[source]¶ Get an initial hypothesis data.
- Parameters
x (torch.Tensor) – The encoder output feature
- Returns
The initial hypothesis.
- Return type
-
main_beam
(weighted_scores: torch.Tensor, ids: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute topk full token ids and partial token ids.
- Parameters
weighted_scores (torch.Tensor) – The weighted sum scores for each tokens. Its shape is (self.n_vocab,).
ids (torch.Tensor) – The partial token ids to compute topk
- Returns
- The topk full token ids and partial token ids.
Their shapes are (self.beam_size,)
- Return type
Tuple[torch.Tensor, torch.Tensor]
-
static
merge_scores
(hyp: espnet.nets.beam_search.Hypothesis, scores: Dict[str, torch.Tensor], idx: int, part_scores: Dict[str, torch.Tensor], part_idx: int) → Dict[str, torch.Tensor][source]¶ Merge scores for new hypothesis.
- Parameters
hyp (Hypotheis) – The previous hypothesis of prefix tokens
scores (Dict[str, torch.Tensor]) – scores by self.full_scorers
idx (int) – The new token id
part_scores (Dict[str, torch.Tensor]) – scores of partial tokens by self.part_scorers
part_idx (int) – The new token id for part_scores
- Returns
- The new score dict.
Its keys are names of self.full_scorers and self.part_scorers. Its values are scalar tensors by the scorers.
- Return type
Dict[str, torch.Tensor]
-
merge_states
(states: Any, part_states: Any, part_idx: int) → Any[source]¶ Merge states for new hypothesis.
- Parameters
states – states of self.full_scorers
part_states – states of self.part_scorers
part_idx (int) – The new token id for part_scores
- Returns
- The new score dict.
Its keys are names of self.full_scorers and self.part_scorers. Its values are states of the scorers.
- Return type
Dict[str, torch.Tensor]
-
post_process
(i: int, maxlen: int, maxlenratio: float, running_hyps: List[espnet.nets.beam_search.Hypothesis], ended_hyps: List[espnet.nets.beam_search.Hypothesis]) → List[espnet.nets.beam_search.Hypothesis][source]¶ Perform post-processing of beam search iterations.
- Parameters
i (int) – The length of hypothesis tokens.
maxlen (int) – The maximum length of tokens in beam search.
maxlenratio (int) – The maximum length ratio in beam search.
running_hyps (List[Hypothesis]) – The running hypotheses in beam search.
ended_hyps (List[Hypothesis]) – The ended hypotheses in beam search.
- Returns
The new running hypotheses.
- Return type
List[Hypothesis]
-
pre_beam
(scores: Dict[str, torch.Tensor], device: torch.device) → torch.Tensor[source]¶ Compute topk token ids for self.part_scorers.
- Parameters
scores (Dict[str, torch.Tensor]) – The score dict of hyp that has string keys of self.full_scorers and tensor score values; its shape is (self.n_vocab,),
device (torch.device) – The device to compute topk
- Returns
The partial tokens ids for self.part_scorers
- Return type
torch.Tensor
-
score
(hyp: espnet.nets.beam_search.Hypothesis, x: torch.Tensor) → Tuple[Dict[str, torch.Tensor], Dict[str, Any]][source]¶ Score new hypothesis by self.full_scorers.
- Parameters
hyp (Hypothesis) – Hypothesis with prefix tokens to score
x (torch.Tensor) – Corresponding input feature
- Returns
- Tuple of
score dict of hyp that has string keys of self.full_scorers and tensor score values of shape: (self.n_vocab,), and state dict that has string keys and state values of self.full_scorers
- Return type
Tuple[Dict[str, torch.Tensor], Dict[str, Any]]
-
score_partial
(hyp: espnet.nets.beam_search.Hypothesis, ids: torch.Tensor, x: torch.Tensor) → Tuple[Dict[str, torch.Tensor], Dict[str, Any]][source]¶ Score new hypothesis by self.part_scorers.
- Parameters
hyp (Hypothesis) – Hypothesis with prefix tokens to score
ids (torch.Tensor) – 1D tensor of new partial tokens to score
x (torch.Tensor) – Corresponding input feature
- Returns
- Tuple of
score dict of hyp that has string keys of self.part_scorers and tensor score values of shape: (len(ids),), and state dict that has string keys and state values of self.part_scorers
- Return type
Tuple[Dict[str, torch.Tensor], Dict[str, Any]]
-
class
espnet.nets.beam_search.
Hypothesis
[source]¶ Bases:
tuple
Hypothesis data type.
Create new instance of Hypothesis(yseq, score, scores, states)
-
property
score
¶ Alias for field number 1
-
property
scores
¶ Alias for field number 2
-
property
states
¶ Alias for field number 3
-
property
yseq
¶ Alias for field number 0
-
property
-
espnet.nets.beam_search.
beam_search
(x: torch.Tensor, sos: int, eos: int, beam_size: int, vocab_size: int, scorers: Dict[str, espnet.nets.scorer_interface.ScorerInterface], weights: Dict[str, float], token_list: List[str] = None, maxlenratio: float = 0.0, minlenratio: float = 0.0, pre_beam_ratio: float = 1.5, pre_beam_score_key: str = 'decoder') → list[source]¶ Perform beam search with scorers.
- Parameters
x (torch.Tensor) – Encoded speech feature (T, D)
sos (int) – Start of sequence id
eos (int) – End of sequence id
beam_size (int) – The number of hypotheses kept during search
vocab_size (int) – The number of vocabulary
scorers (dict[str, ScorerInterface]) – Dict of decoder modules e.g., Decoder, CTCPrefixScorer, LM The scorer will be ignored if it is None
weights (dict[str, float]) – Dict of weights for each scorers The scorer will be ignored if its weight is 0
token_list (list[str]) – List of tokens for debug log
maxlenratio (float) – Input length ratio to obtain max output length. If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
minlenratio (float) – Input length ratio to obtain min output length.
pre_beam_score_key (str) – key of scores to perform pre-beam search
pre_beam_ratio (float) – beam size in the pre-beam search will be int(pre_beam_ratio * beam_size)
- Returns
N-best decoding results
- Return type
list
espnet.nets.ctc_prefix_score¶
-
class
espnet.nets.ctc_prefix_score.
CTCPrefixScore
(x, blank, eos, xp)[source]¶ Bases:
object
Compute CTC label sequence scores
which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the probablities of multiple labels simultaneously
-
class
espnet.nets.ctc_prefix_score.
CTCPrefixScoreTH
(x, xlens, blank, eos, beam, scoring_ratio=1.5, margin=0)[source]¶ Bases:
object
Batch processing of CTCPrefixScore
which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the probablities of multiple labels simultaneously
Construct CTC prefix scorer
- Parameters
x (torch.Tensor) – input label posterior sequences (B, T, O)
xlens (torch.Tensor) – input lengths (B,)
blank (int) – blank label id
eos (int) – end-of-sequence id
beam (int) – beam size
scoring_ratio (float) – ratio of #scored hypos to beam size
margin (int) – margin parameter for windowing (0 means no windowing)
espnet.nets.e2e_asr_common¶
-
class
espnet.nets.e2e_asr_common.
ErrorCalculator
(char_list, sym_space, sym_blank, report_cer=False, report_wer=False)[source]¶ Bases:
object
Calculate CER and WER for E2E_ASR and CTC models during training
- Parameters
y_hats – numpy array with predicted text
y_pads – numpy array with true (target) text
char_list –
sym_space –
sym_blank –
- Returns
espnet.nets.lm_interface¶
Language model interface.
-
class
espnet.nets.lm_interface.
LMInterface
[source]¶ Bases:
espnet.nets.scorer_interface.ScorerInterface
LM Interface for ESPnet model implementation.
-
forward
(x, t)[source]¶ Compute LM loss value from buffer sequences.
- Parameters
x (torch.Tensor) – Input ids. (batch, len)
t (torch.Tensor) – Target ids. (batch, len)
- Returns
- Tuple of
loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)
- Return type
tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Notes
The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)
-
espnet.nets.mt_interface¶
-
class
espnet.nets.mt_interface.
MTInterface
[source]¶ Bases:
object
MT Interface for ESPnet model implementation
-
property
attention_plot_class
¶
-
calculate_all_attentions
(xs, ilens, ys)[source]¶ attention calculation
- Parameters
xs_pad (list) – list of padded input sequences [(T1, idim), (T2, idim), …]
ilens (ndarray) – batch of lengths of input sequences (B)
ys (list) – list of character id sequence tensor [(L1), (L2), (L3), …]
- Returns
attention weights (B, Lmax, Tmax)
- Return type
float ndarray
-
forward
(xs, ilens, ys)[source]¶ compute loss for training
- Parameters
xs – For pytorch, batch of padded source sequences torch.Tensor (B, Tmax, idim) For chainer, list of source sequences chainer.Variable
ilens – batch of lengths of source sequences (B) For pytorch, torch.Tensor For chainer, list of int
ys – For pytorch, batch of padded source sequences torch.Tensor (B, Lmax) For chainer, list of source sequences chainer.Variable
- Returns
loss value
- Return type
torch.Tensor for pytorch, chainer.Variable for chainer
-
translate
(x, trans_args, char_list=None, rnnlm=None)[source]¶ translate x for evaluation
- Parameters
x (ndarray) – input acouctic feature (B, T, D) or (T, D)
trans_args (namespace) – argment namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
-
property
espnet.nets.scorer_interface¶
Scorer interface module.
-
class
espnet.nets.scorer_interface.
PartialScorerInterface
[source]¶ Bases:
espnet.nets.scorer_interface.ScorerInterface
Partial scorer interface for beam search.
The partial scorer performs scoring when non-partial scorer finished scoring, and recieves pre-pruned next tokens to score because it is too heavy to score all the tokens.
Examples
- Prefix search for connectionist-temporal-classification models
-
score_partial
(y: torch.Tensor, next_tokens: torch.Tensor, state: Any, x: torch.Tensor) → Tuple[torch.Tensor, Any][source]¶ Score new token (required).
- Parameters
y (torch.Tensor) – 1D prefix token
next_tokens (torch.Tensor) – torch.int64 next token to score
state – decoder state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys
- Returns
- Tuple of a score tensor for y that has a shape (len(next_tokens),)
and next state for ys
- Return type
tuple[torch.Tensor, Any]
-
class
espnet.nets.scorer_interface.
ScorerInterface
[source]¶ Bases:
object
Scorer interface for beam search.
The scorer performs scoring of the all tokens in vocabulary.
Examples
- Search heuristics
- Decoder networks of the sequence-to-sequence models
espnet.nets.pytorch_backend.nets.transformer.decoder.Decoder
espnet.nets.pytorch_backend.nets.rnn.decoders.Decoder
-
final_score
(state: Any) → float[source]¶ Score eos (optional).
- Parameters
state – Scorer state for prefix tokens
- Returns
final score
- Return type
float
-
init_state
(x: torch.Tensor) → Any[source]¶ Get an initial state for decoding (optional).
- Parameters
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
score
(y: torch.Tensor, state: Any, x: torch.Tensor) → Tuple[torch.Tensor, Any][source]¶ Score new token (required).
- Parameters
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.
- Returns
- Tuple of
scores for next token that has a shape of (n_vocab) and next state for ys
- Return type
tuple[torch.Tensor, Any]
espnet.nets.tts_interface¶
TTS Interface realted modules.
-
class
espnet.nets.tts_interface.
Reporter
(**links)[source]¶ Bases:
chainer.link.Chain
Reporter module.
-
class
espnet.nets.tts_interface.
TTSInterface
[source]¶ Bases:
object
TTS Interface for ESPnet model implementation.
Initilize TTS module.
-
property
attention_plot_class
¶ Plot attention weights.
-
property
base_plot_keys
¶ Return base key names to plot during training.
The keys should match what chainer.reporter reports. if you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns
Base keys to plot during training.
- Return type
list[str]
-
calculate_all_attentions
(*args, **kwargs)[source]¶ Calculate TTS attention weights.
- Parameters
Tensor – Batch of attention weights (B, Lmax, Tmax).
-
property
espnet.nets.chainer_backend.ctc¶
-
class
espnet.nets.chainer_backend.ctc.
CTC
(odim, eprojs, dropout_rate)[source]¶ Bases:
chainer.link.Chain
Chainer implementation of ctc layer.
- Parameters
odim (int) – The output dimension.
eprojs (int | None) – Dimension of input vectors from encoder.
dropout_rate (float) – Dropout rate.
-
class
espnet.nets.chainer_backend.ctc.
WarpCTC
(odim, eprojs, dropout_rate)[source]¶ Bases:
chainer.link.Chain
Chainer implementation of warp-ctc layer.
- Parameters
odim (int) – The output dimension.
eproj (int | None) – Dimension of input vector from encoder.
dropout_rate (float) – Dropout rate.
espnet.nets.chainer_backend.deterministic_embed_id¶
-
class
espnet.nets.chainer_backend.deterministic_embed_id.
EmbedID
(in_size, out_size, initialW=None, ignore_label=None)[source]¶ Bases:
chainer.link.Link
Efficient linear layer for one-hot input.
This is a link that wraps the
embed_id()
function. This link holds the ID (word) embedding matrixW
as a parameter.- Parameters
in_size (int) – Number of different identifiers (a.k.a. vocabulary size).
out_size (int) – Output dimension.
initialW (Initializer) – Initializer to initialize the weight.
ignore_label (int) – If ignore_label is an int value, i-th column of return value is filled with 0.
embed_id()
-
W
¶ Embedding parameter matrix.
- Type
Variable
Examples
>>> W = np.array([[0, 0, 0], ... [1, 1, 1], ... [2, 2, 2]]).astype('f') >>> W array([[ 0., 0., 0.], [ 1., 1., 1.], [ 2., 2., 2.]], dtype=float32) >>> l = L.EmbedID(W.shape[0], W.shape[1], initialW=W) >>> x = np.array([2, 1]).astype('i') >>> x array([2, 1], dtype=int32) >>> y = l(x) >>> y.data array([[ 2., 2., 2.], [ 1., 1., 1.]], dtype=float32)
-
ignore_label
= None¶
-
class
espnet.nets.chainer_backend.deterministic_embed_id.
EmbedIDFunction
(ignore_label=None)[source]¶ Bases:
chainer.function_node.FunctionNode
-
backward
(indexes, grad_outputs)[source]¶ Computes gradients w.r.t. specified inputs given output gradients.
This method is used to compute one step of the backpropagation corresponding to the forward computation of this function node. Given the gradients w.r.t. output variables, this method computes the gradients w.r.t. specified input variables. Note that this method does not need to compute any input gradients not specified by
target_input_indices
.Unlike
Function.backward()
, gradients are given asVariable
objects and this method itself has to return input gradients asVariable
objects. It enables the function node to return the input gradients with the full computational history, in which case it supports differentiable backpropagation or higher-order differentiation.The default implementation returns
None
s, which means the function is not differentiable.- Parameters
target_input_indexes (tuple of int) – Sorted indices of the input variables w.r.t. which the gradients are required. It is guaranteed that this tuple contains at least one element.
grad_outputs (tuple of
Variable
s) – Gradients w.r.t. the output variables. If the gradient w.r.t. an output variable is not given, the corresponding element isNone
.
- Returns
Tuple of variables that represent the gradients w.r.t. specified input variables. The length of the tuple can be same as either
len(target_input_indexes)
or the number of inputs. In the latter case, the elements not specified bytarget_input_indexes
will be discarded.
See also
backward_accumulate()
provides an alternative interface that allows you to implement the backward computation fused with the gradient accumulation.
-
check_type_forward
(in_types)[source]¶ Checks types of input data before forward propagation.
This method is called before
forward()
and validates the types of input variables using the type checking utilities.- Parameters
in_types (TypeInfoTuple) – The type information of input variables for
forward()
.
-
forward
(inputs)[source]¶ Computes the output arrays from the input arrays.
It delegates the procedure to
forward_cpu()
orforward_gpu()
by default. Which of them this method selects is determined by the type of input arrays. Implementations ofFunctionNode
must implement either CPU/GPU methods or this method.- Parameters
inputs – Tuple of input array(s).
- Returns
Tuple of output array(s).
Warning
Implementations of
FunctionNode
must take care that the return value must be a tuple even if it returns only one array.
-
-
class
espnet.nets.chainer_backend.deterministic_embed_id.
EmbedIDGrad
(w_shape, ignore_label=None)[source]¶ Bases:
chainer.function_node.FunctionNode
-
backward
(indexes, grads)[source]¶ Computes gradients w.r.t. specified inputs given output gradients.
This method is used to compute one step of the backpropagation corresponding to the forward computation of this function node. Given the gradients w.r.t. output variables, this method computes the gradients w.r.t. specified input variables. Note that this method does not need to compute any input gradients not specified by
target_input_indices
.Unlike
Function.backward()
, gradients are given asVariable
objects and this method itself has to return input gradients asVariable
objects. It enables the function node to return the input gradients with the full computational history, in which case it supports differentiable backpropagation or higher-order differentiation.The default implementation returns
None
s, which means the function is not differentiable.- Parameters
target_input_indexes (tuple of int) – Sorted indices of the input variables w.r.t. which the gradients are required. It is guaranteed that this tuple contains at least one element.
grad_outputs (tuple of
Variable
s) – Gradients w.r.t. the output variables. If the gradient w.r.t. an output variable is not given, the corresponding element isNone
.
- Returns
Tuple of variables that represent the gradients w.r.t. specified input variables. The length of the tuple can be same as either
len(target_input_indexes)
or the number of inputs. In the latter case, the elements not specified bytarget_input_indexes
will be discarded.
See also
backward_accumulate()
provides an alternative interface that allows you to implement the backward computation fused with the gradient accumulation.
-
forward
(inputs)[source]¶ Computes the output arrays from the input arrays.
It delegates the procedure to
forward_cpu()
orforward_gpu()
by default. Which of them this method selects is determined by the type of input arrays. Implementations ofFunctionNode
must implement either CPU/GPU methods or this method.- Parameters
inputs – Tuple of input array(s).
- Returns
Tuple of output array(s).
Warning
Implementations of
FunctionNode
must take care that the return value must be a tuple even if it returns only one array.
-
-
espnet.nets.chainer_backend.deterministic_embed_id.
embed_id
(x, W, ignore_label=None)[source]¶ Efficient linear function for one-hot input.
This function implements so called word embeddings. It takes two arguments: a set of IDs (words)
x
in \(B\) dimensional integer vector, and a set of all ID (word) embeddingsW
in \(V \\times d\) float32 matrix. It outputs \(B \\times d\) matrix whosei
-th column is thex[i]
-th column ofW
. This function is only differentiable on the inputW
.- Parameters
x (chainer.Variable | np.ndarray) – Batch vectors of IDs. Each element must be signed integer.
W (chainer.Variable | np.ndarray) – Distributed representation of each ID (a.k.a. word embeddings).
ignore_label (int) – If ignore_label is an int value, i-th column of return value is filled with 0.
- Returns
Embedded variable.
- Return type
chainer.Variable
EmbedID
Examples
>>> x = np.array([2, 1]).astype('i') >>> x array([2, 1], dtype=int32) >>> W = np.array([[0, 0, 0], ... [1, 1, 1], ... [2, 2, 2]]).astype('f') >>> W array([[ 0., 0., 0.], [ 1., 1., 1.], [ 2., 2., 2.]], dtype=float32) >>> F.embed_id(x, W).data array([[ 2., 2., 2.], [ 1., 1., 1.]], dtype=float32) >>> F.embed_id(x, W, ignore_label=1).data array([[ 2., 2., 2.], [ 0., 0., 0.]], dtype=float32)
espnet.nets.chainer_backend.e2e_asr¶
-
class
espnet.nets.chainer_backend.e2e_asr.
E2E
(idim, odim, args, flag_return=True)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,chainer.link.Chain
E2E module for chainer backend.
- Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (parser.args) – Training config.
flag_return (bool) – If True, train() would return additional metrics in addition to the training loss.
-
calculate_all_attentions
(xs, ilens, ys)[source]¶ E2E attention calculation.
- Parameters
xs (List) – List of padded input sequences. [(T1, idim), (T2, idim), …]
ilens (np.ndarray) – Batch of lengths of input sequences. (B)
ys (List) – List of character id sequence tensor. [(L1), (L2), (L3), …]
- Returns
Attention weights. (B, Lmax, Tmax)
- Return type
float np.ndarray
-
forward
(xs, ilens, ys)[source]¶ E2E forward propagation.
- Parameters
xs (chainer.Variable) – Batch of padded charactor ids. (B, Tmax)
ilens (chainer.Variable) – Batch of length of each input batch. (B,)
ys (chainer.Variable) – Batch of padded target features. (B, Lmax, odim)
- Returns
Loss that calculated by attention and ctc loss. float (optional): Ctc loss. float (optional): Attention loss. float (optional): Accuracy.
- Return type
float
-
recognize
(x, recog_args, char_list, rnnlm=None)[source]¶ E2E greedy/beam search.
- Parameters
x (chainer.Variable) – Input tensor for recognition.
recog_args (parser.args) – Arguments of config file.
char_list (List[str]) – List of Charactors.
rnnlm (Module) – RNNLM module defined at espnet.lm.chainer_backend.lm.
- Returns
Result of recognition.
- Return type
List[Dict[str, Any]]
espnet.nets.chainer_backend.e2e_asr_transformer¶
-
class
espnet.nets.chainer_backend.e2e_asr_transformer.
E2E
(idim, odim, args, ignore_id=-1, flag_return=True)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,chainer.link.Chain
E2E module.
- Parameters
idim (int) – Dimension of inputs.
odim (int) – Dimension of outputs.
args (Namespace) – Training config.
flag_return (bool) – If True, then return value of forward() would be tuple of (loss, loss_ctc, loss_att, acc)
-
property
attention_plot_class
¶ Get attention plot class.
-
calculate_all_attentions
(xs, ilens, ys)[source]¶ E2E attention calculation.
- Parameters
xs_pad (List[tuple()]) – List of padded input sequences. [(T1, idim), (T2, idim), …]
ilens (ndarray) – Batch of lengths of input sequences. (B)
ys (List) – List of character id sequence tensor. [(L1), (L2), (L3), …]
- Returns
Attention weights. (B, Lmax, Tmax)
- Return type
float ndarray
-
forward
(xs, ilens, ys_pad, calculate_attentions=False)[source]¶ E2E forward propagation.
- Parameters
xs (chainer.Variable) – Batch of padded charactor ids. (B, Tmax)
ilens (chainer.Variable) – Batch of length of each input batch. (B,)
ys (chainer.Variable) – Batch of padded target features. (B, Lmax, odim)
calculate_attentions (bool) – If true, return value is the output of encoder.
- Returns
Training loss. float (optional): Training loss for ctc. float (optional): Training loss for attention. float (optional): Accuracy. chainer.Variable (Optional): Output of the encoder.
- Return type
float
-
recognize
(x_block, recog_args, char_list=None, rnnlm=None)[source]¶ E2E beam search.
- Parameters
x (ndarray) – Input acouctic feature (B, T, D) or (T, D).
recog_args (Namespace) – Argment namespace contraining options.
char_list (List[str]) – List of characters.
rnnlm (torch.nn.Module) – Language model module defined at espnet.lm.chainer_backend.lm.
- Returns
N-best decoding results.
- Return type
List
espnet.nets.chainer_backend.nets_utils¶
-
espnet.nets.chainer_backend.nets_utils.
linear_tensor
(linear, x)[source]¶ Apply linear matrix operation only for the last dimension of a tensor.
- Parameters
linear (Link) – Linear link. (M x N matrix)
x (chainer.Variable) – Tensor. (D_1 x D_2 x … x M matrix)
- Returns
Tensor. (D_1 x D_2 x … x N matrix)
- Return type
chainer.Variable
espnet.nets.chainer_backend.rnn.attentions¶
-
class
espnet.nets.chainer_backend.rnn.attentions.
AttDot
(eprojs, dunits, att_dim)[source]¶ Bases:
chainer.link.Chain
Compute attention based on dot product.
- Parameters
eprojs (int | None) – Dimension of input vectors from encoder.
dunits (int | None) – Dimension of input vectors for decoder.
att_dim (int) – Dimension of input vectors for attention.
-
class
espnet.nets.chainer_backend.rnn.attentions.
AttLoc
(eprojs, dunits, att_dim, aconv_chans, aconv_filts)[source]¶ Bases:
chainer.link.Chain
Compute location-based attention.
- Parameters
eprojs (int | None) – Dimension of input vectors from encoder.
dunits (int | None) – Dimension of input vectors for decoder.
att_dim (int) – Dimension of input vectors for attention.
aconv_chans (int) – Number of channels of output arrays from convolutional layer.
aconv_filts (int) – Size of filters of convolutional layer.
espnet.nets.chainer_backend.rnn.decoders¶
-
class
espnet.nets.chainer_backend.rnn.decoders.
Decoder
(eprojs, odim, dtype, dlayers, dunits, sos, eos, att, verbose=0, char_list=None, labeldist=None, lsm_weight=0.0, sampling_probability=0.0)[source]¶ Bases:
chainer.link.Chain
Decoder layer.
- Parameters
eprojs (int) – Dimension of input variables from encoder.
odim (int) – The output dimension.
dtype (str) – Decoder type.
dlayers (int) – Number of layers for decoder.
dunits (int) – Dimension of input vector of decoder.
sos (int) – Number to indicate the start of sequences.
eos (int) – Number to indicate the end of sequences.
att (Module) – Attention module defined at espnet.espnet.nets.chainer_backend.attentions.
verbose (int) – Verbosity level.
char_list (List[str]) – List of all charactors.
labeldist (numpy.array) – Distributed array of counted transcript length.
lsm_weight (float) – Weight to use when calculating the training loss.
sampling_probability (float) – Threshold for scheduled sampling.
-
calculate_all_attentions
(hs, ys)[source]¶ Calculate all of attentions.
- Parameters
hs (list of chainer.Variable | N-dimensional array) – Input variable from encoder.
ys (list of chainer.Variable | N-dimensional array) – Input variable of decoder.
- Returns
List of attention weights.
- Return type
chainer.Variable
-
recognize_beam
(h, lpz, recog_args, char_list, rnnlm=None)[source]¶ Beam search implementation.
- Parameters
h (chainer.Variable) – One of the output from the encoder.
lpz (chainer.Variable | None) – Result of net propagation.
recog_args (Namespace) – The argument.
char_list (List[str]) – List of all charactors.
rnnlm (Module) – RNNLM module. Defined at espnet.lm.chainer_backend.lm
- Returns
Result of recognition.
- Return type
List[Dict[str,Any]]
-
espnet.nets.chainer_backend.rnn.decoders.
decoder_for
(args, odim, sos, eos, att, labeldist)[source]¶ Return the decoding layer corresponding to the args.
- Parameters
args (Namespace) – The program arguments.
odim (int) – The output dimension.
sos (int) – Number to indicate the start of sequences.
eos (int) –
att (Module) – Attention module defined at espnet.nets.chainer_backend.attentions.
labeldist (numpy.array) – Distributed array of length od transcript.
- Returns
The decoder module.
- Return type
chainer.Chain
espnet.nets.chainer_backend.rnn.encoders¶
-
class
espnet.nets.chainer_backend.rnn.encoders.
Encoder
(etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1)[source]¶ Bases:
chainer.link.Chain
Encoder network class.
- Parameters
etype (str) – Type of encoder network.
idim (int) – Number of dimensions of encoder network.
elayers (int) – Number of layers of encoder network.
eunits (int) – Number of lstm units of encoder network.
eprojs (int) – Number of projection units of encoder network.
subsample (np.array) – Subsampling number. e.g. 1_2_2_2_1
dropout (float) – Dropout rate.
-
class
espnet.nets.chainer_backend.rnn.encoders.
RNN
(idim, elayers, cdim, hdim, dropout, typ='lstm')[source]¶ Bases:
chainer.link.Chain
RNN Module.
- Parameters
idim (int) – Dimension of the imput.
elayers (int) – Number of encoder layers.
cdim (int) – Number of rnn units.
hdim (int) – Number of projection units.
dropout (float) – Dropout rate.
typ (str) – Rnn type.
-
class
espnet.nets.chainer_backend.rnn.encoders.
RNNP
(idim, elayers, cdim, hdim, subsample, dropout, typ='blstm')[source]¶ Bases:
chainer.link.Chain
RNN with projection layer module.
- Parameters
idim (int) – Dimension of inputs.
elayers (int) – Number of encoder layers.
cdim (int) – Number of rnn units. (resulted in cdim * 2 if bidirectional)
hdim (int) – Number of projection units.
subsample (np.ndarray) – List to use sabsample the input array.
dropout (float) – Dropout rate.
typ (str) – The RNN type.
espnet.nets.chainer_backend.rnn.training¶
-
class
espnet.nets.chainer_backend.rnn.training.
CustomConverter
(subsampling_factor=1)[source]¶ Bases:
object
Custom Converter.
- Parameters
subsampling_factor (int) – The subsampling factor.
-
class
espnet.nets.chainer_backend.rnn.training.
CustomParallelUpdater
(train_iters, optimizer, converter, devices, accum_grad=1)[source]¶ Bases:
chainer.training.updaters.multiprocess_parallel_updater.MultiprocessParallelUpdater
Custom Parallel Updater for chainer.
Defines the main update routine.
- Parameters
train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name
'main'
.optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name
'main'
.converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the
device
option are passed to this function.chainer.dataset.concat_examples()
is used by default.device (torch.device) – Device to which the training data is sent. Negative value indicates the host memory (CPU).
accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.
-
class
espnet.nets.chainer_backend.rnn.training.
CustomUpdater
(train_iter, optimizer, converter, device, accum_grad=1)[source]¶ Bases:
chainer.training.updaters.standard_updater.StandardUpdater
Custom updater for chainer.
- Parameters
train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name
'main'
.optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name
'main'
.converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the
device
option are passed to this function.chainer.dataset.concat_examples()
is used by default.device (int or dict) – The destination device info to send variables. In the case of cpu or single gpu, device=-1 or 0, respectively. In the case of multi-gpu, device={“main”:0, “sub_1”: 1, …}.
accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.
espnet.nets.chainer_backend.transformer.attention¶
-
class
espnet.nets.chainer_backend.transformer.attention.
MultiHeadAttention
(n_units, h=8, dropout=0.1, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Multi Head Attention Layer.
- Parameters
n_units (int) – Number of input units.
h (int) – Number of attention heads.
dropout (float) – Dropout rate.
initialW – Initializer to initialize the weight.
initial_bias – Initializer to initialize the bias.
h – the number of heads
n_units – the number of features
dropout_rate (float) – dropout rate
espnet.nets.chainer_backend.transformer.decoder¶
-
class
espnet.nets.chainer_backend.transformer.decoder.
Decoder
(odim, args, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Decoder layer.
- Parameters
odim (int) – The output dimension.
n_layers (int) – Number of ecoder layers.
n_units (int) – Number of attention units.
d_units (int) – Dimension of input vector of decoder.
h (int) – Number of attention heads.
dropout (float) – Dropout rate.
initialW (Initializer) – Initializer to initialize the weight.
initial_bias (Initializer) – Initializer to initialize teh bias.
-
forward
(e, yy_mask, source, xy_mask)[source]¶ Definition of the decoder layer.
- Parameters
e (chainer.Variable) – Input variable to the decoder from the encoder.
yy_mask (chainer.Variable) – Attention mask considering ys as the source and target block.
source (List) – Input sequences padded with sos and pad_sequence method.
xy_mask (chainer.Variable) – Attention mask considering ys and xs as the source/target block.
- Returns
Decoder layer.
- Return type
chainer.Chain
espnet.nets.chainer_backend.transformer.decoder_layer¶
espnet.nets.chainer_backend.transformer.embedding¶
espnet.nets.chainer_backend.transformer.encoder¶
-
class
espnet.nets.chainer_backend.transformer.encoder.
Encoder
(idim, args, initialW=None, initial_bias=None)[source]¶ Bases:
chainer.link.Chain
Encoder.
- Parameters
input_type (str) – Sampling type. input_type must be conv2d or ‘linear’ currently.
idim (int) – Dimension of inputs.
n_layers (int) – Number of encoder layers.
n_units (int) – Number of input/output dimension of a FeedForward layer.
d_units (int) – Number of units of hidden layer in a FeedForward layer.
h (int) – Number of attention heads.
dropout (float) – Dropout rate
espnet.nets.chainer_backend.transformer.encoder_layer¶
espnet.nets.chainer_backend.transformer.label_smoothing_loss¶
espnet.nets.chainer_backend.transformer.layer_norm¶
espnet.nets.chainer_backend.transformer.plot¶
-
class
espnet.nets.chainer_backend.transformer.plot.
PlotAttentionReport
(att_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0)[source]¶
-
espnet.nets.chainer_backend.transformer.plot.
plot_multi_head_attention
(data, attn_dict, outdir, suffix='png', savefn=<function savefig>)[source]¶ Plot multi head attentions
- Parameters
data (dict) – utts info from json file
torch.Tensor] attn_dict (dict[str,) – multi head attention dict. values should be torch.Tensor (head, input_length, output_length)
outdir (str) – dir to save fig
suffix (str) – filename suffix including image type (e.g., png)
savefn – function to save
espnet.nets.chainer_backend.transformer.positionwise_feed_forward¶
espnet.nets.chainer_backend.transformer.subsampling¶
espnet.nets.chainer_backend.transformer.training¶
-
class
espnet.nets.chainer_backend.transformer.training.
CustomConverter
(subsampling_factor=1)[source]¶ Bases:
object
Custom Converter.
- Parameters
subsampling_factor (int) – The subsampling factor.
-
class
espnet.nets.chainer_backend.transformer.training.
CustomParallelUpdater
(train_iters, optimizer, converter, devices, accum_grad=1)[source]¶ Bases:
chainer.training.updaters.multiprocess_parallel_updater.MultiprocessParallelUpdater
Custom Parallel Updater for chainer.
Defines the main update routine.
- Parameters
train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name
'main'
.optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name
'main'
.converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the
device
option are passed to this function.chainer.dataset.concat_examples()
is used by default.device (torch.device) – Device to which the training data is sent. Negative value indicates the host memory (CPU).
accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.
-
class
espnet.nets.chainer_backend.transformer.training.
CustomUpdater
(train_iter, optimizer, converter, device, accum_grad=1)[source]¶ Bases:
chainer.training.updaters.standard_updater.StandardUpdater
Custom updater for chainer.
- Parameters
train_iter (iterator | dict[str, iterator]) – Dataset iterator for the training dataset. It can also be a dictionary that maps strings to iterators. If this is just an iterator, then the iterator is registered by the name
'main'
.optimizer (optimizer | dict[str, optimizer]) – Optimizer to update parameters. It can also be a dictionary that maps strings to optimizers. If this is just an optimizer, then the optimizer is registered by the name
'main'
.converter (espnet.asr.chainer_backend.asr.CustomConverter) – Converter function to build input arrays. Each batch extracted by the main iterator and the
device
option are passed to this function.chainer.dataset.concat_examples()
is used by default.device (int or dict) – The destination device info to send variables. In the case of cpu or single gpu, device=-1 or 0, respectively. In the case of multi-gpu, device={“main”:0, “sub_1”: 1, …}.
accum_grad (int) – The number of gradient accumulation. if set to 2, the network parameters will be updated once in twice, i.e. actual batchsize will be doubled.
-
class
espnet.nets.chainer_backend.transformer.training.
VaswaniRule
(attr, d, warmup_steps=4000, init=None, target=None, optimizer=None, scale=1.0)[source]¶ Bases:
chainer.training.extension.Extension
Trainer extension to shift an optimizer attribute magically by Vaswani.
- Parameters
attr (str) – Name of the attribute to shift.
rate (float) – Rate of the exponential shift. This value is multiplied to the attribute at each call.
init (float) – Initial value of the attribute. If it is
None
, the extension extracts the attribute at the first call and uses it as the initial value.target (float) – Target value of the attribute. If the attribute reaches this value, the shift stops.
optimizer (Optimizer) – Target optimizer to adjust the attribute. If it is
None
, the main optimizer of the updater is used.
-
initialize
(trainer)[source]¶ Initializes up the trainer state.
This method is called before entering the training loop. An extension that modifies the state of
Trainer
can override this method to initialize it.When the trainer has been restored from a snapshot, this method has to recover an appropriate part of the state of the trainer.
For example,
ExponentialShift
extension changes the optimizer’s hyperparameter at each invocation. Note that the hyperparameter is not saved to the snapshot; it is the responsibility of the extension to recover the hyperparameter. TheExponentialShift
extension recovers it in itsinitialize
method if it has been loaded from a snapshot, or just setting the initial value otherwise.- Parameters
trainer (Trainer) – Trainer object that runs the training loop.
espnet.nets.pytorch_backend.ctc¶
-
class
espnet.nets.pytorch_backend.ctc.
CTC
(odim, eprojs, dropout_rate, ctc_type='warpctc', reduce=True)[source]¶ Bases:
torch.nn.modules.module.Module
CTC module
- Parameters
odim (int) – dimension of outputs
eprojs (int) – number of encoder projection units
dropout_rate (float) – dropout rate (0.0 ~ 1.0)
ctc_type (str) – builtin or warpctc
reduce (bool) – reduce the CTC loss into a scalar
-
argmax
(hs_pad)[source]¶ argmax of frame activations
- Parameters
hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)
- Returns
argmax applied 2d tensor (B, Tmax)
- Return type
torch.Tensor
-
forward
(hs_pad, hlens, ys_pad)[source]¶ CTC forward
- Parameters
hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)
hlens (torch.Tensor) – batch of lengths of hidden state sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
ctc loss value
- Return type
torch.Tensor
-
espnet.nets.pytorch_backend.ctc.
ctc_for
(args, odim, reduce=True)[source]¶ Returns the CTC module for the given args and output dimension
- Parameters
args (Namespace) – the program args
:param int odim : The output dimension :param bool reduce : return the CTC loss in a scalar :return: the corresponding CTC module
espnet.nets.pytorch_backend.e2e_asr¶
-
class
espnet.nets.pytorch_backend.e2e_asr.
E2E
(idim, odim, args)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
E2E module
- Parameters
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
calculate_all_attentions
(xs_pad, ilens, ys_pad)[source]¶ E2E attention calculation
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).
- Return type
float ndarray
-
encode
(x)[source]¶ Encode feature in beam_search (optional).
- Parameters
x (numpy.ndarray) – input feature (T, D)
- Returns
encoded feature (T, D)
- Return type
torch.Tensor for pytorch, chainer.Variable for chainer
-
enhance
(xs)[source]¶ Forwarding only the frontend stage
- Parameters
xs (ndarray) – input acoustic feature (T, C, F)
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
loass value
- Return type
torch.Tensor
-
init_like_chainer
()[source]¶ Initialize weight like chainer
chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)
however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)
-
recognize
(x, recog_args, char_list, rnnlm=None)[source]¶ E2E beam search
- Parameters
x (ndarray) – input acoustic feature (T, D)
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
-
recognize_batch
(xs, recog_args, char_list, rnnlm=None)[source]¶ E2E beam search
- Parameters
xs (list) – list of input acoustic feature arrays [(T_1, D), (T_2, D), …]
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
-
scorers
()[source]¶ Get scorers for beam_search (optional).
- Returns
dict of ScorerInterface objects
- Return type
dict[str, ScorerInterface]
espnet.nets.pytorch_backend.e2e_asr_mix¶
-
class
espnet.nets.pytorch_backend.e2e_asr_mix.
E2E
(idim, odim, args)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
E2E module
- Parameters
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
calculate_all_attentions
(xs_pad, ilens, ys_pad_sd)[source]¶ E2E attention calculation
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad_sd (torch.Tensor) – batch of padded character id sequence tensor (B, num_spkrs, Lmax)
- Returns
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).
- Return type
float ndarray
-
forward
(xs_pad, ilens, ys_pad_sd)[source]¶ E2E forward
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad_sd (torch.Tensor) – batch of padded character id sequence tensor (B, num_spkrs, Lmax)
- Returns
ctc loss value
- Return type
torch.Tensor
- Returns
attention loss value
- Return type
torch.Tensor
- Returns
accuracy in attention decoder
- Return type
float
-
init_like_chainer
()[source]¶ Initialize weight like chainer
chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)
however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)
-
recognize
(x, recog_args, char_list, rnnlm=None)[source]¶ E2E beam search
- Parameters
x (ndarray) – input acoustic feature (T, D)
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
-
recognize_batch
(xs, recog_args, char_list, rnnlm=None)[source]¶ E2E beam search
- Parameters
xs (ndarray) – input acoustic feature (T, D)
recog_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
-
class
espnet.nets.pytorch_backend.e2e_asr_mix.
Encoder
(etype, idim, elayers_sd, elayers_rec, eunits, eprojs, subsample, dropout, num_spkrs=2, in_channel=1)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder module
- Parameters
etype (str) – type of encoder network
idim (int) – number of dimensions of encoder network
elayers_sd (int) – number of layers of speaker differentiate part in encoder network
elayers_rec (int) – number of layers of shared recognition part in encoder network
eunits (int) – number of lstm units of encoder network
eprojs (int) – number of projection units of encoder network
subsample (np.ndarray) – list of subsampling numbers
dropout (float) – dropout rate
in_channel (int) – number of input channels
num_spkrs (int) – number of number of speakers
-
class
espnet.nets.pytorch_backend.e2e_asr_mix.
PIT
(num_spkrs)[source]¶ Bases:
object
Permutation Invariant Training (PIT) module
- Parameters
num_spkrs (int) – number of speakers for PIT process (2 or 3)
espnet.nets.pytorch_backend.e2e_asr_transducer¶
-
class
espnet.nets.pytorch_backend.e2e_asr_transducer.
E2E
(idim, odim, args)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
E2E module
- Parameters
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (namespace) – argument Namespace containing options
-
calculate_all_attentions
(xs_pad, ilens, ys_pad)[source]¶ E2E attention calculation
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
- attention weights with the following shape,
multi-head case => attention weights (B, H, Lmax, Tmax),
other case => attention weights (B, Lmax, Tmax).
- Return type
att_ws (ndarray)
-
enhance
(xs)[source]¶ Forwarding only the frontend stage
- Parameters
xs (ndarray) – input acoustic feature (T, C, F)
- Returns
mask (torch.Tensor): ilens (torch.Tensor): batch of lengths of input sequences (B)
- Return type
enhanced (ndarray)
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
transducer loss value
- Return type
loss (torch.Tensor)
-
init_like_chainer
()[source]¶ Initialize weight like chainer
chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)
however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)
-
recognize
(x, recog_args, char_list, rnnlm=None)[source]¶ E2E recognize
- Parameters
x (ndarray) – input acoustic feature (T, D)
recog_args (namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
n-best decoding results
- Return type
y (list)
espnet.nets.pytorch_backend.e2e_asr_transformer¶
-
class
espnet.nets.pytorch_backend.e2e_asr_transformer.
E2E
(idim, odim, args, ignore_id=-1)[source]¶ Bases:
espnet.nets.asr_interface.ASRInterface
,torch.nn.modules.module.Module
-
property
attention_plot_class
¶ Get attention plot class.
-
calculate_all_attentions
(xs_pad, ilens, ys_pad)[source]¶ E2E attention calculation
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).
- Return type
float ndarray
-
encode
(feat)[source]¶ Encode feature in beam_search (optional).
- Parameters
x (numpy.ndarray) – input feature (T, D)
- Returns
encoded feature (T, D)
- Return type
torch.Tensor for pytorch, chainer.Variable for chainer
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward
- Parameters
xs_pad (torch.Tensor) – batch of padded source sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of source sequences (B)
ys_pad (torch.Tensor) – batch of padded target sequences (B, Lmax)
- Returns
ctc loass value
- Return type
torch.Tensor
- Returns
attention loss value
- Return type
torch.Tensor
- Returns
accuracy in attention decoder
- Return type
float
-
recognize
(feat, recog_args, char_list=None, rnnlm=None, use_jit=False)[source]¶ recognize feat
- Parameters
x (ndnarray) – input acouctic feature (B, T, D) or (T, D)
recog_args (namespace) – argment namespace contraining options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
TODO(karita): do not recompute previous attention for faster decoding
-
scorers
()[source]¶ Get scorers for beam_search (optional).
- Returns
dict of ScorerInterface objects
- Return type
dict[str, ScorerInterface]
-
property
espnet.nets.pytorch_backend.e2e_mt¶
-
class
espnet.nets.pytorch_backend.e2e_mt.
E2E
(idim, odim, args)[source]¶ Bases:
espnet.nets.mt_interface.MTInterface
,torch.nn.modules.module.Module
E2E module
- Parameters
idim (int) – dimension of inputs
odim (int) – dimension of outputs
args (Namespace) – argument Namespace containing options
-
calculate_all_attentions
(xs_pad, ilens, ys_pad)[source]¶ E2E attention calculation
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).
- Return type
float ndarray
-
forward
(xs_pad, ilens, ys_pad)[source]¶ E2E forward
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
loss value
- Return type
torch.Tensor
-
init_like_chainer
()[source]¶ Initialize weight like chainer
chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0 pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)
however, there are two exceptions as far as I know. - EmbedID.W ~ Normal(0, 1) - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)
-
target_lang_biasing_train
(xs_pad, ilens, ys_pad)[source]¶ Replace <sos> with target language IDs for multilingual MT during training.
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
- Returns
source text without language IDs
- Return type
torch.Tensor
- Returns
target text without language IDs
- Return type
torch.Tensor
- Returns
target language IDs
- Return type
torch.Tensor (B, 1)
-
translate
(x, trans_args, char_list, rnnlm=None)[source]¶ E2E beam search
- Parameters
x (ndarray) – input source text feature (T, D)
trans_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
-
translate_batch
(xs, trans_args, char_list, rnnlm=None)[source]¶ E2E beam search
- Parameters
xs (list) – list of input source text feature arrays [(T_1, D), (T_2, D), …]
trans_args (Namespace) – argument Namespace containing options
char_list (list) – list of characters
rnnlm (torch.nn.Module) – language model module
- Returns
N-best decoding results
- Return type
list
espnet.nets.pytorch_backend.e2e_tts_fastspeech¶
FastSpeech related modules.
-
class
espnet.nets.pytorch_backend.e2e_tts_fastspeech.
FeedForwardTransformer
(idim, odim, args=None)[source]¶ Bases:
espnet.nets.tts_interface.TTSInterface
,torch.nn.modules.module.Module
Feed Forward Transformer for TTS a.k.a. FastSpeech.
This is a module of FastSpeech, feed-forward Transformer with duration predictor described in FastSpeech: Fast, Robust and Controllable Text to Speech, which does not require any auto-regressive processing during inference, resulting in fast decoding compared with auto-regressive Transformer.
Initialize feed-forward Transformer module.
- Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (Namespace, optional) –
elayers (int): Number of encoder layers.
eunits (int): Number of encoder hidden units.
adim (int): Number of attention transformation dimensions.
aheads (int): Number of heads for multi head attention.
dlayers (int): Number of decoder layers.
dunits (int): Number of decoder hidden units.
use_scaled_pos_enc (bool): Whether to use trainable scaled positional encoding.
encoder_normalize_before (bool): Whether to perform layer normalization before encoder block.
decoder_normalize_before (bool): Whether to perform layer normalization before decoder block.
encoder_concat_after (bool): Whether to concatenate attention layer’s input and output in encoder.
decoder_concat_after (bool): Whether to concatenate attention layer’s input and output in decoder.
duration_predictor_layers (int): Number of duration predictor layers.
duration_predictor_chans (int): Number of duration predictor channels.
duration_predictor_kernel_size (int): Kernel size of duration predictor.
spk_embed_dim (int): Number of speaker embedding dimenstions.
spk_embed_integration_type: How to integrate speaker embedding.
teacher_model (str): Teacher auto-regressive transformer model path.
reduction_factor (int): Reduction factor.
transformer_init (float): How to initialize transformer parameters.
transformer_lr (float): Initial value of learning rate.
transformer_warmup_steps (int): Optimizer warmup steps.
transformer_enc_dropout_rate (float): Dropout rate in encoder except attention & positional encoding.
transformer_enc_positional_dropout_rate (float): Dropout rate after encoder positional encoding.
transformer_enc_attn_dropout_rate (float): Dropout rate in encoder self-attention module.
transformer_dec_dropout_rate (float): Dropout rate in decoder except attention & positional encoding.
transformer_dec_positional_dropout_rate (float): Dropout rate after decoder positional encoding.
transformer_dec_attn_dropout_rate (float): Dropout rate in deocoder self-attention module.
transformer_enc_dec_attn_dropout_rate (float): Dropout rate in encoder-deocoder attention module.
use_masking (bool): Whether to use masking in calculation of loss.
transfer_encoder_from_teacher: Whether to transfer encoder using teacher encoder parameters.
transferred_encoder_module: Encoder module to be initialized using teacher parameters.
-
property
attention_plot_class
¶ Return plot class for attention weight plot.
-
property
base_plot_keys
¶ Return base key names to plot during training. keys should match what chainer.reporter reports.
If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns
List of strings which are base keys to plot during training.
- Return type
list
-
calculate_all_attentions
(xs, ilens, ys, olens, spembs=None, *args, **kwargs)[source]¶ Calculate all of the attention weights.
- Parameters
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns
Dict of attention weights and outputs.
- Return type
dict
-
forward
(xs, ilens, ys, olens, spembs=None, *args, **kwargs)[source]¶ Calculate forward propagation.
- Parameters
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns
Loss value.
- Return type
Tensor
-
inference
(x, inference_args, spemb=None, *args, **kwargs)[source]¶ Generate the sequence of features given the sequences of characters.
- Parameters
x (Tensor) – Input sequence of characters (T,).
inference_args (Namespace) – Dummy for compatibility.
spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).
- Returns
Output sequence of features (1, L, odim). None: Dummy for compatibility. None: Dummy for compatibility.
- Return type
Tensor
espnet.nets.pytorch_backend.e2e_tts_tacotron2¶
Tacotron 2 related modules.
-
class
espnet.nets.pytorch_backend.e2e_tts_tacotron2.
GuidedAttentionLoss
(sigma=0.4, alpha=1.0, reset_always=True)[source]¶ Bases:
torch.nn.modules.module.Module
Guided attention loss function module.
This module calculates the guided attention loss described in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention, which forces the attention to be diagonal.
Initialize guided attention loss module.
- Parameters
sigma (float, optional) – Standard deviation to control how close attention to a diagonal.
alpha (float, optional) – Scaling coefficient (lambda).
reset_always (bool, optional) – Whether to always reset masks.
-
forward
(att_ws, ilens, olens)[source]¶ Calculate forward propagation.
- Parameters
att_ws (Tensor) – Batch of attention weights (B, T_max_out, T_max_in).
ilens (LongTensor) – Batch of input lenghts (B,).
olens (LongTensor) – Batch of output lenghts (B,).
- Returns
Guided attention loss value.
- Return type
Tensor
-
class
espnet.nets.pytorch_backend.e2e_tts_tacotron2.
Tacotron2
(idim, odim, args=None)[source]¶ Bases:
espnet.nets.tts_interface.TTSInterface
,torch.nn.modules.module.Module
Tacotron2 module for end-to-end text-to-speech (E2E-TTS).
This is a module of Spectrogram prediction network in Tacotron2 described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, which converts the sequence of characters into the sequence of Mel-filterbanks.
Initialize Tacotron2 module.
- Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (Namespace, optional) –
spk_embed_dim (int): Dimension of the speaker embedding.
embed_dim (int): Dimension of character embedding.
elayers (int): The number of encoder blstm layers.
eunits (int): The number of encoder blstm units.
econv_layers (int): The number of encoder conv layers.
econv_filts (int): The number of encoder conv filter size.
econv_chans (int): The number of encoder conv filter channels.
dlayers (int): The number of decoder lstm layers.
dunits (int): The number of decoder lstm units.
prenet_layers (int): The number of prenet layers.
prenet_units (int): The number of prenet units.
postnet_layers (int): The number of postnet layers.
postnet_filts (int): The number of postnet filter size.
postnet_chans (int): The number of postnet filter channels.
output_activation (int): The name of activation function for outputs.
adim (int): The number of dimension of mlp in attention.
aconv_chans (int): The number of attention conv filter channels.
aconv_filts (int): The number of attention conv filter size.
cumulate_att_w (bool): Whether to cumulate previous attention weight.
use_batch_norm (bool): Whether to use batch normalization.
use_concate (int): Whether to concatenate encoder embedding with decoder lstm outputs.
dropout_rate (float): Dropout rate.
zoneout_rate (float): Zoneout rate.
reduction_factor (int): Reduction factor.
spk_embed_dim (int): Number of speaker embedding dimenstions.
spc_dim (int): Number of spectrogram embedding dimenstions (only for use_cbhg=True).
use_cbhg (bool): Whether to use CBHG module.
cbhg_conv_bank_layers (int): The number of convoluional banks in CBHG.
cbhg_conv_bank_chans (int): The number of channels of convolutional bank in CBHG.
cbhg_proj_filts (int): The number of filter size of projection layeri in CBHG.
cbhg_proj_chans (int): The number of channels of projection layer in CBHG.
cbhg_highway_layers (int): The number of layers of highway network in CBHG.
cbhg_highway_units (int): The number of units of highway network in CBHG.
cbhg_gru_units (int): The number of units of GRU in CBHG.
use_masking (bool): Whether to mask padded part in loss calculation.
bce_pos_weight (float): Weight of positive sample of stop token (only for use_masking=True).
use-guided-attn-loss (bool): Whether to use guided attention loss.
guided-attn-loss-sigma (float) Sigma in guided attention loss.
guided-attn-loss-lamdba (float): Lambda in guided attention loss.
-
property
base_plot_keys
¶ Return base key names to plot during training. keys should match what chainer.reporter reports.
If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns
List of strings which are base keys to plot during training.
- Return type
list
-
calculate_all_attentions
(xs, ilens, ys, spembs=None, *args, **kwargs)[source]¶ Calculate all of the attention weights.
- Parameters
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns
Batch of attention weights (B, Lmax, Tmax).
- Return type
numpy.ndarray
-
forward
(xs, ilens, ys, labels, olens, spembs=None, spcs=None, *args, **kwargs)[source]¶ Calculate forward propagation.
- Parameters
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
spcs (Tensor, optional) – Batch of groundtruth spectrograms (B, Lmax, spc_dim).
- Returns
Loss value.
- Return type
Tensor
-
inference
(x, inference_args, spemb=None, *args, **kwargs)[source]¶ Generate the sequence of features given the sequences of characters.
- Parameters
x (Tensor) – Input sequence of characters (T,).
inference_args (Namespace) –
threshold (float): Threshold in inference.
minlenratio (float): Minimum length ratio in inference.
maxlenratio (float): Maximum length ratio in inference.
spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).
- Returns
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
- Return type
Tensor
-
class
espnet.nets.pytorch_backend.e2e_tts_tacotron2.
Tacotron2Loss
(use_masking=True, bce_pos_weight=20.0)[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for Tacotron2.
Initialize Tactoron2 loss module.
- Parameters
use_masking (bool) – Whether to mask padded part in loss calculation.
bce_pos_weight (float) – Weight of positive sample of stop token.
-
forward
(after_outs, before_outs, logits, ys, labels, olens)[source]¶ Calculate forward propagation.
- Parameters
after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).
before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).
logits (Tensor) – Batch of stop logits (B, Lmax).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
labels (LongTensor) – Batch of the sequences of stop token labels (B, Lmax).
olens (LongTensor) – Batch of the lengths of each target (B,).
- Returns
L1 loss value. Tensor: Mean square error loss value. Tensor: Binary cross entropy loss value.
- Return type
Tensor
espnet.nets.pytorch_backend.e2e_tts_transformer¶
TTS-Transformer related modules.
-
class
espnet.nets.pytorch_backend.e2e_tts_transformer.
GuidedMultiHeadAttentionLoss
(sigma=0.4, alpha=1.0, reset_always=True)[source]¶ Bases:
espnet.nets.pytorch_backend.e2e_tts_tacotron2.GuidedAttentionLoss
Guided attention loss function module for multi head attention.
- Parameters
sigma (float, optional) – Standard deviation to control how close attention to a diagonal.
alpha (float, optional) – Scaling coefficient (lambda).
reset_always (bool, optional) – Whether to always reset masks.
Initialize guided attention loss module.
- Parameters
sigma (float, optional) – Standard deviation to control how close attention to a diagonal.
alpha (float, optional) – Scaling coefficient (lambda).
reset_always (bool, optional) – Whether to always reset masks.
-
forward
(att_ws, ilens, olens)[source]¶ Calculate forward propagation.
- Parameters
att_ws (Tensor) – Batch of multi head attention weights (B, H, T_max_out, T_max_in).
ilens (LongTensor) – Batch of input lenghts (B,).
olens (LongTensor) – Batch of output lenghts (B,).
- Returns
Guided attention loss value.
- Return type
Tensor
-
class
espnet.nets.pytorch_backend.e2e_tts_transformer.
TTSPlot
(att_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0)[source]¶ Bases:
espnet.nets.pytorch_backend.transformer.plot.PlotAttentionReport
Attention plot module for TTS-Transformer.
-
plotfn
(data, attn_dict, outdir, suffix='png', savefn=None)[source]¶ Plot multi head attentions.
- Parameters
data (dict) – Utts info from json file.
attn_dict (dict) – Multi head attention dict. Values should be numpy.ndarray (H, L, T)
outdir (str) – Directory name to save figures.
suffix (str) – Filename suffix including image type (e.g., png).
savefn (function) – Function to save figures.
-
-
class
espnet.nets.pytorch_backend.e2e_tts_transformer.
Transformer
(idim, odim, args=None)[source]¶ Bases:
espnet.nets.tts_interface.TTSInterface
,torch.nn.modules.module.Module
Text-to-Speech Transformer module.
This is a module of text-to-speech Transformer described in Neural Speech Synthesis with Transformer Network, which convert the sequence of characters or phonemes into the sequence of Mel-filterbanks.
Initialize TTS-Transformer module.
- Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
args (Namespace, optional) –
embed_dim (int): Dimension of character embedding.
eprenet_conv_layers (int): Number of encoder prenet convolution layers.
eprenet_conv_chans (int): Number of encoder prenet convolution channels.
eprenet_conv_filts (int): Filter size of encoder prenet convolution.
dprenet_layers (int): Number of decoder prenet layers.
dprenet_units (int): Number of decoder prenet hidden units.
elayers (int): Number of encoder layers.
eunits (int): Number of encoder hidden units.
adim (int): Number of attention transformation dimensions.
aheads (int): Number of heads for multi head attention.
dlayers (int): Number of decoder layers.
dunits (int): Number of decoder hidden units.
postnet_layers (int): Number of postnet layers.
postnet_chans (int): Number of postnet channels.
postnet_filts (int): Filter size of postnet.
use_scaled_pos_enc (bool): Whether to use trainable scaled positional encoding.
use_batch_norm (bool): Whether to use batch normalization in encoder prenet.
encoder_normalize_before (bool): Whether to perform layer normalization before encoder block.
decoder_normalize_before (bool): Whether to perform layer normalization before decoder block.
encoder_concat_after (bool): Whether to concatenate attention layer’s input and output in encoder.
decoder_concat_after (bool): Whether to concatenate attention layer’s input and output in decoder.
reduction_factor (int): Reduction factor.
spk_embed_dim (int): Number of speaker embedding dimenstions.
spk_embed_integration_type: How to integrate speaker embedding.
transformer_init (float): How to initialize transformer parameters.
transformer_lr (float): Initial value of learning rate.
transformer_warmup_steps (int): Optimizer warmup steps.
transformer_enc_dropout_rate (float): Dropout rate in encoder except attention & positional encoding.
transformer_enc_positional_dropout_rate (float): Dropout rate after encoder positional encoding.
transformer_enc_attn_dropout_rate (float): Dropout rate in encoder self-attention module.
transformer_dec_dropout_rate (float): Dropout rate in decoder except attention & positional encoding.
transformer_dec_positional_dropout_rate (float): Dropout rate after decoder positional encoding.
transformer_dec_attn_dropout_rate (float): Dropout rate in deocoder self-attention module.
transformer_enc_dec_attn_dropout_rate (float): Dropout rate in encoder-deocoder attention module.
eprenet_dropout_rate (float): Dropout rate in encoder prenet.
dprenet_dropout_rate (float): Dropout rate in decoder prenet.
postnet_dropout_rate (float): Dropout rate in postnet.
use_masking (bool): Whether to use masking in calculation of loss.
bce_pos_weight (float): Positive sample weight in bce calculation (only for use_masking=true).
loss_type (str): How to calculate loss.
use_guided_attn_loss (bool): Whether to use guided attention loss.
num_heads_applied_guided_attn (int): Number of heads in each layer to apply guided attention loss.
num_layers_applied_guided_attn (int): Number of layers to apply guided attention loss.
modules_applied_guided_attn (list): List of module names to apply guided attention loss.
guided-attn-loss-sigma (float) Sigma in guided attention loss.
guided-attn-loss-lambda (float): Lambda in guided attention loss.
-
property
attention_plot_class
¶ Return plot class for attention weight plot.
-
property
base_plot_keys
¶ Return base key names to plot during training. keys should match what chainer.reporter reports.
If you add the key loss, the reporter will report main/loss and validation/main/loss values. also loss.png will be created as a figure visulizing main/loss and validation/main/loss values.
- Returns
List of strings which are base keys to plot during training.
- Return type
list
-
calculate_all_attentions
(xs, ilens, ys, olens, spembs=None, skip_output=False, keep_tensor=False, *args, **kwargs)[source]¶ Calculate all of the attention weights.
- Parameters
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
skip_output (bool, optional) – Whether to skip calculate the final output.
keep_tensor (bool, optional) – Whether to keep original tensor.
- Returns
Dict of attention weights and outputs.
- Return type
dict
-
forward
(xs, ilens, ys, labels, olens, spembs=None, *args, **kwargs)[source]¶ Calculate forward propagation.
- Parameters
xs (Tensor) – Batch of padded character ids (B, Tmax).
ilens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
olens (LongTensor) – Batch of the lengths of each target (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns
Loss value.
- Return type
Tensor
-
inference
(x, inference_args, spemb=None, *args, **kwargs)[source]¶ Generate the sequence of features given the sequences of characters.
- Parameters
x (Tensor) – Input sequence of characters (T,).
inference_args (Namespace) –
threshold (float): Threshold in inference.
minlenratio (float): Minimum length ratio in inference.
maxlenratio (float): Maximum length ratio in inference.
spemb (Tensor, optional) – Speaker embedding vector (spk_embed_dim).
- Returns
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).
- Return type
Tensor
-
class
espnet.nets.pytorch_backend.e2e_tts_transformer.
TransformerLoss
(use_masking=True, bce_pos_weight=5.0)[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for TTS-Transformer.
Initialize Transformer loss module.
- Parameters
use_masking (bool, optional) – Whether to mask padded part in loss calculation.
bce_pos_weight (float, optional) – Weight of positive sample of stop token (only for use_masking=True).
-
forward
(after_outs, before_outs, logits, ys, labels, olens)[source]¶ Calculate forward propagation.
- Parameters
after_outs (Tensor) – Batch of outputs after postnets (B, Lmax, odim).
before_outs (Tensor) – Batch of outputs before postnets (B, Lmax, odim).
logits (Tensor) – Batch of stop logits (B, Lmax).
ys (Tensor) – Batch of padded target features (B, Lmax, odim).
labels (LongTensor) – Batch of the sequences of stop token labels (B, Lmax).
olens (LongTensor) – Batch of the lengths of each target (B,).
- Returns
L1 loss value. Tensor: Mean square error loss value. Tensor: Binary cross entropy loss value.
- Return type
Tensor
espnet.nets.pytorch_backend.nets_utils¶
Network related utility tools.
-
espnet.nets.pytorch_backend.nets_utils.
make_non_pad_mask
(lengths, xs=None, length_dim=-1)[source]¶ Make mask tensor containing indices of non-padded part.
- Parameters
lengths (LongTensor or List) – Batch of lengths (B,).
xs (Tensor, optional) – The reference tensor. If set, masks will be the same shape as this tensor.
length_dim (int, optional) – Dimension indicator of the above tensor. See the example.
- Returns
mask tensor containing indices of padded part.
- Return type
ByteTensor
Examples
With only lengths.
>>> lengths = [5, 3, 2] >>> make_non_pad_mask(lengths) masks = [[1, 1, 1, 1 ,1], [1, 1, 1, 0, 0], [1, 1, 0, 0, 0]]
With the reference tensor.
>>> xs = torch.zeros((3, 2, 4)) >>> make_non_pad_mask(lengths, xs) tensor([[[1, 1, 1, 1], [1, 1, 1, 1]], [[1, 1, 1, 0], [1, 1, 1, 0]], [[1, 1, 0, 0], [1, 1, 0, 0]]], dtype=torch.uint8) >>> xs = torch.zeros((3, 2, 6)) >>> make_non_pad_mask(lengths, xs) tensor([[[1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0]], [[1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0]], [[1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
With the reference tensor and dimension indicator.
>>> xs = torch.zeros((3, 6, 6)) >>> make_non_pad_mask(lengths, xs, 1) tensor([[[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0]], [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]], dtype=torch.uint8) >>> make_non_pad_mask(lengths, xs, 2) tensor([[[1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 0]], [[1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0]], [[1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
-
espnet.nets.pytorch_backend.nets_utils.
make_pad_mask
(lengths, xs=None, length_dim=-1)[source]¶ Make mask tensor containing indices of padded part.
- Parameters
lengths (LongTensor or List) – Batch of lengths (B,).
xs (Tensor, optional) – The reference tensor. If set, masks will be the same shape as this tensor.
length_dim (int, optional) – Dimension indicator of the above tensor. See the example.
- Returns
Mask tensor containing indices of padded part.
- Return type
Tensor
Examples
With only lengths.
>>> lengths = [5, 3, 2] >>> make_non_pad_mask(lengths) masks = [[0, 0, 0, 0 ,0], [0, 0, 0, 1, 1], [0, 0, 1, 1, 1]]
With the reference tensor.
>>> xs = torch.zeros((3, 2, 4)) >>> make_pad_mask(lengths, xs) tensor([[[0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 1], [0, 0, 0, 1]], [[0, 0, 1, 1], [0, 0, 1, 1]]], dtype=torch.uint8) >>> xs = torch.zeros((3, 2, 6)) >>> make_pad_mask(lengths, xs) tensor([[[0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1]], [[0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1]], [[0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
With the reference tensor and dimension indicator.
>>> xs = torch.zeros((3, 6, 6)) >>> make_pad_mask(lengths, xs, 1) tensor([[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1]], [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]], dtype=torch.uint8) >>> make_pad_mask(lengths, xs, 2) tensor([[[0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1]], [[0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1]], [[0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
-
espnet.nets.pytorch_backend.nets_utils.
mask_by_length
(xs, lengths, fill=0)[source]¶ Mask tensor according to length.
- Parameters
xs (Tensor) – Batch of input tensor (B, *).
lengths (LongTensor or List) – Batch of lengths (B,).
fill (int or float) – Value to fill masked part.
- Returns
Batch of masked input tensor (B, *).
- Return type
Tensor
Examples
>>> x = torch.arange(5).repeat(3, 1) + 1 >>> x tensor([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]) >>> lengths = [5, 3, 2] >>> mask_by_length(x, lengths) tensor([[1, 2, 3, 4, 5], [1, 2, 3, 0, 0], [1, 2, 0, 0, 0]])
-
espnet.nets.pytorch_backend.nets_utils.
pad_list
(xs, pad_value)[source]¶ Perform padding for the list of tensors.
- Parameters
xs (List) – List of Tensors [(T_1, *), (T_2, *), …, (T_B, *)].
pad_value (float) – Value for padding.
- Returns
Padded tensor (B, Tmax, *).
- Return type
Tensor
Examples
>>> x = [torch.ones(4), torch.ones(2), torch.ones(1)] >>> x [tensor([1., 1., 1., 1.]), tensor([1., 1.]), tensor([1.])] >>> pad_list(x, 0) tensor([[1., 1., 1., 1.], [1., 1., 0., 0.], [1., 0., 0., 0.]])
-
espnet.nets.pytorch_backend.nets_utils.
th_accuracy
(pad_outputs, pad_targets, ignore_label)[source]¶ Calculate accuracy.
- Parameters
pad_outputs (Tensor) – Prediction tensors (B * Lmax, D).
pad_targets (LongTensor) – Target label tensors (B, Lmax, D).
ignore_label (int) – Ignore label id.
- Returns
Accuracy value (0.0 - 1.0).
- Return type
float
-
espnet.nets.pytorch_backend.nets_utils.
to_device
(m, x)[source]¶ Send tensor into the device of the module.
- Parameters
m (torch.nn.Module) – Torch module.
x (Tensor) – Torch tensor.
- Returns
Torch tensor located in the same place as torch module.
- Return type
Tensor
-
espnet.nets.pytorch_backend.nets_utils.
to_torch_tensor
(x)[source]¶ Change to torch.Tensor or ComplexTensor from numpy.ndarray.
- Parameters
x – Inputs. It should be one of numpy.ndarray, Tensor, ComplexTensor, and dict.
- Returns
Type converted inputs.
- Return type
Tensor or ComplexTensor
Examples
>>> xs = np.ones(3, dtype=np.float32) >>> xs = to_torch_tensor(xs) tensor([1., 1., 1.]) >>> xs = torch.ones(3, 4, 5) >>> assert to_torch_tensor(xs) is xs >>> xs = {'real': xs, 'imag': xs} >>> to_torch_tensor(xs) ComplexTensor( Real: tensor([1., 1., 1.]) Imag; tensor([1., 1., 1.]) )
espnet.nets.pytorch_backend.wavenet¶
This code is based on https://github.com/kan-bayashi/PytorchWaveNetVocoder.
-
class
espnet.nets.pytorch_backend.wavenet.
CausalConv1d
(in_channels, out_channels, kernel_size, dilation=1, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module
1D dilated causal convolution.
-
class
espnet.nets.pytorch_backend.wavenet.
OneHot
(depth)[source]¶ Bases:
torch.nn.modules.module.Module
Convert to one-hot vector.
- Parameters
depth (int) – Dimension of one-hot vector.
-
class
espnet.nets.pytorch_backend.wavenet.
UpSampling
(upsampling_factor, bias=True)[source]¶ Bases:
torch.nn.modules.module.Module
Upsampling layer with deconvolution.
- Parameters
upsampling_factor (int) – Upsampling factor.
-
class
espnet.nets.pytorch_backend.wavenet.
WaveNet
(n_quantize=256, n_aux=28, n_resch=512, n_skipch=256, dilation_depth=10, dilation_repeat=3, kernel_size=2, upsampling_factor=0)[source]¶ Bases:
torch.nn.modules.module.Module
Conditional wavenet.
- Parameters
n_quantize (int) – Number of quantization.
n_aux (int) – Number of aux feature dimension.
n_resch (int) – Number of filter channels for residual block.
n_skipch (int) – Number of filter channels for skip connection.
dilation_depth (int) – Number of dilation depth (e.g. if set 10, max dilation = 2^(10-1)).
dilation_repeat (int) – Number of dilation repeat.
kernel_size (int) – Filter size of dilated causal convolution.
upsampling_factor (int) – Upsampling factor.
-
forward
(x, h)[source]¶ Calculate forward propagation.
- Parameters
x (LongTensor) – Quantized input waveform tensor with the shape (B, T).
h (Tensor) – Auxiliary feature tensor with the shape (B, n_aux, T).
- Returns
Logits with the shape (B, T, n_quantize).
- Return type
Tensor
-
generate
(x, h, n_samples, interval=None, mode='sampling')[source]¶ Generate a waveform with fast genration algorithm.
This generation based on Fast WaveNet Generation Algorithm.
- Parameters
x (LongTensor) – Initial waveform tensor with the shape (T,).
h (Tensor) – Auxiliary feature tensor with the shape (n_samples + T, n_aux).
n_samples (int) – Number of samples to be generated.
interval (int, optional) – Log interval.
mode (str, optional) – “sampling” or “argmax”.
- Returns
Generated quantized waveform (n_samples).
- Return type
ndarray
-
espnet.nets.pytorch_backend.wavenet.
decode_mu_law
(y, mu=256)[source]¶ Perform mu-law decoding.
- Parameters
x (ndarray) – Quantized audio signal with the range from 0 to mu - 1.
mu (int) – Quantized level.
- Returns
Audio signal with the range from -1 to 1.
- Return type
ndarray
espnet.nets.pytorch_backend.fastspeech.duration_calculator¶
Duration calculator related modules.
-
class
espnet.nets.pytorch_backend.fastspeech.duration_calculator.
DurationCalculator
(teacher_model)[source]¶ Bases:
torch.nn.modules.module.Module
Duration calculator module for FastSpeech.
Initialize duration calculator module.
- Parameters
teacher_model (e2e_tts_transformer.Transformer) – Pretrained auto-regressive Transformer.
-
forward
(xs, ilens, ys, olens, spembs=None)[source]¶ Calculate forward propagation.
- Parameters
xs (Tensor) – Batch of the padded sequences of character ids (B, Tmax).
ilens (Tensor) – Batch of lengths of each input sequence (B,).
ys (Tensor) – Batch of the padded sequence of target features (B, Lmax, odim).
olens (Tensor) – Batch of lengths of each output sequence (B,).
spembs (Tensor, optional) – Batch of speaker embedding vectors (B, spk_embed_dim).
- Returns
Batch of durations (B, Tmax).
- Return type
Tensor
espnet.nets.pytorch_backend.fastspeech.duration_predictor¶
Duration predictor related modules.
-
class
espnet.nets.pytorch_backend.fastspeech.duration_predictor.
DurationPredictor
(idim, n_layers=2, n_chans=384, kernel_size=3, dropout_rate=0.1, offset=1.0)[source]¶ Bases:
torch.nn.modules.module.Module
Duration predictor module.
This is a module of duration predictor described in FastSpeech: Fast, Robust and Controllable Text to Speech. The duration predictor predicts a duration of each frame in log domain from the hidden embeddings of encoder.
Note
The calculation domain of outputs is different between in forward and in inference. In forward, the outputs are calculated in log domain but in inference, those are calculated in linear domain.
Initilize duration predictor module.
- Parameters
idim (int) – Input dimension.
n_layers (int, optional) – Number of convolutional layers.
n_chans (int, optional) – Number of channels of convolutional layers.
kernel_size (int, optional) – Kernel size of convolutional layers.
dropout_rate (float, optional) – Dropout rate.
offset (float, optional) – Offset value to avoid nan in log domain.
-
class
espnet.nets.pytorch_backend.fastspeech.duration_predictor.
DurationPredictorLoss
(offset=1.0)[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for duration predictor.
The loss value is Calculated in log domain to make it Gaussian.
Initilize duration predictor loss module.
- Parameters
offset (float, optional) – Offset value to avoid nan in log domain.
-
forward
(outputs, targets)[source]¶ Calculate forward propagation.
- Parameters
outputs (Tensor) – Batch of prediction durations in log domain (B, T)
targets (LongTensor) – Batch of groundtruth durations in linear domain (B, T)
- Returns
Mean squared error loss value.
- Return type
Tensor
Note
outputs is in log domain but targets is in linear domain.
espnet.nets.pytorch_backend.fastspeech.length_regulator¶
Length regulator related modules.
-
class
espnet.nets.pytorch_backend.fastspeech.length_regulator.
LengthRegulator
(pad_value=0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Length regulator module for feed-forward Transformer.
This is a module of length regulator described in FastSpeech: Fast, Robust and Controllable Text to Speech. The length regulator expands char or phoneme-level embedding features to frame-level by repeating each feature based on the corresponding predicted durations.
Initilize length regulator module.
- Parameters
pad_value (float, optional) – Value used for padding.
-
forward
(xs, ds, ilens, alpha=1.0)[source]¶ Calculate forward propagation.
- Parameters
xs (Tensor) – Batch of sequences of char or phoneme embeddings (B, Tmax, D).
ds (LongTensor) – Batch of durations of each frame (B, T).
ilens (LongTensor) – Batch of input lengths (B,).
alpha (float, optional) – Alpha value to control speed of speech.
- Returns
replicated input tensor based on durations (B, T*, D).
- Return type
Tensor
espnet.nets.pytorch_backend.frontends.beamformer¶
-
espnet.nets.pytorch_backend.frontends.beamformer.
apply_beamforming_vector
(beamform_vector: torch_complex.tensor.ComplexTensor, mix: torch_complex.tensor.ComplexTensor) → torch_complex.tensor.ComplexTensor[source]¶
-
espnet.nets.pytorch_backend.frontends.beamformer.
get_mvdr_vector
(psd_s: torch_complex.tensor.ComplexTensor, psd_n: torch_complex.tensor.ComplexTensor, reference_vector: torch.Tensor, eps: float = 1e-15) → torch_complex.tensor.ComplexTensor[source]¶ Return the MVDR(Minimum Variance Distortionless Response) vector:
h = (Npsd^-1 @ Spsd) / (Tr(Npsd^-1 @ Spsd)) @ u
- Reference:
On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420
- Parameters
psd_s (ComplexTensor) – (…, F, C, C)
psd_n (ComplexTensor) – (…, F, C, C)
reference_vector (torch.Tensor) – (…, C)
eps (float) –
- Returns
(…, F, C)
- Return type
beamform_vector (ComplexTensor)r
-
espnet.nets.pytorch_backend.frontends.beamformer.
get_power_spectral_density_matrix
(xs: torch_complex.tensor.ComplexTensor, mask: torch.Tensor, normalization=True, eps: float = 1e-15) → torch_complex.tensor.ComplexTensor[source]¶ Return cross-channel power spectral density (PSD) matrix
- Parameters
xs (ComplexTensor) – (…, F, C, T)
mask (torch.Tensor) – (…, F, C, T)
normalization (bool) –
eps (float) –
- Returns
psd (ComplexTensor): (…, F, C, C)
espnet.nets.pytorch_backend.frontends.dnn_beamformer¶
-
class
espnet.nets.pytorch_backend.frontends.dnn_beamformer.
AttentionReference
(bidim, att_dim)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(psd_in: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor, scaling: float = 2.0) → Tuple[torch.Tensor, torch.LongTensor][source]¶ The forward function
- Parameters
psd_in (ComplexTensor) – (B, F, C, C)
ilens (torch.Tensor) – (B,)
scaling (float) –
- Returns
(B, C) ilens (torch.Tensor): (B,)
- Return type
u (torch.Tensor)
-
-
class
espnet.nets.pytorch_backend.frontends.dnn_beamformer.
DNN_Beamformer
(bidim, btype='blstmp', blayers=3, bunits=300, bprojs=320, dropout_rate=0.0, badim=320, ref_channel: int = -1, beamformer_type='mvdr')[source]¶ Bases:
torch.nn.modules.module.Module
DNN mask based Beamformer
- Citation:
Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017; https://arxiv.org/abs/1703.04783
-
forward
(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, torch_complex.tensor.ComplexTensor][source]¶ The forward function
- Notation:
B: Batch C: Channel T: Time or Sequence length F: Freq
- Parameters
data (ComplexTensor) – (B, T, C, F)
ilens (torch.Tensor) – (B,)
- Returns
(B, T, F) ilens (torch.Tensor): (B,)
- Return type
enhanced (ComplexTensor)
espnet.nets.pytorch_backend.frontends.dnn_wpe¶
-
class
espnet.nets.pytorch_backend.frontends.dnn_wpe.
DNN_WPE
(wtype: str = 'blstmp', widim: int = 257, wlayers: int = 3, wunits: int = 300, wprojs: int = 320, dropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask: bool = True, iterations: int = 1, normalization: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(data: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, torch_complex.tensor.ComplexTensor][source]¶ The forward function
- Notation:
B: Batch C: Channel T: Time or Sequence length F: Freq or Some dimension of the feature vector
- Parameters
data – (B, C, T, F)
ilens – (B,)
- Returns
(B, C, T, F) ilens: (B,)
- Return type
data
-
espnet.nets.pytorch_backend.frontends.feature_transform¶
-
class
espnet.nets.pytorch_backend.frontends.feature_transform.
FeatureTransform
(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = 0.0, fmax: float = None, stats_file: str = None, apply_uttmvn: bool = True, uttmvn_norm_means: bool = True, uttmvn_norm_vars: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x: torch_complex.tensor.ComplexTensor, ilens: Union[torch.LongTensor, numpy.ndarray, List[int]]) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
espnet.nets.pytorch_backend.frontends.feature_transform.
GlobalMVN
(stats_file: str, norm_means: bool = True, norm_vars: bool = True, eps: float = 1e-20)[source]¶ Bases:
torch.nn.modules.module.Module
Apply global mean and variance normalization
- Parameters
stats_file (str) – npy file of 1-dim array or text file. From the _first element to the {(len(array) - 1) / 2}th element are treated as the sum of features, and the rest excluding the last elements are treated as the sum of the square value of features, and the last elements eqauls to the number of samples.
std_floor (float) –
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(x: torch.Tensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
espnet.nets.pytorch_backend.frontends.feature_transform.
LogMel
(fs: int = 16000, n_fft: int = 512, n_mels: int = 80, fmin: float = None, fmax: float = None, htk: bool = False, norm=1)[source]¶ Bases:
torch.nn.modules.module.Module
Convert STFT to fbank feats
The arguments is same as librosa.filters.mel
- Parameters
fs – number > 0 [scalar] sampling rate of the incoming signal
n_fft – int > 0 [scalar] number of FFT components
n_mels – int > 0 [scalar] number of Mel bands to generate
fmin – float >= 0 [scalar] lowest frequency (in Hz)
fmax – float >= 0 [scalar] highest frequency (in Hz). If None, use fmax = fs / 2.0
htk – use HTK formula instead of Slaney
norm – {None, 1, np.inf} [scalar] if 1, divide the triangular mel weights by the width of the mel band (area normalization). Otherwise, leave all the triangles aiming for a peak value of 1.0
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(feat: torch.Tensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
espnet.nets.pytorch_backend.frontends.feature_transform.
UtteranceMVN
(norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20)[source]¶ Bases:
torch.nn.modules.module.Module
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should reimplement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(x: torch.Tensor, ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
espnet.nets.pytorch_backend.frontends.feature_transform.
utterance_mvn
(x: torch.Tensor, ilens: torch.LongTensor, norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Apply utterance mean and variance normalization
- Parameters
x – (B, T, D), assumed zero padded
ilens – (B, T, D)
norm_means –
norm_vars –
eps –
espnet.nets.pytorch_backend.frontends.frontend¶
-
class
espnet.nets.pytorch_backend.frontends.frontend.
Frontend
(idim: int, use_wpe: bool = False, wtype: str = 'blstmp', wlayers: int = 3, wunits: int = 300, wprojs: int = 320, wdropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask_for_wpe: bool = True, use_beamformer: bool = False, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, badim: int = 320, ref_channel: int = -1, bdropout_rate=0.0)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x: torch_complex.tensor.ComplexTensor, ilens: Union[torch.LongTensor, numpy.ndarray, List[int]]) → Tuple[torch_complex.tensor.ComplexTensor, torch.LongTensor, Optional[torch_complex.tensor.ComplexTensor]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
espnet.nets.pytorch_backend.frontends.mask_estimator¶
-
class
espnet.nets.pytorch_backend.frontends.mask_estimator.
MaskEstimator
(type, idim, layers, units, projs, dropout, nmask=1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(xs: torch_complex.tensor.ComplexTensor, ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]¶ The forward function
- Parameters
xs – (B, F, C, T)
ilens – (B,)
- Returns
The hidden vector (B, F, C, T) masks: A tuple of the masks. (B, F, C, T) ilens: (B,)
- Return type
hs (torch.Tensor)
-
espnet.nets.pytorch_backend.lm.default¶
Default Recurrent Neural Network Languge Model in lm_train.py.
-
class
espnet.nets.pytorch_backend.lm.default.
ClassifierWithState
(predictor, lossfun=CrossEntropyLoss(), label_key=-1)[source]¶ Bases:
torch.nn.modules.module.Module
A wrapper for pytorch RNNLM.
Initialize class.
:param torch.nn.Module predictor : The RNNLM :param function lossfun : The loss function to use :param int/str label_key :
-
final
(state)[source]¶ Predict final log probabilities for given state using the predictor.
- Parameters
state – The state
:return The final log probabilities :rtype torch.Tensor
-
forward
(state, *args, **kwargs)[source]¶ Compute the loss value for an input and label pair.
Notes
It also computes accuracy and stores it to the attribute. When
label_key
isint
, the corresponding element inargs
is treated as ground truth labels. And when it isstr
, the element inkwargs
is used. The all elements ofargs
andkwargs
except the groundtruth labels are features. It feeds features to the predictor and compare the result with ground truth labels.:param torch.Tensor state : the LM state :param list[torch.Tensor] args : Input minibatch :param dict[torch.Tensor] kwargs : Input minibatch :return loss value :rtype torch.Tensor
-
-
class
espnet.nets.pytorch_backend.lm.default.
DefaultRNNLM
(n_vocab, args)[source]¶ Bases:
espnet.nets.lm_interface.LMInterface
,torch.nn.modules.module.Module
Default RNNLM for LMInterface Implementation.
Note
PyTorch seems to have memory leak when one GPU compute this after data parallel. If parallel GPUs compute this, it seems to be fine. See also https://github.com/espnet/espnet/issues/1075
Initialize class.
- Parameters
n_vocab (int) – The size of the vocabulary
args (argparse.Namespace) – configurations. see py:method:add_arguments
-
final_score
(state)[source]¶ Score eos.
- Parameters
state – Scorer state for prefix tokens
- Returns
final score
- Return type
float
-
forward
(x, t)[source]¶ Compute LM loss value from buffer sequences.
- Parameters
x (torch.Tensor) – Input ids. (batch, len)
t (torch.Tensor) – Target ids. (batch, len)
- Returns
- Tuple of
loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)
- Return type
tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Notes
The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)
-
score
(y, state, x)[source]¶ Score new token.
- Parameters
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – 2D encoder feature that generates ys.
- Returns
- Tuple of
torch.float32 scores for next token (n_vocab) and next state for ys
- Return type
tuple[torch.Tensor, Any]
-
class
espnet.nets.pytorch_backend.lm.default.
RNNLM
(n_vocab, n_layers, n_units, typ='lstm', dropout_rate=0.5)[source]¶ Bases:
torch.nn.modules.module.Module
A pytorch RNNLM.
Initialize class.
- Parameters
n_vocab (int) – The size of the vocabulary
n_layers (int) – The number of layers to create
n_units (int) – The number of units per layer
typ (str) – The RNN type
espnet.nets.pytorch_backend.lm.seq_rnn¶
Sequential implementation of Recurrent Neural Network Language Model.
-
class
espnet.nets.pytorch_backend.lm.seq_rnn.
SequentialRNNLM
(n_vocab, args)[source]¶ Bases:
espnet.nets.lm_interface.LMInterface
,torch.nn.modules.module.Module
Sequential RNNLM.
Initialize class.
- Parameters
n_vocab (int) – The size of the vocabulary
args (argparse.Namespace) – configurations. see py:method:add_arguments
-
forward
(x, t)[source]¶ Compute LM loss value from buffer sequences.
- Parameters
x (torch.Tensor) – Input ids. (batch, len)
t (torch.Tensor) – Target ids. (batch, len)
- Returns
- Tuple of
loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)
- Return type
tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Notes
The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)
-
init_state
(x)[source]¶ Get an initial state for decoding.
- Parameters
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
score
(y, state, x)[source]¶ Score new token.
- Parameters
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – 2D encoder feature that generates ys.
- Returns
- Tuple of
torch.float32 scores for next token (n_vocab) and next state for ys
- Return type
tuple[torch.Tensor, Any]
espnet.nets.pytorch_backend.lm.transformer¶
Transformer language model.
-
class
espnet.nets.pytorch_backend.lm.transformer.
TransformerLM
(n_vocab, args)[source]¶ Bases:
torch.nn.modules.module.Module
,espnet.nets.lm_interface.LMInterface
Transformer language model.
Initialize class.
- Parameters
n_vocab (int) – The size of the vocabulary
args (argparse.Namespace) – configurations. see py:method:add_arguments
-
forward
(x: torch.Tensor, t: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Compute LM loss value from buffer sequences.
- Parameters
x (torch.Tensor) – Input ids. (batch, len)
t (torch.Tensor) – Target ids. (batch, len)
- Returns
- Tuple of
loss to backward (scalar), negative log-likelihood of t: -log p(t) (scalar) and the number of elements in x (scalar)
- Return type
tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Notes
The last two return values are used in perplexity: p(t)^{-n} = exp(-log p(t) / n)
-
score
(y: torch.Tensor, state: Any, x: torch.Tensor) → Tuple[torch.Tensor, Any][source]¶ Score new token.
- Parameters
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – encoder feature that generates ys.
- Returns
- Tuple of
torch.float32 scores for next token (n_vocab) and next state for ys
- Return type
tuple[torch.Tensor, Any]
espnet.nets.pytorch_backend.rnn.attentions¶
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttAdd
(eprojs, dunits, att_dim)[source]¶ Bases:
torch.nn.modules.module.Module
Additive attention
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]¶ AttLoc forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – dummy (does not use)
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
previous attention weights (B x T_max)
- Return type
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttCov
(eprojs, dunits, att_dim)[source]¶ Bases:
torch.nn.modules.module.Module
Coverage mechanism attention
- Reference: Get To The Point: Summarization with Pointer-Generator Network
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0)[source]¶ AttCov forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev_list (list) – list of previous attention weight
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
list of previous attention weights
- Return type
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttCovLoc
(eprojs, dunits, att_dim, aconv_chans, aconv_filts)[source]¶ Bases:
torch.nn.modules.module.Module
Coverage mechanism location aware attention
This attention is a combination of coverage and location-aware attentions.
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0)[source]¶ AttCovLoc forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev_list (list) – list of previous attention weight
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
list of previous attention weights
- Return type
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttDot
(eprojs, dunits, att_dim)[source]¶ Bases:
torch.nn.modules.module.Module
Dot product attention
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]¶ AttDot forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – dummy (does not use)
att_prev (torch.Tensor) – dummy (does not use)
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
previous attention weight (B x T_max)
- Return type
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttForward
(eprojs, dunits, att_dim, aconv_chans, aconv_filts)[source]¶ Bases:
torch.nn.modules.module.Module
Forward attention
- Reference: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=1.0)[source]¶ AttForward forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – attention weights of previous step
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
previous attention weights (B x T_max)
- Return type
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttForwardTA
(eunits, dunits, att_dim, aconv_chans, aconv_filts, odim)[source]¶ Bases:
torch.nn.modules.module.Module
Forward attention with transition agent
- Reference: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis
- Parameters
eunits (int) – # units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
odim (int) – output dimension
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, out_prev, scaling=1.0)[source]¶ AttForwardTA forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B, Tmax, eunits)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B, dunits)
att_prev (torch.Tensor) – attention weights of previous step
out_prev (torch.Tensor) – decoder outputs of previous step (B, odim)
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B, dunits)
- Return type
torch.Tensor
- Returns
previous attention weights (B, Tmax)
- Return type
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttLoc
(eprojs, dunits, att_dim, aconv_chans, aconv_filts)[source]¶ Bases:
torch.nn.modules.module.Module
location-aware attention
- Reference: Attention-Based Models for Speech Recognition
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]¶ AttLoc forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – previous attention weight (B x T_max)
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
previous attention weights (B x T_max)
- Return type
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttLoc2D
(eprojs, dunits, att_dim, att_win, aconv_chans, aconv_filts)[source]¶ Bases:
torch.nn.modules.module.Module
2D location-aware attention
This attention is an extended version of location aware attention. It take not only one frame before attention weights, but also earlier frames into account.
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
att_win (int) – attention window size (default=5)
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]¶ AttLoc2D forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – previous attention weight (B x att_win x T_max)
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
previous attention weights (B x att_win x T_max)
- Return type
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttLocRec
(eprojs, dunits, att_dim, aconv_chans, aconv_filts)[source]¶ Bases:
torch.nn.modules.module.Module
location-aware recurrent attention
This attention is an extended version of location aware attention. With the use of RNN, it take the effect of the history of attention weights into account.
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
att_dim (int) – attention dimension
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev_states, scaling=2.0)[source]¶ AttLocRec forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev_states (tuple) – previous attention weight and lstm states ((B, T_max), ((B, att_dim), (B, att_dim)))
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
previous attention weights and lstm states (w, (hx, cx)) ((B, T_max), ((B, att_dim), (B, att_dim)))
- Return type
tuple
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttMultiHeadAdd
(eprojs, dunits, aheads, att_dim_k, att_dim_v)[source]¶ Bases:
torch.nn.modules.module.Module
Multi head additive attention
- Reference: Attention is all you need
This attention is multi head attention using additive attention for each head.
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
aheads (int) – # heads of multi head attention
att_dim_k (int) – dimension k in multi head attention
att_dim_v (int) – dimension v in multi head attention
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev)[source]¶ AttMultiHeadAdd forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – dummy (does not use)
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
list of previous attention weight (B x T_max) * aheads
- Return type
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttMultiHeadDot
(eprojs, dunits, aheads, att_dim_k, att_dim_v)[source]¶ Bases:
torch.nn.modules.module.Module
Multi head dot product attention
- Reference: Attention is all you need
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
aheads (int) – # heads of multi head attention
att_dim_k (int) – dimension k in multi head attention
att_dim_v (int) – dimension v in multi head attention
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev)[source]¶ AttMultiHeadDot forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – dummy (does not use)
- Returns
attention weighted encoder state (B x D_enc)
- Return type
torch.Tensor
- Returns
list of previous attention weight (B x T_max) * aheads
- Return type
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttMultiHeadLoc
(eprojs, dunits, aheads, att_dim_k, att_dim_v, aconv_chans, aconv_filts)[source]¶ Bases:
torch.nn.modules.module.Module
Multi head location based attention
- Reference: Attention is all you need
This attention is multi head attention using location-aware attention for each head.
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
aheads (int) – # heads of multi head attention
att_dim_k (int) – dimension k in multi head attention
att_dim_v (int) – dimension v in multi head attention
aconv_chans (int) – # channels of attention convolution
aconv_filts (int) – filter size of attention convolution
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0)[source]¶ AttMultiHeadLoc forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – list of previous attention weight (B x T_max) * aheads
scaling (float) – scaling parameter before applying softmax
- Returns
attention weighted encoder state (B x D_enc)
- Return type
torch.Tensor
- Returns
list of previous attention weight (B x T_max) * aheads
- Return type
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
AttMultiHeadMultiResLoc
(eprojs, dunits, aheads, att_dim_k, att_dim_v, aconv_chans, aconv_filts)[source]¶ Bases:
torch.nn.modules.module.Module
Multi head multi resolution location based attention
- Reference: Attention is all you need
This attention is multi head attention using location-aware attention for each head. Furthermore, it uses different filter size for each head.
- Parameters
eprojs (int) – # projection-units of encoder
dunits (int) – # units of decoder
aheads (int) – # heads of multi head attention
att_dim_k (int) – dimension k in multi head attention
att_dim_v (int) – dimension v in multi head attention
aconv_chans (int) – maximum # channels of attention convolution each head use #ch = aconv_chans * (head + 1) / aheads e.g. aheads=4, aconv_chans=100 => filter size = 25, 50, 75, 100
aconv_filts (int) – filter size of attention convolution
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev)[source]¶ AttMultiHeadMultiResLoc forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B x T_max x D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – decoder hidden state (B x D_dec)
att_prev (torch.Tensor) – list of previous attention weight (B x T_max) * aheads
- Returns
attention weighted encoder state (B x D_enc)
- Return type
torch.Tensor
- Returns
list of previous attention weight (B x T_max) * aheads
- Return type
list
-
class
espnet.nets.pytorch_backend.rnn.attentions.
NoAtt
[source]¶ Bases:
torch.nn.modules.module.Module
No attention
-
forward
(enc_hs_pad, enc_hs_len, dec_z, att_prev)[source]¶ NoAtt forward
- Parameters
enc_hs_pad (torch.Tensor) – padded encoder hidden state (B, T_max, D_enc)
enc_hs_len (list) – padded encoder hidden state length (B)
dec_z (torch.Tensor) – dummy (does not use)
att_prev (torch.Tensor) – dummy (does not use)
- Returns
attention weighted encoder state (B, D_enc)
- Return type
torch.Tensor
- Returns
previous attention weights
- Return type
torch.Tensor
-
-
espnet.nets.pytorch_backend.rnn.attentions.
att_for
(args, num_att=1)[source]¶ Instantiates an attention module given the program arguments
- Parameters
args (Namespace) – The arguments
num_att (int) – number of attention modules (in multi-speaker case, it can be 2 or more)
:rtype torch.nn.Module :return: The attention module
-
espnet.nets.pytorch_backend.rnn.attentions.
att_to_numpy
(att_ws, att)[source]¶ Converts attention weights to a numpy array given the attention
- Parameters
att_ws (list) – The attention weights
att (torch.nn.Module) – The attention
- Return type
np.ndarray
- Returns
The numpy array of the attention weights
espnet.nets.pytorch_backend.rnn.decoders¶
-
class
espnet.nets.pytorch_backend.rnn.decoders.
Decoder
(eprojs, odim, dtype, dlayers, dunits, sos, eos, att, verbose=0, char_list=None, labeldist=None, lsm_weight=0.0, sampling_probability=0.0, dropout=0.0, context_residual=False, replace_sos=False)[source]¶ Bases:
torch.nn.modules.module.Module
,espnet.nets.scorer_interface.ScorerInterface
Decoder module
- Parameters
eprojs (int) – # encoder projection units
odim (int) – dimension of outputs
dtype (str) – gru or lstm
dlayers (int) – # decoder layers
dunits (int) – # decoder units
sos (int) – start of sequence symbol id
eos (int) – end of sequence symbol id
att (torch.nn.Module) – attention module
verbose (int) – verbose level
char_list (list) – list of character strings
labeldist (ndarray) – distribution of label smoothing
lsm_weight (float) – label smoothing weight
sampling_probability (float) – scheduled sampling probability
dropout (float) – dropout rate
context_residual (float) – if True, use context vector for token generation
replace_sos (float) – use for multilingual (speech/text) translation
-
calculate_all_attentions
(hs_pad, hlen, ys_pad, strm_idx=0, tgt_lang_ids=None)[source]¶ Calculate all of attentions
- Parameters
hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)
hlen (torch.Tensor) – batch of lengths of hidden state sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
strm_idx (int) – stream index for parallel speaker attention in multi-speaker case
tgt_lang_ids (torch.Tensor) – batch of target language id tensor (B, 1)
- Returns
attention weights with the following shape, 1) multi-head case => attention weights (B, H, Lmax, Tmax), 2) other case => attention weights (B, Lmax, Tmax).
- Return type
float ndarray
-
forward
(hs_pad, hlens, ys_pad, strm_idx=0, tgt_lang_ids=None)[source]¶ Decoder forward
- Parameters
hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)
hlens (torch.Tensor) – batch of lengths of hidden state sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
strm_idx (int) – stream index indicates the index of decoding stream.
tgt_lang_ids (torch.Tensor) – batch of target language id tensor (B, 1)
- Returns
attention loss value
- Return type
torch.Tensor
- Returns
accuracy
- Return type
float
-
init_state
(x)[source]¶ Get an initial state for decoding (optional).
- Parameters
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
recognize_beam
(h, lpz, recog_args, char_list, rnnlm=None, strm_idx=0)[source]¶ beam search implementation
- Parameters
h (torch.Tensor) – encoder hidden state (T, eprojs)
lpz (torch.Tensor) – ctc log softmax output (T, odim)
recog_args (Namespace) – argument Namespace containing options
char_list – list of character strings
rnnlm (torch.nn.Module) – language module
strm_idx (int) – stream index for speaker parallel attention in multi-speaker case
- Returns
N-best decoding results
- Return type
list of dicts
-
recognize_beam_batch
(h, hlens, lpz, recog_args, char_list, rnnlm=None, normalize_score=True, strm_idx=0, tgt_lang_ids=None)[source]¶
-
score
(yseq, state, x)[source]¶ Score new token (required).
- Parameters
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.
- Returns
- Tuple of
scores for next token that has a shape of (n_vocab) and next state for ys
- Return type
tuple[torch.Tensor, Any]
espnet.nets.pytorch_backend.rnn.decoders_transducer¶
Transducer and transducer with attention implementation for training and decoding.
-
class
espnet.nets.pytorch_backend.rnn.decoders_transducer.
DecoderRNNT
(eprojs, odim, dtype, dlayers, dunits, blank, embed_dim, joint_dim, dropout=0.0, dropout_embed=0.0, rnnt_type='warp-transducer')[source]¶ Bases:
torch.nn.modules.module.Module
RNN-T Decoder module.
- Parameters
eprojs (int) – # encoder projection units
odim (int) – dimension of outputs
dtype (str) – gru or lstm
dlayers (int) – # prediction layers
dunits (int) – # prediction units
blank (int) – blank symbol id
embed_dim (init) – dimension of embeddings
joint_dim (int) – dimension of joint space
dropout (float) – dropout rate
dropout_embed (float) – embedding dropout rate
rnnt_type (str) – type of rnn-t implementation
Transducer initializer.
-
forward
(hs_pad, hlens, ys_pad)[source]¶ Forward function for transducer.
- Parameters
hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)
hlens (torch.Tensor) – batch of lengths of hidden state sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
rnnt loss value
- Return type
loss (float)
-
joint
(h_enc, h_dec)[source]¶ Joint computation of z.
- Parameters
h_enc (torch.Tensor) – batch of expanded hidden state (B, T, 1, Henc)
h_dec (torch.Tensor) – batch of expanded hidden state (B, 1, U, Hdec)
- Returns
output (B, T, U, odim)
- Return type
z (torch.Tensor)
-
recognize
(h, recog_args)[source]¶ Greedy search implementation.
- Parameters
h (torch.Tensor) – encoder hidden state sequences (Tmax, Henc)
recog_args (Namespace) – argument Namespace containing options
- Returns
1-best decoding results
- Return type
hyp (list of dicts)
-
recognize_beam
(h, recog_args, rnnlm=None)[source]¶ Beam search implementation.
- Parameters
h (torch.Tensor) – encoder hidden state sequences (Tmax, Henc)
recog_args (Namespace) – argument Namespace containing options
rnnlm (torch.nn.Module) – language module
- Returns
n-best decoding results
- Return type
nbest_hyps (list of dicts)
-
rnn_forward
(ey, dstate)[source]¶ RNN forward.
- Parameters
ey (torch.Tensor) – batch of input features (B, Lmax, Emb_dim)
dstate (list) – list of L input hidden and cell state (1, B, Hdec)
- Returns
batch of output features (B, Lmax, Hdec) dstate (list): list of L output hidden and cell state (1, B, Hdec)
- Return type
output (torch.Tensor)
-
class
espnet.nets.pytorch_backend.rnn.decoders_transducer.
DecoderRNNTAtt
(eprojs, odim, dtype, dlayers, dunits, blank, att, embed_dim, joint_dim, dropout=0.0, dropout_embed=0.0, rnnt_type='warp-transducer')[source]¶ Bases:
torch.nn.modules.module.Module
RNNT-Att Decoder module.
- Parameters
eprojs (int) – # encoder projection units
odim (int) – dimension of outputs
dtype (str) – gru or lstm
dlayers (int) – # decoder layers
dunits (int) – # decoder units
blank (int) – blank symbol id
att (torch.nn.Module) – attention module
embed_dim (int) – dimension of embeddings
joint_dim (int) – dimension of joint space
dropout (float) – dropout rate
dropout_embed (float) – embedding dropout rate
rnnt_type (str) – type of rnnt implementation
Transducer with attention initializer.
-
calculate_all_attentions
(hs_pad, hlens, ys_pad)[source]¶ Calculate all of attentions.
- Parameters
hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)
hlens (torch.Tensor) – batch of lengths of hidden state sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
- attention weights with the following shape,
multi-head case => attention weights (B, H, Lmax, Tmax),
other case => attention weights (B, Lmax, Tmax).
- Return type
att_ws (ndarray)
-
forward
(hs_pad, hlens, ys_pad)[source]¶ Forward function for transducer with attention.
- Parameters
hs_pad (torch.Tensor) – batch of padded hidden state sequences (B, Tmax, D)
hlens (torch.Tensor) – batch of lengths of hidden state sequences (B)
ys_pad (torch.Tensor) – batch of padded character id sequence tensor (B, Lmax)
- Returns
rnnt-att loss value
- Return type
loss (torch.Tensor)
-
joint
(h_enc, h_dec)[source]¶ Joint computation of z.
- Parameters
h_enc (torch.Tensor) – batch of expanded hidden state (B, T, 1, Henc)
h_dec (torch.Tensor) – batch of expanded hidden state (B, 1, U, Hdec)
- Returns
output (B, T, U, odim)
- Return type
z (torch.Tensor)
-
recognize
(h, recog_args)[source]¶ Greedy search implementation.
- Parameters
h (torch.Tensor) – encoder hidden state sequences (Tmax, Henc)
recog_args (Namespace) – argument Namespace containing options
- Returns
1-best decoding results
- Return type
hyp (list of dicts)
-
recognize_beam
(h, recog_args, rnnlm=None)[source]¶ Beam search recognition.
- Parameters
h (torch.Tensor) – encoder hidden state sequences (Tmax, Henc)
recog_args (Namespace) – argument Namespace containing options
rnnlm (torch.nn.Module) – language module
- Results:
nbest_hyps (list of dicts): n-best decoding results
-
rnn_forward
(ey, dstate)[source]¶ RNN forward.
- Parameters
ey (torch.Tensor) – batch of input features (B, (Emb_dim + Eprojs))
dstate (list) – list of L input hidden and cell state (B, Hdec)
- Returns
decoder output for one step (B, Hdec) (list): list of L output hidden and cell state (B, Hdec)
- Return type
y (torch.Tensor)
espnet.nets.pytorch_backend.rnn.encoders¶
-
class
espnet.nets.pytorch_backend.rnn.encoders.
Encoder
(etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder module
- Parameters
etype (str) – type of encoder network
idim (int) – number of dimensions of encoder network
elayers (int) – number of layers of encoder network
eunits (int) – number of lstm units of encoder network
eprojs (int) – number of projection units of encoder network
subsample (np.ndarray) – list of subsampling numbers
dropout (float) – dropout rate
in_channel (int) – number of input channels
-
forward
(xs_pad, ilens, prev_states=None)[source]¶ Encoder forward
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
prev_state (torch.Tensor) – batch of previous encoder hidden states (?, …)
- Returns
batch of hidden state sequences (B, Tmax, eprojs)
- Return type
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.encoders.
RNN
(idim, elayers, cdim, hdim, dropout, typ='blstm')[source]¶ Bases:
torch.nn.modules.module.Module
RNN module
- Parameters
idim (int) – dimension of inputs
elayers (int) – number of encoder layers
cdim (int) – number of rnn units (resulted in cdim * 2 if bidirectional)
hdim (int) – number of final projection units
dropout (float) – dropout rate
typ (str) – The RNN type
-
forward
(xs_pad, ilens, prev_state=None)[source]¶ RNN forward
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, D)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
prev_state (torch.Tensor) – batch of previous RNN states
- Returns
batch of hidden state sequences (B, Tmax, eprojs)
- Return type
torch.Tensor
-
class
espnet.nets.pytorch_backend.rnn.encoders.
RNNP
(idim, elayers, cdim, hdim, subsample, dropout, typ='blstm')[source]¶ Bases:
torch.nn.modules.module.Module
RNN with projection layer module
- Parameters
idim (int) – dimension of inputs
elayers (int) – number of encoder layers
cdim (int) – number of rnn units (resulted in cdim * 2 if bidirectional)
hdim (int) – number of projection units
subsample (np.ndarray) – list of subsampling numbers
dropout (float) – dropout rate
typ (str) – The RNN type
-
forward
(xs_pad, ilens, prev_state=None)[source]¶ RNNP forward
- Parameters
xs_pad (torch.Tensor) – batch of padded input sequences (B, Tmax, idim)
ilens (torch.Tensor) – batch of lengths of input sequences (B)
prev_state (torch.Tensor) – batch of previous RNN states
- Returns
batch of hidden state sequences (B, Tmax, hdim)
- Return type
torch.Tensor
espnet.nets.pytorch_backend.streaming.segment¶
espnet.nets.pytorch_backend.streaming.window¶
-
class
espnet.nets.pytorch_backend.streaming.window.
WindowStreamingE2E
(e2e, recog_args, rnnlm=None)[source]¶ Bases:
object
WindowStreamingE2E constructor.
- Parameters
e2e (E2E) – E2E ASR object
recog_args – arguments for “recognize” method of E2E
-
decode_with_attention_offline
()[source]¶ Run the attention decoder offline.
Works even if the previous layers (encoder and CTC decoder) were being run in the online mode. This method should be run after all the audio has been consumed. This is used mostly to compare the results between offline and online implementation of the previous layers.
espnet.nets.pytorch_backend.tacotron2.cbhg¶
CBHG related modules.
-
class
espnet.nets.pytorch_backend.tacotron2.cbhg.
CBHG
(idim, odim, conv_bank_layers=8, conv_bank_chans=128, conv_proj_filts=3, conv_proj_chans=256, highway_layers=4, highway_units=128, gru_units=256)[source]¶ Bases:
torch.nn.modules.module.Module
CBHG module to convert log Mel-filterbanks to linear spectrogram.
This is a module of CBHG introduced in Tacotron: Towards End-to-End Speech Synthesis. The CBHG converts the sequence of log Mel-filterbanks into linear spectrogram.
Initialize CBHG module.
- Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
conv_bank_layers (int, optional) – The number of convolution bank layers.
conv_bank_chans (int, optional) – The number of channels in convolution bank.
conv_proj_filts (int, optional) – Kernel size of convolutional projection layer.
conv_proj_chans (int, optional) – The number of channels in convolutional projection layer.
highway_layers (int, optional) – The number of highway network layers.
highway_units (int, optional) – The number of highway network units.
gru_units (int, optional) – The number of GRU units (for both directions).
-
forward
(xs, ilens)[source]¶ Calculate forward propagation.
- Parameters
xs (Tensor) – Batch of the padded sequences of inputs (B, Tmax, idim).
ilens (LongTensor) – Batch of lengths of each input sequence (B,).
- Returns
Batch of the padded sequence of outputs (B, Tmax, odim). LongTensor: Batch of lengths of each output sequence (B,).
- Return type
Tensor
-
class
espnet.nets.pytorch_backend.tacotron2.cbhg.
CBHGLoss
(use_masking=True)[source]¶ Bases:
torch.nn.modules.module.Module
Loss function module for CBHG.
Initialize CBHG loss module.
- Parameters
use_masking (bool) – Whether to mask padded part in loss calculation.
-
forward
(cbhg_outs, spcs, olens)[source]¶ Calculate forward propagation.
- Parameters
cbhg_outs (Tensor) – Batch of CBHG outputs (B, Lmax, spc_dim).
spcs (Tensor) – Batch of groundtruth of spectrogram (B, Lmax, spc_dim).
olens (LongTensor) – Batch of the lengths of each sequence (B,).
- Returns
L1 loss value Tensor: Mean square error loss value.
- Return type
Tensor
-
class
espnet.nets.pytorch_backend.tacotron2.cbhg.
HighwayNet
(idim)[source]¶ Bases:
torch.nn.modules.module.Module
Highway Network module.
This is a module of Highway Network introduced in Highway Networks.
Initialize Highway Network module.
- Parameters
idim (int) – Dimension of the inputs.
espnet.nets.pytorch_backend.tacotron2.decoder¶
Tacotron2 decoder related modules.
-
class
espnet.nets.pytorch_backend.tacotron2.decoder.
Decoder
(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)[source]¶ Bases:
torch.nn.modules.module.Module
Decoder module of Spectrogram prediction network.
This is a module of decoder of Spectrogram prediction network in Tacotron2, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The decoder generates the sequence of features from the sequence of the hidden states.
Initialize Tacotron2 decoder module.
- Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
att (torch.nn.Module) – Instance of attention class.
dlayers (int, optional) – The number of decoder lstm layers.
dunits (int, optional) – The number of decoder lstm units.
prenet_layers (int, optional) – The number of prenet layers.
prenet_units (int, optional) – The number of prenet units.
postnet_layers (int, optional) – The number of postnet layers.
postnet_filts (int, optional) – The number of postnet filter size.
postnet_chans (int, optional) – The number of postnet filter channels.
output_activation_fn (torch.nn.Module, optional) – Activation function for outputs.
cumulate_att_w (bool, optional) – Whether to cumulate previous attention weight.
use_batch_norm (bool, optional) – Whether to use batch normalization.
use_concate (bool, optional) – Whether to concatenate encoder embedding with decoder lstm outputs.
dropout_rate (float, optional) – Dropout rate.
zoneout_rate (float, optional) – Zoneout rate.
reduction_factor (int, optional) – Reduction factor.
-
calculate_all_attentions
(hs, hlens, ys)[source]¶ Calculate all of the attention weights.
- Parameters
hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).
hlens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).
- Returns
Batch of attention weights (B, Lmax, Tmax).
- Return type
numpy.ndarray
Note
This computation is performed in teacher-forcing manner.
-
forward
(hs, hlens, ys)[source]¶ Calculate forward propagation.
- Parameters
hs (Tensor) – Batch of the sequences of padded hidden states (B, Tmax, idim).
hlens (LongTensor) – Batch of lengths of each input batch (B,).
ys (Tensor) – Batch of the sequences of padded target features (B, Lmax, odim).
- Returns
Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).
- Return type
Tensor
Note
This computation is performed in teacher-forcing manner.
-
inference
(h, threshold=0.5, minlenratio=0.0, maxlenratio=10.0)[source]¶ Generate the sequence of features given the sequences of characters.
- Parameters
h (Tensor) – Input sequence of encoder hidden states (T, C).
threshold (float, optional) – Threshold to stop generation.
minlenratio (float, optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.
minlenratio – Minimum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.
- Returns
Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
- Return type
Tensor
Note
This computation is performed in auto-regressive manner.
-
class
espnet.nets.pytorch_backend.tacotron2.decoder.
Postnet
(idim, odim, n_layers=5, n_chans=512, n_filts=5, dropout_rate=0.5, use_batch_norm=True)[source]¶ Bases:
torch.nn.modules.module.Module
Postnet module for Spectrogram prediction network.
This is a module of Postnet in Spectrogram prediction network, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The Postnet predicts refines the predicted Mel-filterbank of the decoder, which helps to compensate the detail sturcture of spectrogram.
Initialize postnet module.
- Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
n_layers (int, optional) – The number of layers.
n_filts (int, optional) – The number of filter size.
n_units (int, optional) – The number of filter channels.
use_batch_norm (bool, optional) – Whether to use batch normalization..
dropout_rate (float, optional) – Dropout rate..
-
class
espnet.nets.pytorch_backend.tacotron2.decoder.
Prenet
(idim, n_layers=2, n_units=256, dropout_rate=0.5)[source]¶ Bases:
torch.nn.modules.module.Module
Prenet module for decoder of Spectrogram prediction network.
This is a module of Prenet in the decoder of Spectrogram prediction network, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. The Prenet preforms nonlinear conversion of inputs before input to auto-regressive lstm, which helps to learn diagonal attentions.
Note
This module alway applies dropout even in evaluation See the detail in _`Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions`_.
Initialize prenet module.
- Parameters
idim (int) – Dimension of the inputs.
odim (int) – Dimension of the outputs.
n_layers (int, optional) – The number of prenet layers.
n_units (int, optional) – The number of prenet units.
-
class
espnet.nets.pytorch_backend.tacotron2.decoder.
ZoneOutCell
(cell, zoneout_rate=0.1)[source]¶ Bases:
torch.nn.modules.module.Module
ZoneOut Cell module.
This is a module of zoneout described in Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. This code is modified from eladhoffer/seq2seq.pytorch.
Examples
>>> lstm = torch.nn.LSTMCell(16, 32) >>> lstm = ZoneOutCell(lstm, 0.5)
Initialize zone out cell module.
- Parameters
cell (torch.nn.Module) – Pytorch recurrent cell module e.g. torch.nn.Module.LSTMCell.
zoneout_rate (float, optional) – Probability of zoneout from 0.0 to 1.0.
-
forward
(inputs, hidden)[source]¶ Calculate forward propagation.
- Parameters
inputs (Tensor) – Batch of input tensor (B, input_size).
hidden (tuple) –
Tensor: Batch of initial hidden states (B, hidden_size).
Tensor: Batch of initial cell states (B, hidden_size).
- Returns
Tensor: Batch of next hidden states (B, hidden_size).
Tensor: Batch of next cell states (B, hidden_size).
- Return type
tuple
espnet.nets.pytorch_backend.tacotron2.encoder¶
Tacotron2 encoder related modules.
-
class
espnet.nets.pytorch_backend.tacotron2.encoder.
Encoder
(idim, embed_dim=512, elayers=1, eunits=512, econv_layers=3, econv_chans=512, econv_filts=5, use_batch_norm=True, use_residual=False, dropout_rate=0.5, padding_idx=0)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder module of Spectrogram prediction network.
This is a module of encoder of Spectrogram prediction network in Tacotron2, which described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This is the encoder which converts the sequence of characters into the sequence of hidden states.
Initialize Tacotron2 encoder module.
- Parameters
idim (int) –
embed_dim (int, optional) –
elayers (int, optional) –
eunits (int, optional) –
econv_layers (int, optional) –
econv_filts (int, optional) –
econv_chans (int, optional) –
use_batch_norm (bool, optional) –
use_residual (bool, optional) –
dropout_rate (float, optional) –
-
forward
(xs, ilens=None)[source]¶ Calculate forward propagation.
- Parameters
xs (Tensor) – Batch of the padded sequence of character ids (B, Tmax). Padded value should be 0.
ilens (LongTensor) – Batch of lengths of each input batch (B,).
- Returns
Batch of the sequences of encoder states(B, Tmax, eunits). LongTensor: Batch of lengths of each sequence (B,)
- Return type
Tensor
espnet.nets.pytorch_backend.transformer.attention¶
-
class
espnet.nets.pytorch_backend.transformer.attention.
MultiHeadedAttention
(n_head, n_feat, dropout_rate)[source]¶ Bases:
torch.nn.modules.module.Module
Multi-Head Attention layer
- Parameters
n_head (int) – the number of head s
n_feat (int) – the number of features
dropout_rate (float) – dropout rate
-
forward
(query, key, value, mask)[source]¶ Compute ‘Scaled Dot Product Attention’
- Parameters
query (torch.Tensor) – (batch, time1, size)
key (torch.Tensor) – (batch, time2, size)
value (torch.Tensor) – (batch, time2, size)
mask (torch.Tensor) – (batch, time1, time2)
dropout (torch.nn.Dropout) –
- Return torch.Tensor
attentined and transformed value (batch, time1, d_model) weighted by the query dot key attention (batch, head, time1, time2)
espnet.nets.pytorch_backend.transformer.decoder¶
-
class
espnet.nets.pytorch_backend.transformer.decoder.
Decoder
(odim, attention_dim=256, attention_heads=4, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, self_attention_dropout_rate=0.0, src_attention_dropout_rate=0.0, input_layer='embed', use_output_layer=True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before=True, concat_after=False)[source]¶ Bases:
espnet.nets.scorer_interface.ScorerInterface
,torch.nn.modules.module.Module
Transfomer decoder module
- Parameters
odim (int) – output dim
attention_dim (int) – dimention of attention
attention_heads (int) – the number of heads of multi head attention
linear_units (int) – the number of units of position-wise feed forward
num_blocks (int) – the number of decoder blocks
dropout_rate (float) – dropout rate
attention_dropout_rate (float) – dropout rate for attention
or torch.nn.Module input_layer (str) – input layer type
use_output_layer (bool) – whether to use output layer
pos_enc_class (class) – PositionalEncoding or ScaledPositionalEncoding
normalize_before (bool) – whether to use layer_norm before the first block
concat_after (bool) – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
-
forward
(tgt, tgt_mask, memory, memory_mask)[source]¶ forward decoder
- Parameters
tgt (torch.Tensor) – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
tgt_mask (torch.Tensor) – input token mask, uint8 (batch, maxlen_out)
memory (torch.Tensor) – encoded memory, float32 (batch, maxlen_in, feat)
memory_mask (torch.Tensor) – encoded memory mask, uint8 (batch, maxlen_in)
- Return x
decoded token score before softmax (batch, maxlen_out, token) if use_output_layer is True, final block outputs (batch, maxlen_out, attention_dim) in the other cases
- Return type
torch.Tensor
- Return tgt_mask
score mask before softmax (batch, maxlen_out)
- Return type
torch.Tensor
-
recognize
(tgt, tgt_mask, memory)[source]¶ recognize one step
- Parameters
tgt (torch.Tensor) – input token ids, int64 (batch, maxlen_out)
tgt_mask (torch.Tensor) – input token mask, uint8 (batch, maxlen_out)
memory (torch.Tensor) – encoded memory, float32 (batch, maxlen_in, feat)
- Return x
decoded token score before softmax (batch, maxlen_out, token)
- Return type
torch.Tensor
-
score
(ys, state, x)[source]¶ Score new token (required).
- Parameters
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.
- Returns
- Tuple of
scores for next token that has a shape of (n_vocab) and next state for ys
- Return type
tuple[torch.Tensor, Any]
espnet.nets.pytorch_backend.transformer.decoder_layer¶
-
class
espnet.nets.pytorch_backend.transformer.decoder_layer.
DecoderLayer
(size, self_attn, src_attn, feed_forward, dropout_rate, normalize_before=True, concat_after=False)[source]¶ Bases:
torch.nn.modules.module.Module
Single decoder layer module
- Parameters
size (int) – input dim
self_attn (espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention) – self attention module
src_attn (espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention) – source attention module
feed_forward (espnet.nets.pytorch_backend.transformer.positionwise_feed_forward.PositionwiseFeedForward) – feed forward layer module
dropout_rate (float) – dropout rate
normalize_before (bool) – whether to use layer_norm before the first block
concat_after (bool) – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
-
forward
(tgt, tgt_mask, memory, memory_mask)[source]¶ Compute decoded features
- Parameters
tgt (torch.Tensor) – decoded previous target features (batch, max_time_out, size)
tgt_mask (torch.Tensor) – mask for x (batch, max_time_out)
memory (torch.Tensor) – encoded source features (batch, max_time_in, size)
memory_mask (torch.Tensor) – mask for memory (batch, max_time_in)
espnet.nets.pytorch_backend.transformer.embedding¶
Positonal Encoding Module.
-
class
espnet.nets.pytorch_backend.transformer.embedding.
PositionalEncoding
(d_model, dropout_rate, max_len=5000)[source]¶ Bases:
torch.nn.modules.module.Module
Positional encoding.
Initialize class.
- Parameters
d_model (int) – embedding dim
dropout_rate (float) – dropout rate
max_len (int) – maximum input length
-
class
espnet.nets.pytorch_backend.transformer.embedding.
ScaledPositionalEncoding
(d_model, dropout_rate, max_len=5000)[source]¶ Bases:
espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding
Scaled positional encoding module.
See also: Sec. 3.2 https://arxiv.org/pdf/1809.08895.pdf
Initialize class.
- Parameters
d_model (int) – embedding dim
dropout_rate (float) – dropout rate
max_len (int) – maximum input length
espnet.nets.pytorch_backend.transformer.encoder¶
-
class
espnet.nets.pytorch_backend.transformer.encoder.
Encoder
(idim, attention_dim=256, attention_heads=4, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, attention_dropout_rate=0.0, input_layer='conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before=True, concat_after=False, positionwise_layer_type='linear', positionwise_conv_kernel_size=1, padding_idx=-1)[source]¶ Bases:
torch.nn.modules.module.Module
Transformer encoder module
- Parameters
idim (int) – input dim
attention_dim (int) – dimention of attention
attention_heads (int) – the number of heads of multi head attention
linear_units (int) – the number of units of position-wise feed forward
num_blocks (int) – the number of decoder blocks
dropout_rate (float) – dropout rate
attention_dropout_rate (float) – dropout rate in attention
positional_dropout_rate (float) – dropout rate after adding positional encoding
or torch.nn.Module input_layer (str) – input layer type
pos_enc_class (class) – PositionalEncoding or ScaledPositionalEncoding
normalize_before (bool) – whether to use layer_norm before the first block
concat_after (bool) – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – linear of conv1d
positionwise_conv_kernel_size (int) – kernel size of positionwise conv1d layer
padding_idx (int) – padding_idx for input_layer=embed
espnet.nets.pytorch_backend.transformer.encoder_layer¶
-
class
espnet.nets.pytorch_backend.transformer.encoder_layer.
EncoderLayer
(size, self_attn, feed_forward, dropout_rate, normalize_before=True, concat_after=False)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder layer module
- Parameters
size (int) – input dim
self_attn (espnet.nets.pytorch_backend.transformer.attention.MultiHeadedAttention) – self attention module
feed_forward (espnet.nets.pytorch_backend.transformer.positionwise_feed_forward.PositionwiseFeedForward) – feed forward module
dropout_rate (float) – dropout rate
normalize_before (bool) – whether to use layer_norm before the first block
concat_after (bool) – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
espnet.nets.pytorch_backend.transformer.initializer¶
espnet.nets.pytorch_backend.transformer.label_smoothing_loss¶
-
class
espnet.nets.pytorch_backend.transformer.label_smoothing_loss.
LabelSmoothingLoss
(size, padding_idx, smoothing, normalize_length=False, criterion=KLDivLoss())[source]¶ Bases:
torch.nn.modules.module.Module
Label-smoothing loss
- Parameters
size (int) – the number of class
padding_idx (int) – ignored class id
smoothing (float) – smoothing rate (0.0 means the conventional CE)
normalize_length (bool) – normalize loss by sequence length if True
criterion (torch.nn.Module) – loss function to be smoothed
espnet.nets.pytorch_backend.transformer.layer_norm¶
espnet.nets.pytorch_backend.transformer.mask¶
-
espnet.nets.pytorch_backend.transformer.mask.
subsequent_mask
(size, device='cpu', dtype=torch.uint8)[source]¶ Create mask for subsequent steps (1, size, size)
- Parameters
size (int) – size of mask
device (str) – “cpu” or “cuda” or torch.Tensor.device
dtype (torch.dtype) – result dtype
- Return type
torch.Tensor
>>> subsequent_mask(3) [[1, 0, 0], [1, 1, 0], [1, 1, 1]]
espnet.nets.pytorch_backend.transformer.multi_layer_conv¶
-
class
espnet.nets.pytorch_backend.transformer.multi_layer_conv.
MultiLayeredConv1d
(in_chans, hidden_chans, kernel_size, dropout_rate)[source]¶ Bases:
torch.nn.modules.module.Module
Multi-layered conv1d for Transformer block.
This is a module of multi-leyered conv1d designed to replace positionwise feed-forward network in Transforner block, which is introduced in FastSpeech: Fast, Robust and Controllable Text to Speech.
- Parameters
in_chans (int) – Number of input channels.
hidden_chans (int) – Number of hidden channels.
kernel_size (int) – Kernel size of conv1d.
dropout_rate (float) – Dropout rate.
espnet.nets.pytorch_backend.transformer.optimizer¶
espnet.nets.pytorch_backend.transformer.plot¶
-
class
espnet.nets.pytorch_backend.transformer.plot.
PlotAttentionReport
(att_vis_fn, data, outdir, converter, transform, device, reverse=False, ikey='input', iaxis=0, okey='output', oaxis=0)[source]¶
-
espnet.nets.pytorch_backend.transformer.plot.
plot_multi_head_attention
(data, attn_dict, outdir, suffix='png', savefn=<function savefig>)[source]¶ Plot multi head attentions
- Parameters
data (dict) – utts info from json file
torch.Tensor] attn_dict (dict[str,) – multi head attention dict. values should be torch.Tensor (head, input_length, output_length)
outdir (str) – dir to save fig
suffix (str) – filename suffix including image type (e.g., png)
savefn – function to save
espnet.nets.pytorch_backend.transformer.positionwise_feed_forward¶
-
class
espnet.nets.pytorch_backend.transformer.positionwise_feed_forward.
PositionwiseFeedForward
(idim, hidden_units, dropout_rate)[source]¶ Bases:
torch.nn.modules.module.Module
Positionwise feed forward
- Parameters
idim (int) – input dimenstion
hidden_units (int) – number of hidden units
dropout_rate (float) – dropout rate
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet.nets.pytorch_backend.transformer.repeat¶
-
class
espnet.nets.pytorch_backend.transformer.repeat.
MultiSequential
(*args)[source]¶ Bases:
torch.nn.modules.container.Sequential
Multi-input multi-output torch.nn.Sequential
-
forward
(*args)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
espnet.nets.pytorch_backend.transformer.subsampling¶
espnet.nets.scorers.ctc¶
ScorerInterface implementation for CTC.
-
class
espnet.nets.scorers.ctc.
CTCPrefixScorer
(ctc: torch.nn.modules.module.Module, eos: int)[source]¶ Bases:
espnet.nets.scorer_interface.PartialScorerInterface
Decoder interface wrapper for CTCPrefixScore.
Initialize class.
- Parameters
ctc (torch.nn.Module) – The CTC implementaiton. For example,
espnet.nets.pytorch_backend.ctc.CTC
eos (int) – The end-of-sequence id.
-
init_state
(x: torch.Tensor)[source]¶ Get an initial state for decoding.
- Parameters
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
score_partial
(y, ids, state, x)[source]¶ Score new token.
- Parameters
y (torch.Tensor) – 1D prefix token
next_tokens (torch.Tensor) – torch.int64 next token to score
state – decoder state for prefix tokens
x (torch.Tensor) – 2D encoder feature that generates ys
- Returns
- Tuple of a score tensor for y that has a shape (len(next_tokens),)
and next state for ys
- Return type
tuple[torch.Tensor, Any]
espnet.nets.scorers.length_bonus¶
Length bonus module.
-
class
espnet.nets.scorers.length_bonus.
LengthBonus
(n_vocab: int)[source]¶ Bases:
espnet.nets.scorer_interface.ScorerInterface
Length bonus in beam search.
Initialize class.
- Parameters
n_vocab (int) – The number of tokens in vocabulary for beam search
-
score
(y, state, x)[source]¶ Score new token.
- Parameters
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – 2D encoder feature that generates ys.
- Returns
- Tuple of
torch.float32 scores for next token (n_vocab) and None
- Return type
tuple[torch.Tensor, Any]