espnet2.enh package

espnet2.enh.__init__

espnet2.enh.abs_enh

class espnet2.enh.abs_enh.AbsEnhancement[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract forward_rawwav(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]

espnet2.enh.espnet_model

Enhancement model module.

class espnet2.enh.espnet_model.ESPnetEnhancementModel(encoder: espnet2.enh.encoder.abs_encoder.AbsEncoder, separator: espnet2.enh.separator.abs_separator.AbsSeparator, decoder: espnet2.enh.decoder.abs_decoder.AbsDecoder, loss_wrappers: List[espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper], stft_consistency: bool = False, loss_type: str = 'mask_mse', mask_type: Optional[str] = None)[source]

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

Speech enhancement or separation Frontend model

collect_feats(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]
forward(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]

Frontend + Encoder + Decoder + Calc loss

Parameters
  • speech_mix – (Batch, samples) or (Batch, samples, channels)

  • speech_ref – (Batch, num_speaker, samples) or (Batch, num_speaker, samples, channels)

  • speech_mix_lengths – (Batch,), default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py

espnet2.enh.decoder.__init__

espnet2.enh.decoder.abs_decoder

class espnet2.enh.decoder.abs_decoder.AbsDecoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.decoder.conv_decoder

class espnet2.enh.decoder.conv_decoder.ConvDecoder(channel: int, kernel_size: int, stride: int)[source]

Bases: espnet2.enh.decoder.abs_decoder.AbsDecoder

Transposed Convolutional decoder for speech enhancement and separation

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Args: input (torch.Tensor): spectrum [Batch, T, F] ilens (torch.Tensor): input lengths [Batch]

espnet2.enh.decoder.null_decoder

class espnet2.enh.decoder.null_decoder.NullDecoder[source]

Bases: espnet2.enh.decoder.abs_decoder.AbsDecoder

Null decoder, return the same args.

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward. The input should be the waveform already.

Parameters
  • input (torch.Tensor) – wav [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

espnet2.enh.decoder.stft_decoder

class espnet2.enh.decoder.stft_decoder.STFTDecoder(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window='hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]

Bases: espnet2.enh.decoder.abs_decoder.AbsDecoder

STFT decoder for speech enhancement and separation

forward(input: torch_complex.tensor.ComplexTensor, ilens: torch.Tensor)[source]

Forward.

Parameters
  • input (ComplexTensor) – spectrum [Batch, T, F]

  • ilens (torch.Tensor) – input lengths [Batch]

espnet2.enh.encoder.__init__

espnet2.enh.encoder.abs_encoder

class espnet2.enh.encoder.abs_encoder.AbsEncoder[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract property output_dim

espnet2.enh.encoder.conv_encoder

class espnet2.enh.encoder.conv_encoder.ConvEncoder(channel: int, kernel_size: int, stride: int)[source]

Bases: espnet2.enh.encoder.abs_encoder.AbsEncoder

Convolutional encoder for speech enhancement and separation

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters
  • input (torch.Tensor) – mixed speech [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

mixed feature after encoder [Batch, flens, channel]

Return type

feature (torch.Tensor)

property output_dim

espnet2.enh.encoder.null_encoder

class espnet2.enh.encoder.null_encoder.NullEncoder[source]

Bases: espnet2.enh.encoder.abs_encoder.AbsEncoder

Null encoder.

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters
  • input (torch.Tensor) – mixed speech [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

property output_dim

espnet2.enh.encoder.stft_encoder

class espnet2.enh.encoder.stft_encoder.STFTEncoder(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window='hann', center: bool = True, normalized: bool = False, onesided: bool = True, use_builtin_complex: bool = True)[source]

Bases: espnet2.enh.encoder.abs_encoder.AbsEncoder

STFT encoder for speech enhancement and separation

forward(input: torch.Tensor, ilens: torch.Tensor)[source]

Forward.

Parameters
  • input (torch.Tensor) – mixed speech [Batch, sample]

  • ilens (torch.Tensor) – input lengths [Batch]

property output_dim

espnet2.enh.layers.__init__

espnet2.enh.layers.beamformer

Beamformer module.

espnet2.enh.layers.beamformer.apply_beamforming_vector(beamform_vector: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], mix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
espnet2.enh.layers.beamformer.blind_analytic_normalization(ws, psd_noise, eps=1e-08)[source]

Blind analytic normalization (BAN) for post-filtering

Parameters
  • ws (torch.complex64/ComplexTensor) – beamformer vector (…, F, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise PSD matrix (…, F, C, C)

  • eps (float) –

Returns

normalized beamformer vector (…, F)

Return type

ws_ban (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.generalized_eigenvalue_decomposition(a: torch.Tensor, b: torch.Tensor, eps=1e-06)[source]

Solves the generalized eigenvalue decomposition through Cholesky decomposition.

ported from https://github.com/asteroid-team/asteroid/blob/master/asteroid/dsp/beamforming.py#L464

a @ e_vec = e_val * b @ e_vec | | Cholesky decomposition on b: | b = L @ L^H, where L is a lower triangular matrix | | Let C = L^-1 @ a @ L^-H, it is Hermitian. | => C @ y = lambda * y => e_vec = L^-H @ y

Reference: https://www.netlib.org/lapack/lug/node54.html

Parameters
  • a – A complex Hermitian or real symmetric matrix whose eigenvalues and eigenvectors will be computed. (…, C, C)

  • b – A complex Hermitian or real symmetric definite positive matrix. (…, C, C)

Returns

generalized eigenvalues (ascending order) e_vec: generalized eigenvectors

Return type

e_val

espnet2.enh.layers.beamformer.get_WPD_filter(Phi: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Rf: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: torch.Tensor, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Return the WPD vector.

WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:

h = (Rf^-1 @ Phi_{xx}) / tr[(Rf^-1) @ Phi_{xx}] @ u

Reference:

T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481

Parameters
  • Phi (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the PSD of zero-padded speech [x^T(t,f) 0 … 0]^T.

  • Rf (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.

  • reference_vector (torch.Tensor) – (B, (btaps+1) * C) is the reference_vector.

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(B, F, (btaps + 1) * C)

Return type

filter_matrix (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_WPD_filter_v2(Phi: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Rf: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Return the WPD vector (v2).

This implementation is more efficient than get_WPD_filter as

it skips unnecessary computation with zeros.

Parameters
  • Phi (torch.complex64/ComplexTensor) – (B, F, C, C) is speech PSD.

  • Rf (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.

  • reference_vector (torch.Tensor) – (B, C) is the reference_vector.

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(B, F, (btaps+1) * C)

Return type

filter_matrix (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_WPD_filter_with_rtf(psd_observed_bar: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, normalize_ref_channel: Optional[int] = None, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-15) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Return the WPD vector calculated with RTF.

WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:

h = (Rf^-1 @ vbar) / (vbar^H @ R^-1 @ vbar)

Reference:

T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481

Parameters
  • psd_observed_bar (torch.complex64/ComplexTensor) – stacked observation covariance matrix

  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • iterations (int) – number of iterations in power method

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • normalize_ref_channel (int) – reference channel for normalizing the RTF

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(…, F, C)

Return type

beamform_vector (torch.complex64/ComplexTensor)r

espnet2.enh.layers.beamformer.get_covariances(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], inverse_power: torch.Tensor, bdelay: int, btaps: int, get_vector: bool = False) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
Calculates the power normalized spatio-temporal covariance

matrix of the framed signal.

Parameters
  • Y – Complex STFT signal with shape (B, F, C, T)

  • inverse_power – Weighting factor with shape (B, F, T)

Returns

(B, F, (btaps+1) * C, (btaps+1) * C) Correlation vector: (B, F, btaps + 1, C, C)

Return type

Correlation matrix

espnet2.enh.layers.beamformer.get_gev_vector(psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Return the generalized eigenvalue (GEV) beamformer vector:

psd_speech @ h = lambda * psd_noise @ h

Reference:

Blind acoustic beamforming based on generalized eigenvalue decomposition; E. Warsitz and R. Haeb-Umbach, 2007.

Parameters
  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition (only for torch builtin complex tensors)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • iterations (int) – number of iterations in power method

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(…, F, C)

Return type

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_lcmv_vector_with_rtf(psd_n: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], rtf_mat: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: Union[int, torch.Tensor, None] = None, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
Return the LCMV (Linearly Constrained Minimum Variance) vector

calculated with RTF:

h = (Npsd^-1 @ rtf_mat) @ (rtf_mat^H @ Npsd^-1 @ rtf_mat)^-1 @ p

Reference:

H. L. Van Trees, “Optimum array processing: Part IV of detection, estimation, and modulation theory,” John Wiley & Sons, 2004. (Chapter 6.7)

Parameters
  • psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)

  • rtf_mat (torch.complex64/ComplexTensor) – RTF matrix (…, F, C, num_spk)

  • reference_vector (torch.Tensor or int) – (…, num_spk) or scalar

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(…, F, C)

Return type

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_mvdr_vector(psd_s, psd_n, reference_vector: torch.Tensor, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the MVDR (Minimum Variance Distortionless Response) vector:

h = (Npsd^-1 @ Spsd) / (Tr(Npsd^-1 @ Spsd)) @ u

Reference:

On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420

Parameters
  • psd_s (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor) – (…, C)

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(…, F, C)

Return type

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_mvdr_vector_with_rtf(psd_n: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, normalize_ref_channel: Optional[int] = None, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
Return the MVDR (Minimum Variance Distortionless Response) vector

calculated with RTF:

h = (Npsd^-1 @ rtf) / (rtf^H @ Npsd^-1 @ rtf)

Reference:

On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420

Parameters
  • psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)

  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • iterations (int) – number of iterations in power method

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • normalize_ref_channel (int) – reference channel for normalizing the RTF

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(…, F, C)

Return type

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_mwf_vector(psd_s, psd_n, reference_vector: Union[torch.Tensor, int], use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the MWF (Minimum Multi-channel Wiener Filter) vector:

h = (Npsd^-1 @ Spsd) @ u

Parameters
  • psd_s (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_n (torch.complex64/ComplexTensor) – power-normalized observation covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(…, F, C)

Return type

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_power_spectral_density_matrix(xs, mask, normalization=True, reduction='mean', eps: float = 1e-15)[source]

Return cross-channel power spectral density (PSD) matrix

Parameters
  • xs (torch.complex64/ComplexTensor) – (…, F, C, T)

  • reduction (str) – “mean” or “median”

  • mask (torch.Tensor) – (…, F, C, T)

  • normalization (bool) –

  • eps (float) –

Returns

psd (torch.complex64/ComplexTensor): (…, F, C, C)

espnet2.enh.layers.beamformer.get_rank1_mwf_vector(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the R1-MWF (Rank-1 Multi-channel Wiener Filter) vector

h = (Npsd^-1 @ Spsd) / (mu + Tr(Npsd^-1 @ Spsd)) @ u

Reference:

[1] Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wangyou et al, 2018 https://hal.inria.fr/hal-01634449/document [2] Low-rank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918

Parameters
  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • denoising_weight (float) – a trade-off parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. When denoising_weight = 0, it corresponds to MVDR beamformer.

  • approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its low-rank approximation as in [1]

  • iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(…, F, C)

Return type

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_rtf(psd_speech, psd_noise, mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, use_torch_solver: bool = True)[source]

Calculate the relative transfer function (RTF)

Algorithm of power method:
  1. rtf = reference_vector

  2. for i in range(iterations):

    rtf = (psd_noise^-1 @ psd_speech) @ rtf rtf = rtf / ||rtf||_2 # this normalization can be skipped

  3. rtf = psd_noise @ rtf

  4. rtf = rtf / rtf[…, ref_channel, :]

Note: 4) Normalization at the reference channel is not performed here.

Parameters
  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • iterations (int) – number of iterations in power method

  • use_torch_solver (bool) – Whether to use solve instead of inverse

Returns

(…, F, C, 1)

Return type

rtf (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.get_rtf_matrix(psd_speeches, psd_noises, diagonal_loading: bool = True, ref_channel: int = 0, rtf_iterations: int = 3, use_torch_solver: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Calculate the RTF matrix with each column the relative transfer function of the corresponding source.

espnet2.enh.layers.beamformer.get_sdw_mwf_vector(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]

Return the SDW-MWF (Speech Distortion Weighted Multi-channel Wiener Filter) vector

h = (Spsd + mu * Npsd)^-1 @ Spsd @ u

Reference:

[1] Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction; A. Spriet et al, 2004 https://dl.acm.org/doi/abs/10.1016/j.sigpro.2004.07.028 [2] Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wangyou et al, 2018 https://hal.inria.fr/hal-01634449/document [3] Low-rank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918

Parameters
  • psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)

  • psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)

  • reference_vector (torch.Tensor or int) – (…, C) or scalar

  • denoising_weight (float) – a trade-off parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. The plain MWF is obtained with denoising_weight = 1 (by default).

  • approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its low-rank approximation as in [2]

  • iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True

  • use_torch_solver (bool) – Whether to use solve instead of inverse

  • diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n

  • diag_eps (float) –

  • eps (float) –

Returns

(…, F, C)

Return type

beamform_vector (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.gev_phase_correction(vector)[source]

Phase correction to reduce distortions due to phase inconsistencies.

ported from https://github.com/fgnt/nn-gev/blob/master/fgnt/beamforming.py#L169

Parameters

vector – Beamforming vector with shape (…, F, C)

Returns

Phase corrected beamforming vectors

Return type

w

espnet2.enh.layers.beamformer.perform_WPD_filtering(filter_matrix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], bdelay: int, btaps: int) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Perform WPD filtering.

Parameters
  • filter_matrix – Filter matrix (B, F, (btaps + 1) * C)

  • Y – Complex STFT signal with shape (B, F, C, T)

Returns

(B, F, T)

Return type

enhanced (torch.complex64/ComplexTensor)

espnet2.enh.layers.beamformer.prepare_beamformer_stats(signal, masks_speech, mask_noise, powers=None, beamformer_type='mvdr', bdelay=3, btaps=5, eps=1e-06)[source]

Prepare necessary statistics for constructing the specified beamformer.

Parameters
  • signal (torch.complex64/ComplexTensor) – (…, F, C, T)

  • masks_speech (List[torch.Tensor]) – (…, F, C, T) masks for all speech sources

  • mask_noise (torch.Tensor) – (…, F, C, T) noise mask

  • powers (List[torch.Tensor]) – powers for all speech sources (…, F, T) used for wMPDR or WPD beamformers

  • beamformer_type (str) – one of the pre-defined beamformer types

  • bdelay (int) – delay factor, used for WPD beamformser

  • btaps (int) – number of filter taps, used for WPD beamformser

  • eps (torch.Tensor) – tiny constant

Returns

a dictionary containing all necessary statistics

e.g. “psd_n”, “psd_speech”, “psd_distortion” Note: * When masks_speech is a tensor or a single-element list, all returned

statistics are tensors;

  • When masks_speech is a multi-element list, some returned statistics can be a list, e.g., “psd_n” for MVDR, “psd_speech” and “psd_distortion”.

Return type

beamformer_stats (dict)

espnet2.enh.layers.beamformer.signal_framing(signal: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], frame_length: int, frame_step: int, bdelay: int, do_padding: bool = False, pad_value: int = 0, indices: List = None) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Expand signal into several frames, with each frame of length frame_length.

Parameters
  • signal – (…, T)

  • frame_length – length of each segment

  • frame_step – step for selecting frames

  • bdelay – delay for WPD

  • do_padding – whether or not to pad the input signal at the beginning of the time dimension

  • pad_value – value to fill in the padding

Returns

if do_padding: (…, T, frame_length) else: (…, T - bdelay - frame_length + 2, frame_length)

Return type

torch.Tensor

espnet2.enh.layers.beamformer.tik_reg(mat, reg: float = 1e-08, eps: float = 1e-08)[source]

Perform Tikhonov regularization (only modifying real part).

Parameters
  • mat (torch.complex64/ComplexTensor) – input matrix (…, C, C)

  • reg (float) – regularization factor

  • eps (float) –

Returns

regularized matrix (…, C, C)

Return type

ret (torch.complex64/ComplexTensor)

espnet2.enh.layers.complex_utils

Beamformer module.

espnet2.enh.layers.complex_utils.cat(seq: Sequence[Union[torch_complex.tensor.ComplexTensor, torch.Tensor]], *args, **kwargs)[source]
espnet2.enh.layers.complex_utils.complex_norm(c: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], dim=-1, keepdim=False) → torch.Tensor[source]
espnet2.enh.layers.complex_utils.einsum(equation, *operands)[source]
espnet2.enh.layers.complex_utils.inverse(c: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
espnet2.enh.layers.complex_utils.is_complex(c)[source]
espnet2.enh.layers.complex_utils.is_torch_complex_tensor(c)[source]
espnet2.enh.layers.complex_utils.matmul(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], b: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
espnet2.enh.layers.complex_utils.new_complex_like(ref: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], real_imag: Tuple[torch.Tensor, torch.Tensor])[source]
espnet2.enh.layers.complex_utils.reverse(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], dim=0)[source]
espnet2.enh.layers.complex_utils.solve(b: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor])[source]

Solve the linear equation ax = b.

espnet2.enh.layers.complex_utils.stack(seq: Sequence[Union[torch_complex.tensor.ComplexTensor, torch.Tensor]], *args, **kwargs)[source]
espnet2.enh.layers.complex_utils.to_double(c)[source]
espnet2.enh.layers.complex_utils.to_float(c)[source]
espnet2.enh.layers.complex_utils.trace(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor])[source]

espnet2.enh.layers.complexnn

class espnet2.enh.layers.complexnn.ComplexBatchNorm(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, complex_axis=1)[source]

Bases: torch.nn.modules.module.Module

extra_repr()[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reset_parameters()[source]
reset_running_stats()[source]
class espnet2.enh.layers.complexnn.ComplexConv2d(in_channels, out_channels, kernel_size=(1, 1), stride=(1, 1), padding=(0, 0), dilation=1, groups=1, causal=True, complex_axis=1)[source]

Bases: torch.nn.modules.module.Module

ComplexConv2d.

in_channels: real+imag out_channels: real+imag kernel_size : input [B,C,D,T] kernel size in [D,T] padding : input [B,C,D,T] padding in [D,T] causal: if causal, will padding time dimension’s left side,

otherwise both

forward(inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.complexnn.ComplexConvTranspose2d(in_channels, out_channels, kernel_size=(1, 1), stride=(1, 1), padding=(0, 0), output_padding=(0, 0), causal=False, complex_axis=1, groups=1)[source]

Bases: torch.nn.modules.module.Module

ComplexConvTranspose2d.

in_channels: real+imag out_channels: real+imag

forward(inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.complexnn.NavieComplexLSTM(input_size, hidden_size, projection_dim=None, bidirectional=False, batch_first=False)[source]

Bases: torch.nn.modules.module.Module

flatten_parameters()[source]
forward(inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.complexnn.complex_cat(inputs, axis)[source]

espnet2.enh.layers.dnn_beamformer

DNN beamformer module.

class espnet2.enh.layers.dnn_beamformer.AttentionReference(bidim, att_dim, eps=1e-06)[source]

Bases: torch.nn.modules.module.Module

forward(psd_in: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor, scaling: float = 2.0) → Tuple[torch.Tensor, torch.LongTensor][source]

Attention-based reference forward function.

Parameters
  • psd_in (torch.complex64/ComplexTensor) – (B, F, C, C)

  • ilens (torch.Tensor) – (B,)

  • scaling (float) –

Returns

(B, C) ilens (torch.Tensor): (B,)

Return type

u (torch.Tensor)

class espnet2.enh.layers.dnn_beamformer.DNN_Beamformer(bidim, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, num_spk: int = 1, use_noise_mask: bool = True, nonlinear: str = 'sigmoid', dropout_rate: float = 0.0, badim: int = 320, ref_channel: int = -1, beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, mwf_mu: float = 1.0, eps: float = 1e-06, diagonal_loading: bool = True, diag_eps: float = 1e-07, mask_flooring: bool = False, flooring_thres: float = 1e-06, use_torch_solver: bool = True, btaps: int = 5, bdelay: int = 3)[source]

Bases: torch.nn.modules.module.Module

DNN mask based Beamformer.

Citation:

Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017; http://proceedings.mlr.press/v70/ochiai17a/ochiai17a.pdf

apply_beamforming(data, ilens, psd_n, psd_speech, psd_distortion=None, rtf_mat=None, spk=0)[source]

Beamforming with the provided statistics.

Parameters
  • data (torch.complex64/ComplexTensor) – (B, F, C, T)

  • ilens (torch.Tensor) – (B,)

  • psd_n (torch.complex64/ComplexTensor) – Noise covariance matrix for MVDR (B, F, C, C) Observation covariance matrix for MPDR/wMPDR (B, F, C, C) Stacked observation covariance for WPD (B,F,(btaps+1)*C,(btaps+1)*C)

  • psd_speech (torch.complex64/ComplexTensor) – Speech covariance matrix (B, F, C, C)

  • psd_distortion (torch.complex64/ComplexTensor) – Noise covariance matrix (B, F, C, C)

  • rtf_mat (torch.complex64/ComplexTensor) – RTF matrix (B, F, C, num_spk)

  • spk (int) – speaker index

Returns

(B, F, T) ws (torch.complex64/ComplexTensor): (B, F) or (B, F, (btaps+1)*C)

Return type

enhanced (torch.complex64/ComplexTensor)

forward(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor, powers: Optional[List[torch.Tensor]] = None, oracle_masks: Optional[List[torch.Tensor]] = None) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], torch.LongTensor, torch.Tensor][source]

DNN_Beamformer forward function.

Notation:

B: Batch C: Channel T: Time or Sequence length F: Freq

Parameters
  • data (torch.complex64/ComplexTensor) – (B, T, C, F)

  • ilens (torch.Tensor) – (B,)

  • powers (List[torch.Tensor] or None) – used for wMPDR or WPD (B, F, T)

  • oracle_masks (List[torch.Tensor] or None) – oracle masks (B, F, C, T) if not None, oracle_masks will be used instead of self.mask

Returns

(B, T, F) ilens (torch.Tensor): (B,) masks (torch.Tensor): (B, T, C, F)

Return type

enhanced (torch.complex64/ComplexTensor)

predict_mask(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]

Predict masks for beamforming.

Parameters
  • data (torch.complex64/ComplexTensor) – (B, T, C, F), double precision

  • ilens (torch.Tensor) – (B,)

Returns

(B, T, C, F) ilens (torch.Tensor): (B,)

Return type

masks (torch.Tensor)

espnet2.enh.layers.dnn_wpe

class espnet2.enh.layers.dnn_wpe.DNN_WPE(wtype: str = 'blstmp', widim: int = 257, wlayers: int = 3, wunits: int = 300, wprojs: int = 320, dropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask: bool = True, nmask: int = 1, nonlinear: str = 'sigmoid', iterations: int = 1, normalization: bool = False, eps: float = 1e-06, diagonal_loading: bool = True, diag_eps: float = 1e-07, mask_flooring: bool = False, flooring_thres: float = 1e-06, use_torch_solver: bool = True)[source]

Bases: torch.nn.modules.module.Module

forward(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], torch.LongTensor, Union[torch.Tensor, torch_complex.tensor.ComplexTensor]][source]

DNN_WPE forward function.

Notation:

B: Batch C: Channel T: Time or Sequence length F: Freq or Some dimension of the feature vector

Parameters
  • data – (B, T, C, F)

  • ilens – (B,)

Returns

(B, T, C, F) ilens: (B,) masks (torch.Tensor or List[torch.Tensor]): (B, T, C, F) power (List[torch.Tensor]): (B, F, T)

Return type

enhanced (torch.Tensor or List[torch.Tensor])

predict_mask(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]

Predict mask for WPE dereverberation.

Parameters
  • data (torch.complex64/ComplexTensor) – (B, T, C, F), double precision

  • ilens (torch.Tensor) – (B,)

Returns

(B, T, C, F) ilens (torch.Tensor): (B,)

Return type

masks (torch.Tensor or List[torch.Tensor])

espnet2.enh.layers.dprnn

class espnet2.enh.layers.dprnn.DPRNN(rnn_type, input_size, hidden_size, output_size, dropout=0, num_layers=1, bidirectional=True)[source]

Bases: torch.nn.modules.module.Module

Deep dual-path RNN.

Parameters
  • rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).

  • hidden_size – int, dimension of the hidden state.

  • output_size – int, dimension of the output size.

  • dropout – float, dropout ratio. Default is 0.

  • num_layers – int, number of stacked RNN layers. Default is 1.

  • bidirectional – bool, whether the RNN layers are bidirectional. Default is True.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.dprnn.SingleRNN(rnn_type, input_size, hidden_size, dropout=0, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

Container module for a single RNN layer.

Parameters
  • rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).

  • hidden_size – int, dimension of the hidden state.

  • dropout – float, dropout ratio. Default is 0.

  • bidirectional – bool, whether the RNN layers are bidirectional. Default is False.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.dprnn.merge_feature(input, rest)[source]
espnet2.enh.layers.dprnn.split_feature(input, segment_size)[source]

espnet2.enh.layers.mask_estimator

class espnet2.enh.layers.mask_estimator.MaskEstimator(type, idim, layers, units, projs, dropout, nmask=1, nonlinear='sigmoid')[source]

Bases: torch.nn.modules.module.Module

forward(xs: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]

Mask estimator forward function.

Parameters
  • xs – (B, F, C, T)

  • ilens – (B,)

Returns

The hidden vector (B, F, C, T) masks: A tuple of the masks. (B, F, C, T) ilens: (B,)

Return type

hs (torch.Tensor)

espnet2.enh.layers.skim

class espnet2.enh.layers.skim.MemLSTM(hidden_size, dropout=0.0, bidirectional=False, mem_type='hc', norm_type='cLN')[source]

Bases: torch.nn.modules.module.Module

the Mem-LSTM of SkiM

Parameters
  • hidden_size – int, dimension of the hidden state.

  • dropout – float, dropout ratio. Default is 0.

  • bidirectional – bool, whether the LSTM layers are bidirectional. Default is False.

  • mem_type – ‘hc’, ‘h’, ‘c’ or ‘id’. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned.

  • norm_type – gLN, cLN. cLN is for causal implementation.

extra_repr() → str[source]

Set the extra representation of the module

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(hc, S)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.skim.SegLSTM(input_size, hidden_size, dropout=0.0, bidirectional=False, norm_type='cLN')[source]

Bases: torch.nn.modules.module.Module

the Seg-LSTM of SkiM

Parameters
  • input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).

  • hidden_size – int, dimension of the hidden state.

  • dropout – float, dropout ratio. Default is 0.

  • bidirectional – bool, whether the LSTM layers are bidirectional. Default is False.

  • norm_type – gLN, cLN. cLN is for causal implementation.

forward(input, hc)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.enh.layers.skim.SkiM(input_size, hidden_size, output_size, dropout=0.0, num_blocks=2, segment_size=20, bidirectional=True, mem_type='hc', norm_type='gLN', seg_overlap=False)[source]

Bases: torch.nn.modules.module.Module

Skipping Memory Net

Parameters
  • input_size – int, dimension of the input feature. Input shape shoud be (batch, length, input_size)

  • hidden_size – int, dimension of the hidden state.

  • output_size – int, dimension of the output size.

  • dropout – float, dropout ratio. Default is 0.

  • num_blocks – number of basic SkiM blocks

  • segment_size – segmentation size for splitting long features

  • bidirectional – bool, whether the RNN layers are bidirectional.

  • mem_type – ‘hc’, ‘h’, ‘c’, ‘id’ or None. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.

  • norm_type – gLN, cLN. cLN is for causal implementation.

  • seg_overlap – Bool, whether the segmentation will reserve 50% overlap for adjacent segments.Default is False.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.enh.layers.tcn

class espnet2.enh.layers.tcn.ChannelwiseLayerNorm(channel_size, shape='BDT')[source]

Bases: torch.nn.modules.module.Module

Channel-wise Layer Normalization (cLN).

forward(y)[source]

Forward.

Parameters

y – [M, N, K], M is batch size, N is channel size, K is length

Returns

[M, N, K]

Return type

cLN_y

reset_parameters()[source]
class espnet2.enh.layers.tcn.Chomp1d(chomp_size)[source]

Bases: torch.nn.modules.module.Module

To ensure the output length is the same as the input.

forward(x)[source]

Forward.

Parameters

x – [M, H, Kpad]

Returns

[M, H, K]

class espnet2.enh.layers.tcn.DepthwiseSeparableConv(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward.

Parameters

x – [M, H, K]

Returns

[M, B, K]

Return type

result

class espnet2.enh.layers.tcn.GlobalLayerNorm(channel_size, shape='BDT')[source]

Bases: torch.nn.modules.module.Module

Global Layer Normalization (gLN).

forward(y)[source]

Forward.

Parameters

y – [M, N, K], M is batch size, N is channel size, K is length

Returns

[M, N, K]

Return type

gLN_y

reset_parameters()[source]
class espnet2.enh.layers.tcn.TemporalBlock(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: torch.nn.modules.module.Module

forward(x)[source]

Forward.

Parameters

x – [M, B, K]

Returns

[M, B, K]

class espnet2.enh.layers.tcn.TemporalConvNet(N, B, H, P, X, R, C, norm_type='gLN', causal=False, mask_nonlinear='relu')[source]

Bases: torch.nn.modules.module.Module

Basic Module of tasnet.

Parameters
  • N – Number of filters in autoencoder

  • B – Number of channels in bottleneck 1 * 1-conv block

  • H – Number of channels in convolutional blocks

  • P – Kernel size in convolutional blocks

  • X – Number of convolutional blocks in each repeat

  • R – Number of repeats

  • C – Number of speakers

  • norm_type – BN, gLN, cLN

  • causal – causal or non-causal

  • mask_nonlinear – use which non-linear function to generate mask

forward(mixture_w)[source]

Keep this API same with TasNet.

Parameters

mixture_w – [M, N, K], M is batch size

Returns

[M, C, N, K]

Return type

est_mask

espnet2.enh.layers.tcn.check_nonlinear(nolinear_type)[source]
espnet2.enh.layers.tcn.choose_norm(norm_type, channel_size, shape='BDT')[source]

The input of normalization will be (M, C, K), where M is batch size.

C is channel size and K is sequence length.

espnet2.enh.layers.wpe

espnet2.enh.layers.wpe.get_correlations(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], inverse_power: torch.Tensor, taps, delay) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Union[torch.Tensor, torch_complex.tensor.ComplexTensor]][source]

Calculates weighted correlations of a window of length taps

Parameters
  • Y – Complex-valued STFT signal with shape (F, C, T)

  • inverse_power – Weighting factor with shape (F, T)

  • taps (int) – Lenghts of correlation window

  • delay (int) – Delay for the weighting factor

Returns

Correlation matrix of shape (F, taps*C, taps*C) Correlation vector of shape (F, taps, C, C)

espnet2.enh.layers.wpe.get_filter_matrix_conj(correlation_matrix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], correlation_vector: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], eps: float = 1e-10) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Calculate (conjugate) filter matrix based on correlations for one freq.

Parameters
  • correlation_matrix – Correlation matrix (F, taps * C, taps * C)

  • correlation_vector – Correlation vector (F, taps, C, C)

  • eps

Returns

(F, taps, C, C)

Return type

filter_matrix_conj (torch.complex/ComplexTensor)

espnet2.enh.layers.wpe.get_power(signal, dim=-2) → torch.Tensor[source]

Calculates power for signal

Parameters
  • signal – Single frequency signal with shape (F, C, T).

  • axis – reduce_mean axis

Returns

Power with shape (F, T)

espnet2.enh.layers.wpe.is_torch_1_9_plus = True

//github.com/fgnt/nara_wpe Many functions aren’t enough tested

Type

WPE pytorch version

Type

Ported from https

espnet2.enh.layers.wpe.perform_filter_operation(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], filter_matrix_conj: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], taps, delay) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]
Parameters
  • Y – Complex-valued STFT signal of shape (F, C, T)

  • Matrix (filter) –

espnet2.enh.layers.wpe.signal_framing(signal: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], frame_length: int, frame_step: int, pad_value=0) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

Expands signal into frames of frame_length.

Parameters

signal – (B * F, D, T)

Returns

(B * F, D, T, W)

Return type

torch.Tensor

espnet2.enh.layers.wpe.wpe(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], taps=10, delay=3, iterations=3) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

WPE

Parameters
  • Y – Complex valued STFT signal with shape (F, C, T)

  • taps – Number of filter taps

  • delay – Delay as a guard interval, such that X does not become zero.

  • iterations

Returns

(F, C, T)

Return type

enhanced

espnet2.enh.layers.wpe.wpe_one_iteration(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], power: torch.Tensor, taps: int = 10, delay: int = 3, eps: float = 1e-10, inverse_power: bool = True) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]

WPE for one iteration

Parameters
  • Y – Complex valued STFT signal with shape (…, C, T)

  • power – : (…, T)

  • taps – Number of filter taps

  • delay – Delay as a guard interval, such that X does not become zero.

  • eps

  • inverse_power (bool) –

Returns

(…, C, T)

Return type

enhanced

espnet2.enh.loss.__init__

espnet2.enh.loss.criterions.__init__

espnet2.enh.loss.criterions.abs_loss

class espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(ref, inf) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property name

espnet2.enh.loss.criterions.tf_domain

class espnet2.enh.loss.criterions.tf_domain.FrequencyDomainL1(compute_on_mask=False, mask_type='IBM')[source]

Bases: espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property compute_on_mask
forward(ref, inf) → torch.Tensor[source]

time-frequency L1 loss.

Parameters
  • ref – (Batch, T, F) or (Batch, T, C, F)

  • inf – (Batch, T, F) or (Batch, T, C, F)

Returns

(Batch,)

Return type

loss

property mask_type
property name
class espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss[source]

Bases: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract property compute_on_mask
create_mask_label(mix_spec, ref_spec)[source]
abstract property mask_type
class espnet2.enh.loss.criterions.tf_domain.FrequencyDomainMSE(compute_on_mask=False, mask_type='IBM')[source]

Bases: espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss

property compute_on_mask
forward(ref, inf) → torch.Tensor[source]

time-frequency MSE loss.

Parameters
  • ref – (Batch, T, F) or (Batch, T, C, F)

  • inf – (Batch, T, F) or (Batch, T, C, F)

Returns

(Batch,)

Return type

loss

property mask_type
property name

espnet2.enh.loss.criterions.time_domain

class espnet2.enh.loss.criterions.time_domain.CISDRLoss(filter_length=512)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

CI-SDR loss

Reference:

Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation; C. Boeddeker et al., 2021; https://arxiv.org/abs/2011.15003

Parameters
  • ref – (Batch, samples)

  • inf – (Batch, samples)

  • filter_length (int) – a time-invariant filter that allows slight distortion via filtering

Returns

(Batch,)

Return type

loss

forward(ref: torch.Tensor, inf: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property name
class espnet2.enh.loss.criterions.time_domain.SISNRLoss(eps=1.1920928955078125e-07)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

forward(ref: torch.Tensor, inf: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property name
class espnet2.enh.loss.criterions.time_domain.SNRLoss(eps=1.1920928955078125e-07)[source]

Bases: espnet2.enh.loss.criterions.time_domain.TimeDomainLoss

forward(ref: torch.Tensor, inf: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property name
class espnet2.enh.loss.criterions.time_domain.TimeDomainLoss[source]

Bases: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

espnet2.enh.loss.wrappers.__init__

espnet2.enh.loss.wrappers.abs_wrapper

class espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(ref: List, inf: List, others: Dict) → Tuple[torch.Tensor, Dict, Dict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

weight = 1.0

espnet2.enh.loss.wrappers.fixed_order

class espnet2.enh.loss.wrappers.fixed_order.FixedOrderSolver(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0)[source]

Bases: espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

forward(ref, inf, others={})[source]

An naive fixed-order solver

Parameters
  • ref (List[torch.Tensor]) – [(batch, …), …] x n_spk

  • inf (List[torch.Tensor]) – [(batch, …), …]

Returns

(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: reserved

Return type

loss

espnet2.enh.loss.wrappers.pit_solver

class espnet2.enh.loss.wrappers.pit_solver.PITSolver(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0, independent_perm=True)[source]

Bases: espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper

forward(ref, inf, others={})[source]

Permutation invariant training solver.

Parameters
  • ref (List[torch.Tensor]) – [(batch, …), …] x n_spk

  • inf (List[torch.Tensor]) – [(batch, …), …]

Returns

(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned

Return type

loss

espnet2.enh.separator.__init__

espnet2.enh.separator.abs_separator

class espnet2.enh.separator.abs_separator.AbsSeparator[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, ilens: torch.Tensor) → Tuple[Tuple[torch.Tensor], torch.Tensor, collections.OrderedDict][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract property num_spk

espnet2.enh.separator.asteroid_models

class espnet2.enh.separator.asteroid_models.AsteroidModel_Converter(encoder_output_dim: int, model_name: str, num_spk: int, pretrained_path: str = '', loss_type: str = 'si_snr', **model_related_kwargs)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

The class to convert the models from asteroid to AbsSeprator.

Parameters
forward(input: torch.Tensor, ilens: torch.Tensor = None)[source]

Whole forward of asteroid models.

Parameters
  • input (torch.Tensor) – Raw Waveforms [B, T]

  • ilens (torch.Tensor) – input lengths [B]

Returns

[(B, T), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, T), ‘mask_spk2’: torch.Tensor(Batch, T), … ‘mask_spkn’: torch.Tensor(Batch, T),

]

Return type

estimated Waveforms(List[Union(torch.Tensor])

forward_rawwav(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Output with waveforms.

property num_spk

espnet2.enh.separator.conformer_separator

class espnet2.enh.separator.conformer_separator.ConformerSeparator(input_dim: int, num_spk: int = 2, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, input_layer: str = 'linear', positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, nonlinear: str = 'relu', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, conformer_enc_kernel_size: int = 7, padding_idx: int = -1)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Conformer separator.

Parameters
  • input_dim – input feature dimension

  • num_spk – number of speakers

  • adim (int) – Dimension of attention.

  • aheads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • layers (int) – The number of transformer blocks.

  • dropout_rate (float) – Dropout rate.

  • input_layer (Union[str, torch.nn.Module]) – Input layer type.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • conformer_pos_enc_layer_type (str) – Encoder positional encoding layer type.

  • conformer_self_attn_layer_type (str) – Encoder attention layer type.

  • conformer_activation_type (str) – Encoder activation function type.

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • use_macaron_style_in_conformer (bool) – Whether to use macaron style for positionwise layer.

  • use_cnn_in_conformer (bool) – Whether to use convolution module.

  • conformer_enc_kernel_size (int) – Kernerl size of convolution module.

  • padding_idx (int) – Padding idx for input_layer=embed.

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.dccrn_separator

class espnet2.enh.separator.dccrn_separator.DCCRNSeparator(input_dim: int, num_spk: int = 1, rnn_layer: int = 2, rnn_units: int = 256, masking_mode: str = 'E', use_clstm: bool = True, bidirectional: bool = False, use_cbn: bool = False, kernel_size: int = 5, kernel_num: List[int] = [32, 64, 128, 256, 256, 256], use_builtin_complex: bool = True, use_noise_mask: bool = False)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

DCCRN separator.

Parameters
  • input_dim (int) – input dimension。

  • num_spk (int, optional) – number of speakers. Defaults to 1.

  • rnn_layer (int, optional) – number of lstm layers in the crn. Defaults to 2.

  • rnn_units (int, optional) – rnn units. Defaults to 128.

  • masking_mode (str, optional) – usage of the estimated mask. Defaults to “E”.

  • use_clstm (bool, optional) – whether use complex LSTM. Defaults to False.

  • bidirectional (bool, optional) – whether use BLSTM. Defaults to False.

  • use_cbn (bool, optional) – whether use complex BN. Defaults to False.

  • kernel_size (int, optional) – convolution kernel size. Defaults to 5.

  • kernel_num (list, optional) – output dimension of each layer of the encoder.

  • use_builtin_complex (bool, optional) – torch.complex if True, else ComplexTensor.

  • use_noise_mask (bool, optional) – whether to estimate the mask of noise.

apply_masks(masks: List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], real: torch.Tensor, imag: torch.Tensor)[source]

apply masks

Parameters
  • masks – est_masks, [(B, T, F), …]

  • real (torch.Tensor) – real part of the noisy spectrum, (B, F, T)

  • imag (torch.Tensor) – imag part of the noisy spectrum, (B, F, T)

Returns

[(B, T, F), …]

Return type

masked (List[Union(torch.Tensor, ComplexTensor)])

create_masks(mask_tensor: torch.Tensor)[source]

create estimated mask for each speaker

Parameters

mask_tensor (torch.Tensor) – output of decoder, shape(B, 2*num_spk, F-1, T)

flatten_parameters()[source]
forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

[(B, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.dprnn_separator

class espnet2.enh.separator.dprnn_separator.DPRNNSeparator(input_dim: int, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Dual-Path RNN (DPRNN) Separator

Parameters
  • input_dim – input feature dimension

  • rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.

  • bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.

  • num_spk – number of speakers

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of stacked RNN layers. Default is 3.

  • unit – int, dimension of the hidden state.

  • segment_size – dual-path segment size

  • dropout – float, dropout ratio. Default is 0.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.neural_beamformer

class espnet2.enh.separator.neural_beamformer.NeuralBeamformer(input_dim: int, num_spk: int = 1, loss_type: str = 'mask_mse', use_wpe: bool = False, wnet_type: str = 'blstmp', wlayers: int = 3, wunits: int = 300, wprojs: int = 320, wdropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask_for_wpe: bool = True, wnonlinear: str = 'crelu', multi_source_wpe: bool = True, wnormalization: bool = False, use_beamformer: bool = True, bnet_type: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, badim: int = 320, ref_channel: int = -1, use_noise_mask: bool = True, bnonlinear: str = 'sigmoid', beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, bdropout_rate: float = 0.0, shared_power: bool = True, diagonal_loading: bool = True, diag_eps_wpe: float = 1e-07, diag_eps_bf: float = 1e-07, mask_flooring: bool = False, flooring_thres_wpe: float = 1e-06, flooring_thres_bf: float = 1e-06, use_torch_solver: bool = True)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters
  • input (torch.complex64/ComplexTensor) – mixed speech [Batch, Frames, Channel, Freq]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

List[torch.complex64/ComplexTensor] output lengths other predcited data: OrderedDict[

’dereverb1’: ComplexTensor(Batch, Frames, Channel, Freq), ‘mask_dereverb1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_noise1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Channel, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Channel, Freq),

]

Return type

enhanced speech (single-channel)

property num_spk

espnet2.enh.separator.rnn_separator

class espnet2.enh.separator.rnn_separator.RNNSeparator(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, nonlinear: str = 'sigmoid', layer: int = 3, unit: int = 512, dropout: float = 0.0)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

RNN Separator

Parameters
  • input_dim – input feature dimension

  • rnn_type – string, select from ‘blstm’, ‘lstm’ etc.

  • bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.

  • num_spk – number of speakers

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of stacked RNN layers. Default is 3.

  • unit – int, dimension of the hidden state.

  • dropout – float, dropout ratio. Default is 0.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.skim_separator

class espnet2.enh.separator.skim_separator.SkiMSeparator(input_dim: int, causal: bool = True, num_spk: int = 2, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0, mem_type: str = 'hc', seg_overlap: bool = False)[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Skipping Memory (SkiM) Separator

Parameters
  • input_dim – input feature dimension

  • causal – bool, whether the system is causal.

  • num_spk – number of target speakers.

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

  • layer – int, number of SkiM blocks. Default is 3.

  • unit – int, dimension of the hidden state.

  • segment_size – segmentation size for splitting long features

  • dropout – float, dropout ratio. Default is 0.

  • mem_type – ‘hc’, ‘h’, ‘c’, ‘id’ or None. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.

  • seg_overlap – Bool, whether the segmentation will reserve 50% overlap for adjacent segments. Default is False.

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.tcn_separator

class espnet2.enh.separator.tcn_separator.TCNSeparator(input_dim: int, num_spk: int = 2, layer: int = 8, stack: int = 3, bottleneck_dim: int = 128, hidden_dim: int = 512, kernel: int = 3, causal: bool = False, norm_type: str = 'gLN', nonlinear: str = 'relu')[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Temporal Convolution Separator

Parameters
  • input_dim – input feature dimension

  • num_spk – number of speakers

  • layer – int, number of layers in each stack.

  • stack – int, number of stacks

  • bottleneck_dim – bottleneck dimension

  • hidden_dim – number of convolution channel

  • kernel – int, kernel size.

  • causal – bool, defalut False.

  • norm_type – str, choose from ‘BN’, ‘gLN’, ‘cLN’

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk

espnet2.enh.separator.transformer_separator

class espnet2.enh.separator.transformer_separator.TransformerSeparator(input_dim: int, num_spk: int = 2, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, use_scaled_pos_enc: bool = True, nonlinear: str = 'relu')[source]

Bases: espnet2.enh.separator.abs_separator.AbsSeparator

Transformer separator.

Parameters
  • input_dim – input feature dimension

  • num_spk – number of speakers

  • adim (int) – Dimension of attention.

  • aheads (int) – The number of heads of multi head attention.

  • linear_units (int) – The number of units of position-wise feed forward.

  • layers (int) – The number of transformer blocks.

  • dropout_rate (float) – Dropout rate.

  • attention_dropout_rate (float) – Dropout rate in attention.

  • positional_dropout_rate (float) – Dropout rate after adding positional encoding.

  • normalize_before (bool) – Whether to use layer_norm before the first block.

  • concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

  • positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.

  • positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.

  • use_scaled_pos_enc (bool) – use scaled positional encoding or not

  • nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’

forward(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]

Forward.

Parameters
  • input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]

  • ilens (torch.Tensor) – input lengths [Batch]

Returns

[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[

’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),

]

Return type

masked (List[Union(torch.Tensor, ComplexTensor)])

property num_spk