espnet2.bin package¶
espnet2.bin.__init__¶
espnet2.bin.aggregate_stats_dirs¶
espnet2.bin.asr_align¶
espnet2.bin.asr_inference¶
-
class
espnet2.bin.asr_inference.
Speech2Text
(asr_train_config: Union[pathlib.Path, str] = None, asr_model_file: Union[pathlib.Path, str] = None, transducer_conf: dict = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, streaming: bool = False)[source]¶ Bases:
object
Speech2Text class
Examples
>>> import soundfile >>> speech2text = Speech2Text("asr_config.yml", "asr.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis object), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build Speech2Text instance from the pretrained model.
- Parameters
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns
Speech2Text instance.
- Return type
-
static
-
espnet2.bin.asr_inference.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: Optional[str], asr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, transducer_conf: Optional[dict], streaming: bool)[source]¶
espnet2.bin.asr_inference_k2¶
espnet2.bin.asr_inference_maskctc¶
-
class
espnet2.bin.asr_inference_maskctc.
Speech2Text
(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32', maskctc_n_iterations: int = 10, maskctc_threshold_probability: float = 0.99)[source]¶ Bases:
object
Speech2Text class
Examples
>>> import soundfile >>> speech2text = Speech2Text("asr_config.yml", "asr.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis object), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build Speech2Text instance from the pretrained model.
- Parameters
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns
Speech2Text instance.
- Return type
-
static
-
espnet2.bin.asr_inference_maskctc.
inference
(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, maskctc_n_iterations: int, maskctc_threshold_probability: float)[source]¶
espnet2.bin.asr_inference_streaming¶
-
class
espnet2.bin.asr_inference_streaming.
Speech2TextStreaming
(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, disable_repetition_detection=False, decoder_text_length_limit=0, encoded_feat_length_limit=0)[source]¶ Bases:
object
Speech2TextStreaming class
Details in “Streaming Transformer ASR with Blockwise Synchronous Beam Search” (https://arxiv.org/abs/2006.14941)
Examples
>>> import soundfile >>> speech2text = Speech2TextStreaming("asr_config.yml", "asr.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis object), ...]
-
espnet2.bin.asr_inference_streaming.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, sim_chunk_length: int, disable_repetition_detection: bool, encoded_feat_length_limit: int, decoder_text_length_limit: int)[source]¶
espnet2.bin.asr_train¶
espnet2.bin.diar_inference¶
-
class
espnet2.bin.diar_inference.
DiarizeSpeech
(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, num_spk: Optional[int] = None, device: str = 'cpu', dtype: str = 'float32')[source]¶ Bases:
object
DiarizeSpeech class
Examples
>>> import soundfile >>> diarization = DiarizeSpeech("diar_config.yaml", "diar.pth") >>> audio, rate = soundfile.read("speech.wav") >>> diarization(audio) [(spk_id, start, end), (spk_id2, start2, end2)]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build DiarizeSpeech instance from the pretrained model.
- Parameters
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns
DiarizeSpeech instance.
- Return type
-
static
-
espnet2.bin.diar_inference.
inference
(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], show_progressbar: bool, num_spk: Optional[int])[source]¶
espnet2.bin.diar_train¶
espnet2.bin.enh_inference¶
-
class
espnet2.bin.enh_inference.
SeparateSpeech
(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, ref_channel: Optional[int] = None, normalize_output_wav: bool = False, device: str = 'cpu', dtype: str = 'float32')[source]¶ Bases:
object
SeparateSpeech class
Examples
>>> import soundfile >>> separate_speech = SeparateSpeech("enh_config.yml", "enh.pth") >>> audio, rate = soundfile.read("speech.wav") >>> separate_speech(audio) [separated_audio1, separated_audio2, ...]
-
cal_permumation
(ref_wavs, enh_wavs, criterion='si_snr')[source]¶ Calculate the permutation between seaprated streams in two adjacent segments.
- Parameters
ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]
criterion (str) – one of (“si_snr”, “mse”, “corr)
- Returns
permutation for enh_wavs (Batch, num_spk)
- Return type
perm (torch.Tensor)
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build SeparateSpeech instance from the pretrained model.
- Parameters
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns
SeparateSpeech instance.
- Return type
-
-
espnet2.bin.enh_inference.
inference
(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, ref_channel: Optional[int], normalize_output_wav: bool)[source]¶
espnet2.bin.enh_scoring¶
espnet2.bin.enh_train¶
espnet2.bin.gan_tts_train¶
espnet2.bin.hubert_train¶
espnet2.bin.launch¶
espnet2.bin.lm_calc_perplexity¶
-
espnet2.bin.lm_calc_perplexity.
calc_perplexity
(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], log_base: Optional[float], allow_variable_data_keys: bool)[source]¶
espnet2.bin.lm_train¶
espnet2.bin.mt_inference¶
-
class
espnet2.bin.mt_inference.
Text2Text
(mt_train_config: Union[pathlib.Path, str] = None, mt_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1)[source]¶ Bases:
object
Text2Text class
Examples
>>> text2text = Text2Text("mt_config.yml", "mt.pth") >>> text2text(audio) [(text, token, token_int, hypothesis object), ...]
-
espnet2.bin.mt_inference.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], mt_train_config: Optional[str], mt_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool)[source]¶
espnet2.bin.mt_train¶
espnet2.bin.pack¶
-
class
espnet2.bin.pack.
ASRPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['asr_model_file', 'lm_file']¶
-
yaml_files
= ['asr_train_config', 'lm_train_config']¶
-
-
class
espnet2.bin.pack.
DiarPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['model_file']¶
-
yaml_files
= ['train_config']¶
-
-
class
espnet2.bin.pack.
EnhPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['model_file']¶
-
yaml_files
= ['train_config']¶
-
-
class
espnet2.bin.pack.
STPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['st_model_file']¶
-
yaml_files
= ['st_train_config']¶
-
-
class
espnet2.bin.pack.
TTSPackedContents
[source]¶ Bases:
espnet2.bin.pack.PackedContents
-
files
= ['model_file']¶
-
yaml_files
= ['train_config']¶
-
espnet2.bin.split_scps¶
espnet2.bin.st_inference¶
-
class
espnet2.bin.st_inference.
Speech2Text
(st_train_config: Union[pathlib.Path, str] = None, st_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1)[source]¶ Bases:
object
Speech2Text class
Examples
>>> import soundfile >>> speech2text = Speech2Text("st_config.yml", "st.pth") >>> audio, rate = soundfile.read("speech.wav") >>> speech2text(audio) [(text, token, token_int, hypothesis object), ...]
-
static
from_pretrained
(model_tag: Optional[str] = None, **kwargs)[source]¶ Build Speech2Text instance from the pretrained model.
- Parameters
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
- Returns
Speech2Text instance.
- Return type
-
static
-
espnet2.bin.st_inference.
inference
(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], st_train_config: Optional[str], st_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool)[source]¶
espnet2.bin.st_train¶
espnet2.bin.tokenize_text¶
-
espnet2.bin.tokenize_text.
field2slice
(field: Optional[str]) → slice[source]¶ Convert field string to slice
Note that field string accepts 1-based integer.
Examples
>>> field2slice("1-") slice(0, None, None) >>> field2slice("1-3") slice(0, 3, None) >>> field2slice("-3") slice(None, 3, None)
-
espnet2.bin.tokenize_text.
tokenize
(input: str, output: str, field: Optional[str], delimiter: Optional[str], token_type: str, space_symbol: str, non_linguistic_symbols: Optional[str], bpemodel: Optional[str], log_level: str, write_vocabulary: bool, vocabulary_size: int, remove_non_linguistic_symbols: bool, cutoff: int, add_symbol: List[str], cleaner: Optional[str], g2p: Optional[str])[source]¶
espnet2.bin.tts_inference¶
Script to run the inference of text-to-speeech model.
-
class
espnet2.bin.tts_inference.
Text2Speech
(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, speed_control_alpha: float = 1.0, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, vocoder_config: Union[pathlib.Path, str] = None, vocoder_file: Union[pathlib.Path, str] = None, dtype: str = 'float32', device: str = 'cpu', seed: int = 777, always_fix_seed: bool = False)[source]¶ Bases:
object
Text2Speech class.
Examples
>>> from espnet2.bin.tts_inference import Text2Speech >>> # Case 1: Load the local model and use Griffin-Lim vocoder >>> text2speech = Text2Speech( >>> train_config="/path/to/config.yml", >>> model_file="/path/to/model.pth", >>> ) >>> # Case 2: Load the local model and the pretrained vocoder >>> text2speech = Text2Speech.from_pretrained( >>> train_config="/path/to/config.yml", >>> model_file="/path/to/model.pth", >>> vocoder_tag="kan-bayashi/ljspeech_tacotron2", >>> ) >>> # Case 3: Load the pretrained model and use Griffin-Lim vocoder >>> text2speech = Text2Speech.from_pretrained( >>> model_tag="kan-bayashi/ljspeech_tacotron2", >>> ) >>> # Case 4: Load the pretrained model and the pretrained vocoder >>> text2speech = Text2Speech.from_pretrained( >>> model_tag="kan-bayashi/ljspeech_tacotron2", >>> vocoder_tag="parallel_wavegan/ljspeech_parallel_wavegan.v1", >>> ) >>> # Run inference and save as wav file >>> import soundfile as sf >>> wav = text2speech("Hello, World")["wav"] >>> sf.write("out.wav", wav.numpy(), text2speech.fs, "PCM_16")
Initialize Text2Speech module.
-
static
from_pretrained
(model_tag: Optional[str] = None, vocoder_tag: Optional[str] = None, **kwargs)[source]¶ Build Text2Speech instance from the pretrained model.
- Parameters
model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.
vocoder_tag (Optional[str]) – Vocoder tag of the pretrained vocoders. Currently, the tags of parallel_wavegan are supported, which should start with the prefix “parallel_wavegan/”.
- Returns
Text2Speech instance.
- Return type
-
property
fs
¶ Return sampling rate.
-
property
use_lids
¶ Return sid is needed or not in the inference.
-
property
use_sids
¶ Return sid is needed or not in the inference.
-
property
use_speech
¶ Return speech is needed or not in the inference.
-
property
use_spembs
¶ Return spemb is needed or not in the inference.
-
static
-
espnet2.bin.tts_inference.
inference
(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], threshold: float, minlenratio: float, maxlenratio: float, use_teacher_forcing: bool, use_att_constraint: bool, backward_window: int, forward_window: int, speed_control_alpha: float, noise_scale: float, noise_scale_dur: float, always_fix_seed: bool, allow_variable_data_keys: bool, vocoder_config: Optional[str], vocoder_file: Optional[str], vocoder_tag: Optional[str])[source]¶ Run text-to-speech inference.