core tools

ESPnet provides several command-line tools for training and evaluating neural networks (NN) under espnet/bin:

  • asr_enhance.py: Enhance noisy speech for speech recognition

  • asr_recog.py: Transcribe text from speech using a speech recognition model on one CPU or GPU

  • asr_train.py: Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs

  • lm_train.py: Train a new language model on one CPU or one GPU

  • mt_recog.py: Transcribe text from speech using a speech recognition model on one CPU or GPU

  • mt_train.py: Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs

  • tts_decode.py: Synthesize speech from text using a TTS model on one CPU

  • tts_train.py: Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs

asr_enhance.py

Enhance noisy speech for speech recognition

usage: asr_enhance.py [-h] [--config CONFIG] [--config2 CONFIG2]
                      [--config3 CONFIG3] [--ngpu NGPU]
                      [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                      [--seed SEED] [--verbose VERBOSE]
                      [--batchsize BATCHSIZE]
                      [--preprocess-conf PREPROCESS_CONF]
                      [--recog-json RECOG_JSON] --model MODEL
                      [--model-conf MODEL_CONF]
                      [--enh-wspecifier ENH_WSPECIFIER]
                      [--enh-filetype {mat,hdf5,sound.hdf5,sound}] [--fs FS]
                      [--keep-length KEEP_LENGTH] [--image-dir IMAGE_DIR]
                      [--num-images NUM_IMAGES] [--apply-istft APPLY_ISTFT]
                      [--istft-win-length ISTFT_WIN_LENGTH]
                      [--istft-n-shift ISTFT_N_SHIFT]
                      [--istft-window ISTFT_WINDOW]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs

Default: 0

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1

--batchsize

Batch size for beam search (0: means no batch processing)

Default: 1

--preprocess-conf

The configuration file for the pre-processing

--recog-json

Filename of recognition data (json)

--model

Model file parameters to read

--model-conf

Model config file

--enh-wspecifier

Specify the output way for enhanced speech.e.g. ark,scp:outdir,wav.scp

--enh-filetype

Possible choices: mat, hdf5, sound.hdf5, sound

Specify the file format for enhanced speech. “mat” is the matrix format in kaldi

Default: “sound”

--fs

The sample frequency

Default: 16000

--keep-length

Adjust the output length to match with the input for enhanced speech

Default: True

--image-dir

The directory saving the images.

--num-images

The number of images files to be saved. If negative, all samples are to be saved.

Default: 20

--apply-istft

Apply istft to the output from the network

Default: True

--istft-win-length

The window length for istft. This option is ignored if stft is found in the preprocess-conf

Default: 512

--istft-n-shift

The window type for istft. This option is ignored if stft is found in the preprocess-conf

Default: 256

--istft-window

The window type for istft. This option is ignored if stft is found in the preprocess-conf

Default: “hann”

asr_recog.py

Transcribe text from speech using a speech recognition model on one CPU or GPU

usage: asr_recog.py [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--dtype {float16,float32,float64}]
                    [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                    [--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
                    [--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
                    [--recog-json RECOG_JSON] --result-label RESULT_LABEL
                    --model MODEL [--model-conf MODEL_CONF]
                    [--num-spkrs {1,2}] [--nbest NBEST]
                    [--beam-size BEAM_SIZE] [--penalty PENALTY]
                    [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                    [--ctc-weight CTC_WEIGHT]
                    [--ctc-window-margin CTC_WINDOW_MARGIN]
                    [--score-norm-transducer [SCORE_NORM_TRANSDUCER]]
                    [--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
                    [--word-rnnlm WORD_RNNLM]
                    [--word-rnnlm-conf WORD_RNNLM_CONF]
                    [--word-dict WORD_DICT] [--lm-weight LM_WEIGHT]
                    [--streaming-mode {window,segment}]
                    [--streaming-window STREAMING_WINDOW]
                    [--streaming-min-blank-dur STREAMING_MIN_BLANK_DUR]
                    [--streaming-onset-margin STREAMING_ONSET_MARGIN]
                    [--streaming-offset-margin STREAMING_OFFSET_MARGIN]
                    [--tgt-lang TGT_LANG]

Named Arguments

--config

Config file path

--config2

Second config file path that overwrites the settings in –config

--config3

Third config file path that overwrites the settings in –config and –config2

--ngpu

Number of GPUs

Default: 0

--dtype

Possible choices: float16, float32, float64

Float precision (only available in –api v2)

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1

--batchsize

Batch size for beam search (0: means no batch processing)

Default: 1

--preprocess-conf

The configuration file for the pre-processing

--api

Possible choices: v1, v2

Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.

Default: “v1”

--recog-json

Filename of recognition data (json)

--result-label

Filename of result label data (json)

--model

Model file parameters to read

--model-conf

Model config file

--num-spkrs

Possible choices: 1, 2

Number of speakers in the speech

Default: 1

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 1

--penalty

Incertion penalty

Default: 0.0

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--ctc-weight

CTC weight in joint decoding

Default: 0.0

--ctc-window-margin
Use CTC window with margin parameter to accelerate

CTC/attention decoding especially on GPU. Smaller magin makes decoding faster, but may increase search errors. If margin=0 (default), this function is disabled

Default: 0

--score-norm-transducer

Normalize transducer scores by length

Default: True

--rnnlm

RNNLM model file to read

--rnnlm-conf

RNNLM model config file to read

--word-rnnlm

Word RNNLM model file to read

--word-rnnlm-conf

Word RNNLM model config file to read

--word-dict

Word list to read

--lm-weight

RNNLM weight

Default: 0.1

--streaming-mode

Possible choices: window, segment

Use streaming recognizer for inference.

–batchsize must be set to 0 to enable this mode

--streaming-window

Window size

Default: 10

--streaming-min-blank-dur

Minimum blank duration threshold

Default: 10

--streaming-onset-margin

Onset margin

Default: 1

--streaming-offset-margin

Offset margin

Default: 1

--tgt-lang

target language ID (e.g., <en>, <de>, <fr> etc.)

Default: False

asr_train.py

Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs

usage: asr_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                    [--backend {chainer,pytorch}] --outdir OUTDIR
                    [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                    [--debugdir DEBUGDIR] [--resume [RESUME]]
                    [--minibatches MINIBATCHES] [--verbose VERBOSE]
                    [--tensorboard-dir [TENSORBOARD_DIR]]
                    [--report-interval-iters REPORT_INTERVAL_ITERS]
                    [--train-json TRAIN_JSON] [--valid-json VALID_JSON]
                    [--model-module MODEL_MODULE]
                    [--ctc_type {builtin,warpctc}] [--mtlalpha MTLALPHA]
                    [--lsm-type [{,unigram}]] [--lsm-weight LSM_WEIGHT]
                    [--report-cer] [--report-wer] [--nbest NBEST]
                    [--beam-size BEAM_SIZE] [--penalty PENALTY]
                    [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                    [--ctc-weight CTC_WEIGHT] [--rnnlm RNNLM]
                    [--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
                    [--sym-space SYM_SPACE] [--sym-blank SYM_BLANK]
                    [--sortagrad [SORTAGRAD]]
                    [--batch-count {auto,seq,bin,frame}]
                    [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                    [--batch-frames-in BATCH_FRAMES_IN]
                    [--batch-frames-out BATCH_FRAMES_OUT]
                    [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                    [--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
                    [--preprocess-conf [PREPROCESS_CONF]]
                    [--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
                    [--eps EPS] [--eps-decay EPS_DECAY]
                    [--weight-decay WEIGHT_DECAY] [--criterion {loss,acc}]
                    [--threshold THRESHOLD] [--epochs EPOCHS]
                    [--early-stop-criterion [EARLY_STOP_CRITERION]]
                    [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                    [--num-save-attention NUM_SAVE_ATTENTION]
                    [--grad-noise GRAD_NOISE] [--num-spkrs {1,2}] [--spa]
                    [--elayers-sd ELAYERS_SD]
                    [--context-residual [CONTEXT_RESIDUAL]]
                    [--replace-sos [REPLACE_SOS]] [--enc-init ENC_INIT]
                    [--enc-init-mods ENC_INIT_MODS] [--dec-init DEC_INIT]
                    [--dec-init-mods DEC_INIT_MODS]
                    [--use-frontend USE_FRONTEND] [--use-wpe USE_WPE]
                    [--wtype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
                    [--wlayers WLAYERS] [--wunits WUNITS] [--wprojs WPROJS]
                    [--wdropout-rate WDROPOUT_RATE] [--wpe-taps WPE_TAPS]
                    [--wpe-delay WPE_DELAY]
                    [--use-dnn-mask-for-wpe USE_DNN_MASK_FOR_WPE]
                    [--use-beamformer USE_BEAMFORMER]
                    [--btype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
                    [--blayers BLAYERS] [--bunits BUNITS] [--bprojs BPROJS]
                    [--badim BADIM] [--ref-channel REF_CHANNEL]
                    [--bdropout-rate BDROPOUT_RATE] [--stats-file STATS_FILE]
                    [--apply-uttmvn APPLY_UTTMVN]
                    [--uttmvn-norm-means UTTMVN_NORM_MEANS]
                    [--uttmvn-norm-vars UTTMVN_NORM_VARS]
                    [--fbank-fs FBANK_FS] [--n-mels N_MELS]
                    [--fbank-fmin FBANK_FMIN] [--fbank-fmax FBANK_FMAX]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs. If not given, use all visible devices

--train-dtype

Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--dict

Dictionary

--seed

Random seed

Default: 1

--debugdir

Output directory for debugging

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log dir path

--report-interval-iters

Report interval iterations

Default: 100

--train-json

Filename of train label data (json)

--valid-json

Filename of validation label data (json)

--model-module

model defined module (default: espnet.nets.xxx_backend.e2e_asr:E2E)

--ctc_type

Possible choices: builtin, warpctc

Type of CTC implementation to calculate loss.

Default: “warpctc”

--mtlalpha

Multitask learning coefficient, alpha: alpha*ctc_loss + (1-alpha)*att_loss

Default: 0.5

--lsm-type

Possible choices: , unigram

Apply label smoothing with a specified distribution type

Default: “”

--lsm-weight

Label smoothing weight

Default: 0.0

--report-cer

Compute CER on development set

Default: False

--report-wer

Compute WER on development set

Default: False

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 4

--penalty

Incertion penalty

Default: 0.0

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--ctc-weight

CTC weight in joint decoding

Default: 0.3

--rnnlm

RNNLM model file to read

--rnnlm-conf

RNNLM model config file to read

--lm-weight

RNNLM weight.

Default: 0.1

--sym-space

Space symbol

Default: “<space>”

--sym-blank

Blank symbol

Default: “<blank>”

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batch-count

Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0

--batch-bins

Maximum bins in a minibatch (0 to disable)

Default: 0

--batch-frames-in

Maximum input frames in a minibatch (0 to disable)

Default: 0

--batch-frames-out

Maximum output frames in a minibatch (0 to disable)

Default: 0

--batch-frames-inout

Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 800

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 150

--n-iter-processes

Number of processes of iterator

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--opt

Possible choices: adadelta, adam, noam

Optimizer

Default: “adadelta”

--accum-grad

Number of gradient accumuration

Default: 1

--eps

Epsilon constant for optimizer

Default: 1e-08

--eps-decay

Decaying ratio of epsilon

Default: 0.01

--weight-decay

Weight decay ratio

Default: 0.0

--criterion

Possible choices: loss, acc

Criterion to perform epsilon decay

Default: “acc”

--threshold

Threshold to stop iteration

Default: 0.0001

--epochs, -e

Maximum number of epochs

Default: 30

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/acc”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--grad-clip

Gradient norm threshold to clip

Default: 5

--num-save-attention

Number of samples of attention to be saved

Default: 3

--grad-noise

The flag to switch to use noise injection to gradients during training

Default: False

--num-spkrs

Possible choices: 1, 2

Number of speakers in the speech.

Default: 1

--spa

Enable speaker parallel attention.

Default: False

--elayers-sd

Number of encoder layers for speaker differentiate part. (multi-speaker asr mode only)

Default: 4

--context-residual

The flag to switch to use context vector residual in the decoder network

Default: False

--replace-sos

Replace <sos> in the decoder with a target language ID (the first token in the target sequence)

Default: False

--enc-init

Pre-trained ASR model to initialize encoder.

--enc-init-mods

List of encoder modules to initialize, separated by a comma.

Default: enc.enc.

--dec-init

Pre-trained ASR, MT or LM model to initialize decoder.

--dec-init-mods

List of decoder modules to initialize, separated by a comma.

Default: att., dec.

--use-frontend

The flag to switch to use frontend system.

Default: False

--use-wpe

Apply Weighted Prediction Error

Default: False

--wtype

Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru

Type of encoder network architecture of the mask estimator for WPE.

Default: “blstmp”

--wlayers

Default: 2

--wunits

Default: 300

--wprojs

Default: 300

--wdropout-rate

Default: 0.0

--wpe-taps

Default: 5

--wpe-delay

Default: 3

--use-dnn-mask-for-wpe

Use DNN to estimate the power spectrogram. This option is experimental.

Default: False

--use-beamformer

Default: True

--btype

Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru

Type of encoder network architecture of the mask estimator for Beamformer.

Default: “blstmp”

--blayers

Default: 2

--bunits

Default: 300

--bprojs

Default: 300

--badim

Default: 320

--ref-channel

The reference channel used for beamformer. By default, the channel is estimated by DNN.

Default: -1

--bdropout-rate

Default: 0.0

--stats-file

The stats file for the feature normalization

--apply-uttmvn

Apply utterance level mean variance normalization.

Default: True

--uttmvn-norm-means

Default: True

--uttmvn-norm-vars

Default: False

--fbank-fs

The sample frequency used for the mel-fbank creation.

Default: 16000

--n-mels

The number of mel-frequency bins.

Default: 80

--fbank-fmin

Default: 0.0

--fbank-fmax

lm_train.py

Train a new language model on one CPU or one GPU

usage: lm_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] --dict DICT [--seed SEED]
                   [--resume [RESUME]] [--verbose VERBOSE]
                   [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   --train-label TRAIN_LABEL --valid-label VALID_LABEL
                   [--test-label TEST_LABEL] [--dump-hdf5-path DUMP_HDF5_PATH]
                   [--opt {sgd,adam}] [--sortagrad [SORTAGRAD]]
                   [--batchsize BATCHSIZE] [--epoch EPOCH]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--gradclip GRADCLIP]
                   [--maxlen MAXLEN] [--model-module MODEL_MODULE]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs. If not given, use all visible devices

--train-dtype

Possible choices: float16, float32, float64, O0, O1, O2, O3

Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels

Default: “float32”

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--dict

Dictionary

--seed

Random seed

Default: 1

--resume, -r

Resume the training from snapshot

Default: “”

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log dir path

--report-interval-iters

Report interval iterations

Default: 100

--train-label

Filename of train label data

--valid-label

Filename of validation label data

--test-label

Filename of test label data

--dump-hdf5-path

Path to dump a preprocessed dataset as hdf5

--opt

Possible choices: sgd, adam

Optimizer

Default: “sgd”

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batchsize, -b

Number of examples in each mini-batch

Default: 300

--epoch, -e

Number of sweeps over the dataset to train

Default: 20

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/loss”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--gradclip, -c

Gradient norm threshold to clip

Default: 5

--maxlen

Batch size is reduced if the input sequence > ML

Default: 40

--model-module

model defined module (default: espnet.nets.xxx_backend.lm.default:DefaultRNNLM)

Default: “default”

mt_recog.py

Transcribe text from speech using a speech recognition model on one CPU or GPU

usage: mt_recog.py [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                   [--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
                   [--preprocess-conf PREPROCESS_CONF]
                   [--recog-json RECOG_JSON] --result-label RESULT_LABEL
                   --model MODEL [--model-conf MODEL_CONF] [--nbest NBEST]
                   [--beam-size BEAM_SIZE] [--penalty PENALTY]
                   [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                   [--ctc-weight CTC_WEIGHT] [--rnnlm RNNLM]
                   [--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
                   [--tgt-lang TGT_LANG]

Named Arguments

--config

Config file path

--config2

Second config file path that overwrites the settings in –config

--config3

Third config file path that overwrites the settings in –config and –config2

--ngpu

Number of GPUs

Default: 0

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--verbose, -V

Verbose option

Default: 1

--batchsize

Batch size for beam search (0: means no batch processing)

Default: 1

--preprocess-conf

The configuration file for the pre-processing

--recog-json

Filename of recognition data (json)

--result-label

Filename of result label data (json)

--model

Model file parameters to read

--model-conf

Model config file

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 1

--penalty

Incertion penalty

Default: 0.1

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 3.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--ctc-weight

dummy

Default: 0.0

--rnnlm

RNNLM model file to read

--rnnlm-conf

RNNLM model config file to read

--lm-weight

RNNLM weight

Default: 0.0

--tgt-lang

target language ID (e.g., <en>, <de>, <fr> etc.)

Default: False

mt_train.py

Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs

usage: mt_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                   [--config3 CONFIG3] [--ngpu NGPU]
                   [--backend {chainer,pytorch}] --outdir OUTDIR
                   [--debugmode DEBUGMODE] --dict-tgt DICT_TGT
                   [--dict-src [DICT_SRC]] [--seed SEED] [--debugdir DEBUGDIR]
                   [--resume [RESUME]] [--minibatches MINIBATCHES]
                   [--verbose VERBOSE] [--tensorboard-dir [TENSORBOARD_DIR]]
                   [--report-interval-iters REPORT_INTERVAL_ITERS]
                   [--train-json TRAIN_JSON] [--valid-json VALID_JSON]
                   [--model-module MODEL_MODULE]
                   [--etype {lstm,blstm,lstmp,blstmp,gru,bgru,grup,bgrup}]
                   [--elayers ELAYERS] [--eunits EUNITS] [--eprojs EPROJS]
                   [--subsample SUBSAMPLE]
                   [--atype {noatt,dot,add,location,coverage,coverage_location,location2d,location_recurrent,multi_head_dot,multi_head_add,multi_head_loc,multi_head_multi_res_loc}]
                   [--adim ADIM] [--awin AWIN] [--aheads AHEADS]
                   [--aconv-chans ACONV_CHANS] [--aconv-filts ACONV_FILTS]
                   [--dtype {lstm,gru}] [--dlayers DLAYERS] [--dunits DUNITS]
                   [--lsm-type [{,unigram}]] [--lsm-weight LSM_WEIGHT]
                   [--sampling-probability SAMPLING_PROBABILITY]
                   [--nbest NBEST] [--beam-size BEAM_SIZE] [--penalty PENALTY]
                   [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                   [--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
                   [--lm-weight LM_WEIGHT] [--sym-space SYM_SPACE]
                   [--sym-blank SYM_BLANK] [--dropout-rate DROPOUT_RATE]
                   [--dropout-rate-decoder DROPOUT_RATE_DECODER]
                   [--sortagrad [SORTAGRAD]]
                   [--batch-count {auto,seq,bin,frame}]
                   [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                   [--batch-frames-in BATCH_FRAMES_IN]
                   [--batch-frames-out BATCH_FRAMES_OUT]
                   [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                   [--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
                   [--preprocess-conf PREPROCESS_CONF]
                   [--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
                   [--eps EPS] [--eps-decay EPS_DECAY]
                   [--weight-decay WEIGHT_DECAY] [--criterion {loss,acc}]
                   [--threshold THRESHOLD] [--epochs EPOCHS]
                   [--early-stop-criterion [EARLY_STOP_CRITERION]]
                   [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                   [--num-save-attention NUM_SAVE_ATTENTION]
                   [--context-residual [CONTEXT_RESIDUAL]]
                   [--replace-sos [REPLACE_SOS]]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs

Default: 0

--backend

Possible choices: chainer, pytorch

Backend library

Default: “chainer”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--dict-tgt

Dictionary for target language

--dict-src

Dictionary for source language. Dictionanies are shared between soruce and target languages in default setting.

Default: “”

--seed

Random seed

Default: 1

--debugdir

Output directory for debugging

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log dir path

--report-interval-iters

Report interval iterations

Default: 100

--train-json

Filename of train label data (json)

--valid-json

Filename of validation label data (json)

--model-module

model defined module (default: espnet.nets.xxx_backend.e2e_mt:E2E)

--etype

Possible choices: lstm, blstm, lstmp, blstmp, gru, bgru, grup, bgrup

Type of encoder network architecture (VGG is not supported for NMT)

Default: “blstmp”

--elayers

Number of encoder layers

Default: 4

--eunits, -u

Number of encoder hidden units

Default: 1024

--eprojs

Number of encoder projection units

Default: 1024

--subsample

Subsample input frames x_y_z means subsample every x frame at 1st layer, every y frame at 2nd layer etc.

Default: “1”

--atype

Possible choices: noatt, dot, add, location, coverage, coverage_location, location2d, location_recurrent, multi_head_dot, multi_head_add, multi_head_loc, multi_head_multi_res_loc

Type of attention architecture

Default: “dot”

--adim

Number of attention transformation dimensions

Default: 1024

--awin

Window size for location2d attention

Default: 5

--aheads

Number of heads for multi head attention

Default: 4

--aconv-chans

Number of attention convolution channels (negative value indicates no location-aware attention)

Default: -1

--aconv-filts

Number of attention convolution filters (negative value indicates no location-aware attention)

Default: 100

--dtype

Possible choices: lstm, gru

Type of decoder network architecture

Default: “lstm”

--dlayers

Number of decoder layers

Default: 1

--dunits

Number of decoder hidden units

Default: 1024

--lsm-type

Possible choices: , unigram

Apply label smoothing with a specified distribution type

Default: “”

--lsm-weight

Label smoothing weight

Default: 0.0

--sampling-probability

Ratio of predicted labels fed back to decoder

Default: 0.0

--nbest

Output N-best hypotheses

Default: 1

--beam-size

Beam size

Default: 4

--penalty

Incertion penalty

Default: 0.0

--maxlenratio
Input length ratio to obtain max output length.

If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths

Default: 0.0

--minlenratio

Input length ratio to obtain min output length

Default: 0.0

--rnnlm

RNNLM model file to read

--rnnlm-conf

RNNLM model config file to read

--lm-weight

RNNLM weight.

Default: 0.0

--sym-space

Space symbol

Default: “<space>”

--sym-blank

Blank symbol

Default: “<blank>”

--dropout-rate

Dropout rate for the encoder

Default: 0.0

--dropout-rate-decoder

Dropout rate for the decoder

Default: 0.0

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batch-count

Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0

--batch-bins

Maximum bins in a minibatch (0 to disable)

Default: 0

--batch-frames-in

Maximum input frames in a minibatch (0 to disable)

Default: 0

--batch-frames-out

Maximum output frames in a minibatch (0 to disable)

Default: 0

--batch-frames-inout

Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 100

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 100

--n-iter-processes

Number of processes of iterator

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--opt

Possible choices: adadelta, adam, noam

Optimizer

Default: “adadelta”

--accum-grad

Number of gradient accumuration

Default: 1

--eps

Epsilon constant for optimizer

Default: 1e-08

--eps-decay

Decaying ratio of epsilon

Default: 0.01

--weight-decay

Weight decay ratio

Default: 0.0

--criterion

Possible choices: loss, acc

Criterion to perform epsilon decay

Default: “acc”

--threshold

Threshold to stop iteration

Default: 0.0001

--epochs, -e

Maximum number of epochs

Default: 30

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/acc”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--grad-clip

Gradient norm threshold to clip

Default: 5

--num-save-attention

Number of samples of attention to be saved

Default: 3

--context-residual

Default: “”

--replace-sos

Replace <sos> in the decoder with a target language ID (the first token in the target sequence)

Default: False

tts_decode.py

Synthesize speech from text using a TTS model on one CPU

usage: tts_decode.py [-h] [--config CONFIG] [--config2 CONFIG2]
                     [--config3 CONFIG3] [--ngpu NGPU]
                     [--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
                     [--seed SEED] --out OUT [--verbose VERBOSE]
                     [--preprocess-conf PREPROCESS_CONF] --json JSON --model
                     MODEL [--model-conf MODEL_CONF]
                     [--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
                     [--threshold THRESHOLD]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs

Default: 0

--backend

Possible choices: chainer, pytorch

Backend library

Default: “pytorch”

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--out

Output filename

--verbose, -V

Verbose option

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--json

Filename of train label data (json)

--model

Model file parameters to read

--model-conf

Model config file

--maxlenratio

Maximum length ratio in decoding

Default: 5

--minlenratio

Minimum length ratio in decoding

Default: 0

--threshold

Threshold value in decoding

Default: 0.5

tts_train.py

Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs

usage: tts_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
                    [--config3 CONFIG3] [--ngpu NGPU]
                    [--backend {chainer,pytorch}] --outdir OUTDIR
                    [--debugmode DEBUGMODE] [--seed SEED] [--resume [RESUME]]
                    [--minibatches MINIBATCHES] [--verbose VERBOSE]
                    [--tensorboard-dir [TENSORBOARD_DIR]]
                    [--save-interval-epochs SAVE_INTERVAL_EPOCHS]
                    [--report-interval-iters REPORT_INTERVAL_ITERS]
                    --train-json TRAIN_JSON --valid-json VALID_JSON
                    [--model-module MODEL_MODULE] [--sortagrad [SORTAGRAD]]
                    [--batch-sort-key [{shuffle,output,input}]]
                    [--batch-count {auto,seq,bin,frame}]
                    [--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
                    [--batch-frames-in BATCH_FRAMES_IN]
                    [--batch-frames-out BATCH_FRAMES_OUT]
                    [--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
                    [--maxlen-out ML]
                    [--num-iter-processes NUM_ITER_PROCESSES]
                    [--preprocess-conf PREPROCESS_CONF]
                    [--use-speaker-embedding USE_SPEAKER_EMBEDDING]
                    [--use-second-target USE_SECOND_TARGET]
                    [--opt {adam,noam}] [--accum-grad ACCUM_GRAD] [--lr LR]
                    [--eps EPS] [--weight-decay WEIGHT_DECAY]
                    [--epochs EPOCHS]
                    [--early-stop-criterion [EARLY_STOP_CRITERION]]
                    [--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
                    [--num-save-attention NUM_SAVE_ATTENTION]
                    [--keep-all-data-on-mem KEEP_ALL_DATA_ON_MEM]

Named Arguments

--config

config file path

--config2

second config file path that overwrites the settings in –config.

--config3

third config file path that overwrites the settings in –config and –config2.

--ngpu

Number of GPUs. If not given, use all visible devices

--backend

Possible choices: chainer, pytorch

Backend library

Default: “pytorch”

--outdir

Output directory

--debugmode

Debugmode

Default: 1

--seed

Random seed

Default: 1

--resume, -r

Resume the training from snapshot

Default: “”

--minibatches, -N

Process only N minibatches (for debug)

Default: -1

--verbose, -V

Verbose option

Default: 0

--tensorboard-dir

Tensorboard log directory path

--save-interval-epochs

Save interval epochs

Default: 1

--report-interval-iters

Report interval iterations

Default: 100

--train-json

Filename of training json

--valid-json

Filename of validation json

--model-module

model defined module

Default: “espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2”

--sortagrad

How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs

Default: 0

--batch-sort-key

Possible choices: shuffle, output, input

Batch sorting key. “shuffle” only work with –batch-count “seq”.

Default: “shuffle”

--batch-count

Possible choices: auto, seq, bin, frame

How to count batch_size. The default (auto) will find how to count by args.

Default: “auto”

--batch-size, --batch-seqs, -b

Maximum seqs in a minibatch (0 to disable)

Default: 0

--batch-bins

Maximum bins in a minibatch (0 to disable)

Default: 0

--batch-frames-in

Maximum input frames in a minibatch (0 to disable)

Default: 0

--batch-frames-out

Maximum output frames in a minibatch (0 to disable)

Default: 0

--batch-frames-inout

Maximum input+output frames in a minibatch (0 to disable)

Default: 0

--maxlen-in, --batch-seq-maxlen-in

When –batch-count=seq, batch size is reduced if the input sequence length > ML.

Default: 100

--maxlen-out, --batch-seq-maxlen-out

When –batch-count=seq, batch size is reduced if the output sequence length > ML

Default: 200

--num-iter-processes

Number of processes of iterator

Default: 0

--preprocess-conf

The configuration file for the pre-processing

--use-speaker-embedding

Whether to use speaker embedding

Default: False

--use-second-target

Whether to use second target

Default: False

--opt

Possible choices: adam, noam

Optimizer

Default: “adam”

--accum-grad

Number of gradient accumuration

Default: 1

--lr

Learning rate for optimizer

Default: 0.001

--eps

Epsilon for optimizer

Default: 1e-06

--weight-decay

Weight decay coefficient for optimizer

Default: 1e-06

--epochs, -e

Number of maximum epochs

Default: 30

--early-stop-criterion

Value to monitor to trigger an early stopping of the training

Default: “validation/main/loss”

--patience

Number of epochs to wait without improvement before stopping the training

Default: 3

--grad-clip

Gradient norm threshold to clip

Default: 1

--num-save-attention

Number of samples of attention to be saved

Default: 5

--keep-all-data-on-mem

Whether to keep all data on memory

Default: False