core tools¶
ESPnet provides several command-line tools for training and evaluating neural networks (NN) under espnet/bin
:
asr_enhance.py: Enhance noisy speech for speech recognition
asr_recog.py: Transcribe text from speech using a speech recognition model on one CPU or GPU
asr_train.py: Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs
lm_train.py: Train a new language model on one CPU or one GPU
mt_recog.py: Transcribe text from speech using a speech recognition model on one CPU or GPU
mt_train.py: Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs
tts_decode.py: Synthesize speech from text using a TTS model on one CPU
tts_train.py: Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs
asr_enhance.py¶
Enhance noisy speech for speech recognition
usage: asr_enhance.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] [--verbose VERBOSE]
[--batchsize BATCHSIZE]
[--preprocess-conf PREPROCESS_CONF]
[--recog-json RECOG_JSON] --model MODEL
[--model-conf MODEL_CONF]
[--enh-wspecifier ENH_WSPECIFIER]
[--enh-filetype {mat,hdf5,sound.hdf5,sound}] [--fs FS]
[--keep-length KEEP_LENGTH] [--image-dir IMAGE_DIR]
[--num-images NUM_IMAGES] [--apply-istft APPLY_ISTFT]
[--istft-win-length ISTFT_WIN_LENGTH]
[--istft-n-shift ISTFT_N_SHIFT]
[--istft-window ISTFT_WINDOW]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs
Default: 0
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --verbose, -V
Verbose option
Default: 1
- --batchsize
Batch size for beam search (0: means no batch processing)
Default: 1
- --preprocess-conf
The configuration file for the pre-processing
- --recog-json
Filename of recognition data (json)
- --model
Model file parameters to read
- --model-conf
Model config file
- --enh-wspecifier
Specify the output way for enhanced speech.e.g. ark,scp:outdir,wav.scp
- --enh-filetype
Possible choices: mat, hdf5, sound.hdf5, sound
Specify the file format for enhanced speech. “mat” is the matrix format in kaldi
Default: “sound”
- --fs
The sample frequency
Default: 16000
- --keep-length
Adjust the output length to match with the input for enhanced speech
Default: True
- --image-dir
The directory saving the images.
- --num-images
The number of images files to be saved. If negative, all samples are to be saved.
Default: 20
- --apply-istft
Apply istft to the output from the network
Default: True
- --istft-win-length
The window length for istft. This option is ignored if stft is found in the preprocess-conf
Default: 512
- --istft-n-shift
The window type for istft. This option is ignored if stft is found in the preprocess-conf
Default: 256
- --istft-window
The window type for istft. This option is ignored if stft is found in the preprocess-conf
Default: “hann”
asr_recog.py¶
Transcribe text from speech using a speech recognition model on one CPU or GPU
usage: asr_recog.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--dtype {float16,float32,float64}]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
[--preprocess-conf PREPROCESS_CONF] [--api {v1,v2}]
[--recog-json RECOG_JSON] --result-label RESULT_LABEL
--model MODEL [--model-conf MODEL_CONF]
[--num-spkrs {1,2}] [--nbest NBEST]
[--beam-size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--ctc-weight CTC_WEIGHT]
[--ctc-window-margin CTC_WINDOW_MARGIN]
[--score-norm-transducer [SCORE_NORM_TRANSDUCER]]
[--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
[--word-rnnlm WORD_RNNLM]
[--word-rnnlm-conf WORD_RNNLM_CONF]
[--word-dict WORD_DICT] [--lm-weight LM_WEIGHT]
[--streaming-mode {window,segment}]
[--streaming-window STREAMING_WINDOW]
[--streaming-min-blank-dur STREAMING_MIN_BLANK_DUR]
[--streaming-onset-margin STREAMING_ONSET_MARGIN]
[--streaming-offset-margin STREAMING_OFFSET_MARGIN]
[--tgt-lang TGT_LANG]
Named Arguments¶
- --config
Config file path
- --config2
Second config file path that overwrites the settings in –config
- --config3
Third config file path that overwrites the settings in –config and –config2
- --ngpu
Number of GPUs
Default: 0
- --dtype
Possible choices: float16, float32, float64
Float precision (only available in –api v2)
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --verbose, -V
Verbose option
Default: 1
- --batchsize
Batch size for beam search (0: means no batch processing)
Default: 1
- --preprocess-conf
The configuration file for the pre-processing
- --api
Possible choices: v1, v2
Beam search APIs v1: Default API. It only supports the ASRInterface.recognize method and DefaultRNNLM. v2: Experimental API. It supports any models that implements ScorerInterface.
Default: “v1”
- --recog-json
Filename of recognition data (json)
- --result-label
Filename of result label data (json)
- --model
Model file parameters to read
- --model-conf
Model config file
- --num-spkrs
Possible choices: 1, 2
Number of speakers in the speech
Default: 1
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 1
- --penalty
Incertion penalty
Default: 0.0
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
Default: 0.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --ctc-weight
CTC weight in joint decoding
Default: 0.0
- --ctc-window-margin
- Use CTC window with margin parameter to accelerate
CTC/attention decoding especially on GPU. Smaller magin makes decoding faster, but may increase search errors. If margin=0 (default), this function is disabled
Default: 0
- --score-norm-transducer
Normalize transducer scores by length
Default: True
- --rnnlm
RNNLM model file to read
- --rnnlm-conf
RNNLM model config file to read
- --word-rnnlm
Word RNNLM model file to read
- --word-rnnlm-conf
Word RNNLM model config file to read
- --word-dict
Word list to read
- --lm-weight
RNNLM weight
Default: 0.1
- --streaming-mode
Possible choices: window, segment
- Use streaming recognizer for inference.
–batchsize must be set to 0 to enable this mode
- --streaming-window
Window size
Default: 10
- --streaming-min-blank-dur
Minimum blank duration threshold
Default: 10
- --streaming-onset-margin
Onset margin
Default: 1
- --streaming-offset-margin
Offset margin
Default: 1
- --tgt-lang
target language ID (e.g., <en>, <de>, <fr> etc.)
Default: False
asr_train.py¶
Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs
usage: asr_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] --dict DICT [--seed SEED]
[--debugdir DEBUGDIR] [--resume [RESUME]]
[--minibatches MINIBATCHES] [--verbose VERBOSE]
[--tensorboard-dir [TENSORBOARD_DIR]]
[--report-interval-iters REPORT_INTERVAL_ITERS]
[--train-json TRAIN_JSON] [--valid-json VALID_JSON]
[--model-module MODEL_MODULE]
[--ctc_type {builtin,warpctc}] [--mtlalpha MTLALPHA]
[--lsm-type [{,unigram}]] [--lsm-weight LSM_WEIGHT]
[--report-cer] [--report-wer] [--nbest NBEST]
[--beam-size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--ctc-weight CTC_WEIGHT] [--rnnlm RNNLM]
[--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
[--sym-space SYM_SPACE] [--sym-blank SYM_BLANK]
[--sortagrad [SORTAGRAD]]
[--batch-count {auto,seq,bin,frame}]
[--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
[--batch-frames-in BATCH_FRAMES_IN]
[--batch-frames-out BATCH_FRAMES_OUT]
[--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
[--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
[--preprocess-conf [PREPROCESS_CONF]]
[--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
[--eps EPS] [--eps-decay EPS_DECAY]
[--weight-decay WEIGHT_DECAY] [--criterion {loss,acc}]
[--threshold THRESHOLD] [--epochs EPOCHS]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
[--num-save-attention NUM_SAVE_ATTENTION]
[--grad-noise GRAD_NOISE] [--num-spkrs {1,2}] [--spa]
[--elayers-sd ELAYERS_SD]
[--context-residual [CONTEXT_RESIDUAL]]
[--replace-sos [REPLACE_SOS]] [--enc-init ENC_INIT]
[--enc-init-mods ENC_INIT_MODS] [--dec-init DEC_INIT]
[--dec-init-mods DEC_INIT_MODS]
[--use-frontend USE_FRONTEND] [--use-wpe USE_WPE]
[--wtype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
[--wlayers WLAYERS] [--wunits WUNITS] [--wprojs WPROJS]
[--wdropout-rate WDROPOUT_RATE] [--wpe-taps WPE_TAPS]
[--wpe-delay WPE_DELAY]
[--use-dnn-mask-for-wpe USE_DNN_MASK_FOR_WPE]
[--use-beamformer USE_BEAMFORMER]
[--btype {lstm,blstm,lstmp,blstmp,vgglstmp,vggblstmp,vgglstm,vggblstm,gru,bgru,grup,bgrup,vgggrup,vggbgrup,vgggru,vggbgru}]
[--blayers BLAYERS] [--bunits BUNITS] [--bprojs BPROJS]
[--badim BADIM] [--ref-channel REF_CHANNEL]
[--bdropout-rate BDROPOUT_RATE] [--stats-file STATS_FILE]
[--apply-uttmvn APPLY_UTTMVN]
[--uttmvn-norm-means UTTMVN_NORM_MEANS]
[--uttmvn-norm-vars UTTMVN_NORM_VARS]
[--fbank-fs FBANK_FS] [--n-mels N_MELS]
[--fbank-fmin FBANK_FMIN] [--fbank-fmax FBANK_FMAX]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs. If not given, use all visible devices
- --train-dtype
Possible choices: float16, float32, float64, O0, O1, O2, O3
Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --dict
Dictionary
- --seed
Random seed
Default: 1
- --debugdir
Output directory for debugging
- --resume, -r
Resume the training from snapshot
Default: “”
- --minibatches, -N
Process only N minibatches (for debug)
Default: -1
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log dir path
- --report-interval-iters
Report interval iterations
Default: 100
- --train-json
Filename of train label data (json)
- --valid-json
Filename of validation label data (json)
- --model-module
model defined module (default: espnet.nets.xxx_backend.e2e_asr:E2E)
- --ctc_type
Possible choices: builtin, warpctc
Type of CTC implementation to calculate loss.
Default: “warpctc”
- --mtlalpha
Multitask learning coefficient, alpha: alpha*ctc_loss + (1-alpha)*att_loss
Default: 0.5
- --lsm-type
Possible choices: , unigram
Apply label smoothing with a specified distribution type
Default: “”
- --lsm-weight
Label smoothing weight
Default: 0.0
- --report-cer
Compute CER on development set
Default: False
- --report-wer
Compute WER on development set
Default: False
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 4
- --penalty
Incertion penalty
Default: 0.0
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
Default: 0.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --ctc-weight
CTC weight in joint decoding
Default: 0.3
- --rnnlm
RNNLM model file to read
- --rnnlm-conf
RNNLM model config file to read
- --lm-weight
RNNLM weight.
Default: 0.1
- --sym-space
Space symbol
Default: “<space>”
- --sym-blank
Blank symbol
Default: “<blank>”
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batch-count
Possible choices: auto, seq, bin, frame
How to count batch_size. The default (auto) will find how to count by args.
Default: “auto”
- --batch-size, --batch-seqs, -b
Maximum seqs in a minibatch (0 to disable)
Default: 0
- --batch-bins
Maximum bins in a minibatch (0 to disable)
Default: 0
- --batch-frames-in
Maximum input frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-out
Maximum output frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-inout
Maximum input+output frames in a minibatch (0 to disable)
Default: 0
- --maxlen-in, --batch-seq-maxlen-in
When –batch-count=seq, batch size is reduced if the input sequence length > ML.
Default: 800
- --maxlen-out, --batch-seq-maxlen-out
When –batch-count=seq, batch size is reduced if the output sequence length > ML
Default: 150
- --n-iter-processes
Number of processes of iterator
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --opt
Possible choices: adadelta, adam, noam
Optimizer
Default: “adadelta”
- --accum-grad
Number of gradient accumuration
Default: 1
- --eps
Epsilon constant for optimizer
Default: 1e-08
- --eps-decay
Decaying ratio of epsilon
Default: 0.01
- --weight-decay
Weight decay ratio
Default: 0.0
- --criterion
Possible choices: loss, acc
Criterion to perform epsilon decay
Default: “acc”
- --threshold
Threshold to stop iteration
Default: 0.0001
- --epochs, -e
Maximum number of epochs
Default: 30
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/acc”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --grad-clip
Gradient norm threshold to clip
Default: 5
- --num-save-attention
Number of samples of attention to be saved
Default: 3
- --grad-noise
The flag to switch to use noise injection to gradients during training
Default: False
- --num-spkrs
Possible choices: 1, 2
Number of speakers in the speech.
Default: 1
- --spa
Enable speaker parallel attention.
Default: False
- --elayers-sd
Number of encoder layers for speaker differentiate part. (multi-speaker asr mode only)
Default: 4
- --context-residual
The flag to switch to use context vector residual in the decoder network
Default: False
- --replace-sos
Replace <sos> in the decoder with a target language ID (the first token in the target sequence)
Default: False
- --enc-init
Pre-trained ASR model to initialize encoder.
- --enc-init-mods
List of encoder modules to initialize, separated by a comma.
Default: enc.enc.
- --dec-init
Pre-trained ASR, MT or LM model to initialize decoder.
- --dec-init-mods
List of decoder modules to initialize, separated by a comma.
Default: att., dec.
- --use-frontend
The flag to switch to use frontend system.
Default: False
- --use-wpe
Apply Weighted Prediction Error
Default: False
- --wtype
Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru
Type of encoder network architecture of the mask estimator for WPE.
Default: “blstmp”
- --wlayers
Default: 2
- --wunits
Default: 300
- --wprojs
Default: 300
- --wdropout-rate
Default: 0.0
- --wpe-taps
Default: 5
- --wpe-delay
Default: 3
- --use-dnn-mask-for-wpe
Use DNN to estimate the power spectrogram. This option is experimental.
Default: False
- --use-beamformer
Default: True
- --btype
Possible choices: lstm, blstm, lstmp, blstmp, vgglstmp, vggblstmp, vgglstm, vggblstm, gru, bgru, grup, bgrup, vgggrup, vggbgrup, vgggru, vggbgru
Type of encoder network architecture of the mask estimator for Beamformer.
Default: “blstmp”
- --blayers
Default: 2
- --bunits
Default: 300
- --bprojs
Default: 300
- --badim
Default: 320
- --ref-channel
The reference channel used for beamformer. By default, the channel is estimated by DNN.
Default: -1
- --bdropout-rate
Default: 0.0
- --stats-file
The stats file for the feature normalization
- --apply-uttmvn
Apply utterance level mean variance normalization.
Default: True
- --uttmvn-norm-means
Default: True
- --uttmvn-norm-vars
Default: False
- --fbank-fs
The sample frequency used for the mel-fbank creation.
Default: 16000
- --n-mels
The number of mel-frequency bins.
Default: 80
- --fbank-fmin
Default: 0.0
- --fbank-fmax
lm_train.py¶
Train a new language model on one CPU or one GPU
usage: lm_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--train-dtype {float16,float32,float64,O0,O1,O2,O3}]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] --dict DICT [--seed SEED]
[--resume [RESUME]] [--verbose VERBOSE]
[--tensorboard-dir [TENSORBOARD_DIR]]
[--report-interval-iters REPORT_INTERVAL_ITERS]
--train-label TRAIN_LABEL --valid-label VALID_LABEL
[--test-label TEST_LABEL] [--dump-hdf5-path DUMP_HDF5_PATH]
[--opt {sgd,adam}] [--sortagrad [SORTAGRAD]]
[--batchsize BATCHSIZE] [--epoch EPOCH]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--gradclip GRADCLIP]
[--maxlen MAXLEN] [--model-module MODEL_MODULE]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs. If not given, use all visible devices
- --train-dtype
Possible choices: float16, float32, float64, O0, O1, O2, O3
Data type for training (only pytorch backend). O0,O1,.. flags require apex. See https://nvidia.github.io/apex/amp.html#opt-levels
Default: “float32”
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --dict
Dictionary
- --seed
Random seed
Default: 1
- --resume, -r
Resume the training from snapshot
Default: “”
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log dir path
- --report-interval-iters
Report interval iterations
Default: 100
- --train-label
Filename of train label data
- --valid-label
Filename of validation label data
- --test-label
Filename of test label data
- --dump-hdf5-path
Path to dump a preprocessed dataset as hdf5
- --opt
Possible choices: sgd, adam
Optimizer
Default: “sgd”
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batchsize, -b
Number of examples in each mini-batch
Default: 300
- --epoch, -e
Number of sweeps over the dataset to train
Default: 20
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/loss”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --gradclip, -c
Gradient norm threshold to clip
Default: 5
- --maxlen
Batch size is reduced if the input sequence > ML
Default: 40
- --model-module
model defined module (default: espnet.nets.xxx_backend.lm.default:DefaultRNNLM)
Default: “default”
mt_recog.py¶
Transcribe text from speech using a speech recognition model on one CPU or GPU
usage: mt_recog.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] [--verbose VERBOSE] [--batchsize BATCHSIZE]
[--preprocess-conf PREPROCESS_CONF]
[--recog-json RECOG_JSON] --result-label RESULT_LABEL
--model MODEL [--model-conf MODEL_CONF] [--nbest NBEST]
[--beam-size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--ctc-weight CTC_WEIGHT] [--rnnlm RNNLM]
[--rnnlm-conf RNNLM_CONF] [--lm-weight LM_WEIGHT]
[--tgt-lang TGT_LANG]
Named Arguments¶
- --config
Config file path
- --config2
Second config file path that overwrites the settings in –config
- --config3
Third config file path that overwrites the settings in –config and –config2
- --ngpu
Number of GPUs
Default: 0
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --verbose, -V
Verbose option
Default: 1
- --batchsize
Batch size for beam search (0: means no batch processing)
Default: 1
- --preprocess-conf
The configuration file for the pre-processing
- --recog-json
Filename of recognition data (json)
- --result-label
Filename of result label data (json)
- --model
Model file parameters to read
- --model-conf
Model config file
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 1
- --penalty
Incertion penalty
Default: 0.1
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
Default: 3.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --ctc-weight
dummy
Default: 0.0
- --rnnlm
RNNLM model file to read
- --rnnlm-conf
RNNLM model config file to read
- --lm-weight
RNNLM weight
Default: 0.0
- --tgt-lang
target language ID (e.g., <en>, <de>, <fr> etc.)
Default: False
mt_train.py¶
Train an automatic speech recognition (ASR) model on one CPU, one or multiple GPUs
usage: mt_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] --dict-tgt DICT_TGT
[--dict-src [DICT_SRC]] [--seed SEED] [--debugdir DEBUGDIR]
[--resume [RESUME]] [--minibatches MINIBATCHES]
[--verbose VERBOSE] [--tensorboard-dir [TENSORBOARD_DIR]]
[--report-interval-iters REPORT_INTERVAL_ITERS]
[--train-json TRAIN_JSON] [--valid-json VALID_JSON]
[--model-module MODEL_MODULE]
[--etype {lstm,blstm,lstmp,blstmp,gru,bgru,grup,bgrup}]
[--elayers ELAYERS] [--eunits EUNITS] [--eprojs EPROJS]
[--subsample SUBSAMPLE]
[--atype {noatt,dot,add,location,coverage,coverage_location,location2d,location_recurrent,multi_head_dot,multi_head_add,multi_head_loc,multi_head_multi_res_loc}]
[--adim ADIM] [--awin AWIN] [--aheads AHEADS]
[--aconv-chans ACONV_CHANS] [--aconv-filts ACONV_FILTS]
[--dtype {lstm,gru}] [--dlayers DLAYERS] [--dunits DUNITS]
[--lsm-type [{,unigram}]] [--lsm-weight LSM_WEIGHT]
[--sampling-probability SAMPLING_PROBABILITY]
[--nbest NBEST] [--beam-size BEAM_SIZE] [--penalty PENALTY]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--rnnlm RNNLM] [--rnnlm-conf RNNLM_CONF]
[--lm-weight LM_WEIGHT] [--sym-space SYM_SPACE]
[--sym-blank SYM_BLANK] [--dropout-rate DROPOUT_RATE]
[--dropout-rate-decoder DROPOUT_RATE_DECODER]
[--sortagrad [SORTAGRAD]]
[--batch-count {auto,seq,bin,frame}]
[--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
[--batch-frames-in BATCH_FRAMES_IN]
[--batch-frames-out BATCH_FRAMES_OUT]
[--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
[--maxlen-out ML] [--n-iter-processes N_ITER_PROCESSES]
[--preprocess-conf PREPROCESS_CONF]
[--opt {adadelta,adam,noam}] [--accum-grad ACCUM_GRAD]
[--eps EPS] [--eps-decay EPS_DECAY]
[--weight-decay WEIGHT_DECAY] [--criterion {loss,acc}]
[--threshold THRESHOLD] [--epochs EPOCHS]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
[--num-save-attention NUM_SAVE_ATTENTION]
[--context-residual [CONTEXT_RESIDUAL]]
[--replace-sos [REPLACE_SOS]]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs
Default: 0
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “chainer”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --dict-tgt
Dictionary for target language
- --dict-src
Dictionary for source language. Dictionanies are shared between soruce and target languages in default setting.
Default: “”
- --seed
Random seed
Default: 1
- --debugdir
Output directory for debugging
- --resume, -r
Resume the training from snapshot
Default: “”
- --minibatches, -N
Process only N minibatches (for debug)
Default: -1
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log dir path
- --report-interval-iters
Report interval iterations
Default: 100
- --train-json
Filename of train label data (json)
- --valid-json
Filename of validation label data (json)
- --model-module
model defined module (default: espnet.nets.xxx_backend.e2e_mt:E2E)
- --etype
Possible choices: lstm, blstm, lstmp, blstmp, gru, bgru, grup, bgrup
Type of encoder network architecture (VGG is not supported for NMT)
Default: “blstmp”
- --elayers
Number of encoder layers
Default: 4
- --eunits, -u
Number of encoder hidden units
Default: 1024
- --eprojs
Number of encoder projection units
Default: 1024
- --subsample
Subsample input frames x_y_z means subsample every x frame at 1st layer, every y frame at 2nd layer etc.
Default: “1”
- --atype
Possible choices: noatt, dot, add, location, coverage, coverage_location, location2d, location_recurrent, multi_head_dot, multi_head_add, multi_head_loc, multi_head_multi_res_loc
Type of attention architecture
Default: “dot”
- --adim
Number of attention transformation dimensions
Default: 1024
- --awin
Window size for location2d attention
Default: 5
- --aheads
Number of heads for multi head attention
Default: 4
- --aconv-chans
Number of attention convolution channels (negative value indicates no location-aware attention)
Default: -1
- --aconv-filts
Number of attention convolution filters (negative value indicates no location-aware attention)
Default: 100
- --dtype
Possible choices: lstm, gru
Type of decoder network architecture
Default: “lstm”
- --dlayers
Number of decoder layers
Default: 1
- --dunits
Number of decoder hidden units
Default: 1024
- --lsm-type
Possible choices: , unigram
Apply label smoothing with a specified distribution type
Default: “”
- --lsm-weight
Label smoothing weight
Default: 0.0
- --sampling-probability
Ratio of predicted labels fed back to decoder
Default: 0.0
- --nbest
Output N-best hypotheses
Default: 1
- --beam-size
Beam size
Default: 4
- --penalty
Incertion penalty
Default: 0.0
- --maxlenratio
- Input length ratio to obtain max output length.
If maxlenratio=0.0 (default), it uses a end-detect function to automatically find maximum hypothesis lengths
Default: 0.0
- --minlenratio
Input length ratio to obtain min output length
Default: 0.0
- --rnnlm
RNNLM model file to read
- --rnnlm-conf
RNNLM model config file to read
- --lm-weight
RNNLM weight.
Default: 0.0
- --sym-space
Space symbol
Default: “<space>”
- --sym-blank
Blank symbol
Default: “<blank>”
- --dropout-rate
Dropout rate for the encoder
Default: 0.0
- --dropout-rate-decoder
Dropout rate for the decoder
Default: 0.0
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batch-count
Possible choices: auto, seq, bin, frame
How to count batch_size. The default (auto) will find how to count by args.
Default: “auto”
- --batch-size, --batch-seqs, -b
Maximum seqs in a minibatch (0 to disable)
Default: 0
- --batch-bins
Maximum bins in a minibatch (0 to disable)
Default: 0
- --batch-frames-in
Maximum input frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-out
Maximum output frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-inout
Maximum input+output frames in a minibatch (0 to disable)
Default: 0
- --maxlen-in, --batch-seq-maxlen-in
When –batch-count=seq, batch size is reduced if the input sequence length > ML.
Default: 100
- --maxlen-out, --batch-seq-maxlen-out
When –batch-count=seq, batch size is reduced if the output sequence length > ML
Default: 100
- --n-iter-processes
Number of processes of iterator
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --opt
Possible choices: adadelta, adam, noam
Optimizer
Default: “adadelta”
- --accum-grad
Number of gradient accumuration
Default: 1
- --eps
Epsilon constant for optimizer
Default: 1e-08
- --eps-decay
Decaying ratio of epsilon
Default: 0.01
- --weight-decay
Weight decay ratio
Default: 0.0
- --criterion
Possible choices: loss, acc
Criterion to perform epsilon decay
Default: “acc”
- --threshold
Threshold to stop iteration
Default: 0.0001
- --epochs, -e
Maximum number of epochs
Default: 30
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/acc”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --grad-clip
Gradient norm threshold to clip
Default: 5
- --num-save-attention
Number of samples of attention to be saved
Default: 3
- --context-residual
Default: “”
- --replace-sos
Replace <sos> in the decoder with a target language ID (the first token in the target sequence)
Default: False
tts_decode.py¶
Synthesize speech from text using a TTS model on one CPU
usage: tts_decode.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] [--debugmode DEBUGMODE]
[--seed SEED] --out OUT [--verbose VERBOSE]
[--preprocess-conf PREPROCESS_CONF] --json JSON --model
MODEL [--model-conf MODEL_CONF]
[--maxlenratio MAXLENRATIO] [--minlenratio MINLENRATIO]
[--threshold THRESHOLD]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs
Default: 0
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “pytorch”
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --out
Output filename
- --verbose, -V
Verbose option
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --json
Filename of train label data (json)
- --model
Model file parameters to read
- --model-conf
Model config file
- --maxlenratio
Maximum length ratio in decoding
Default: 5
- --minlenratio
Minimum length ratio in decoding
Default: 0
- --threshold
Threshold value in decoding
Default: 0.5
tts_train.py¶
Train a new text-to-speech (TTS) model on one CPU, one or multiple GPUs
usage: tts_train.py [-h] [--config CONFIG] [--config2 CONFIG2]
[--config3 CONFIG3] [--ngpu NGPU]
[--backend {chainer,pytorch}] --outdir OUTDIR
[--debugmode DEBUGMODE] [--seed SEED] [--resume [RESUME]]
[--minibatches MINIBATCHES] [--verbose VERBOSE]
[--tensorboard-dir [TENSORBOARD_DIR]]
[--save-interval-epochs SAVE_INTERVAL_EPOCHS]
[--report-interval-iters REPORT_INTERVAL_ITERS]
--train-json TRAIN_JSON --valid-json VALID_JSON
[--model-module MODEL_MODULE] [--sortagrad [SORTAGRAD]]
[--batch-sort-key [{shuffle,output,input}]]
[--batch-count {auto,seq,bin,frame}]
[--batch-size BATCH_SIZE] [--batch-bins BATCH_BINS]
[--batch-frames-in BATCH_FRAMES_IN]
[--batch-frames-out BATCH_FRAMES_OUT]
[--batch-frames-inout BATCH_FRAMES_INOUT] [--maxlen-in ML]
[--maxlen-out ML]
[--num-iter-processes NUM_ITER_PROCESSES]
[--preprocess-conf PREPROCESS_CONF]
[--use-speaker-embedding USE_SPEAKER_EMBEDDING]
[--use-second-target USE_SECOND_TARGET]
[--opt {adam,noam}] [--accum-grad ACCUM_GRAD] [--lr LR]
[--eps EPS] [--weight-decay WEIGHT_DECAY]
[--epochs EPOCHS]
[--early-stop-criterion [EARLY_STOP_CRITERION]]
[--patience [PATIENCE]] [--grad-clip GRAD_CLIP]
[--num-save-attention NUM_SAVE_ATTENTION]
[--keep-all-data-on-mem KEEP_ALL_DATA_ON_MEM]
Named Arguments¶
- --config
config file path
- --config2
second config file path that overwrites the settings in –config.
- --config3
third config file path that overwrites the settings in –config and –config2.
- --ngpu
Number of GPUs. If not given, use all visible devices
- --backend
Possible choices: chainer, pytorch
Backend library
Default: “pytorch”
- --outdir
Output directory
- --debugmode
Debugmode
Default: 1
- --seed
Random seed
Default: 1
- --resume, -r
Resume the training from snapshot
Default: “”
- --minibatches, -N
Process only N minibatches (for debug)
Default: -1
- --verbose, -V
Verbose option
Default: 0
- --tensorboard-dir
Tensorboard log directory path
- --save-interval-epochs
Save interval epochs
Default: 1
- --report-interval-iters
Report interval iterations
Default: 100
- --train-json
Filename of training json
- --valid-json
Filename of validation json
- --model-module
model defined module
Default: “espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2”
- --sortagrad
How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs
Default: 0
- --batch-sort-key
Possible choices: shuffle, output, input
Batch sorting key. “shuffle” only work with –batch-count “seq”.
Default: “shuffle”
- --batch-count
Possible choices: auto, seq, bin, frame
How to count batch_size. The default (auto) will find how to count by args.
Default: “auto”
- --batch-size, --batch-seqs, -b
Maximum seqs in a minibatch (0 to disable)
Default: 0
- --batch-bins
Maximum bins in a minibatch (0 to disable)
Default: 0
- --batch-frames-in
Maximum input frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-out
Maximum output frames in a minibatch (0 to disable)
Default: 0
- --batch-frames-inout
Maximum input+output frames in a minibatch (0 to disable)
Default: 0
- --maxlen-in, --batch-seq-maxlen-in
When –batch-count=seq, batch size is reduced if the input sequence length > ML.
Default: 100
- --maxlen-out, --batch-seq-maxlen-out
When –batch-count=seq, batch size is reduced if the output sequence length > ML
Default: 200
- --num-iter-processes
Number of processes of iterator
Default: 0
- --preprocess-conf
The configuration file for the pre-processing
- --use-speaker-embedding
Whether to use speaker embedding
Default: False
- --use-second-target
Whether to use second target
Default: False
- --opt
Possible choices: adam, noam
Optimizer
Default: “adam”
- --accum-grad
Number of gradient accumuration
Default: 1
- --lr
Learning rate for optimizer
Default: 0.001
- --eps
Epsilon for optimizer
Default: 1e-06
- --weight-decay
Weight decay coefficient for optimizer
Default: 1e-06
- --epochs, -e
Number of maximum epochs
Default: 30
- --early-stop-criterion
Value to monitor to trigger an early stopping of the training
Default: “validation/main/loss”
- --patience
Number of epochs to wait without improvement before stopping the training
Default: 3
- --grad-clip
Gradient norm threshold to clip
Default: 1
- --num-save-attention
Number of samples of attention to be saved
Default: 5
- --keep-all-data-on-mem
Whether to keep all data on memory
Default: False