core tools (espnet2)¶

ESPnet2 provides several command-line tools for training and evaluating neural networks (NN) under espnet2/bin:

aggregate_stats_dirs.py¶

usage: aggregate_stats_dirs.py [-h]
                               [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                               [--skip_sum_stats] [--input_dir INPUT_DIR]
                               --output_dir OUTPUT_DIR

Aggregate statistics directories into one directory

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --skip_sum_stats      Skip computing the sum of statistics. (default: False)
  --input_dir INPUT_DIR
                        Input directories (default: None)
  --output_dir OUTPUT_DIR
                        Output directory (default: None)

asr_inference.py¶

usage: asr_inference.py [-h] [--config CONFIG]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                        [--dtype {float16,float32,float64}]
                        [--num_workers NUM_WORKERS]
                        --data_path_and_name_and_type
                        DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--asr_train_config ASR_TRAIN_CONFIG]
                        [--asr_model_file ASR_MODEL_FILE]
                        [--lm_train_config LM_TRAIN_CONFIG]
                        [--lm_file LM_FILE]
                        [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                        [--word_lm_file WORD_LM_FILE]
                        [--ngram_file NGRAM_FILE] [--model_tag MODEL_TAG]
                        [--batch_size BATCH_SIZE] [--nbest NBEST]
                        [--beam_size BEAM_SIZE] [--penalty PENALTY]
                        [--maxlenratio MAXLENRATIO]
                        [--minlenratio MINLENRATIO] [--ctc_weight CTC_WEIGHT]
                        [--lm_weight LM_WEIGHT] [--ngram_weight NGRAM_WEIGHT]
                        [--streaming STREAMING]
                        [--transducer_conf TRANSDUCER_CONF]
                        [--token_type {char,bpe,None}] [--bpemodel BPEMODEL]

ASR Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --asr_train_config ASR_TRAIN_CONFIG
                        ASR training configuration (default: None)
  --asr_model_file ASR_MODEL_FILE
                        ASR model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
                        Word LM training configuration (default: None)
  --word_lm_file WORD_LM_FILE
                        Word LM parameter file (default: None)
  --ngram_file NGRAM_FILE
                        N-gram parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths.If maxlenratio<0.0, its absolute value is
                        interpretedas a constant max output length (default:
                        0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --ctc_weight CTC_WEIGHT
                        CTC weight in joint decoding (default: 0.5)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --ngram_weight NGRAM_WEIGHT
                        ngram weight (default: 0.9)
  --streaming STREAMING
  --transducer_conf TRANSDUCER_CONF
                        The keyword arguments for transducer beam search.
                        (default: None)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ASR model. If not given, refers
                        from the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

asr_inference_streaming.py¶

usage: asr_inference_streaming.py [-h] [--config CONFIG]
                                  [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                                  --output_dir OUTPUT_DIR [--ngpu NGPU]
                                  [--seed SEED]
                                  [--dtype {float16,float32,float64}]
                                  [--num_workers NUM_WORKERS]
                                  --data_path_and_name_and_type
                                  DATA_PATH_AND_NAME_AND_TYPE
                                  [--key_file KEY_FILE]
                                  [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                                  [--sim_chunk_length SIM_CHUNK_LENGTH]
                                  --asr_train_config ASR_TRAIN_CONFIG
                                  --asr_model_file ASR_MODEL_FILE
                                  [--lm_train_config LM_TRAIN_CONFIG]
                                  [--lm_file LM_FILE]
                                  [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                                  [--word_lm_file WORD_LM_FILE]
                                  [--batch_size BATCH_SIZE] [--nbest NBEST]
                                  [--beam_size BEAM_SIZE] [--penalty PENALTY]
                                  [--maxlenratio MAXLENRATIO]
                                  [--minlenratio MINLENRATIO]
                                  [--ctc_weight CTC_WEIGHT]
                                  [--lm_weight LM_WEIGHT]
                                  [--disable_repetition_detection DISABLE_REPETITION_DETECTION]
                                  [--encoded_feat_length_limit ENCODED_FEAT_LENGTH_LIMIT]
                                  [--decoder_text_length_limit DECODER_TEXT_LENGTH_LIMIT]
                                  [--token_type {char,bpe,None}]
                                  [--bpemodel BPEMODEL]

ASR Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
  --sim_chunk_length SIM_CHUNK_LENGTH
                        The length of one chunk, to which speech will be
                        divided for evalution of streaming processing.
                        (default: 0)

The model configuration related:
  --asr_train_config ASR_TRAIN_CONFIG
  --asr_model_file ASR_MODEL_FILE
  --lm_train_config LM_TRAIN_CONFIG
  --lm_file LM_FILE
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
  --word_lm_file WORD_LM_FILE

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths (default: 0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --ctc_weight CTC_WEIGHT
                        CTC weight in joint decoding (default: 0.5)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --disable_repetition_detection DISABLE_REPETITION_DETECTION
  --encoded_feat_length_limit ENCODED_FEAT_LENGTH_LIMIT
                        Limit the lengths of the encoded featureto input to
                        the decoder. (default: 0)
  --decoder_text_length_limit DECODER_TEXT_LENGTH_LIMIT
                        Limit the lengths of the textto input to the decoder.
                        (default: 0)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ASR model. If not given, refers
                        from the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

asr_train.py¶

usage: asr_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--dry_run DRY_RUN]
                    [--iterator_type {sequence,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--token_list TOKEN_LIST]
                    [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                    [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                    [--joint_net_conf JOINT_NET_CONF]
                    [--use_preprocessor USE_PREPROCESSOR]
                    [--token_type {bpe,char,word,phn}] [--bpemodel BPEMODEL]
                    [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                    [--cleaner {None,tacotron,jaconv,vietnamese}]
                    [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}]
                    [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                    [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                    [--noise_scp NOISE_SCP]
                    [--noise_apply_prob NOISE_APPLY_PROB]
                    [--noise_db_range NOISE_DB_RANGE]
                    [--frontend {default,sliding_window,s3prl,fused}]
                    [--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
                    [--specaug_conf SPECAUG_CONF]
                    [--normalize {global_mvn,utterance_mvn,None}]
                    [--normalize_conf NORMALIZE_CONF]
                    [--model {espnet,maskctc}] [--model_conf MODEL_CONF]
                    [--preencoder {sinc,linear,None}]
                    [--preencoder_conf PREENCODER_CONF]
                    [--encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,longformer}]
                    [--encoder_conf ENCODER_CONF]
                    [--postencoder {hugging_face_transformers,None}]
                    [--postencoder_conf POSTENCODER_CONF]
                    [--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm}]
                    [--decoder_conf DECODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': True})
  --joint_net_conf JOINT_NET_CONF
                        The keyword arguments for joint network class. (default: None)

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --frontend {default,sliding_window,s3prl,fused}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --model {espnet,maskctc}
                        The model type (default: espnet)
  --model_conf MODEL_CONF
                        The keyword arguments for model (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,contextual_block_transformer,contextual_block_conformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain,longformer}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn,transducer,mlm}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})

diar_inference.py¶

usage: diar_inference.py [-h] [--config CONFIG]
                         [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                         --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                         [--dtype {float16,float32,float64}] [--fs FS]
                         [--num_workers NUM_WORKERS]
                         --data_path_and_name_and_type
                         DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                         [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                         [--train_config TRAIN_CONFIG]
                         [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                         [--batch_size BATCH_SIZE]
                         [--segment_size SEGMENT_SIZE]
                         [--show_progressbar SHOW_PROGRESSBAR]
                         [--num_spk NUM_SPK]

Speaker Diarization inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --fs FS               Sampling rate (default: 8000)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
                        Diarization training configuration (default: None)
  --model_file MODEL_FILE
                        Diarization model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)

Data loading related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

Diarize speech related:
  --segment_size SEGMENT_SIZE
                        Segment length in seconds for segment-wise speaker
                        diarization (default: None)
  --show_progressbar SHOW_PROGRESSBAR
                        Whether to show a progress bar when performing
                        segment-wise speaker diarization (default: False)
  --num_spk NUM_SPK     Predetermined number of speakers for inference
                        (default: None)

diar_train.py¶

usage: diar_train.py [-h] [--config CONFIG] [--print_config]
                     [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                     [--dry_run DRY_RUN]
                     [--iterator_type {sequence,chunk,task,none}]
                     [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                     [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                     [--dist_backend DIST_BACKEND]
                     [--dist_init_method DIST_INIT_METHOD]
                     [--dist_world_size DIST_WORLD_SIZE]
                     [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                     [--dist_master_addr DIST_MASTER_ADDR]
                     [--dist_master_port DIST_MASTER_PORT]
                     [--dist_launcher {slurm,mpi,None}]
                     [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                     [--unused_parameters UNUSED_PARAMETERS]
                     [--sharded_ddp SHARDED_DDP]
                     [--cudnn_enabled CUDNN_ENABLED]
                     [--cudnn_benchmark CUDNN_BENCHMARK]
                     [--cudnn_deterministic CUDNN_DETERMINISTIC]
                     [--collect_stats COLLECT_STATS]
                     [--write_collected_feats WRITE_COLLECTED_FEATS]
                     [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                     [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                     [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                     [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                     [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                     [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                     [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                     [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                     [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                     [--train_dtype {float16,float32,float64}]
                     [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                     [--use_matplotlib USE_MATPLOTLIB]
                     [--use_tensorboard USE_TENSORBOARD]
                     [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                     [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                     [--wandb_name WANDB_NAME]
                     [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                     [--detect_anomaly DETECT_ANOMALY]
                     [--pretrain_path PRETRAIN_PATH]
                     [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                     [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                     [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                     [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                     [--batch_size BATCH_SIZE]
                     [--valid_batch_size VALID_BATCH_SIZE]
                     [--batch_bins BATCH_BINS]
                     [--valid_batch_bins VALID_BATCH_BINS]
                     [--train_shape_file TRAIN_SHAPE_FILE]
                     [--valid_shape_file VALID_SHAPE_FILE]
                     [--batch_type {unsorted,sorted,folded,length,numel}]
                     [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                     [--fold_length FOLD_LENGTH]
                     [--sort_in_batch {descending,ascending}]
                     [--sort_batch {descending,ascending}]
                     [--multiple_iterator MULTIPLE_ITERATOR]
                     [--chunk_length CHUNK_LENGTH]
                     [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                     [--num_cache_chunks NUM_CACHE_CHUNKS]
                     [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                     [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                     [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                     [--max_cache_size MAX_CACHE_SIZE]
                     [--max_cache_fd MAX_CACHE_FD]
                     [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                     [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                     [--optim_conf OPTIM_CONF]
                     [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                     [--scheduler_conf SCHEDULER_CONF] [--num_spk NUM_SPK]
                     [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                     [--input_size INPUT_SIZE] [--model_conf MODEL_CONF]
                     [--use_preprocessor USE_PREPROCESSOR]
                     [--frontend {default,sliding_window,s3prl}]
                     [--frontend_conf FRONTEND_CONF]
                     [--specaug {specaug,None}] [--specaug_conf SPECAUG_CONF]
                     [--normalize {global_mvn,utterance_mvn,None}]
                     [--normalize_conf NORMALIZE_CONF]
                     [--encoder {conformer,transformer,rnn}]
                     [--encoder_conf ENCODER_CONF] [--decoder {linear}]
                     [--decoder_conf DECODER_CONF]
                     [--label_aggregator {label_aggregator}]
                     [--label_aggregator_conf LABEL_AGGREGATOR_CONF]
                     [--attractor {rnn,None}]
                     [--attractor_conf ATTRACTOR_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --num_spk NUM_SPK     The number fo speakers (for each recording) used in system training (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'attractor_weight': 1.0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --frontend {default,sliding_window,s3prl}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --encoder {conformer,transformer,rnn}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --decoder {linear}    The decoder type (default: linear)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --label_aggregator {label_aggregator}
                        The label_aggregator type (default: label_aggregator)
  --label_aggregator_conf LABEL_AGGREGATOR_CONF
                        The keyword arguments for label_aggregator (default: {})
  --attractor {rnn,None}
                        The attractor type (default: None)
  --attractor_conf ATTRACTOR_CONF
                        The keyword arguments for attractor (default: {})

enh_inference.py¶

usage: enh_inference.py [-h] [--config CONFIG]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                        [--dtype {float16,float32,float64}] [--fs FS]
                        [--num_workers NUM_WORKERS]
                        --data_path_and_name_and_type
                        DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--normalize_output_wav NORMALIZE_OUTPUT_WAV]
                        [--train_config TRAIN_CONFIG]
                        [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                        [--batch_size BATCH_SIZE]
                        [--segment_size SEGMENT_SIZE] [--hop_size HOP_SIZE]
                        [--normalize_segment_scale NORMALIZE_SEGMENT_SCALE]
                        [--show_progressbar SHOW_PROGRESSBAR]
                        [--ref_channel REF_CHANNEL]

Frontend inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --fs FS               Sampling rate (default: 8000)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

Output data related:
  --normalize_output_wav NORMALIZE_OUTPUT_WAV
                        Whether to normalize the predicted wav to [-1~1]
                        (default: False)

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)

Data loading related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)

SeparateSpeech related:
  --segment_size SEGMENT_SIZE
                        Segment length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --hop_size HOP_SIZE   Hop length in seconds for segment-wise speech
                        enhancement/separation (default: None)
  --normalize_segment_scale NORMALIZE_SEGMENT_SCALE
                        Whether to normalize the energy of the separated
                        streams in each segment (default: False)
  --show_progressbar SHOW_PROGRESSBAR
                        Whether to show a progress bar when performing
                        segment-wise speech enhancement/separation (default:
                        False)
  --ref_channel REF_CHANNEL
                        If not None, this will overwrite the ref_channel
                        defined in the separator module (for multi-channel
                        speech processing) (default: None)

enh_scoring.py¶

usage: enh_scoring.py [-h] [--config CONFIG]
                      [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                      --output_dir OUTPUT_DIR
                      [--dtype {float16,float32,float64}] --ref_scp REF_SCP
                      --inf_scp INF_SCP [--key_file KEY_FILE]
                      [--ref_channel REF_CHANNEL]

Frontend inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --dtype {float16,float32,float64}
                        Data type (default: float32)

Input data related:
  --ref_scp REF_SCP
  --inf_scp INF_SCP
  --key_file KEY_FILE
  --ref_channel REF_CHANNEL

enh_train.py¶

usage: enh_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--dry_run DRY_RUN]
                    [--iterator_type {sequence,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                    [--model_conf MODEL_CONF] [--criterions CRITERIONS]
                    [--use_preprocessor USE_PREPROCESSOR]
                    [--encoder {stft,conv,same}] [--encoder_conf ENCODER_CONF]
                    [--separator {rnn,skim,tcn,dprnn,dccrn,transformer,conformer,wpe_beamformer,asteroid}]
                    [--separator_conf SEPARATOR_CONF]
                    [--decoder {stft,conv,same}] [--decoder_conf DECODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'stft_consistency': False, 'loss_type': 'mask_mse', 'mask_type': None})
  --criterions CRITERIONS
                        The criterions binded with the loss wrappers. (default: [{'name': 'si_snr', 'conf': {}, 'wrapper': 'fixed_order', 'wrapper_conf': {}}])

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: False)
  --encoder {stft,conv,same}
                        The encoder type (default: stft)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --separator {rnn,skim,tcn,dprnn,dccrn,transformer,conformer,wpe_beamformer,asteroid}
                        The separator type (default: rnn)
  --separator_conf SEPARATOR_CONF
                        The keyword arguments for separator (default: {})
  --decoder {stft,conv,same}
                        The decoder type (default: stft)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})

gan_tts_train.py¶

usage: gan_tts_train.py [-h] [--config CONFIG] [--print_config]
                        [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        [--dry_run DRY_RUN]
                        [--iterator_type {sequence,chunk,task,none}]
                        [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                        [--num_workers NUM_WORKERS]
                        [--num_att_plot NUM_ATT_PLOT]
                        [--dist_backend DIST_BACKEND]
                        [--dist_init_method DIST_INIT_METHOD]
                        [--dist_world_size DIST_WORLD_SIZE]
                        [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                        [--dist_master_addr DIST_MASTER_ADDR]
                        [--dist_master_port DIST_MASTER_PORT]
                        [--dist_launcher {slurm,mpi,None}]
                        [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                        [--unused_parameters UNUSED_PARAMETERS]
                        [--sharded_ddp SHARDED_DDP]
                        [--cudnn_enabled CUDNN_ENABLED]
                        [--cudnn_benchmark CUDNN_BENCHMARK]
                        [--cudnn_deterministic CUDNN_DETERMINISTIC]
                        [--collect_stats COLLECT_STATS]
                        [--write_collected_feats WRITE_COLLECTED_FEATS]
                        [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                        [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                        [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                        [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                        [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                        [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                        [--grad_clip GRAD_CLIP]
                        [--grad_clip_type GRAD_CLIP_TYPE]
                        [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                        [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                        [--train_dtype {float16,float32,float64}]
                        [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                        [--use_matplotlib USE_MATPLOTLIB]
                        [--use_tensorboard USE_TENSORBOARD]
                        [--use_wandb USE_WANDB]
                        [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                        [--wandb_entity WANDB_ENTITY]
                        [--wandb_name WANDB_NAME]
                        [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                        [--detect_anomaly DETECT_ANOMALY]
                        [--pretrain_path PRETRAIN_PATH]
                        [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                        [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                        [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                        [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                        [--batch_size BATCH_SIZE]
                        [--valid_batch_size VALID_BATCH_SIZE]
                        [--batch_bins BATCH_BINS]
                        [--valid_batch_bins VALID_BATCH_BINS]
                        [--train_shape_file TRAIN_SHAPE_FILE]
                        [--valid_shape_file VALID_SHAPE_FILE]
                        [--batch_type {unsorted,sorted,folded,length,numel}]
                        [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                        [--fold_length FOLD_LENGTH]
                        [--sort_in_batch {descending,ascending}]
                        [--sort_batch {descending,ascending}]
                        [--multiple_iterator MULTIPLE_ITERATOR]
                        [--chunk_length CHUNK_LENGTH]
                        [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                        [--num_cache_chunks NUM_CACHE_CHUNKS]
                        [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                        [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--max_cache_size MAX_CACHE_SIZE]
                        [--max_cache_fd MAX_CACHE_FD]
                        [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                        [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                        [--optim_conf OPTIM_CONF]
                        [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                        [--scheduler_conf SCHEDULER_CONF]
                        [--optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                        [--optim2_conf OPTIM2_CONF]
                        [--scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                        [--scheduler2_conf SCHEDULER2_CONF]
                        [--generator_first GENERATOR_FIRST]
                        [--token_list TOKEN_LIST] [--odim ODIM]
                        [--model_conf MODEL_CONF]
                        [--use_preprocessor USE_PREPROCESSOR]
                        [--token_type {bpe,char,word,phn}]
                        [--bpemodel BPEMODEL]
                        [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                        [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                        [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}]
                        [--feats_extract {fbank,log_spectrogram,linear_spectrogram}]
                        [--feats_extract_conf FEATS_EXTRACT_CONF]
                        [--normalize {global_mvn,utterance_mvn,None}]
                        [--normalize_conf NORMALIZE_CONF]
                        [--tts {vits,joint_text2wav}] [--tts_conf TTS_CONF]
                        [--pitch_extract {dio,None}]
                        [--pitch_extract_conf PITCH_EXTRACT_CONF]
                        [--pitch_normalize {global_mvn,utterance_mvn,None}]
                        [--pitch_normalize_conf PITCH_NORMALIZE_CONF]
                        [--energy_extract {energy,None}]
                        [--energy_extract_conf ENERGY_EXTRACT_CONF]
                        [--energy_normalize {global_mvn,utterance_mvn,None}]
                        [--energy_normalize_conf ENERGY_NORMALIZE_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --generator_first GENERATOR_FIRST
                        Whether to update generator first. (default: False)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})
  --optim2 {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim2_conf OPTIM2_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler2 {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler2_conf SCHEDULER2_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --odim ODIM           The number of dimension of output feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: phn)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --feats_extract {fbank,log_spectrogram,linear_spectrogram}
                        The feats_extract type (default: linear_spectrogram)
  --feats_extract_conf FEATS_EXTRACT_CONF
                        The keyword arguments for feats_extract (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: None)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --tts {vits,joint_text2wav}
                        The tts type (default: vits)
  --tts_conf TTS_CONF   The keyword arguments for tts (default: {})
  --pitch_extract {dio,None}
                        The pitch_extract type (default: None)
  --pitch_extract_conf PITCH_EXTRACT_CONF
                        The keyword arguments for pitch_extract (default: {})
  --pitch_normalize {global_mvn,utterance_mvn,None}
                        The pitch_normalize type (default: None)
  --pitch_normalize_conf PITCH_NORMALIZE_CONF
                        The keyword arguments for pitch_normalize (default: {})
  --energy_extract {energy,None}
                        The energy_extract type (default: None)
  --energy_extract_conf ENERGY_EXTRACT_CONF
                        The keyword arguments for energy_extract (default: {})
  --energy_normalize {global_mvn,utterance_mvn,None}
                        The energy_normalize type (default: None)
  --energy_normalize_conf ENERGY_NORMALIZE_CONF
                        The keyword arguments for energy_normalize (default: {})

hubert_train.py¶

usage: hubert_train.py [-h] [--config CONFIG] [--print_config]
                       [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       [--dry_run DRY_RUN]
                       [--iterator_type {sequence,chunk,task,none}]
                       [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                       [--num_workers NUM_WORKERS]
                       [--num_att_plot NUM_ATT_PLOT]
                       [--dist_backend DIST_BACKEND]
                       [--dist_init_method DIST_INIT_METHOD]
                       [--dist_world_size DIST_WORLD_SIZE]
                       [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                       [--dist_master_addr DIST_MASTER_ADDR]
                       [--dist_master_port DIST_MASTER_PORT]
                       [--dist_launcher {slurm,mpi,None}]
                       [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                       [--unused_parameters UNUSED_PARAMETERS]
                       [--sharded_ddp SHARDED_DDP]
                       [--cudnn_enabled CUDNN_ENABLED]
                       [--cudnn_benchmark CUDNN_BENCHMARK]
                       [--cudnn_deterministic CUDNN_DETERMINISTIC]
                       [--collect_stats COLLECT_STATS]
                       [--write_collected_feats WRITE_COLLECTED_FEATS]
                       [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                       [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                       [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                       [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                       [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                       [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                       [--grad_clip GRAD_CLIP]
                       [--grad_clip_type GRAD_CLIP_TYPE]
                       [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                       [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                       [--train_dtype {float16,float32,float64}]
                       [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                       [--use_matplotlib USE_MATPLOTLIB]
                       [--use_tensorboard USE_TENSORBOARD]
                       [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                       [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                       [--wandb_name WANDB_NAME]
                       [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                       [--detect_anomaly DETECT_ANOMALY]
                       [--pretrain_path PRETRAIN_PATH]
                       [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                       [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                       [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                       [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                       [--batch_size BATCH_SIZE]
                       [--valid_batch_size VALID_BATCH_SIZE]
                       [--batch_bins BATCH_BINS]
                       [--valid_batch_bins VALID_BATCH_BINS]
                       [--train_shape_file TRAIN_SHAPE_FILE]
                       [--valid_shape_file VALID_SHAPE_FILE]
                       [--batch_type {unsorted,sorted,folded,length,numel}]
                       [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                       [--fold_length FOLD_LENGTH]
                       [--sort_in_batch {descending,ascending}]
                       [--sort_batch {descending,ascending}]
                       [--multiple_iterator MULTIPLE_ITERATOR]
                       [--chunk_length CHUNK_LENGTH]
                       [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                       [--num_cache_chunks NUM_CACHE_CHUNKS]
                       [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                       [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                       [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                       [--max_cache_size MAX_CACHE_SIZE]
                       [--max_cache_fd MAX_CACHE_FD]
                       [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                       [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                       [--optim_conf OPTIM_CONF]
                       [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                       [--scheduler_conf SCHEDULER_CONF]
                       [--token_list TOKEN_LIST]
                       [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                       [--input_size INPUT_SIZE] [--model_conf MODEL_CONF]
                       [--use_preprocessor USE_PREPROCESSOR]
                       [--token_type {bpe,char,word,phn}]
                       [--bpemodel BPEMODEL]
                       [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                       [--cleaner {None,tacotron,jaconv,vietnamese}]
                       [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}]
                       [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                       [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                       [--noise_scp NOISE_SCP]
                       [--noise_apply_prob NOISE_APPLY_PROB]
                       [--noise_db_range NOISE_DB_RANGE]
                       [--pred_masked_weight PRED_MASKED_WEIGHT]
                       [--pred_nomask_weight PRED_NOMASK_WEIGHT]
                       [--loss_weights LOSS_WEIGHTS]
                       [--hubert_dict HUBERT_DICT]
                       [--frontend {default,sliding_window}]
                       [--frontend_conf FRONTEND_CONF]
                       [--specaug {specaug,None}]
                       [--specaug_conf SPECAUG_CONF]
                       [--normalize {global_mvn,utterance_mvn,None}]
                       [--normalize_conf NORMALIZE_CONF]
                       [--preencoder {sinc,None}]
                       [--preencoder_conf PREENCODER_CONF]
                       [--encoder {hubert_pretrain}]
                       [--encoder_conf ENCODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)
  --pred_masked_weight PRED_MASKED_WEIGHT
                        weight for predictive loss for masked frames (default: 1.0)
  --pred_nomask_weight PRED_NOMASK_WEIGHT
                        weight for predictive loss for unmasked frames (default: 0.0)
  --loss_weights LOSS_WEIGHTS
                        weights for additional loss terms (not first one) (default: 0.0)
  --hubert_dict HUBERT_DICT
                        word-based target dictionary for Hubert pretraining stage (default: ./dict.txt)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_cer': False, 'report_wer': False, 'sym_space': '<space>', 'sym_blank': '<blank>', 'pred_masked_weight': 1.0, 'pred_nomask_weight': 0.0, 'loss_weights': 0.0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --frontend {default,sliding_window}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --preencoder {sinc,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {hubert_pretrain}
                        The encoder type (default: hubert_pretrain)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})

launch.py¶

usage: launch.py [-h] [--cmd CMD] [--log LOG]
                 [--max_num_log_files MAX_NUM_LOG_FILES] [--ngpu NGPU]
                 [--num_nodes NUM_NODES | --host HOST] [--envfile ENVFILE]
                 [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                 [--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
                 [--init_file_prefix INIT_FILE_PREFIX]
                 args [args ...]

Launch distributed process with appropriate options.

positional arguments:
  args

optional arguments:
  --cmd CMD             The path of cmd script of Kaldi: run.pl. queue.pl, or
                        slurm.pl (default: utils/run.pl)
  --log LOG             The path of log file used by cmd (default: run.log)
  --max_num_log_files MAX_NUM_LOG_FILES
                        The maximum number of log-files to be kept (default:
                        1000)
  --ngpu NGPU           The number of GPUs per node (default: 1)
  --num_nodes NUM_NODES
                        The number of nodes (default: 1)
  --host HOST           Directly specify the host names. The job are submitted
                        via SSH. Multiple host names can be specified by
                        splitting by comma. e.g. host1,host2 You can also the
                        device id after the host name with ':'. e.g.
                        host1:0:2:3,host2:0:2. If the device ids are specified
                        in this way, the value of --ngpu is ignored. (default:
                        None)
  --envfile ENVFILE     Source the shell script before executing command. This
                        option is used when --host is specified. (default:
                        path.sh)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Distributed method is used when single-node mode.
                        (default: True)
  --master_port MASTER_PORT
                        Specify the port number of masterMaster is a host
                        machine has RANK0 process. (default: None)
  --master_addr MASTER_ADDR
                        Specify the address s of master. Master is a host
                        machine has RANK0 process. (default: None)
  --init_file_prefix INIT_FILE_PREFIX
                        The file name prefix for init_file, which is used for
                        'Shared-file system initialization'. This option is
                        used when --port is not specified (default:
                        .dist_init_)

lm_calc_perplexity.py¶

usage: lm_calc_perplexity.py [-h] [--config CONFIG]
                             [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                             --output_dir OUTPUT_DIR [--ngpu NGPU]
                             [--seed SEED] [--dtype {float16,float32,float64}]
                             [--num_workers NUM_WORKERS]
                             [--batch_size BATCH_SIZE] [--log_base LOG_BASE]
                             --data_path_and_name_and_type
                             DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                             [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                             [--train_config TRAIN_CONFIG]
                             [--model_file MODEL_FILE]

Calc perplexity

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --log_base LOG_BASE   The base of logarithm for Perplexity. If None,
                        napier's constant is used. (default: None)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
  --model_file MODEL_FILE

lm_train.py¶

usage: lm_train.py [-h] [--config CONFIG] [--print_config]
                   [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                   [--dry_run DRY_RUN]
                   [--iterator_type {sequence,chunk,task,none}]
                   [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                   [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                   [--dist_backend DIST_BACKEND]
                   [--dist_init_method DIST_INIT_METHOD]
                   [--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
                   [--local_rank LOCAL_RANK]
                   [--dist_master_addr DIST_MASTER_ADDR]
                   [--dist_master_port DIST_MASTER_PORT]
                   [--dist_launcher {slurm,mpi,None}]
                   [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                   [--unused_parameters UNUSED_PARAMETERS]
                   [--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
                   [--cudnn_benchmark CUDNN_BENCHMARK]
                   [--cudnn_deterministic CUDNN_DETERMINISTIC]
                   [--collect_stats COLLECT_STATS]
                   [--write_collected_feats WRITE_COLLECTED_FEATS]
                   [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                   [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                   [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                   [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                   [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                   [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                   [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                   [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                   [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                   [--train_dtype {float16,float32,float64}]
                   [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                   [--use_matplotlib USE_MATPLOTLIB]
                   [--use_tensorboard USE_TENSORBOARD] [--use_wandb USE_WANDB]
                   [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                   [--wandb_entity WANDB_ENTITY] [--wandb_name WANDB_NAME]
                   [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                   [--detect_anomaly DETECT_ANOMALY]
                   [--pretrain_path PRETRAIN_PATH]
                   [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                   [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                   [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                   [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                   [--batch_size BATCH_SIZE]
                   [--valid_batch_size VALID_BATCH_SIZE]
                   [--batch_bins BATCH_BINS]
                   [--valid_batch_bins VALID_BATCH_BINS]
                   [--train_shape_file TRAIN_SHAPE_FILE]
                   [--valid_shape_file VALID_SHAPE_FILE]
                   [--batch_type {unsorted,sorted,folded,length,numel}]
                   [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                   [--fold_length FOLD_LENGTH]
                   [--sort_in_batch {descending,ascending}]
                   [--sort_batch {descending,ascending}]
                   [--multiple_iterator MULTIPLE_ITERATOR]
                   [--chunk_length CHUNK_LENGTH]
                   [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                   [--num_cache_chunks NUM_CACHE_CHUNKS]
                   [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                   [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                   [--max_cache_size MAX_CACHE_SIZE]
                   [--max_cache_fd MAX_CACHE_FD]
                   [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                   [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                   [--optim_conf OPTIM_CONF]
                   [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                   [--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
                   [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                   [--model_conf MODEL_CONF]
                   [--use_preprocessor USE_PREPROCESSOR]
                   [--token_type {bpe,char,word}] [--bpemodel BPEMODEL]
                   [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                   [--cleaner {None,tacotron,jaconv,vietnamese}]
                   [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}]
                   [--lm {seq_rnn,transformer}] [--lm_conf LM_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'ignore_id': 0})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word}
  --bpemodel BPEMODEL   The model file fo sentencepiece (default: None)
  --lm {seq_rnn,transformer}
                        The lm type (default: seq_rnn)
  --lm_conf LM_CONF     The keyword arguments for lm (default: {})

mt_inference.py¶

usage: mt_inference.py [-h] [--config CONFIG]
                       [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                       [--dtype {float16,float32,float64}]
                       [--num_workers NUM_WORKERS]
                       --data_path_and_name_and_type
                       DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                       [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                       [--mt_train_config MT_TRAIN_CONFIG]
                       [--mt_model_file MT_MODEL_FILE]
                       [--lm_train_config LM_TRAIN_CONFIG] [--lm_file LM_FILE]
                       [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                       [--word_lm_file WORD_LM_FILE] [--ngram_file NGRAM_FILE]
                       [--model_tag MODEL_TAG] [--batch_size BATCH_SIZE]
                       [--nbest NBEST] [--beam_size BEAM_SIZE]
                       [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                       [--minlenratio MINLENRATIO] [--lm_weight LM_WEIGHT]
                       [--ngram_weight NGRAM_WEIGHT]
                       [--token_type {char,bpe,None}] [--bpemodel BPEMODEL]

MT Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --mt_train_config MT_TRAIN_CONFIG
                        ST training configuration (default: None)
  --mt_model_file MT_MODEL_FILE
                        MT model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
                        Word LM training configuration (default: None)
  --word_lm_file WORD_LM_FILE
                        Word LM parameter file (default: None)
  --ngram_file NGRAM_FILE
                        N-gram parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths.If maxlenratio<0.0, its absolute value is
                        interpretedas a constant max output length (default:
                        0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --ngram_weight NGRAM_WEIGHT
                        ngram weight (default: 0.9)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ST model. If not given, refers from
                        the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

mt_train.py¶

usage: mt_train.py [-h] [--config CONFIG] [--print_config]
                   [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                   [--dry_run DRY_RUN]
                   [--iterator_type {sequence,chunk,task,none}]
                   [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                   [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                   [--dist_backend DIST_BACKEND]
                   [--dist_init_method DIST_INIT_METHOD]
                   [--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
                   [--local_rank LOCAL_RANK]
                   [--dist_master_addr DIST_MASTER_ADDR]
                   [--dist_master_port DIST_MASTER_PORT]
                   [--dist_launcher {slurm,mpi,None}]
                   [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                   [--unused_parameters UNUSED_PARAMETERS]
                   [--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
                   [--cudnn_benchmark CUDNN_BENCHMARK]
                   [--cudnn_deterministic CUDNN_DETERMINISTIC]
                   [--collect_stats COLLECT_STATS]
                   [--write_collected_feats WRITE_COLLECTED_FEATS]
                   [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                   [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                   [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                   [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                   [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                   [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                   [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                   [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                   [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                   [--train_dtype {float16,float32,float64}]
                   [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                   [--use_matplotlib USE_MATPLOTLIB]
                   [--use_tensorboard USE_TENSORBOARD] [--use_wandb USE_WANDB]
                   [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                   [--wandb_entity WANDB_ENTITY] [--wandb_name WANDB_NAME]
                   [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                   [--detect_anomaly DETECT_ANOMALY]
                   [--pretrain_path PRETRAIN_PATH]
                   [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                   [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                   [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                   [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                   [--batch_size BATCH_SIZE]
                   [--valid_batch_size VALID_BATCH_SIZE]
                   [--batch_bins BATCH_BINS]
                   [--valid_batch_bins VALID_BATCH_BINS]
                   [--train_shape_file TRAIN_SHAPE_FILE]
                   [--valid_shape_file VALID_SHAPE_FILE]
                   [--batch_type {unsorted,sorted,folded,length,numel}]
                   [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                   [--fold_length FOLD_LENGTH]
                   [--sort_in_batch {descending,ascending}]
                   [--sort_batch {descending,ascending}]
                   [--multiple_iterator MULTIPLE_ITERATOR]
                   [--chunk_length CHUNK_LENGTH]
                   [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                   [--num_cache_chunks NUM_CACHE_CHUNKS]
                   [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                   [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                   [--max_cache_size MAX_CACHE_SIZE]
                   [--max_cache_fd MAX_CACHE_FD]
                   [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                   [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                   [--optim_conf OPTIM_CONF]
                   [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                   [--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
                   [--src_token_list SRC_TOKEN_LIST]
                   [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                   [--input_size INPUT_SIZE] [--model_conf MODEL_CONF]
                   [--use_preprocessor USE_PREPROCESSOR]
                   [--token_type {bpe,char,word,phn}]
                   [--src_token_type {bpe,char,word,phn}]
                   [--bpemodel BPEMODEL] [--src_bpemodel SRC_BPEMODEL]
                   [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                   [--cleaner {None,tacotron,jaconv,vietnamese}]
                   [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}]
                   [--frontend {embed}] [--frontend_conf FRONTEND_CONF]
                   [--preencoder {sinc,linear,None}]
                   [--preencoder_conf PREENCODER_CONF]
                   [--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn}]
                   [--encoder_conf ENCODER_CONF]
                   [--postencoder {hugging_face_transformers,None}]
                   [--postencoder_conf POSTENCODER_CONF]
                   [--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                   [--decoder_conf DECODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (for target language) (default: None)
  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (for source language) (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'src_vocab_size': 0, 'src_token_list': [], 'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_bleu': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'extract_feats_in_collect_stats': True, 'share_decoder_input_output_embed': False, 'share_encoder_decoder_input_embed': False})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The target text will be tokenized in the specified level token (default: bpe)
  --src_token_type {bpe,char,word,phn}
                        The source text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (for target language) (default: None)
  --src_bpemodel SRC_BPEMODEL
                        The model file of sentencepiece (for source language) (default: None)
  --frontend {embed}    The frontend type (default: embed)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})

pack.py¶

usage: pack.py [-h] {asr,st,tts,enh,diar} ...

Pack input files to archive format

positional arguments:
  {asr,st,tts,enh,diar}

optional arguments:

split_scps.py¶

usage: split_scps.py [-h]
                     [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                     --scps SCPS [SCPS ...] [--names NAMES [NAMES ...]]
                     [--num_splits NUM_SPLITS] --output_dir OUTPUT_DIR

Split scp files

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --scps SCPS [SCPS ...]
                        Input texts (default: None)
  --names NAMES [NAMES ...]
                        Output names for each files (default: None)
  --num_splits NUM_SPLITS
                        Split number (default: None)
  --output_dir OUTPUT_DIR
                        Output directory (default: None)

st_inference.py¶

usage: st_inference.py [-h] [--config CONFIG]
                       [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                       --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                       [--dtype {float16,float32,float64}]
                       [--num_workers NUM_WORKERS]
                       --data_path_and_name_and_type
                       DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                       [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                       [--st_train_config ST_TRAIN_CONFIG]
                       [--st_model_file ST_MODEL_FILE]
                       [--lm_train_config LM_TRAIN_CONFIG] [--lm_file LM_FILE]
                       [--word_lm_train_config WORD_LM_TRAIN_CONFIG]
                       [--word_lm_file WORD_LM_FILE] [--ngram_file NGRAM_FILE]
                       [--model_tag MODEL_TAG] [--batch_size BATCH_SIZE]
                       [--nbest NBEST] [--beam_size BEAM_SIZE]
                       [--penalty PENALTY] [--maxlenratio MAXLENRATIO]
                       [--minlenratio MINLENRATIO] [--lm_weight LM_WEIGHT]
                       [--ngram_weight NGRAM_WEIGHT]
                       [--token_type {char,bpe,None}] [--bpemodel BPEMODEL]

ST Decoding

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --st_train_config ST_TRAIN_CONFIG
                        ST training configuration (default: None)
  --st_model_file ST_MODEL_FILE
                        ST model parameter file (default: None)
  --lm_train_config LM_TRAIN_CONFIG
                        LM training configuration (default: None)
  --lm_file LM_FILE     LM parameter file (default: None)
  --word_lm_train_config WORD_LM_TRAIN_CONFIG
                        Word LM training configuration (default: None)
  --word_lm_file WORD_LM_FILE
                        Word LM parameter file (default: None)
  --ngram_file NGRAM_FILE
                        N-gram parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        *_train_config and *_file will be overwritten
                        (default: None)

Beam-search related:
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --nbest NBEST         Output N-best hypotheses (default: 1)
  --beam_size BEAM_SIZE
                        Beam size (default: 20)
  --penalty PENALTY     Insertion penalty (default: 0.0)
  --maxlenratio MAXLENRATIO
                        Input length ratio to obtain max output length. If
                        maxlenratio=0.0 (default), it uses a end-detect
                        function to automatically find maximum hypothesis
                        lengths.If maxlenratio<0.0, its absolute value is
                        interpretedas a constant max output length (default:
                        0.0)
  --minlenratio MINLENRATIO
                        Input length ratio to obtain min output length
                        (default: 0.0)
  --lm_weight LM_WEIGHT
                        RNNLM weight (default: 1.0)
  --ngram_weight NGRAM_WEIGHT
                        ngram weight (default: 0.9)

Text converter related:
  --token_type {char,bpe,None}
                        The token type for ST model. If not given, refers from
                        the training args (default: None)
  --bpemodel BPEMODEL   The model path of sentencepiece. If not given, refers
                        from the training args (default: None)

st_train.py¶

usage: st_train.py [-h] [--config CONFIG] [--print_config]
                   [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                   [--dry_run DRY_RUN]
                   [--iterator_type {sequence,chunk,task,none}]
                   [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                   [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                   [--dist_backend DIST_BACKEND]
                   [--dist_init_method DIST_INIT_METHOD]
                   [--dist_world_size DIST_WORLD_SIZE] [--dist_rank DIST_RANK]
                   [--local_rank LOCAL_RANK]
                   [--dist_master_addr DIST_MASTER_ADDR]
                   [--dist_master_port DIST_MASTER_PORT]
                   [--dist_launcher {slurm,mpi,None}]
                   [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                   [--unused_parameters UNUSED_PARAMETERS]
                   [--sharded_ddp SHARDED_DDP] [--cudnn_enabled CUDNN_ENABLED]
                   [--cudnn_benchmark CUDNN_BENCHMARK]
                   [--cudnn_deterministic CUDNN_DETERMINISTIC]
                   [--collect_stats COLLECT_STATS]
                   [--write_collected_feats WRITE_COLLECTED_FEATS]
                   [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                   [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                   [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                   [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                   [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                   [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                   [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                   [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                   [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                   [--train_dtype {float16,float32,float64}]
                   [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                   [--use_matplotlib USE_MATPLOTLIB]
                   [--use_tensorboard USE_TENSORBOARD] [--use_wandb USE_WANDB]
                   [--wandb_project WANDB_PROJECT] [--wandb_id WANDB_ID]
                   [--wandb_entity WANDB_ENTITY] [--wandb_name WANDB_NAME]
                   [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                   [--detect_anomaly DETECT_ANOMALY]
                   [--pretrain_path PRETRAIN_PATH]
                   [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                   [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                   [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                   [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                   [--batch_size BATCH_SIZE]
                   [--valid_batch_size VALID_BATCH_SIZE]
                   [--batch_bins BATCH_BINS]
                   [--valid_batch_bins VALID_BATCH_BINS]
                   [--train_shape_file TRAIN_SHAPE_FILE]
                   [--valid_shape_file VALID_SHAPE_FILE]
                   [--batch_type {unsorted,sorted,folded,length,numel}]
                   [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                   [--fold_length FOLD_LENGTH]
                   [--sort_in_batch {descending,ascending}]
                   [--sort_batch {descending,ascending}]
                   [--multiple_iterator MULTIPLE_ITERATOR]
                   [--chunk_length CHUNK_LENGTH]
                   [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                   [--num_cache_chunks NUM_CACHE_CHUNKS]
                   [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                   [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                   [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                   [--max_cache_size MAX_CACHE_SIZE]
                   [--max_cache_fd MAX_CACHE_FD]
                   [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                   [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                   [--optim_conf OPTIM_CONF]
                   [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                   [--scheduler_conf SCHEDULER_CONF] [--token_list TOKEN_LIST]
                   [--src_token_list SRC_TOKEN_LIST]
                   [--init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}]
                   [--input_size INPUT_SIZE] [--ctc_conf CTC_CONF]
                   [--model_conf MODEL_CONF]
                   [--use_preprocessor USE_PREPROCESSOR]
                   [--token_type {bpe,char,word,phn}]
                   [--src_token_type {bpe,char,word,phn}]
                   [--bpemodel BPEMODEL] [--src_bpemodel SRC_BPEMODEL]
                   [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                   [--cleaner {None,tacotron,jaconv,vietnamese}]
                   [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}]
                   [--speech_volume_normalize SPEECH_VOLUME_NORMALIZE]
                   [--rir_scp RIR_SCP] [--rir_apply_prob RIR_APPLY_PROB]
                   [--noise_scp NOISE_SCP]
                   [--noise_apply_prob NOISE_APPLY_PROB]
                   [--noise_db_range NOISE_DB_RANGE]
                   [--frontend {default,sliding_window,s3prl}]
                   [--frontend_conf FRONTEND_CONF] [--specaug {specaug,None}]
                   [--specaug_conf SPECAUG_CONF]
                   [--normalize {global_mvn,utterance_mvn,None}]
                   [--normalize_conf NORMALIZE_CONF]
                   [--preencoder {sinc,linear,None}]
                   [--preencoder_conf PREENCODER_CONF]
                   [--encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}]
                   [--encoder_conf ENCODER_CONF]
                   [--postencoder {hugging_face_transformers,None}]
                   [--postencoder_conf POSTENCODER_CONF]
                   [--decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                   [--decoder_conf DECODER_CONF]
                   [--extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                   [--extra_asr_decoder_conf EXTRA_ASR_DECODER_CONF]
                   [--extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}]
                   [--extra_mt_decoder_conf EXTRA_MT_DECODER_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}
                        Specify g2p method if --token_type=phn (default: None)
  --speech_volume_normalize SPEECH_VOLUME_NORMALIZE
                        Scale the maximum amplitude to the given value. (default: None)
  --rir_scp RIR_SCP     The file path of rir scp file. (default: None)
  --rir_apply_prob RIR_APPLY_PROB
                        THe probability for applying RIR convolution. (default: 1.0)
  --noise_scp NOISE_SCP
                        The file path of noise scp file. (default: None)
  --noise_apply_prob NOISE_APPLY_PROB
                        The probability applying Noise adding. (default: 1.0)
  --noise_db_range NOISE_DB_RANGE
                        The range of noise decibel level. (default: 13_15)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (for target language) (default: None)
  --src_token_list SRC_TOKEN_LIST
                        A text mapping int-id to token (for source language) (default: None)
  --init {chainer,xavier_uniform,xavier_normal,kaiming_uniform,kaiming_normal,None}
                        The initialization method (default: None)
  --input_size INPUT_SIZE
                        The number of input dimension of the feature (default: None)
  --ctc_conf CTC_CONF   The keyword arguments for CTC class. (default: {'dropout_rate': 0.0, 'ctc_type': 'builtin', 'reduce': True, 'ignore_nan_grad': True})
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {'src_vocab_size': 0, 'src_token_list': [], 'asr_weight': 0.0, 'mt_weight': 0.0, 'mtlalpha': 0.0, 'ignore_id': -1, 'lsm_weight': 0.0, 'length_normalized_loss': False, 'report_cer': True, 'report_wer': True, 'report_bleu': True, 'sym_space': '<space>', 'sym_blank': '<blank>', 'extract_feats_in_collect_stats': True})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The target text will be tokenized in the specified level token (default: bpe)
  --src_token_type {bpe,char,word,phn}
                        The source text will be tokenized in the specified level token (default: bpe)
  --bpemodel BPEMODEL   The model file of sentencepiece (for target language) (default: None)
  --src_bpemodel SRC_BPEMODEL
                        The model file of sentencepiece (for source language) (default: None)
  --frontend {default,sliding_window,s3prl}
                        The frontend type (default: default)
  --frontend_conf FRONTEND_CONF
                        The keyword arguments for frontend (default: {})
  --specaug {specaug,None}
                        The specaug type (default: None)
  --specaug_conf SPECAUG_CONF
                        The keyword arguments for specaug (default: {})
  --normalize {global_mvn,utterance_mvn,None}
                        The normalize type (default: utterance_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --preencoder {sinc,linear,None}
                        The preencoder type (default: None)
  --preencoder_conf PREENCODER_CONF
                        The keyword arguments for preencoder (default: {})
  --encoder {conformer,transformer,contextual_block_transformer,vgg_rnn,rnn,wav2vec2,hubert,hubert_pretrain}
                        The encoder type (default: rnn)
  --encoder_conf ENCODER_CONF
                        The keyword arguments for encoder (default: {})
  --postencoder {hugging_face_transformers,None}
                        The postencoder type (default: None)
  --postencoder_conf POSTENCODER_CONF
                        The keyword arguments for postencoder (default: {})
  --decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The decoder type (default: rnn)
  --decoder_conf DECODER_CONF
                        The keyword arguments for decoder (default: {})
  --extra_asr_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The extra_asr_decoder type (default: rnn)
  --extra_asr_decoder_conf EXTRA_ASR_DECODER_CONF
                        The keyword arguments for extra_asr_decoder (default: {})
  --extra_mt_decoder {transformer,lightweight_conv,lightweight_conv2d,dynamic_conv,dynamic_conv2d,rnn}
                        The extra_mt_decoder type (default: rnn)
  --extra_mt_decoder_conf EXTRA_MT_DECODER_CONF
                        The keyword arguments for extra_mt_decoder (default: {})

tokenize_text.py¶

usage: tokenize_text.py [-h]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --input INPUT --output OUTPUT [--field FIELD]
                        [--token_type {char,bpe,word,phn}]
                        [--delimiter DELIMITER] [--space_symbol SPACE_SYMBOL]
                        [--bpemodel BPEMODEL]
                        [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                        [--remove_non_linguistic_symbols REMOVE_NON_LINGUISTIC_SYMBOLS]
                        [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                        [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}]
                        [--write_vocabulary WRITE_VOCABULARY]
                        [--vocabulary_size VOCABULARY_SIZE] [--cutoff CUTOFF]
                        [--add_symbol ADD_SYMBOL]

Tokenize texts

optional arguments:
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --input INPUT, -i INPUT
                        Input text. - indicates sys.stdin (default: None)
  --output OUTPUT, -o OUTPUT
                        Output text. - indicates sys.stdout (default: None)
  --field FIELD, -f FIELD
                        The target columns of the input text as 1-based
                        integer. e.g 2- (default: None)
  --token_type {char,bpe,word,phn}, -t {char,bpe,word,phn}
                        Token type (default: char)
  --delimiter DELIMITER, -d DELIMITER
                        The delimiter (default: None)
  --space_symbol SPACE_SYMBOL
                        The space symbol (default: <space>)
  --bpemodel BPEMODEL   The bpemodel file path (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --remove_non_linguistic_symbols REMOVE_NON_LINGUISTIC_SYMBOLS
                        Remove non-language-symbols from tokens (default:
                        False)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}
                        Specify g2p method if --token_type=phn (default: None)

write_vocabulary mode related:
  --write_vocabulary WRITE_VOCABULARY
                        Write tokens list instead of tokenized text per line
                        (default: False)
  --vocabulary_size VOCABULARY_SIZE
                        Vocabulary size (default: 0)
  --cutoff CUTOFF       cut-off frequency used for write-vocabulary mode
                        (default: 0)
  --add_symbol ADD_SYMBOL
                        Append symbol e.g. --add_symbol '<blank>:0'
                        --add_symbol '<unk>:1' (default: [])

tts_inference.py¶

usage: tts_inference.py [-h] [--config CONFIG]
                        [--log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                        --output_dir OUTPUT_DIR [--ngpu NGPU] [--seed SEED]
                        [--dtype {float16,float32,float64}]
                        [--num_workers NUM_WORKERS] [--batch_size BATCH_SIZE]
                        --data_path_and_name_and_type
                        DATA_PATH_AND_NAME_AND_TYPE [--key_file KEY_FILE]
                        [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                        [--train_config TRAIN_CONFIG]
                        [--model_file MODEL_FILE] [--model_tag MODEL_TAG]
                        [--maxlenratio MAXLENRATIO]
                        [--minlenratio MINLENRATIO] [--threshold THRESHOLD]
                        [--use_att_constraint USE_ATT_CONSTRAINT]
                        [--backward_window BACKWARD_WINDOW]
                        [--forward_window FORWARD_WINDOW]
                        [--use_teacher_forcing USE_TEACHER_FORCING]
                        [--speed_control_alpha SPEED_CONTROL_ALPHA]
                        [--noise_scale NOISE_SCALE]
                        [--noise_scale_dur NOISE_SCALE_DUR]
                        [--always_fix_seed ALWAYS_FIX_SEED]
                        [--vocoder_config VOCODER_CONFIG]
                        [--vocoder_file VOCODER_FILE]
                        [--vocoder_tag VOCODER_TAG]

TTS inference

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --log_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --output_dir OUTPUT_DIR
                        The path of output directory (default: None)
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --dtype {float16,float32,float64}
                        Data type (default: float32)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --batch_size BATCH_SIZE
                        The batch size for inference (default: 1)
  --speed_control_alpha SPEED_CONTROL_ALPHA
                        Alpha in FastSpeech to change the speed of generated
                        speech (default: 1.0)
  --noise_scale NOISE_SCALE
                        Noise scale parameter for the flow in vits (default:
                        0.667)
  --noise_scale_dur NOISE_SCALE_DUR
                        Noise scale parameter for the stochastic duration
                        predictor in vits (default: 0.8)

Input data related:
  --data_path_and_name_and_type DATA_PATH_AND_NAME_AND_TYPE
  --key_file KEY_FILE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS

The model configuration related:
  --train_config TRAIN_CONFIG
                        Training configuration file (default: None)
  --model_file MODEL_FILE
                        Model parameter file (default: None)
  --model_tag MODEL_TAG
                        Pretrained model tag. If specify this option,
                        train_config and model_file will be overwritten
                        (default: None)

Decoding related:
  --maxlenratio MAXLENRATIO
                        Maximum length ratio in decoding (default: 10.0)
  --minlenratio MINLENRATIO
                        Minimum length ratio in decoding (default: 0.0)
  --threshold THRESHOLD
                        Threshold value in decoding (default: 0.5)
  --use_att_constraint USE_ATT_CONSTRAINT
                        Whether to use attention constraint (default: False)
  --backward_window BACKWARD_WINDOW
                        Backward window value in attention constraint
                        (default: 1)
  --forward_window FORWARD_WINDOW
                        Forward window value in attention constraint (default:
                        3)
  --use_teacher_forcing USE_TEACHER_FORCING
                        Whether to use teacher forcing (default: False)
  --always_fix_seed ALWAYS_FIX_SEED
                        Whether to always fix seed (default: False)

Vocoder related:
  --vocoder_config VOCODER_CONFIG
                        Vocoder configuration file (default: None)
  --vocoder_file VOCODER_FILE
                        Vocoder parameter file (default: None)
  --vocoder_tag VOCODER_TAG
                        Pretrained vocoder tag. If specify this option,
                        vocoder_config and vocoder_file will be overwritten
                        (default: None)

tts_train.py¶

usage: tts_train.py [-h] [--config CONFIG] [--print_config]
                    [--log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}]
                    [--dry_run DRY_RUN]
                    [--iterator_type {sequence,chunk,task,none}]
                    [--output_dir OUTPUT_DIR] [--ngpu NGPU] [--seed SEED]
                    [--num_workers NUM_WORKERS] [--num_att_plot NUM_ATT_PLOT]
                    [--dist_backend DIST_BACKEND]
                    [--dist_init_method DIST_INIT_METHOD]
                    [--dist_world_size DIST_WORLD_SIZE]
                    [--dist_rank DIST_RANK] [--local_rank LOCAL_RANK]
                    [--dist_master_addr DIST_MASTER_ADDR]
                    [--dist_master_port DIST_MASTER_PORT]
                    [--dist_launcher {slurm,mpi,None}]
                    [--multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED]
                    [--unused_parameters UNUSED_PARAMETERS]
                    [--sharded_ddp SHARDED_DDP]
                    [--cudnn_enabled CUDNN_ENABLED]
                    [--cudnn_benchmark CUDNN_BENCHMARK]
                    [--cudnn_deterministic CUDNN_DETERMINISTIC]
                    [--collect_stats COLLECT_STATS]
                    [--write_collected_feats WRITE_COLLECTED_FEATS]
                    [--max_epoch MAX_EPOCH] [--patience PATIENCE]
                    [--val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION]
                    [--early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION]
                    [--best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]]
                    [--keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]]
                    [--nbest_averaging_interval NBEST_AVERAGING_INTERVAL]
                    [--grad_clip GRAD_CLIP] [--grad_clip_type GRAD_CLIP_TYPE]
                    [--grad_noise GRAD_NOISE] [--accum_grad ACCUM_GRAD]
                    [--no_forward_run NO_FORWARD_RUN] [--resume RESUME]
                    [--train_dtype {float16,float32,float64}]
                    [--use_amp USE_AMP] [--log_interval LOG_INTERVAL]
                    [--use_matplotlib USE_MATPLOTLIB]
                    [--use_tensorboard USE_TENSORBOARD]
                    [--use_wandb USE_WANDB] [--wandb_project WANDB_PROJECT]
                    [--wandb_id WANDB_ID] [--wandb_entity WANDB_ENTITY]
                    [--wandb_name WANDB_NAME]
                    [--wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL]
                    [--detect_anomaly DETECT_ANOMALY]
                    [--pretrain_path PRETRAIN_PATH]
                    [--init_param [INIT_PARAM [INIT_PARAM ...]]]
                    [--ignore_init_mismatch IGNORE_INIT_MISMATCH]
                    [--freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]]
                    [--num_iters_per_epoch NUM_ITERS_PER_EPOCH]
                    [--batch_size BATCH_SIZE]
                    [--valid_batch_size VALID_BATCH_SIZE]
                    [--batch_bins BATCH_BINS]
                    [--valid_batch_bins VALID_BATCH_BINS]
                    [--train_shape_file TRAIN_SHAPE_FILE]
                    [--valid_shape_file VALID_SHAPE_FILE]
                    [--batch_type {unsorted,sorted,folded,length,numel}]
                    [--valid_batch_type {unsorted,sorted,folded,length,numel,None}]
                    [--fold_length FOLD_LENGTH]
                    [--sort_in_batch {descending,ascending}]
                    [--sort_batch {descending,ascending}]
                    [--multiple_iterator MULTIPLE_ITERATOR]
                    [--chunk_length CHUNK_LENGTH]
                    [--chunk_shift_ratio CHUNK_SHIFT_RATIO]
                    [--num_cache_chunks NUM_CACHE_CHUNKS]
                    [--train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE]
                    [--valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE]
                    [--allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS]
                    [--max_cache_size MAX_CACHE_SIZE]
                    [--max_cache_fd MAX_CACHE_FD]
                    [--valid_max_cache_size VALID_MAX_CACHE_SIZE]
                    [--optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}]
                    [--optim_conf OPTIM_CONF]
                    [--scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}]
                    [--scheduler_conf SCHEDULER_CONF]
                    [--token_list TOKEN_LIST] [--odim ODIM]
                    [--model_conf MODEL_CONF]
                    [--use_preprocessor USE_PREPROCESSOR]
                    [--token_type {bpe,char,word,phn}] [--bpemodel BPEMODEL]
                    [--non_linguistic_symbols NON_LINGUISTIC_SYMBOLS]
                    [--cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}]
                    [--g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}]
                    [--feats_extract {fbank,spectrogram,linear_spectrogram}]
                    [--feats_extract_conf FEATS_EXTRACT_CONF]
                    [--normalize {global_mvn,None}]
                    [--normalize_conf NORMALIZE_CONF]
                    [--tts {tacotron2,transformer,fastspeech,fastspeech2,vits,joint_text2wav}]
                    [--tts_conf TTS_CONF] [--pitch_extract {dio,None}]
                    [--pitch_extract_conf PITCH_EXTRACT_CONF]
                    [--pitch_normalize {global_mvn,None}]
                    [--pitch_normalize_conf PITCH_NORMALIZE_CONF]
                    [--energy_extract {energy,None}]
                    [--energy_extract_conf ENERGY_EXTRACT_CONF]
                    [--energy_normalize {global_mvn,None}]
                    [--energy_normalize_conf ENERGY_NORMALIZE_CONF]

base parser

optional arguments:
  --config CONFIG       Give config file in yaml format (default: None)
  --non_linguistic_symbols NON_LINGUISTIC_SYMBOLS
                        non_linguistic_symbols file path (default: None)
  --cleaner {None,tacotron,jaconv,vietnamese,korean_cleaner}
                        Apply text cleaning (default: None)
  --g2p {None,g2p_en,g2p_en_no_space,pyopenjtalk,pyopenjtalk_kana,pyopenjtalk_accent,pyopenjtalk_accent_with_pause,pyopenjtalk_prosody,pypinyin_g2p,pypinyin_g2p_phone,espeak_ng_arabic,espeak_ng_german,espeak_ng_french,espeak_ng_spanish,espeak_ng_russian,espeak_ng_greek,espeak_ng_finnish,espeak_ng_hungarian,espeak_ng_dutch,espeak_ng_english_us_vits,espeak_ng_hindi,g2pk,g2pk_no_space,korean_jaso,korean_jaso_no_space}
                        Specify g2p method if --token_type=phn (default: None)

Common configuration:
  --print_config        Print the config file and exit (default: False)
  --log_level {ERROR,WARNING,INFO,DEBUG,NOTSET}
                        The verbose level of logging (default: INFO)
  --dry_run DRY_RUN     Perform process without training (default: False)
  --iterator_type {sequence,chunk,task,none}
                        Specify iterator type (default: sequence)
  --output_dir OUTPUT_DIR
  --ngpu NGPU           The number of gpus. 0 indicates CPU mode (default: 0)
  --seed SEED           Random seed (default: 0)
  --num_workers NUM_WORKERS
                        The number of workers used for DataLoader (default: 1)
  --num_att_plot NUM_ATT_PLOT
                        The number images to plot the outputs from attention. This option makes sense only when attention-based model. We can also disable the attention plot by setting it 0 (default: 3)

distributed training related:
  --dist_backend DIST_BACKEND
                        distributed backend (default: nccl)
  --dist_init_method DIST_INIT_METHOD
                        if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred. (default: env://)
  --dist_world_size DIST_WORLD_SIZE
                        number of nodes for distributed training (default: None)
  --dist_rank DIST_RANK
                        node rank for distributed training (default: None)
  --local_rank LOCAL_RANK
                        local rank for distributed training. This option is used if --multiprocessing_distributed=false (default: None)
  --dist_master_addr DIST_MASTER_ADDR
                        The master address for distributed training. This value is used when dist_init_method == 'env://' (default: None)
  --dist_master_port DIST_MASTER_PORT
                        The master port for distributed trainingThis value is used when dist_init_method == 'env://' (default: None)
  --dist_launcher {slurm,mpi,None}
                        The launcher type for distributed training (default: None)
  --multiprocessing_distributed MULTIPROCESSING_DISTRIBUTED
                        Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel training (default: False)
  --unused_parameters UNUSED_PARAMETERS
                        Whether to use the find_unused_parameters in torch.nn.parallel.DistributedDataParallel  (default: False)
  --sharded_ddp SHARDED_DDP
                        Enable sharded training provided by fairscale (default: False)

cudnn mode related:
  --cudnn_enabled CUDNN_ENABLED
                        Enable CUDNN (default: True)
  --cudnn_benchmark CUDNN_BENCHMARK
                        Enable cudnn-benchmark mode (default: False)
  --cudnn_deterministic CUDNN_DETERMINISTIC
                        Enable cudnn-deterministic mode (default: True)

collect stats mode related:
  --collect_stats COLLECT_STATS
                        Perform on "collect stats" mode (default: False)
  --write_collected_feats WRITE_COLLECTED_FEATS
                        Write the output features from the model when "collect stats" mode (default: False)

Trainer related:
  --max_epoch MAX_EPOCH
                        The maximum number epoch to train (default: 40)
  --patience PATIENCE   Number of epochs to wait without improvement before stopping the training (default: None)
  --val_scheduler_criterion VAL_SCHEDULER_CRITERION VAL_SCHEDULER_CRITERION
                        The criterion used for the value given to the lr scheduler. Give a pair referring the phase, "train" or "valid",and the criterion name. The mode specifying "min" or "max" can be changed by --scheduler_conf (default: ('valid', 'loss'))
  --early_stopping_criterion EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION EARLY_STOPPING_CRITERION
                        The criterion used for judging of early stopping. Give a pair referring the phase, "train" or "valid",the criterion name and the mode, "min" or "max", e.g. "acc,max". (default: ('valid', 'loss', 'min'))
  --best_model_criterion BEST_MODEL_CRITERION [BEST_MODEL_CRITERION ...]
                        The criterion used for judging of the best model. Give a pair referring the phase, "train" or "valid",the criterion name, and the mode, "min" or "max", e.g. "acc,max". (default: [('train', 'loss', 'min'), ('valid', 'loss', 'min'), ('train', 'acc', 'max'), ('valid', 'acc', 'max')])
  --keep_nbest_models KEEP_NBEST_MODELS [KEEP_NBEST_MODELS ...]
                        Remove previous snapshots excluding the n-best scored epochs (default: [10])
  --nbest_averaging_interval NBEST_AVERAGING_INTERVAL
                        The epoch interval to apply model averaging and save nbest models (default: 0)
  --grad_clip GRAD_CLIP
                        Gradient norm threshold to clip (default: 5.0)
  --grad_clip_type GRAD_CLIP_TYPE
                        The type of the used p-norm for gradient clip. Can be inf (default: 2.0)
  --grad_noise GRAD_NOISE
                        The flag to switch to use noise injection to gradients during training (default: False)
  --accum_grad ACCUM_GRAD
                        The number of gradient accumulation (default: 1)
  --no_forward_run NO_FORWARD_RUN
                        Just only iterating data loading without model forwarding and training (default: False)
  --resume RESUME       Enable resuming if checkpoint is existing (default: False)
  --train_dtype {float16,float32,float64}
                        Data type for training. (default: float32)
  --use_amp USE_AMP     Enable Automatic Mixed Precision. This feature requires pytorch>=1.6 (default: False)
  --log_interval LOG_INTERVAL
                        Show the logs every the number iterations in each epochs at the training phase. If None is given, it is decided according the number of training samples automatically . (default: None)
  --use_matplotlib USE_MATPLOTLIB
                        Enable matplotlib logging (default: True)
  --use_tensorboard USE_TENSORBOARD
                        Enable tensorboard logging (default: True)
  --use_wandb USE_WANDB
                        Enable wandb logging (default: False)
  --wandb_project WANDB_PROJECT
                        Specify wandb project (default: None)
  --wandb_id WANDB_ID   Specify wandb id (default: None)
  --wandb_entity WANDB_ENTITY
                        Specify wandb entity (default: None)
  --wandb_name WANDB_NAME
                        Specify wandb run name (default: None)
  --wandb_model_log_interval WANDB_MODEL_LOG_INTERVAL
                        Set the model log period (default: -1)
  --detect_anomaly DETECT_ANOMALY
                        Set torch.autograd.set_detect_anomaly (default: False)

Pretraining model related:
  --pretrain_path PRETRAIN_PATH
                        This option is obsoleted (default: None)
  --init_param [INIT_PARAM [INIT_PARAM ...]]
                        Specify the file path used for initialization of parameters. The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', where file_path is the model file path, src_key specifies the key of model states to be used in the model file, dst_key specifies the attribute of the model to be initialized, and exclude_keys excludes keys of model states for the initialization.e.g.
                          # Load all parameters  --init_param some/where/model.pth
                          # Load only decoder parameters  --init_param some/where/model.pth:decoder:decoder
                          # Load only decoder parameters excluding decoder.embed  --init_param some/where/model.pth:decoder:decoder:decoder.embed
                          --init_param some/where/model.pth:decoder:decoder:decoder.embed
                         (default: [])
  --ignore_init_mismatch IGNORE_INIT_MISMATCH
                        Ignore size mismatch when loading pre-trained model (default: False)
  --freeze_param [FREEZE_PARAM [FREEZE_PARAM ...]]
                        Freeze parameters (default: [])

BatchSampler related:
  --num_iters_per_epoch NUM_ITERS_PER_EPOCH
                        Restrict the number of iterations for training per epoch (default: None)
  --batch_size BATCH_SIZE
                        The mini-batch size used for training. Used if batch_type='unsorted', 'sorted', or 'folded'. (default: 20)
  --valid_batch_size VALID_BATCH_SIZE
                        If not given, the value of --batch_size is used (default: None)
  --batch_bins BATCH_BINS
                        The number of batch bins. Used if batch_type='length' or 'numel' (default: 1000000)
  --valid_batch_bins VALID_BATCH_BINS
                        If not given, the value of --batch_bins is used (default: None)
  --train_shape_file TRAIN_SHAPE_FILE
  --valid_shape_file VALID_SHAPE_FILE

Sequence iterator related:
  --batch_type {unsorted,sorted,folded,length,numel}
                        "unsorted":
                        UnsortedBatchSampler has nothing in particular feature and just creates mini-batches which has constant batch_size. This sampler doesn't require any length information for each feature. 'key_file' is just a text file which describes each sample name.

                            utterance_id_a
                            utterance_id_b
                            utterance_id_c

                        The fist column is referred, so 'shape file' can be used, too.

                            utterance_id_a 100,80
                            utterance_id_b 400,80
                            utterance_id_c 512,80

                        "sorted":
                        SortedBatchSampler sorts samples by the length of the first input  in order to make each sample in a mini-batch has close length. This sampler requires a text file which describes the length for each sample

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "folded":
                        FoldedBatchSampler supports variable batch_size. The batch_size is decided by
                            batch_size = base_batch_size // (L // fold_length)
                        L is referred to the largest length of samples in the mini-batch. This samples requires length information as same as SortedBatchSampler

                        "length":
                        LengthBatchSampler supports variable batch_size. This sampler makes mini-batches which have same number of 'bins' as possible counting by the total lengths of each feature in the mini-batch. This sampler requires a text file which describes the length for each sample.

                            utterance_id_a 1000
                            utterance_id_b 1453
                            utterance_id_c 1241

                        The first element of feature dimensions is referred, so 'shape_file' can be also used.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                        "numel":
                        NumElementsBatchSampler supports variable batch_size. Just like LengthBatchSampler, this sampler makes mini-batches which have same number of 'bins' as possible counting by the total number of elements of each feature instead of the length. Thus this sampler requires the full information of the dimension of the features.

                            utterance_id_a 1000,80
                            utterance_id_b 1453,80
                            utterance_id_c 1241,80

                         (default: folded)
  --valid_batch_type {unsorted,sorted,folded,length,numel,None}
                        If not given, the value of --batch_type is used (default: None)
  --fold_length FOLD_LENGTH
  --sort_in_batch {descending,ascending}
                        Sort the samples in each mini-batches by the sample lengths. To enable this, "shape_file" must have the length information. (default: descending)
  --sort_batch {descending,ascending}
                        Sort mini-batches by the sample lengths (default: descending)
  --multiple_iterator MULTIPLE_ITERATOR
                        Use multiple iterator mode (default: False)

Chunk iterator related:
  --chunk_length CHUNK_LENGTH
                        Specify chunk length. e.g. '300', '300,400,500', or '300-400'.If multiple numbers separated by command are given, one of them is selected randomly for each samples. If two numbers are given with '-', it indicates the range of the choices. Note that if the sequence length is shorter than the all chunk_lengths, the sample is discarded.  (default: 500)
  --chunk_shift_ratio CHUNK_SHIFT_RATIO
                        Specify the shift width of chunks. If it's less than 1, allows the overlapping and if bigger than 1, there are some gaps between each chunk. (default: 0.5)
  --num_cache_chunks NUM_CACHE_CHUNKS
                        Shuffle in the specified number of chunks and generate mini-batches More larger this value, more randomness can be obtained. (default: 1024)

Dataset related:
  --train_data_path_and_name_and_type TRAIN_DATA_PATH_AND_NAME_AND_TYPE
                        Give three words splitted by comma. It's used for the training data. e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. The first value, some/path/a.scp, indicates the file path, and the second, foo, is the key name used for the mini-batch data, and the last, sound, decides the file type. This option is repeatable, so you can input any number of features for your task. Supported file types are as follows:

                        "sound":
                        Audio format types which supported by sndfile wav, flac, etc.

                           utterance_id_a a.wav
                           utterance_id_b b.wav
                           ...

                        "kaldi_ark":
                        Kaldi-ark file type.

                           utterance_id_A /some/where/a.ark:123
                           utterance_id_B /some/where/a.ark:456
                           ...

                        "npy":
                        Npy file format.

                           utterance_id_A /some/where/a.npy
                           utterance_id_B /some/where/b.npy
                           ...

                        "text_int":
                        A text file in which is written a sequence of interger numbers separated by space.

                           utterance_id_A 12 0 1 3
                           utterance_id_B 3 3 1
                           ...

                        "csv_int":
                        A text file in which is written a sequence of interger numbers separated by comma.

                           utterance_id_A 100,80
                           utterance_id_B 143,80
                           ...

                        "text_float":
                        A text file in which is written a sequence of float numbers separated by space.

                           utterance_id_A 12. 3.1 3.4 4.4
                           utterance_id_B 3. 3.12 1.1
                           ...

                        "csv_float":
                        A text file in which is written a sequence of float numbers separated by comma.

                           utterance_id_A 12.,3.1,3.4,4.4
                           utterance_id_B 3.,3.12,1.1
                           ...

                        "text":
                        Return text as is. The text must be converted to ndarray by 'preprocess'.

                           utterance_id_A hello world
                           utterance_id_B foo bar
                           ...

                        "hdf5":
                        A HDF5 file which contains arrays at the first level or the second level.   >>> f = h5py.File('file.h5')
                           >>> array1 = f['utterance_id_A']
                           >>> array2 = f['utterance_id_B']


                        "rand_float":
                        Generate random float-ndarray which has the given shapes in the file.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rand_int_\d+_\d+":
                        e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given shapes in the path. Give the lower and upper value by the file type. e.g. rand_int_0_10 -> Generate integers from 0 to 10.

                           utterance_id_A 3,4
                           utterance_id_B 10,4
                           ...

                        "rttm":
                        rttm file loader, currently support for speaker diarization

                            SPEAKER file1 1 0 1023 <NA> <NA> spk1 <NA>    SPEAKER file1 2 4000 3023 <NA> <NA> spk2 <NA>    SPEAKER file1 3 500 4023 <NA> <NA> spk1 <NA>    END     file1 <NA> 4023 <NA> <NA> <NA> <NA>   ...

                         (default: [])
  --valid_data_path_and_name_and_type VALID_DATA_PATH_AND_NAME_AND_TYPE
  --allow_variable_data_keys ALLOW_VARIABLE_DATA_KEYS
                        Allow the arbitrary keys for mini-batch with ignoring the task requirements (default: False)
  --max_cache_size MAX_CACHE_SIZE
                        The maximum cache size for data loader. e.g. 10MB, 20GB. (default: 0.0)
  --max_cache_fd MAX_CACHE_FD
                        The maximum number of file descriptors to be kept as opened for ark files. This feature is only valid when data type is 'kaldi_ark'. (default: 32)
  --valid_max_cache_size VALID_MAX_CACHE_SIZE
                        The maximum cache size for validation data loader. e.g. 10MB, 20GB. If None, the 5 percent size of --max_cache_size (default: None)

Optimizer related:
  --optim {adam,adamw,sgd,adadelta,adagrad,adamax,asgd,lbfgs,rmsprop,rprop,radam}
                        The optimizer type (default: adadelta)
  --optim_conf OPTIM_CONF
                        The keyword arguments for optimizer (default: {})
  --scheduler {reducelronplateau,lambdalr,steplr,multisteplr,exponentiallr,cosineannealinglr,noamlr,warmuplr,cycliclr,onecyclelr,cosineannealingwarmrestarts,None}
                        The lr scheduler type (default: None)
  --scheduler_conf SCHEDULER_CONF
                        The keyword arguments for lr scheduler (default: {})

  Task related

  --token_list TOKEN_LIST
                        A text mapping int-id to token (default: None)
  --odim ODIM           The number of dimension of output feature (default: None)
  --model_conf MODEL_CONF
                        The keyword arguments for model class. (default: {})

  Preprocess related

  --use_preprocessor USE_PREPROCESSOR
                        Apply preprocessing to data or not (default: True)
  --token_type {bpe,char,word,phn}
                        The text will be tokenized in the specified level token (default: phn)
  --bpemodel BPEMODEL   The model file of sentencepiece (default: None)
  --feats_extract {fbank,spectrogram,linear_spectrogram}
                        The feats_extract type (default: fbank)
  --feats_extract_conf FEATS_EXTRACT_CONF
                        The keyword arguments for feats_extract (default: {})
  --normalize {global_mvn,None}
                        The normalize type (default: global_mvn)
  --normalize_conf NORMALIZE_CONF
                        The keyword arguments for normalize (default: {})
  --tts {tacotron2,transformer,fastspeech,fastspeech2,vits,joint_text2wav}
                        The tts type (default: tacotron2)
  --tts_conf TTS_CONF   The keyword arguments for tts (default: {})
  --pitch_extract {dio,None}
                        The pitch_extract type (default: None)
  --pitch_extract_conf PITCH_EXTRACT_CONF
                        The keyword arguments for pitch_extract (default: {})
  --pitch_normalize {global_mvn,None}
                        The pitch_normalize type (default: None)
  --pitch_normalize_conf PITCH_NORMALIZE_CONF
                        The keyword arguments for pitch_normalize (default: {})
  --energy_extract {energy,None}
                        The energy_extract type (default: None)
  --energy_extract_conf ENERGY_EXTRACT_CONF
                        The keyword arguments for energy_extract (default: {})
  --energy_normalize {global_mvn,None}
                        The energy_normalize type (default: None)
  --energy_normalize_conf ENERGY_NORMALIZE_CONF
                        The keyword arguments for energy_normalize (default: {})