Skip to content

WARNING:root:NaN or Inf found in input tensor. #54

@VladC12

Description

@VladC12

GPU: 1060 6Gb

`❯ python train.py -c config.json -p train_config.output_directory=outdir
train_config.output_directory=outdir
output_directory=outdir
{'train_config': {'output_directory': 'outdir', 'epochs': 10000000, 'learning_rate': 0.0001, 'weight_decay': 1e-06, 'sigma': 1.0, 'iters_per_checkpoint': 5000, 'batch_size': 1, 'seed': 1234, 'checkpoint_path': '', 'ignore_layers': [], 'include_layers': ['speaker', 'encoder', 'embedding'], 'warmstart_checkpoint_path': '', 'with_tensorboard': True, 'fp16_run': False}, 'data_config': {'training_files': 'filelists/ljs_audiopaths_text_sid_train_filelist.txt', 'validation_files': 'filelists/ljs_audiopaths_text_sid_val_filelist.txt', 'text_cleaners': ['flowtron_cleaners'], 'p_arpabet': 0.5, 'cmudict_path': 'data/cmudict_dictionary', 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'mel_fmin': 0.0, 'mel_fmax': 8000.0, 'max_wav_value': 32768.0}, 'dist_config': {'dist_backend': 'nccl', 'dist_url': 'tcp://localhost:54321'}, 'model_config': {'n_speakers': 1, 'n_speaker_dim': 128, 'n_text': 185, 'n_text_dim': 512, 'n_flows': 2, 'n_mel_channels': 80, 'n_attn_channels': 640, 'n_hidden': 1024, 'n_lstm_layers': 2, 'mel_encoder_n_hidden': 512, 'n_components': 0, 'mean_scale': 0.0, 'fixed_gaussian': True, 'dummy_speaker_embedding': False, 'use_gate_layer': True}}

got rank 0 and world size 1 ...
Flowtron(
(speaker_embedding): Embedding(1, 128)
(embedding): Embedding(185, 512)
(flows): ModuleList(
(0): AR_Step(
(conv): Conv1d(1024, 160, kernel_size=(1,), stride=(1,))
(lstm): LSTM(1664, 1024, num_layers=2)
(attention_lstm): LSTM(80, 1024)
(attention_layer): Attention(
(softmax): Softmax(dim=2)
(query): LinearNorm(
(linear_layer): Linear(in_features=1024, out_features=640, bias=False)
)
(key): LinearNorm(
(linear_layer): Linear(in_features=640, out_features=640, bias=False)
)
(value): LinearNorm(
(linear_layer): Linear(in_features=640, out_features=640, bias=False)
)
(v): LinearNorm(
(linear_layer): Linear(in_features=640, out_features=1, bias=False)
)
)
(dense_layer): DenseLayer(
(layers): ModuleList(
(0): LinearNorm(
(linear_layer): Linear(in_features=1024, out_features=1024, bias=True)
)
(1): LinearNorm(
(linear_layer): Linear(in_features=1024, out_features=1024, bias=True)
)
)
)
)
(1): AR_Back_Step(
(ar_step): AR_Step(
(conv): Conv1d(1024, 160, kernel_size=(1,), stride=(1,))
(lstm): LSTM(1664, 1024, num_layers=2)
(attention_lstm): LSTM(80, 1024)
(attention_layer): Attention(
(softmax): Softmax(dim=2)
(query): LinearNorm(
(linear_layer): Linear(in_features=1024, out_features=640, bias=False)
)
(key): LinearNorm(
(linear_layer): Linear(in_features=640, out_features=640, bias=False)
)
(value): LinearNorm(
(linear_layer): Linear(in_features=640, out_features=640, bias=False)
)
(v): LinearNorm(
(linear_layer): Linear(in_features=640, out_features=1, bias=False)
)
)
(dense_layer): DenseLayer(
(layers): ModuleList(
(0): LinearNorm(
(linear_layer): Linear(in_features=1024, out_features=1024, bias=True)
)
(1): LinearNorm(
(linear_layer): Linear(in_features=1024, out_features=1024, bias=True)
)
)
)
(gate_layer): LinearNorm(
(linear_layer): Linear(in_features=1664, out_features=1, bias=True)
)
)
)
)
(encoder): Encoder(
(convolutions): ModuleList(
(0): Sequential(
(0): ConvNorm(
(conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
)
(1): InstanceNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
)
(1): Sequential(
(0): ConvNorm(
(conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
)
(1): InstanceNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
)
(2): Sequential(
(0): ConvNorm(
(conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
)
(1): InstanceNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
)
)
(lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
)
)
Number of speakers : 1
output directory outdir
Epoch: 0
C:\AI_Research_Project\flowtron\data.py:40: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ..\torch\csrc\utils\tensor_numpy.cpp:141.)
return torch.from_numpy(data).float(), sampling_rate
C:\AI_Research_Project\flowtron\flowtron.py:373: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ..\aten\src\ATen\native\cuda\LegacyDefinitions.cpp:19.)
self.score_mask_value)
0: nan
WARNING:root:NaN or Inf found in input tensor.
C:\AI_Research_Project\flowtron\data.py:40: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ..\torch\csrc\utils\tensor_numpy.cpp:141.)
return torch.from_numpy(data).float(), sampling_rate
Mean None
LogVar None
Prob None
Validation loss 0: nan
WARNING:root:NaN or Inf found in input tensor.
Saving model and optimizer state at iteration 0 to outdir/model_0
1: nan
WARNING:root:NaN or Inf found in input tensor.
2: nan`

And that error keeps on going
EDIT: It just finished doing it's thing

After 319: nan

319: nan WARNING:root:NaN or Inf found in input tensor. Traceback (most recent call last): File "train.py", line 300, in <module> train(n_gpus, rank, **train_config) File "train.py", line 238, in train loss.backward() File "C:\Users\vladc\anaconda3\lib\site-packages\torch\tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\vladc\anaconda3\lib\site-packages\torch\autograd\__init__.py", line 127, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions