Some results are presented below. In these models, 120 frames were fed into the network and 25 frames were predicted. After that, the sliding window was shifted forward by 1 frame and the process was repeated. The 25 predicted frames correspond approximately to one word. However, the quality of the prediction is difficult to determine based on only one word. Therefore, several predictions were composed.
The composition was done as follows: From a predicted sequence, the frame at the position offset
was stored. Then the sliding window was moved shifted by 1 and the next frame at the position offset
was saved.
The prediction becomes of course more difficult if the offset
is larger. For example, with an offset=25
, 24 frames must first be predicted and then only the 25th predicted frame is stored. With a smaller offset
less frames have to be predicted and thus the task becomes easier.
The results are presented below. Only the ground truth and the prediction are shown in the table -> the segment that was fed into the network is not visible.
Ground truth audio (resynthesized Mel-spectrogram):
offset | Ground Truth spectrogram | Predicted Spectrogram | Predicted Audio |
---|---|---|---|
1 | |||
5 | |||
10 | |||
20 | |||
25 |
This model overfitted on the test data - but is for some sentences (such as this one) particularly good
Ground truth audio (resynthesized Mel-spectrogram):
offset | Ground Truth spectrogram | Predicted Spectrogram | Predicted Audio |
---|---|---|---|
1 | |||
5 | |||
10 | |||
20 | |||
25 |
Ground truth audio (resynthesized Mel-spectrogram):
offset | Ground Truth spectrogram | Predicted Spectrogram | Predicted Audio |
---|---|---|---|
1 | |||
5 | |||
10 | |||
20 | |||
25 |
This model overfitted on the test data - but is for some sentences (such as this one) particularly good
Ground truth audio (resynthesized Mel-spectrogram):
offset | Ground Truth spectrogram | Predicted Spectrogram | Predicted Audio |
---|---|---|---|
1 | |||
5 | |||
10 | |||
20 | |||
25 |