This AI Turns Lyrics Into Fully Synced Song and Dance Performances | HackerNoon
Briefly

The proposed model generates joint vocal and whole-body motion from textual inputs, evaluated using multiple metrics. Mean Opinion Score (MOS) measures vocal naturalness, while Frechet Inception Distance (FID) assesses motion quality. For face realism, vertex MSE and L1 difference are utilized. Beat Constancy (BC) measures motion and vocal synchrony. Baselines include state-of-the-art methods like DiffSinger for vocal generation and T2M-GPT for motion generation. Results are reported via an 85%/7.5%/7.5% train/val/test split on the RapVerse benchmark.
To evaluate the generation quality of singing vocals, we utilize the Mean Opinion Score (MOS) to gauge the naturalness of the synthesized vocal.
For motion synthesis, we evaluate the generation quality of body hand gestures and the realism of the face, using metrics such as Frechet Inception Distance (FID) and vertex MSE.
For gesture generation, we use FID based on a feature extractor to evaluate the distance of feature distributions between the generated and real motions.
We report all the results on RapVerse with an 85%/7.5%/7.5% train/val/test split, comparing our model against state-of-the-art methods.
Read at Hackernoon
[
|
]