One of the fastest-improving areas of artificial intelligence video is lip-sync — that is, being able to make an AI character speak and look like it is speaking the words it says.
There are a number of companies offering lip-synching including Pika Labs, Synchlabs and character-based platforms like Hey Gen and Synthesia. The latter two are potentially the best examples of lip-synching I’ve seen, but they are focused more on avatar than animation.
For this story, I’ve focused on platforms working in the AI video space, rather than avatar creation. Kling and Runway are the most similar, offering full video creation platforms with lip-sync as a feature. Hedra is currently focused on the character, but it’s building a wider steerable video model that starts with the character. So I’ve picked those three for this test.
Designing the battle
This is going to be a five-round competition between the trio of models, three rounds using an image I’ve given them and two using their own image/video generation capabilities. (I’ll explain how many rounds I ended up running at the end.)
We will use the same image with each tool but use their own built-in voices and the same monologue script. I’ve focused on 10-second snippets even though Hedra can go up to a minute. This is to keep consistency across all three models.
Hedra works slightly differently to Kling and Runway. The latter two begin with a video and map lip movement within the video; Hedra begins with an image. The final results are similar.
Round 1: The Static Face Test
This should be the easiest. We’ve given Midjourney the prompt: “A neutral, close-up portrait of a person with minimal expression, well-lit in a natural studio setting, showing a front-facing view of the face. The background is a soft, blurred color gradient with no distractions. Skin tones should be natural, and the character should look calm, with no notable emotion.”
We’ve then picked a custom voice from each of the three models and decided to make it say “Hello, welcome to the future of AI video generation. I don’t really exist but can still speak to you thanks to the wonders of lip-synching”.
This first test should have taken 20 minutes to run even with the added complexity of lip-syncing, but Kling, as good as it is in terms of visual and motion realism, is by far the slowest AI video model. Runway, thanks to Turbo is near real-time and Hedra is animating an image, so it’s quick.
This was a close round between Hedra, with the more realistic voice and mouth movement and Kling for the more impressive movement. I wasn’t convinced by the flickering, so I’m giving it to Hedra on this occasion.
Round 2: The Expression Challenge
In this test, we’ve got an ultra close-up image made in Midjourney: “A close-up portrait of a person with an expressive, happy face, showing teeth in a wide smile. The lighting is bright and warm, creating a cheerful and energetic mood. The background is a soft, light pastel colour that doesn’t distract from the facial expression.”
Each of the three models were asked to say the phrase: “Life can be odd sometimes, but it is a good odd, a happy way of being. Something to smile about.” This will test the ability to capture emotional context.
All three were nightmarish renders. It is clear that if you want a good lip-sync, you should start with a closed mouth. I can’t crown a winner, but I will reluctantly give it Hedra for the least horrific mouth movement.
Round 3: The Action Scene
Finally, we’re going to see how well each contender can animate the lips of someone mid-conversation and not facing the camera directly. We use the Midjourney prompt: “A mid-action shot of a person slightly turned to the side, speaking with a hand raised as if gesturing during an intense conversation. The face shows determination and focus. The background is a dynamic, slightly blurred urban street scene, with movement to suggest the person is speaking while in motion.”
I’ve given the character the script: “So I told him if he wants to buy the car he’ll have to come back with a better price. Never heard from him again.”
None of them were perfect but I think Hedra and Runway did a better job than Kling. Overall, I think Runway took this round for the most realistic lip-sync.
The winner: Hedra
I had originally planned five rounds but Kling took so long to generate each video it made it impossible to complete in enough time. The last two tests were going to be of the text-to-video capabilities, without the starting image, but the results were too sporadic to be viable.
Hedra’s Character-2 came out on top and to some inevitable degree. It starts with an image and animation it where the other two have to map the mouth movement within a video and sync the lips to the sound. Of the video models, I think Kling was better overall, but this was a first past the post-test, so technically Runway came in second.
If I were to repeat this experiment, I’d use external sounds to. create more consistency, always use generated images and carry out a wider range of tests. I just wish Kling were quicker.