AI video conversation agents can now look back at you
9 mins read

AI video conversation agents can now look back at you

Every aspect of AI feels like it’s advanced a decade in the past year, and in the whirlwind of new releases and features, you may have missed something important: interactive video chatbots that can see, hear and converse with you in real time.

Look, never mind, I’m over here still impressed with how good ChatGPT’s advanced voice mode is. Multimodal AIs have proven to be just as capable of expressing themselves in audio and video form as they are in the written word.

A couple of years ago, the idea of ​​having a video chat with a fairly lifelike, photorealistic AI would have seemed ridiculously futuristic. Yet here we are in 2024, I’ve just spoken to five of them, and I’m now so conditioned to immediately accept new cutting-edge technologies that it already feels completely normal.

Good

I’m actually about to complain about their current shortcomings. But first, let’s talk about what they do well. Speed, for me, has to be at the top of the list here. You talk and the video avatar responds with no more delay or latency than you’d get with a voice chat — about 600 milliseconds, according to Tavus. I’ve actually had plenty of video chats with other real people where there’s been more lag than this.

Of course, the avatars look and sound great—and of course, their conversational abilities make the latest generation models like Siri and Alexa feel like black-and-white TVs.

AIs for early video conversations are already pretty incredible
AIs for early video conversations are already pretty incredible

Tavus

It’s truly amazing how much these models can do in real time – never mind the conversation, the voice, the body language – they can also look back at you through your laptop or device camera, take stock of your surroundings and incorporate them into the conversation.

“I see you’ve got some guitars and keyboards behind you, Loz,” AI agent ‘Carter’ tells me. “And those sound absorbing panels on the ceiling… Looks like you have some serious music production space there, love those creative vibes!”

These agents can be given personalities, memories, scenarios, habits, tasks, boundaries, interaction goals, scripts and access to all the information they need to do their jobs – jobs like automated sales, customer service, information assistants, whatever human-encounter tasks can be done via a video chat -interface.

They can converse comfortably in a variety of languages, without losing the essential tone of their voices. They can occur in a variety of environments; walking down the street, driving a car, hanging out in a coffee shop or sitting in any office you can think of.

And they can look and sound just like you. A single two-minute video upload is all Tavus needs to capture your appearance and voice, which it then transforms into a programmable “digital twin” conversational agent that is your own spitting image.

The Bad

These things are still very early versions of what will steam down for us. The Carter bot doesn’t always get his lips in perfect sync with his voice. The facial expressions are not always in the right places. He sparkles a little; the eyes seem to relocate on his head from time to time, and the video or audio sometimes stutters to reveal his digital nature.

And, as with ChatGPT, the conversation is still a little broken. You have to take turns, and if you stop and think too long in the middle of a sentence, he’ll start responding when a human would (ideally) give you some more space. AIs haven’t yet mastered the art of gently interrupting, prompting, this sort of thing.

It doesn’t matter. The speed at which this technology is developing is truly astounding. In a few months, Carter will be old news, and all of these gaps will close quickly. Most of the world only learned about ChatGPT last year – now you’re watching AI, and it’s watching you, in real-time video conversations.

Choose from standard AI agents, or build one yourself with just a two-minute selfie video
Choose from standard AI agents, or build one yourself with just a two-minute selfie video

Tavus

The ugly one

In fact, part of what this thing needs to do to improve is to get better at reading body language, which could help it solve e.g. the difference between someone lagging behind, or thinking, or having finished their sentence.

And then, of course, it must learn to adapt its own body language in response to yours, and to advance its goals in communication.

And here, for those of you who have followed my thoughts on AI over the last few years, you’ll start to see some of the scary potential here. Pardon me as we venture into the realm of speculation – but the rapid convergence of technologies in this space makes some things pretty clear to me.

In April, a study found that Text-based AIs were already about 82% more persuasive than humans – and at the same time we started to see the first one emotionally intelligent AI chat servicesable to read the tone of your voice and respond to the emotional content as well as the words.

Oh, and here’s some light reading if you’re wondering how much an AI might be able to learn about you from your body language… Back in 2021, a research review absolutely convinced me by describing all the things AI could tell about a person just by tracking their eye movements.

Eye tracking devices can see much more than just what you choose to look at and infer a huge amount of sensitive information
Eye tracking devices can see much more than just what you choose to look at and infer a huge amount of sensitive information

So when I look at Carter looking back at me, I’m amazed at the progress and blown away by the technology, but I also see the embryonic form of history’s most powerful tool of persuasion. This one might just beat religion, friends.

With just a short piece of video, a scammer can have an agent video call you as your own mother, and cold-read you like no human expert ever could, constantly monitoring your facial expressions, tone of voice, and body language to keep track of whether or not you’re being scammed. If you start reining in, it can notice almost before you do, and start using all sorts of distraction or refocusing techniques to bypass objections, create a sense of urgency, and move you toward its end goal, whatever it is.

That’s just the criminal side of things … Imagine trying to get a refund when the customer service agent you’re talking to is a master conversational tactician, a superhuman body language expert and tone of voice analyst all rolled into one. Imagine how powerful the sale will be when you talk to a galaxy sick super salesman who can read you like a book.

That’s not to mention how effective these things will be as disinformation vectors, virtual girlfriends, divisive political tools… Maybe even police detectives or interrogators. They will be incredibly believable one-on-one interactions, weaponizing our built-in physical tendencies to make our bodies betray us. The balance of power here is going to be incredibly one-sided, if they can just keep us in line.

In a positive sense, they will be incredible therapists, doctors, assistants, coaches, mentors, trainers, teachers and probably friends. But it will be more important than ever to remember the basic truth: if you don’t own an AI, someone else will, and it works for them first and you later. So be very careful what you choose to disclose, and only deal with companies you trust…

… or not. There may be no real way to protect yourself from this. We as a species may just have to adapt to a new reality.

You can have a two minute demo chat with Carter yourself Tavu’s website. Say I said hi.

Oh, and you can take a look at what HeyGen is doing in this space too if you want to see some similar options, although I was less impressed with HeyGen’s demos.

Source: Tavus