The Voice-Activated AI Workflow

The Voice-Activated Workflow works differently; both the AI and Human leverage the strengths of their own brain-type. But this style of working is not for everyone. It might speak to younger portions of our working community and help make us old people seem more innovative than we actually are. A quick overview of what it is and how it works.

AI Professional – Fiona Passantino, early September 2024

The Tease

In the summer of 2024, OpenAI introduced a new type of AI-Human workflow with an audio-based Advanced Voice Mode, released to a select group of users for field testing. The company press release video featured a small group of OpenAI executives talking to their phones in a casual, focused way on puffy sofas, and listening to what the phone said back.

The first big improvement from previous voice-activated models was the issue of latency; this describes the time it takes for a model to process a Human command and respond with an output. In Human conversation, this takes place quickly, sometimes layering over one another. The Omni rollout uses a combination of on-device mini-models to handle fast and easy queries with cloud-based processing for the more complex issues that require the full force of a trillion-parameter model that can time-release the response on the fly.

The Omni model was the first global release showing off this merged capability and was able to demonstrate near-Human conversational response time and smoother transitions between tasks. The result is a vastly upgraded version of Siri or Alexa that is surprisingly fun to talk to and able to handle complex verbal prompts. It’s fast enough to interrupt mid-sentence and update your request. It can sing, whisper, use multiple voices, even celebrities and cartoon characters[i]. The speech is laced with Human-like emotions, little giggles, outbursts, laughter, a bit of flirtation and even sarcasm.

Like most Western models, the voice AI is best at English but can adeptly switch between multiple languages during the same conversation; GPT-4o was built with 45 languages[ii]. Two phones with Advanced Voice Mode can have an oddly fluent conversation with one another, translating between French, Spanish or German as needed.

Like all chatbots, Voice Activated AI is not immune to hallucinations. It makes mistakes, forgets a request on occasion and generates random output for no apparent reason. But it’s laugh-out-loud fun to work with and can at times be eerie. Having a chatty, friendly companion alongside you during a long, intense day working from home with no Humans in sight is nice if you need a quick calculation, idea, recommendation or inspiration.

How the Workflow Works

In the typical workflow, all the phases of development – ideation, outlining, roughing, refinement, fleshing, editing, cutting back, combing and release – are text-oriented. You type out your query, copy-paste the AI components into your workspace and write. All interactions between Man and Machine are delivered and received by keyboard.

The Voice-Activated Workflow works differently. It’s a dynamic collaboration where both the AI and the Human contribute meaningfully in a structured and iterative process, each leveraging the strengths of its own brain-type: algorithmic or meat-based. It happens by voice.

Before you even put your first brainstorming queries into a prompt window, you are already telling the model that you might be feeling a tad unmotivated this morning. But you have this assignment, the deadline is tight, and you’re going to need to produce a brilliant, 500-word piece of thought leadership, ghost-writing for your boss, on the benefits of IoTs in Smart Cities. You will need to name drop the title of his new book within the first 150 words and close with a call to action to attend his annual industry customer conference.

The initial phases of ideation, brainstorming, and approach planning is all done by voice, just like you might with a helpful colleague who happens to know a whole lot about the particular subject. The AI might suggest specific ideas or prompts to refine the concept to help you gather your thoughts and truly crystallize your intentions.

Typical audio prompts for a project like this might be…

What’s the best way to structure the article?

Can you help me outline the key points in order of interest?

What about background information; what does the average person need to know about Smart Cities in order to understand his thesis?

What are some of the most exciting recent developments with Smart Cities in the past two years?

What are three projects I should be researching that combine IoTs and Smart Cities?

Can you tell me a few jokes about Smart Cities to break the ice for the reader?

What about titles for a piece like this, including a few subtitles?

As you talk and gather the AI brain’s perspective on the fly, you are also writing. You still query the browser-based bot in the same way, writing your prompt to get the first few lines of “starter” text. But now you have an instant sounding board for fast feedback to help you through small bottleneck, both phycological and content based.

Dude, what’s a better word for “realism”?

Are there too many adjectives in this description?

Which part is the strongest, and which should I delete?

Why are Smart Cities so interesting to us?

How do I define “Internet of Things” for the general public?

Ok, listen to this opening paragraph; what am I missing?

Is this sentence too obvious? Which one should I delete?

What would make this opening line stronger?

Does my own opinion come through too strong in this paragraph?

Can you give me a better summary for the abstract?

Voice-Activation Pros and Cons

What’s good about this new approach? You now have two AI brains working for you at the same time, the written and the spoken. Like most rollouts, the functionality does not match the dream when road-testing these applications. While latency is lower, and the back-and-forth is faster, sometimes you need to say something out loud to hear it clearly.

Not to be underestimated, the entertainment value; if your Voice-Activated AI is so odd, creepy and unexpectedly funny that it makes you laugh, then your motivation and workflow has already improved. Joy and laughter at work is a huge lubricant for the working machine that we Humans often overlook. We are emotional creatures, after all, making the majority of our decisions based on our gut and our feelings. Even financial ones; about 90% of our financial decisions are made by our emotions, explained away by our logical brains only after the fact[iii].

The Voice Activated Workflow is not for everyone. Done right, it can freeing; the Human can focus on creative thinking and storytelling while the AI handles drafting, editing and formatting, jumping back and forth between the voice and the text-based prompt. Or the Human, empty of creative ideas at that moment, can lean on the AI brain to get the thoughts flowing by voice, so the written prompts have a slightly different twist.

For many, particularly the older-school writers, the addition of voice adds a layer of input that only adds distraction and focus-splitting to the Human, who is used to going it alone. The voice is irritating, not comprehending. The jokes aren’t funny, and the advice is bad. Mostly because we can forget that it’s an AI and not a well-meaning Human intern sitting beside you. Voice-Activated AI still needs structured queries to do its best work, just like a text prompt. But we can easily just… forget. Simply because we are not used to interacting with a bit of programming in this way.

Like all things AI, this functionality will get better. One can already start to see the rough outlines of a new relationship with our devices for both home and work; there might be a permanently open stream of endless AI-Human conversation. This may save us having to ask our partners where the butter is; good AI vision may help keep us focused, aware and fact-based (barring hallucinations, of course).

Need help with AI Integration?

Reach out to me for advice – I have a few nice tricks up my sleeve to help guide you on your way, as well as a few “insiders’ links” I can share to get you that free trial version you need to get started.

No eyeballs to read or watch? Just listen.

Listen on APPLE
Listen on SPOTIFY

Working Humans is a bi-monthly podcast focusing on the AI and Human connection at work.

About Fiona Passantino

Fiona helps empower working Humans with AI integration, leadership and communication. Maximizing connection, engagement and creativity for more joy and inspiration into the workplace. A passionate keynote speaker, trainer, facilitator and coach, she is a prolific content producer, host of the podcast “Working Humans” and award-winning author of the “Comic Books for Executives” series. Her latest book is “The AI-Powered Professional”.

[i] O’Dinnell (2024) “OpenAI’s new GPT-4o lets people interact using voice or video in the same model” MIT Technology Review. Accessed August 15, 2024. https://www.technologyreview.com/2024/05/13/1092358/openais-new-gpt-4o-model-lets-people-interact-using-voice-or-video-in-the-same-model/

[ii] Rogers (2024) “I Used ChatGPT’s Advanced Voice Mode. It’s Fun, and Just a Bit Creepy” Wired Magazine. Accessed August 14, 2024. https://www.wired.com/story/chatgpt-advanced-voice-mode-first-impressions/

[iii] Singletary (2024) “Nobel laureate Daniel Kahneman taught us that money isn’t always about math” The Washington Post. Accessed August 18. https://www.washingtonpost.com/business/2024/03/29/daniel-kahneman-behavioral-economics/

Inspired. Engaged. AI-Powered.

The Voice-Activated AI Workflow

No eyeballs to read or watch? Just listen.

About Fiona Passantino

1 Comment

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Where AI + Human meet.