OpenAI: Bringing More Human-like Intelligence to the App Store through Audio-Enabled Synthesis of Synthetic Speech and Audio-Audio
Making the app more appealing would make it harder for other companies to compete with Openai in the race to provide powerful artificial intelligence (ai) engines. Feeding audio and visual data into the machine learning models behind ChatGPT may also help OpenAI’s long-term vision of creating more human-like intelligence.
ChatGPT’s new voice generation technology—developed in-house by the company—also opens new opportunities for the company to license its technology to others. Spotify plans to use Openai’s speech synthesis algorithms to pilot a feature that will mimic the original voice of a Podcaster in order to translate it into more than one language.
The ability to build synthetic voice with a few seconds of audio also opens the door for a variety of problematic use cases. “These capabilities also present new risks, such as the potential for malicious actors to impersonate public figures or commit fraud,” the company says in a blog post announcing the new features. The model isn’t available for broad use for precisely that reason, OpenAI says: it’s going to be much more controlled and restrained to specific use cases and partnerships.
Multimodal Language Models for Video, Image, and Voice Response in Artificial Intelligence Is Prompt AI Powered by Google Lens
The image search, meanwhile, is a bit like Google Lens. You take a photo of something you are interested in, and you can request a response from the company that took the photo. You can also use the app’s drawing tool to help make your query clear, or speak or type questions to go along with the image. You can prompt the bot and refine the answer as you go, instead of trying to find the correct answer by searching and then doing another search. This is the same thing that Google is doing with multimodal search.
OpenAI’s language models that power its chatbot, including the most recent, GPT-4, were created using vast amounts of text collected from various sources around the web. Just as animal and human intelligence use various types of data, creating a more advanced artificial intelligence may require other types of data too, like audio and text.
Multiplemodal is the new style of artificial intelligence that will allow video, image, and voice input, as well as text. “From a model performance standpoint, intuitively we would expect multimodal models to outperform models trained on a single modality,” says Trevor Darrell, a professor at UC Berkeley and a cofounder of Prompt AI, a startup working on combining natural language with image generation and manipulation. If we build a model with just language, it will learn the language, no matter how powerful it is.