Anthropic’s Claude 3.5 Sonnet: How to Look at a Screen and Work on a Computer for Efficient Computation and Tool Use
The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. The performance of swe-bench verified improves from 33.4% to 49.0%, scoring higher than all publicly available models. It increases the performance of the TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.6% to 41.6% in the more challenging airline domain.
Meanwhile, Anthropic says its new Claude 3.5 Sonnet model has improvements in many benchmarks and is offered to customers at the same price and speed as its predecessor:
Also, this version of Claude has apparently been told to steer clear of social media, with “measures to monitor when Claude is asked to engage in election-related activity, as well as systems for nudging Claude away from activities like generating and posting content on social media, registering web domains, or interacting with government websites.”
There are many actions that people routinely do with computers (dragging, zooming, and so on) that Claude can’t yet attempt. The “flipbook” nature of Claude’s view of the screen—taking screenshots and piecing them together, rather than observing a more granular video stream—means that it can miss short-lived actions or notifications.
Computer use can becumbersome and error-prone, according to Anthropic. The company says that it is releasing computer use early for feedback from the developers, and expects the capability to improve rapidly over time.
Microsoft’s Copilot Vision feature and OpenAI’s desktop app for ChatGPT have demonstrated what their artificial intelligence tools can do when viewed from a computer screen. They haven’t released tools that will be ready to click around and perform tasks for you like this. Rabbit promised similar capabilities for its R1, which it has yet to deliver.
Anthropic has a new feature in its Claude 3.5 Sonnet AI model that lets you control a computer by looking at the screen, moving a mouse or keyboard, and typing. The new feature, called “Computer use,” is available on the API and allows developers to direct Claude to work on a computer like a human does.
It took a while for people to get used to the idea of a computer that does things for them. The next leap into the unknown may involve trusting artificial intelligence to take over our computers, too.
“I think we’re going to enter into a new era where a model can use all of the tools that you use as a person to get tasks done,” says Jared Kaplan, chief science officer at Anthropic and an associate professor at Johns Hopkins University.
Kaplan showed WIRED a prerecorded demo in which an “agentic”—or tool-using—version of Claude had been asked to help plan an outing to see the sunrise at the Golden Gate Bridge with a friend. Claude used the Chrome web browser and a calendar app to find out about the ideal viewing spot and optimal time to be there, then created an event and shared it with a friend. Further instructions, such as what route to take to get there in the least amount of time, were not included.
Claude was asked to build a website to promote himself during the second demo. In a surreal moment, the model inputted a text prompt into its own web interface to generate the necessary code. It then used Visual Studio Code, a popular code editor developed by Microsoft, to write a simple website, and opened a text terminal to spin up a simple web server to test the site. The website offered a decent, 1990s-themed landing page for the AI model. When the user asked it to fix a problem on the website, the model returned to the editor and deleted the offending code.
Mike Krieger, chief product officer at Anthropic, says the company hopes that so-called AI agents will automate routine office tasks and free people up to be more productive in other areas. “What would you do if you got rid of a bunch of hours of copy and pasting or whatever you end up doing?” He said so. I’d try to play more guitar.
Anthropic is making the agentic abilities available through its application programming interface (API) for its most powerful multimodal large language model, Claude 3.5 Sonnet, from today. Claude 3.5 Haiku was announced as the new and improved version of the smaller model.