Apple has developed a new AI system called Ferret-UI 2 that can read and control apps across iPhones, iPads, Android devices, web browsers, and Apple TV.
The system’s UI element recognition test score was 89.73, significantly higher than GPT-4o’s score of 77.73. We also see significant improvements over previous versions in basic tasks such as text and button recognition, as well as more complex operations.
Understand user intent
Ferret-UI 2 aims to understand user intent rather than relying on specific click coordinates. When given a command such as “confirm input,” the system can identify the appropriate button without requiring precise location data. Apple’s research team used GPT-4o’s visual capabilities to generate high-quality training data that helps systems better understand how UI elements relate to each other spatially. did.
Ferret-UI 2 uses an adaptive architecture that recognizes UI elements across the platform. It includes algorithms that automatically balance image resolution and processing requirements for each platform. According to the researchers, this approach “combines both the preservation of information and the efficiency of local encoding.”
advertisement
Testing showed strong cross-platform performance, with models trained on iPhone data achieving 68% accuracy on iPad and 71% accuracy on Android devices. However, this system makes transitions between mobile devices and television or web interfaces more difficult, which researchers attribute to differences in screen layouts.
Microsoft releases UI understanding tool as open source
Apple’s efforts come as other companies develop their own UI-understanding AI systems. Anthropic recently released the latest Claude 3.5 Sonnet with UI interactions. Meanwhile, Microsoft released OmniParser, an open source tool that converts screen content into structured data for the same purpose.
Apple also recently announced CAMPHOR, a framework that uses specialized AI agents coordinated by a master inference agent to handle complex tasks. This technology, combined with Ferret-UI 2, enables voice assistants like Siri to analyze and perform complex tasks such as searching for and making reservations for a specific restaurant, navigating apps and the web using only voice commands. It will look like this.