Apple releases an open-source AI model for monocular depth estimation

October 7, 2024

0 7 1 minute read

Apple releases an open-source AI model for monocular depth estimation

Apple has released several open-source artificial intelligence (AI) models this year. These are usually small language models designed for a specific task. Adding to the list, the Cupertino-based tech giant has now released a new AI model called Depth Pro. It is a vision model that can generate monocular depth maps of any image. This technology is useful in generating 3D textures, augmented reality (AR) and more. The researchers behind the project claim that the depth maps generated by AI are better than those generated using multiple cameras.

Apple releases the Depth Pro AI model

Depth estimation is an important process in 3D modeling, as well as several other technologies such as AR, autonomous driving systems, robotics and more. The human eye is a complex lens system that can accurately measure the depth of objects even while viewing them from a single point perspective. However, cameras aren’t very good at it. Images taken with a single camera make it appear two-dimensional, taking depth out of the equation.

For technologies where the depth of an object plays an important role, multiple cameras are used. However, modeling such objects can be time-consuming and labor-intensive. Instead, in a research paper Titled “Depth Pro: Sharp Monocular Metric Depth in Less Than a Second,” Apple highlighted how it used a vision-based AI model to generate zero-shot depth maps from monocular images of objects.

How the Depth Pro AI model generates depth maps
Photo credit: Apple

To develop the AI model, the researchers used the Vision Transformer-based (ViT) architecture. The output resolution of 384 x 384 was chosen, but the input and processing resolution was kept at 1536 x 1536, giving the AI model more room to understand the details.

In the pre-print version of the paper, currently published in the online journal arXiv, the researchers claim that the AI model can now accurately generate depth maps of visually complex objects such as a cage, a furry cat’s body and whiskers, and more. The generation time is said to be one second. The open-source AI model weights are currently hosted on a GitHub mention. Interested individuals can run the model based on the inference of a single GPU.

October 7, 2024

0 7 1 minute read