Mistral’s first multimodal AI model debuts with ‘Computer Vision’ function
Mistral on Wednesday released its first multimodal artificial intelligence (AI) model, called Pixtral 12B. The AI company, known for its open-source large language models (LLMs), has also made the latest AI model available on GitHub and Hugging Face for users to download and test. Notably, Pixtral, despite being multimodal, can only process images using computer vision technology and answer questions about them. Two special encoders have been added for this functionality; it cannot generate images like Stable Diffusion models or Midjourney’s Generative Adversarial Networks (GANs).
Mistral releases Pixtral 12B
Mistral’s official account on X (formerly known as Twitter), which is known for its minimalist announcements, released the AI model in a after by sharing the magnetlink. The total file size of Pixtral 12B is 24GB and requires an NPU-capable PC or one with a powerful GPU to run the model.
The Pixtral 12B comes with 12 billion parameters and is built using the company’s existing Nemo 12B AI model. Mistral points out that users will also need the Gaussian Error Linear Unit (GeLU) as the vision adapter and 2D Rotary Position Embedding (RoPE) as the vision encoder.
Specifically, users can upload image files or URLs to the Pixtral 12B and it should be able to answer questions about the image, such as identifying objects, counting the number of objects, and sharing additional information. Since it’s built on Nemo, the model will also be adept at completing all typical text-based tasks.
A Reddit user placed an image over the Pixtral 12B benchmark scores, and it appears that the LLM outperforms Claude-3 Haiku and Phi-3 Vision in multimodal capabilities on the ChartQA bench. It also outperforms both rival AI models on the Massive Multitask Language Understanding (MMLU) bench for multimodal knowledge and reasoning.
Citing a company spokesperson, TechCrunch reports that the Mistral AI model can be refined and used under an Apache 2.0 license. This means that the outputs of the model can be used without restrictions for personal or commercial use. Furthermore, Sophia Yang, the Head of Developer Relations at Mistral, clarified in a after that Pixtral 12B will soon be available on Le Chat and Le Platforme.
For now, users can directly download the AI model using the magnet link provided by the company. Alternatively, the model weights are also hosted about Hugging Face and GitHub mentions.