Alibaba recently released Qwen2-VL, the latest model in its vision-language series.
The new model can chat via camera, play card games, and control mobile phones and robots by acting as an agent.
It is available in three versions:
- the open source 2 billon models
- The 7 billion models, and the more advanced 72 billion model, accessible using API.
The advanced 72 billion model of Qwen2-VL achieved SOTA visual understanding across 20 benchmarks.
According to the blog statement, the new model demonstrates a significant :edge in document understanding.
“Overall, our 72B model showcases top-tier performance across most metrics, often surpassing even closed-source models like GPT-4o and Claude 3.5-Sonnet,”
read the blog.
Qwen2-VL performs exceptionally well in benchmarks like
- MathVista (for math reasoning),
- DocVQA (for document understanding), and
- RealWorldQA (for answering real-world questions using visual information).
What It Can Do
a. It can analyse videos longer than 20 minutes, provide detailed summaries, and answer questions about the content.
b. It can also function as a control agent, operating devices like mobile phones and robots using visual cues and text commands.
c. It can recognise and understand text in images across multiple languages, including European languages, Japanese, Korean, and Arabic.
Architectural Upgrades
One of the key architectural improvements in Qwen2-VL includes
1. The implementation of Naive Dynamic Resolution support.
The model can adapt to and process images of different sizes and clarity.
Binyuan Hui, the creator of OpenDevin and core maintainer at Qwen analysing this in his statement said:
“Unlike its predecessor, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens. This ensures consistency between the model input and the inherent information in images,”
He said that this approach more closely mimics human visual perception, allowing the model to process images of any clarity or size.
2. The innovation of Multimodal Rotary Position Embedding (M-ROPE).
“By deconstructing the original rotary embedding into three parts representing temporal and spatial (height and width) information,M-ROPE enables LLM to concurrently capture and integrate 1D textual, 2D visual, and 3D video positional information,” Hui added.
This technique enables the model to understand and integrate text, image and video data. “Data is All You Need!” said Hui.
Use Cases
Worthy of mention among many use cases are;
William J.B. Mattingly, a digital nomad on X who recently praised this development .
He called it his new favorite Handwritten Text Recognition (HTR) model while trying to convert a handwritten text into digital format
Ashutosh Shrivastava, a user on X also used this model to solve a calculus problem and reported successful results for the same.
This establishes its validity in problem solving.
The update is available on Hugging Face.