Alibaba Releases Qwen2-VL

Alibaba recently released Qwen2-VL, the latest model in its vision-language series.

The new model can chat via camera, play card games, and control mobile phones and robots by acting as an agent.

It is available in three versions:

the open source 2 billon models
The 7 billion models, and the more advanced 72 billion model, accessible using API.

The advanced 72 billion model of Qwen2-VL achieved SOTA visual understanding across 20 benchmarks.

According to the blog statement, the new model demonstrates a significant :edge in document understanding.

“Overall, our 72B model showcases top-tier performance across most metrics, often surpassing even closed-source models like GPT-4o and Claude 3.5-Sonnet,”

read the blog.

Qwen2-VL performs exceptionally well in benchmarks like

MathVista (for math reasoning),
DocVQA (for document understanding), and
RealWorldQA (for answering real-world questions using visual information).

What It Can Do

a. It can analyse videos longer than 20 minutes, provide detailed summaries, and answer questions about the content.

b. It can also function as a control agent, operating devices like mobile phones and robots using visual cues and text commands.

c. It can recognise and understand text in images across multiple languages, including European languages, Japanese, Korean, and Arabic.

Architectural Upgrades

One of the key architectural improvements in Qwen2-VL includes

1. The implementation of Naive Dynamic Resolution support.

The model can adapt to and process images of different sizes and clarity.

Binyuan Hui, the creator of OpenDevin and core maintainer at Qwen analysing this in his statement said:

“Unlike its predecessor, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens. This ensures consistency between the model input and the inherent information in images,”

He said that this approach more closely mimics human visual perception, allowing the model to process images of any clarity or size.

2. The innovation of Multimodal Rotary Position Embedding (M-ROPE).

“By deconstructing the original rotary embedding into three parts representing temporal and spatial (height and width) information，M-ROPE enables LLM to concurrently capture and integrate 1D textual, 2D visual, and 3D video positional information,” Hui added.

This technique enables the model to understand and integrate text, image and video data. “Data is All You Need!” said Hui.

Use Cases

Worthy of mention among many use cases are;

William J.B. Mattingly, a digital nomad on X who recently praised this development .

He called it his new favorite Handwritten Text Recognition (HTR) model while trying to convert a handwritten text into digital format

Ashutosh Shrivastava, a user on X also used this model to solve a calculus problem and reported successful results for the same.

This establishes its validity in problem solving.

The update is available on Hugging Face.

Popular tags

Latest News
View All

TechCity Weekly: OpenAI Gets Visual, Facebook Goes Retro, Nintendo’s Digital Future & Apple’s WWDC 2025 Reveal

GPT-4o and the Future of AI Creativity: How OpenAI Just Changed the Game

From Struggle to Stability: How FinTech is Helping Nigerian SMEs Overcome Cash Flow Challenges

PalmPay Launches First Debit Card with Verve, Expanding Its Digital Banking Services

Alibaba Releases Qwen2-VL

TechCity Weekly: OpenAI Gets Visual, Facebook Goes Retro, Nintendo’s Digital Future & Apple’s WWDC 2025 Reveal

GPT-4o and the Future of AI Creativity: How OpenAI Just Changed the Game

From Struggle to Stability: How FinTech is Helping Nigerian SMEs Overcome Cash Flow Challenges

PalmPay Launches First Debit Card with Verve, Expanding Its Digital Banking Services

How to Protect Your WhatsApp Chats from Hackers

Architectural Upgrades

Use Cases

Alibaba Releases Qwen2-VL

Architectural Upgrades

Use Cases

Related Posts