Vision?

#16

by erichartford - opened 3 days ago

With tough competition like Kimi-K2.7-Code, MiniMax M3, and Nex N2 Pro all offering Vision, it's hard to pick a non-vision model.

LuoChen2008

3 days ago

MiMo-V2.5 is a multimodal LLM which supports picture, video and audio input while Mimo-V2.5-pro is purely text based...
One solution is OCR the picture with mimo-v2.5 and process with mimo-v2.5-pro

ehartford

3 days ago

That's precisely what I'm suggesting - that mimo v2.5 pro should have had a ViT.

It's a relatively small feature addition that has a huge impact in usefulness for many use cases.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment