Vision?

#16
by erichartford - opened

With tough competition like Kimi-K2.7-Code, MiniMax M3, and Nex N2 Pro all offering Vision, it's hard to pick a non-vision model.

MiMo-V2.5 is a multimodal LLM which supports picture, video and audio input while Mimo-V2.5-pro is purely text based...
One solution is OCR the picture with mimo-v2.5 and process with mimo-v2.5-pro

That's precisely what I'm suggesting - that mimo v2.5 pro should have had a ViT.

It's a relatively small feature addition that has a huge impact in usefulness for many use cases.

Sign up or log in to comment