Bit News Tsinghua KEG Lab recently cooperated with Zhipu AI to jointly launch a new generation of image understanding large model CogAgent. Based on the previously launched CogVLM, the model uses visual modalities instead of text to provide a more comprehensive and direct perception of the GUI interface through a visual GUI agent for planning and decision-making. It is reported that CogAgent can accept 1120×1120 high-resolution image input, with visual question answering, visual positioning (Grounding), GUI Agent and other capabilities, in 9 classic image understanding lists (including VQAv2, STVQA, DocVQA, TextVQA, MM-VET, POPE, etc.) has achieved the first result in general ability.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Tsinghua KEG Lab and Zhipu AI jointly launched CogAgent, a large image understanding model
Bit News Tsinghua KEG Lab recently cooperated with Zhipu AI to jointly launch a new generation of image understanding large model CogAgent. Based on the previously launched CogVLM, the model uses visual modalities instead of text to provide a more comprehensive and direct perception of the GUI interface through a visual GUI agent for planning and decision-making. It is reported that CogAgent can accept 1120×1120 high-resolution image input, with visual question answering, visual positioning (Grounding), GUI Agent and other capabilities, in 9 classic image understanding lists (including VQAv2, STVQA, DocVQA, TextVQA, MM-VET, POPE, etc.) has achieved the first result in general ability.