GPT-4V: A new Multimodal AI Model by OpenAI


A robot looking at a paper


OpenAI has recently announced a new multimodal AI model called GPT-4V. This model is a multimodal variant of GPT-4, capable of processing both text and image inputs. The system card provides insights into the model’s capabilities, limitations, and the mitigations implemented to ensure responsible use.

GPT-4V’s Capabilities and Limitations:

GPT-4V has shown proficiency in understanding complex images, including specialized imagery from scientific publications. It can critically assess claims for novel scientific discoveries and even identify dangerous compounds or poisonous foods. However, the model has limitations. It can sometimes combine unrelated text components, overlook mathematical symbols, and fail to identify information from images. It can also make factual errors and sometimes hallucinate information that isn’t present.

The model’s ability to provide medical advice has been tested, revealing inconsistencies in interpretation. While it occasionally gives accurate responses, it can sometimes give wrong responses to the same question. Given these limitations, GPT-4V is not considered fit for performing any medical function.

Mitigations and Safety Measures:

OpenAI has implemented several mitigations to address potential risks. For instance, the model refuses to answer questions about hate symbols and extremist content in some instances. However, this behavior may be inconsistent and contextually inappropriate. OpenAI has added refusals for certain kinds of harmful generations but acknowledges that this remains a challenging problem to solve.

The model also inherits several transfer benefits from model-level and system-level safety mitigations already deployed in GPT-4. Some safety measures implemented for DALL·E have proved beneficial in addressing potential multi-modal risk in GPT-4V.

Future Steps:

OpenAI plans to invest further in addressing fundamental questions around the behaviors the models should or should not engage in. This includes whether models should be allowed to infer gender, race, or emotions from images of people, and whether visually impaired individuals should receive special consideration for accessibility. OpenAI also plans to improve performance in languages spoken by global users and enhance image recognition capabilities relevant to a worldwide audience.

In conclusion, while GPT-4V presents exciting opportunities, it also poses novel challenges. OpenAI’s approach to deployment preparation targets the assessment and mitigation of risks related to images of people, biased outputs, and the model’s capability jumps in high-risk domains such as medicine and scientific proficiency.

Comments

Popular Posts