![]() ![]() The authors could have tried to probe as to how much of this was coming from "knowledge" versus "inference" via picture manipulation.but they didn't. Or, "here are pictures of famous landmarks, look at the very thoughtful descriptions GPT-4V is producing!" There was a combination of thoughtful tests, and lazy ones.Į.g., there were a lot of queries of the images which almost certainly were asking GPT-4 to lean on crystallized knowledge ("Which of these oceans does the prime meridian intersect?") in ways that IMO were very unclear as to whether much visual "thinking" was going on at all. ![]() Likewise, most of the captions in the Washington state parks aren't even based on the image (there's no hiking trails shown in the Pilchuck or North Cascades image) the fifth scene "shows a girl wrapping pants around her neck to see if it fits" isn't even shown in the frame - that's just the text. in the video of "things that Asian people do for no reason", it's mostly just reading the captions with basic scene info augmented (the only non-textual image reads I see is "two girls" being seen). ![]() ![]() That is it gives an overview of what might work, but I would have liked to see more robust tests over dozen of samples in the same category.Īnd section 9.8 "video watching" is being oversold. Great! Except this is one image - it doesn't demonstrate any generality at this prompt being superior - for all I know they just got lucky. It's an interesting, long paper, but giving a detailed read really bothers me.Ī good example is section 3.1 demonstrating "condition on good performance" - it shows how different prompts arrive at different results and how the "you are an expert in counting things" gives the correct result. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. Metacademy is a great resource which compiles lesson plans on popular machine learning topics.įor Beginner questions please try /r/LearnMachineLearning, /r/MLQuestions or įor career related questions, visit /r/cscareerquestions/ Please have a look at our FAQ and Link-Collection Rules For Posts + Research + Discussion + Project + News on Twitter Chat with us on Slack Beginners: ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |