#vision-language-models

[ follow ]
fromNature
4 days ago

Multimodal learning with next-token prediction for large multimodal models - Nature

Since AlexNet5, deep learning has replaced heuristic hand-crafted features by unifying feature learning with deep neural networks. Later, Transformers6 and GPT-3 (ref. 1) further advanced sequence learning at scale, unifying structured tasks such as natural language processing. However, multimodal learning, spanning modalities such as images, video and text, has remained fragmented, relying on separate diffusion-based generation or compositional vision-language pipelines with many hand-crafted designs.
Artificial intelligence
Artificial intelligence
fromZDNET
3 weeks ago

Nvidia's physical AI models clear the way for next-gen robots - here's what's new

Nvidia released open Cosmos and GR00T physical-AI models to accelerate robot development, enabling realistic world understanding, simulation, reasoning, and reduced pretraining effort.
Artificial intelligence
fromTechCrunch
2 months ago

Nvidia announces new open AI models and tools for autonomous driving research | TechCrunch

Nvidia released Alpamayo-R1, an open vision-language reasoning model plus Cosmos Cookbook resources to accelerate level-4 autonomous driving and physical AI development.
Wearables
fromZDNET
4 months ago

These Halo smart glasses just got a major memory boost, thanks to Liquid AI

Brilliant Labs will integrate Liquid AI's vision–language foundation models into Halo AI smart glasses to improve real-time scene understanding and agentic memory.
Artificial intelligence
fromComputerworld
4 months ago

Microsoft researchers develop new tech for video AI agents

Microsoft is developing MindJourney, a video-AI framework that explores 3D spaces using world models, VLMs, video generation, and reasoning to predict surroundings and movement.
Philosophy
fromTheregister
5 months ago

Vision AI models see optical illusions when none exist

Vision language models, like GPT-5, misinterpret simple images as complex illusions, reflecting a form of cognitive bias similar to humans.
Artificial intelligence
fromHackernoon
2 years ago

Researchers Push Vision-Language Models to Grapple with Metaphors, Idioms, and Sarcasm | HackerNoon

The V-FLUTE dataset enhances understanding of figurative language in AI, assessing the performance of vision-language models.
Artificial intelligence
fromHackernoon
2 years ago

Can AI Understand a Joke? New Dataset Tests Bots on Metaphors, Sarcasm, and Humor | HackerNoon

Large AI models struggle with figurative language, which presents challenges due to its implicit meanings.
#idefics2
Bootstrapping
fromHackernoon
56 years ago

The Artistry Behind Efficient AI Conversations | HackerNoon

The cross-attention architecture exceeds fully autoregressive models in vision-language performance, despite having a higher computational cost.
#machine-learning
Artificial intelligence
fromPyImageSearch
8 months ago

Content Moderation via Zero Shot Learning with Qwen 2.5 - PyImageSearch

Digital platforms face complex challenges in content moderation due to user-generated content growth.
Qwen 2.5 models can enhance content moderation through advanced multimodal understanding.
[ Load more ]