260316 X(트위터) 모음
RT by @hwchase17: Hear Harrison at the [[NVIDIA]] GTC Keynote Pre-Game on Monday, Ma
RT by @hwchase17: Hear Harrison at the NVIDIA GTC Keynote Pre-Game on Monday, Ma
Hear Harrison at the NVIDIA GTC Keynote Pre-Game on Monday, March 16 (8-11 AM PT) during the session “The Agentic AI Inflection Point.”
He’ll be joined by:
Peter Steinberger, OpenClaw
Sam Rodriques, Edison Scientific
Vincent Weisser, Prime Intellect
📅 Add it to your calendar: nvidia.com/gtc/pregame/
출처: https://nitter.net/LangChain/status/2033239787394441237#m
RT by @ylecun: If Trump doesn't like your coverage of the war, his FCC will pull
RT by @ylecun: If Trump doesn't like your coverage of the war, his FCC will pull
If Trump doesn't like your coverage of the war, his FCC will pull your broadcast license.
That is flagrantly unconstitutional.
Brendan Carr (@BrendanCarrFCC)
Broadcasters that are running hoaxes and news distortions - also known as the fake news - have a chance now to correct course before their license renewals come up.
The law is clear. Broadcasters must operate in the public interest, and they will lose their licenses if they do not.
And frankly, changing course is in their own business interests since trust in legacy media has now fallen to an all time low of just 9% and are ratings disasters.
The American people have subsidized broadcasters to the tune of billions of dollars by providing free access to the nation’s airwaves.
It is very important to bring trust back into media, which has earned itself the label of fake news.
When a political candidate is able to win a landslide election victory after in the face of hoaxes and distortions, there is something very wrong. It means the public has lost faith and confidence in the media. And we can’t allow that to happen.
Time for change!
출처: https://nitter.net/CAgovernor/status/2032875355879690534#m
RT by @ylecun: Latent world models learn differentiable dynamics in a learned re
RT by @ylecun: Latent world models learn differentiable dynamics in a learned re
Latent world models learn differentiable dynamics in a learned representation space, which should make planning as simple as gradient descent.
But it almost never works.
What I mean is, at test time, you can treat the action sequence as learnable parameters, roll out the frozen world model, measure how far the predicted final state is from the goal, and backprop through the entire unrolled chain to optimize actions directly. Yet many of the systems that work (Dreamer, TD-MPC2, DINO-WM) abandon this and fall back to sampling-based search instead.
That's why I really like this new paper by @yingwww_, @ylecun, and @mengyer, which gives a clean diagnosis of why, and a principled fix.
The reason everyone abandons gradient descent on actions is that the planning objective is highly non-convex in the learned latent space. So instead most systems use CEM (cross-entropy method) or MPPI (model predictive path integral), both derivative-free.
CEM samples batches of action sequences, evaluates them by rolling out the world model, keeps the top-k, and refits the sampling distribution.
MPPI does something similar but weights trajectories by exponentiated negative cost instead of hard elite selection.
These work when gradients are unreliable but the compute cost is substantial — hundreds of candidate rollouts per planning step vs a single forward-backward pass.
This paper asks what exactly makes the latent planning landscape so hostile to gradients and what you can do about it.
The diagnosis. Their baseline is DINO-WM, a JEPA-style world model with a ViT predictor planning in frozen DINOv2 feature space, minimizing terminal MSE between predicted and goal embeddings. The problem is that DINOv2 latent trajectories are highly curved (when you use MSE as the planning cost you're implicitly assuming euclidean distance approximates geodesic distance along feasible transitions).
For curved trajectories this breaks badly, gradient-based planners get trapped and straight-line distances in embedding space misrepresent actual reachability.
The fix draws from the perceptual straightening hypothesis in neuroscience — the idea that biological visual systems transform complex video into internally straighter representations. So they add a curvature regularizer during world model training.
Given consecutive encoded states
z_t, z_{t+1}, z_{t+2},
define velocity vectors as
v_t = z_{t+1} - z_t
measure curvature as the cosine similarity between consecutive velocities, and minimize
L_curv = 1 - cos(v_t, v_{t+1}).
Total loss is then
L_pred + λ * L_curv
with stop-gradient on the target branch to prevent collapse.
The theory backs this up cleanly — they prove that reducing curvature directly bounds how well-conditioned the planning optimization is — straighter latent trajectories guarantee faster convergence of gradient descent over longer horizons.
Worth noting that even without the curvature loss, training the encoder with a prediction objective alone produces some "implicit straightening" — the JEPA loss naturally favors representations whose temporal evolution is predictable. Explicit regularization simply pushes this much further.
Empirical results across four 2D goal-reaching environments are consistently strong. Open-loop success improves by 20-50%, and the GD with straightening matches or beats CEM at a fraction of the compute.
The most convincing evidence is the distance heatmaps: after straightening, latent Euclidean distance closely matches the shortest distance between states, even though the model was trained only on suboptimal random trajectories.
What I find interesting beyond the specific method is that the planning algorithm didn't change. The dynamics model didn't change. A single regularization term on the embedding geometry turned gradient descent from unreliable to competitive with sampling methods.
The field has largely treated representation learning and planning as separate concerns — learn good features, then figure out how to plan in them.
This paper makes a concrete case that the representation geometry is itself the bottleneck.
This connects to a broader pattern in ML. When optimization fails, the instinct is to fix the optimizer (better search, more samples, adaptive schedules). But often the real lever is the shape of the space you're optimizing in.
Same principle shows up in RL post-training where reward landscape shaping matters as much as the algorithm itself.
Shape the space so simple optimization works, rather than building complex optimization to handle a bad space.
Their paper:
arxiv.org/abs/2603.12231
출처: https://arxiv.org/abs/2603.12231
RT by @ylecun: What is a good latent space for world modeling and planning? 🤔
I
RT by @ylecun: What is a good latent space for world modeling and planning? 🤔
I
What is a good latent space for world modeling and planning? 🤔
Inspired by the perceptual straightening hypothesis in human vision, we introduce temporal straightening to improve representation learning for latent planning.
📑: agenticlearning.ai/temporal-…

출처: https://nitter.net/yingwww_/status/2032499275687489994#m
RT by @ylecun: @arxiv is recruiting a CEO who will lead the organization as it
RT by @ylecun: @arxiv is recruiting a CEO who will lead the organization as it
@arxiv is recruiting a CEO who will lead the organization as it becomes an independent non-profit. This is an exciting opportunity to improve the infrastructure for open science! The job advert is here: jobs.chronicle.com/job/37961…
출처: https://nitter.net/tdietterich/status/2032118223345500659#m
RT by @ylecun: AMI Labs just raised $1.03B. World Labs raised $1B a few weeks ea
RT by @ylecun: AMI Labs just raised $1.03B. World Labs raised $1B a few weeks ea
AMI Labs just raised $1.03B. World Labs raised $1B a few weeks earlier. Both are betting on world models.
But almost nobody means the same thing by that term.
Here are, in my view, five categories of world models.
---
1. Joint Embedding Predictive Architecture (JEPA)
Representatives: AMI Labs (@ylecun), V-JEPA 2
The central bet here is that pixel reconstruction alone is an inefficient objective for learning the abstractions needed for physical understanding. LeCun has been saying this for years — predicting every pixel of the future is intractable in any stochastic environment. JEPA sidesteps this by predicting in a learned latent space instead.
Concretely, JEPA trains an encoder that maps video patches to representations, then a predictor that forecasts masked regions in that representation space — not in pixel space.
This is a crucial design choice.
A generative model that reconstructs pixels is forced to commit to low-level details (exact texture, lighting, leaf position) that are inherently unpredictable. By operating on abstract embeddings, JEPA can capture "the ball will fall off the table" without having to hallucinate every frame of it falling.
V-JEPA 2 is the clearest large-scale proof point so far. It's a 1.2B-parameter model pre-trained on 1M+ hours of video via self-supervised masked prediction — no labels, no text. The second training stage is where it gets interesting: just 62 hours of robot data from the DROID dataset is enough to produce an action-conditioned world model that supports zero-shot planning. The robot generates candidate action sequences, rolls them forward through the world model, and picks the one whose predicted outcome best matches a goal image. This works on objects and environments never seen during training.
The data efficiency is the real technical headline. 62 hours is almost nothing. It suggests that self-supervised pre-training on diverse video can bootstrap enough physical prior knowledge that very little domain-specific data is needed downstream. That's a strong argument for the JEPA design — if your representations are good enough, you don't need to brute-force every task from scratch.
AMI Labs is LeCun's effort to push this beyond research. They're targeting healthcare and robotics first, which makes sense given JEPA's strength in physical reasoning with limited data. But this is a long-horizon bet — their CEO has openly said commercial products could be years away.
---
2. Spatial Intelligence (3D World Models)
Representative: World Labs (@drfeifei)
Where JEPA asks "what will happen next," Fei-Fei Li's approach asks "what does the world look like in 3D, and how can I build it?"
The thesis is that true understanding requires explicit spatial structure — geometry, depth, persistence, and the ability to re-observe a scene from novel viewpoints — not just temporal prediction.
This is a different bet from JEPA: rather than learning abstract dynamics, you learn a structured 3D representation of the environment that you can manipulate directly.
Their product Marble generates persistent 3D environments from images, text, video, or 3D layouts. "Persistent" is the key word — unlike a video generation model that produces a linear sequence of frames, Marble's outputs are actual 3D scenes with spatial coherence. You can orbit the camera, edit objects, export meshes. This puts it closer to a 3D creation tool than to a predictive model, which is deliberate.
For context, this builds on a lineage of neural 3D representation work (NeRFs, 3D Gaussian Splatting) but pushes toward generation rather than reconstruction. Instead of capturing a real scene from multi-view photos, Marble synthesizes plausible new scenes from sparse inputs. The challenge is maintaining physical plausibility — consistent geometry, reasonable lighting, sensible occlusion — across a generated world that never existed.
---
3. Learned Simulation (Generative Video + Latent-Space RL)
Representatives: Google DeepMind (Genie 3, Dreamer V3/V4), Runway GWM-1
This category groups two lineages that are rapidly converging: generative video models that learn to simulate interactive worlds, and RL agents that learn world models to train policies in imagination.
The video generation lineage. DeepMind's Genie 3 is the purest version — text prompt in, navigable environment out, 24 fps at 720p, with consistency for a few minutes. Rather than relying on an explicit hand-built simulator, it learns interactive dynamics from data. The key architectural property is autoregressive generation conditioned on user actions: each frame is generated based on all previous frames plus the current input (move left, look up, etc.). This means the model must maintain an implicit spatial memory — turn away from a tree and turn back, and it needs to still be there. DeepMind reports consistency up to about a minute, which is impressive but still far from what you'd need for sustained agent training.
Runway's GWM-1 takes a similar foundation — autoregressive frame prediction built on Gen-4.5 — but splits into three products: Worlds, Robotics, and Avatars. The split into Worlds / Avatars / Robotics suggests the practical generality problem is still being decomposed by action space and use case.
The RL lineage. The Dreamer series has the longer intellectual history. The core idea is clean: learn a latent dynamics model from observations, then roll out imagined trajectories in latent space and optimize a policy via backpropagation through the model's predictions. The agent never needs to interact with the real environment during policy learning.
Dreamer V3 was the first AI to get diamonds in Minecraft without human data. Dreamer 4 did the same purely offline — no environment interaction at all. Architecturally, Dreamer 4 moves from Dreamer’s earlier recurrent-style lineage to a more scalable transformer-based world-model recipe, and introduced "shortcut forcing" — a training objective that lets the model jump from noisy to clean predictions in just 4 steps instead of the 64 typical in diffusion models. This is what makes real-time inference on a single H100 possible.
These two sub-lineages used to feel distinct: video generation produces visual environments, while RL world models produce trained policies.
But Dreamer 4 blurred the line — humans can now play inside its world model interactively, and Genie 3 is being used to train DeepMind's SIMA agents.
The convergence point is that both need the same thing: a model that can accurately simulate how actions affect environments over extended horizons.
The open question for this whole category is one LeCun keeps raising: does learning to generate pixels that look physically correct actually mean the model understands physics? Or is it pattern-matching appearance? Dreamer 4's ability to get diamonds in Minecraft from pure imagination is a strong empirical counterpoint, but it's also a game with discrete, learnable mechanics — the real world is messier.
---
4. Physical AI Infrastructure (Simulation Platform)
Representative: NVIDIA Cosmos
NVIDIA's play is don't build the world model, build the platform everyone else uses to build theirs.
Cosmos launched at CES January 2025 and covers the full stack — data curation pipeline (process 20M hours of video in 14 days on Blackwell, vs. 3+ years on CPU), a visual tokenizer with 8x better compression than prior SOTA, model training via NeMo, and deployment through NIM microservices.
The pre-trained world foundation models are trained on 9,000 trillion tokens from 20M hours of real-world video spanning driving, industrial, robotics, and human activity data.
They come in two architecture families: diffusion-based (operating on continuous latent tokens) and autoregressive transformer-based (next-token prediction on discretized tokens). Both can be fine-tuned for specific domains.
Three model families sit on top of this.
Predict generates future video states from text, image, or video inputs — essentially video forecasting that can be post-trained for specific robot or driving scenarios.
Transfer handles sim-to-real domain adaptation, which is one of the persistent headaches in physical AI — your model works great in simulation but breaks in the real world due to visual and dynamics gaps.
Reason (added at GTC 2025) brings chain-of-thought reasoning over physical scenes — spatiotemporal awareness, causal understanding of interactions, video Q&A.
---
5. Active Inference
Representative: VERSES AI (Karl Friston)
This is the outlier on the list — not from the deep learning tradition at all, but from computational neuroscience.
Karl Friston's Free Energy Principle says intelligent systems continuously generate predictions about their environment and act to minimize surprise (technically: variational free energy, an upper bound on surprise).
Where standard RL is usually framed around reward maximization, active inference frames behavior as minimizing variational / expected free energy, which blends goal-directed preferences with epistemic value. This leads to natural exploration behavior: the agent is drawn to situations where it's uncertain, because resolving uncertainty reduces free energy.
VERSES built AXIOM (Active eXpanding Inference with Object-centric Models) on this foundation.
The architecture is fundamentally different from neural network world models. Instead of learning a monolithic function approximator, AXIOM maintains a structured generative model where each entity in the environment is a discrete object with typed attributes and relations.
Inference is Bayesian — beliefs are probability distributions that get updated via message passing, not gradient descent. This makes it interpretable (you can inspect what the agent believes about each object), compositional (add a new object type without retraining), and extremely data-efficient.
In their robotics work, they've shown a hierarchical multi-agent setup where each joint of a robot arm is its own active inference agent. The joint-level agents handle local motor control while higher-level agents handle task planning, all coordinating through shared beliefs in a hierarchy. The whole system adapts in real time to unfamiliar environments without retraining — you move the target object and the agent re-plans immediately, because it's doing online inference, not executing a fixed policy.
They shipped a commercial product (Genius) in April 2025, and the AXIOM benchmarks against RL baselines are competitive on standard control tasks while using orders of magnitude less data.
---
imo, these five categories aren't really competing — they're solving different sub-problems.
JEPA compresses physical understanding.
Spatial intelligence reconstructs 3D structure.
Learned simulation trains agents through generated experience.
NVIDIA provides the picks and shovels.
Active inference offers a fundamentally different computational theory of intelligence.
My guess is the lines between them blur fast.
출처: https://nitter.net/zhuokaiz/status/2032201769053212682#m
If you're going to be at GTC this week... come say hi. We'll be there talking ab
If you're going to be at GTC this week... come say hi. We'll be there talking ab
If you're going to be at GTC this week... come say hi. We'll be there talking about open source agents!
🎬 Mon AM — GTC Live Pregame: "The Agentic AI Inflection Point" with @samrodriques @steipete @vincentweisser @saranormous @Alfred_Lin
🎤 Mon 4pm — "Open, Trusted, and Observable: Deploying AI Agents at Enterprise Scale" talke
🗣️ Wed 12:30pm — Panel moderated by Jensen Huang on "Open Models: Where We Are and Where We're Headed" alongside @AravSrinivas @MiraMurati @arthurmensch @michaeltruell and more
see ya there!

출처: https://nitter.net/hwchase17/status/2033310532984520969#m
RT by @hwchase17: Come pregame the NVIDIA GTC 2026 keynote with GTC Live—a front
RT by @hwchase17: Come pregame the NVIDIA GTC 2026 keynote with GTC Live—a front
Come pregame the NVIDIA GTC 2026 keynote with GTC Live—a front-row seat to what's next in AI. 🧵
👇 Here are the chapters unpacking the future of AI:
출처: https://nitter.net/NVIDIAAI/status/2032588957968375871#m
RT by @hwchase17: 💫 New LangChain Academy Course: Building Reliable Agents 💫
RT by @hwchase17: 💫 New LangChain Academy Course: Building Reliable Agents 💫
💫 New LangChain Academy Course: Building Reliable Agents 💫
Shipping agents to production is hard. Traditional software is deterministic – when something breaks, you check the logs and fix the code. But agents rely on non-deterministic models.
Add multi-step reasoning, tool use, and real user traffic, and building reliable agents becomes far more complex than traditional system design.
The goal of this course is to teach you how to take an agent from first run to production-ready system through iterative cycles of improvement.
You’ll learn how to do this with LangSmith, our agent engineering platform for observing, evaluating, and deploying agents.
출처: https://nitter.net/LangChain/status/2033269622103793917#m
R to @jerryjliu0: If you’re interested in the GLM-OCR technical report, it’s her
R to @jerryjliu0: If you’re interested in the GLM-OCR technical report, it’s her
If you’re interested in the GLM-OCR technical report, it’s here: arxiv.org/pdf/2603.10910
We’re constantly benchmarking models of all sizes to push the frontiers on accuracy/cost for document parsing. If you have general document OCR needs, come check out LlamaParse: cloud.llamaindex.ai/?utm_sou…
출처: https://arxiv.org/pdf/2603.10910
Zhipu AI released the GLM-OCR technical report yesterday. A model that tops on O
Zhipu AI released the GLM-OCR technical report yesterday. A model that tops on O
Zhipu AI released the GLM-OCR technical report yesterday. A model that tops on OmniDocBench V1.5 with a 94.62 score - with only 0.9B params!
I give them credit where credit is due: we are genuinely excited about any research that pushes the frontier of document parsing at sub-1B scale.
Between GLM-OCR, dots.ocr, paddleOCR, Deepseek, small doc parsing models are getting really good really quickly 📈

David Hendrickson (@TeksEdge)
🚨 Want to parse complex PDFs with SOTA accuracy, 100% locally? 📄🔍
At just 0.9B parameters, you can drop GLM-OCR straight into LM Studio and run it on almost any machine! 🥔
🧠 0.9B total parameters
💾 Runs on < 1.5GB VRAM (or ~1GB quantized!)
💸 Zero API costs
🔒 Total data privacy
Desktop document AI is officially here. 💻⚡![]()
출처: https://nitter.net/jerryjliu0/status/2033245944045801485#m
관련 노트
- [[NVIDIA]]
- [[260315_reddit]] — 키워드 유사
- [[260314_arxiv]] — 키워드 유사
- [[260315_tg]] — 키워드 유사
- [[260315_hn]] — 키워드 유사
- [[260319_reddit]] — 키워드 유사
- [[260317_x]] — 키워드 유사
- [[260318_x]] — 키워드 유사
- [[260318_arxiv]] — 키워드 유사
- [[260317_arxiv]] — 키워드 유사
- [[260316_reddit]] — 키워드 유사