260318 arxiv 모음

Mixture-of-Depths Attention

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-e

출처: https://arxiv.org/abs/2603.15619v1

Mechanistic Origin of Moral Indifference in Language Models

원문

Mechanistic Origin of Moral Indifference in Language Models

Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon P

출처: https://arxiv.org/abs/2603.15615v1

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learnin

원문

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learnin

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss impleme

출처: https://arxiv.org/abs/2603.15611v1

SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval

원문

SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval

Recent conversational memory systems invest heavily in LLM-based structuring at ingestion time and learned retrieval policies at query time. We show that neither is necessary. SmartSearch retrieves from raw, unstructured conversation history using a fully deterministic pipeline: NER-weighted substring matching for recall, rule-based entity discovery for multi-hop expansion, and a CrossEncoder+ColBERT rank fusion stage -- the only learned component -- running on CPU in ~650ms. Oracle analysis on

출처: https://arxiv.org/abs/2603.15599v1

Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing

원문

Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing

Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters att

출처: https://arxiv.org/abs/2603.09205v2

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training

원문

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search age

출처: https://arxiv.org/abs/2603.15594v1

SemBench: A Benchmark for Semantic Query Processing Engines

원문

SemBench: A Benchmark for Semantic Query Processing Engines

We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data. Our benchmark introduces diversity across three key dimensions: scenarios, modalities, and operators.

출처: https://arxiv.org/abs/2511.01716v2

Effective Distillation to Hybrid xLSTM Architectures

원문

Effective Distillation to Hybrid xLSTM Architectures

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation

출처: https://arxiv.org/abs/2603.15590v1

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

원문

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effe

출처: https://arxiv.org/abs/2603.06922v2

Mamba-3: Improved Sequence Modeling using State Space Principles

원문

Mamba-3: Improved Sequence Modeling using State Space Principles

Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While the current Transformer-based models deliver strong model quality, their quadratic compute and linear memory make inference expensive. This has spurred the development of sub-quadratic models with reduced linear compute and constant memory requirements. However, many recent linear models trade off model quality and capabi

출처: https://arxiv.org/abs/2603.15569v1

Computational Concept of the Psyche

원문

Computational Concept of the Psyche

This article presents an overview of approaches to modeling the human psyche in the context of constructing an artificial one. Based on this overview, a concept of cognitive architecture is proposed, in which the psyche is viewed as the operating system of a living or artificial subject, comprising a space of states, including the state of needs that determine the meaning of a subject's being in relation to stimuli from the external world, and intelligence as a decision-making system regarding a

출처: https://arxiv.org/abs/2603.15586v1

Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI

원문

Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI

As AI coding agents become both primary producers and consumers of source code, the software industry faces an accelerating loss of institutional knowledge. Each commit captures a code diff but discards the reasoning behind it - the constraints, rejected alternatives, and forward-looking context that shaped the decision. I term this discarded reasoning the Decision Shadow. This paper proposes Lore, a lightweight protocol that restructures commit messages - using native git trailers - into self-c

출처: https://arxiv.org/abs/2603.15566v1

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

원문

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategi

출처: https://arxiv.org/abs/2603.15563v1

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study De

원문

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study De

Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification

출처: https://arxiv.org/abs/2603.15542v1

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

원문

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

As Large Language Models (LLMs) become more powerful and autonomous, they increasingly face conflicts and dilemmas in many scenarios. We first summarize and taxonomize these diverse conflicts. Then, we model the LLM's preferences to make different choices as a priority graph, where instructions and values are nodes, and the edges represent context-specific priorities determined by the model's output distribution. This graph reveals that a unified stable LLM alignment is very challenging, because

출처: https://arxiv.org/abs/2603.15527v1

Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge E

원문

Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge E

While locate-then-edit knowledge editing efficiently updates knowledge encoded within Large Language Models (LLMs), a critical generalization failure mode emerges in the practical same-subject knowledge editing scenario: models fail to recall the updated knowledge when following user instructions, despite successfully recalling it in the original edited form. This paper identifies the geometric root of this generalization collapse as a fundamental conflict where the inner activation drifts induc

출처: https://arxiv.org/abs/2603.15518v1

Agentic workflow enables the recovery of critical materials from complex feedsto

원문

Agentic workflow enables the recovery of critical materials from complex feedsto

We present a multi-agentic workflow for critical materials recovery that deploys a series of AI agents and automated instruments to recover critical materials from produced water and magnet leachates. This approach achieves selective precipitation from real-world feedstocks using simple chemicals, accelerating the development of efficient, adaptable, and scalable separations to a timeline of days, rather than months and years.

출처: https://arxiv.org/abs/2603.15491v1

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analy

원문

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analy

Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, pr

출처: https://arxiv.org/abs/2603.15483v1

Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models

원문

Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models

Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent's policy model interacts with the learned world model, yieldi

출처: https://arxiv.org/abs/2601.08955v2

Robust and Computationally Efficient Linear Contextual Bandits under Adversarial

원문

Robust and Computationally Efficient Linear Contextual Bandits under Adversarial

We study linear contextual bandits under adversarial corruption and heavy-tailed noise with finite $(1+ε)$-th moments for some $ε\in (0,1]$. Existing work that addresses both adversarial corruption and heavy-tailed noise relies on a finite variance (i.e., finite second-moment) assumption and suffers from computational inefficiency. We propose a computationally efficient algorithm based on online mirror descent that achieves robustness to both adversarial corruption and heavy-tailed noise. While

출처: https://arxiv.org/abs/2603.15596v1

Do Metrics for Counterfactual Explanations Align with User Perception?

원문

Do Metrics for Counterfactual Explanations Align with User Perception?

Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments acros

출처: https://arxiv.org/abs/2603.15607v1

Optimizing Task Completion Time Updates Using POMDPs

원문

Optimizing Task Completion Time Updates Using POMDPs

Managing announced task completion times is a fundamental control problem in project management. While extensive research exists on estimating task durations and task scheduling, the problem of when and how to update completion times communicated to stakeholders remains understudied. Organizations must balance announcement accuracy against the costs of frequent timeline updates, which can erode stakeholder trust and trigger costly replanning. Despite the prevalence of this problem, current appro

출처: https://arxiv.org/abs/2603.12340v2

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process R

원문

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process R

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We levera

출처: https://arxiv.org/abs/2603.15600v1

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

원문

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-

출처: https://arxiv.org/abs/2603.15597v1

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic

원문

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of

출처: https://arxiv.org/abs/2603.15617v1

Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code

원문

Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code

Magnetic resonance imaging (MRI) is a powerful and versatile imaging technique, offering a wide spectrum of information about the anatomy by employing different acquisition modalities. However, in the clinical workflow, it is impractical to collect all relevant modalities due to the scan time and cost constraints. Virtual full-stack scanning aims to impute missing MRI modalities from available but incomplete acquisitions, offering a cost-efficient solution to enhance data completeness and clinic

출처: https://arxiv.org/abs/2501.18328v3

Understanding Reasoning in LLMs through Strategic Information Allocation under U

원문

Understanding Reasoning in LLMs through Strategic Information Allocation under U

LLMs often exhibit Aha moments during reasoning, such as apparent self-correction following tokens like "Wait," yet their underlying mechanisms remain unclear. We introduce an information-theoretic framework that decomposes reasoning into procedural information and epistemic verbalization - the explicit externalization of uncertainty that supports downstream control actions. We show that purely procedural reasoning can become informationally stagnant, whereas epistemic verbalization enables cont

출처: https://arxiv.org/abs/2603.15500v1

260318 arxiv (27개)

260318 arxiv 모음

Mixture-of-Depths Attention

Mixture-of-Depths Attention

Mechanistic Origin of Moral Indifference in Language Models

Mechanistic Origin of Moral Indifference in Language Models

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learnin

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learnin

SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval

SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval

Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing

Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training

SemBench: A Benchmark for Semantic Query Processing Engines

SemBench: A Benchmark for Semantic Query Processing Engines

Effective Distillation to Hybrid xLSTM Architectures

Effective Distillation to Hybrid xLSTM Architectures

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

Mamba-3: Improved Sequence Modeling using State Space Principles

Mamba-3: Improved Sequence Modeling using State Space Principles

Computational Concept of the Psyche

Computational Concept of the Psyche

Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI

Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study De

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study De

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge E

Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge E

Agentic workflow enables the recovery of critical materials from complex feedsto

Agentic workflow enables the recovery of critical materials from complex feedsto

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analy

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analy

Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models

Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models

Robust and Computationally Efficient Linear Contextual Bandits under Adversarial

Robust and Computationally Efficient Linear Contextual Bandits under Adversarial

Do Metrics for Counterfactual Explanations Align with User Perception?

Do Metrics for Counterfactual Explanations Align with User Perception?

Optimizing Task Completion Time Updates Using POMDPs

Optimizing Task Completion Time Updates Using POMDPs

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process R

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process R

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic

Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code

Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code

Understanding Reasoning in LLMs through Strategic Information Allocation under U

Understanding Reasoning in LLMs through Strategic Information Allocation under U

관련 노트