research-notes

.Research Papers List 2026 (active)

#ai-Safety

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs : alphaxiv
Frontier Models are Capable of In-context Scheming : alphaxiv
Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors : alphaxiv
Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents : alphaxiv
Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios : alphaxiv
Our evaluation of Claude Mythos Preview’s cyber capabilities | AISI Work : AISI Blog
Tell me about yourself: LLMs are aware of their learned behavior : slides
Steering Evaluation-Aware Language Models to Act Like They Are Deployed : arxiv
When can we trust untrusted monitoring? A safety case sketch across collusion strategies : alphaxiv
AI Control: Improving Safety Despite Intentional Subversion : alphaxiv
SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks : alphaxiv
Large Language Model Reasoning Failures : alphaxiv
Training large language models on narrow tasks can lead to broad misalignment : Nature slide
Toward a Science of AI Agent Reliability : alphaxiv
Continuation of Measuring AI Ability to Complete Long Tasks : alphaxiv (Related to "Measuring AI Ability to Complete Long Software Tasks")
Measuring AI Ability to Complete Long Tasks : METR Blog
Against the METR graph : slides
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts : alphaxiv
HCAST: Human-Calibrated Autonomy Software Tasks : alphaxiv
Meta-RL Induces Exploration in Language Agents : alphaxiv
Scaling Up Active Testing to Large Language Models : alphaxiv
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities : alphaxiv
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents : alphaxiv
UK AISI Align Evaluation Case-Study : AISI Reports

Reading paper notes

Current Paper reading list

2026

Poolside Laguna M.1/XS.2 Technical Report
Memory in the Age of AI Agents
Less is More: Tiny Recursive Networks
Opus 4.7 = 4T params? Incompressible Knowledge Probes
DeepSeek V4 Pro/Flash
Self-Distilled RLVR paper
Agents of Chaos paper
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
Composer 2 + Claude Code Leaks
Moonshot Attention Residuals + Cursor Composer 2
Karpathy Autoresearch, ShinkaEvolve, TexttoLORA
Moltbook analysis + Persona Selection
Discovering Multiagent Learning Algorithms with Large Language Models
Midtraining Bridges Pretraining and Posttraining Distributions
Rubric Based RL survey + Alec Radford Generative Meta-Model
LLaDA 2.1 + RL via Self-Distillation
Kimi K2.5 Tech Report + Alec Radford on Data Filtering
Recursive Language Models, Meta Confucius Code Agent
Anthropic's Assistant Axis: situating and stabilizing the character of large language models
LTX-2: Efficient Joint Audio-Visual Foundation Model
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Show-o2: Improved Native Unified Multimodal Models

2025

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
NVIDIA Nemotron 3 + 3 Nano
PLUM: Adapting Pre-trained Language Models for Industrial-scale Generative Recommendations: Semantic IDs v2
DeepSeek v3.2, DeepSeekMath v2
RF-DETR: Realtime Neural Arch Search for Realtime Detection Transformers
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Kimi K2 Thinking + Kimi Linear + Q&A
Scaling LLMs for Next-Generation Single-Cell Analysis
Real-Time Detection of Hallucinated Entities in Long-Form Generation
Thinky LoRA + DeepSeek OCR
RLVR paper roundup!
Training scientific reasoning LLMs with biological world models as soft verifiers
Veo 3 + DeepSeek 3.2
Meta SuperIntelligence papers: Bootstrapping, CaTARE
InternVL3.5 & Vision in GPT-OSS
How much do Language Models Memorize?
Hierarchical Reasoning Models
RecSys with Generative Retrieval (RQ-VAE)
GLM 4.5: Agentic Coding, Reasoning Foundation Model
Qwen Image: SOTA text rendering + 4o-imagegen-level Editing Open Weights MMDiT
GPT-OSS: OpenAI's Open AI
Anthropic Fellows: Subliminal Learning & Inverse Scaling
Kimi K-2 Tech Report
Muon, MuonClip, Kimi K-2 from Moonshot AI
Magistral Reasoning
Evals for long-context Q&A + Ernie 4.5 Technical Report
Reflect, Retry, Reward: Self-Improving LLMs
Claude 4 + Self Adapting Language Models
Apple: The Illusion of Thinking + WWDC25 Foundation Models
Gemini Diffusion & Diffusion Models Survey
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
Llama 1/2/3/4 by Hand
Phi-4 Reasoning
Survey: Long Context, Leaderboard Illusion, Reasoning Economy
Advances and Challenges in Foundation Agents
Anthropic: Tracing the thoughts of an LLM
Autoregressive Image Generation Survey
RecSys and search in the age of LLMs

#agents #ai #ai-safety #discussions #eval #hallucination #interpretability #llm #llm-challenges #research