Subramanyam Sahoo

Independent AI Safety Researcher

Cuttack, Odisha, India · sahoo2vec@gmail.com

AI Safety Researcher specializing in alignment science and governance, with 2.5+ years of academic research experience and a proven publication record. I work on building AI systems that remain reliably aligned under adversarial conditions — through mechanistic interpretability, adversarial self-play, and governance frameworks that address institutional failure, not just model failure. NIT Hamirpur gold medalist. MARS 4.0 fellow, Cambridge AI Safety Hub.

Google Scholar GitHub HuggingFace LinkedIn Twitter / X Medium

20+

Publications

Hackathons

$16.5K

Research Funding

Gold

NIT Hamirpur Medal

Fellowships & Positions

Apr 2026

CORDA Democracy Fellowship — Open Democracy Institute

Ongoing research on Integrity Disclosures for Generative AI in Democratic Information Environments

Feb 2025–

AI Policy Fellow (Remote) — UC Berkeley (BASIS Fellowship)

Conducting research on governance aspects for the Berkeley AI Safety Initiative

Dec 2025–Feb 2026

MARS 4.0 — Cambridge AI Safety Hub

Mentorship for Alignment Research Students · Submitted 1 paper to RLC 2026 on RL Agents

Aug–Dec 2025

E-SOAR — EleutherAI

Summer of Open AI Research · Prompt Optimization for Verifiable Hallucination Reduction

Apr 2025–

Independent Contractor — Outlier AI

Designing synthetic datasets for RL-style post-training and evaluation under controlled task distributions

Summer 2025

Harvard Technical AI Safety & Harvard AI Policy Fellowships

Dual fellowships awarded for Summer 2025

Jul–Oct 2025

Mentor (Remote) — Paragon Policy Fellowship

AI Policy and Technical AI Governance (TAIG) research

Research Funding

AIM Intelligence AI Safety Compute Grant

South Korea · Apr–Jun 2026 · PI

USD 10,000

Martian — Research Grant

Nov 2025–Feb 2026 · PI · Mechanistic Interpretability

USD 6,000

Apart Research — Research Grant

Oct 2025 · Pilot experiments and preliminary analyses

USD 500

Recent

ICLR 2026 — AI for Peace Workshop — Oral Presentation · Dial E for Ethical Enforcement
Accepted: AIM Intelligence AI Safety Compute Grant, South Korea — PI (USD 10,000)
Harvard Technical AI Safety & Harvard AI Policy Fellowships — awarded Summer 2025
Apart Lab Studio Internship — accepted following Martian Mechanistic Interpretability Hackathon project
CBRN AI Risk Research Sprint — 3rd Prize · Molecules Under Watch
Featured in Bloomberg — Intelligence Symbiosis Manifesto signatory
Y Combinator Startup School 2026 — accepted, Bangalore, India (April 18, 2026)

Under Review — 2026

ICML 2026

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation ↗ Under Review

Subramanyam Sahoo

Feb 2026

ICML 2026

Who Measures What Matters? An Analysis of Social Impact Evaluations in Foundation Model Reporting ↗ Under Review

Subramanyam Sahoo

Oct 2025

Accepted — ICLR 2026

ICLR 2026

Dial E for Ethical Enforcement: Institutional Veto Power as a Governance Primitive ↗ Oral ★

Subramanyam Sahoo

AI for Peace Workshop · Feb 2026

ICLR 2026

Policy Myopia as a Mechanism of Gradual Disempowerment in Post-AGI Governance, Circa 2049 ↗ Accepted

Subramanyam Sahoo

Post-AGI Science and Society Workshop · Feb 2026

ICLR 2026

The Controllability Trap: A Governance Framework for Military AI Agents ↗ Accepted

Subramanyam Sahoo

Agents in the Wild: Safety, Security, and Beyond · Feb 2026

ICLR 2026

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement ↗ Accepted

Subramanyam Sahoo

AI with Recursive Self-Improvement Workshop · Feb 2026

ICLR 2026

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers Under Embedding Drift ↗ Accepted

Subramanyam Sahoo

I Can't Believe It's Not Better Workshop · Feb 2026

ICLR 2026

When Shallow Wins: Silent Failures and the Depth–Accuracy Paradox in Latent Reasoning ↗ Accepted

Subramanyam Sahoo

Latent & Implicit Thinking Workshop · Feb 2026

Accepted — AAAI & NeurIPS 2025–2026

AAAI 2026

Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity ↗ Accepted

Subramanyam Sahoo

Logical and Symbolic Reasoning in Language Models · Nov 2025

NeurIPS 2025

The Horcrux: Mechanistically Interpretable Task Decomposition for Detecting and Mitigating Reward Hacking in Embodied AI Systems ↗ Accepted

Subramanyam Sahoo

Embodied and Safe-Assured Robotic Systems Workshop · Nov 2025

NeurIPS 2025

Position: The Computational Impossibility of Perfect AI Alignment — Formalizing the RLHF Trilemma ↗ Accepted

Subramanyam Sahoo

Socially Responsible and Trustworthy Foundation Models · Nov 2025

NeurIPS 2025

The Last Vote: A Multi-Stakeholder Framework for Language Model Governance ↗ Accepted

Subramanyam Sahoo

Algorithmic Collective Action Workshop · Sep 2025

NeurIPS 2025

The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training ↗ Accepted

Subramanyam Sahoo

ARLET Workshop · Sep 2025

ICVGIP 2025

The Green Mile: A Multi-Layered Quest to Reveal, Measure, and Slash Carbon Emissions in AI Training & Inference ↗ Accepted

Subramanyam Sahoo

Indian Conference on Computer Vision, Graphics and Image Processing · Oct 2025

Thesis

M.Tech 2024

Trustworthiness of Random Forest Models via Explainable Features ↗

NIT Hamirpur · Under Dr. Kamlesh Dutta

XAI and TreeSHAP · O(n²) → O(n log n) · EEG datasets · Computational Neuroscience

B.Tech 2020

Protein Subcellular Localization Using ML Algorithms ↗

Parala Maharaja Engineering College · Under Dr. Debasis Mohapatra

KNN, Decision Trees, SVMs · Competitive accuracy

Open-Source Projects

GitHub

SAHOO Framework ↗

Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Python · PyTorch · FOMAML meta-learning · Differentiable mortality operator · EWC baselines

Framework

AdverSplay-GRPO

Adversarial self-play framework for sycophancy reduction in LLMs

Dual LoRA adapters · Frozen Qwen3-32B base · Group Relative Policy Optimisation · Lambda Labs GH200 / A100 · 89.3% sycophancy reduction

HuggingFace

SahoobhAI ↗

Model checkpoints, datasets, and evaluation artifacts

Alignment experiments · LoRA adapters · Reproducible benchmarks

Apart Research

Latent Sabotage ↗

Geometric Fingerprints of Deceptive Alignment in Code Language Models

AI Control Hackathon 2026 · Deceptive alignment detection · Backdoored LLMs

Apart Research

Mechanistic Judging and LLM Routing ↗

Task-specific vulnerabilities and exploitable failure modes

Apart x Martian Hackathon · Led to Apart Lab Studio internship acceptance

Technologies

Programming & ML: Python, JAX, PyTorch, NumPy, Pandas, scikit-learn, OpenAI Gym
Systems & Tools: CUDA, Docker, Git, LaTeX, VS Code
Compute: Lambda Labs GH200 and A100, Modal cloud compute
Models: Qwen3-32B, Qwen3-14B, GPT-2 Large, Llama-2-13B, RoBERTa

Research Hackathons & Competitions

Mar 2026

Latent Sabotage: Geometric Fingerprints of Deceptive Alignment in Code Language Models ↗

AI Control Hackathon 2026 — Apart Research

Remote · Led to Apart Lab Studio internship acceptance

Feb 2026

Insurance-Grade Data Infrastructure for Frontier AI Governance ↗

The Technical AI Governance Challenge — Apart Research

Remote

Jan 2026

Faithful Adversarial MCTS for Persuasive CoT Manipulation Check — A Cooperative AI Lens ↗

AI Manipulation Hackathon — Apart Research

Remote

Nov 2025

A Practical Toolkit for Evaluating Safety Monitors in Reasoning Models ↗

Defensive Acceleration Hackathon — Apart Research

Remote · Honeypots, Sparse Autoencoders, Adversarial Probes

Nov 2025

Rashomon: Multiple Views of the Same AI Timeline Forecasts ↗

The AI Forecasting Hackathon — Apart Research

Remote

Sep 2025

Molecules Under Watch: Multi-Modal AI-Driven Threat Emergence Detection for Biosecurity ↗ 3rd Prize

CBRN AI Risks Research Sprint — Apart Research

Remote · Biosecurity · Multi-modal AI

Jun 2025

Mechanistic Judging and LLM Routing: Task-Specific Vulnerabilities and Exploitable Failure Modes ↗

Apart x Martian Mechanistic Router Interpretability Hackathon

Remote · Led to USD 6,000 Martian research grant and Apart Lab Studio internship

Invited Talks & Presentations

Feb 2026

ICLR 2026 — AI for Peace Workshop — Oral Presentation ↗ Oral

Dial E for Ethical Enforcement: Institutional Veto Power as a Governance Primitive

Oct 2024

Odisha AI Conference 2024 — Invited Speaker on AI Safety and Alignment

Virtual event hosted in the USA · 5 October 2024

2024

ACM India Summer School — Interpretable AI, IIT Madras

Led group project on mechanistic interpretability · Responsible and Safe AI track · Selected from 2,000+ applicants

2024

Live Q&A with Prof. David Krueger — University of Cambridge

Presented 2 questions on AI safety in public session

2024

Live Q&A with Dr. Sudarsan Padmanabhan — IIT Madras

Question on AI alignment and governance

2023

Orientation for B.Tech AI/ML Students — NIT Kurukshetra

Invited talk on AI safety and research directions

2023

Seminar on Recent Trends in AI — Dept. of CSE, NIT Hamirpur

LLM advances and safety implications

Collaborators

Sussex

Prof. Fernando Rosas — University of Sussex

Belief geometry in deep RL · MARS 4.0, Cambridge AI Safety Hub · RLC 2026 joint submission

Industry

Amirali Abdullah — Thoughtworks

Mechanistic interpretability · Specification gaming · ACL 2026 joint submission

EleutherAI

E-SOAR Research Mentee — EleutherAI

Open-source alignment research · Aug–Dec 2025

Coalition

EVALEVAL Coalition

Invited by Irene Solaiman (Hugging Face) and Anka Reuel (Stanford) · Science of evaluations

Apart

Apart Lab Studio Internship ↗

Accepted following Martian Mechanistic Interpretability Hackathon project

Camp

AI Safety Camp 2026 — 11th Edition

AI Control track · Jan–Apr 2026 · Invited by Justin Shenk

Open to Collaboration

Seeking research contractor roles and collaborations in:

LLM Post-training Reinforcement Learning Mechanistic Interpretability AI Control Adversarial Robustness Cooperative AI Science of Evaluations AI Policy & Governance Autonomous AI Agents

Teaching Assistant & Lab Supervisor — NIT Hamirpur (Aug 2022 – Jul 2024)

Sem 4

CS-661 Deep Learning — Teaching Assistant & Project Supervisor

Dual Degree CS · 4th Year · 8th Semester

Sem 4

CS-664 Deep Learning & Data Analytics Lab

Lab management and supervision

Sem 4

CS-326 Computer Networks Lab — Teaching & Lab Assistant

B.Tech CS · 3rd Year · 6th Semester

Sem 4

CS-429 Major Project Stage 2 — Assistant Supervisor

B.Tech CS · 4th Year · Guided 10+ students

Sem 3

CS-652 Machine Learning — Teaching Assistant & Project Supervisor

Dual Degree CS · 4th Year · 7th Semester

Sem 3

CS-651 Artificial Intelligence — Curriculum & Assessment Development

Curriculum content and assessment materials

Sem 3

CS-315 Database Management Systems Lab — Teaching & Lab Assistant

B.Tech CS · 3rd Year · 5th Semester

Sem 3

CS-419 Major Project Stage 1 — Assistant Supervisor

B.Tech CS · 4th Year

Sem 2

CS-101 Computer Programming — Teaching Assistant

B.Tech EE · 1st Year · 2nd Semester

Sem 1

CS-102 Computer Programming Lab — Teaching & Lab Assistant

B.Tech CS · 1st Year · 1st Semester

2023

Organized AMRIT-2023 and MINDS-2023 Conferences — NIT Hamirpur

Core organizing committee · Mentored students on project planning and career development

Certificates

Mar–May 2026

Cooperative AI Fundamentals — Cooperative AI Foundation

Jan–Feb 2026

Technical AI Safety Fundamentals — BlueDot Impact

Nov–Dec 2025

Biosecurity Fundamentals — BlueDot Impact

May–Sep 2025

AI Agents and Law — Vista Institute for AI Policy

Jan–May 2025

Advanced Large Language Model Agents (MOOC) — UC Berkeley, Google DeepMind

Feb–May 2025

AI Safety, Ethics, and Society — Center for AI Safety (CAIS)

Peer Review

ACL 2026 — EvalEval Workshop — Reviewer
ACL 2026 — TrustNLP Workshop — Reviewer
ICLR 2026 — AIWILD Workshop — Reviewer
ICLR 2026 — P-AGI Workshop — Reviewer
ICLR 2026 — RSI Workshop — Reviewer
ICLR 2026 — SPOT Workshop — Reviewer
ICLR 2026 — Sci4DL Workshop — Reviewer
ICML 2026 — EIML Workshop — Reviewer
ICML 2026 — TAIGR Workshop — Reviewer
COLM 2026 Conference — Reviewer
EVALEVAL Coalition — Science of Evaluations — Active Member · Invited by Irene Solaiman (HuggingFace) & Anka Reuel (Stanford)

Research Funding

AIM Intelligence AI Safety Compute Grant

South Korea · Apr–Jun 2026 · PI · Long horizon stability & Regression risk

USD 10,000

Martian — Research Grant

Nov 2025–Feb 2026 · PI · Mechanistic interpretability, model analysis, dissemination

USD 6,000

Apart Research — Research Grant

Oct 2025 · Pilot experiments and preliminary analyses

USD 500

Volunteer

Jan–Apr 2026

AI Safety Camp — 11th Edition

AI Control track

Jul–Oct 2025

Mentor — Paragon AI Policy Fellowship

AI Policy and Technical AI Governance

May–Jun 2025

Mentor — LatinX AI Club, 2025 Edition

Media

Intelligence Symbiosis Manifesto — Signatory ↗ · Covered in Bloomberg ↗
Invited speaker on AI Safety — Odisha AI Conference 2024, virtual event hosted in the USA

Academic Achievements

Gold Medalist — M.Tech CSE (AI), NIT Hamirpur (Oct 2024) · Summa Cum Laude · Batch Topper · CGPA 9.38/10
B.Tech with Honours — Computer Science and Engineering · Ranked TOP-5 · CGPA 8.67/10
Harvard Technical AI Safety Fellowship — awarded Summer 2025
Harvard AI Policy Fellowship — awarded Summer 2025
Berkeley AI Safety Initiative (BASIS) Fellowship — awarded for AI Governance research
ICLR 2026 — Oral Presentation, AI for Peace Workshop
CBRN AI Risk Research Sprint — 3rd Prize, Apart Research (Sep 2025)
Apart Lab Studio Internship — accepted, following Martian Hackathon project
Y Combinator Startup School 2026 — accepted, Bangalore, India (April 18, 2026)
Full fee waiver — "Harms and Risks of AI in the Military" workshop, Mila - Quebec AI Institute, Montreal, Canada (2024)
Climate Change AI Summer School — Mila, Quebec (2024)
ACM India Summer School 2024 — "Responsible and Safe AI" at IIT Madras, selected from 2,000+ applicants
ACM India Summer School 2024 — "Generative AI for Text" at IIT Gandhinagar, selected from 1,700+ applicants

Notable Interactions

Discussed "Weak to Strong Generalization" with Stephen Casper — Algorithmic Alignment Group, MIT EECS (IIT Madras)
Live Q&A with Prof. David Krueger, University of Cambridge — 2 questions on AI safety
Live Q&A with Prof. Mausam, IIT Delhi — IIT Gandhinagar
Live Q&A with Dr. Sudarsan Padmanabhan — IIT Madras

Languages & Interests

English — Full professional proficiency
Odia — Native proficiency
Hindi — Limited working proficiency
Sanskrit — Limited working proficiency
Research interests: Large Language Models, AI Governance, AI Safety & Alignment
Hobby: Critically acclaimed podcasts