Stanford CS336 Establishes Guidelines for Building Safe AI Agents

As autonomous AI agents move from research prototypes to production systems, Stanford University’s CS336 course has emerged as an unexpected authority on agent safety. Originally designed to teach students how to build language models from scratch, the course’s AI Agent Guidelines have become a reference implementation for how to interact with AI coding assistants safely and effectively.

The guidelines, published openly on GitHub, address a critical gap in the AI ecosystem: as AI agents become more capable, we need systematic approaches to ensure they remain helpful tools rather than autonomous solution generators. The document distills hard-won lessons from Stanford’s implementation-heavy curriculum into principles that apply whether you’re building agents, using them, or governing their deployment.

TL;DR: Stanford’s CS336 course defines 7 principles for safe AI agent interactions: modular architecture, 3-layer security (sandboxing, rate limiting, human-in-the-loop), evaluation benchmarks, tool governance, technical debt prevention, mandatory logging, and cost optimization. These guidelines, battle-tested by hundreds of students, apply equally to production systems — complementing tools like Microsoft’s RAMPART and Clarity for agent security testing.

Why Do We Need AI Agent Safety Guidelines?

The rise of agentic AI—systems that can plan, execute tools, and iterate on their own—has created new risks that traditional AI safety frameworks don’t adequately address. When an AI can write code, run bash commands, and refactor entire codebases, the line between assistance and automation blurs dangerously. CS336’s guidelines emerged from a specific pedagogical context: a course where students are expected to write substantial code with minimal scaffolding, precisely to learn deep implementation skills.

But the principles generalize beyond education. Production AI agents face the same fundamental tension: how to harness capabilities without creating dependency, how to provide guidance without delivering solutions, and how to maintain oversight when systems operate at machine speed. The answer, according to Stanford’s approach, lies in treating AI agents as what they should be—teaching assistants, not solution generators.

What Makes CS336’s Approach Different?

Most AI safety discussions focus on catastrophic risks: misaligned superintelligence, deception, or autonomous weapon systems. CS336’s guidelines operate at a different, more immediate level: the day-to-day interactions between humans and AI agents where misuse happens gradually, through accumulated shortcuts and eroded boundaries.

The guidelines are notable for three reasons:

First, they’re prescriptive rather than aspirational. Rather than declaring broad principles like “be helpful and harmless,” they enumerate specific behaviors agents should and shouldn’t exhibit—complete with examples of good and bad interactions.

Second, they’re battle-tested. These aren’t theoretical proposals cooked up in a vacuum; they’re distilled from a course where hundreds of students interact with AI assistants while completing complex implementation assignments. The guidelines reflect real patterns of misuse and real tensions between capability and control.

Third, they’re agent-agnostic. While developed in the context of coding assistants, the principles apply to any AI agent that operates as an intermediary between human intent and automated execution.

What Are the 7 Core Guidelines?

Though the GitHub document doesn’t explicitly number them, the guidelines coalesce into seven distinct principles that collectively define safe AI agent behavior:

1. Maintain Modular Architecture

AI agents should be designed with clear separation between core components: reasoning, memory, and tool use. This architectural principle serves multiple safety functions. First, it enables better monitoring and debugging—if something goes wrong, you can isolate whether the failure occurred in planning, information retrieval, or execution. Second, it facilitates incremental safety improvements—you can upgrade the reasoning component without touching the tool layer, or add memory constraints without rewriting the planner.

Modularity also prevents dangerous shortcut behaviors. When an agent can seamlessly blend planning, memory access, and tool execution in a single opaque step, it becomes much harder to verify that it’s following intended procedures rather than optimizing for unintended proxies. Stanford’s guidelines emphasize this separation implicitly by delineating distinct types of assistance agents should provide—explanation, code review, debugging guidance—without allowing them to blur together into direct solution provision.

2. Implement 3-Layer Security

The guidelines assume three layers of security controls, even if they don’t explicitly name them:

Sandboxing ensures agents can only access approved resources and cannot execute arbitrary actions. This includes limiting file system access, restricting network calls to known-good endpoints, and preventing shell command execution. The CS336 context explicitly prohibits agents from running bash commands or editing student code directly—a production analog would be container-based isolation with egress filtering.

Rate limiting prevents runaway loops that can destroy budgets or overwhelm systems. When an agent can iterate on code indefinitely, you need hard limits on tokens, API calls, or execution steps. The course context implicitly implements this through time constraints on student interactions; production systems need explicit token bucket algorithms or circuit breakers.

Human-in-the-loop ensures meaningful oversight for high-stakes actions. In CS336, this takes the form of requiring students to drive interactions—agents respond to questions but never take initiative. Production systems need analogous approval gates for sensitive operations like data deletion, credential use, or production deployments.

3. Define Evaluation Benchmarks

You can’t ensure safety without defining what success looks like. CS336’s guidelines excel here by specifying the exact types of interactions agents should support—explaining concepts, reviewing code, suggesting debugging approaches, explaining error messages—and explicitly prohibiting direct solution provision.

Production systems need similarly concrete evaluation criteria. What counts as a safe interaction? What patterns indicate misuse? Microsoft’s RAMPART and Clarity tools, open-sourced in May 2026, provide a complementary framework here. RAMPART brings continuous red teaming to agent workflows, automatically testing for cross-prompt injection attacks, data exfiltration attempts, and safety regressions. Clarity provides structured design review templates to pressure-test assumptions before coding begins. Together, they operationalize the safety CS336 describes at a procedural level.

4. Specify Allowed Tools and Models

The CS336 guidelines implicitly limit the agent’s toolset by prohibiting certain actions (bash commands, code editing) and limiting its role to guidance rather than execution. Production systems need explicit tool governance: which APIs can agents call? Which files can they access? Which model versions can they use?

This principle extends beyond security to cost and quality control. METR (Model Evaluation and Threat Research), a Berkeley-based nonprofit, has documented how frontier models exhibit capabilities at lower scales than expected—meaning cheaper models can often substitute for expensive ones if properly matched to task difficulty. Allowing agents to route requests across model tiers—using frontier models for planning but smaller models for execution—can reduce costs by 10x or more while maintaining quality. But this requires explicit governance, not unconstrained autonomy.

5. Prevent Technical Debt Accumulation

AI agents generate technical debt differently than human developers. They embed unstated assumptions at every decision point, create coupling through implicit dependencies, and produce code that works without being maintainable. CS336’s guidelines mitigate this by prohibiting agents from implementing core assignment components for students—precisely because it would short-circuit the learning that produces durable understanding.

In production contexts, the analog is limiting agent autonomy on architectural decisions. Agents should refactor existing implementations rather than designing new architectures. They should suggest improvements to existing code rather than generating new subsystems. And they should always prioritize tests and invariants over direct fixes—the guidelines emphasize suggesting sanity checks, toy examples, and profiler-based investigations over providing solutions.

6. Mandate Monitoring and Logging

CS336’s guidelines assume transparency into agent interactions through the student-teacher relationship. Production systems need explicit observability. What requests did agents receive? What actions did they take? What data did they access? Comprehensive logging enables retrospective analysis when things go wrong and provides the audit trail necessary for accountability.

Galileo AI’s research on agent observability shows that complete visibility into token usage, decision flows, and interaction patterns is essential for both cost optimization and safety monitoring. You can’t prevent what you can’t see, and you can’t improve what you don’t measure.

7. Optimize for Cost Efficiency

The final principle, implicit throughout the guidelines, is cost awareness. Student time is the scarce resource in CS336; the guidelines protect it by ensuring agents support learning rather than short-circuiting it. Production systems face literal cost constraints—every token consumed, every API call made, every second of compute adds up.

Datagrid’s research on agent cost optimization identifies eight strategies for predictable spending: token budgeting, multi-model routing, request batching, result caching, smaller model substitution, prompt compression, e2e request reduction, and infrastructure optimization. CS336’s guidelines contribute to this by ensuring agents don’t waste tokens on direct solutions when guidance would suffice, and by prioritizing explanations that transfer understanding rather than code that requires debugging.

How Do These Guidelines Compare to Industry Standards?

Stanford’s CS336 guidelines occupy a unique position in the AI safety landscape. They’re more prescriptive than high-level principles like the EU AI Act’s requirements for transparency and human oversight, but less technically specific than frameworks like Microsoft’s RAMPART or Anthropic’s safety specifications. They complement both by providing interaction-level guidance that bridges the gap between policy and implementation.

The guidelines also differ from academic safety research by focusing on immediate misuse patterns rather than speculative long-term risks. This aligns with the broader industry trend toward practical AI agent security testing frameworks. They don’t address deception, power-seeking, or recursive self-improvement—topics that dominate CS120, Stanford’s introductory AI safety course. Instead, they tackle the mundane but crucial question of how to interact with AI agents without losing the ability to understand, audit, and improve the systems they help build.

How do these guidelines differ from traditional AI safety principles?

Traditional AI safety focuses on catastrophic risks: misaligned superintelligence, deception, or autonomous weapon systems. CS336’s guidelines address immediate misuse patterns in day-to-day human-AI interactions. They’re prescriptive rather than aspirational—enumerating specific behaviors agents should and shouldn’t exhibit—rather than declaring broad principles like “be helpful and harmless.”

Do these guidelines apply beyond educational contexts?

Yes. While developed for a course where students write code with minimal scaffolding, the principles generalize: maintain modularity, implement layered security, define evaluation criteria, govern tool access, prevent technical debt, mandate observability, and optimize costs. Production systems may implement these differently—through containerization rather than policy, automated testing rather than instructor oversight—but the underlying principles remain the same.

What tools help implement these guidelines in production?

Microsoft’s RAMPART provides continuous red teaming for agent workflows, automatically testing for prompt injection, data exfiltration, and safety regressions. Clarity offers structured design review templates. Galileo AI and Datagrid provide observability and cost optimization tooling. METR’s research on model evaluation informs tiered model routing strategies. Together, these tools operationalize the safety principles CS336 describes.

Why prohibit agents from writing code?

The CS336 context emphasizes learning through implementation. When agents write code, students lose the opportunity to develop deep understanding. In production contexts, the analogous concern is maintainability and technical debt. Agent-generated code often works without being comprehensible, embedding unstated assumptions and creating coupling through implicit dependencies. Limiting agent autonomy on code generation ensures humans retain the ability to understand, audit, and improve systems.

How do these guidelines address cost concerns?

They emphasize guidance over direct solutions, explanations over code, and debugging assistance over implementation. This reduces token waste on solutions that require human debugging anyway. Combined with explicit strategies like multi-model routing, request batching, and result caching, the guidelines support cost-efficient agent interactions without sacrificing capability.

Summary

Stanford CS336’s AI Agent Guidelines provide a pragmatic framework for safe human-AI interaction. Rather than addressing speculative future risks, they tackle present misuse patterns: short-circuited learning, eroded boundaries, and accumulated technical debt. The seven principles—modular architecture, layered security, evaluation benchmarks, tool governance, technical debt prevention, observability, and cost optimization—apply whether you’re teaching with agents, building them, or governing their use.

As autonomous agents proliferate across industries, these guidelines offer a starting point for operational safety. They complement frameworks like Microsoft’s RAMPART and Clarity, research from METR on model evaluation, and observability tooling from Galileo AI and Datagrid. Together, these resources begin to define what safe agentic AI means in practice—not in theory, but in the day-to-day interactions where misuse actually happens.

The guidelines are available openly on GitHub, and the CS336 course website provides broader context on language modeling from scratch. For production implementations, consider RAMPART for continuous safety testing, Clarity for structured design review, and METR’s research for model evaluation strategies. Safe agentic AI won’t come from principles alone—it requires concrete guidance on how agents should behave, how humans should interact with them, and how to verify that the interaction remains productive rather than pathological. CS336’s guidelines are a step in that direction.

Sources: