Day 1 · March 24
AI Agents
Agent intelligence is the inference layer. It's how a current context window generates the desired outcome, and how that context window updates its environment to prime the next one. The system prompt architecture, the decision logic, the operating workflow. The reasoning that turns a pile of context and tools into useful, trustworthy action, and then leaves the world in a better state for the next run.
Think of an AI agent as a stack with four layers. Intelligence sits second from the top: it expects a stable platform underneath and uses it to create amazing user experiences above. Everything about how tokens inform outcomes. The new programming.
The boundary between intelligence and capabilities is where most failures happen. Intelligence decides what to do; capabilities determines whether the platform can execute it. When the agent reasons about using a tool, drafts a plan, or sequences multi-step actions, that's intelligence. When the tool actually runs, when the API call fires, when the runtime handles retries and timeouts, that's capabilities. The harness.
Intelligence expects a stable surface. It shouldn't need to worry about whether a tool is available, whether auth works, whether the sandbox is running. It just needs to know: I can call this, it will work, here's what comes back. The better that contract is, the more intelligence can focus on what actually matters: making good decisions about what to do and when.
Above, UX and product shape what the user sees. Intelligence is invisible to the user. They experience the outcome, not the reasoning. The job of intelligence is to make the reasoning so good that the product team has a powerful engine to build on top of.
Day 2 · March 25
What Agents Can Actually Do
There's a gap between what people think agents can do and what they reliably do. Most of the discourse lives at the extremes: either "AI will replace everything" or "it's just autocomplete." The truth is more interesting. Agents today are genuinely powerful at specific things, emerging at others, and honestly bad at a few more. Knowing the difference is the whole game.
These are the things agents do well enough that you can trust them in production. Not perfect — but consistently useful, with failure modes you can design around.
Research & Synthesis
Give an agent a pile of documents, a messy inbox, or a research question, and it will produce a structured summary faster and more thoroughly than you would. Not because it's smarter — because it doesn't get bored, doesn't skim, and doesn't forget the thing on page 47.
Examples: Summarize 30 investor emails. Extract action items from a week of Slack. Research a company before a meeting. Synthesize user feedback into themes.
Drafting & Editing
The drafting sweet spot: agent writes 80%, human edits 20%. Works for emails, docs, proposals, updates. The key is giving enough context — tone, audience, constraints — so the draft lands close enough to be worth editing rather than rewriting.
Examples: Draft a board update from bullet points. Rewrite a support article for clarity. Turn meeting notes into a follow-up email. Adapt a template to a specific case.
Data Transformation
Agents are remarkably good at taking unstructured data and giving it shape. CSV cleanup, format conversion, extraction from natural language, categorization. Anywhere you'd normally write a script or do it manually, an agent can often handle it directly.
Examples: Parse 200 resumes into a structured spreadsheet. Extract dates and amounts from contract PDFs. Normalize messy CRM data. Convert between formats.
Workflow Execution
When the steps are defined and the tools are available, agents execute workflows reliably. The magic is combining steps across services — pulling from one system, transforming, pushing to another — without the human needing to context-switch between tools.
Examples: Process new applicants: screen resume → draft assessment → schedule interview. Morning briefing: pull calendar + emails + Slack → synthesize → deliver summary.
These work sometimes, with the right scaffolding. Not reliable enough for "set and forget," but powerful enough that you should be learning them now.
Multi-Step Reasoning
Agents can decompose complex goals into steps and execute them in sequence. The catch: they're good at the decomposition but sometimes lose the thread during execution. Long chains of dependent actions still need checkpoints and human review.
Monitoring & Proactive Action
Set an agent to watch for conditions — an important email, a calendar conflict, a metric threshold — and take action when they trigger. Works well for defined triggers. Less reliable when "what to watch for" requires judgment.
Taste & Preference Learning
With enough examples, agents develop a model of your preferences: writing tone, decision patterns, priorities. Early but real. After 50+ interactions, the drafts start sounding like you. After 200, they anticipate your preferences before you state them.
Honesty about limits builds trust. These are things people expect agents to do but where current systems genuinely struggle.
Individual capabilities are useful. Combined capabilities are transformative. An agent that can read email is a filter. An agent that can read email and check your calendar and draft responses and knows your priorities? That's a chief of staff.
The real power lives not in any single capability, but in the compound value of capabilities working together over time. Each integration added, each preference learned, each workflow packaged makes every other capability more powerful. The graph of connections matters more than any individual node.
The question isn't "what can an agent do?" It's "what can an agent do that knows everything you know?"
Day 3 · March 26
Context Engineering
Every major AI lab is racing to build the most intelligent model. They'll keep getting smarter — Claude, GPT, Gemini, the next hundred models. Intelligence will commoditize. It's happening already. So what doesn't commoditize? The context you give it to reason over.
Two identical models — same weights, same capabilities — will produce vastly different results depending on what context they're given. One has your emails, calendar, Slack, meeting history, preferences, and work patterns. The other has nothing. Same intelligence. Completely different outcomes.
The model is the engine. Context is the fuel. The best agent systems aren't the ones with the smartest models — they're the ones where any model becomes dramatically more useful because of the context provided.
After 90 days with a well-built agent, users can't go back to a blank chat. Not because the agent is smarter. Because the agent knows them.
Context is everything the model needs to make a good decision that isn't built into its weights. It's broader than people think.
Immediate
What the user just asked. The document they're looking at. The email they're replying to. This is the obvious layer — what's in front of you right now.
Session
Previous turns in this session. What's been discussed, decided, discarded. The thread of reasoning that led here.
Personal
Preferences, style, priorities, past decisions. How you like emails written. Which meetings you care about. What "urgent" means to you specifically. This layer accumulates over weeks and months.
Organizational
Team structure, project status, company terminology, internal processes. Who's responsible for what. The institutional knowledge that lives in Slack threads and people's heads.
Temporal
Calendar events, deadlines, recent communications. Context has a clock. What mattered yesterday might not matter today.
Relational
The relationship with the recipient. Tone calibration — how you talk to your investor vs. your teammate vs. your friend. Every communication has an audience.
Context engineering is different from prompt engineering because context compounds. Every interaction teaches the system something. Every email processed, every preference expressed, every correction made — it all accumulates.
Day 1: the agent is smart but generic. Day 30: it knows your style, your recurring meetings, your top priorities. Day 90: it anticipates what you need before you ask.
Intelligence you can swap — switch from Claude to GPT tomorrow, the capabilities are similar. Context you can't swap. It's yours, built over time, specific to your world.
Context windows are finite. Even as they grow — 128K, 1M, 10M tokens — they'll never hold everything. And bigger windows don't solve the problem; they shift it. With 1M tokens, the question isn't "will it fit?" but "will the model attend to the right parts?"
This makes context engineering a curation problem. You can't dump everything in. You have to decide: what goes in this window, for this task, at this moment? What's relevant? What's noise? What should be summarized vs. included verbatim?
The best context engineering is invisible. The user doesn't think about what's in the window. They just notice that the agent seems to understand.
Day 4 · March 27
Prompting Strategy
A prompt is a program written in natural language. It has intent (what you want), constraints (what you don't want), and context (what the model needs to know). The difference between a good prompt and a bad one isn't cleverness — it's clarity.
"Do exactly this, in this order, with these constraints."
High control, low flexibility. Best for defined workflows, data transformation. The agent is an executor.
"Here's the goal and context. Figure out the approach."
Low control, high flexibility. Best for research, exploration, creative work. The agent is a partner.
Most production prompting lives in the middle: guided autonomy. You set the boundaries, define the goal, provide the context — then let the model choose how to get there within those constraints.
The best system prompts describe a role and a way of thinking, not a list of rules for every situation. "You are a careful editor who values clarity over cleverness" generalizes better than 50 editing rules. Rules are brittle. Mental models are flexible.
A mediocre prompt with excellent context will outperform a brilliant prompt with no context, every time. Before optimizing your instructions, ask: does the model have everything it needs to make a good decision?
"Write a good email" is almost useless. "Write a 3-paragraph email to an investor, professional but warm, acknowledging their concern about burn rate and redirecting to our runway" gives the model something to work with. Include examples when possible — they're worth more than paragraphs of instruction.
System prompts, user prompts, and context serve different purposes. The system prompt defines character and constraints. User messages carry the immediate task. Context provides information. When these get tangled, contradictions emerge.
Every prompt has a failure mode. "If you're not confident, say so and explain why" is worth more than 10 lines of happy-path instructions.
Prompt development is empirical, not theoretical. Keep a test suite of inputs and expected outputs. When you change something, run the suite. Intuition helps you write the first draft. Evidence helps you write the tenth.
Every additional instruction competes for attention in the context window. The model can follow 5 clear rules better than 50 overlapping ones. The exception: examples. One good example is worth a paragraph of explanation.
The instruction novel
10,000 tokens of instructions that the model can't possibly follow all at once. Distill to principles, use examples for edge cases, and accept that some judgment calls are better left to the model.
Contradictory constraints
"Be concise but thorough. Be creative but stick to the facts." Every contradiction forces the model to choose. When you catch yourself writing "but," that's a signal to clarify your actual priority.
Negative instructions
"Don't use jargon" is processed less reliably than "use plain language that a smart non-expert would understand." Tell the model what to do, not what to avoid.
Context starvation
Brilliant instructions with no context. "Write the perfect follow-up email" without including the original email, the relationship context, or the desired outcome.
Day 5 · March 28
Skills & Recipes
A skill is a reusable unit of intelligence. It packages domain knowledge, decision logic, and workflow steps into something an agent can execute reliably. Skills are how intelligence becomes tangible — how "make good decisions" turns into "handle this specific kind of work, well, every time."
You do something manually — process a batch of applicants, prep for a recurring meeting, triage an inbox. The agent watches. After seeing you do it two or three times, it recognizes the pattern and offers to automate it.
The first time cost you 20 minutes. Every subsequent time costs you 10 seconds of review. That's the value proposition: intelligence that learns from your behavior and packages it for reuse.
Do something twice, teach it once, never do it again.
1. Discover
Every skill starts as a repeated action. Discovery is about recognizing that a human is doing something an agent could learn.
2. Package
Turn the pattern into a skill: define the trigger, the inputs, the steps, the decision points, the output format. Capture not just what someone does but why.
3. Ship
Skills need to be discoverable — surfaced at the right moment, described clearly, easy to activate. The best skill is one the user doesn't know exists until it appears exactly when they need it.
4. Iterate
Every execution is feedback. Skills should get better over time — not just from explicit improvement but from implicit learning through use.
Day 6 · March 31
Evals & Measurement
Without measurement, every prompt change is a coin flip. Evals are the difference between engineering and guessing. They turn intelligence work from an art into a science — or at least a craft with feedback loops.
Decision Quality
Given the context and the task, did the agent make the decision a competent human would make? Build golden sets: curated examples where you know what the right answer is, and score against them.
Task Completion
Binary but important. Did the email get drafted? Did the meeting get scheduled? Track completion rate and identify where tasks stall.
Faithfulness
Did it use the information provided or hallucinate? Did it follow the constraints or ignore them? Faithfulness to context is the most measurable aspect of intelligence.
User Trust Signals
The ultimate eval: did the user send the draft as-is, edit it heavily, or throw it away? Every user interaction is an implicit evaluation.
Calibration
When the agent says "I'm confident," is it right? Calibration — the alignment between stated confidence and actual accuracy — is one of the most underrated metrics.
Curated input/output pairs where you've defined what "good" looks like. Start small — 20-30 cases. Expand as you discover edge cases. Every bug report is a potential golden set entry.
Before you change anything, run the suite. After, run it again. Did the thing you fixed actually improve? Did anything else break? Table stakes in software engineering. Should be table stakes in intelligence work.
Same input, two different prompts. Which scores higher against your golden set? Which gets more user accepts? Controlled comparison beats gut feeling.
Record the full reasoning path. When something goes wrong, the trace tells you why. Without traces, debugging intelligence is archaeology.
User corrections are the highest-signal eval data. The diff between original and edited version tells you exactly where the agent got it wrong. Collect these systematically.
The measure of intelligence isn't what the system prompt says. It's whether the user trusted the output enough to act on it.
The measure of intelligence isn't the complexity of the system prompt. It's the quality of the decisions that come out the other end. Every prompt change, every workflow tweak, every new heuristic gets tested against one question: did the agent make a better decision?