19 min read

A.I. Manifesto & Workflow Walkthrough

A.I. Manifesto & Workflow Walkthrough

The following is adapted from a talk I gave at Stripe to the Risk Engineering Organization.

I am Jesse Spevack and I’m on the Agentic Sellers team. Thank you all for coming. I’ve been at Stripe for 4.5 years. I maintain the AI Vibes weekly newsletter, which I started almost a year ago. At the time I started that, I was on the Differentiated Onboarding team and had spent the previous nearly four years in the Risk Organization. 

I’ve been watching the AI space very closely. I want to explain what I’ve seen, what I’ve used myself, what I found to be effective, what I think is coming, and how I think about AI more broadly. This is going to be part manifesto, part workflow walkthrough. I’m not going to do a live demo because I don’t think watching me chat with Claude Code makes for compelling live content. I’ll start with my argument about where we are and the second half of this talk will walk through problems I’ve come across when using agents and then the solutions that have worked for me.

I want to share a little secret with you all. I haven’t written any code this year. Maybe that doesn’t come as a surprise. After all, Claude Code and Opus 4.5 and now 4.6 are having a moment. Many of you have realized that Opus is a completely different animal from the autocomplete we were messing with two or three years ago.

My Claude pilling goes further back than winter break. I don’t precisely know when I stopped writing code by hand. I can tell you that the vast majority of code I’ve merged since late 2024 was not hand written. Likely all the code I’ve merged in the past six months has been completely generated with AI.

At the end of 2024 I went on a sidequest and worked with the experimental projects team on a Google sheets Stripe plugin. The project was Google Apps Script and Typescript and I distinctly remember using our web based llm chat as a key part of my workflow. The project was small enough and the LLM’s context window was large enough that I could paste the entire code base in and get updates out, run the updated code and repeat. 

Obviously copy and pasting coding between one’s editor and an LLM chat window is a nightmarish workflow.

At around the same time the Cursor pilot launched and I did not get in. I DM’d the PM multiple times to no avail. Outside of work I used my education budget to subscribe to Windsurf, which I thought was going to be a cool hipster Cursor - kind of like a Brooklyn IDE. I worked on side projects that I never would have attempted. I built familiarity with the models.

It was at this time I realized that right now is the best time to be a developer. Even though the models were not as good as they are now, they did provide me with the courage to attempt things I’d really never have tried in the past. LLMs aim to please, so when you say, ‘do you think we could build…’ they’ll say ‘that’s an awesome idea.’ Some call this sycophancy, and you have to watch very carefully for that particular failure mode, but there is a part of it that can be your own anti-imposter syndrome cheerleader, which for me personally has been game changing.

I ran out of tokens on Windsurf so I subscribed to Cursor. Then I ran out of tokens on Cursor. For a few months I’d start with Windsurf and then move to Cursor and back again just rotating between wherever I had tokens. I just want to pause here and note how wild this is. In the previous nine years I’ve switched IDE’s once. In the past year I’ve switched IDE’s three or four times.

I remember when Claude Code came out. I had a background job bug which I was about to give up on. I gave the problem to Claude Code and it solved it. When it was done it printed that the conversation had cost me $4.17. This was absolutely wild. It honestly reminded me of going to arcades as a kid. Money in. Fun out.

Around that time, at Stripe we got access to Goose and then a Claude Code Pilot and then finally Cursor.

In June 2025 I did not want everyone to know how much I was using coding agents. It felt like cheating. I felt like real engineers looked down on this sort of thing. At the same time AI 2027 came out and it kind of hit really hard for me. This was a blog post that used rigorous quantitative estimates to predict advances in AI and its effects on the economy, geopolitics, and the existence of our species. I had of course read AI books by Bostrom and Tegmark, but this was making predictions about the very near future rooted in quantitative analysis. I kind of had a bit of a professional crisis.

I don’t know if this career will exist in 10 years. I’ve gone through one career transition already. As some of you know, I taught at a public school in the South Bronx a long time ago and I’ve been an assistant principal of a very good public high school in Manhattan and a not quite as good middle school in Denver.

I think software development will exist at the end of this year, but it is going to get weird. And I’m not sure I have another career change in me. So remember when I said that this is the best time to be a developer? It also might be the last time.

So what has changed for me? Why am I now ready to share my thinking and how I use AI at Stripe? I think a year ago I thought of AI as a secret that would give me an edge in my work. My mindset has changed. This is a team sport and we are a team in contention for the championship. And I hate to lose.

If I legitimately care about our users, then I should view the opportunity to share how I work best for them with all of you as a way to truly put users first.

I’m not a professional athlete, but I suspect teammates share workout techniques, insights gained from studying the tape, and I also want to believe that there is a lot more discussion of these kinds of trade secrets as you move up the hierarchy of competitiveness. Like if you are in the Seahawks locker room those guys are communicating and being fully transparent on process with each other. I’m actually a lifelong Jets fan. Maybe if we are in the Jets locker room it's different. The culture is broken. In some ways I think of this call like we are studying tape and insights we all share will help us perform better on Sunday. I think our culture is strong. This is a team sport.

Let’s zoom out a little. One of the things I’ve heard Boris Cherny, the creator of Claude Code, say is that AI is like the printing press, which was invented in Europe in the 1400s. For the previous thousand plus years, literacy rates were very low. The vast majority of people could not spell their own names. Reading and even more so writing was something only specialized craftspeople, known as scribes, could do.

It makes me imagine a medieval monastery with a few monks hunched over their sheets of vellum, meticulously illuminating the text. They took pride in their craft. It was sacred. In any monastery there would have to be a monk who had the best handwriting. They may have been commissioned by or worked for a lord. Perhaps the lord could not write themselves. Is this starting to sound familiar? Does it remind you of the startup landscape for the past 10 or 15 years?

We are the monks and scribes. We have made our careers in the age of the computer scribe. Some of you have particularly good handwriting. Software engineers work for founders, many of whom (ours excluded, I very much appreciate the wunderkid boyscout super genius Collison vibe) are essentially illiterate when it comes to technical systems. Literacy is widespread in our society, but computational literacy is not. That is until now. 

Once the printing press arrived, the price of creating writing dropped and literacy rates over the next several centuries rose. The status of the scribe with the best handwriting changed. Actually their job changed. And over time, reading and writing which had been the exclusive purview of the scribe extended to 99% of today’s developed nations populations.

I think we are in the midst of the printing press moment of computational literacy. Even if the models don’t get any better, we are already seeing non-technical Stripes build things that would have taken teams of us computer scribes in the past. I don’t see this trend slowing down.

As a side note, I was on a zoom call brainstorming some UI work my team is doing and our product manager live vibe coded some ui elements during the call. It was one of the more impressive things I’ve seen. This is going to become more and more common.

This is sort of broadly how I think about where our industry is at and to restate it more clearly, we are at a time of faster than normal change and it is hard to predict where this will go.

Now maybe some of you, especially the scribes with good handwriting, are still skeptical about the efficacy of these tools. My theory that I have no data to back up is that some of the best programmers wrote off the utility of LLMs when the tools first hit the scene in 2024 and agentic tools in 2025 because these tools slowed them down. We are all busy and it is really hard to take on that productivity hit.

The models have gotten better every couple months since then. Now the frontier models are very smart. The way I think about Opus 4.6 is that it is my teammate with absolutely perfect code writing ability in every language. That is insane. I used to think I was pretty good at Ruby. I’m nothing compared to a modern LLM even in my primary language. But with its syntactical perfection, it suffers from memory loss, an over eagerness to please, sycophancy, and an inability to learn.

If you haven’t used the current frontier models in anger yet, you need to. One way to think about it is that like many new tools it really sucks at first. When I learned Vim, it was awful. It was really awful for about a week and pretty awful for a second week. But now it speeds me up. If you tally how much time I’ve saved hjkl-ing and cw-ing around text since I learned vim, it definitely adds up to more than the two weeks of struggle.

This is the job we have. We learn new tools. Learning new tools is usually hard and uncomfortable. And to some extent I think that is true about the LLMs. Hopefully I have convinced you to lean into the discomfort because persevering through the discomfort of unknowing is a developer's super power and it is the job. Probably since you are here you are already convinced so I’m not sure how persuasively I need to make the point. I think 2 months ago all of this would have been met far more skeptically.

Let’s talk about what will happen once you’ve started to use the LLMs in anger. I want to share the problems I ran into and how I thought about solving them. And then I want to talk about the problems I’m still having - both practical and existential.

First I think it is important to consult some experts. Andre Karpathy has a 3.5 hour youtube video on how LLMs work. I strongly recommend this. I also just had a conversation with Claude and asked it questions about how it works. For example, I wanted to better understand how LLMs read text. So I asked Claude, ‘what is your experience when reading text?’ That was actually an illuminating conversation that I remember really clearly. Before that, I was very fixated on getting text in an order that made sense for me, with instructions and headings. But the LLM does not read left to right, top to bottom. It essentially reads all at once in parallel so it just kind of gets the gestalt of the text in an instant.

Side note, how we tricked grains of sand to do this is breathtaking. As another aside, the LLM will tell you what it can do and how to do it, just ask it to teach you.

There are a lot of AI influencers out there. I think most of them are not good. A few are really compelling though. Ben Guo just gave a tech talk last week and based on his work at Stripe and how he built Zo Computer, I take what he says pretty seriously. Another good source is of course, his Lordship Simon Willison.

You are going to need something to practice on. You may be the kind of person who needs something high stakes as a forcing function, or you might be like me and want to try out these tools on lower stakes work to find the sharp edges with a safety net. Trying this out does not mean one prompt. That is the equivalent of opening vim, not knowing how to get into insert mode, and deciding vim is bad. You actually have to stick with it, just like vim.

I think a KTLO task, run ticket or some sort of code migration is a good place to start. Just to be clear, I’m suggesting KTLO not because the tools are only able to do this kind of work, but because it might be an easier entry point into learning the tools for some of us. I’ve done real project work with Claude Code. And like I said, all my code has been generated for quite some time.

My first prompts were not good. Thankfully the models have gotten a lot better at understanding my half baked ideas. I commonly will ask the LLM to, ‘help me write a prompt for a coding agent to…” The way to get better at prompting is to do more prompting. Experiment, read other people’s prompts and then do more prompting again.

Let’s say we have a KTLO task to try out. I’m going to spin up a new devbox. I have a setup script which does a few things. It clones my dotfiles repo to the devbox. My dotfiles are where I have my tool setup, aliases, nvim, tmux, and terminal config. It helps me set the devbox up how I like it. It also contains a lot of my system level claude files, which you can update in your own dot-claude directory. My advice for dotfiles is to find a Stripe whose game you respect and see what their dotfiles look like. You can also look at trailhead for the dotfile setup script shim. Finally you can open Claude Code in pay server and ask Claude to survey all the setup scripts.

My setup script also clones my agent team repo to the devbox. This is the directory I end up working out of, but explain why that is once I first explain some of the problems I’ve run into.

The order of this section of the talk is: problem I encountered and then the solution I’ve landed on. So, going back, the problem was finding experts, so I landed on Simon Willison, Ben Guo, and Karpathy to name a few. The problem was getting my configuration including my Claude files onto my devbox, so I set up a setup script to automatically clone my dotfiles onto each new devbox.

The problem now is I want to get better at using coding agents, so the solution is I find a low stakes way to practice. At this point I have a keep the lights on (KTLO) task and a devbox. Now I’m going to run tmux. Tmux allows me to shut my computer down completely, but retain the working session on the devbox. This helps solve the problem of dropping a Claude Code session because of a meeting. With Tmux you can pick up from where you left off. I only started using tmux earnestly this year. 

Problem, how do I prevent losing valuable ongoing chats with Claude Code. Solution: Tmux. Problem, Tmux is hard, how do I learn it? Solution, ask Claude. I asked Claude Code to survey everyone’s tmux setups in the dotfiles folder in pay-server. Then it designed one special just for me. Then Claude taught me how to use the tmux setup.

Just as a tangent, as I was learning tmux, I’d keep a Claude Code session going from my dotfiles directory, where my tmux config lives. If I wasn’t sure how to do something, I’d just ask that Claude session, which could then read my config. If I didn’t like the answer, like the shortcut wasn’t easy to reach or remember, I’d ask Claude to change my config.

Now we have tmux open on a new devbox. The next problem is having to go back and forth and grant permissions to Claude. The solution is to alias claude –dangerously-skip-permissions to lfg.

Ok, so we have our Claude Code session going and our KTLO task and Tmux. Now the problem is how do I write good prompts. When I would just prompt, please complete this KTLO task, I would not get great results - probably a lot of us have experienced this. I have found better results by separating planning from implementation. At first I might prompt Claude Code to come up with a plan. Or I might say, let’s come up with a plan together, or I might reach for a brainstorm command. I forked my particular version of this from superpowers, which is a publicly available Claude plugin. This command, which lives in my .claude/commands directory conducts a structured interview to get Claude and I on the same page regarding an idea I’m having.

You can achieve the same result by asking Claude to interview you to figure out what you want. The thing that is really effective about this command is that it tells Claude to ask me questions, one at a time and preferably multiple choice until it has all the information it needs to proceed.

The next problem is context. We’ve all heard about context management at this point. I think we should stop thinking about story points and t-shirt sizes. When we are breaking work down, which has always been and continues to be a critical software engineering skill, I think about context. Is this work something that can be accomplished in a single context? If yes, we have an atomic piece of work. If not, I might try to break it down further.

The next problem I’ve run into is when work goes beyond a single context, like planning the work versus implementing the work, how can I get the second Claude Code session the work from the first. I typically will ask Claude to write the plan as markdown, which the new session will have access to. In the past I’ve had Claude write  down a markdown plan with steps and places to keep track of status. This was good at first, but over time I found that Claude often would forget to update the markdown. I’ll get into that problem shortly.

A plan might not be an atomic unit of work. One way to separate this further is to split the what and the how. At the end of the brainstorm, I might ask it to create a product requirements doc (PRD) from our brainstorm. Then I’ll open a second session and ask it to turn the PRD into an implementation plan. Then I’ll open a third session to carry out the implementation. 

This is a pattern I think works pretty well. Brainstorm to create a markdown PRD, turn the markdown PRD into a markdown implementation plan which includes task tracking, and then implement the plan and record status in the plan in a fresh session. That’s breaking the software development life cycle into chunks that can fit in the agent’s context.

Verification is one of the hardest parts of this process. If the changeset is too large, it is going to be pretty hard to verify and I have to back up and create a PRD or plan with a narrower scope. Another technique I use is to separate testing and implementation in the plan. I can then have one session write the tests and then a second session do the implementation that makes the tests pass. Separating the implementation and testing and having the agents actually do TDD creates guardrails for the agent and this enables the agent to work for a longer period of time without your intervention.

Continuing on the problem of verification, The next thing I’ll do is have Claude check its work. In a new session I’ll ask it to review the current changeset. Since it is a new session, you are working with a clean context, it doesn’t know why in the moment it made a particular change because that was a totally separate instance. It is the movie Memento and Guy Pierce’s memory has just reset. This is all very weird. 

You can provide Claude with various lenses with which to conduct its review. My two favorite lenses are to perform a skeptical review where I instruct Claude to assume everything in the PR is wrong and work backwards from that assumption. The other lens I like is based on the idea that most of our work involves copying pre-existing patterns. I’ll ask Claude to review code and find prior art for all the changes. Any change without prior art is highly suspicious.

Then I’ll have Claude aggregate the results of each of these review lenses and do a meta review. While this is typically a good starting point, it does not guarantee anything. The most rigorous verification I do when it comes to review is ask Claude to present me the code changes 1 hunk at a time, essentially mimicking the github PR workflow. For each hunk I have Claude provide an explanation and then I’ll either ask a question about it, ask for a change, or move on to the next hunk.

Another problem is how can I prevent my agent from making the same mistake twice. At the end of a chat at any of these stages, especially one that didn’t really go as well as I would have liked, I’ll ask the model how the chat could have gone better. Sometimes this will result in an update to CLAUDE.md or a new skill or command.

The full cycle is plan, implement, review. I started running this on a loop where I’d feed the review back into the planning step, generate a new implementation step, and then review again, and repeat.

Then I started thinking about running multiple pieces of work in parallel. There is a whole new set of orchestration problems that I ran into when running multiple agents. This is where my agent-team repository comes into play and this is why I launch my Claude Code sessions from that directory.

The problems that I ran into when handling multiple pieces of work simultaneously is first, keeping track of the PRDs and plans. I mentioned that markdown as a tracker breaks down at a certain point unless you can ensure that Claude consistently keeps it up to date. To solve this problem I use two tracking mechanisms. In my agent-team repository I have beads and briefs. A brief is a PRD and implementation plan, and tracker all in one. Beads are JIRAs for agents. They are synced via git and can be organized hierarchically with tasks, sub tasks and dependencies. This is Steve Yegge’s project and it is very solid and also insane.

Another problem when handling multiple pieces of work is the agents sometimes battle each other and start committing each other's work. That’s bad. The solution I’ve landed on there is to use one devbox per piece of work. On smaller personal projects I’ve used git work trees, but devboxes are just an incredible piece of infrastructure that it seems silly not to use them for this.

Going back to my agent-team, what problem does this solve? Well the agent team has individual agent definitions in the .claude/agents directory. The agent definitions describe different roles I want the agents to take on as they work with me. The main role is the tech lead. The tech lead solves the problem of how do I keep track of multiple agents. The tech lead does that. It is my orchestrator and it communicates with me. 

I have scouts, researchers, and pathfinders for figuring out how things work in our code. I have an implementor and tester for generating and verifying code. And I have a standard reviewer, skeptical reviewer, and pedantic reviewer for looking over the team’s work.

I probably got carried away with the anthropomorphization of these agents and I might pare that back in the future. I am skeptical that it is important to describe what makes a good tech lead in an agent definition. I think the utility here goes back to context management. Each agent has its own context window. By having a set of agents I can call out to I can divide work into appropriate sized tasks that stay within the context window. This is a separation of concerns, just like in any type of system design. The researcher can specialize in finding documentation and files, and then report the result to the tech lead, who can then write a plan with that information. The plan can be handed off to a tester and implementer and then to a team of reviewers.

Another problem is that no matter how hard I try to manage context, I always fail. The solution that I’ve  landed on for now is to try to hand-off work to a new team when I’m at a good stopping point. What is a good stopping point? I don’t actually know. But, when I’m at a good stopping point, ideally before I get to two thirds of the context window, which you can check with the /context command, I’ll run a command that has the agent update the brief with the current state of things. Then it will output a handoff prompt. I copy the prompt, which has the filepath to the brief and some notes on how to get started. I can then, somewhat guiltily, close the session, type lfg to start a new session, and paste the prompt in for my new, fresh tech lead. 

That’s the end of the walkthrough portion of this talk. These are the problems I’ve encountered and the solutions I’ve landed on. There are other problems I’ve encountered, I just haven’t found good solutions yet. These problems include having a backlog of agent-ready work so I can just constantly have these things running. It’s not that I don’t have an endless amount of work, it’s that I have to figure out how to get that work in a shape that is ready to be handed to my agent team. I suspect that we are going to start having meetings where instead of commenting on docs, we are going to be commenting on and workshopping prompts for our agents.

Another problem I don’t have a good solution for yet is how do we keep the agents running for long periods of time. I have gotten research agents to run for over an hour without stopping and producing useful results. But to get agents to run for long periods of time they need ways to automatically manage context and perhaps more importantly verify their progress and results.

I want to close by talking about where I think this is going. I think increasingly our jobs will not be about writing code, but about creating systems that enable the agents to write code. Those systems I think will continue to better address some of the problems I’ve tried to solve, like context management, setup scripts, and verification processes. We have to solve these problems because if we fast forward a few weeks or months from now and we have thousands of developers each with dozens of agents all writing code at once, something is going to go wrong. On top of that we want to be the fastest company and grow the GDP of the internet and build trust in the financial infrastructure that grows the GDP. All of that requires us to continuously improve our practice and that means engaging with the LLMs.

All of this is scary and weird. But it is also exciting. I think we all feel the buzz. So manage your context. And enjoy this because it is the best time to be a developer.