Harness engineering: Leveraging Codex in an agent-first world

(openai.com)

98 points | by pramodbiligiri 1 day ago

25 comments

thelucent 1 minute ago
This might work only if you have “infinite” compute and infinite tokens.
As someone that used the $20 plan, this pure agentic approach is impossible to do because I’d hit the limit fast and I would end up with less outcome.
What I found that work incredibly well was to provide a human written code as reference, and ask it to extend it. So I scaffold the entire thing, architect it, write few samples code (controllers, services, models, components, database schema, how auth works, etc) so the LLM can have a headstart on their attention (pun intended)
I usually wrote a stub with a lot of details on how to implement it. Something like a higher abstraction pseudo code. Then ask the LLM to implement it.
When it fails, it is often better to undo the whole changes, adjust the stub so it catches what fails before, and try again.
Or, commit the changes, and use a new fresh context and only address what went wrong.
-
Whenever I tried this agentic from scratch approach, I always end up disappointed; both on the outcome and on the limit that I hit before an hour even passed.
drivebyhooting 7 minutes ago
I wish these breathless blog posts would actually try to be more didactic.
For example, actually doing a walkthrough of how to set up these allegedly super powered workflows and concrete demonstrations.
I’m not an AI skeptic. Rather I’d don’t want to miss out on any actual super powers.
yurimo 55 minutes ago
What I still can't understand is why is massive amount of code generated is a flex? I don't feel that software has gotten a lot better in past 3 years, only sloppier. It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality. I'd argue you have to optimize for less lines generated as possible while secondary optimization should be readability for humans. I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.
And if I am working on an existing codebase then isn't a good commit often a negative sum between added and removed lines? I don't want to bloat my codebase but make it more polished and elegant. After reading that I wonder if what they have done could have been accomplished for a far fewer LoC budget.
[-]
- crdrost 10 minutes ago
  Yeah so all of personal computing—text editing, SVG antialiasing, etc, fits in 20,000 LOC (VPRI's STEPS project) so a million lines of code is 50 reïnventions of personal computing. BUT: it is unlikely that humans would have solved this problem in 20 kLOC. Sussman said “we really don't know how to compute!” as his talk title and LLMs had to ossify some pre-existing voice as the forever programming habitus and it chose a persona that doesn't know how to program—because we don't —and now we are stuck with it. Claude is our tickets, our implementations, our documentation... And if you tell it “hey the node role should not have those permissions, that should be a service account” it will happily do the right thing, but it has no intrinsic sense of taste and the error message it's trying to clear just says “the node role doesn't have that permission and the system prompt says “keep it short, stupid ” and graybeards might be our last bulwark.
- dumbdumb125 14 minutes ago
  It's a huge flex if the alternative is no code at all. Reward hacking aside, LOC resonates with me in the sense that I've seen 10+ projects to fruition that wouldn't have even begun without an agentic harness and an LLM.
  It's like the difference between doing stock price predictions with binary "up" or "down" histories and trying to figure out how to normalize actual price histories (basically impossible). The binary work gives a well-defined signal.
Frannky 6 minutes ago
I started using chatgpt for functions and checking, then for single file changes and checking, now for multiple changes and checking. I am at a point where the only changes I correct are architectural. So it may start to become smarter to learn how to see only the architectural directions while multiple agents work, test, and commit both on unit and against live deployment.
bko 3 hours ago
> We had weeks to ship what ended up being a million lines of code... Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.
That's an insane level of throughput. What's a good baseline? Prior to agentic coding, whats the typical number of PRs engineers were expected to push? Maybe a 2-10?
Do people feel the software has gotten better in the last 6 months? The number of engs is prob the same so we should expect maybe 5x faster cycle in major software apps, but I don't see it. The AI apps do change very fast but given its a very new field, I'd expect as much. But outside of that, I don't see it.
[-]
- torben-friis 2 hours ago
  Here's a fun one: firefox lists its current count at about 2.5M LOC, from roughly 1M commits during the years.
  You end up with about 3 lines added per commit, which is not ridiculous when you consider that most would be editions rather than full additions.
  Here, we have 1500 PRs and 1M LOC, which is about 650 added LOC per PR. Remember, not 650 lines total in the PR, but +650 balance after additions-removals.
  Fun questions for attentive readers:
  - What does a project growing at a rate of one full firefox-codebase worth of LOC per year look like, a decade down the line?
  - What does the line count say about the verbosity of the tool, and what does it say about outcomes that the purpose of the project isn't clearly disclosed?
  - Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?
  - If it was confirmed that LLM usage blows up your line count, what's the implication for codebases that want to return to manual coding after months of usage? (Say, because the tool gets expensive).
  [-]
  - therealdrag0 12 minutes ago
    Does the Firefox LOC include ALL forms of text: infrastructure (Firefox doesn’t have), documentation, developer scripts,tests, etc? How is the test coverage of Firefox?
  - CleanCoder 1 hour ago
    When I got to the 1M LOC I involuntarily paused feeling like this must be satire.
- krackers 2 hours ago
  They never specified what exactly the product was, without which it's impossible to judge the post.
  For some reason most of the uses of "agents" are to build yet other AI products, it's turtles all the way down. Maybe that says more about the field of harnesses than it does about the power of "agents".
  [-]
  - theptip 43 minutes ago
    There is a sense in which it doesn’t matter at all; many of the limitations of agents in large codebases are just the context management challenges. So proving that you can cohere and progress at O(1m) is a useful scale observation. “Can I use agents in my 1m line codebase?”
    There is of course another sense in which the output quality is the only thing that matters. “Can I use agents to build a 1m line codebase that I want to maintain going forward.”
    I take this as being exclusively a tech demo of the former. Quality (feature velocity, bugs, scalability) is not demonstrated.
  - becomevocal 2 hours ago
    Feels like the active discovery going on is trying to understand what is computer vs what is AI, for every product.
    Agents help a ton with the discovery, but the act of building a product needs a deeper level of thought and validation to make it actually better than what came before. So IMO what you see is people still learning what needs to be understood and crafted first hand to make a product better (including economics)
    We’ll get there if more of us try
- techblueberry 30 minutes ago
  I’ve been vibe coding a lot over the past year or so, and I think I’m going to stop. In fact, I sort of want to challenge myself to see, can I go back to a sort of the fork in the road with the old copilot autocomplete workflow and really maximize that. Be in the drivers seat for most of the code being written, but find ways to use AI to really enhance the flow state / remove blockers. Tools only minimal actual code generation.
- jakolaptu 1 hour ago
  It is likely better because AI agents make access to domain knowledge easier. However, I would wager that the problem is people don’t remember the code well. The problems are going to be long-term as the pace of change increases.
  If you think about it, successful products rely on designing well-thought-out experiences, customer discovery (see all the Forward-Deployed Enginneer job listings at OpenAI) so the code velocity somewhat becomes irrelevant.
  If you’re solving the right problem and you’ve got a good team then competitive advantage comes from somewhere OUTSIDE of code velocity.
  The more important question I think is does faster code yield more value long-term? At the moment, it’s like yeah we do 3.5 pull requests per day.
  I’m thinking, great, good for you. You could also combine three pull requests into one and then you’re doing 1 per day. This is quantitative data that doesn’t really mean anything tangible.
- aleqs 2 hours ago
  > should expect maybe 5x faster cycle in major software apps
  To what end and what would that even look like though? Enshittifying everything at maximum speed? The apps/platforms I use regularly - GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.
  [-]
  - linsomniac 1 hour ago
    >GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.
    What if AI lets you create new versions of those tools, but without the enshitification?
    I say that being in the "soaking" stage of using AI to rebuild a shitty software project in 70KLOC over about 2 weeks of spare time, so this may not be as theoretical as you might think.
    [-]
    - aleqs 29 minutes ago
      Oh I definitely agree that AI can and will help create great software.
      It's just that creating great software isn't really the SV/VC/big tech business model or main goal.
    - NoraCodes 33 minutes ago
      > What if AI lets you create new versions of those tools, but without the enshitification?
      I'm not sure I fully understand what you're saying here. Isn't the value of these tools almost entirely independent of their actual software? That is, we have many good open source, self-hostable forges (Forgejo, sr.ht, etc.), lots of great music player software (Jellyfin, Symphonium, etc.), and decent maps software (OsmAnd and Organic Maps). People use GitHub, Spotify, and Google Maps -- perhaps even _put up_ with their often bad/glitchy software -- because of network effects (all three) and content/licensing partnerships (Spotify/GMaps). That proprietary data isn't something AI can help you with, right?
      [-]
      - linsomniac 21 minutes ago
        It really depends on the use-case. For example, my most starred github repo is a tool to convert Spotify playlists to YouTube Music (that was done pre-AI). Github depends on what issues you have with it, what your use case is, and whether you can leverage some of the network effects via API from the github source. Maps, same story.
      - nostrademons 31 minutes ago
        AI coders are great for making scrapers, possibly because AI companies use their own tools to make an awful lot of scrapers.
- dchftcs 45 minutes ago
  This is a lot tamer than what Claude Code's team claims tbf.
- Aperocky 2 hours ago
  > ended up being a million lines of code
  This almost reeks of "I've never cleaned up our code base because there is too much code, and didn't even bother having agents/LLM cleaning them up".
  You almost never need a million lines of code - this includes your software, infra, testing and operational tools. You didn't ship the linux kernel in 3 weeks and you know it. The code is already speghetti and it achieve the basic functions OK but it will harder and harder to simplify and untangle and maintain.
  [-]
  - bombcar 2 hours ago
    Even the linux kernel doesn't need millions of lines of code; most of the actual LOC is device drivers, and you don't need all of them, you just need the ones for the devices you have.
    [-]
    - Chu4eeno 1 hour ago
      And Linux maintainers are actively pushing to radically cut down on the LOC by eliminating drivers etc.
  - girvo 2 hours ago
    Yeah I cannot see how "we shipped 1 million lines of code in three weeks" is... something to be proud of haha
  - faustin 52 minutes ago
    They directly address routine code cleanup and regularly paying down technical debt near the end of the article.
- ai-roundup 1 hour ago
  [dead]
varenc 2 hours ago
digression:
It's interesting this was submitted to HN over 15 times since it was published in February: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
But this is the only submission that's had any traction. Since the content is nearly the same for all submissions, it highlights how getting to the front page can be a bit random. (Though this is the only one that capitalized 'Leveraged' so maybe that's the secret)
murat124 2 hours ago
The other day I came across to a video showing workers in a e-vape factory. They pick up a bunch of e-vapes from the conveyor belt (each has 6 e-vape think), stick in their mouth and vigorously vape all of them for about 5 seconds, then test the next bunch. Humans reviewing hundreds of lines of change in a PR written by AI is not very different.
[-]
- prakashn27 10 minutes ago
  Very true. If a PR has 1000 lines I would check only a handful full of them and leave the rest for test suit .
zatkin 1 hour ago
I worry most about blindspots with this kind of approach. Let's say that this repository goes on for years, at which point the docs folder is several MB in size. Would Codex be able to think outside of the box? Or would the aggregate of the Markdown content fundamentally cover enough ground to prevent it from thinking of novel new approaches to existing problems?
[-]
- therealdrag0 9 minutes ago
  It’s not a self coding machine. There is human in the loop, they even added MORE engineers to the team of this project! 7 engineers should be able to collaborate with the AI to find good solutions to problems.
- esikich 1 hour ago
  You tell it to update the docs: not append. I've done the same thing with a readme in the root with links to the docs. After every commit, before the push, I have my agent "update all relevant and related docs, add or remove what's needed" or something to that extent. And it works remarkably well. I also have an append only change log it's supposed to add to. Between that, good commit messages, and comprehensive testing, I've built a homebrew OS and updating it is remarkably smooth. Runs a homebrew FTP and HTTP server and can run Wolfenstein. Working on DOOM right now. Close, but sound has been difficult.
  https://github.com/ESikich/smallos
  [-]
  - vibcdingenjoyer 36 minutes ago
    Yep. You’ve got to have it update the docs. After a few sessions, if I forget to request this, opus starts rehashing the same tasks and finds that they are complete - and sometimes still won’t update those docs unless I ask.
    Another tip is to condense the doc files into the minimal required. Sometimes I’ll end up with 5 to 6 floating around in various states of staleness. Condensing to 2-3 and removing completed tasks seems to help a lot
shepherdjerred 1 hour ago
This mirrors exactly what I have been doing.
- Give Claude/Codex a way to verify its own work (browser, smoke tests, e2e tests, high-fidelity local environment)
- Keep all context (issue tracking, docs, ideas, plans, worklogs) in-repo (https://github.com/shepherdjerred/monorepo/tree/main/package...)
- Give Claude/Codex access to observability (Grafana, Prometheus, Tempo, PagerDuty)
- Have Claude/Codex follow good engineering guidelines like fail-fast, type safety, parse at boundaries
I haven't yet been able to achieve full autonomy due to cost and CI load on my homelab.
[-]
- para_parolu 50 minutes ago
  Does it yield good results? I found that instead of docs it’s easier just to ask ai to read code. I feel like this is same as comments in code. Become outdated fast
  [-]
  - shepherdjerred 42 minutes ago
    I don't really use "docs" for documentation. I've prompted Claude/Codex to always write a "log" and save it in-repo to track what it did and why.
    I've found this to be really helpful, e.g. "you did this last week, and now some other thing is happening" or "you tried this approach before to solve alert X but it didn't work" -- except it can discover this itself.
    https://github.com/shepherdjerred/monorepo/tree/main/package...
    I've also used it to store TODOs and plans. For example I might want to explore some idea and defer it for later, or some weekend have it execute on some tech debt I've put off. One last use case is asking "what did I work on in the last 2-3 weeks, is it healthy, and what additional quality checks can/should I do; is there any follow-up work?"
faangguyindia 2 hours ago
Codex updates usually appear every few hours (i am not saying this how often it's published) but that's my perception as a user. Often i update codex just to see new update within an hour so.
Many times those updates are not properly tested, for example in one update the model selector got completely changed.
then next hotfix was pushed which restored original.
[-]
- dawnerd 2 hours ago
  Who needs a QA team when you can just test on users and iterate instantly /s
andai 1 hour ago
> To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop ).
https://ghuntley.com/loop/
charintstr 30 minutes ago
I am at a major company that is essentially vibe coding. I’ve shipped about 100k LoC this entire half and am toward top 10% of my team. I find it likely that either
A. The code is absolute garbage and is speed for speed sake B. They’re using an internal model that is a generation beyond GPT 5.5
I say this because we’ve attempted to do something similar using the latest gen Claude models and a significantly larger team. The code is probably along the lines of millions LoC but is an absolute mess because of vibing. There’s a price you pay for speed
[-]
- therealdrag0 6 minutes ago
  I like how they said they were spending 20% of their time addressing slop. Sounds like they’ve tried to automate the slop correction but it’s a good honest reminder.
  Additionally it’s an internal tool, which is likely much more amenable to slop.
Aperocky 1 hour ago
1 million lines of code aside, I feel like anyone who seriously thought about this would eventually run their own harness.
Just like .vimrc and .zshrc, the harness "code" itself can be easy and personal. Provided that it's built on working and existing construct such as tmux.
angrydev 2 hours ago
Published Feb 11, 2026
[-]
- ukuina 57 minutes ago
  Might as well be 2025.
darepublic 2 hours ago
Codex pushed an update that made my old threads inaccessible. This takes a million of lines to put out a half baked crud app?
jonmoore 1 hour ago
This would be much more convincing if the repos, issue trackers, etc. were accessible.
bronny1989 1 hour ago
why do you have “weeks” to ship what would take “months”?
[-]
- andai 1 hour ago
  > We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude.
  [-]
  - robotresearcher 1 hour ago
    I guess orders of magnitude ain’t what they used to be.
rfw300 2 hours ago
I understand that the’ve written zero lines of code for this application, but would it kill them to write a few lines of the blog post by hand?
Forcing readers to wade through an unceasing string of LLM clichés demonstrates the opposite of the point you’re trying to make—that the consumers of your work are worse off because you exercised no human judgment in creating it.
drchaim 2 hours ago
But this is almost what we have been doing for the last 3/5 months, isn’t?
[-]
- wilsonnb3 2 hours ago
  Article is from February so that tracks
- fbrncci 2 hours ago
  Well to a lot of people this is still a foreign concept.
Sarkie 2 hours ago
I would never dare put that in production
EnPissant 38 minutes ago
> Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.
This is such a common thing among software engineers nowadays that I was very surprised that OpenAI would open with that line as if it were mind blowing.
But then I saw it was published in February and OP is just reposting it to farm karma.
apical_dendrite 1 hour ago
I wonder why we as engineers aren't protesting AI in the same way that artists and people in film and television are. This post should instill the same terror that visual artists feel.
If you're a more senior person in tech, this post is effectively saying that a large portion of your skillset is about to become completely worthless. This goes beyond the skills involved in writing the code. Everything that you've learned over years about how to determine whether code is good or bad, and what practices make an engineering team effective is not just obsolete, it's fundamentally counter-productive because it assumes a slow, human-centric process that requires you to actually review and understand the code. Even your ability to mentor junior engineers is now obsolete, because all that experience you've built up is now worthless to them.
If this is the approach the industry takes, particularly when combined with a lack of interest in quality from the business (and let's face it, consumers have shown us that they're happy to pay for cheap crap), it's hard to see much of a future for software engineers. You don't need thousands of people with deep technical expertise, you need a handful of manager-types, who will focus on defining product and business requirements and configuring how the AI gets enough context to implement the requirements.
Maybe, if we're extremely lucky, there's so much demand for software that total employment doesn't fall off a cliff, but the nature of the work will change so much that many older, more expensive engineers will become unemployable. Those who remain will have to accept that the skills they spend decades developing are now worthless, that younger engineers no longer respect or listen to them, that the business no longer sees them as experts worthy of respect, but old fogies who grew up in a different world.
Joe Biden liked to say that a job is more than just a paycheck, it's part of your identity and your sense of self-worth. We're all very used to a certain level of respect (and commensurate remuneration). If you don't think that's true, compare how a software engineer is treated to how a warehouse worker is treated. What happens when we lose that?
[-]
- briHass 27 minutes ago
  It's the other way around, unfortunately. The senior engineers will still be useful for architecture and infrastructure considerations, as well as guiding the agents. It's the junior engineers that get nailed, because there's little incentive to hire one when a LLM does a better job immediately and costs less.
- linsomniac 56 minutes ago
  >a large portion of your skillset is about to become completely worthless
  I'm not convinced of that.
  I watched a video of an architect using AI to create architectural drawings. It became very clear to me that he has a lot of skills and terminology that helped him produce something very specific, in a few minutes. I've been working on some home improvement stuff including a studio/shed and I've struggled to produce even something simple (currently trying to get a conversation packet on the roof trusses to take the the permit department to get started). Even with my high school architecture class.
  After watching that I wonder how much of what I'm doing with AI that looks easy is because I hae a deep technical knowledge, plus 3 years of heavy work with AI.
trytodupe 37 minutes ago
[dead]
jlintc 2 hours ago
[flagged]
knicholes 2 hours ago
Everyone is criticizing the number of lines of code and the lack of attention that must certainly have been applied to generate that code and push it into production. What is being ignored is this awesome prompt that is almost certainly better than having no agents.md or plans.md or whatever you've come up with, to add validation steps for committed changes. You're still free to look at your code, the changes, and ask the agent to clean up. Try it. It's really nice.