Time is my limiting factor, especially on personal projects. To me, this makes any multiplying effect valuable.
When I consider it against my other hobbies, $100 is pretty reasonable for a month of supply. That being said, I wouldn’t do it every month. Just the months I need it.
The same thing can be said about Opus running through Opus.
It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.
My (admittedly one person's anecdotal) experience has been that when I ask Codex and Claude to make a plan/fix and then ask them both to review it, they both agree that Codex's version is better quality. This is on a 140K LOC codebase with an unreasonable amount of time spent on rules (lint, format, commit, etc), on specifying coding patterns, on documenting per workspace README.md, etc.
That's a fair point and yet I deeply believe Codex is better here. After finishing a big task, I used two fresh instances of Claude and Codex to review it. Codex finds more issues in ~9 out of 10 cases.
While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.
Every time Claude Code finishes a task, I plan a full review of its own task with a very detailed plan and it catches itself many things it didn’t see before. It works well and it’s part of the process of refinement. We all know it’s almost never 100% hit of the first try on big chunks of code generated.
If your customer base is so broud that you can't define a clear outcome for your nitche, your company probably isnt focused enough. Especially for a start up.
Take for instance, customer support Agent , that is supposed to resolve tickets. Assuming it resolves around 30% tickets by an objective measure. Do you think that cannot be captured and agreed upon by both sides?
Already, today, human customer support agents' performance is measured in ticket resolution, and the Goodhart's Law consequences of that are trivial visible to anyone that's ever tried to get a ticket actually resolved, as opposed to simply marked "resolved" in a ticketing system somewhere…
Interesting. Let's take the case of infra spend on AWS. Amazon says you invoked serverless calls 100k times and you are charged for it. How are you trusting them?
The comparison doesn't quite hold because AWS is a utility; they aren't an arbiter of quality. Amazon charges for a serverless call regardless of whether your code worked or crashed. You pay for the effort (compute), which is verifiable and binary.
Once you shift to billing for outcomes like "resolutions," the vendor switches from a utility provider to the judge and jury of their own performance. At scale, that creates a "fox guarding the henhouse" dynamic. The friction of auditing those outcomes to ensure they aren't just Goodharted metrics eventually offsets the simplicity the model promises. Frankly, I just cannot and will not trust the judgment of tech companies who evangelize their own LLM outputs.
How do you verify AWS charges? By inspecting logs? There goes the arbiter.
I get the binary part. The biggest difference is the subjective component of outcome? However, a tech provider - especially Agent provider - has to bring down the subjective to a quantitative metric when selling. If that cannot be done, I am not sure what we are going to be buying from Agent builders/providers?
I hate to be so negative, but one of the biggest problems junior engineers face is that they don't know how to make sense of or prioritize the gluttony of new-to-them information to make decisions. It's not helpful to have an AI reduce the search space because they still can't narrow down the last step effectively (or possibly independently).
There are junior engineers who seem to inherently have this skill. They might still be poor in finding all necessary information, but when they do, they can make the final, critical decision. Now, with AI, they've largely eliminated the search problem so they can focus more on the decision making.
The problem is it's extremely hard to identify who is what type. It's also something that senior level devs have generally figured out.
So, one of the main reasons it needs to look like a truck is because it needs to have a structure like a truck to be compatible with basically all of the aftermarket parts.
I want a truck with flat bed rails so I can put a cap on it. It needs to have a proper frame under the bed so it’s not bending with point loads.
I need a bed that’s a separate piece from the cab so they have flex for uneven grades.
The difference is what is actually powering the wheel. Hybrid is still primarily ICE. EREV is electric motors (with the ICE just charging the batteries).
I literally couldn’t think of a better truck than an EREV. Give me an ICE engine that can haul my trailer into the boondocks knowing I just need a gas station nearby, but can power my trailer off the battery.
A compromised laptop should always be treated as a fully compromised. However, you can take steps that drastically reduce the likelihood of bad things happening before you can react (e.g. disable accounts/rotate keys).
Further, you can take actions that inherently limit the ability for a compromise to actually cause impact. Not needing to actually store certain things on the machine is a great start.
Our sales teams heres the "we'll just build it internally" or "we can just throw it into an LLM" all of the time.
Yes, certain parts of our product are indeed just lightweight wrappers around an LLM. What you're paying for is the 99% of the other stuff that's (1) either extremely hard to do (and probably non-obvious) (2) an endless supply of "routine" work that still takes time (3) an SLA/support that's more than "random dev isn't on PTO"
No because it is never a credible bluff. You would not be having the conversation if it was.
In fact having sold stuff If a lead says this, it is a huge red flag for me that I probably don't want to do business with them because they are probably a "vampire customer"
LLMs can write surprisingly decent code a few hundred lines at a time but they absolutely can't write coherent hundred thousand line or bigger programs.
I ended up doing a similar thing when I was a contractor. Just a really long note file that I'd track everything I was doing.
Relatedly, I find all of the todo/task management apps to be utterly overwhelming for my person tasks. I'm so tired of all of the task apps adding way too much complexity.
All I want is:
* Something that's available on all of my devices.
* Can be ordered by sections
* Triage
* Now
* Today
* Tomorrow
* Soon
* Eventually
* Whenever (when-never)
* Let's me add a task without thinking (default to triage)
When I consider it against my other hobbies, $100 is pretty reasonable for a month of supply. That being said, I wouldn’t do it every month. Just the months I need it.
reply