More

SkyPuncher · 2025-12-22T01:53:57 1766368437

Time is my limiting factor, especially on personal projects. To me, this makes any multiplying effect valuable.

When I consider it against my other hobbies, $100 is pretty reasonable for a month of supply. That being said, I wouldn’t do it every month. Just the months I need it.

SkyPuncher · 2025-12-19T03:50:01 1766116201

The same thing can be said about Opus running through Opus.

It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.

sinatra · 2025-12-19T04:47:06 1766119626

My (admittedly one person's anecdotal) experience has been that when I ask Codex and Claude to make a plan/fix and then ask them both to review it, they both agree that Codex's version is better quality. This is on a 140K LOC codebase with an unreasonable amount of time spent on rules (lint, format, commit, etc), on specifying coding patterns, on documenting per workspace README.md, etc.

pietz · 2025-12-19T11:14:23 1766142863

That's a fair point and yet I deeply believe Codex is better here. After finishing a big task, I used two fresh instances of Claude and Codex to review it. Codex finds more issues in ~9 out of 10 cases.

While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.

shinycode · 2025-12-19T09:14:55 1766135695

Every time Claude Code finishes a task, I plan a full review of its own task with a very detailed plan and it catches itself many things it didn’t see before. It works well and it’s part of the process of refinement. We all know it’s almost never 100% hit of the first try on big chunks of code generated.

a24j · 2025-12-19T11:10:04 1766142604

How exactly do you plan/initiate a review from the terminal? open up a new shell/instance of claude and initiate the review with fresh context?

fragmede · 2025-12-19T11:15:19 1766142919

Yeah. It dumps context into various .md files, like TODO.md.

SkyPuncher · 2025-12-17T19:04:51 1765998291

Outcome billing is ideal for pretty much any SaaS product.

Sounds great in theory, until you realize everyone has a different definition of outcome.

higginsniggins · 2025-12-17T19:48:29 1766000909

If your customer base is so broud that you can't define a clear outcome for your nitche, your company probably isnt focused enough. Especially for a start up.

rajvarkala · 2025-12-17T19:07:20 1765998440

Understood.

Take for instance, customer support Agent , that is supposed to resolve tickets. Assuming it resolves around 30% tickets by an objective measure. Do you think that cannot be captured and agreed upon by both sides?

deathanatos · 2025-12-17T19:36:00 1766000160

Already, today, human customer support agents' performance is measured in ticket resolution, and the Goodhart's Law consequences of that are trivial visible to anyone that's ever tried to get a ticket actually resolved, as opposed to simply marked "resolved" in a ticketing system somewhere…

rajvarkala · 2025-12-18T06:29:38 1766039378

We just give today's human performance metrics to AI agents.

AI agent developers internally have a metric they are targeting to improve. That itself violates goodhart law.

wood_spirit · 2025-12-17T19:21:07 1765999267

You get what you measure. The bot might be really bad and customers close the chat and it gets counted as success etc.

rajvarkala · 2025-12-17T19:36:28 1766000188

The same applies to human agents as well. Humans are incentivised differently ? How?

The same oversight mechanism that applies to humans cannot correct the flaws of AI agents?

wood_spirit · 2025-12-18T12:55:57 1766062557

Except the meta reason for employing AI for these use cases is to stop employing the humans?

HelloMcFly · 2025-12-18T03:11:27 1766027487

At scale? Programmatically? In a way that actually saves time and doesn't create billing conflict (that always happens to benefit the LLM vendor)?

No I do not.

rajvarkala · 2025-12-18T06:19:24 1766038764

Interesting. Let's take the case of infra spend on AWS. Amazon says you invoked serverless calls 100k times and you are charged for it. How are you trusting them?

HelloMcFly · 2025-12-18T13:37:03 1766065023

The comparison doesn't quite hold because AWS is a utility; they aren't an arbiter of quality. Amazon charges for a serverless call regardless of whether your code worked or crashed. You pay for the effort (compute), which is verifiable and binary.

Once you shift to billing for outcomes like "resolutions," the vendor switches from a utility provider to the judge and jury of their own performance. At scale, that creates a "fox guarding the henhouse" dynamic. The friction of auditing those outcomes to ensure they aren't just Goodharted metrics eventually offsets the simplicity the model promises. Frankly, I just cannot and will not trust the judgment of tech companies who evangelize their own LLM outputs.

rajvarkala · 2025-12-18T13:48:53 1766065733

How do you verify AWS charges? By inspecting logs? There goes the arbiter.

I get the binary part. The biggest difference is the subjective component of outcome? However, a tech provider - especially Agent provider - has to bring down the subjective to a quantitative metric when selling. If that cannot be done, I am not sure what we are going to be buying from Agent builders/providers?

SkyPuncher · 2025-12-17T18:59:36 1765997976

*Some juniors have gotten better.

I hate to be so negative, but one of the biggest problems junior engineers face is that they don't know how to make sense of or prioritize the gluttony of new-to-them information to make decisions. It's not helpful to have an AI reduce the search space because they still can't narrow down the last step effectively (or possibly independently).

There are junior engineers who seem to inherently have this skill. They might still be poor in finding all necessary information, but when they do, they can make the final, critical decision. Now, with AI, they've largely eliminated the search problem so they can focus more on the decision making.

The problem is it's extremely hard to identify who is what type. It's also something that senior level devs have generally figured out.

SkyPuncher · 2025-12-16T00:57:14 1765846634

So, one of the main reasons it needs to look like a truck is because it needs to have a structure like a truck to be compatible with basically all of the aftermarket parts.

I want a truck with flat bed rails so I can put a cap on it. It needs to have a proper frame under the bed so it’s not bending with point loads.

I need a bed that’s a separate piece from the cab so they have flex for uneven grades.

SkyPuncher · 2025-12-16T00:51:10 1765846270

The difference is what is actually powering the wheel. Hybrid is still primarily ICE. EREV is electric motors (with the ICE just charging the batteries).

I literally couldn’t think of a better truck than an EREV. Give me an ICE engine that can haul my trailer into the boondocks knowing I just need a gas station nearby, but can power my trailer off the battery.

SkyPuncher · 2025-12-15T15:07:36 1765811256

This is absolutely not true.

A compromised laptop should always be treated as a fully compromised. However, you can take steps that drastically reduce the likelihood of bad things happening before you can react (e.g. disable accounts/rotate keys).

Further, you can take actions that inherently limit the ability for a compromise to actually cause impact. Not needing to actually store certain things on the machine is a great start.

SkyPuncher · 2025-12-15T15:00:33 1765810833

We have a ratio of roughly 7:1 (repos to engineers). It was probably closer to 12:1 at some point.

* Spikes/Demo project

* Smaller projects that might have gone live, but have since been migrated elsewhere

* Core services

* Forks of certain supply chain dependencies that we've made improvements to.

SkyPuncher · 2025-12-15T14:21:43 1765808503

Our sales teams heres the "we'll just build it internally" or "we can just throw it into an LLM" all of the time.

Yes, certain parts of our product are indeed just lightweight wrappers around an LLM. What you're paying for is the 99% of the other stuff that's (1) either extremely hard to do (and probably non-obvious) (2) an endless supply of "routine" work that still takes time (3) an SLA/support that's more than "random dev isn't on PTO"

robofanatic · 2025-12-15T16:03:56 1765814636

> "we'll just build it internally" or "we can just throw it into an LLM" all of the time.

Is that a bluff used to negotiate the price?

pseudosavant · 2025-12-15T17:03:07 1765818187

If it is a credible bluff, does it work?

citizenpaul · 2025-12-15T18:43:57 1765824237

No because it is never a credible bluff. You would not be having the conversation if it was.

In fact having sold stuff If a lead says this, it is a huge red flag for me that I probably don't want to do business with them because they are probably a "vampire customer"

UltraSane · 2025-12-15T20:47:50 1765831670

LLMs can write surprisingly decent code a few hundred lines at a time but they absolutely can't write coherent hundred thousand line or bigger programs.

woah · 2025-12-15T17:21:55 1765819315

> (3) an SLA/support that's more than "random dev isn't on PTO"

Why do they have an internal engineering org at all if they can't manage the most basic maintenance of a software product?

SkyPuncher · 2025-12-11T20:45:33 1765485933

I ended up doing a similar thing when I was a contractor. Just a really long note file that I'd track everything I was doing.

Relatedly, I find all of the todo/task management apps to be utterly overwhelming for my person tasks. I'm so tired of all of the task apps adding way too much complexity.

All I want is:

* Something that's available on all of my devices.

* Can be ordered by sections

  * Triage

  * Now

  * Today

  * Tomorrow

  * Soon

  * Eventually

  * Whenever (when-never)

* Let's me add a task without thinking (default to triage)

* Lets me drag-and-drop tasks for ordering