> The replies are all a variation of: "You're using it wrong"
I don't know what you are trying to say with your post. I mean, if two persons feed their prompts to an agent and while one is able to reach their goals the other fails to achieve anything, would it be outlandish to suggest one of them is using it right whereas the other is using it wrong? Or do you expect the output to not reflect the input at all?
> I expect the $500 billion magic machine to be magic. Especially after all the explicit threats to me and my friends livelihoods.
That's a problem you are creating for yourself by believing in magical nonsense.
Meanwhile, the rest of the world is gradually learning how to use the tool to simplify their work, being it helping onboard onto projects, doing ad-hoc code reviews, serving as sparring partners, helping with design work, and yes even creating complete projects from scratch.
In my experience it depends on which way the wind is blowing, random chance, and a lot of luck.
For example, I was working on the same kind of change across a few dozen files. The prompt input didn't change, the work didn't change, but the "AI" got it wrong as often as it got it right. So was I "using it wrong" or was the "AI" doing it wrong half the time? I tried several "AI" offerings and they all had similar results. Ultimately, the "AI" wasted as much time as it saved me.
And yours is also "you are using it wrong" in the spirit.
Are they doing the same thing? Are they trying to achieve the same goals, but fail because one is lacking some skill?
One person may be someone who needs a very basic thing like creating a script to batch-rename his files, another one may be trying to do a massive refactoring.
And while the former succeeds, the latter fails. Is it only because someone doesn't know how to use agentic AI, or because agentic AI is simply lacking?
And some more variations that, in my anecdotal experience make or break the agentic experience:
* strictness of the result - a personal blog entry vs a complex migration to reform a production database of a large, critical system
* team constraints - style guides, peer review, linting, test requirements, TDD, etc
* language, frameworks - quick node-js app vs a java monolyth e.g.
* legacy - a 12+ year Django app vs a greenfield rust microservice
* context - complex, historical, nonsensical business constraints and flows vs a simple crud action
* example body - a simple crud TODO in PHP or JS, done a million times vs a event-sourced, hexagonal architecrtured, cryptographical signing system for govt data.
Also bad at integration between modules. I have not tried to solve this yet, by giving documentation of both the modules.
Also model used impacted. To understand java code it was great.
I first ask it to generate the detailed prompt by giving a high level prompt. Then use the detailed prompt to execute the task.
Java code size Upto 20k loc is fine. Other wise context becomes big. So you have to do module by module.
I believe to have a discussion maybe someone has to take an open source code example and then say it doesn't work. Other people can then discuss and decide.
> Also bad at integration between modules. I have not tried to solve this yet, by giving documentation of both the modules.
You should first draft the interface and roll out coverage with automated tests, and then prompt your way into filling in the implementation. If you just post a vague prompt on how you want multiple modules workinh together, odds are the output might not met implicit constraints.
Of course the output reflects the input, that's why it's a bad idea to let the LLM run in a loop without constraints, it's simple maths, if something is 99% accurate, after 5 times is 95% accurate, after 10 steps it's about 90% accurate, after 100 times it's about 36% accurate.
For LLMs to be effective, you (or something else) needs to constantly find the errors and fix it.
I've had good experience getting a different LLM perform a technical review, then feed that back to the primary LLM but tell it to evaluate the feedback rather than just blindly accepting it.
You still have to have a hand on the wheel, but it helps a fair bit.
I’ve seen LLM catch and fix their own mistakes and literally tell me they were wrong and that they are fixing their self made wrong mistake. This analogy is therefore not accurate as error rate can actually decrease over time.
If we assume that each action has 99% success rate, and when it fails, it has 20% chance of recovery, and if the math here by gemini 2.5 pro is correct, that means the system will tend towards 95% chance of success.
===
In equilibrium, the probability of leaving the Success state must equal the probability of entering it.
(Probability of being in S) * (Chance of leaving S) = (Probability of being in F) * (Chance of leaving F)
Let P(S) be the probability of being in Success and P(F) be the probability of being in Failure.
P(S) * 0.01 = P(F) * 0.20
Since P(S) + P(F) = 1, we can say P(F) = 1 - P(S). Substituting that in:
That’s math based off of arbitrary initial assumptions. There are numbers that work.
All this math is useless. Use your brain. The entire point I’m communicating is that it’s not a given that it must become less accurate. There are multiple open possibilities here and scenarios that can occur.
Doing random math here as if you’re dropping the mic is just pointless. It doesn’t do anything. It’s like making up a cosmological constant and saying the universe is collapsing look at my math.
I don't know what you are trying to say with your post. I mean, if two persons feed their prompts to an agent and while one is able to reach their goals the other fails to achieve anything, would it be outlandish to suggest one of them is using it right whereas the other is using it wrong? Or do you expect the output to not reflect the input at all?