ChatGPT can get worse over time, Stanford study finds

mellosouls · on Sept 5, 2023

Paywalled article covering a study already discussed:

https://news.ycombinator.com/item?id=36781015

dang · on Sept 5, 2023

Thanks! Macroexpanded:

How is ChatGPT's behavior changing over time? - https://news.ycombinator.com/item?id=36781015 - July 2023 (178 comments)

archo · on Sept 5, 2023

https://archive.is/Kty4A

_nrnx · on Sept 5, 2023

I remember there is a study about the alignment cost. Basically the more restrictions and limit you put on a model, the worse its general performance becomes. Things like a ban on violence, race, or any other sensitive topics effectively throttle or change how the model "reason" or connect information within its network of parameters and result in degraded capacity.

I wonder if this is the reason behind all of this.

Edit: the study: https://arxiv.org/pdf/2308.13449.pdf

RationPhantoms · on Sept 5, 2023

How much of it is OpenAI/Microsoft curtailing the compute being used to generate responses?

practice9 · on Sept 5, 2023

The accuracy loss is more consistent with some kind of quantization of the model(-s) behind the scenes than the alignment gone wrong. Quantization to serve more users faster, on same amount or less of compute.

arrowsmith · on Sept 5, 2023

Sorry, what does quantization mean here?

iamjackg · on Sept 5, 2023

Reducing the precision of the weights from high precision floating points to either lower precision floats or even integers. You'd think it would greatly reduce the performance of a model, but in most cases the decline in quality is extremely tolerable compared to the reduction in memory/processing requirements.

mlboss · on Sept 5, 2023

It means using less number of bits to store float values. This reduces the memory/compute requirement at the cost of making model less precise.

imdsm · on Sept 5, 2023

Reducing the precision of the parameters — result being less memory intensive

mov_eax_ecx · on Sept 5, 2023

How can i locate this study?. I think you are misrepresenting something.

In the gpt4 paper they specifically address this, and find that "Averaged across all exams, the base model achieves a score of 73.7% while the RLHF model achieves a score of 74.0%, suggesting that post-training does not substantially alter base model capability."

nicce · on Sept 5, 2023

The problem with these studies is that we really still don’t know. Nobody can replicate the papers of OpenAI.

_nrnx · on Sept 5, 2023

Found it, it is a pretty recent paper.

https://arxiv.org/pdf/2308.13449.pdf

adamsb6 · on Sept 5, 2023

Given the homogeneity of responses on taboo subjects, there's probably something exogenous to the model at work.

dalore · on Sept 5, 2023

It feels the same thing happens with humans.

NoMoreNicksLeft · on Sept 5, 2023

[flagged]

DonaldPShimoda · on Sept 5, 2023

What a dumb take.

With no limitations in place, people who do not understand the natural limitations of language models will turn to them for advice on topics for which they are unqualified to respond. The most obvious example that comes to mind for me is medical advice: people will ask, e.g., ChatGPT to diagnose a complex medical issue, and the system (being unable to understand or reason) will give objectively bad advice in an authoritative manner. Responses of this nature should be prevented. Leaving the system without safeguards is irresponsible.

Similarly, prompts that engage with social constructs will provide responses that reflect biases due to the biases inherent in the training data, but an unrestricted system will respond in a matter-of-fact way that may conflate the opinions on which the model was trained with objective fact. To not curtail such responses is also irresponsible.

AuryGlenz · on Sept 5, 2023

A disclaimer for medical advice would work just as well without hamstringing the model into uselessness. Doctors give objectively bad advice in an authoritative manner all the time. ChatGPT is a great way to get a second opinion, and at least could be really nice for truly rare conditions that doctors have a hard time diagnosing.

Also, don't confuse "biases in the training data" with "facts I don't like." It seems people often do.

DonaldPShimoda · on Sept 5, 2023

> A disclaimer for medical advice would work just as well without hamstringing the model into uselessness.

I disagree. A lot of people skip disclaimers all the time, and when it comes to things like this, I think engineers of these systems have a duty to consider the possible consequences.

> Doctors give objectively bad advice in an authoritative manner all the time.

That doesn't mean we ought to automate the process.

> ChatGPT is a great way to get a second opinion, and at least could be really nice for truly rare conditions that doctors have a hard time diagnosing.

I don't think it's a great way to get a second opinion, and I think it's especially a bad idea to try to use it for diagnosing rare conditions. By the statistical nature of the language model, the more rare the condition, the less likely the model will predict a string of words that accurately diagnoses it.

> Also, don't confuse "biases in the training data" with "facts I don't like." It seems people often do.

The problem is not the actual facts but the inferences drawn from those facts when they are used to statistically generate responses by a language model.

AuryGlenz · on Sept 6, 2023

When I was a teenager, I was incredibly tired all of the time and at one point my mother and I noticed that my feet were shrinking. I went to my GP, and he ran a few basic blood tests and then did the medical equivalent of a shrug. I was put on antidepressants and he guessed that my feet were possibly only seemingly shrinking if my arches were getting higher.

I just asked ChatGPT what should be tested/what could be the cause of the fatigue and shrinking feet. It suggested a testosterone test (along with a few others), and that was indeed the reason. It also suggested some relevant specialists that might have figured it out.

I didn't find out until nearly 10 years after that appointment when I decided to really push for some more testing, after doing far more research than just typing it in to ChatGPT. I had to structure my whole life around my fatigue and didn't achieve what I could have. I also lost two inches of height as a teenager, in addition to my feet shrinking and my hands staying small.

I wish I had ChatGPT then, and it's surely helping people now.

NoMoreNicksLeft · on Sept 5, 2023

It would be interesting to know just what is filtered with hidden prompts. For instance, a person might harm themselves inadvertently when asking ChatGPT for advice on pouring concrete or doing small engine repair... but we both doubt very strongly that these are filtered meaningfully. Even if you wouldn't admit it.

Very little effort is made to filter such things. Instead, most of what is filtered is that which people find morally objectionable. With some certainty and though I haven't checked, if we were to ask it how I might go about buying a slave in Oman, it would filter that.

The unrestricted ChatGPT is a public relations nightmare, and we can't have it speculating on how Hitler might have succeeded with slightly different strategies. It is entirely about political correctness. Which wouldn't even be all that bad, really... if it also didn't fuck up responses about how to assemble Ikea furniture or the pros and cons of tilling for vegetable gardens.

You're being disingenuous claiming that this is about preventing it from harming people with bad medical advice. And you know you're being disingenuous. And everyone knows it too.

DonaldPShimoda · on Sept 5, 2023

> You're being disingenuous claiming that this is about preventing it from harming people with bad medical advice. And you know you're being disingenuous. And everyone knows it too.

I don't think I'm being disingenuous. I didn't claim that this was only about medical advice or anything of the sort. I was just giving some examples of things that I think would obviously require some sort of filtering, because the parent comment seemed to me to be suggesting that either (a) all filtering is due to "political correctness" or (b, a lesser claim) filtering language models in general is the wrong course of action. My point was to illustrate that (in my opinion) it is necessary to filter the responses given by language models, at least to some extent. I chose medical advice because that seems like something I would imagine practically all reasonable people would agree about.

> Even if you wouldn't admit it.

I'm not sure why you think I wouldn't admit that.

The filtering added to systems like ChatGPT seems to me to be about "How much face can we save with the least amount of effort?" I suppose they chose topics that were either obviously problematic and worth filtering or else potentially controversial and easy to filter.

The trade-off there is that there will be a lot of false positives: topics and prompts that are actually not objectionable but confuse the system due to insufficiently sophisticated filtration techniques. This is where your instructions/gardening responses get hindered.

> It is entirely about political correctness.

I disagree with the "entirely" bit. I think much of the filtering is probably due to a desire to avoid controversy, but some of it (maybe a lot? I guess we don't know) is also surely due to an actual need to prevent people from taking advice from a glorified text-prediction system. I don't think this position is particularly crazy, nor do I have been disingenuous in the slightest. I think your (mis)characterization of that is actually rather ironic.

delusional · on Sept 5, 2023

That's not at all transferable.

rafaelero · on Sept 5, 2023

The original GPT-4 gets 95% accuracy in my own benchmark (a multiple choice questions dataset I have built for a client in law-related subjects). Using the latest version (0613), it drops to 85%. So yeah, it is also my experience that the performance has degraded.

NikolaNovak · on Sept 5, 2023

* had to shorten the title to fit HN size. No editorializing intended

* original study: https://arxiv.org/pdf/2307.09009.pdf

* thought it's interesting as we hear a lot of people's experience, but this seems structured and well tracked set of specific scenarios

dang · on Sept 5, 2023

Your title was fine ("Study shows over time, ChatGPT lost accuracy") but as the article's own HTML doc title was short enough, we can just use it.

JimDabell · on Sept 5, 2023

I’m not sure about accuracy, but I have noticed that it’s become quite a bit more forgetful. When GPT-4 was new, I could start talking about a technical topic, then drill down into the details and refine / expand upon the subject.

These days, it seems to be a lot more likely to forget key details from earlier in the conversation. We’re not talking about chats that are anywhere near hitting the token limit. I have to keep reminding it about things when I didn’t have to before.

capableweb · on Sept 5, 2023

I'm fairly sure this is because of the longer context window they've probably enabled for GPT4 on ChatGPT, I'm seeing the same thing.

When you use the longest context window, it tends to place the most focus on the first and last message, for whatever reason. So if you instead of continuing the conversation, you start it from the beginning but with the additional information gained from the previous one in the initial message, it does a lot better.

nafizh · on Sept 5, 2023

Can 100% confirm this. You have to keep reminding it what I said just 2-3 hops ago.

nforgerit · on Sept 5, 2023

My favorite ironical spin of the LLM hypetrain is that our predecessors spent so much time in getting everything 100% correct from the ground up and parts of the current generation seem to decide "nah, screw it, mostly correct is fine".

And of course, I'm horrified by bad decisions politicians and managers seem to make.

wongarsu · on Sept 5, 2023

People keep saying they want computers to have "human level intelligence". I'm not quite sure what exactly that would entail, but if there's one thing I know about humans it's that they are performing jobs "mostly correct" and make a lot of mistakes.

capableweb · on Sept 5, 2023

> our predecessors spent so much time in getting everything 100% correct from the ground up

What predecessors are you talking about here? The internet is barely held together with duct tape and servers running only because of minute by minute injection of WD50 straight into the fan bearings. There is so much cruft everywhere that is being relied on daily in production systems that most people lost count by now on how much horrible code and setups they've seen in the life.

jwestbury · on Sept 6, 2023

Back in 2014, I left my job as a one-man IT department and started working at AWS. Pretty quickly, I found myself joking that I used to know how the Internet worked, but now I didn't know how it keeps working.

I spent over five years at AWS, and have worked at other major tech companies since. I continue to not know how the Internet keeps working -- it's basically always on fire, and held together only because of a lot of people who are actively firefighting.

quadrature · on Sept 5, 2023

>"nah, screw it, mostly correct is fine"

This isn't new, a big chunk of systems you use everyday are based on probabilistic models and heuristics that work most of the time.

LLMs reach very broadly and make the trade off very apparent.

Mathnerd314 · on Sept 5, 2023

The study itself is somewhat flawed: https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-tim...

empath-nirvana · on Sept 5, 2023

I don't think "answering math questions" is a good or interesting test of LLMs, especially prime number checks. Whether it ends up with the correct answer for any given question is largely down to luck. It simply cannot do basic arithmetic reliably, even with chain of thought. It's also, even if its performance at that task improves, a terrible waste of computational resources. Nobody would ever use it in a real world application -- at best you'd ask it to write a program that can determine it for you, something which it is reliably good at.

ZunarJ5 · on Sept 5, 2023

For what it's worth, the article hits the nail on the head with my observations as well re: the model no longer being able to reason and go through the step by step processes mentioned. I have asked it to do several really simple tasks recently (e.g. list all headers in a link with exact parameters on how they are formatted), which it used to be able to do with ease, had it repeat back to me all instructions and parameters in order then it would completely ignore those instructions and do random things. I am using it less and less because of this. Maybe I just need agents?

capableweb · on Sept 5, 2023

I personally haven't experienced the same, ChatGPT and GPT4 via API seems mostly the same as before. Here is an example system prompt I'm using for some programming tasks:

> Repeat what I've set as the requirements in other words to ensure you understand it. Describe concisely at least two different approaches you could take to solve the problem while explaining the reasoning for why it's a useful approach. Chose the best approach, then think through how the problem could be solved step-by-step. Finally implement each step and provide a full solution.

Yesterday I used that to build (iteratively) a Rust CLI that downloads the last X HN comments have group-counts them by domain. Ended up being ~100 lines that GPT wrote by itself without any errors along the way.

DonaldPShimoda · on Sept 5, 2023

> the model no longer being able to reason

The model was never able to reason; it was only able to generate responses that masqueraded as reasoning. Thorough investigations of its "reasoning process" reveal this at every stage of its development.

whatshisface · on Sept 5, 2023

Okay, then they're saying it's no longer able to masquerade as reasoning.

DonaldPShimoda · on Sept 5, 2023

I can appreciate that, but in the context of LLMs where people are frequently drumming up discourse based on faulty beliefs about the systems' capabilities, I think it is important to be clear.

KingOfCoders · on Sept 5, 2023

I have a lot of problems with getting the title of an URL. It worked, no longer does.

capableweb · on Sept 5, 2023

> I have a lot of problems with getting the title of an URL. It worked, no longer does.

What do you mean exactly? Would be fun to try to replicate.

You give ChatGPT a URL and expect it to be able to get the <title> tag from the HTML? Or you're pasting the whole HTML document?

colordrops · on Sept 5, 2023

One of the use cases of gpt is that it can handle multifaceted questions. Math may be part of an answer that is not purely a math question.

dr_dshiv · on Sept 5, 2023

> Nobody would ever use it in a real world application

Math tutoring is a very legit application. It will be an incredible learning tool — but it needs better logical reasoning abilities with arithmetic. It seems like an easily solvable problem, given that math problems and answers are easy to scale into massive datasets.

capableweb · on Sept 5, 2023

Tutoring makes sense, arithmetics doesn't. Instead, what you want, is to include something like "If you need to evaluate formulas, embed them in a Markdown code block with the language set to 'math' which will be executed for you, and the results will be provided as the next message" to the system prompt, and write code around it to handle the execution.

Simplified example:

> system-prompt message

> user message: teach me linear algebra

> assistant: Sure, here's how [...] ```math 1 + 1```

> user message (but generated by application): ```1 + 1 = 2```

> assistant: and since 1 + 1 = 2, [...]

LLMs are great at generating text, but not at math, so make something else do the math for the LLM and then it can focus on what it does best.

albert_e · on Sept 5, 2023

Rather than longish term trends -- let me share very short term fluctuations I saw in ChatGPT Plus.

It seemed like there is some throttling / capacity management happening behind the scenes. For very similar level of complexity and work tasks, during peak time (US day time) I felt the responses lost context or ran into error loops more often.

Similarly working with plugins (Notable in my case) seemed to work fine in off peak hours but got more prone to losing the place, forgetting the default project/notebook, or even in one case simply limiting itself to adding code cells but not executing them -- and instead asking me to go and run the cell and report back the results. At better times it would run the cell, parse any errors, correct the code, and run it again till it gets it right. The amount of code documentation and markdown inserted also seemed to vary wildly. I am not sure houw much of this is tweaking in plugin configuration by Noteable though.

hot_gril · on Sept 5, 2023

From the beginning, the only thing my friends and I have used ChatGPT for is getting it to say things it's not supposed to, like offensive jokes or answers to ridiculous moral dilemmas. It's gotten much worse at that lately, so we've stopped using it.

rvz · on Sept 5, 2023

AI snake-oil usually degrades and wears off quicker over time when it is approaching its expiry date as more people realize that it hallucinates often without transparent explanation; study finds.

Unexplainable black-box stochastic parrots are no different to this no matter how cleverly packaged they are by AI bros.

Zambyte · on Sept 5, 2023

It's interesting to me that people use the word "hallucinate" to describe when an LLM provides incorrect information, when it is doing the exact same process as when it happens to get things right: predicting a reasonably likely next token.

It's "hallucinating" in the same way y=x^2 will predict the next point in a curve.

willsmith72 · on Sept 5, 2023

Do you honestly believe it's snake-oil? I can't believe people still think this, nearly everyone I've talked about it with has been able to find an amazing use for AI now.

ftxbro · on Sept 5, 2023

the idea of AI skepticism or AI snake-oil has shifted, from meaning that it's ineffective to meaning that it's so effective that it's bad for society

ftxbro · on Sept 5, 2023

don't forget to say that it's a blurry jpeg of the web, and therefore has fake artefacts and is a degraded form of the original

GaggiX · on Sept 5, 2023

I'm surprised he didn't mention that it's a glorified Markov chain.

ftxbro · on Sept 5, 2023

i'm as hype for ai as anyone but mathematically it is a markov chain; a trained llm model is a function that gives a probability distribution over token n+1 conditional on the preceding n tokens.

GaggiX · on Sept 5, 2023

Well, on Wikipedia you can read "Random process independent of past history" and "a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event." honestly I would not consider the entire context window to be part of a single event, since the model generates tokens one at a time, separately.

ftxbro · on Sept 5, 2023

> honestly I would not consider the entire context window to be part of a single event, since the model generates tokens one at a time, separately.

OK but traditional small language model Markov chains have context windows too, and they also generate tokens one at a time, separately. Maybe you are arguing that there is a qualitative difference between an eight token context window and an eight thousand token context window. I guess that would be fair enough.

edgarvaldes · on Sept 5, 2023

On the other hand, I would love to see the good stories about ChatGPT. Not the quick prototype that is common to find in the blog posts, but a real use case where ChatGPT was the differentiator for a non-simple task. HN is a great place to expect such histories.

willsmith72 · on Sept 5, 2023

I use gpt4 all day every day for programming and software architecture work, and I know I'm not alone.

suoduandao3 · on Sept 5, 2023

I'm reminded of how much better international real estate listings were in 2011. Used to be I could look at every real estate listing in Greece with a simple google search. I'm sure those sites still exist, but damned if I can find them through the noise of people advertising some more profitable subset.

I hope AI doesn't follow that trend, but I can't see why it wouldn't...

hot_gril · on Sept 5, 2023

Programming queries in particular have been ruined by SEO on Google. Like, search "longest TEXT in postgres" and enjoy the featured snippet's wildly incorrect answer (65KB) from some cloud whatever platform site, far above Postgres's official docs plainly stating the correct answer (1GB). Also "is kotlin a superset of java," Google's answer is yes according to a random blog.

This crap has even made its way onto Instagram. I've been suggested images of AI-generated figurines resembling Clash of Clans with awkwardly phrased captions about learning to code. "LEARNING: WHAT IS ARRAY" with a picture of an angry lumberjack.

slig · on Sept 5, 2023

Google Search was orders of magnitude better 10 years ago.

Someone · on Sept 5, 2023

I’m not sure about that. Its search results were better, but I think the fraction of hits of reasonable quality was higher, too.

Maybe, Google Search had gotten better, but not as fast as its task has gotten harder.

aoitiawj · on Sept 5, 2023

Almost all applications do this now. Even when I search logged out and in "in cognito mode", Google shows me the same repetitive ads and curated search results over and over. I'll search on two other peoples' computers and get two entirely different sets of results. The last time I used dating apps, several of them only showed me people of similar race and appearance. Like, I'm an extreme minority in a city of several million people and several different apps showed me a few hundred people with similar coloring and said there was no one else around. It's ridiculous and as long as the only motive is to consolidate wealth in the hands of a small number of owners, it will never change. This is exactly what capitalism delivers.

GaggiX · on Sept 5, 2023

To be honest I have used ChatGPT since day one it was out and in my experience it only improved, in particular the previous models were more restricted and they will complain more about what you asked, a possible comparison is Lama 2 70B compared to ChatGPT today.

Edit: I'm mostly talking about ChatGPT based on GPT-3.5

beebmam · on Sept 5, 2023

I observed the same as you, but for GPT-4.

IceHegel · on Sept 5, 2023

Though it generally seems to have gotten worse, I've found it's gotten better at programing in Swift. It used to be really bad, now it's somewhat useable.

tmaly · on Sept 5, 2023

I think it is really hard to say definitively.

There would need to be a better set of tests.

But the March 14, 2023 model is still available on the API.

Further testing is needed.

jwr · on Sept 5, 2023

Article doesn't seem to be available on the internet?

capableweb · on Sept 5, 2023

It is on my internet! https://arxiv.org/pdf/2307.09009.pdf

phillipcarter · on Sept 5, 2023

Awaiting the substack post that:

- Can't reproduce the issue that this paywalled article links to

- Finds that the methodology used to assess "accuracy" isn't actually checking accuracy, but is checking for a specific output, and that output generally does still produce the right value, just in a different format

- Summarizes by saying that ChatGPT is neither the second coming of christ nor horribly inaccurate, just something in between

NikolaNovak · on Sept 5, 2023

Is it pay walled? I don't have any subscription and just came across it semi randomly. I've also linked to the underlying study In a comment if thay helps.

phillipcarter · on Sept 5, 2023

Yeah, the article is paywalled unfortunately. Thanks for linking to the study.

r053bud · on Sept 5, 2023

Well Sam Altman assured us nothing has changed with the model, so case closed!