For one month beginning on October 5, I ran an experiment: Every day, I asked ChatGPT 5 (more precisely, its “Extended Thinking” version) to find an error in “Today’s featured article”. In 28 of these 31 featured articles (90%), ChatGPT identified what I considered a valid error, often several. I have so far corrected 35 such errors.


A tool that gives at least 40% wrong answers, used to find 90% errors?
deleted by creator
But we don’t know what the false positive rate is either? How many submissions were blocked that shouldn’t have been, it seems like you don’t have a way to even find that metric out unless somebody complained about it.
deleted by creator
90% errors isn’t accurate. It’s not that 90% of all facts in wikipedia are wrong. 90% of the featured articles contained at least one error, so the articles were still mostly correct.
And the featured articles are usually quite large. As an example, today’s featured article is on a type of crab - the article is over 3,700 words with 129 references and 30-something books in the bibliography.
It’s not particularly unreasonable or unsurprising to be able to find a single error amongst articles that complex.
Bias needs to be reinforced!