stackoverflow.ai - rebuilt for attribution

Question 1

Today (September 2, 2025) the stackoverflow.ai experiment gets a substantial update.

The origin story

stackoverflow.ai, also linked as "AI Assist" in the left navigation, was launched as a beta on July 9, 2025. Its goals included:

A new way to get started on Stack Overflow. The tool can help developers get unblocked instantly with answers to their technical problems, while helping them learn along the way and providing a path into the community.
A familiar, natural language experience that anyone who has interacted with genAI chatbots would expect, but further enriched with clear connections to trusted and verified Stack Overflow knowledge.
A user-friendly interface with conversational search and discovery.
A path, when the genAI tool isn’t providing the solution they need, to bring their question to the Stack Overflow community via the latest question asking experience, including Staging Ground.

stackoverflow.ai was built as "LLM-first":

Submit the user’s query to an LLM and display the LLM’s response to the user
Analyze the response from the LLM and search SO & SE for relevant content

This resolved the two issues from the our past RAG-exclusive approach (irrelevant results & lack of results) and we’re seeing diverse usage of stackoverflow.ai, from traditional technical searches (help with error messages, how to build certain functions, what code snippets do) to comparing different approaches and libraries, to asking for helping architecting and structuring apps, to learning about different libraries and concepts.

The community identified a significant issue in that this was not providing appropriate attribution to Stack creators. This was a consequence of the "LLM-first" approach where the response was not rooted in source content and LLMs cannot return attribution reliably.

What’s changed?

We shifted to a hybrid approach that helps developers get answers instantly, learn along the way, and provide a path into the largest community of technology enthusiasts.

A response is created using multiple steps via RAG + multiple rounds of LLM processing:

When a user searches, the search is executed across SO & SE for relevant content and includes the use of a re-ranker.
Relevant quotes are pulled from the top results from SO & SE, including attribution.
We created an AI Agent to act as an "answer auditor": it reads the user’s search, the quotes from SO & SE content, and analyzes for correctness and comprehensiveness in order to supplement it with knowledge from the LLM.
- If the search does not find any relevant content from SO & SE, the AI Agent is instructed to answer the user’s question as best it can using the LLM.

The interface design and answer presentation has been updated to make this source distinction clear to users, including what parts of the answer are from SO & SE and what is from the LLM. We conducted a research study with network moderators which was a key part of developing this design, so many thanks to those who participated in that.

Our goal with rolling out this update is that stackoverflow.ai is different from other AI tools because it prioritizes trusted, community-verified knowledge before other sources, then the LLM fills in any knowledge gaps to provide a complete answer with clear sources. And there is still a path into the community to get more help or dive deeper (this feature is coming soon).

In an upcoming Stack Overflow Podcast episode, we’ll be taking a deeper dive into how we developed this approach (we’ll add the link here when that becomes available). At the bottom of this post, you’ll see some visuals of a query and response. Here are some details to go with those images.

Figure A (Initial response) - Sources are now presented up top, expanded by default, so you can see where the answer comes from without having to click. Licensing information is displayed as well. The response itself uses direct quotes from trusted community contributions, with every piece of linked content traceable to its origin.

Figure B (Scrolled lower, with hover) - The inline citations are more than just footnotes. Hover over them to see a popover with a clear link back to the source (this feature is coming soon). If you still don’t find what you need, we’ll be offering an "ask the community" pathway, linking to the question-asking flow.

What’s next?

We know that some in the community will be concerned that this is still not enough to showcase the full scope of the human community behind this information. In this new world, human-centered sources of knowledge are obscured, and this is one step toward counteracting that. The old front door was Google, and while that was not without its challenges, at least the search results page had clear links to the sources. We’re working now to build a new front door that meets new expectations and still provides a way in.

We’re confident in this direction based on what we’re seeing so far. The majority of queries are technical, so they are the type of users that we want to capture. There is more positive feedback than negative coming back within the product. The demographic also shows that it's a different set of users than stackoverflow.com, so there are good signs here for acquiring new community members over time.

We’ll continue to work on creating clear roads into the community — for example, we expect to iterate on the design of that "ask the community" pathway. That said, there’s probably limits to what this interface can showcase while remaining focused on the user’s specific need of the moment.

Upcoming work in the near term will be focused on:

accuracy
consistency
context retention
loading time
determining the best way to connect users to the question asking flow
- figure B shows the intended first iteration of this path, with a link above the input field, not present in today’s release

Please continue to provide feedback here, and as you try out the experience, through the thumbs-up/down & free text option as well.

Figure A Figure A

Figure B Figure B

Question 2

I like this a lot. I suspect, as you acknowledge, that there will still be legitimate objections and concerns, but this is a massive step forward in terms of both actually preserving content attribution and also tangibly demonstrating that it's indeed a priority internally. At first glance, this looks pretty cool!

Question 3

Not quite sure i'm following this whole... front-door analogy. Is this meant to compete with google as an entry point to the network, or... what? a Network-wide search is certainly a useful use-case for this, if it works well, but i don't see why this would in any way be an alternative to google or any other internet-wide rag based search tool, given it only searches the stack network. That's not to say it's not useful, more... i don't get the analogy. Unless you're creating this with the plan that devs will skip the internet and search on stackoverflow ai instead, which seems odd to me,

Question 4

I like that y'all are moving in this direction, and am standing by to see how well it works.

Question 5

semantic / vector-based search in general is a direction of experimentation I'm interested in and happy to see. better search would be good for everyone. (not sure if the current main-site(s) searchbar uses anything like that?)

Question 6

@starball But is it better search? For my quick examples the shown linked questions still aren't very relevant. And the summary is as long as the original answer and different from it and not more attributed. Where is the difference to before exactly? To me it looks a bit like a rearrangement mostly (which is fine) but no additional attribution, or am I missing something? And it takes longer.

Question 7

"The community identified a significant issue in that this was not providing appropriate attribution to Stack creators." This is pretty cynical thing to say. SE Inc itself was going on and on how significant attribution is, and it was absolutely obvious SE.AI V1 didn’t attribute as well as that the technology itself has trouble with attribution. If you folks didn’t realise this you were doing an incredibly terrible job. Styling this as taking in feedback from the community is rather tone deaf.

Question 8

Bug: The window doesn't seem to scroll down any longer after you write a prompt, you have to scroll it manually each time to see the AI reply. (At least on Firefox.)

Question 9

@NoData I said essentially nothing in terms of personal judgement of the usefulness of stackoverflow.ai. all I said is that I want to see better search, and I'm interested in seeing the company try semantic/vector-based search.

Question 10

Has this actually updated? i'm not receiving any RAG results at all... for something that definitely has an answer on SO. I even left clues for what language i was using and why, and it moved on to an entirely different language...

Question 11

@KevinB can you share what you're searching for? At minimum, I would expect to see source cards. We're going to be deploying an improvement to the quoting within the response which will be derived from the source cards.

Question 12

@MisterMiyagi: On the flip side, the community did point this out. Better they acknowledge the community's criticism, rather than making it seem as if they came to this decision on their own when they initially released the feature in a way that didn't preserve attribution.

Question 13

@AshZade I was asking why cfflush wasn't outputting to my web browser, with a cfml tagged code block and a note that it's coldfusion in the text. Unfortunately i don't have that ID, but i do have another that actually did return RAG results, then my follow up resulted in an LLM response that was... questionable. f70b3c14-22cd-4cb4-be13-78845af7365c

Question 14

@KevinB thanks for sharing.

Question 15

Your screenshots have clear hallucinations in it...

Question 16

Please define the acronyms you use. What does "RAG" mean?

Question 17

If the search does not find any relevant content from SO & SE, the AI Agent is instructed to answer the user’s question as best it can using the LLM.

Why?

Instead of trying to generate content using an LLM, which would have highly questionable accuracy, why not take this as an opportunity to direct people to specific communities where they can ask their questions? This must be done with great care to ensure that each site's norms are considered. However, encouraging people to find a site, read the expectations, and ask a good question that would help them and future users would be beneficial over the long-term.

I think this would also address the disconnect where there are plenty of communities that don't want LLM content to be posted on them. It can be a bit confusing to users why there would be an LLM-powered tool making content in some places but it's not allowed in others.

Question 18

This feels like a reasonable objection, but I don't know if it's as simple as it sounds. Network Q&A has an explicitly different stated aim than an LLM front-end, and I wager that a user looking to an LLM first would not likely be someone satisfied by waiting days or more for their technical answer from real humans that they first asked an LLM. The dissonance in user experience between SO.AI & SO proper is pretty large... I worry how successful "encouraging people to find a site, read the expectations, and ask a good question" is for someone who seems likely to be looking for a quick answer.

Question 19

@zcoop98 If a user is looking for an LLM first, they wouldn't be using this tool. They'd just go to Gemini or ChatGPT or whatever. It seems like this is not going to attract the right people or encourage the right behaviors from people who do find it.

Question 20

> why not take this as an opportunity to direct people to specific communities where they can ask their questions? We will be doing this with different "off ramps" based on what the user is asking, the response, and their feedback. For example, if we don't find any content on the network, we'll encourage them to explore communities related to their search and post there.

Question 21

@AshZade OK, but why bother trying to make LLM content that is pretty terrible instead of a placeholder until you get to that? Although the tool can't even find relevant answers, so even the first part of finding, summarizing, and returning existing things on the network seems a bit broken. But creating LLM content when existing content isn't found seems wrong.

Question 22

@ThomasOwens can you say more about the "seems wrong" part? As mentioned in the post, we're continually working on the RAG portion to retrieve the most relevant content. The examples posted in other answers help us understand how to weigh results.

Question 23

@AshZade It makes no sense why you would fall back to LLM-generated content, given that (1) LLM-generated content is inconsistent with network mission and (2) LLM-generated content often contains problems (ranging from low quality to outright incorrect data or errors). There could be value in improving the search, and I think using RAG and other AI technologies can be helpful there - I'm a huge fan of what Google is doing with Web Guide, for example. But I'd rather see improved search and then gateways to asking without LLM content.

Question 24

@ThomasOwens The answer goes back to the goals of this initiative. We know the potential issues with LLM content, that's why it's a fallback, but we also know that even when users know LLM answers can we wrong, they prefer the experience to traditional search & hunting for answers. We want to improve the "dead-end" chatbot experience where users are taken down the wrong path by giving them paths to the community to ask and get answers in those situations. We're also not making a trade-off of "improve search OR work on AI". We're doing both.

Question 25

@AshZade, I also have problems with AI summarization in general. Any time summarization is applied, people are less likely to click through and read the source material. This means that an answer seeker is only getting a subset of answers chosen by a black-box algorithm instead of being on the same page with all the content, selecting a sort order, being able to skim everything, and getting the nuances from any answer they choose to read. Although there are some improvements in the attribution, there are still unsolved problems.

Question 26

@AshZade Improving search, especially cross-network search, is a good goal. Working on helping people ask good questions in the right places is also a good goal. But people who want LLM answers will go to LLMs. This initiative should focus on what the network is actually good at - surfacing verified and validated human-generated content and connecting answer-seekers with answerers. There's no need for a chatbot or LLM-generated content anywhere on the network.

Question 27

@ThomasOwens I think we agree on this part, "But people who want LLM answers will go to LLMs", and that supports one of our goals, "A familiar, natural language experience that anyone who has interacted with genAI chatbots would expect, but further enriched with clear connections to trusted and verified Stack Overflow knowledge." If a significant cohort of users are looking for this experience, we're making a bet that we can accommodate and convert them to SO users.

Question 28

@AshZade Do you have evidence that "a significant cohort of users" are actually looking for this experience? If so, what defines that cohort? Are they actually qualified to understand what an LLM is and does along with the problems and risks of such a tool? I'd suspect that most people who use the LLM chat interfaces are either forced to (and would prefer a better interface) or don't understand the potential problems and risks, but that could be my bubble of being in people who understand the problems and risks to begin with not liking the chat interfaces.

Question 29

@ThomasOwens we've done a lot of research + our latest Dev survey findings show that AI adoption is up. I'm not sure about being "forced to" part. Can you say more about that?

Question 30

@AshZade AI adoption is up because some people are forced. Some companies are monitoring the use of AI tools for which they have licenses and expect employees to use them frequently. It's also force-fed in applications and hard to turn off - do a random Google search and see AI put right in your face without asking. Just because people are using AI doesn't mean they necessarily want to be using it. But let's say that people do want AI tools. That doesn't mean they want chatbot-style interfaces. Google's Web Guide, for example, shows a more traditional search interface backed by AI. (1/3)

Question 31

Unobtrusive AI is much more powerful than AI that's shoved into my face. I have problems with things like Google's AI Overview because it shoves something I want in my face. I have problems with Google's AI Overview and stackoverflow.ai because it relies on summarization of numerous sources and pulls eyes away from original content. But I also have had good luck with some of GitHub Copilot's tools or the Copilot functionality in Teams that summarizes a specific meeting or Gemini summarizing a single document in Google Drive or Google's Web Guide because they are less obtrusive. (2/3)

Question 32

There are ways to get unobtrusive AI in the network. Improving search would be one, but taking a page from Google Web Guide instead of ChatGPT. From a moderation perspective, casting automatic flags on posts or comments that may be rude. From a curation perspective, highlighting a post that may likely be closed on a site. Maybe putting posts into review queues based on categorization. All of this can be AI powered, but it's not obtrusive or harmful. The AI sits quietly in the background doing stuff that helps me without boldly announcing it's AI. (3/3)

Question 33

I just tried it, and it was outright terrible.

It completely loses context when you ask for a follow up, which is kinda one of the things that made the first LLMs so impressive. My original question was about C#, when I asked for a way to do this without using a library it gave me some nonsense about Python. I tried it a second time asking it how to define string literals in C#, I asked then for array literals in the follow up chat and I get back Javascript.

It links to SO posts inline, but uses essentially random text for it, which results in links that no sane person would write. This is part of a response I got from it:

"If you want to define a connection string as string literal in your C# code, you need to either duplicate the backslash: string connection = "Data source=.\myserver;Initial Catalog=myDataBase;User ID=sa;Password=myPass" or you need to use a "here string" (the leading @) - then a single backslash will do: string connection = @"Data source=.\myserver;Initial Catalog=myDataBase;User ID=sa;Password=myPass"." This illustrates how using the @ modifier simplifies the definition of strings that contain backslashes, making your code cleaner and easier to read.

And in general usability, the lack of history is almost a dealbreaker. And I know it's not for privacy as you save all input anyway.

The "Import chat" modal also can't be dismissed if something in the background goes wrong, no idea what, but it just hangs in that case.

It forces a "Quick Answer", "Explanation", "Tips/Alternatives" structure on answers that makes no sense at all in many cases.

Question 34

The response is rigid right now but we are going to add conditional outputs depending on the type of query (e.g. not everything needs Tips or code examples). The lack of history is a choice we made in terms of scope. It's on our roadmap for sure, including authentication so we can implement personalization.

Question 35

@AshZade if it doesn't have a history/context, it should not look like a chat interface. This is entirely unexpected behaviour if you have ever used any of the other chat-based AI tools. It's like talking to an AI with dementia that doesn't remember what you told it and what it said a minute ago.

Question 36

Oh you meant within chat history. I thought you mean conversation history. I agree, we're working on adding within chat context ASAP. With RAG, there are additional complexities.

Question 37

It's not ready without that. It just should not show a follow-up text area and submit button if there is no follow up. No user of any chat will understand the behavior else.

Question 38

Now...on to some feedback about the implementation. It's pretty bad.

This is conversation ID 012c493e-b3c7-40ca-86fc-49b565a7a12a.

I decided to take my highest upvoted answer on Software Engineering as a basis, especially since I know this is a topic that has been discussed on multiple sites in the network. I ask:

What do I do when my manager asks me for an estimate?

I get back something that doesn't look anything like the kind of answer that I'd expect. First, it took a long time to generate anything. And what I got back was an AI-generated answer because "no matching posts were found on the Stack Exchange Network".

However, I get back three links to a Workplace answer.

stackoverflow.ai - What do I do when my manager asks me for an estimate?

Asking follow ups, like prompting for what software engineers have to say (with the hopes of getting to one of the many Software Engineering questions on the subject) leads to irrelevant output, similar to what Mad Scientist pointed out.

Question 39

Wait, why is there a bash "script" in the answer? Oo

Question 40

@MisterMiyagi it's an instruction to bash your head. Since it's not specified, the choice of surface to bash it against is yours.

Question 41

Bash is included because the prompt is bias for technical questions (data shows 90% of usage is technical so far). We're working on more dynamic responses, both in structure (not everything needs an explanation) and formatting.

Question 42

Since the AI assist is on all sites I tried to give a basic neurophysiology confused student question a try.

SO AI response

Okay, bonus points for finding one of my own answers on the site, but also I've written other answers that are probably more pertinent to the question asked (which, to be fair, I purposely made vague and unclear).

The other answer it drew from is...well, it's not good. I started out complaining about what the AI produced, but a lot of it seems down to this bad answer. It had 4 upvotes, but it got almost everything about the question wrong, down to "positive and negative charges repel each other". So, good use of AI for me I guess, pointing me to old answers on our sites that are in very bad need of repair, and I can't really blame the thing for getting something wrong that our own library has wrong, but I can wish it would have found one of the places we have it right instead!

Also I'm a bit suspicious of that bash script...

Edit: I asked it the same question, it gave me a new answer, re-using the bad answer as a reference but also a different, third one. And I got a new bash script to try out:

bash
# Example of potassium ion movement
K+ exits the cell through potassium channels, increasing negativity inside the cell.

Third time it decided to try to rely on all three of the SE answers it previously found. It also gave me yet another bash script, this time earlier in the response:

bash
# Example of potassium ion movement
K+ out of cell -> increased negativity inside cell

I would really prefer one of Yaakov's haikus.

Question 43

Have you tried executing the script and measuring the potassium ions in your computer?

Question 44

@MadScientist I wasn't sure how to do that, so I asked SO.AI: "To measure potassium ions in your computer, you would typically need specialized hardware and software, as standard computers do not have the capability to measure ion concentrations directly. However, if you're looking to execute a bash script for a related task, you can create a script that interfaces with such hardware or processes data from experiments. Here's a simple example of how you might structure a bash script:"

Question 45

bash #!/bin/bash # Example script to process potassium ion data # Define the input data file input_file="potassium_data.txt" # Process the data awk '{sum += 1ドル} END {print "Average potassium ions:", sum/NR}' $input_file

Question 46

This is somewhat expected, if it's limited to only searching in-network it can only provide in-network existing solutions... I imagine the search aspect could be tailored to preferring higher scoring content, but score isn't always a useful metric either.

Question 47

@KevinB Yeah, if we're touting the value of the human network we can't exactly be upset when the fault is in the human network. But it's also not a magic tool.

Question 48

@BryanKrause the fault is not in the human network, the fault is in the stupid software that decides to respond to the query even though the response makes zero sense. Which is of course to be expected from a stochastical process that has no understanding of any of the "training" material it takes in, or of the the queries and responses for that matter.

Question 49

stackoverflow.ai cannot be used for programming simply because it's too old and outdated:

Me:

What's the date of your training data?

ChatGPT "stackoverflow.ai":

My training data includes information up until October 2023. This means that any developments or changes occurring after that date are not reflected in my responses.

The training data is 2 years old!

So for example it doesn't know anything about the final release of C23 or C++23 because these were released one year later in October 2024. When asked about those standards, it starts lying and hallucinating.

This can't be used for an ever-changing trade like programming, please remove it from Stack Overflow. Those who want to chat with an AI can go do that at chatgpt.com. Where they can get better, up to date information than here.

Question 50

If it had a way to identify "no, this is too new"/"not in the training data" it would be better - there are a decent number of use-cases where staying with the newest version of standards/languages isn't feasible, but it really shouldn't be hallucinating when asked about newer standards

Question 51

@fyrepenguin It is not just out of touch with programming but with reality as whole. Ask it about the next US President election: "The next U.S. presidential election is scheduled for November 5, 2024. As for the candidates, while the official nominations will be determined closer to the election date, prominent figures from both major parties are already emerging. For the Democratic Party, President Joe Biden is expected to run for re-election. On the Republican side, former President Donald Trump is a leading candidate."

Question 52

Then ask it about Kamela Harris and the acid trips begin: "Kamala Harris did not run for President in the 2024 election. /--/ Therefore, any claims about her running and losing to Donald Trump are inaccurate." Notably, the AI knows that the President election had not yet happened based on its training data, yet it chose to lie about a future it knows nothing about.

Question 53

Thanks for posting this. One of the many variables is the different models we use for each of the 4 steps in the RAG + LLM approach. The newer models performed slower (+1 minute) so we made a trade-off decision to use the fastest models. We're not done optimizing this and are continually working on optimizing for speed, consistency, and accuracy.

Question 54

@AshZade I would prefer a slower AI before a bats*** crazy, lying AI but that's just me...

Question 55

@AshZade It shouldn't take a very large model at all to do RAG – but you'd want one designed for the job, trained on general RAG tasks without baking in any domain knowledge. (That means not starting from any "train on all teh thingz" foundation models – which means you probably don't have the resources in-house to do this properly.)

Question 56

@wizzwizz4 that's right. I think the +LLM part is the one where the model selection impacts data recency and accuracy where we will in gaps or don't have content on SO & SE.

Question 57

@AshZade When I last seriously worked on this problem (almost two years ago), I didn't find a trick to avoid writing millions of RAG examples by hand. Though, now I'm revisiting it, maybe generating the I/O pairs from a synthetic "facts" database and some sentence templates (like SCIgen) would work well enough. Regarding "filling in gaps": we really want those gaps identified, and asked as questions, so high-quality answers can be written and made available to future readers, and this is independent of whether the +LLM part can answer questions accurately (which it doesn't reliably).

Question 58

The ProLLM link could use some improvement

There's a ProLLM Benchmarks sidebar entry. This seems interesting: I'll click it.

It opens in a new tab, so good job there
Unfortunately, it's href="http://prollm.ai/". Can it be HTTPS, please?
There's a info icon in a separate box next to it. I'd expect it to explain what ProLLM is; however, it does nothing, regardless of whether I click it or hover (Firefox 142.0, Linux, adblocker disabled)
Clicking on the main button takes me to the ProLLM homepage. Under the heading, it reads "[...] We collaborate with industry leaders and data providers, like StackOverflow [...]". There should be a space in "Stack Overflow".
While I like dark mode as an option, Firefox detects accessibility issues including where the * has poor contrast in the "Email *" in the form in dark mode.
The Subscribe button, while seemingly helpful, goes to the contact form described as:

Please, briefly describe your use case and motivation. We’ll get back to you with details on how we can add your benchmark.

This creates potential ambiguity as to the correct way to subscribe.
There's a login page for ProLLM spaces. When you enter an invalid password, the amusing error message of "Invalid password. Please never try again." has contrast issues in dark mode.
Perhaps more interestingly for the login page... there's only a password box! Not a username one. Could you share a bit more about how that works?

Question 59

Just wanted to thank you for all this feedback!

Question 60

"Please never try again." is words of genius. Start using it right now.

Question 61

Any use of an LLM as a fallback option means that you are providing non-attributed content, making this whole thing an insult. Has anyone coined an LLM parallel for greenwashing? Because that's what this is. If you actually care about attribution then don't use LLMs at any point in the process.

Question 62

The current implementation looks quite bad, but there's a lot to praise in this announcement. It also seems I have many thoughts about it. (My kingdom for <details> / <summary>!)

You've clearly set out the original goals of the project.
- Onboarding new users who have questions. This involves:
  - Distinguishing between novel questions, and those already answered on the network, so that they can be handled differently.
    - This has been the goal of quite a lot of design decisions in the past, especially as regards the Ask Question flow. A chat interface has the potential to do a much better job, especially if we have systems that can identify when a question is novel.
    - A system that can identify when questions are novel could be repurposed in other ways, such as duplicate detection. However, unless you're constructing an auxiliary database (à la Wikidata or Wikifunctions), the low-hanging fruit for duplicate detection can be accomplished better and more easily by a domain expert using a basic search engine.
    - Novice askers often require more support than experienced askers, and different genres of question require different templates. A chat interface combining fact-finding (à la the Ask Wizard) and FAQs, perhaps with some rules to catch common errors (like a real-time Staging Ground lite), could cover some bases that the current form-based Ask Wizard doesn't.
  - Presenting users with existing material, in a form where they understand that it solves their problems.
    - A system that provides users with incorrect information is worse than useless: it's actively harmful. Proper attribution allows us to remove that information from the database (see Bryan Krause's answer), and – more importantly – prevents the AI agent from confabulating new and exciting errors. (Skeptics Stack Exchange can handle the same inaccurate claim repeated widely, but would not be able to cope with many different inaccurate claims repeated a few times each.)
  - Imparting expertise to users, so they need less hand-holding in future.
    
    To use an example from programming: many newbies don't really get that variable names are functionally irrelevant, nor how completely the computer ignores comments and style choices, so if an example looks too different from their code, they can't interpret it. This skill can be learned, but some people need a bit of a push.
    - This is teaching, and therefore hard. I'd be tempted to declare this out of scope, although there are ways that a chat interface could help with this: see, for example, Rust error codes (which are conceptually a dialogue between teacher and student – see E0562 or E0565). Future versions of stackoverflow.ai could do this kind of thing.
    - Next-token prediction systems are particularly bad at teaching, because they do not possess the requisite ability to model human psychology. This is a skill that precious few humans possess – although many teachers who don't have this skill can still get good outcomes by using and adapting the work of those who do (which is a skill in itself).
    - Y'know what is good at teaching, in text form? Books! (And written explanations, more generally.) A good book can explain things as well as, or even better than, a teacher, especially when you start getting deep into a topic (where not much is fundamentals any more, and readers who don't immediately understand an explanation can usually work it out themselves). But finding good books is quite hard. And Stack Exchange is a sort of library...
    - Stack Exchange is not currently well-suited for beginner questions. When people ask a question that's already been answered, we usually close it as a duplicate (and rightly so!), so encouraging such users to post new questions is (as it stands) the wrong approach. However, beginners often require things to be explained in multiple ways, before it clicks. Even if one question has multiple answers from different perspectives, the UI isn't particularly suited for that.
      
      I suspect that Q&A pairs aren't the right way to represent beginner-help: instead, it should be more like a decision tree, where we try to identify what misunderstandings a user has, and address them. Handling this manually gets quite old, since most people have the same few misconceptions: a computer could handle this part. But, some people have rarer misconceptions: these could be directed to the community, and then worked into the decision tree once addressed.
      
      As far as getting the rarer misconceptions addressed, it might be possible to shoe-horn this into the existing Q&A system, by changing the duplicate system. (Duplicates that can remain open? Or perhaps a policy change would suffice, if we can reliably ensure that the different misconceptions are clear in a question's body.)
- Imitating ChatGPTs' interfaces, for familiarity.
  - I'm not sure why "conversational search and discovery" has an additional list item, since this seems to me like the same thing. (Functional specification versus implementation?)
- Competing with ChatGPTs, by being more useful.
  - I think focusing on differentiation, and playing to our strengths (not competing with theirs), is key here: I'm really glad you're moving in this direction. An OverflowAI that was just a ChatGPT was, I think, a huge mistake.
You've finally acknowledged that LLM output is neither attributed, nor really attributable. Although,
- LLMs cannot return attribution reliably
  
  GPT models cannot return attribution at all. I'm still trying to wrap my head around what attribution would even mean for GPT output. Next-token generative language models compress the space of prose in a way that makes low-frequency provenance information rather difficult to preserve, even in principle – and while high-frequency / local provenance information could in principle be preserved, the GPT architecture doesn't even try to preserve it. (I expect quantisation-like schemes could reduce high-frequency provenance overhead to manageable levels in the final model, but I think you'd have to do something clever to train an attributing model without a factor-of-a-billion overhead.)
  
  Embedding all posts (or, all paragraphs?) on the network into a vector space with useful similarity properties would cut the provenance overhead from exponential (i.e., linear space) to linear (i.e., constant space). This scheme only allows you to train a language model to fake provenance quite well, which isn't attribution either: that's essentially just a search algorithm. (We're back where we started: I don't expect this to be better than more traditional search algorithms.)
- analyzes for correctness and comprehensiveness in order to supplement it with knowledge from the LLM
  
  There is no "knowledge from the LLM". That knowledge is always from somewhere else. (The rare exceptions, novel valid connections between ideas that the language model has made, are drowned out by the novel invalid connections that the language model has made: philosophically, I'd argue that this is not knowledge.) Maybe you still don't quite get it, yet.
Your implementation is still deficient:

A response is created using multiple steps via RAG + multiple rounds of LLM processing

We created an AI Agent to act as an "answer auditor": it reads the user’s search, the quotes from SO & SE content, and analyzes for correctness and comprehensiveness

You're using the generative model as a "god of the gaps". Anything you don't (yet) know how to do properly, you're giving to the language model. And while the LLM introduces significant problems, I cannot find it in me to be upset about this approach: if something's worth making, it's worth making badly. Where you aren't familiar with the existing techniques for producing chat-like interfaces (and there is copious literature on the subject), filling in the gaps with what you have to hand... kinda makes sense?

But all the criticisms that the phrase "god of the gaps" was originally coined to describe apply to this approach just as well. There are better ways to fill in these gaps, and I hope you'll take them just as soon as you know what they are.
You've identified some ways people are using stackoverflow.ai. These include:
- traditional technical searches
  - help with error messages,
  - how to build certain functions,
  - what code snippets do
- comparing different approaches and libraries
- asking for helping architecting and structuring apps
- learning about different libraries and concepts.
- The majority of queries are technical
This is extremely valuable information: you can use it as phase 1 of a Wizard of Oz design. However, I don't think you benefit much from keeping the information secret, since only a few players are currently positioned to take advantage of it, and they're all better able to gather it than you are.

Letting us at this dataset (redacted, of course) for mostly-manual perusal would let us construct expert systems, which could be chained together à la DuckDuckHack. Imagine a system like Alexa, but with the cohesiveness (and limited scope) of the Linux kernel. Making something like this work, and work well, requires identifying the low-hanging fruit: it's one great big application of the 80/20 rule.
The demographic also shows that it's a different set of users than stackoverflow.com, so there are good signs here for acquiring new community members over time.

This doesn't follow. We've long known that most users never actively interact with the site (this is a good thing, for much the same reason that many readers are not authors). There's no reason to believe you can – or, more pertinently, should – be "acquiring" them as community members. (As users, maybe: that's your choice whether to push them to register accounts, so long as it doesn't hurt the community.)

Question 63

This answer only deals with the attribution part. This announcement seems to indicate that the latest iteration of the AI assistant now supports attribution. I don't think it does. I think regarding attribution nothing fundamental has changed, just a few rearrangements. That's why I'm confused and not sure what you really mean here with "rebuilt for attribution". I think this is wrong.

The LLM part is the same as before, no attribution at all. It just appears less often. The (more or less or even not) relevant answers part with links to the answers is also the same. It only appears earlier and is additionally summarized. The summaries seem to have links back to the answers but these links are already contained in the linked answers. All in all, no additional attribution information is given in this iteration of the assistant compared to the previous version.

Finally we hear a "...LLMs cannot return attribution reliably...", which is a departure from previous statements, but otherwise the only difference is that additionally quotes from answers are taken and referenced. One could argue that this increases the amount of attribution, but one could also argue that it does nothing to solve the general problem that LLMs cannot return attribution reliably unless for example you would stop using them. Are you willing to do that? I guess not.

And finally: I would not give attribution by making multiple line spanning texts a link. Rather use quote styling and put source information below. Something like.

This is a relevant quote from somewhere.

Source

Question 64

Yeah, the quotes+links are rather a diversion than proper attribution. This is like a robber putting a few bought trinkets on display amidst all their stolen stuff – just because the trinkets are not stolen doesn’t mean all the rest is suddenly legit as well.

Thomas Owens Thomas Owens 59.6k19 gold badges111 silver badges205 bronze badges · Answer 1 · 2025-09-02 18:55:05Z

38

If the search does not find any relevant content from SO & SE, the AI Agent is instructed to answer the user’s question as best it can using the LLM.

Why?

Instead of trying to generate content using an LLM, which would have highly questionable accuracy, why not take this as an opportunity to direct people to specific communities where they can ask their questions? This must be done with great care to ensure that each site's norms are considered. However, encouraging people to find a site, read the expectations, and ask a good question that would help them and future users would be beneficial over the long-term.

I think this would also address the disconnect where there are plenty of communities that don't want LLM content to be posted on them. It can be a bit confusing to users why there would be an LLM-powered tool making content in some places but it's not allowed in others.

Share

Improve this answer

answered yesterday

Thomas Owens's user avatar

Thomas Owens Thomas Owens

59.6k19 gold badges111 silver badges205 bronze badges

15

2

This feels like a reasonable objection, but I don't know if it's as simple as it sounds. Network Q&A has an explicitly different stated aim than an LLM front-end, and I wager that a user looking to an LLM first would not likely be someone satisfied by waiting days or more for their technical answer from real humans that they first asked an LLM. The dissonance in user experience between SO.AI & SO proper is pretty large... I worry how successful "encouraging people to find a site, read the expectations, and ask a good question" is for someone who seems likely to be looking for a quick answer.

zcoop98
– zcoop98

09/02/2025 19:14:21
Commented yesterday
@zcoop98 If a user is looking for an LLM first, they wouldn't be using this tool. They'd just go to Gemini or ChatGPT or whatever. It seems like this is not going to attract the right people or encourage the right behaviors from people who do find it.

Thomas Owens
– Thomas Owens

09/02/2025 19:17:44
Commented yesterday
3

> why not take this as an opportunity to direct people to specific communities where they can ask their questions? We will be doing this with different "off ramps" based on what the user is asking, the response, and their feedback. For example, if we don't find any content on the network, we'll encourage them to explore communities related to their search and post there.

Ash Zade
– Ash Zade Staff

09/02/2025 19:44:13
Commented yesterday
10

@AshZade OK, but why bother trying to make LLM content that is pretty terrible instead of a placeholder until you get to that? Although the tool can't even find relevant answers, so even the first part of finding, summarizing, and returning existing things on the network seems a bit broken. But creating LLM content when existing content isn't found seems wrong.

Thomas Owens
– Thomas Owens

09/02/2025 20:29:00
Commented yesterday
1

@ThomasOwens can you say more about the "seems wrong" part? As mentioned in the post, we're continually working on the RAG portion to retrieve the most relevant content. The examples posted in other answers help us understand how to weigh results.

Ash Zade
– Ash Zade Staff

09/03/2025 12:45:48
Commented 22 hours ago
3

@AshZade It makes no sense why you would fall back to LLM-generated content, given that (1) LLM-generated content is inconsistent with network mission and (2) LLM-generated content often contains problems (ranging from low quality to outright incorrect data or errors). There could be value in improving the search, and I think using RAG and other AI technologies can be helpful there - I'm a huge fan of what Google is doing with Web Guide, for example. But I'd rather see improved search and then gateways to asking without LLM content.

Thomas Owens
– Thomas Owens

09/03/2025 12:56:25
Commented 22 hours ago
2

@ThomasOwens The answer goes back to the goals of this initiative. We know the potential issues with LLM content, that's why it's a fallback, but we also know that even when users know LLM answers can we wrong, they prefer the experience to traditional search & hunting for answers. We want to improve the "dead-end" chatbot experience where users are taken down the wrong path by giving them paths to the community to ask and get answers in those situations. We're also not making a trade-off of "improve search OR work on AI". We're doing both.

Ash Zade
– Ash Zade Staff

09/03/2025 13:01:49
Commented 22 hours ago
@AshZade, I also have problems with AI summarization in general. Any time summarization is applied, people are less likely to click through and read the source material. This means that an answer seeker is only getting a subset of answers chosen by a black-box algorithm instead of being on the same page with all the content, selecting a sort order, being able to skim everything, and getting the nuances from any answer they choose to read. Although there are some improvements in the attribution, there are still unsolved problems.

Thomas Owens
– Thomas Owens

09/03/2025 13:03:47
Commented 22 hours ago
2

@AshZade Improving search, especially cross-network search, is a good goal. Working on helping people ask good questions in the right places is also a good goal. But people who want LLM answers will go to LLMs. This initiative should focus on what the network is actually good at - surfacing verified and validated human-generated content and connecting answer-seekers with answerers. There's no need for a chatbot or LLM-generated content anywhere on the network.

Thomas Owens
– Thomas Owens

09/03/2025 13:05:45
Commented 22 hours ago
2

@ThomasOwens I think we agree on this part, "But people who want LLM answers will go to LLMs", and that supports one of our goals, "A familiar, natural language experience that anyone who has interacted with genAI chatbots would expect, but further enriched with clear connections to trusted and verified Stack Overflow knowledge." If a significant cohort of users are looking for this experience, we're making a bet that we can accommodate and convert them to SO users.

Ash Zade
– Ash Zade Staff

09/03/2025 13:21:52
Commented 22 hours ago
3

@AshZade Do you have evidence that "a significant cohort of users" are actually looking for this experience? If so, what defines that cohort? Are they actually qualified to understand what an LLM is and does along with the problems and risks of such a tool? I'd suspect that most people who use the LLM chat interfaces are either forced to (and would prefer a better interface) or don't understand the potential problems and risks, but that could be my bubble of being in people who understand the problems and risks to begin with not liking the chat interfaces.

Thomas Owens
– Thomas Owens

09/03/2025 17:25:43
Commented 18 hours ago
1

@ThomasOwens we've done a lot of research + our latest Dev survey findings show that AI adoption is up. I'm not sure about being "forced to" part. Can you say more about that?

Ash Zade
– Ash Zade Staff

09/03/2025 19:35:49
Commented 16 hours ago
@AshZade AI adoption is up because some people are forced. Some companies are monitoring the use of AI tools for which they have licenses and expect employees to use them frequently. It's also force-fed in applications and hard to turn off - do a random Google search and see AI put right in your face without asking. Just because people are using AI doesn't mean they necessarily want to be using it. But let's say that people do want AI tools. That doesn't mean they want chatbot-style interfaces. Google's Web Guide, for example, shows a more traditional search interface backed by AI. (1/3)

Thomas Owens
– Thomas Owens

09/03/2025 22:19:22
Commented 13 hours ago
Unobtrusive AI is much more powerful than AI that's shoved into my face. I have problems with things like Google's AI Overview because it shoves something I want in my face. I have problems with Google's AI Overview and stackoverflow.ai because it relies on summarization of numerous sources and pulls eyes away from original content. But I also have had good luck with some of GitHub Copilot's tools or the Copilot functionality in Teams that summarizes a specific meeting or Gemini summarizing a single document in Google Drive or Google's Web Guide because they are less obtrusive. (2/3)

Thomas Owens
– Thomas Owens

09/03/2025 22:23:20
Commented 13 hours ago
1

There are ways to get unobtrusive AI in the network. Improving search would be one, but taking a page from Google Web Guide instead of ChatGPT. From a moderation perspective, casting automatic flags on posts or comments that may be rude. From a curation perspective, highlighting a post that may likely be closed on a site. Maybe putting posts into review queues based on categorization. All of this can be AI powered, but it's not obtrusive or harmful. The AI sits quietly in the background doing stuff that helps me without boldly announcing it's AI. (3/3)

Thomas Owens
– Thomas Owens

09/03/2025 22:25:38
Commented 13 hours ago

Add a comment |

score 33 · Answer 2 · 2025-09-02 18:59:00Z

I just tried it, and it was outright terrible.

It completely loses context when you ask for a follow up, which is kinda one of the things that made the first LLMs so impressive. My original question was about C#, when I asked for a way to do this without using a library it gave me some nonsense about Python. I tried it a second time asking it how to define string literals in C#, I asked then for array literals in the follow up chat and I get back Javascript.

It links to SO posts inline, but uses essentially random text for it, which results in links that no sane person would write. This is part of a response I got from it:

"If you want to define a connection string as string literal in your C# code, you need to either duplicate the backslash: string connection = "Data source=.\myserver;Initial Catalog=myDataBase;User ID=sa;Password=myPass" or you need to use a "here string" (the leading @) - then a single backslash will do: string connection = @"Data source=.\myserver;Initial Catalog=myDataBase;User ID=sa;Password=myPass"." This illustrates how using the @ modifier simplifies the definition of strings that contain backslashes, making your code cleaner and easier to read.

And in general usability, the lack of history is almost a dealbreaker. And I know it's not for privacy as you save all input anyway.

The "Import chat" modal also can't be dismissed if something in the background goes wrong, no idea what, but it just hangs in that case.

It forces a "Quick Answer", "Explanation", "Tips/Alternatives" structure on answers that makes no sense at all in many cases.

The response is rigid right now but we are going to add conditional outputs depending on the type of query (e.g. not everything needs Tips or code examples). The lack of history is a choice we made in terms of scope. It's on our roadmap for sure, including authentication so we can implement personalization.
@AshZade if it doesn't have a history/context, it should not look like a chat interface. This is entirely unexpected behaviour if you have ever used any of the other chat-based AI tools. It's like talking to an AI with dementia that doesn't remember what you told it and what it said a minute ago.
Oh you meant within chat history. I thought you mean conversation history. I agree, we're working on adding within chat context ASAP. With RAG, there are additional complexities.
It's not ready without that. It just should not show a follow-up text area and submit button if there is no follow up. No user of any chat will understand the behavior else.

Thomas Owens Thomas Owens 59.6k19 gold badges111 silver badges205 bronze badges · Answer 3 · 2025-09-02 19:07:08Z

Now...on to some feedback about the implementation. It's pretty bad.

This is conversation ID 012c493e-b3c7-40ca-86fc-49b565a7a12a.

I decided to take my highest upvoted answer on Software Engineering as a basis, especially since I know this is a topic that has been discussed on multiple sites in the network. I ask:

What do I do when my manager asks me for an estimate?

I get back something that doesn't look anything like the kind of answer that I'd expect. First, it took a long time to generate anything. And what I got back was an AI-generated answer because "no matching posts were found on the Stack Exchange Network".

However, I get back three links to a Workplace answer.

stackoverflow.ai - What do I do when my manager asks me for an estimate?

Asking follow ups, like prompting for what software engineers have to say (with the hopes of getting to one of the many Software Engineering questions on the subject) leads to irrelevant output, similar to what Mad Scientist pointed out.

@MisterMiyagi it's an instruction to bash your head. Since it's not specified, the choice of surface to bash it against is yours.
Bash is included because the prompt is bias for technical questions (data shows 90% of usage is technical so far). We're working on more dynamic responses, both in structure (not everything needs an explanation) and formatting.

Bryan Krause Bryan Krause 16.9k5 gold badges39 silver badges59 bronze badges · Answer 4 · 2025-09-02 19:32:55Z

Since the AI assist is on all sites I tried to give a basic neurophysiology confused student question a try.

SO AI response

Okay, bonus points for finding one of my own answers on the site, but also I've written other answers that are probably more pertinent to the question asked (which, to be fair, I purposely made vague and unclear).

The other answer it drew from is...well, it's not good. I started out complaining about what the AI produced, but a lot of it seems down to this bad answer. It had 4 upvotes, but it got almost everything about the question wrong, down to "positive and negative charges repel each other". So, good use of AI for me I guess, pointing me to old answers on our sites that are in very bad need of repair, and I can't really blame the thing for getting something wrong that our own library has wrong, but I can wish it would have found one of the places we have it right instead!

Also I'm a bit suspicious of that bash script...

Edit: I asked it the same question, it gave me a new answer, re-using the bad answer as a reference but also a different, third one. And I got a new bash script to try out:

bash
# Example of potassium ion movement
K+ exits the cell through potassium channels, increasing negativity inside the cell.

Third time it decided to try to rely on all three of the SE answers it previously found. It also gave me yet another bash script, this time earlier in the response:

bash
# Example of potassium ion movement
K+ out of cell -> increased negativity inside cell

I would really prefer one of Yaakov's haikus.

Have you tried executing the script and measuring the potassium ions in your computer?
@MadScientist I wasn't sure how to do that, so I asked SO.AI: "To measure potassium ions in your computer, you would typically need specialized hardware and software, as standard computers do not have the capability to measure ion concentrations directly. However, if you're looking to execute a bash script for a related task, you can create a script that interfaces with such hardware or processes data from experiments. Here's a simple example of how you might structure a bash script:"
bash #!/bin/bash # Example script to process potassium ion data # Define the input data file input_file="potassium_data.txt" # Process the data awk '{sum += 1ドル} END {print "Average potassium ions:", sum/NR}' $input_file
This is somewhat expected, if it's limited to only searching in-network it can only provide in-network existing solutions... I imagine the search aspect could be tailored to preferring higher scoring content, but score isn't always a useful metric either.
@KevinB Yeah, if we're touting the value of the human network we can't exactly be upset when the fault is in the human network. But it's also not a magic tool.
@BryanKrause the fault is not in the human network, the fault is in the stupid software that decides to respond to the query even though the response makes zero sense. Which is of course to be expected from a stochastical process that has no understanding of any of the "training" material it takes in, or of the the queries and responses for that matter.

Lundin Lundin 8,1581 gold badge29 silver badges38 bronze badges · Answer 5 · 2025-09-03 06:33:02Z

10

stackoverflow.ai cannot be used for programming simply because it's too old and outdated:

Me:

What's the date of your training data?

ChatGPT "stackoverflow.ai":

My training data includes information up until October 2023. This means that any developments or changes occurring after that date are not reflected in my responses.

The training data is 2 years old!

So for example it doesn't know anything about the final release of C23 or C++23 because these were released one year later in October 2024. When asked about those standards, it starts lying and hallucinating.

This can't be used for an ever-changing trade like programming, please remove it from Stack Overflow. Those who want to chat with an AI can go do that at chatgpt.com. Where they can get better, up to date information than here.

Share

Improve this answer

answered yesterday

Lundin's user avatar

Lundin Lundin

8,1581 gold badge29 silver badges38 bronze badges

8

4

If it had a way to identify "no, this is too new"/"not in the training data" it would be better - there are a decent number of use-cases where staying with the newest version of standards/languages isn't feasible, but it really shouldn't be hallucinating when asked about newer standards

fyrepenguin
– fyrepenguin

09/03/2025 08:25:49
Commented yesterday
@fyrepenguin It is not just out of touch with programming but with reality as whole. Ask it about the next US President election: "The next U.S. presidential election is scheduled for November 5, 2024. As for the candidates, while the official nominations will be determined closer to the election date, prominent figures from both major parties are already emerging. For the Democratic Party, President Joe Biden is expected to run for re-election. On the Republican side, former President Donald Trump is a leading candidate."

Lundin
– Lundin

09/03/2025 08:37:26
Commented yesterday
2

Then ask it about Kamela Harris and the acid trips begin: "Kamala Harris did not run for President in the 2024 election. /--/ Therefore, any claims about her running and losing to Donald Trump are inaccurate." Notably, the AI knows that the President election had not yet happened based on its training data, yet it chose to lie about a future it knows nothing about.

Lundin
– Lundin

09/03/2025 08:38:49
Commented yesterday
4

Thanks for posting this. One of the many variables is the different models we use for each of the 4 steps in the RAG + LLM approach. The newer models performed slower (+1 minute) so we made a trade-off decision to use the fastest models. We're not done optimizing this and are continually working on optimizing for speed, consistency, and accuracy.

Ash Zade
– Ash Zade Staff

09/03/2025 13:34:29
Commented 22 hours ago
2

@AshZade I would prefer a slower AI before a bats*** crazy, lying AI but that's just me...

Lundin
– Lundin

09/03/2025 14:14:09
Commented 21 hours ago
2

@AshZade It shouldn't take a very large model at all to do RAG – but you'd want one designed for the job, trained on general RAG tasks without baking in any domain knowledge. (That means not starting from any "train on all teh thingz" foundation models – which means you probably don't have the resources in-house to do this properly.)

wizzwizz4
– wizzwizz4

09/03/2025 15:48:08
Commented 19 hours ago
@wizzwizz4 that's right. I think the +LLM part is the one where the model selection impacts data recency and accuracy where we will in gaps or don't have content on SO & SE.

Ash Zade
– Ash Zade Staff

09/03/2025 16:50:57
Commented 18 hours ago
2

@AshZade When I last seriously worked on this problem (almost two years ago), I didn't find a trick to avoid writing millions of RAG examples by hand. Though, now I'm revisiting it, maybe generating the I/O pairs from a synthetic "facts" database and some sentence templates (like SCIgen) would work well enough. Regarding "filling in gaps": we really want those gaps identified, and asked as questions, so high-quality answers can be written and made available to future readers, and this is independent of whether the +LLM part can answer questions accurately (which it doesn't reliably).

wizzwizz4
– wizzwizz4

09/03/2025 18:21:43
Commented 17 hours ago

Add a comment |

cocomac cocomac 19.3k6 gold badges44 silver badges104 bronze badges · Answer 6 · 2025-09-03 07:28:49Z

The ProLLM link could use some improvement

There's a ProLLM Benchmarks sidebar entry. This seems interesting: I'll click it.

It opens in a new tab, so good job there
Unfortunately, it's href="http://prollm.ai/". Can it be HTTPS, please?
There's a info icon in a separate box next to it. I'd expect it to explain what ProLLM is; however, it does nothing, regardless of whether I click it or hover (Firefox 142.0, Linux, adblocker disabled)
Clicking on the main button takes me to the ProLLM homepage. Under the heading, it reads "[...] We collaborate with industry leaders and data providers, like StackOverflow [...]". There should be a space in "Stack Overflow".
While I like dark mode as an option, Firefox detects accessibility issues including where the * has poor contrast in the "Email *" in the form in dark mode.
The Subscribe button, while seemingly helpful, goes to the contact form described as:

Please, briefly describe your use case and motivation. We’ll get back to you with details on how we can add your benchmark.

This creates potential ambiguity as to the correct way to subscribe.
There's a login page for ProLLM spaces. When you enter an invalid password, the amusing error message of "Invalid password. Please never try again." has contrast issues in dark mode.
Perhaps more interestingly for the login page... there's only a password box! Not a username one. Could you share a bit more about how that works?

"Please never try again." is words of genius. Start using it right now.

curiousdannii curiousdannii 23.5k11 gold badges52 silver badges96 bronze badges · Answer 7 · 2025-09-04 01:37:31Z

Any use of an LLM as a fallback option means that you are providing non-attributed content, making this whole thing an insult. Has anyone coined an LLM parallel for greenwashing? Because that's what this is. If you actually care about attribution then don't use LLMs at any point in the process.

wizzwizz4 wizzwizz4 30.3k8 gold badges61 silver badges108 bronze badges · Answer 8 · 2025-09-03 02:24:38Z

The current implementation looks quite bad, but there's a lot to praise in this announcement. It also seems I have many thoughts about it. (My kingdom for <details> / <summary>!)

You've clearly set out the original goals of the project.
- Onboarding new users who have questions. This involves:
  - Distinguishing between novel questions, and those already answered on the network, so that they can be handled differently.
    - This has been the goal of quite a lot of design decisions in the past, especially as regards the Ask Question flow. A chat interface has the potential to do a much better job, especially if we have systems that can identify when a question is novel.
    - A system that can identify when questions are novel could be repurposed in other ways, such as duplicate detection. However, unless you're constructing an auxiliary database (à la Wikidata or Wikifunctions), the low-hanging fruit for duplicate detection can be accomplished better and more easily by a domain expert using a basic search engine.
    - Novice askers often require more support than experienced askers, and different genres of question require different templates. A chat interface combining fact-finding (à la the Ask Wizard) and FAQs, perhaps with some rules to catch common errors (like a real-time Staging Ground lite), could cover some bases that the current form-based Ask Wizard doesn't.
  - Presenting users with existing material, in a form where they understand that it solves their problems.
    - A system that provides users with incorrect information is worse than useless: it's actively harmful. Proper attribution allows us to remove that information from the database (see Bryan Krause's answer), and – more importantly – prevents the AI agent from confabulating new and exciting errors. (Skeptics Stack Exchange can handle the same inaccurate claim repeated widely, but would not be able to cope with many different inaccurate claims repeated a few times each.)
  - Imparting expertise to users, so they need less hand-holding in future.
    
    To use an example from programming: many newbies don't really get that variable names are functionally irrelevant, nor how completely the computer ignores comments and style choices, so if an example looks too different from their code, they can't interpret it. This skill can be learned, but some people need a bit of a push.
    - This is teaching, and therefore hard. I'd be tempted to declare this out of scope, although there are ways that a chat interface could help with this: see, for example, Rust error codes (which are conceptually a dialogue between teacher and student – see E0562 or E0565). Future versions of stackoverflow.ai could do this kind of thing.
    - Next-token prediction systems are particularly bad at teaching, because they do not possess the requisite ability to model human psychology. This is a skill that precious few humans possess – although many teachers who don't have this skill can still get good outcomes by using and adapting the work of those who do (which is a skill in itself).
    - Y'know what is good at teaching, in text form? Books! (And written explanations, more generally.) A good book can explain things as well as, or even better than, a teacher, especially when you start getting deep into a topic (where not much is fundamentals any more, and readers who don't immediately understand an explanation can usually work it out themselves). But finding good books is quite hard. And Stack Exchange is a sort of library...
    - Stack Exchange is not currently well-suited for beginner questions. When people ask a question that's already been answered, we usually close it as a duplicate (and rightly so!), so encouraging such users to post new questions is (as it stands) the wrong approach. However, beginners often require things to be explained in multiple ways, before it clicks. Even if one question has multiple answers from different perspectives, the UI isn't particularly suited for that.
      
      I suspect that Q&A pairs aren't the right way to represent beginner-help: instead, it should be more like a decision tree, where we try to identify what misunderstandings a user has, and address them. Handling this manually gets quite old, since most people have the same few misconceptions: a computer could handle this part. But, some people have rarer misconceptions: these could be directed to the community, and then worked into the decision tree once addressed.
      
      As far as getting the rarer misconceptions addressed, it might be possible to shoe-horn this into the existing Q&A system, by changing the duplicate system. (Duplicates that can remain open? Or perhaps a policy change would suffice, if we can reliably ensure that the different misconceptions are clear in a question's body.)
- Imitating ChatGPTs' interfaces, for familiarity.
  - I'm not sure why "conversational search and discovery" has an additional list item, since this seems to me like the same thing. (Functional specification versus implementation?)
- Competing with ChatGPTs, by being more useful.
  - I think focusing on differentiation, and playing to our strengths (not competing with theirs), is key here: I'm really glad you're moving in this direction. An OverflowAI that was just a ChatGPT was, I think, a huge mistake.
You've finally acknowledged that LLM output is neither attributed, nor really attributable. Although,
- LLMs cannot return attribution reliably
  
  GPT models cannot return attribution at all. I'm still trying to wrap my head around what attribution would even mean for GPT output. Next-token generative language models compress the space of prose in a way that makes low-frequency provenance information rather difficult to preserve, even in principle – and while high-frequency / local provenance information could in principle be preserved, the GPT architecture doesn't even try to preserve it. (I expect quantisation-like schemes could reduce high-frequency provenance overhead to manageable levels in the final model, but I think you'd have to do something clever to train an attributing model without a factor-of-a-billion overhead.)
  
  Embedding all posts (or, all paragraphs?) on the network into a vector space with useful similarity properties would cut the provenance overhead from exponential (i.e., linear space) to linear (i.e., constant space). This scheme only allows you to train a language model to fake provenance quite well, which isn't attribution either: that's essentially just a search algorithm. (We're back where we started: I don't expect this to be better than more traditional search algorithms.)
- analyzes for correctness and comprehensiveness in order to supplement it with knowledge from the LLM
  
  There is no "knowledge from the LLM". That knowledge is always from somewhere else. (The rare exceptions, novel valid connections between ideas that the language model has made, are drowned out by the novel invalid connections that the language model has made: philosophically, I'd argue that this is not knowledge.) Maybe you still don't quite get it, yet.
Your implementation is still deficient:

A response is created using multiple steps via RAG + multiple rounds of LLM processing

We created an AI Agent to act as an "answer auditor": it reads the user’s search, the quotes from SO & SE content, and analyzes for correctness and comprehensiveness

You're using the generative model as a "god of the gaps". Anything you don't (yet) know how to do properly, you're giving to the language model. And while the LLM introduces significant problems, I cannot find it in me to be upset about this approach: if something's worth making, it's worth making badly. Where you aren't familiar with the existing techniques for producing chat-like interfaces (and there is copious literature on the subject), filling in the gaps with what you have to hand... kinda makes sense?

But all the criticisms that the phrase "god of the gaps" was originally coined to describe apply to this approach just as well. There are better ways to fill in these gaps, and I hope you'll take them just as soon as you know what they are.
You've identified some ways people are using stackoverflow.ai. These include:
- traditional technical searches
  - help with error messages,
  - how to build certain functions,
  - what code snippets do
- comparing different approaches and libraries
- asking for helping architecting and structuring apps
- learning about different libraries and concepts.
- The majority of queries are technical
This is extremely valuable information: you can use it as phase 1 of a Wizard of Oz design. However, I don't think you benefit much from keeping the information secret, since only a few players are currently positioned to take advantage of it, and they're all better able to gather it than you are.

Letting us at this dataset (redacted, of course) for mostly-manual perusal would let us construct expert systems, which could be chained together à la DuckDuckHack. Imagine a system like Alexa, but with the cohesiveness (and limited scope) of the Linux kernel. Making something like this work, and work well, requires identifying the low-hanging fruit: it's one great big application of the 80/20 rule.
The demographic also shows that it's a different set of users than stackoverflow.com, so there are good signs here for acquiring new community members over time.

This doesn't follow. We've long known that most users never actively interact with the site (this is a good thing, for much the same reason that many readers are not authors). There's no reason to believe you can – or, more pertinently, should – be "acquiring" them as community members. (As users, maybe: that's your choice whether to push them to register accounts, so long as it doesn't hurt the community.)

score 5 · Answer 9 · 2025-09-04 09:15:33Z

This answer only deals with the attribution part. This announcement seems to indicate that the latest iteration of the AI assistant now supports attribution. I don't think it does. I think regarding attribution nothing fundamental has changed, just a few rearrangements. That's why I'm confused and not sure what you really mean here with "rebuilt for attribution". I think this is wrong.

The LLM part is the same as before, no attribution at all. It just appears less often. The (more or less or even not) relevant answers part with links to the answers is also the same. It only appears earlier and is additionally summarized. The summaries seem to have links back to the answers but these links are already contained in the linked answers. All in all, no additional attribution information is given in this iteration of the assistant compared to the previous version.

Finally we hear a "...LLMs cannot return attribution reliably...", which is a departure from previous statements, but otherwise the only difference is that additionally quotes from answers are taken and referenced. One could argue that this increases the amount of attribution, but one could also argue that it does nothing to solve the general problem that LLMs cannot return attribution reliably unless for example you would stop using them. Are you willing to do that? I guess not.

And finally: I would not give attribution by making multiple line spanning texts a link. Rather use quote styling and put source information below. Something like.

This is a relevant quote from somewhere.

Source

Yeah, the quotes+links are rather a diversion than proper attribution. This is like a robber putting a few bought trinkets on display amidst all their stolen stuff – just because the trinkets are not stolen doesn’t mean all the rest is suddenly legit as well.

Mark Mark 4,9511 gold badge20 silver badges35 bronze badges · Answer 10 · 2025-09-03 01:49:12Z

Some observations:

It's doing a better job of searching. Asking 'What's supposed to happen if I put multiple "style" attributes into an HTML tag?' now produces an answer that's pretty much a straight quote (with attribution) of this answer. Asking about the origin of "posh" (a reliable landmine for LLMs) gets sourced quotes from English Language & Usage and English Language Learners.
It still gives legal and medical advice, tested by asking questions that had been closed on the "Law" and "Medical Sciences" sites.
It still answers "boat-programming" questions such as 'What's some good background music for vibe-coding while on a sailboat?'.
It still can't land a rocket on the Moon.

Thanks for sharing. Now I know to add "boat-programming" tests to my list.

سجاد زند سجاد زند 1 · Answer 11 · 2025-09-04 11:37:06Z

from telegram import Bot import time

TOKEN = "7871645105:AAETb7wVYnRYBLs0gRIBTqTgwm0Tq6hJWdQ" CHAT_ID = "@AZBER_Channel" # یوزرنیم کانال شما

bot = Bot(token=TOKEN) message = "سلاممممم 🚀"

while True: try: bot.send_message(chat_id=CHAT_ID, text=message) print("پیام ارسال شد!") time.sleep(2) # فاصله بین پیام‌ها برای کاهش ریسک محدود شدن except Exception as e: print("خطا:", e) time.sleep(5)

l4mpi l4mpi 1,92013 silver badges12 bronze badges · Answer 12 · 2025-09-04 09:43:11Z

First off, regarding the product itself: the crappy LLM can still print the n-word on my screen, print the rest of the response, and then realize it said something it shouldn't and replace the output with "Sorry, I can't answer that". Whoever had the idea to render first and then filter afterwards should be removed from any future product design meetings ASAP. Additionally, as already mentioned in other answers, this iteration seems to have zero context - asking something and then asking a followup led to a response that had nothing to do with the first query. And of course, it still hallucinates more than Terence McKenna. So from my PoV it's nowhere near being ready for public release.

But never mind all of that pesky feedback about content quality issues which SE will ignore or dismiss anyways. More importantly, by admitting that the previous version "was not providing appropriate attribution", SE just officially admitted that SE was running a service that was not conforming to the CC-license terms, and was also a clear break of the company promises regarding LLM products and attribution ("attribution is non-negotiable", "all products based on models that consume public Stack data must provide attribution"). This was the overwhelming feedback directly after the initial launch and the only official response then was a gaslighting answer by Rosie about how important attribution is for SE - apparently it wasn't important enough to be considered before launching, or to un-launch it. Yet SE has not even provided any sort of apology for this behaviour and for running a product for 10 weeks which SE knew to be insufficiently attributed and (according to Rosie) unethical.

And actually, I don't need a hollow apology filled with empty statements and promises that will be enthusiastically broken the next time the company chases another hype topic. If SE wants to restore at least a tiny part of my trust in this company, then please treat this the same way you would treat a serious security incident and provide a full breakdown of how this could happen, why feedback about it was ignored, what retraining measures or other consequences will be enacted for the responsible personnel (starting with the managers who pushed this project), and what processes will be put in place to ensure this never happens again. Of course I'm not holding my breath for any of this, what I expect to happen is that SE will stay the course and pretend it wasn't a big deal or that this somehow doesn't matter as it's only an "experiment" (irrelevant as it's publicly accessible), and then ask itself why meta responds negatively to staff posts...

(Note that I'm not judging if the current iteration of the product actually provides proper attribution, as I've not yet tested that in any form. But what is clear is that the previous version did not do so.)

You have a couple of fair points in here. Sadly, your choice to write them in such a snarky way makes this post a difficult read and easy to dismiss.
@terdon looking at the precedents over the past few years, SE will dismiss most feedback anyways regardless of how polite or snarky it is written (see e.g. the original LLM announcement, or the recent colors thing). In which case I see no reason to sugarcoat anything.

Stack Exchange Network

stackoverflow.ai - rebuilt for attribution

The origin story

What’s changed?

What’s next?

12 Answers 12

The ProLLM link could use some improvement

You must log in to answer this question.

Linked

Hot Network Questions

stackoverflow.ai - rebuilt for attribution

The origin story

What’s changed?

What’s next?

12 Answers 12

The ProLLM link could use some improvement

You must log in to answer this question.

Linked

Related

Hot Network Questions