The honest answer to "what can generative AI safely handle in banking support" is: more than a bank's risk team is comfortable with, and less than a vendor's pitch deck claims. The interesting work is in the gap between those two, because that gap is where the actual deployment decisions live.
Generative AI is genuinely capable in a banking context. It can read a customer's question, understand it, and answer in plain language from the bank's own knowledge, including questions no one scripted. The reason banks are cautious anyway is the same reason the capability is valuable: the model produces fluent answers whether or not it is correct, and in banking a fluent wrong answer is about someone's money. Safety is the entire game.
The thing that makes banking different
In most industries, a confidently wrong AI answer is an annoyance. In banking it can be a regulatory event.
The Consumer Financial Protection Bureau studied chatbots in consumer finance and was direct about the risk. It found that chatbots can give inaccurate information, struggle to recognize or resolve disputes, and that chatbots which prevent access to live human support "can lead to law violations, diminished service, and other harms." The same report estimated that over 98 million people, roughly 37% of the U.S. population, interacted with a bank's chatbot in 2022, so the scrutiny applies to a very large surface area.
That reframes the design question. The goal is the highest automation rate the bank can defend in an audit, which sits below the rate a vendor will quote. A bot that answers everything and is sometimes wrong is more dangerous than one that answers less and routes the rest. Where that defensible rate lands is also what drives the economics of conversational AI in banking.
What generative AI can safely handle
The safe envelope in banking is the set of contacts where the answer is grounded in something the bank owns and a wrong answer is low-stakes or easy to catch.
Informational questions about products and policy. Fees, limits, supported features, how a product works, what a statement line means. The answers live in the bank's documents, and the model's job is retrieval and plain-language explanation.
Account and transaction lookups, after identity verification. Balance, recent transactions, payment and transfer status. These are factual reads from the bank's systems, which keeps the model honest as long as it is pulling real data rather than recalling.
Routine self-service actions. Card activation, card locking, PIN requests, login and password resets. Bounded actions with defined outcomes, and the core of automating tier-1 banking support.
Guided navigation and triage. Understanding what the customer actually needs, gathering the relevant context, and either resolving it or routing it to the right place with that context attached.
Drafting and summarizing for agents. Summarizing a conversation history, drafting a reply for a human to approve, surfacing the right policy. The human reviews before anything reaches the customer, which keeps it safe.
The common thread is grounding. Every safe use ties the answer to a current document or a live system read, and the model explains or retrieves rather than reasons its way to a conclusion about the customer's money.
What it should not handle on its own
The unsafe envelope is defined by judgment, irreversibility, and regulatory weight.
Payment disputes, account closures, hardship and collections negotiations, lending and credit decisions, fraud investigations beyond a first-line card lock, and anything resembling financial advice all belong with a human. These require judgment a model should not exercise, carry consequences a wrong answer cannot undo, or sit inside regulatory frameworks that assume an accountable person. The model can gather context and hand off cleanly. It should not be the decision-maker.
The CFPB specifically flagged that chatbots often cannot even recognize that a customer is raising a dispute, let alone resolve one. So disputes are a clear hand-off, and the model's useful contribution is recognizing the dispute fast and routing it with everything the customer has already said.
What generative AI can safely handle in banking, and what it can't
Grouped by grounding and stakes. Based on the safe/unsafe envelopes in this article; no per-row metrics implied.
- Informational Q&A on products & policy (fees, limits, features)
- Account & transaction lookups (after identity verification)
- Routine self-service actions (card activation/lock, PIN, password reset)
- Guided navigation & triage
- Drafting & summarizing for agents (human approves)
- Payment disputes
- Account closures
- Hardship & collections negotiation
- Lending & credit decisions
- Fraud investigation beyond first-line card lock
- Anything resembling financial advice
Why hallucination is the controlling factor
The reason "safely handle" needs this much care is that generative models are confidently wrong at rates that would alarm anyone outside the field. Stanford's RegLab tested leading large language models on specific, verifiable questions in a high-stakes domain and found they hallucinated between 69% and 88% of the time. Banking questions have the same profile: precise, verifiable, and consequential.
A raw model pointed at customers is unsafe, and the controls that make it safe are specific and known.
- Grounding. Answers come from the bank's own current knowledge and live system data, so the model retrieves facts rather than generating them.
- Source constraint. The model is held to what it can cite or look up, which collapses the space for invention.
- Conservative accuracy. When confidence drops, the model hands off to a human instead of producing a plausible guess. This is the single most important control in banking.
- Verification gates. No account data before identity is verified to the bank's standard.
- Audit logging. Every automated answer is captured, because a bank cannot defend a response it did not record.
This is the posture Open.cx is built around: it ingests the raw banking knowledge directly, answers from it, and routes to a person the moment confidence drops, and because it bills per resolution and treats escalations as free, there is no incentive to over-answer the cases that should be handed off. The principle holds whatever the vendor: in banking, "I'll connect you with someone who can help" is a correct answer, and a model that reaches for it when uncertain is safer than one that reaches for a guess.
The numbers behind the caution
Sources: Stanford RegLab, “Hallucinating Law” (2024); CFPB Chatbots in Consumer Finance (2023).
Rate at which raw LLMs hallucinated on specific, verifiable questions (Stanford RegLab)
U.S. consumers who used a bank chatbot in 2022 (~37% of population)
Controls that make it safe: grounding, source constraint, conservative accuracy, verification gates, audit logging
Accepting a lower ceiling on purpose
The hardest idea for a team optimizing automation rate is that the safe ceiling is lower than the achievable one, and that this is the right call.
A conservative model resolves fewer contacts than an aggressive one. The contacts it declines are the uncertain and the sensitive, which are exactly the ones where being wrong is expensive in a bank. So the lower resolution rate is buying down the regulatory and reputational risk that the higher rate would expose. The right metric to watch is the handoff rate alongside resolution and CSAT: a healthy banking deployment escalates the hard cases deliberately, and a handoff rate falling while CSAT falls means the model is answering things it should route.
Banks that get this right treat generative AI as a capable front line for the routine and a disciplined router for everything else. The technology is good enough that the temptation is to give it more. The discipline is knowing where its confidence stops being trustworthy, and building the system so it stops there too.