Human-AI Enterprise Teaming Is A Hard Technical Problem—That Requires Going Beyond LLMs

Jun 15

This post was originally published on my substack.

The new paradigm for managing work will be tools tailor built for human-AI workflows and collaboration.

The future of work is human-AI teams. This is already abundantly evident within engineering teams; writing code with AI is a fundamentally different world.

But even as programming begins to transform, what about the managers of these technical teams — who own team-level, project-level, company-level and long-term outcomes? What does the new paradigm of managing-with-AI look like?

The next layer of enterprise intelligence will go beyond the engineering layer of building software—to the business layer of managing outcomes, where projects, priorities and resources come together to meet common objectives. This is where, ultimately, the potential for highest business impact lies, and where the highest-leverage usecases are.

The promise of multi-agent systems (MAS) for the enterprise—autonomous, goal-driven, and real-time adaptive—is powerful. Agents can go beyond surface-level automations and become embedded in real-world workflows. They can build system-level intelligence, to solve a wide range of practical problems as they arise, and co-own a team’s goals.

At least that is the vision.

We are far from this vision. For all the talk of disruptive AI, we have yet to see fully-realized autonomous AI even for individual usecases that have been on the radar for years e.g. a personal AI assistant with deep contextual awareness to intelligently and independently handle the full range of my emails, scheduling, to-do lists. One of the strongest signal that this old problem remains important and unsolved was YC including it in their ‘request for startups’ for Summer 2025.

Within the enterprise, there is no path to long-term adoption and trust without demonstrable ROI.

How well can agentic AI co-manage team-level outcomes?

How reliably can it deliver on high-value usecases for managers?

Can it learn quickly and continuously to get smart really fast?

This post makes 2 key points:

Human-AI team collaboration to co-manage long-term enterprise outcomes is a hard technical problem.
LLMs are ill-suited for this problem, because their architecture is incompatible with logical reasoning.
The solution requires elements of classical AI—reasoning, planning, inference, deep reinforcement learning and memory—outside the limiting architecture of LLMs.

The future of work is Human-AI teams co-managing long term outcomes.

Teams where humans and AI share work and collaborate to solve a wide range of problems will become standard. The future is teamwork, not replacement.

But there will be a ceiling on AI’s utility as a teammate unless it can reason well enough to co-own outcomes. Unless it can collaborate with human teams of various sizes to do what humans do—understand, own and work sequentially towards team-level and project-level goals.

This is not an incremental productivity boost or efficiency boost. It is a new world where AI can learn how to make our goals its own goals.

This is a transformational AI capability.

If you’re managing a technical team, this means AI can be a true collaborator, that understands team roles, responsibilities and priorities. Delegate the delegation to it, let it flag early signs of emerging risk, get its input on how your team can meet goals more consistently (or more reliably, or faster), let it build and manage new workflows, trust that it will remove blockers and escalate the right things to the right people at the right time, or proactively come to you when something—anything—in the web of your teams, workflows, goals, projects, priorities, deadlines needs your attention.

This is the real promise of autonomous AI. This moves the needle. This makes a dent in how often projects succeed, how consistently teams hit goals, how reliably work gets done. This affects the bottom-line; the cost of failed projects is high.

This is the modern enterprise.

This is also a very hard technical problem. More on this below.

The shift from Generative AI to Agentic AI has seismic implications. It allows us to build full-stack intelligence for enterprise.

The shift from generative to agentic is a systems-level shift.

Generative AI is best suited for usecases where generating content—text, video, audio—is the core deliverable e.g. writing code, doing research, creating images, product documentation, summarization, knowledge work that requires deep or broad searches, and so on. Anything where ROI is delivered via parsing and producing content, in various modalities e.g. natural language, image, audio, video. It is best served by prompting in natural language, making it passive.

Agentic AI goes beyond that. While the first wave of AI agents were built on top of LLMs and most still serve generative usecases (and sure, you can call anything an agent, and build “thin” agents on top of LLMs), agentic AI is not limited to LLMs or generative AI. Agentic AI can—and should be—built atop various machine learning models, depending on the usecase.

This is precisely why agentic AI opens up an entire set of new usecases we can now solve for: specifically ones where reasoning, learning, goal-seeking behavior, and autonomous execution is required—to build system-level, full-stack, context-rich intelligence.

In other words, the real promise of agentic AI is closer to the original vision of AI—systems that can understand, reason, and generalize in new environments, and with incomplete information.

5 reasons to go beyond LLMs to build autonomous enterprise AI.

As we seek to build autonomous AI systems—that can make decisions and act intelligently in a range of enterprise usecases—token prediction is not sufficient for the reasoning capabilities required for problem-solving and decision-making.

Enterprise usecases involve collaborating with teams, owning projects and management-heavy workloads—they are primarily comprised of multi-step, real-world tasks and workflows ill-suited for LLMs e.g. decision-making with incomplete information, multi-step reasoning that requires contextual awareness and mental models, and autonomous execution in a novel situation.

To build agents that can not only talk to us, but think and act, we need to build beyond LLMs.

A 5-step analysis of the intrinsic limitations of LLMs that make them a bad candidate to build autonomous AI for enterprise needs:

LLMs are stochastic language models.
- They’re designed first and foremost to parse and generate human-like, natural language. The transformer architecture underpinning them predicts text, based on patterns learned from extensive training data that contains human-generated content.
  The core limitation of stochastic modeling is precisely that it is stochastic i.e. it has randomness as a feature, not a bug. This randomness is the fundamental reason that it can capture the nuance, ambiguity and variation of human language. It is not deterministic and does not follow logical rules. It generates outputs based on probabilistic prediction, often yielding different responses to the same input, due to the randomness of the stochastic sampling process. This also leads to the infamous “hallucination” problem. This randomness makes LLMs incompatible with usecases that require logical reasoning. See next point.
LLMs are intrinsically inconsistent with the types of reasoning required for complex problem solving and decision making.
- LLMs frequently struggle with: (i) causal reasoning (understanding cause-effect relationships), (ii) counterfactual thinking (considering different paths/outcomes), (iii) inductive reasoning (to find new information in existing knowledge), (iv) deductive reasoning (to draw conclusions from a given premise).
- The above are core to problem solving and decision-making—which are rendered unoptimal at best and impossible at worst without this reasoning. Problem solving requires evaluating potential solutions, drawing conclusions, and applying logical thinking to find solutions. Decision-making involves evaluating multiple paths, and choosing the optimal one based on goals and information available. Long-term consequences must often be weighed. This is not what LLMs are optimized for.
- LLMs’ probabilistic architecture is incompatible with deterministic reasoning models—with logical structure and rule-based systems at their core. E.g. for deductive reasoning, if the premise is true, the conclusion is guaranteed to be true. Reasoning models have logical operations (and, or, not) at their core. They make inferences, identify logical fallacies, and draw conclusions. This is simply not what LLMs are built to do. In contrast, natural language systems are probabilistic. LLM “reasoning” is heavily reliant on specific patterns encountered during training—and thus fragile. LLMs can identify patterns and predict the next word, but fail at looking at the entirety of a situation or considering its downstream effects (reinforcement learning excels at this—more on this later).
LLMs are known to be bad at performing in new situations.
- They do well when they encounter problems or patterns they’ve seen in training data, but in the absence of a logical core designed for various types of reasoning, they get “stuck” when they see something new. They have limited ability to maintain context over time and extended interactions. Instead, they are primarily limited to the provided input, the historic data they’re trained on (and at times, user-driven fine-tuning).
LLMs do not build conceptual models to understand what they do.
- An example of this is their poor performance on math. Not only do they struggle, but when slight changes are made in the wording of the math problem, they show massive performance and output variations. This is because they don’t understand (even basic) math at all. All they ‘understand’ about numbers is where they’re most likely to occur in a sequence or sentence, based on their training data. When a model tells you 12+8 is 20, it is not doing the math, and doesn’t know what addition is — it has simply seen sufficient examples in training data that 12+8 is 20.
LLMs are not built for state management—and hence ill-suited for goal-seeking systems.
- Making progress towards a goal requires maintaining state. Often, it requires performing search on a space with high state complexity, maintaining heuristics to estimate how close a given state is to a goal state, and so on. LLMs are not built for this — again both due to their stochastic nature, and due to their limited ability to handle complex computations.

What we see above is that LLMs are not just bad at, but architecturally and computationally incompatible with the modeling required to excel at:

Reasoning and logical thinking
Complex problem solving
Decision-making
Responding in novel situations, and
Goal-centric systems

This modeling require logical operations, deterministic outcomes, conceptual modeling and state management.

LLMs are incredible at what they were actually designed to do. i.e. generate plausible, human-like, not always factually-correct language. Stochastic language modeling is optimized to do exactly what it does; it’s great at it.

But it should not be force-fitted to usecases it is not designed for, or generalized to represent “AI”. In other words, if LLMs are poor at something, that doesn’t mean AI is poor at it. LLMs are one type of AI modeling available to us.

When we ask LLMs to operate outside the world they’re trained to perform in—by building agents on top of LLMs and asking them to take on a range of tasks that are not language-centric, not compatible with stochastic sampling, not probabilistic in nature—they become brittle.

So what is all this talk of ‘reasoning’ and reinforcement learning for LLM-powered agents?

On the topic of reasoning, a small tangent to address what current “reasoning-enhanced” and “chain-of-thought” models by OpenAI are about. ChatGPT uses RLHF (Reinforcement Learning from Human Feedback).

A few thoughts on this, but the TLDR is that—this ‘reasoning’ is still for the purpose of understanding and generating language and dialogue, text summarization, and machine translation—not executing on multi-step enterprise-grade workstreams and long-term goals. The issues of inference, reliability and hallucination are mitigated but not eliminated.

RLHF, and RL in the context of LLMs is different from RL used in other machine learning models—it is optimized for NLP tasks. The end goal is still an effective language model, but trained to better follow human instructions in natural language, and to optimize long-term or human-centered goals rather than simple token prediction.
Traditional LLM training uses Supervised Learning (SL), where training datasets are labeled. RLHF attempts to augment this by adding a feedback loop from the (human) user. In technical terms, this is still supervised learning, but the supervision comes from a live user instead of the historic dataset, with the purpose of ‘preference-tuning’,
The most interesting example of attempting to use “pure RL” came from DeepSeek-R1-Zero — while their RL process was similar to OpenAI’s RLHF, the critical difference was that they applied RL without supervised fine-tuning (SFT) as a preliminary step to achieve much better results.
Reasoning within the scope of language is still reasoning, but it is for the purpose of producing more relevant, more useful language, and still confined by the computational structure of an LLM, still reliant on a stochastic sampling process, still probabilistic in nature. These attempts do improve LLM performance significantly — in the training phase, LLMs can be trained on formally verifiable tasks, such as math problems or coding, to help produce more correct CoT (Chains of thought). But during inference, LLMs remain fundamentally challenged because their architecture is (and has to be) probabilistic. The self-learning loop relies on the same stochastic sampling and modeling. There is no deterministic verification step during training (as there is in pure RL), which means it does not yield dependable results for goal-seeking systems in the long term.

Deep Reinforcement Learning (RL)—outside the limitations of LLM architecture—is well-suited for systems of management.

Systems of management are systems of decision-making and problem solving. Learning models that are goal-seeking and decision-making—such as RL—are well-suited for these systems and their usecases. RL shines in real-time decision-making and learning tasks.

3 interesting preambles about RL relevant to our enterprise usecase:

Reinforced learning is different from both supervised and unsupervised learning—it falls into neither category. Even though the naming implies that the two are encompassing learning models, they are not.
Of all the forms of machine learning, reinforcement learning is closest to the learning that humans and other animals do. Many core RL algorithms were originally inspired by biological learning systems. There’s the classic example used in “Reinforcement Learning: An Introduction”, Richard Sutton and Andrew Barto’s seminal book on examples of RL in action: “A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour.”
RL agents—like humans—learn successful strategies that result in the greatest long-term ‘rewards’. They construct their own knowledge directly from environmental inputs. This is key, because the biggest determinant of the value AI agents can add in enterprise usecases is how smart they are at creation, and how quickly an agent gets smarter. Essentially it is a deep learning (and training) problem. The learning mechanism for the baby gazelle above, for example, relies both on the intelligence it comes “seeded” with, as well as its capacity to learn, reason and adapt quickly by interacting with its environment, trying new things, and assessing results.

Sutton and Barto’s book highlight 3 defining features of RL that make it an incredible candidate for our enterprise vision of building full-stack intelligence.

A key feature of RL is that it assess the entirety of a problem.
In contrast to approaches that focus on isolated subproblems without understanding how they fit into the larger picture, RL agents’ approach is complete. It explicitly assess the entirety of the problem (of a goal-seeking agent interacting with an uncertain environment).
RL starts with a goal-seeking, interactive agent.
It has explicit goals—explicit in the sense that the agent can judge progress toward its goal based on what it can sense directly. It senses and interacts with signals from its environment, and can make decisions to take actions that influence the environment.
RL agents are optimized for execution in novel situations, with incomplete information.
In fact, it is assumed in RL that the agent will need to operate despite significant uncertainty about the environment it is in. In other words, it is built and . To do this, RL agents can build models of their environment, which mimics the behavior of the environment. These can be used for planning i.e. deciding on a course of action by considering possible future situations before they are actually experienced.

*Deep Reinforcement Learning — a type of RL that uses deep neural networks.*

We can create new, dynamic training data for RL agents—by turning everyday work and human interactions into structured data.

At the training level, using AI to manage work requires transforming everyday work into structured data for AI models.

Human interactions and workflows—Slack conversations, decisions made over zoom, email threads, GitHub pull requests and code reviews, RFCs, decision logs—can be turned into structured data for AI models, to train agents. Static, historic, company data like product documentation, or team resources remain critical, but are insufficient for building full-stack intelligence in an environment where priorities and goals are changing rapidly.

Instead, dynamic workflows, communication and conversations—that cannot be ‘uploaded’ in a format consumable by an AI model—but carry rich, essential context for direction and strategy can be transformed into data that AI models can consume, and agents can train on.

Creating this new, dynamic training data enables agents to learn by interacting with their environment and mimic the real-time, experience-based, continuous learning that humans do.

This can be a major strategic advantage in the context of demonstrable ROI.

We can model AI teams after human teams—generalists and specialists working together towards common goals.

Agents working in isolation, or looking at narrow usecases and understanding little else, struggle to move the needle on project-wide, company-wide outcomes. But generalist agents don’t add deep, customized value. What we need is both collaborating and sharing data with each other.

Much like human teams, we need specialist and generalist agents working together—as part of a team—to add the highest value.

A team of vertical, specialist agents reports to and is managed by a generalist, org-level “super agent” that has the 360 on an org similar to the way a manager does. The former share data and findings with the latter to build collective, system-level, layered intelligence.

Together, they form a goal-seeking system that can co-own and co-manage outcomes for, and in collaboration with, human teams.

Such a system has the intelligence to close the loop on usecases. e.g. if it follows up on a delegated task and gets no response what does it do? If a blocker gets stalled, or an issue is uncovered, who does it report to? It works and collaborates like a human team member.

This is full-stack, autonomous intelligence for the enterprise.

Building effective, long-term human-AI collaboration for enterprise-grade work is a very hard technical problem.

This is a world where you can tell AI to go and “unblock the team” so a key project or goal is met, and it knows exactly what that means, what to do, who to talk to, which issues to pay attention to, what to do if someone doesn’t respond, when to escalate, and to whom.

It is a world where you can ask AI if a high-priority project is “at risk of falling off track” and it can give you an accurate assessment as well as a smart recommendation for how to get things back on track. It can assess the current trajectory of a project, look at the history of past sprints and where common bottlenecks emerged, it can consider human factors like which teammates are overstretched or which project is dangerously close to having scope creep.

This is full-stack intelligence. In addition to execution intelligence (which most of us have seen some examples of now), there are additional layers of intelligence that are part of full-stack intelligence: predictive intelligence, corrective intelligence and retrospective intelligence.

This is AI that becomes an indispensable part of teams’ daily workflows by adding value that is deep, context-rich and enduring.

Saba Gul