Large language models are undeniably smart, but why do they still “hallucinate” at the most critical moments?
A risk-control model spits out a plausible regulation that simply doesn’t exist. An internal knowledge Q&A returns a “close enough” but inaccurate compliance explanation. A technical document search pulls in an outdated version without warning…
This is not about the model being subpar — it’s about model inference lacking factual grounding.
Today, we are officially releasing the Yueli Knowledge Computation Engine (Yueli KGM Computing) — a dynamic scheduling middleware that combines a self-hosted inference orchestration layer with a compatibility gateway, open-sourced under the MIT license on GitHub. Its mission: to provide a deterministic knowledge anchor for trustworthy LLM reasoning.
Core Proposition: Boundaries First, Uniqueness Second
Yueli KGM Computing (hereafter KGM) is not yet another LLM application framework, nor is it meant to replace the vLLM or LangChain you are already using. Its positioning is crystal clear:
Self-hosted native inference + inference orchestration and compatibility gateway. You can self-deploy models suitable for enterprise private deployment scenarios and perform local inference; you can also orchestrate a mix of local inference and cloud MaaS services, with KGM acting as the gateway and orchestration layer, offloading the primary compute to external inference services. Furthermore, you can extend and develop on top of the open-source codebase to define your own enterprise scheduling routes and workflows.
In one sentence:
yueli-kgm-computing is the knowledge infrastructure layer for enterprise AI applications, making LLMs more trustworthy and reliable.
KGM is purpose-built for the following scenarios:
- Enterprise intelligent application development and algorithm service foundation
- Structured extraction of private enterprise data and automated knowledge graph construction
- Anchoring LLM inference to knowledge graph fact nodes (reducing hallucinations, increasing traceability)
- Unified semantic computation for multimodal content
- Providing standardized knowledge APIs for enterprise AI applications, supporting full-stack private deployment
- Unified encapsulation, offering data audit cost control and data security assurance
What’s Delivered: Four Clear Capability Lines
Once installed, @haxitag/yueli-kgm-computing gives you four things:
① Dual-Protocol HTTP Surface
Within the same process, two API sets are exposed simultaneously: OpenAI-compatible (/v1/chat/completions) and Anthropic-compatible (/v1/messages). Tool semantics are bidirectionally mapped — OpenAI’s tool_calls and Anthropic’s tool_use are automatically converted at the gateway layer.
For the business side, this means: one Base URL, two industry protocols, and zero client-side awareness of upstream differences.
② KGM Extension: Orchestration Knobs on the Same Request Body
In a standard OpenAI/Anthropic request body, you can optionally carry a kgm field that serves as a “progressive enhancement switch”. When omitted, KGM operates in passthrough mode (directly proxying the upstream SSE). When orchestration signals are present, it automatically switches to bridge streaming mode (KGM assembles SSE segments, injecting intermediate semantics from knowledge graphs, retrieval, tools, etc.).
This is a diversion approach, not an either/or choice — traffic that doesn’t need orchestration pays no orchestration cost.
③ Managed Runtime Control Plane
Artifact pulling, runtime lifecycle management, and inference-related metrics are all brought under unified management. KGM knows where each model artifact is, what state it’s in, and which runtime it runs on — you get an operable, observable control plane, not just a forwarding router.
④ In-Process Native Inference Engine
KGM includes its own NativeRuntimeEngine, capable of performing tensor forward pass and decoding within the same process. It’s important to note that this is fundamentally different from “replacing a vLLM cluster.” KGM honestly documents a four-tier capability boundary (A/B/C/D) in its docs, clearly indicating which paths are production-suitable and which are for regression validation.
Architects and developers: first connect an external engine to establish a passthrough baseline, then evaluate whether you need the in-process Native engine for target models.
Orchestration Core: How Cognitive Augmentation Works
KGM’s main execution pipeline breaks down “cognitive augmentation” into configurable, observable modules:
Context Management: runs memory retrieval, graph queries, and conversation history retrieval in parallel. Stable parts are cached, dynamic parts are incrementally updated — highly effective in multi-turn dialog scenarios.
Memory Management: a separated short-term/long-term memory system, written and retrieved via API, implicitly triggered within the ContextBuilder path.
Knowledge Graph Augmentation: triggered via kgm.graph.enabled=true, injects graph sub-query results into the context before inference so that retrieval results carry “contextual relationships” rather than relying solely on similarity.
Tool Orchestration: server-side multi-turn execution compatible, parses intent and executes tool calls, with responses carrying an audit trail. It also supports delegating tool execution to an external sandbox, enabling a “tool gating” design.
Multi-Provider Access: Unified Management of 30+ Mainstream LLM Providers
KGM covers 30+ mainstream LLM providers through LlmProviderFactory — from OpenAI, DeepSeek, Anthropic Claude, and Google Gemini to Alibaba Cloud Bailian, Volcano Ark, Zhipu GLM, Baidu Qianfan, and on-premise options like Ollama, vLLM, SGLang, and LM Studio. Switching is done with a single environment variable.
When multi-routing strategy is enabled, you can implement auditable routing rules through declarative JSON configuration — for instance, “sensitive tasks → intranet Ollama / complex reasoning → vLLM / long-context → OpenRouter.”
Production Deployment: Enterprise-Grade Engineering Reliability
KGM already delivers production-grade capabilities:
- Structured Logging: JSON format, automatic sensitive data masking
- Unified Error Handling: custom error types, stack traces hidden in production
- Circuit Breaker: circuit breaker pattern for external service calls, with monitorable state
- Database: SQLite for development, PostgreSQL recommended for production
- Observability: Prometheus-compatible metrics (latency, time-to-first-token, tokens per second, KV cache memory usage, queue depth, etc.)
- Graceful Shutdown: SIGTERM/SIGINT handling, completing in-flight requests
A minimal production startup takes just 5 minutes, and a Web Playground is included for managing skills, MCP connectors, and output templates.
Division of Labor with the Open-Source Ecosystem: Not Competition, but Layering
The most common question from technical teams — “I’m already using LangChain/LlamaIndex/vLLM. Do I still need KGM?” — has a clear answer: they operate at different layers.
| Dimension | LangChain | LlamaIndex | vLLM | Yueli KGM |
|---|---|---|---|---|
| Primary Positioning | LLM app framework | Data retrieval framework | High-perf inference engine | Inference orchestration + compatibility gateway |
| Unified Dual Protocol | Requires DIY | Requires DIY | Not provided | Native dual protocol, bidirectional tool mapping |
| Diversion Design | None | None | N/A | Native diversion — no orchestration, no added cost |
| Managed Control Plane | None | None | Standalone service | Native control plane |
| Knowledge Graph-constrained Reasoning | Needs custom integration | Basic support | None | Native KGM, deep integration |
Recommended composition patterns:
- vLLM/SGLang as the compute backbone + KGM as the protocol compatibility and orchestration layer
- LangChain/LlamaIndex as the application logic layer + KGM as the underlying unified HTTP entry point
- Dify or BotFactory as the low-code workflow layer + KGM for model routing and key management
Uniqueness with Engineering Honesty: Six Points, Fully Documented
Summarized from the repository’s capabilities.md:
① Single self-hosted surface unifying two industry protocols: reduces dual-stack maintenance costs
② “Diversion” rather than “either/or”: passthrough for non-orchestrated requests avoids unnecessary traffic rewriting
③ KGM extension as a progressive switch: supports an integration path of “proxy first, augment later”
④ Managed Runtime + cross-format recognition: oriented toward model asset governance, not just HTTP pass-through
⑤ Honest Native layered narrative (A→D): reduces the industry misconception that “parsing a config means it can run”
⑥ Operable: Prometheus metrics and automatic route auditing give the gateway layer SRE-grade observability
Value Propositions for Different Audiences
Enterprise IT Decision-Makers and Architects
KGM doesn’t solve “make the model smarter” — it solves “make enterprise AI infrastructure governable, auditable, and replaceable.”
Any LLM provider can be switched via an environment variable. Any business application only needs to interface with KGM’s unified API. This is an engineering path to reduce vendor lock-in risk and build an evolvable AI infrastructure.
Suggested evaluation path: deploy KGM for a single scenario (internal knowledge Q&A or API unification), establish a /metrics and passthrough baseline, verify observability and routing capabilities, and then decide whether to enable KGM extensions for orchestration-enhanced phases.
Enterprise Service Technical Teams and Software Engineers
KGM’s core engineering design philosophy: configuration-driven, not code-driven. Multi-provider routing is a JSON rule. Skills and MCP connectors are Playground configurations. A great deal of “dirty work” is abstracted into declarative configurations, eliminating the need to reinvent the wheel for every project.
MaaS Providers and Cloud Compute Vendors
KGM’s ProviderType registry already covers 30+ vendors — zero-cost out-of-the-box integration. KGM’s declarative routing, key management, circuit breaker, and Prometheus metrics mean customers can manage multi-cloud compute allocation within a single observable control plane.
As a middleware layer for protocol compatibility, orchestration routing, and control plane, KGM is open-sourced under MIT, available as an NPM package and source code. It supports low-complexity integration across different tech stacks including Golang, Python, and Rust, with unrestricted production deployment and modification rights.
The enterprise service solutions built by the HaxiTAG team are also integrated on top of yueli-kgm-computing. We welcome peers, partners, and talented developers to build upon KGM for private deployments, integration services, and industry solutions.
Final Words
Integrating AI into enterprise applications and production systems is no longer about “who has the strongest model,” but “who can make models work stably, trustworthily, and auditably in real business scenarios.”
yueli-kgm-computing’s answer: use the determinism of knowledge graphs to constrain the probabilistic nature of large language models.
This is not a minor technical patch — it’s the essential path for enterprise AI to move from “the lab” to “production.”