Skip to main content
Version: v1.6.43-dev5 🚧

Architecture

h2oGPTe Architecture Diagram

Overview​

h2oGPTe is an enterprise RAG (Retrieval-Augmented Generation) platform that provides secure document processing, conversational AI, agentic workflows, and deep research capabilities. The platform enables autonomous AI agents to perform complex multi-step tasks, conduct thorough research across large document collections, and execute sophisticated reasoning chains. The architecture follows a microservices pattern with PostgreSQL for data storage, Redis for caching, MinIO for object storage, and Keycloak for authentication. In the architecture diagram below, green boxes indicate GPU-accelerated services.

Data Flow Patterns​

The h2oGPTe platform implements several key data flow patterns that enable efficient document processing, intelligent retrieval, and real-time AI interactions. These patterns are designed to optimize performance, ensure data consistency, and provide seamless user experiences across different use cases.

1. Document Ingestion Flow​

The document ingestion pipeline handles the complete lifecycle of document processing, from initial upload to final indexing. This flow supports multiple document formats (PDF, Word, Excel, images, etc.) and automatically extracts structured and unstructured content while preserving document layout and formatting. The pipeline includes intelligent chunking strategies, metadata extraction, and multi-modal content understanding.

2. RAG Query Flow​

The Retrieval-Augmented Generation (RAG) query flow combines semantic search with generative AI to provide accurate, contextual responses. This pattern leverages both vector similarity and lexical search to find relevant content, then uses advanced prompt engineering to generate responses that are grounded in your organization's data. The flow includes caching mechanisms for performance optimization and streaming for real-time user interaction.

3. Agent Execution Flow​

The agent execution flow enables autonomous AI agents to perform complex, multi-step tasks using a variety of tools and reasoning capabilities. Agents can plan their approach, execute tools iteratively, and adapt their strategy based on intermediate results. This pattern is essential for workflows that require decision-making, data analysis, or integration with external systems.

Component Responsibilities​

Each component in the h2oGPTe architecture has specific responsibilities and capabilities designed to work together as a cohesive system. The platform's modular design allows for independent scaling, maintenance, and enhancement of individual services while maintaining overall system integrity.

Note: Green boxes indicate GPU-accelerated services that leverage specialized hardware for compute-intensive AI operations.

h2oGPT Service​

The h2oGPT service acts as the LLM abstraction layer for all AI text generation operations in h2oGPTe.

  • LLM Routing: All LLM requests from h2oGPTe go to h2oGPT at configurable endpoints:
    • H2OGPTE_CORE_LLM_ADDRESS: Primary LLM endpoint
    • H2OGPTE_CORE_OPENAI_ADDRESS: OpenAI-compatible API endpoint
    • H2OGPTE_CORE_AGENT_SHARED_ADDRESS: Shared agent service endpoint
    • H2OGPTE_CORE_AGENT_ISOLATED_ADDRESS: Isolated agent service endpoint
    • H2OGPTE_CORE_LITE_LLM_ADDRESS: LiteLLM proxy endpoint
  • Deployment Flexibility:
    • Internal deployment: When h2oGPT runs within the cluster/network, provides access to multiple configured LLMs
    • External deployment: When pointing to external h2oGPT instances with pre-configured LLM choices
  • LLM Abstraction: Abstracts different LLM providers (vLLM, text-generation-inference, Replicate, Azure, OpenAI, AWS Bedrock, H2O MLOps)
  • Prompt Engineering: Handles prompt optimization and template management for different LLMs
  • API Support: Provides both text completion and chat APIs with custom context and prompts
  • Document Processing: Map/reduce API built on langchain for document processing workflows

Frontend (UI)​

The React-based frontend provides a modern, responsive user interface built with TypeScript and Tailwind CSS. It serves as the primary interaction point for users, offering an intuitive experience for both technical and non-technical users.

  • User Interface: Interactive chat interface with markdown rendering, document viewer with highlighting, collection management dashboard, and admin panels
  • Real-time Communication: WebSocket-based streaming for live AI responses, progress indicators for long-running operations, and collaborative features
  • File Management: Drag-and-drop file upload with progress tracking, batch document processing, preview and download capabilities
  • Configuration: User preferences and settings management, theme customization, API key management, and workspace configuration

Mux (API Gateway)​

The Mux gateway, written in Go, serves as the API gateway and authentication layer for all client requests.

  • Authentication: OIDC integration with Keycloak, JWT token validation, session management, guest user support with device fingerprinting
  • Authorization: Role-based access control (RBAC), API key management, license validation
  • Database Integration: PostgreSQL connection pooling with trusted/untrusted user separation, row-level security enforcement
  • Request Routing: HTTP/WebSocket routing to backend services, Redis pub/sub for multi-instance synchronization
  • File Operations: File upload/download handling, streaming support for large documents

Core Service​

The Core service, implemented in Python, acts as the central orchestrator for document processing and LLM interactions.

  • Orchestration: Coordinates document processing workflows and service interactions
  • LLM Management: Handles requests to the h2oGPT service for LLM interactions
  • Configuration: Manages system-wide settings and environment configurations
  • Encryption: Provides encryption/decryption services for sensitive data
  • File Server: Built-in file serving capabilities for document access

VEX Service​

The VEX service provides vector search and indexing capabilities with support for multiple backends.

  • Vector Search: Similarity search using embeddings, HNSW algorithm for internal backend
  • Full-text Search: Text search capabilities alongside vector search
  • Backend Support: Internal backends (HNSW, SQLite), External backends (Elasticsearch, Milvus, Qdrant, Redis)
  • Document Processing: Chunking strategies, embedding generation
  • FastAPI Interface: REST API built with FastAPI and uvicorn

Crawl Service​

The Crawl service handles document ingestion from various enterprise sources.

  • Connectors: SharePoint (On-premise and Online), Azure Blob Storage, Google Cloud Storage, AWS S3, Local file system
  • Document Processing: Document parsing and metadata extraction, Integration with parse capabilities for text extraction
  • Chunking: Document chunking for vector indexing
  • Ingestion Pipeline: Batch and incremental ingestion support

Chat Service​

The Chat service manages conversation sessions and RAG pipeline execution.

  • Session Management: Chat session creation and tracking, conversation history storage in PostgreSQL
  • RAG Pipeline: Document retrieval from VEX service, context injection for LLM prompts
  • Real-time Communication: WebSocket support for streaming responses
  • Integration: Coordinates with Core service for LLM interactions

Parse Service​

The Parse service (integrated within Crawl) provides document parsing and text extraction.

  • Format Support: PDF, Word, Excel, PowerPoint, HTML, images, and various text formats
  • OCR Capabilities: Optical character recognition for scanned documents, multi-language support
  • Text Extraction: Structured text extraction from documents, metadata extraction
  • Integration: Embedded within the crawl service pipeline

Models Service​

The Models service provides GPU-accelerated model serving capabilities.

  • GPU Support: CUDA support for GPU acceleration, configurable GPU device allocation
  • Resource Management: Memory limits and resource constraints, container-based deployment
  • Integration: Works with external h2oGPT service for LLM capabilities
  • PII Detection: Built-in PII detection and redaction capabilities

Deployment Architecture​

The h2oGPTe platform is designed for flexible deployment across various environments, from single-node development setups to large-scale production clusters. The architecture supports containerized deployment using Docker and Kubernetes, with comprehensive configuration management and monitoring capabilities.

Container Architecture​

The platform utilizes a containerized microservices architecture that ensures consistency across environments and simplifies deployment and scaling. Each service runs in its own container with clearly defined interfaces and dependencies.

Deployment Options​

Docker Compose Deployment​

  • Services: h2ogpte-app (Python services), h2ogpte-mux (Go gateway), h2ogpte-ui (React frontend)
  • Infrastructure: PostgreSQL, Redis, MinIO, Keycloak
  • Optional Services: vLLM server for local LLM hosting, Milvus/Elasticsearch for vector search

Kubernetes Deployment​

  • Helm Charts: Available for production deployments
  • Multi-instance Support: Redis pub/sub for service synchronization

Configuration Management​

  • Environment Variables: Extensive configuration via environment variables
  • Settings Service: Centralized configuration management
  • Feature Flags: Runtime feature toggling support

Security Architecture​

Security is built into every layer of the h2oGPTe platform, implementing defense-in-depth strategies to protect sensitive data and ensure compliance with enterprise security requirements.

Authentication and Identity Management​

  • OIDC/OAuth2: Full OpenID Connect and OAuth 2.0 implementation for enterprise SSO
  • JWT Tokens: Token-based authentication with JWKS validation and key rotation
  • OAuth2 Flows: Authorization code flow with PKCE, token refresh, and token exchange
  • Session Management: Redis-backed session storage with automatic token refresh
  • API Keys: Database-stored API keys for programmatic access
  • Guest Users: Device fingerprinting for anonymous access
  • Enterprise IdP Support: Integration with Okta, Azure AD, Keycloak, and other OIDC providers

Authorization and Access Control​

  • PostgreSQL RLS: Row-level security policies for data isolation
  • RBAC System: Database-backed roles and permissions
  • User Types: Regular users, admin users, guest users
  • Trusted/Untrusted Connections: Separate database connection pools based on trust level
  • Audit Logging: Database-level audit trails

Data Security​

  • Encryption: Configurable encryption for sensitive data
  • PII Detection: Built-in PII detection and redaction
  • Secure Cookies: SameSite policies and secure flag
  • Object Storage: MinIO with bucket-level access controls

Infrastructure Security​

  • Container Security: Docker-based isolation
  • Service Communication: Internal service authentication
  • Rate Limiting: Built-in rate limiting and cost controls
  • License Validation: Enterprise license enforcement

Storage Architecture​

  • PostgreSQL: Primary database with 150+ migrations, extensive stored procedures
  • Redis: Caching layer and pub/sub messaging
  • MinIO Buckets: Documents, collections, user data, shared data, agent tools
  • Cloud Storage Support: S3, Azure Blob, Google Cloud Storage integration

h2oGPT Services Architecture​

When h2oGPT is deployed as part of the h2oGPTe stack, it runs as a distributed set of services that handle different aspects of LLM operations. This microservice architecture allows for better resource allocation, scalability, and fault tolerance.

Docker Container Service Breakdown​

h2ogpt-openai Container​

Container Role: Primary LLM interface and request routing hub

Internal Services:

  • OpenAI API Server (:5000): Main LLM endpoint, OpenAI-compatible API
  • LiteLLM Proxy (:5020): Multi-provider LLM routing (OpenAI, Azure, Bedrock, etc.)
  • Gradio UI (:7860): Web interface for direct LLM interaction

Resource Profile: CPU-optimized, no GPU requirements Memory Limit: 32GB

h2ogpt-function Container​

Container Role: GPU-accelerated compute and specialized AI functions

Internal Services:

  • Function Server (:5002): Tool execution, function calls, GPU-intensive tasks
  • Function OpenAI API (:5005): Specialized API for STT and image generation
  • Function Gradio UI (:7860→7861): Web interface for function-specific operations

Resource Profile: GPU-accelerated with NVIDIA runtime support Memory Limit: 32GB GPU Configuration: Configurable via CUDA_VISIBLE_DEVICES

h2ogpt-agent-shared Container​

Container Role: Multi-worker agent execution environment

Internal Services:

  • Agent Server (:5004): Multi-worker agent operations with configurable worker count

Resource Configuration:

  • Memory Limit: 64GB
  • Worker Count: Configurable via ${H2OGPT_WORKERS}
  • Docker-out-of-Docker support for AutoGen via /var/run/docker.sock mount

Communication Pattern:

  • Connects to h2ogpt-openai:5000 for LLM requests
  • Connects to h2ogpt-openai:5020 for multi-provider routing
  • Connects to h2ogpt-function:5002 for tool execution
  • Connects to h2ogpt-function:5005 for STT/ImageGen operations
  • Uses IMAGEGEN_OPENAI_BASE_URL and STT_OPENAI_BASE_URL for specialized AI functions

h2ogpt-agent-isolated Container​

Container Role: Single-tenant isolated agent execution

Internal Services:

  • Agent Server (:5006): Single-request agent with strict isolation

Resource Configuration:

  • Memory Limit: 64GB
  • Worker Count: Fixed at 1 (H2OGPT_AGENT_WORKERS: "1")
  • Docker-out-of-Docker support for AutoGen via /var/run/docker.sock mount

Isolation Configuration:

  • NUM_CONCURRENT_REQUESTS: "1" enforces single request processing
  • CONCURRENT_REQUESTS_BEHAVIOR: "reject" prevents request queuing
  • Dedicated resource allocation ensures no interference between requests

Communication Pattern: Same as agent-shared but with guaranteed isolation

Network Architecture​

All h2oGPT services communicate within the h2ogpt-network bridge network, enabling:

  • Service Discovery: Docker DNS resolution between containers
  • Internal APIs: Direct service-to-service communication on internal ports
  • Load Balancing: Automatic failover and request distribution
  • Security: Isolated network segment with controlled external access

Resource Allocation and Deployment Configuration​

Memory Limits​

  • h2ogpt-openai: 32GB memory limit (CPU-optimized for API routing)
  • h2ogpt-function: 32GB memory limit (GPU-accelerated for compute tasks)
  • h2ogpt-agent-shared: 64GB memory limit (multi-worker agent processing)
  • h2ogpt-agent-isolated: 64GB memory limit (single-tenant isolation)

Worker Configuration​

  • OpenAI Service: Configurable workers via ${H2OGPT_WORKERS}
  • Function Service: Fixed single worker (H2OGPT_FUNCTION_SERVER_WORKERS: "1")
  • Agent Shared: Configurable workers via ${H2OGPT_WORKERS}
  • Agent Isolated: Fixed single worker (H2OGPT_AGENT_WORKERS: "1")

Special Capabilities​

  • Browser Integration: Agent containers support browser cookies (H2OGPT_BROWSER_COOKIES: "1")
  • GPU Access: Function container configurable via CUDA_VISIBLE_DEVICES environment variable

Feedback