Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.atthene.com/llms.txt

Use this file to discover all available pages before exploring further.

Knowledge bases provide agents with access to domain-specific information, documents, and organizational knowledge through semantic search.

Available Knowledge Base Types

Milvus

Vector DatabaseProduction-ready vector database with semantic search, hybrid search, and query expansion.

UKnow

Cloud Storage SearchSearch documents in SharePoint, OneDrive, Google Drive, and Confluence via the UKnow API.

Milvus Knowledge Base

Milvus is a vector database that enables semantic search over your documents using dense vector embeddings. Knowledge bases support customizable chunking strategies, embedding models, and structured data processing.

Basic Configuration

agents:
  - name: "support_agent"
    agent_type: "llm_agent"
    
    knowledge_bases:
      - name: "company_docs"
        knowledge_base_type: "milvus"
        id: "kb_abc123"
        config:
          top_k: 10
          search_ef: 64
          strategy: "dense"

Creating with API

Knowledge bases are created via the API with full control over processing configuration. See Create Knowledge Base API for complete documentation.

Instance Fields

name
string
required
Unique identifier for this knowledge base instance. Must contain only alphanumeric characters, hyphens, and underscores.
knowledge_base_type
string
required
Knowledge base type: "milvus" or "uknow"
enabled
boolean
default:"true"
Toggle the knowledge base on or off without removing it from configuration
id
string
KnowledgeBase model ID for database lookup (required for Milvus to resolve collection info)
config
object
default:"{}"
Adapter-specific retrieval configuration (see below)
description
string
Human-readable description of the knowledge base

Milvus Retrieval Configuration

The config object controls how documents are retrieved from Milvus:
config.top_k
integer
default:"10"
Number of most relevant results to returnRange: 1-1000
config.strategy
string
default:"dense"
Search strategy for retrievalOptions:
  • dense - Vector-based semantic search only (default, fastest)
  • hybrid - Combines dense vector + BM25 keyword search
  • bm25 - BM25 keyword-based search only
  • hybrid_rrf - Hybrid with Reciprocal Rank Fusion for improved ranking
config.search_ef
integer
HNSW search parameter controlling accuracy vs speed trade-offHigher values = more accurate but slower
config.score_threshold
float
Minimum similarity score thresholdRange: 0.0-1.0
Only return results above this score
config.offset
integer
default:"0"
Number of results to skip (for pagination)
config.embedding_provider
string
default:"mistral"
Embedding provider for vectorization: "mistral", "azure_openai", or "telekom_otc"
config.embedding_model
string
default:"mistral-embed"
Embedding model name
config.query_model
string
LLM model for query expansion. Must be one of the available LLM models.
config.max_num_query_expansions
integer
default:"0"
Number of expanded queries to generate (0-5). When greater than 0, uses the query_model to generate variations of the original query for improved recall.
config.filter_expression
string
Raw Milvus expression for pre-search filtering. Supports Milvus-native operators like ARRAY_CONTAINS.
config.filters
object
Haystack-style metadata filters for document retrieval.

Usage Examples

Basic Knowledge Base Agent

agents:
  - name: "support_agent"
    agent_type: "llm_agent"
    
    knowledge_bases:
      - name: "support_kb"
        knowledge_base_type: "milvus"
        id: "kb_support_001"
        config:
          top_k: 5
    
    system_prompt: |
      You are a customer support agent with access to our support documentation.
      
      Always search the knowledge base first for answers to customer questions.
      Provide accurate information based on our documentation.

Hybrid Search with RRF Fusion

agents:
  - name: "research_agent"
    agent_type: "react_agent"
    
    tools:
      - "tavily_search"
    
    knowledge_bases:
      - name: "research_kb"
        knowledge_base_type: "milvus"
        id: "kb_research_001"
        config:
          strategy: "hybrid_rrf"  # Combines dense + BM25 with RRF
          top_k: 20
          search_ef: 128
          score_threshold: 0.7
    
    system_prompt: |
      You are a research assistant with access to internal research papers.
      
      Search the knowledge base for relevant research before using web search.
      Prioritize high-quality, relevant results.

Query Expansion

agents:
  - name: "advanced_support_agent"
    agent_type: "llm_agent"
    
    knowledge_bases:
      - name: "product_kb"
        knowledge_base_type: "milvus"
        id: "kb_product_001"
        config:
          strategy: "dense"
          top_k: 15
          search_ef: 96
          
          # Query expansion for better recall
          query_model: "gpt-4o"
          max_num_query_expansions: 3
    
    system_prompt: |
      You are an expert product support agent.
      Use query expansion to find relevant documentation.
agents:
  - name: "document_agent"
    agent_type: "llm_agent"
    
    knowledge_bases:
      - name: "sharepoint_docs"
        knowledge_base_type: "uknow"
        config:
          username: "user@company.com"
          drive_key: "SP"
          drive_ids: ["drive_abc123"]
          search_options:
            search_type: "similarity"
            fetch_k: 10
    
    system_prompt: |
      You have access to company SharePoint documents.
      Search the knowledge base to find relevant information.

Multiple Knowledge Bases

agents:
  - name: "comprehensive_agent"
    agent_type: "llm_agent"
    
    knowledge_bases:
      - name: "technical_docs"
        knowledge_base_type: "milvus"
        id: "kb_tech_001"
        config:
          top_k: 10
      
      - name: "company_policies"
        knowledge_base_type: "milvus"
        id: "kb_policy_001"
        config:
          top_k: 5
    
    system_prompt: |
      You have access to multiple knowledge bases:
      - Technical documentation
      - Company policies
      
      Search the appropriate knowledge base based on the question type.

Chunking Strategies

Knowledge bases support three chunking strategies configured during creation:

Recursive (Default)

Splits text using multiple separators in order (paragraphs → sentences → words). Best for general documents. Parameters:
  • strategy: "recursive"
  • chunk_size: Size in words/characters (default: 500)
  • chunk_overlap: Overlap between chunks (default: 50)
  • split_by: "word" | "char" (only these two options)
  • recursive_separators: Array of separators to try in order
Example:
{
  "strategy": "recursive",
  "chunk_size": 500,
  "chunk_overlap": 50,
  "split_by": "word",
  "recursive_separators": ["\n\n", "\n", ". ", " "]
}

Hierarchical

Creates multi-level chunks preserving document structure. Ideal for academic papers and structured documents. Parameters:
  • strategy: "hierarchical"
  • hierarchical_block_sizes: Descending array of block sizes (e.g., [700, 350, 150])
  • chunk_overlap: Overlap between chunks
  • split_by: "word" | "sentence"
Example:
{
  "strategy": "hierarchical",
  "chunk_overlap": 70,
  "split_by": "word",
  "hierarchical_block_sizes": [700, 350, 150]
}

Fixed

Simple fixed-size chunks. Fastest processing for straightforward documents. Parameters:
  • strategy: "fixed"
  • chunk_size: Fixed chunk size
  • chunk_overlap: Overlap between chunks
  • split_by: "word" | "sentence"
Example:
{
  "strategy": "fixed",
  "chunk_size": 500,
  "chunk_overlap": 50,
  "split_by": "word"
}

Structured Data Processing

CSV and Excel files (.csv, .xlsx, .xls) support specialized processing:

Configuration

{
  "document_config": {
    "strategy": "recursive",
    "chunk_size": 500,
    "chunk_overlap": 50,
    "split_by": "word"
  },
  "structured_config": {
    "rows_per_batch": 10,
    "table_format": "csv",
    "csv_content_column": "text",
    "csv_conversion_mode": "row"
  }
}

Structured Config Parameters

rows_per_batch
integer
default:"10"
Number of rows to combine into one searchable chunk (1-20)Lower values = More precise retrieval, slower ingestion
Higher values = Faster ingestion, broader context
table_format
string
default:"csv"
Output format for table dataOptions:
  • csv - Comma-separated values
  • markdown - Markdown table format
csv_content_column
string
default:"text"
Column name containing the main text content (CSV only)Required for row mode processing
csv_conversion_mode
string
default:"row"
Processing mode for CSV filesOptions:
  • row - One document per row (precise retrieval)
  • file - One document per file (holistic context)
use_streaming
boolean
default:"false"
If enabled, uses an openpyxl streaming mode for Excel processing to avoid loading entire sheets into memory. Recommended for Excel files larger than 50MB.

Embedding Providers

Choose your embedding model during knowledge base creation:

Azure OpenAI

{
  "provider": "azure_openai",
  "model": "text-embedding-ada-002",
  "dimensions": 1536
}
Available model:
  • text-embedding-ada-002 (1536 dimensions)

Mistral AI

{
  "provider": "mistral",
  "model": "mistral-embed",
  "dimensions": 1024
}
Available model:
  • mistral-embed (1024 dimensions)

Telekom OTC

{
  "provider": "telekom_otc",
  "model": "text-embedding-bge-m3",
  "dimensions": 1024
}
Available models:
  • text-embedding-bge-m3 (1024 dimensions) - BGE multilingual model
  • jina-embeddings-v2-base-de (768 dimensions) - German-optimized
  • jina-embeddings-v2-base-code (768 dimensions) - Code-optimized
  • tsi-embedding-colqwen2-2b-v1 (1024 dimensions) - TSI ColQwen2

Best Practices

Chunking

General Documents: Use recursive strategy with 500 word chunks and 50 word overlap.
Code Files: Use smaller chunks (300 words) with recursive separators ["\n\n", "\n", " "].
Research Papers: Use hierarchical strategy with block sizes [700, 350, 150] to preserve structure.
CSV/Excel Data: Start with rows_per_batch: 10 and adjust based on row size and retrieval precision needs.
Chunk overlap must be less than chunk size (or smallest block size for hierarchical). The API will reject invalid configurations.

Retrieval

Top K Selection: Start with top_k: 5-10 for most use cases. Increase if you need more context.
Search Accuracy: Use search_ef: 64 for balanced performance. Increase to 128+ for higher accuracy needs.
Metric Type: Use COSINE for most semantic search applications as it’s normalized and works well with embeddings.
Score Threshold: Set a score_threshold (e.g., 0.7) to filter out low-relevance results.
Higher search_ef values improve accuracy but increase query latency. Balance based on your performance requirements.

Next Steps

Agent Capabilities

Explore all agent capabilities including tools and streaming

Agent Types

Learn about different agent types