Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

longkeyy/knowledgesdk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

40 Commits

Repository files navigation

KnowledgeSDK

KnowledgeSDK is a Go library for building and managing vector-based knowledge bases with semantic search capabilities. It provides a comprehensive set of tools for document management, chunking, embedding generation, and semantic search.

Features

  • Knowledge base management (create, read, update, delete)
  • Document handling with automatic content extraction
  • Text chunking for efficient storage and retrieval
  • Vector embedding generation
  • Semantic search with similarity scoring
  • PostgreSQL-based vector storage with efficient indexing
  • Support for various file formats via Apache Tika integration

Configuration

SDK Configuration

type Config struct {
 // Database configuration
 DBHost string
 DBPort int
 DBName string
 DBUser string
 DBPassword string
 // Vector embedding service configuration
 APIKey string
 BaseURL string // Compatible with different model services
 EmbeddingModel string // e.g. "text-embedding-ada-002"
}

Chunk Configuration

type ChunkConfig struct {
 ChunkSize int // Maximum number of characters per chunk
 Overlap int // Number of overlapping characters between adjacent chunks
}

Search Parameters

type SearchParams struct {
 Query string // Query text to search for
 TopK int // Number of results to return
 SimilarityThreshold float64 // Minimum similarity score (0-1)
 CreatorID string // Creator ID for filtering results (optional)
 KBID string // Knowledge base ID to limit search scope (optional)
}

Tika Configuration

type TikaConfig struct {
 URL string // Tika server URL, e.g., "http://localhost:9998"
}
// DefaultTikaConfig returns default Tika configuration with URL set to "http://localhost:9998"

API Reference

Initialization

NewKnowledgeSDK

Creates a new SDK instance with the provided configuration.

  • Parameters:
    • config Config: Configuration for database and embedding service
  • Returns:
    • *KnowledgeSDK: SDK instance
    • error: Error if initialization fails

Knowledge Base Management

CreateKnowledgeBase

Creates a new knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kb *KnowledgeBase: Knowledge base object with fields:
      • Name: Knowledge base name
      • Description: Knowledge base description
      • ModelID: Large model identifier (optional)
      • Temperature: Model temperature parameter, controls randomness (optional, default 0.7)
      • RigorousPrompt: Rigorous answer prompt template (optional)
      • EnableRigorousAnswer: Whether to enable rigorous answer mode (optional, default false)
      • ChunkSize: Document chunk size in characters (optional, default 1000)
      • Overlap: Overlap between adjacent chunks (optional, default 50)
      • TopK: Maximum number of related chunks to retrieve (optional, default 5)
      • SimilarityThreshold: Similarity threshold (optional, default 0.6)
      • SystemPromptTemplate: System prompt template (optional)
      • MaxReferenceLength: Maximum reference knowledge length (optional, default 3000)
      • CreatorID: ID of the knowledge base creator (optional)
  • Returns:
    • *KnowledgeBase: Created knowledge base
    • error: Error if creation fails

GetKnowledgeBase

Retrieves a knowledge base by ID.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
  • Returns:
    • *KnowledgeBase: Retrieved knowledge base
    • error: Error if retrieval fails

ListKnowledgeBases

Lists all knowledge bases.

  • Parameters:
    • ctx context.Context: Context for the operation
  • Returns:
    • []KnowledgeBase: List of knowledge bases
    • error: Error if listing fails

ListKnowledgeBasesByCreatorID

Lists all knowledge bases by creator ID.

  • Parameters:
    • ctx context.Context: Context for the operation
    • creatorID string: ID of the creator
  • Returns:
    • []KnowledgeBase: List of knowledge bases
    • error: Error if listing fails

ListKnowledgeBasesByIDs

Retrieves multiple knowledge bases by their IDs.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbIDs []string: List of knowledge base IDs
  • Returns:
    • []KnowledgeBase: List of knowledge bases
    • error: Error if retrieval fails

UpdateKnowledgeBase

Updates all properties of a knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kb *KnowledgeBase: Knowledge base object with updated fields
  • Returns:
    • *KnowledgeBase: Updated knowledge base
    • error: Error if update fails

DeleteKnowledgeBase

Deletes a knowledge base and all its documents.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
  • Returns:
    • error: Error if deletion fails

ListKnowledgeBaseDocuments

Lists all documents in a knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
  • Returns:
    • []Document: List of documents
    • error: Error if listing fails

ListKnowledgeBaseDocumentsPaginated

Lists documents in a knowledge base with pagination, sorting, and keyword filtering for document names.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • keyword string: Keyword for filtering document names (use empty string for no filtering)
    • page int: Page number (starting from 1)
    • pageSize int: Number of documents per page
    • orderBy string: Sorting criteria (e.g., "uploaded_at DESC")
    • creatorID string: ID of the creator (optional, for filtering)
  • Returns:
    • []Document: List of documents
    • int64: Total number of documents in the knowledge base matching the filter criteria
    • error: Error if listing fails

SearchKnowledgeBasesByName

Search knowledge bases by name.

  • Parameters:
    • ctx context.Context: Context for the operation
    • name string: Name keyword to search for
  • Returns:
    • []KnowledgeBase: List of matching knowledge bases
    • error: Error if search fails

SearchKnowledgeBasesByDescription

Search knowledge bases by description.

  • Parameters:
    • ctx context.Context: Context for the operation
    • description string: Description keyword to search for
  • Returns:
    • []KnowledgeBase: List of matching knowledge bases
    • error: Error if search fails

SearchKnowledgeBasesByKeyword

Search knowledge bases by keyword (searches both name and description).

  • Parameters:
    • ctx context.Context: Context for the operation
    • keyword string: Keyword to search for
  • Returns:
    • []KnowledgeBase: List of matching knowledge bases
    • error: Error if search fails

SearchKnowledgeBasesAdvanced

Perform advanced search on knowledge bases with multiple criteria.

  • Parameters:
    • ctx context.Context: Context for the operation
    • params KnowledgeBaseSearchParams: Search parameters including:
      • Keyword: Keyword to search in name and description (optional)
      • Name: Name keyword (optional)
      • Description: Description keyword (optional)
      • ModelID: Model ID for exact matching (optional)
      • CreatorID: Creator ID for filtering (optional)
      • Page: Page number (starting from 1)
      • PageSize: Number of items per page
      • OrderBy: Sorting criteria (e.g., "created_at DESC")
  • Returns:
    • []KnowledgeBase: List of matching knowledge bases
    • int64: Total number of matching knowledge bases
    • error: Error if search fails

Document Management

AddDocument

Adds a text document to a knowledge base and immediately chunks it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • name string: Document name
    • content string: Document content
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

AddDocumentWithMetadata

Adds a document with metadata to a knowledge base and chunks it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • name string: Document name
    • content string: Document content
    • contentType string: Content MIME type
    • metadata map[string]string: Document metadata
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

GetDocument

Retrieves a document by ID.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • *Document: Retrieved document
    • error: Error if retrieval fails

GetDocumentWithChunks

Retrieves a document with its chunks.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • *Document: Retrieved document with chunks
    • error: Error if retrieval fails

DeleteDocument

Deletes a document and its chunks.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if deletion fails

UpdateDocumentContent

Updates a document's content and re-chunks it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
    • newContent string: New document content
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • error: Error if update fails

GetDocumentMetadata

Retrieves a document's metadata.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • map[string]string: Document metadata
    • error: Error if retrieval fails

File Management

AddFile

Adds a file to a knowledge base, extracts its content, and chunks it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • fileName string: Name of the file
    • fileData []byte: File data
    • tikaConfig TikaConfig: Apache Tika configuration
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

AddFileFromReader

Adds a file from an io.Reader to a knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • fileName string: Name of the file
    • reader io.Reader: File data reader
    • tikaConfig TikaConfig: Apache Tika configuration
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

AddFileFromMultipart

Adds a file from an HTTP multipart upload to a knowledge base.

  • Parameters:
    • ctx context.Context: Context for the operation
    • kbID string: ID of the knowledge base
    • file *multipart.FileHeader: Uploaded file
    • tikaConfig TikaConfig: Apache Tika configuration
    • chunkConfig ChunkConfig: Chunking configuration
  • Returns:
    • *Document: Added document
    • error: Error if addition fails

ExtractFileContent

Extracts content and metadata from a file using Apache Tika.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileName string: Name of the file
    • fileData []byte: File data
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • *FileContent: Extracted content and metadata
    • error: Error if extraction fails

ExtractFileContentFromReader

Extracts content and metadata from a file using io.Reader.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileName string: Name of the file
    • reader io.Reader: File data reader
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • *FileContent: Extracted content and metadata
    • error: Error if extraction fails

ExtractFileContentFromMultipart

Extracts content and metadata from an HTTP multipart uploaded file.

  • Parameters:
    • ctx context.Context: Context for the operation
    • file *multipart.FileHeader: Uploaded file
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • *FileContent: Extracted content and metadata
    • error: Error if extraction fails

ExtractFileContentFromURL

Extracts content and metadata from a file at a given URL.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileURL string: URL of the file
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • *FileContent: Extracted content and metadata
    • error: Error if extraction fails

GetFileMetadata

Extracts metadata from a file without storing it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileName string: Name of the file
    • fileData []byte: File data
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • map[string]string: File metadata
    • error: Error if extraction fails

GetFileMetadataFromReader

Extracts metadata from a file using io.Reader without storing it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • fileName string: Name of the file
    • reader io.Reader: File data reader
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • map[string]string: File metadata
    • error: Error if extraction fails

GetFileMetadataFromMultipart

Extracts metadata from an HTTP multipart uploaded file without storing it.

  • Parameters:
    • ctx context.Context: Context for the operation
    • file *multipart.FileHeader: Uploaded file
    • tikaConfig TikaConfig: Apache Tika configuration
  • Returns:
    • map[string]string: File metadata
    • error: Error if extraction fails

Search

Search

Performs vector similarity search.

  • Parameters:
    • ctx context.Context: Context for the operation
    • params SearchParams: Search parameters including:
      • Query: Search query text
      • TopK: Maximum number of results to return
      • SimilarityThreshold: Minimum similarity score (0-1)
      • CreatorID: Creator ID for filtering results (optional)
      • KBID: Knowledge base ID to limit search scope (optional)
  • Returns:
    • []SearchResult: Search results
    • error: Error if search fails

FullTextSearch

Performs traditional full-text search.

  • Parameters:
    • ctx context.Context: Context for the operation
    • query string: Search query
    • limit int: Maximum number of results
    • creatorID string: Creator ID for filtering results (optional)
    • kbID string: Knowledge base ID to limit search scope (optional)
  • Returns:
    • []SearchResult: Search results
    • error: Error if search fails

HybridSearch

Performs hybrid search (vector + full-text).

  • Parameters:
    • ctx context.Context: Context for the operation
    • params SearchParams: Search parameters including:
      • Query: Search query text
      • TopK: Maximum number of results to return
      • SimilarityThreshold: Minimum similarity score (0-1)
      • CreatorID: Creator ID for filtering results (optional)
      • KBID: Knowledge base ID to limit search scope (optional)
  • Returns:
    • []SearchResult: Search results
    • error: Error if search fails

Embedding Generation

GenerateEmbedding

Generates a vector embedding for text.

  • Parameters:
    • ctx context.Context: Context for the operation
    • text string: Text to embed
  • Returns:
    • []float32: Vector embedding
    • error: Error if generation fails

BatchGenerateEmbeddings

Generates vector embeddings for multiple texts in batch.

  • Parameters:
    • ctx context.Context: Context for the operation
    • texts []string: Texts to embed
  • Returns:
    • [][]float32: Vector embeddings
    • error: Error if generation fails

GetEmbeddingStatus

Retrieves the status of embedding generation.

  • Parameters:
    • ctx context.Context: Context for the operation
  • Returns:
    • *ChunkStatus: Status information
    • error: Error if retrieval fails

UpdateChunkEmbedding

Updates a chunk's vector embedding.

  • Parameters:
    • ctx context.Context: Context for the operation
    • chunk *Chunk: Chunk to update
    • embedding []float32: Vector embedding
  • Returns:
    • error: Error if update fails

BatchUpdateChunkEmbeddings

Updates multiple chunks' vector embeddings in batch.

  • Parameters:
    • ctx context.Context: Context for the operation
    • chunks []Chunk: Chunks to update
    • embeddings [][]float32: Vector embeddings
  • Returns:
    • error: Error if update fails

GetPendingChunks

Retrieves chunks pending embedding generation.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of chunks
  • Returns:
    • []Chunk: Pending chunks
    • error: Error if retrieval fails

Document Status Management

UpdateDocumentStatus

Updates a document's status.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
    • status string: New status
  • Returns:
    • error: Error if update fails

MarkDocumentAsUploadSuccessful

Marks a document as successfully uploaded.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsUploadFailed

Marks a document as failed during upload.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsExtractSuccessful

Marks a document as successfully content-extracted.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsExtractFailed

Marks a document as failed during content extraction.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsSplitSuccessful

Marks a document as successfully chunked.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsSplitFailed

Marks a document as failed during chunking.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsIndexSuccessful

Marks a document as successfully indexed.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

MarkDocumentAsIndexFailed

Marks a document as failed during indexing.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if marking fails

IsDocumentReadyForExtract

Checks if a document is ready for content extraction.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • bool: True if ready, false otherwise
    • error: Error if check fails

IsDocumentReadyForSplit

Checks if a document is ready for chunking.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • bool: True if ready, false otherwise
    • error: Error if check fails

IsDocumentReadyForIndex

Checks if a document is ready for indexing.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • bool: True if ready, false otherwise
    • error: Error if check fails

GetDocumentsInStatus

Retrieves documents with a specific status.

  • Parameters:
    • ctx context.Context: Context for the operation
    • status string: Status to filter by
    • limit int: Maximum number of documents
  • Returns:
    • []Document: Documents with the specified status
    • error: Error if retrieval fails

GetDocumentsForExtract

Retrieves documents waiting for content extraction.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of documents
  • Returns:
    • []Document: Documents waiting for content extraction
    • error: Error if retrieval fails

GetDocumentsForSplit

Retrieves documents waiting for chunking.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of documents
  • Returns:
    • []Document: Documents waiting for chunking
    • error: Error if retrieval fails

GetDocumentsForIndex

Retrieves documents waiting for indexing.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of documents
  • Returns:
    • []Document: Documents waiting for indexing
    • error: Error if retrieval fails

CheckDocumentIndexStatus

Checks if all chunks of a document are indexed.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • bool: True if all chunks are indexed, false otherwise
    • error: Error if check fails

UpdateDocumentIndexStatus

Updates a document's index status based on its chunks.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
  • Returns:
    • error: Error if update fails

Chunk Management

UpdateDocumentChunks

Updates multiple document chunks.

  • Parameters:
    • ctx context.Context: Context for the operation
    • chunks []Chunk: Chunks to update
  • Returns:
    • error: Error if update fails

GetChunksNeedingIndex

Retrieves chunks that need indexing.

  • Parameters:
    • ctx context.Context: Context for the operation
    • limit int: Maximum number of chunks
  • Returns:
    • []Chunk: Chunks needing indexing
    • error: Error if retrieval fails

MarkChunkAsIndexed

Marks a chunk as indexed.

  • Parameters:
    • ctx context.Context: Context for the operation
    • docID string: Document ID
    • chunkIndex int: Chunk index
  • Returns:
    • error: Error if marking fails

CompareChunkContent

Compares the content of two chunks.

  • Parameters:
    • chunk1 *Chunk: First chunk
    • chunk2 *Chunk: Second chunk
  • Returns:
    • bool: True if content is identical, false otherwise

Utility Methods

GetDB

Retrieves the GORM database connection.

  • Returns:
    • *gorm.DB: Database connection

GetOpenAIClient

Retrieves the OpenAI client.

  • Returns:
    • *openai.Client: OpenAI client

GetEmbeddingModel

Retrieves the current embedding model name.

  • Returns:
    • string: Embedding model name

GetModelDimension

Retrieves the model vector dimension.

  • Returns:
    • int: Vector dimension

EmbeddingToPgVector

Converts a vector embedding to PostgreSQL vector format.

  • Parameters:
    • embedding []float32: Vector embedding
  • Returns:
    • string: PostgreSQL vector format

DefaultTikaConfig

Returns default Tika configuration.

测试

测试环境设置

  1. 创建测试数据库:
make setup-test-db
  1. 运行所有测试:
make test
  1. 运行搜索功能测试:
make test-search
  1. 运行知识库测试:
make test-kb
  1. 清理测试数据:
make clean-test-db

自动化测试脚本

使用提供的测试脚本进行完整的测试流程:

# 运行完整测试(包含设置和清理)
./scripts/test_search.sh --cleanup
# 仅设置测试环境
./scripts/test_search.sh --setup-only
# 仅清理测试数据
./scripts/test_search.sh --cleanup-only

测试数据隔离

为确保测试的可靠性,所有测试都实现了数据隔离:

  • 使用唯一标识符防止测试数据冲突
  • 每个测试后自动清理创建的数据
  • 建议使用专门的测试数据库
  • 详细说明请参考 测试最佳实践指南

Constants

Document Status Constants

  • DocStatusUploadFailed: Upload failed
  • DocStatusUploadSuccess: Upload successful, waiting for content extraction
  • DocStatusExtractFailed: Content extraction failed
  • DocStatusExtractSuccess: Content extraction successful, waiting for chunking
  • DocStatusSplitFailed: Chunking failed
  • DocStatusSplitSuccess: Chunking successful, waiting for indexing
  • DocStatusIndexFailed: Indexing failed
  • DocStatusIndexSuccess: Indexing successful

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /