4.8 KiB
1. Project Overview
Project Name: Proof DB Type: Historical Evidence Retrieval System (RAG-oriented backend)
This project is a backend-centric system designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:
- Evidence traceability
- Chunk-level retrieval
- Hybrid search (full-text + vector)
- Citation reconstruction
Unlike generic RAG systems, this project treats "evidence" as first-class structured objects, not just text.
2. Core Concept
The system is divided into three conceptual layers:
- Proof DB → Data layer (MySQL + OpenSearch + Vector)
- Archive Cask → Frontend interface (not part of this task)
- Few-shot Engine → OCR (external, not part of this task)
Current scope: Proof DB only
3. System Architecture (Backend Focus)
The backend follows a modular service architecture (not microservices yet, but clearly separated layers):
Components:
-
Ingestion Layer
- Accepts raw Markdown archive documents
- Pre-processes Markdown page markers such as
<!-- DOCMASTER:PAGE 0001 --> - Splits documents into page-bounded vector chunks
- Keeps list-style archive records and their
COMMENTblocks together where possible - Extracts metadata, including page numbers
- Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
-
Storage Layer
- MySQL → metadata, relations
- OpenSearch → full-text index
- Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
-
Retrieval Layer
- Full-text search (BM25)
- Vector search (embedding similarity)
- Hybrid search (fusion)
-
Evidence Layer
- Maps chunk → page → article
- Provides page-level citation traceability
👉 这是典型 backend architecture 分层设计(server + database + API协同) (DEV Community)
4. Tech Stack
Backend Framework
- PHP 8+
- Webman (HTTP API)
- Workerman (async workers / background jobs)
Database
- MySQL (relational metadata)
Search Engine
-
OpenSearch
- Full-text search (BM25)
- Optional vector search (kNN)
Vector Layer
- Option A: OpenSearch kNN
- Option B: Qdrant (preferred if scaling)
Data Flow Tools
- Custom chunking logic (PHP)
- Embedding via external API / local model
- Metadata enrichment via Redis queue + OpenAI-compatible chat completion API
5. Data Model (CRITICAL)
Core Entities
Archive
├── archive_uid (ULID)
├── title
├── summary
├── source
└── metadata
Page
├── page_number
├── block_count
├── chunk_count
└── content_length
PageBlock (internal import structure)
├── block_uid
├── archive_uid
├── page_number
└── content
Chunk
├── chunk_uid (archive_uid + sequence + short uid)
├── page_start
├── page_end
├── text
├── embedding_ref
Key Principle
- archive_uid 是档案级核心 ID,使用 ULID
- chunk_uid 是 chunk 级核心 ID,格式为
{archive_uid}_{chunk_index}_{short_uid} - MySQL / OpenSearch / Vector DB 全部围绕
archive_uid和chunk_uid - page_number 是证据定位的关键字段
- Chunk 是向量化和检索召回单位,不是精确 citation 单位
- 证据定位只需要定位到页码,因此 chunk 可以跨段落合并,但不能跨页
6. Search Design
Full-text (OpenSearch)
-
Indexed at chunk level
-
Supports:
- keyword match
- phrase match
Vector Search
- embedding similarity
Hybrid Search
- BM25 + vector fusion
- rerank stage
7. API Design (First Phase)
Ingestion
POST /api/articles/import
Retrieval
POST /api/search/fulltext
POST /api/search/vector
POST /api/search/hybrid
Evidence
GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}
8. Design Philosophy (IMPORTANT)
- Evidence > Text
- Chunk > Document
- Traceability > Raw Retrieval
- Hybrid Search by default
9. Non-goals (IMPORTANT)
-
No frontend (Archive Cask handled later)
-
No OCR (Few-shot Engine external)
-
No heavy microservices (keep simple modular architecture first)
-
Proof DB ≠ storage
-
Proof DB = retrieval + meaning + traceability