## 1. Project Overview **Project Name:** Proof DB **Type:** Historical Evidence Retrieval System (RAG-oriented backend) This project is a **backend-centric system** designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on: * Evidence traceability * Chunk-level retrieval * Hybrid search (full-text + vector) * Citation reconstruction Unlike generic RAG systems, this project treats **"evidence" as first-class structured objects**, not just text. --- ## 2. Core Concept The system is divided into three conceptual layers: * **Proof DB** → Data layer (MySQL + OpenSearch + Vector) * **Archive Cask** → Frontend interface (not part of this task) * **Few-shot Engine** → OCR (external, not part of this task) Current scope: **Proof DB only** --- ## 3. System Architecture (Backend Focus) The backend follows a **modular service architecture** (not microservices yet, but clearly separated layers): ### Components: 1. **Ingestion Layer** * Accepts raw Markdown archive documents * Pre-processes Markdown page markers such as `` * Splits documents into page-bounded vector chunks * Keeps list-style archive records and their `COMMENT` blocks together where possible * Extracts metadata, including page numbers * Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment 2. **Storage Layer** * MySQL → metadata, relations * OpenSearch → full-text index * Vector DB → embeddings (can be OpenSearch kNN or Qdrant) 3. **Retrieval Layer** * Full-text search (BM25) * Vector search (embedding similarity) * Hybrid search (fusion) 4. **Evidence Layer** * Maps chunk → page → article * Provides page-level citation traceability 👉 这是典型 backend architecture 分层设计(server + database + API协同) ([DEV Community][1]) --- ## 4. Tech Stack ### Backend Framework * PHP 8+ * Webman (HTTP API) * Workerman (async workers / background jobs) ### Database * MySQL (relational metadata) ### Search Engine * OpenSearch * Full-text search (BM25) * Optional vector search (kNN) ### Vector Layer * Option A: OpenSearch kNN * Option B: Qdrant (preferred if scaling) ### Data Flow Tools * Custom chunking logic (PHP) * Embedding via external API / local model * Metadata enrichment via Redis queue + OpenAI-compatible chat completion API --- ## 5. Data Model (CRITICAL) ### Core Entities ```text Archive ├── archive_uid (ULID) ├── title ├── summary ├── source └── metadata Page ├── page_number ├── block_count ├── chunk_count └── content_length PageBlock (internal import structure) ├── block_uid ├── archive_uid ├── page_number └── content Chunk ├── chunk_uid (archive_uid + sequence + short uid) ├── page_start ├── page_end ├── text ├── embedding_ref ``` ### Key Principle * **archive_uid 是档案级核心 ID,使用 ULID** * **chunk_uid 是 chunk 级核心 ID,格式为 `{archive_uid}_{chunk_index}_{short_uid}`** * MySQL / OpenSearch / Vector DB 全部围绕 `archive_uid` 和 `chunk_uid` * **page_number 是证据定位的关键字段** * Chunk 是向量化和检索召回单位,不是精确 citation 单位 * 证据定位只需要定位到页码,因此 chunk 可以跨段落合并,但不能跨页 --- ## 6. Search Design ### Full-text (OpenSearch) * Indexed at chunk level * Supports: * keyword match * phrase match ### Vector Search * embedding similarity ### Hybrid Search * BM25 + vector fusion * rerank stage --- ## 7. API Design (First Phase) ### Ingestion ```http POST /api/articles/import ``` --- ### Retrieval ```http POST /api/search/fulltext POST /api/search/vector POST /api/search/hybrid ``` --- ### Evidence ```http GET /api/chunks/{chunk_uid} GET /api/evidence/{chunk_uid} ``` --- ## 8. Design Philosophy (IMPORTANT) * Evidence > Text * Chunk > Document * Traceability > Raw Retrieval * Hybrid Search by default --- ## 9. Non-goals (IMPORTANT) * No frontend (Archive Cask handled later) * No OCR (Few-shot Engine external) * No heavy microservices (keep simple modular architecture first) * Proof DB ≠ storage * Proof DB = retrieval + meaning + traceability [1]: https://dev.to/tomjohnson3/understanding-backend-architecture-ljb?utm_source=chatgpt.com "Understanding Backend Architecture" [2]: https://exodata.io/what-is-a-tech-stack-how-to-architect-a-modern-scalable-technology-stack/?utm_source=chatgpt.com "How to Build a Tech Stack That Scales [2026] | Exodata" [3]: https://medium.com/%40hanxuyang0826/roadmap-to-backend-programming-master-architectural-patterns-c763c9194414?utm_source=chatgpt.com "Roadmap to Backend Programming Master: Architectural ..."