proofdb/readme.md at b8f599a6177d1d8c523473a30f8c357bc9b7bf66

2026-05-01 23:40:14 +08:00

4.8 KiB

Raw Blame History

1. Project Overview

Project Name: Proof DB Type: Historical Evidence Retrieval System (RAG-oriented backend)

This project is a backend-centric system designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:

Evidence traceability
Chunk-level retrieval
Hybrid search (full-text + vector)
Citation reconstruction

Unlike generic RAG systems, this project treats "evidence" as first-class structured objects, not just text.

2. Core Concept

The system is divided into three conceptual layers:

Proof DB → Data layer (MySQL + OpenSearch + Vector)
Archive Cask → Frontend interface (not part of this task)
Few-shot Engine → OCR (external, not part of this task)

Current scope: Proof DB only

3. System Architecture (Backend Focus)

The backend follows a modular service architecture (not microservices yet, but clearly separated layers):

Components:

Ingestion Layer
- Accepts raw Markdown archive documents
- Pre-processes Markdown page markers such as 
- Splits documents into page-bounded vector chunks
- Keeps list-style archive records and their COMMENT blocks together where possible
- Extracts metadata, including page numbers
- Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
Storage Layer
- MySQL → metadata, relations
- OpenSearch → full-text index
- Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
Retrieval Layer
- Full-text search (BM25)
- Vector search (embedding similarity)
- Hybrid search (fusion)
Evidence Layer
- Maps chunk → page → article
- Provides page-level citation traceability

👉 这是典型 backend architecture 分层设计（server + database + API协同） (DEV Community)

4. Tech Stack

Backend Framework

PHP 8+
Webman (HTTP API)
Workerman (async workers / background jobs)

Database

MySQL (relational metadata)

Search Engine

OpenSearch
- Full-text search (BM25)
- Optional vector search (kNN)

Vector Layer

Option A: OpenSearch kNN
Option B: Qdrant (preferred if scaling)

Data Flow Tools

Custom chunking logic (PHP)
Embedding via external API / local model
Metadata enrichment via Redis queue + OpenAI-compatible chat completion API

5. Data Model (CRITICAL)

Core Entities

Archive
 ├── archive_uid (ULID)
 ├── title
 ├── summary
 ├── source
 └── metadata

Page
 ├── page_number
 ├── block_count
 ├── chunk_count
 └── content_length

PageBlock (internal import structure)
 ├── block_uid
 ├── archive_uid
 ├── page_number
 └── content

Chunk
 ├── chunk_uid (archive_uid + sequence + short uid)
 ├── page_start
 ├── page_end
 ├── text
 ├── embedding_ref

Key Principle

archive_uid 是档案级核心 ID，使用 ULID
chunk_uid 是 chunk 级核心 ID，格式为 {archive_uid}_{chunk_index}_{short_uid}
MySQL / OpenSearch / Vector DB 全部围绕 archive_uid 和 chunk_uid
page_number 是证据定位的关键字段
Chunk 是向量化和检索召回单位，不是精确 citation 单位
证据定位只需要定位到页码，因此 chunk 可以跨段落合并，但不能跨页

6. Search Design

Full-text (OpenSearch)

Indexed at chunk level
Supports:
- keyword match
- phrase match

Vector Search

embedding similarity

Hybrid Search

BM25 + vector fusion
rerank stage

7. API Design (First Phase)

Ingestion

POST /api/articles/import

Retrieval

POST /api/search/fulltext
POST /api/search/vector
POST /api/search/hybrid

Evidence

GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}

8. Design Philosophy (IMPORTANT)

Evidence > Text
Chunk > Document
Traceability > Raw Retrieval
Hybrid Search by default

9. Non-goals (IMPORTANT)

No frontend (Archive Cask handled later)
No OCR (Few-shot Engine external)
No heavy microservices (keep simple modular architecture first)
Proof DB ≠ storage
Proof DB = retrieval + meaning + traceability

4.8 KiB Raw Blame History Unescape Escape