proofdb/readme.md

## 1. Project Overview

**Project Name:** Proof DB
**Type:** Historical Evidence Retrieval System (RAG-oriented backend)

This project is a **backend-centric system** designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:

* Evidence traceability
* Chunk-level retrieval
* Hybrid search (full-text + vector)
* Citation reconstruction

Unlike generic RAG systems, this project treats **"evidence" as first-class structured objects**, not just text.

---

## 2. Core Concept

The system is divided into three conceptual layers:

* **Proof DB** → Data layer (MySQL + OpenSearch + Vector)
* **Archive Cask** → Frontend interface (not part of this task)
* **Few-shot Engine** → OCR (external, not part of this task)

Current scope: **Proof DB only**

---

## 3. System Architecture (Backend Focus)

The backend follows a **modular service architecture** (not microservices yet, but clearly separated layers):

### Components:

1. **Ingestion Layer**

   * Accepts raw Markdown archive documents
   * Pre-processes Markdown page markers such as `<!-- DOCMASTER:PAGE 0001 -->`
   * Splits documents into page-bounded vector chunks
   * Keeps list-style archive records and their `COMMENT` blocks together where possible
   * Extracts metadata, including page numbers
   * Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment

2. **Storage Layer**

   * MySQL → metadata, relations
   * OpenSearch → full-text index
   * Vector DB → embeddings (can be OpenSearch kNN or Qdrant)

3. **Retrieval Layer**

   * Full-text search (BM25)
   * Vector search (embedding similarity)
   * Hybrid search (fusion)

4. **Evidence Layer**

   * Maps chunk → page → article
   * Provides page-level citation traceability

👉 这是典型 backend architecture 分层设计（server + database + API协同） ([DEV Community][1])

---

## 4. Tech Stack

### Backend Framework

* PHP 8+
* Webman (HTTP API)
* Workerman (async workers / background jobs)

### Database

* MySQL (relational metadata)

### Search Engine

* OpenSearch

  * Full-text search (BM25)
  * Optional vector search (kNN)

### Vector Layer

* Option A: OpenSearch kNN
* Option B: Qdrant (preferred if scaling)

### Data Flow Tools

* Custom chunking logic (PHP)
* Embedding via external API / local model
* Metadata enrichment via Redis queue + OpenAI-compatible chat completion API

---

## 5. Data Model (CRITICAL)

### Core Entities

```text
Archive
 ├── archive_uid (ULID)
 ├── title
 ├── summary
 ├── source
 └── metadata

Page
 ├── page_number
 ├── block_count
 ├── chunk_count
 └── content_length

PageBlock (internal import structure)
 ├── block_uid
 ├── archive_uid
 ├── page_number
 └── content

Chunk
 ├── chunk_uid (archive_uid + sequence + short uid)
 ├── page_start
 ├── page_end
 ├── text
 ├── embedding_ref
```

### Key Principle

* **archive_uid 是档案级核心 ID，使用 ULID**
* **chunk_uid 是 chunk 级核心 ID，格式为 `{archive_uid}_{chunk_index}_{short_uid}`**
* MySQL / OpenSearch / Vector DB 全部围绕 `archive_uid` 和 `chunk_uid`
* **page_number 是证据定位的关键字段**
* Chunk 是向量化和检索召回单位，不是精确 citation 单位
* 证据定位只需要定位到页码，因此 chunk 可以跨段落合并，但不能跨页

---

## 6. Search Design

### Full-text (OpenSearch)

* Indexed at chunk level
* Supports:

  * keyword match
  * phrase match

### Vector Search

* embedding similarity

### Hybrid Search

* BM25 + vector fusion
* rerank stage


---

## 7. API Design (First Phase)

### Ingestion

```http
POST /api/articles/import
```

---

### Retrieval

```http
POST /api/search/fulltext
POST /api/search/vector
POST /api/search/hybrid
```

---

### Evidence

```http
GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}
```

---

## 8. Design Philosophy (IMPORTANT)

* Evidence > Text
* Chunk > Document
* Traceability > Raw Retrieval
* Hybrid Search by default

---

## 9. Non-goals (IMPORTANT)

* No frontend (Archive Cask handled later)
* No OCR (Few-shot Engine external)
* No heavy microservices (keep simple modular architecture first)


* Proof DB ≠ storage
* Proof DB = retrieval + meaning + traceability


[1]: https://dev.to/tomjohnson3/understanding-backend-architecture-ljb?utm_source=chatgpt.com "Understanding Backend Architecture"
[2]: https://exodata.io/what-is-a-tech-stack-how-to-architect-a-modern-scalable-technology-stack/?utm_source=chatgpt.com "How to Build a Tech Stack That Scales [2026] | Exodata"
[3]: https://medium.com/%40hanxuyang0826/roadmap-to-backend-programming-master-architectural-patterns-c763c9194414?utm_source=chatgpt.com "Roadmap to Backend Programming Master: Architectural ..."