proofdb/readme.md
2026-05-01 23:40:14 +08:00

215 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## 1. Project Overview
**Project Name:** Proof DB
**Type:** Historical Evidence Retrieval System (RAG-oriented backend)
This project is a **backend-centric system** designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:
* Evidence traceability
* Chunk-level retrieval
* Hybrid search (full-text + vector)
* Citation reconstruction
Unlike generic RAG systems, this project treats **"evidence" as first-class structured objects**, not just text.
---
## 2. Core Concept
The system is divided into three conceptual layers:
* **Proof DB** → Data layer (MySQL + OpenSearch + Vector)
* **Archive Cask** → Frontend interface (not part of this task)
* **Few-shot Engine** → OCR (external, not part of this task)
Current scope: **Proof DB only**
---
## 3. System Architecture (Backend Focus)
The backend follows a **modular service architecture** (not microservices yet, but clearly separated layers):
### Components:
1. **Ingestion Layer**
* Accepts raw Markdown archive documents
* Pre-processes Markdown page markers such as `<!-- DOCMASTER:PAGE 0001 -->`
* Splits documents into page-bounded vector chunks
* Keeps list-style archive records and their `COMMENT` blocks together where possible
* Extracts metadata, including page numbers
* Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
2. **Storage Layer**
* MySQL → metadata, relations
* OpenSearch → full-text index
* Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
3. **Retrieval Layer**
* Full-text search (BM25)
* Vector search (embedding similarity)
* Hybrid search (fusion)
4. **Evidence Layer**
* Maps chunk → page → article
* Provides page-level citation traceability
👉 这是典型 backend architecture 分层设计server + database + API协同 ([DEV Community][1])
---
## 4. Tech Stack
### Backend Framework
* PHP 8+
* Webman (HTTP API)
* Workerman (async workers / background jobs)
### Database
* MySQL (relational metadata)
### Search Engine
* OpenSearch
* Full-text search (BM25)
* Optional vector search (kNN)
### Vector Layer
* Option A: OpenSearch kNN
* Option B: Qdrant (preferred if scaling)
### Data Flow Tools
* Custom chunking logic (PHP)
* Embedding via external API / local model
* Metadata enrichment via Redis queue + OpenAI-compatible chat completion API
---
## 5. Data Model (CRITICAL)
### Core Entities
```text
Archive
├── archive_uid (ULID)
├── title
├── summary
├── source
└── metadata
Page
├── page_number
├── block_count
├── chunk_count
└── content_length
PageBlock (internal import structure)
├── block_uid
├── archive_uid
├── page_number
└── content
Chunk
├── chunk_uid (archive_uid + sequence + short uid)
├── page_start
├── page_end
├── text
├── embedding_ref
```
### Key Principle
* **archive_uid 是档案级核心 ID使用 ULID**
* **chunk_uid 是 chunk 级核心 ID格式为 `{archive_uid}_{chunk_index}_{short_uid}`**
* MySQL / OpenSearch / Vector DB 全部围绕 `archive_uid``chunk_uid`
* **page_number 是证据定位的关键字段**
* Chunk 是向量化和检索召回单位,不是精确 citation 单位
* 证据定位只需要定位到页码,因此 chunk 可以跨段落合并,但不能跨页
---
## 6. Search Design
### Full-text (OpenSearch)
* Indexed at chunk level
* Supports:
* keyword match
* phrase match
### Vector Search
* embedding similarity
### Hybrid Search
* BM25 + vector fusion
* rerank stage
---
## 7. API Design (First Phase)
### Ingestion
```http
POST /api/articles/import
```
---
### Retrieval
```http
POST /api/search/fulltext
POST /api/search/vector
POST /api/search/hybrid
```
---
### Evidence
```http
GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}
```
---
## 8. Design Philosophy (IMPORTANT)
* Evidence > Text
* Chunk > Document
* Traceability > Raw Retrieval
* Hybrid Search by default
---
## 9. Non-goals (IMPORTANT)
* No frontend (Archive Cask handled later)
* No OCR (Few-shot Engine external)
* No heavy microservices (keep simple modular architecture first)
* Proof DB ≠ storage
* Proof DB = retrieval + meaning + traceability
[1]: https://dev.to/tomjohnson3/understanding-backend-architecture-ljb?utm_source=chatgpt.com "Understanding Backend Architecture"
[2]: https://exodata.io/what-is-a-tech-stack-how-to-architect-a-modern-scalable-technology-stack/?utm_source=chatgpt.com "How to Build a Tech Stack That Scales [2026] | Exodata"
[3]: https://medium.com/%40hanxuyang0826/roadmap-to-backend-programming-master-architectural-patterns-c763c9194414?utm_source=chatgpt.com "Roadmap to Backend Programming Master: Architectural ..."