215 lines
4.8 KiB
Markdown
215 lines
4.8 KiB
Markdown
## 1. Project Overview
|
||
|
||
**Project Name:** Proof DB
|
||
**Type:** Historical Evidence Retrieval System (RAG-oriented backend)
|
||
|
||
This project is a **backend-centric system** designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:
|
||
|
||
* Evidence traceability
|
||
* Chunk-level retrieval
|
||
* Hybrid search (full-text + vector)
|
||
* Citation reconstruction
|
||
|
||
Unlike generic RAG systems, this project treats **"evidence" as first-class structured objects**, not just text.
|
||
|
||
---
|
||
|
||
## 2. Core Concept
|
||
|
||
The system is divided into three conceptual layers:
|
||
|
||
* **Proof DB** → Data layer (MySQL + OpenSearch + Vector)
|
||
* **Archive Cask** → Frontend interface (not part of this task)
|
||
* **Few-shot Engine** → OCR (external, not part of this task)
|
||
|
||
Current scope: **Proof DB only**
|
||
|
||
---
|
||
|
||
## 3. System Architecture (Backend Focus)
|
||
|
||
The backend follows a **modular service architecture** (not microservices yet, but clearly separated layers):
|
||
|
||
### Components:
|
||
|
||
1. **Ingestion Layer**
|
||
|
||
* Accepts raw Markdown archive documents
|
||
* Pre-processes Markdown page markers such as `<!-- DOCMASTER:PAGE 0001 -->`
|
||
* Splits documents into page-bounded vector chunks
|
||
* Keeps list-style archive records and their `COMMENT` blocks together where possible
|
||
* Extracts metadata, including page numbers
|
||
* Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
|
||
|
||
2. **Storage Layer**
|
||
|
||
* MySQL → metadata, relations
|
||
* OpenSearch → full-text index
|
||
* Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
|
||
|
||
3. **Retrieval Layer**
|
||
|
||
* Full-text search (BM25)
|
||
* Vector search (embedding similarity)
|
||
* Hybrid search (fusion)
|
||
|
||
4. **Evidence Layer**
|
||
|
||
* Maps chunk → page → article
|
||
* Provides page-level citation traceability
|
||
|
||
👉 这是典型 backend architecture 分层设计(server + database + API协同) ([DEV Community][1])
|
||
|
||
---
|
||
|
||
## 4. Tech Stack
|
||
|
||
### Backend Framework
|
||
|
||
* PHP 8+
|
||
* Webman (HTTP API)
|
||
* Workerman (async workers / background jobs)
|
||
|
||
### Database
|
||
|
||
* MySQL (relational metadata)
|
||
|
||
### Search Engine
|
||
|
||
* OpenSearch
|
||
|
||
* Full-text search (BM25)
|
||
* Optional vector search (kNN)
|
||
|
||
### Vector Layer
|
||
|
||
* Option A: OpenSearch kNN
|
||
* Option B: Qdrant (preferred if scaling)
|
||
|
||
### Data Flow Tools
|
||
|
||
* Custom chunking logic (PHP)
|
||
* Embedding via external API / local model
|
||
* Metadata enrichment via Redis queue + OpenAI-compatible chat completion API
|
||
|
||
---
|
||
|
||
## 5. Data Model (CRITICAL)
|
||
|
||
### Core Entities
|
||
|
||
```text
|
||
Archive
|
||
├── archive_uid (ULID)
|
||
├── title
|
||
├── summary
|
||
├── source
|
||
└── metadata
|
||
|
||
Page
|
||
├── page_number
|
||
├── block_count
|
||
├── chunk_count
|
||
└── content_length
|
||
|
||
PageBlock (internal import structure)
|
||
├── block_uid
|
||
├── archive_uid
|
||
├── page_number
|
||
└── content
|
||
|
||
Chunk
|
||
├── chunk_uid (archive_uid + sequence + short uid)
|
||
├── page_start
|
||
├── page_end
|
||
├── text
|
||
├── embedding_ref
|
||
```
|
||
|
||
### Key Principle
|
||
|
||
* **archive_uid 是档案级核心 ID,使用 ULID**
|
||
* **chunk_uid 是 chunk 级核心 ID,格式为 `{archive_uid}_{chunk_index}_{short_uid}`**
|
||
* MySQL / OpenSearch / Vector DB 全部围绕 `archive_uid` 和 `chunk_uid`
|
||
* **page_number 是证据定位的关键字段**
|
||
* Chunk 是向量化和检索召回单位,不是精确 citation 单位
|
||
* 证据定位只需要定位到页码,因此 chunk 可以跨段落合并,但不能跨页
|
||
|
||
---
|
||
|
||
## 6. Search Design
|
||
|
||
### Full-text (OpenSearch)
|
||
|
||
* Indexed at chunk level
|
||
* Supports:
|
||
|
||
* keyword match
|
||
* phrase match
|
||
|
||
### Vector Search
|
||
|
||
* embedding similarity
|
||
|
||
### Hybrid Search
|
||
|
||
* BM25 + vector fusion
|
||
* rerank stage
|
||
|
||
|
||
---
|
||
|
||
## 7. API Design (First Phase)
|
||
|
||
### Ingestion
|
||
|
||
```http
|
||
POST /api/articles/import
|
||
```
|
||
|
||
---
|
||
|
||
### Retrieval
|
||
|
||
```http
|
||
POST /api/search/fulltext
|
||
POST /api/search/vector
|
||
POST /api/search/hybrid
|
||
```
|
||
|
||
---
|
||
|
||
### Evidence
|
||
|
||
```http
|
||
GET /api/chunks/{chunk_uid}
|
||
GET /api/evidence/{chunk_uid}
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Design Philosophy (IMPORTANT)
|
||
|
||
* Evidence > Text
|
||
* Chunk > Document
|
||
* Traceability > Raw Retrieval
|
||
* Hybrid Search by default
|
||
|
||
---
|
||
|
||
## 9. Non-goals (IMPORTANT)
|
||
|
||
* No frontend (Archive Cask handled later)
|
||
* No OCR (Few-shot Engine external)
|
||
* No heavy microservices (keep simple modular architecture first)
|
||
|
||
|
||
|
||
* Proof DB ≠ storage
|
||
* Proof DB = retrieval + meaning + traceability
|
||
|
||
|
||
[1]: https://dev.to/tomjohnson3/understanding-backend-architecture-ljb?utm_source=chatgpt.com "Understanding Backend Architecture"
|
||
[2]: https://exodata.io/what-is-a-tech-stack-how-to-architect-a-modern-scalable-technology-stack/?utm_source=chatgpt.com "How to Build a Tech Stack That Scales [2026] | Exodata"
|
||
[3]: https://medium.com/%40hanxuyang0826/roadmap-to-backend-programming-master-architectural-patterns-c763c9194414?utm_source=chatgpt.com "Roadmap to Backend Programming Master: Architectural ..."
|