Go to file
2026-05-01 23:40:14 +08:00
apidoc 暂存 2026-05-01 23:40:14 +08:00
app 暂存 2026-05-01 23:40:14 +08:00
config 暂存 2026-05-01 23:40:14 +08:00
public 暂存 2026-05-01 23:40:14 +08:00
runtime 暂存 2026-05-01 23:40:14 +08:00
scripts 暂存 2026-05-01 23:40:14 +08:00
support 暂存 2026-05-01 23:40:14 +08:00
test 暂存 2026-05-01 23:40:14 +08:00
vendor 暂存 2026-05-01 23:40:14 +08:00
webman 暂存 2026-05-01 23:40:14 +08:00
.codex 暂存 2026-05-01 23:40:14 +08:00
.env 暂存 2026-05-01 23:40:14 +08:00
ark.txt 暂存 2026-05-01 23:40:14 +08:00
composer.json 暂存 2026-05-01 23:40:14 +08:00
composer.lock 暂存 2026-05-01 23:40:14 +08:00
docker-compose.yml 暂存 2026-05-01 23:40:14 +08:00
Dockerfile 暂存 2026-05-01 23:40:14 +08:00
LICENSE 暂存 2026-05-01 23:40:14 +08:00
readme.md 暂存 2026-05-01 23:40:14 +08:00
start.php 暂存 2026-05-01 23:40:14 +08:00
windows.bat 暂存 2026-05-01 23:40:14 +08:00
windows.php 暂存 2026-05-01 23:40:14 +08:00

1. Project Overview

Project Name: Proof DB Type: Historical Evidence Retrieval System (RAG-oriented backend)

This project is a backend-centric system designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:

  • Evidence traceability
  • Chunk-level retrieval
  • Hybrid search (full-text + vector)
  • Citation reconstruction

Unlike generic RAG systems, this project treats "evidence" as first-class structured objects, not just text.


2. Core Concept

The system is divided into three conceptual layers:

  • Proof DB → Data layer (MySQL + OpenSearch + Vector)
  • Archive Cask → Frontend interface (not part of this task)
  • Few-shot Engine → OCR (external, not part of this task)

Current scope: Proof DB only


3. System Architecture (Backend Focus)

The backend follows a modular service architecture (not microservices yet, but clearly separated layers):

Components:

  1. Ingestion Layer

    • Accepts raw Markdown archive documents
    • Pre-processes Markdown page markers such as <!-- DOCMASTER:PAGE 0001 -->
    • Splits documents into page-bounded vector chunks
    • Keeps list-style archive records and their COMMENT blocks together where possible
    • Extracts metadata, including page numbers
    • Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
  2. Storage Layer

    • MySQL → metadata, relations
    • OpenSearch → full-text index
    • Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
  3. Retrieval Layer

    • Full-text search (BM25)
    • Vector search (embedding similarity)
    • Hybrid search (fusion)
  4. Evidence Layer

    • Maps chunk → page → article
    • Provides page-level citation traceability

👉 这是典型 backend architecture 分层设计server + database + API协同 (DEV Community)


4. Tech Stack

Backend Framework

  • PHP 8+
  • Webman (HTTP API)
  • Workerman (async workers / background jobs)

Database

  • MySQL (relational metadata)

Search Engine

  • OpenSearch

    • Full-text search (BM25)
    • Optional vector search (kNN)

Vector Layer

  • Option A: OpenSearch kNN
  • Option B: Qdrant (preferred if scaling)

Data Flow Tools

  • Custom chunking logic (PHP)
  • Embedding via external API / local model
  • Metadata enrichment via Redis queue + OpenAI-compatible chat completion API

5. Data Model (CRITICAL)

Core Entities

Archive
 ├── archive_uid (ULID)
 ├── title
 ├── summary
 ├── source
 └── metadata

Page
 ├── page_number
 ├── block_count
 ├── chunk_count
 └── content_length

PageBlock (internal import structure)
 ├── block_uid
 ├── archive_uid
 ├── page_number
 └── content

Chunk
 ├── chunk_uid (archive_uid + sequence + short uid)
 ├── page_start
 ├── page_end
 ├── text
 ├── embedding_ref

Key Principle

  • archive_uid 是档案级核心 ID使用 ULID
  • chunk_uid 是 chunk 级核心 ID格式为 {archive_uid}_{chunk_index}_{short_uid}
  • MySQL / OpenSearch / Vector DB 全部围绕 archive_uidchunk_uid
  • page_number 是证据定位的关键字段
  • Chunk 是向量化和检索召回单位,不是精确 citation 单位
  • 证据定位只需要定位到页码,因此 chunk 可以跨段落合并,但不能跨页

6. Search Design

Full-text (OpenSearch)

  • Indexed at chunk level

  • Supports:

    • keyword match
    • phrase match
  • embedding similarity
  • BM25 + vector fusion
  • rerank stage

7. API Design (First Phase)

Ingestion

POST /api/articles/import

Retrieval

POST /api/search/fulltext
POST /api/search/vector
POST /api/search/hybrid

Evidence

GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}

8. Design Philosophy (IMPORTANT)

  • Evidence > Text
  • Chunk > Document
  • Traceability > Raw Retrieval
  • Hybrid Search by default

9. Non-goals (IMPORTANT)

  • No frontend (Archive Cask handled later)

  • No OCR (Few-shot Engine external)

  • No heavy microservices (keep simple modular architecture first)

  • Proof DB ≠ storage

  • Proof DB = retrieval + meaning + traceability