Go to file
2026-05-04 13:15:16 +08:00
apidoc 暂存 2026-05-01 23:40:14 +08:00
app 暂存 2026-05-04 13:15:16 +08:00
config 暂存 2026-05-04 13:15:16 +08:00
public 暂存 2026-05-01 23:40:14 +08:00
runtime 暂存 2026-05-01 23:40:14 +08:00
scripts 暂存 2026-05-01 23:40:14 +08:00
support 暂存 2026-05-01 23:40:14 +08:00
test 暂存 2026-05-01 23:40:14 +08:00
vendor 暂存 2026-05-04 13:15:16 +08:00
webman 暂存 2026-05-01 23:40:14 +08:00
.codex 暂存 2026-05-01 23:40:14 +08:00
.env 暂存 2026-05-01 23:40:14 +08:00
ark.txt 暂存 2026-05-01 23:40:14 +08:00
composer.json 暂存 2026-05-04 13:15:16 +08:00
composer.lock 暂存 2026-05-04 13:15:16 +08:00
docker-compose.yml 暂存 2026-05-01 23:40:14 +08:00
Dockerfile 暂存 2026-05-01 23:40:14 +08:00
LICENSE 暂存 2026-05-01 23:40:14 +08:00
readme.md 暂存 2026-05-04 13:15:16 +08:00
start.php 暂存 2026-05-01 23:40:14 +08:00
windows.bat 暂存 2026-05-01 23:40:14 +08:00
windows.php 暂存 2026-05-01 23:40:14 +08:00

1. Project Overview

Project Name: Proof DB Type: Historical Evidence Retrieval System (RAG-oriented backend)

This project is a backend-centric system designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:

  • Evidence traceability
  • Chunk-level retrieval
  • Hybrid search (full-text + vector)
  • Citation reconstruction

Unlike generic RAG systems, this project treats "evidence" as first-class structured objects, not just text.


2. Core Concept

The system is divided into three conceptual layers:

  • Proof DB → Data layer (PostgreSQL + OpenSearch + Vector)
  • Archive Cask → Frontend interface (not part of this task)
  • Few-shot Engine → OCR (external, not part of this task)

Current scope: Proof DB only


3. System Architecture (Backend Focus)

The backend follows a modular service architecture (not microservices yet, but clearly separated layers):

Components:

  1. Ingestion Layer

    • Accepts raw Markdown archive documents
    • Pre-processes Markdown page markers such as <!-- DOCMASTER:PAGE 0001 -->
    • Splits documents into page-bounded vector chunks
    • Keeps list-style archive records and their COMMENT blocks together where possible
    • Extracts metadata, including page numbers
    • Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
  2. Storage Layer

    • PostgreSQL → metadata, relations
    • OpenSearch → full-text index
    • Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
  3. Retrieval Layer

    • Full-text search (BM25)
    • Vector search (embedding similarity)
    • Hybrid search (fusion)
  4. Evidence Layer

    • Maps chunk → page → article
    • Provides page-level citation traceability

👉 这是典型 backend architecture 分层设计server + database + API协同 (DEV Community)


4. Tech Stack

Backend Framework

  • PHP 8+
  • Webman (HTTP API)
  • Workerman (async workers / background jobs)

Database

  • PostgreSQL (relational metadata)

Search Engine

  • OpenSearch

    • Full-text search (BM25)
    • Optional vector search (kNN)

Vector Layer

  • Option A: OpenSearch kNN
  • Option B: Qdrant (preferred if scaling)

Data Flow Tools

  • Custom chunking logic (PHP)
  • Embedding via external API / local model
  • Metadata enrichment via Redis queue + OpenAI-compatible chat completion API

5. Data Model (CRITICAL)

Core Entities

Archive
 ├── archive_uid (ULID)
 ├── title
 ├── summary
 ├── source
 └── metadata

Page
 ├── page_number
 ├── block_count
 ├── chunk_count
 └── content_length

PageBlock (internal import structure)
 ├── block_uid
 ├── archive_uid
 ├── page_number
 └── content

Chunk
 ├── chunk_uid (archive_uid + sequence + short uid)
 ├── page_start
 ├── page_end
 ├── text
 ├── embedding_ref

Key Principle

  • archive_uid 是档案级核心 ID使用 ULID
  • chunk_uid 是 chunk 级核心 ID格式为 {archive_uid}_{chunk_index}_{short_uid}
  • PostgreSQL / OpenSearch / Vector DB 全部围绕 archive_uidchunk_uid
  • page_number 是证据定位的关键字段
  • Chunk 是向量化和检索召回单位,不是精确 citation 单位
  • 证据定位只需要定位到页码,因此 chunk 可以跨段落合并,但不能跨页

6. Search Design

Full-text (OpenSearch)

  • Indexed at chunk level

  • Supports:

    • keyword match
    • phrase match
  • embedding similarity
  • BM25 + vector fusion
  • rerank stage

7. API Design (First Phase)

Ingestion

POST /api/articles/import

Retrieval

POST /api/search/fulltext
POST /api/search/vector
POST /api/search/hybrid

Evidence

GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}

8. Design Philosophy (IMPORTANT)

  • Evidence > Text
  • Chunk > Document
  • Traceability > Raw Retrieval
  • Hybrid Search by default

9. Non-goals (IMPORTANT)

  • No frontend (Archive Cask handled later)

  • No OCR (Few-shot Engine external)

  • No heavy microservices (keep simple modular architecture first)

  • Proof DB ≠ storage

  • Proof DB = retrieval + meaning + traceability


10. StepMapToDo

Code review date: 2026-05-03 Scope reviewed: app/, config/, scripts/, apidoc/, project root runtime/deploy files. vendor/ is treated as third-party dependency code and not counted as project implementation. Database decision: PostgreSQL is the project database contract. The default Webman Dockerfile is out of scope for this StepMap.

Done

  • Webman backend skeleton is present and listens on 0.0.0.0:8787.
  • Import API route is registered: POST /api/articles/import.
  • Import controller supports multipart Markdown upload, raw Markdown body, and JSON body.
  • Archive import service can normalize payloads, infer fallback title/source, validate inputs, parse DOCMASTER Markdown page markers, and fall back to single-page Markdown when markers are absent.
  • Page-level parsing and chunking are implemented with page-bounded chunks. Current chunking does not intentionally cross page boundaries.
  • Noise filtering exists for common archive/OCR boilerplate such as page numbers, classification headers, and declassification footer lines.
  • List-style policy records and following COMMENT blocks are kept together where the local pattern matches.
  • archive_uid uses ULID and chunk_uid follows {archive_uid}_{chunk_index}_{short_uid}.
  • Runtime import snapshot writing is implemented under runtime/proofdb/imports/{import_uid}.json.
  • Relational persistence is implemented through ArchiveRepository::saveImport(), including archives and chunks writes.
  • PostgreSQL is the selected relational database, matching current pgsql, JSONB, BIGSERIAL, and TIMESTAMPTZ implementation.
  • PostgreSQL setup script exists for creating archives and chunks tables plus indexes.
  • Async AI metadata queue exists on Redis with pending, delayed, failed, retry, and error keys.
  • ai_metadata Workerman process is registered and can consume Redis jobs.
  • OpenAI-compatible chat client exists for metadata enrichment.
  • Metadata enrichment service can request/fill title, year, author, tags, and summary when LLM config is available.
  • LLM retry helper exists for retryable HTTP/provider errors.
  • Import API documentation exists in apidoc/importapi.md.

Partially Done

  • Archive/Page/Chunk model is partly persisted: archives and chunks tables exist, but pages/page blocks are only summarized in import output and snapshots, not stored as first-class relational tables.
  • embedding_status, embedding_ref, and embedding_model fields exist, but no embedding generation or vector index write path exists yet.
  • Import response exposes page summaries and chunk IDs, but there is no read API yet to fetch archive, page, or chunk records after import.
  • AI metadata enrichment updates the archive row, but import-time response only reports the queue state; clients need a follow-up API or polling path to observe completed enrichment.
  • API documentation still contains an old "后续接入 MySQL" phrase; update it to PostgreSQL to match the database decision.
  • Database and Redis credentials are hard-coded in config files; move them to environment variables before production use.

Not Done

  • OpenSearch integration is not implemented.
  • Full-text indexing of chunks is not implemented.
  • Full-text search API is not implemented: POST /api/search/fulltext.
  • Embedding API/client for vector generation is not implemented.
  • Vector database integration is not implemented, neither OpenSearch kNN nor Qdrant.
  • Vector search API is not implemented: POST /api/search/vector.
  • Hybrid search fusion/rerank is not implemented: POST /api/search/hybrid.
  • Evidence reconstruction API is not implemented: GET /api/evidence/{chunk_uid}.
  • Chunk detail API is not implemented: GET /api/chunks/{chunk_uid}.
  • Page-level citation reconstruction is not implemented beyond storing page_start and page_end on chunks.
  • OpenSearch/Vector schema, index mappings, and migration/setup scripts are not present.
  • Background worker for embedding pending chunks is not present.
  • Reindex/re-embed maintenance commands are not present.
  • Request validation is handwritten in the service; no dedicated validator classes or reusable validation layer are present.
  • Automated tests for Markdown parsing, chunking, import persistence, queue behavior, and metadata enrichment are not present.
  • API authentication, rate limiting, and admin controls are not present.
  • Observability for import/search/enrichment jobs is minimal; no structured job metrics or admin status endpoints are present.
  • Default index page/view still uses Webman starter content and is not Proof DB specific.

Next Build Order

  1. Normalize remaining API documentation wording from MySQL to PostgreSQL.
  2. Add read APIs for archives/chunks/evidence so imported data can be verified without reading snapshots or the database directly.
  3. Add focused tests for DOCMASTER page parsing, noise filtering, comment coalescing, chunk UID stability, and repository persistence.
  4. Implement embedding generation worker and persist embedding_ref/embedding_model.
  5. Add OpenSearch full-text indexing and POST /api/search/fulltext.
  6. Add vector backend choice and POST /api/search/vector.
  7. Implement hybrid fusion/rerank and citation-oriented evidence reconstruction.