proofdb

laysense/proofdb

Fork 0

T

enoch ed70a140a2 暂存

2026-05-08 00:05:51 +08:00

apidoc

暂存

2026-05-08 00:05:51 +08:00

app

暂存

2026-05-08 00:05:51 +08:00

config

暂存

2026-05-08 00:05:51 +08:00

public

暂存

2026-05-08 00:05:51 +08:00

runtime

暂存

2026-05-01 23:40:14 +08:00

scriptdoc

暂存

2026-05-08 00:05:51 +08:00

scripts

暂存

2026-05-08 00:05:51 +08:00

support

暂存

2026-05-01 23:40:14 +08:00

test

暂存

2026-05-01 23:40:14 +08:00

vendor

暂存

2026-05-04 13:15:16 +08:00

webman

暂存

2026-05-01 23:40:14 +08:00

.codex

暂存

2026-05-01 23:40:14 +08:00

.env

暂存

2026-05-08 00:05:51 +08:00

.version

暂存

2026-05-08 00:05:51 +08:00

ark.txt

暂存

2026-05-01 23:40:14 +08:00

composer.json

暂存

2026-05-04 13:15:16 +08:00

composer.lock

暂存

2026-05-04 13:15:16 +08:00

LICENSE

暂存

2026-05-01 23:40:14 +08:00

readme.md

暂存

2026-05-08 00:05:51 +08:00

start.php

暂存

2026-05-01 23:40:14 +08:00

windows.bat

暂存

2026-05-01 23:40:14 +08:00

windows.php

暂存

2026-05-01 23:40:14 +08:00

readme.md

1. Project Overview

Project Name: Proof DB Type: Historical Evidence Retrieval System (RAG-oriented backend)

This project is a backend-centric system designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:

Evidence traceability
Chunk-level retrieval
Hybrid search (full-text + vector)
Citation reconstruction

Unlike generic RAG systems, this project treats "evidence" as first-class structured objects, not just text.

2. Core Concept

The system is divided into three conceptual layers:

Proof DB → Data layer (PostgreSQL + OpenSearch + Vector)
Archive Cask → Frontend interface (not part of this task)
Few-shot Engine → OCR (external, not part of this task)

Current scope: Proof DB only

3. System Architecture (Backend Focus)

The backend follows a modular service architecture (not microservices yet, but clearly separated layers):

Components:

Ingestion Layer
- Accepts raw Markdown archive documents
- Pre-processes Markdown page markers such as 
- Splits documents into page-bounded vector chunks
- Keeps list-style archive records and their COMMENT blocks together where possible
- Extracts metadata, including page numbers
- Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
Storage Layer
- PostgreSQL → metadata, relations
- OpenSearch → full-text index
- Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
Retrieval Layer
- Full-text search (BM25)
- Vector search (embedding similarity)
- Hybrid search (fusion)
Evidence Layer
- Maps chunk → page → article
- Provides page-level citation traceability

👉 这是典型 backend architecture 分层设计（server + database + API协同） (DEV Community)

4. Tech Stack

Backend Framework

PHP 8+
Webman (HTTP API)
Workerman (async workers / background jobs)

Database

PostgreSQL (relational metadata)

Search Engine

OpenSearch
- Full-text search (BM25)
- Optional vector search (kNN)

Vector Layer

Option A: OpenSearch kNN
Option B: Qdrant (preferred if scaling)

Data Flow Tools

Custom chunking logic (PHP)
Embedding via external API / local model
Metadata enrichment via Redis queue + OpenAI-compatible chat completion API

5. Data Model (CRITICAL)

Core Entities

Archive
 ├── archive_uid (ULID)
 ├── title
 ├── summary
 ├── source
 └── metadata

Page
 ├── page_number
 ├── block_count
 ├── chunk_count
 └── content_length

PageBlock (internal import structure)
 ├── block_uid
 ├── archive_uid
 ├── page_number
 └── content

Chunk
 ├── chunk_uid (archive_uid + sequence + short uid)
 ├── page_start
 ├── page_end
 ├── text
 ├── embedding_ref

Key Principle

archive_uid 是档案级核心 ID，使用 ULID
chunk_uid 是 chunk 级核心 ID，格式为 {archive_uid}_{chunk_index}_{short_uid}
PostgreSQL / OpenSearch / Vector DB 全部围绕 archive_uid 和 chunk_uid
page_number 是证据定位的关键字段
Chunk 是向量化和检索召回单位，不是精确 citation 单位
证据定位只需要定位到页码，因此 chunk 可以跨段落合并，但不能跨页

6. Search Design

Full-text (OpenSearch)

Indexed at chunk level
Supports:
- keyword match
- phrase match

Vector Search

embedding similarity

Hybrid Search

BM25 + vector fusion
rerank stage

7. API Design (First Phase)

Ingestion

POST /api/articles/import

Retrieval

POST /api/search/fulltext
POST /api/search/vector
POST /api/search/hybrid

Evidence

GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}

8. Design Philosophy (IMPORTANT)

Evidence > Text
Chunk > Document
Traceability > Raw Retrieval
Hybrid Search by default

9. Non-goals (IMPORTANT)

No frontend (Archive Cask handled later)
No OCR (Few-shot Engine external)
No heavy microservices (keep simple modular architecture first)
Proof DB ≠ storage
Proof DB = retrieval + meaning + traceability

10. StepMapToDo

Code review date: 2026-05-03 Scope reviewed: app/, config/, scripts/, apidoc/, project root runtime/deploy files. vendor/ is treated as third-party dependency code and not counted as project implementation. Database decision: PostgreSQL is the project database contract. The default Webman Dockerfile is out of scope for this StepMap.

Done

Webman backend skeleton is present and listens on 0.0.0.0:8787.
Import API route is registered: POST /api/articles/import.
Import controller supports multipart Markdown upload, raw Markdown body, and JSON body.
Archive import service can normalize payloads, infer fallback title/source, validate inputs, parse DOCMASTER Markdown page markers, and fall back to single-page Markdown when markers are absent.
Page-level parsing and chunking are implemented with page-bounded chunks. Current chunking does not intentionally cross page boundaries.
Noise filtering exists for common archive/OCR boilerplate such as page numbers, classification headers, and declassification footer lines.
List-style policy records and following COMMENT blocks are kept together where the local pattern matches.
archive_uid uses ULID and chunk_uid follows {archive_uid}_{chunk_index}_{short_uid}.
Runtime import snapshot writing is implemented under runtime/proofdb/imports/{import_uid}.json.
Relational persistence is implemented through ArchiveRepository::saveImport(), including archives and chunks writes.
Minimal admin entry frontend exists: landing page with Archive Cask redirect and admin login, plus a session-backed admin dashboard shell.
Admin dashboard now includes archives-table management, OpenSearch status, admin-user management, APIDOC viewing, and a whitelist-based maintenance-script terminal.
PostgreSQL is the selected relational database, matching current pgsql, JSONB, BIGSERIAL, and TIMESTAMPTZ implementation.
PostgreSQL setup script exists for creating archives and chunks tables plus indexes.
Admin user bootstrap script exists for creating admin_users and seeding/updating an admin account.
Async AI metadata queue exists on Redis with pending, delayed, failed, retry, and error keys.
ai_metadata Workerman process is registered and can consume Redis jobs.
OpenAI-compatible chat client exists for metadata enrichment.
Metadata enrichment service can request/fill title, year, author, tags, and summary when LLM config is available.
LLM retry helper exists for retryable HTTP/provider errors.
Import API documentation exists in apidoc/importapi.md.
BigModel/Zhipu embedding-3 client is implemented and verified with a live 256-dimension smoke test.
Generic async task queue/process foundation exists: one DB dispatcher process plus one Redis worker process.
OpenSearch client factory is implemented and supports passwordless local OpenSearch when security is disabled.
OpenSearch proofdb_chunks hybrid index mapping exists with BM25 text fields and a 2048-dimension knn_vector embedding field.
OpenSearch search-index task handler is implemented and writes embedded chunks through bulk upsert.
Archive-level summary metadata is written into OpenSearch chunk documents and participates in BM25 search alongside text, title, and other metadata fields.
End-to-end embedding-to-OpenSearch smoke test passed for 14 chunks: all are embedding_status=embedded, search_index_status=indexed, and OpenSearch documents contain 2048-dimension vectors.
Full-text search service, route, controller, and external API documentation are implemented for POST /api/search/fulltext.
Full-text OpenSearch smoke test passed with query="policy documents", returning 12 total hits from indexed chunks.
Vector search service, route, controller, and external API documentation are implemented for POST /api/search/vector.
Vector OpenSearch smoke test passed with English and Chinese queries. Chinese query 伊拉克入侵科威特与沙漠风暴 correctly recalled the Iraq/Kuwait/Desert Storm chunk as top hit.
Hybrid search service, route, controller, and external API documentation are implemented for POST /api/search/hybrid using Reciprocal Rank Fusion over full-text and vector candidates.
Hybrid smoke tests passed: English query combines fulltext/vector ranks, and Chinese query falls back to vector recall with the Iraq/Kuwait/Desert Storm chunk as top hit.
Hybrid search supports ai=true: the original query is used for vector search, while the full-text query is rewritten into BM25 keywords through the existing OpenAI-compatible LLM chat path. Keyword generation has a shorter timeout and falls back to the original query on failure.
Chunk detail API and evidence API are implemented with external documentation: GET /api/chunks/{chunk_uid} and GET /api/evidence/{chunk_uid}.
Archive detail API is implemented with external documentation: GET /api/archives/{archive_uid}.
Archive chunk-list and archive evidence-list APIs are implemented with external documentation: GET /api/archives/{archive_uid}/chunks and GET /api/archives/{archive_uid}/evidence.
Evidence smoke test passed for 01KQHVREB6XPYF604RVZAP9NNY_1_39003, returning page label, citation string, and chunk quote.
Historical archives.content can now be repaired with php scripts/backfill_archive_content.php, using normalized raw when available and ordered chunk text as fallback.
OpenSearch repair/reindex maintenance script exists: php scripts/reindex_opensearch.php, with optional --archive_uid=... targeting.

Partially Done

Archive/Page/Chunk model is partly persisted: archives and chunks tables exist, but pages/page blocks are only summarized in import output and snapshots, not stored as first-class relational tables.
embedding_status, embedding_ref, embedding_model, embedding_error, and embedding_updated_at fields exist; embedding generation into PostgreSQL JSONB, OpenSearch vector indexing, and vector retrieval API are all implemented.
search_index_status, search_index_error, and search_index_updated_at fields exist and are used by the generic task dispatcher/worker.
Import response exposes page summaries and chunk IDs. Archive-level and chunk-level read APIs now exist, but there is still no first-class page record API because pages are not stored as relational rows yet.
AI metadata enrichment updates the archive row, but import-time response only reports the queue state; clients need a follow-up API or polling path to observe completed enrichment.
Database and Redis credentials are hard-coded in config files; move them to environment variables before production use.

Async Task Contract

The search/vector pipeline should use two generic background processes instead of one process per task family:

ProofDbTaskDispatcher
  -> periodically scans PostgreSQL for unfinished work
  -> marks eligible rows as queued
  -> pushes normalized task payloads into Redis

ProofDbTaskWorker
  -> consumes Redis task payloads
  -> dispatches by task_type to handlers
  -> updates PostgreSQL status after success/failure

Task payload shape:

{
  "task_type": "search_index",
  "target_type": "archive",
  "target_uid": "01...",
  "attempt": 1
}

Initial task types:

search_index: enqueue records where search_index_status != indexed; handler writes chunks to OpenSearch.
embedding: enqueue records where embedding_status in pending, queued, failed_retryable; handler calls BigModel/Zhipu embedding-3 and writes embedding references.

Redis tasks may be duplicated or lost; PostgreSQL status is the recovery source of truth. Task handlers must be idempotent around archive_uid / chunk_uid.

Not Done

Page-level citation reconstruction is not implemented beyond storing page_start and page_end on chunks.
Re-embed maintenance command is not present.
Request validation is handwritten in the service; no dedicated validator classes or reusable validation layer are present.
Automated tests for Markdown parsing, chunking, import persistence, queue behavior, and metadata enrichment are not present.
Public API authentication and rate limiting are not present. Minimal admin login/session controls are now present for the maintenance frontend.
Observability for import/search/enrichment jobs is still minimal; the admin panel now exposes coarse status endpoints, but there are no historical metrics, tracing, or alerting pipelines yet.
Default landing page is replaced with a Proof DB-specific admin entry surface instead of the Webman starter content.

Future Optimizations

Extend full-text search from single query string over multi_match fields to multi-query bool search, for example queries: ["Iraq Kuwait", "Desert Storm", "policy documents"] mapped to OpenSearch bool.should.

Next Build Order

Normalize remaining API documentation wording from MySQL to PostgreSQL.
Add read APIs for archives/chunks/evidence so imported data can be verified without reading snapshots or the database directly.
Add focused tests for DOCMASTER page parsing, noise filtering, comment coalescing, chunk UID stability, and repository persistence.
Add async task foundation: task statuses, Redis task payload format, generic DB dispatcher process, and generic Redis worker process. (Done for embedding and OpenSearch indexing)
Improve page-level citation reconstruction beyond chunk page range metadata.