| apidoc | ||
| app | ||
| config | ||
| public | ||
| runtime | ||
| scriptdoc | ||
| scripts | ||
| support | ||
| test | ||
| vendor | ||
| webman | ||
| .codex | ||
| .env | ||
| .version | ||
| ark.txt | ||
| composer.json | ||
| composer.lock | ||
| LICENSE | ||
| readme.md | ||
| start.php | ||
| windows.bat | ||
| windows.php | ||
1. Project Overview
Project Name: Proof DB Type: Historical Evidence Retrieval System (RAG-oriented backend)
This project is a backend-centric system designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:
- Evidence traceability
- Chunk-level retrieval
- Hybrid search (full-text + vector)
- Citation reconstruction
Unlike generic RAG systems, this project treats "evidence" as first-class structured objects, not just text.
2. Core Concept
The system is divided into three conceptual layers:
- Proof DB → Data layer (PostgreSQL + OpenSearch + Vector)
- Archive Cask → Frontend interface (not part of this task)
- Few-shot Engine → OCR (external, not part of this task)
Current scope: Proof DB only
3. System Architecture (Backend Focus)
The backend follows a modular service architecture (not microservices yet, but clearly separated layers):
Components:
-
Ingestion Layer
- Accepts raw Markdown archive documents
- Pre-processes Markdown page markers such as
<!-- DOCMASTER:PAGE 0001 --> - Splits documents into page-bounded vector chunks
- Keeps list-style archive records and their
COMMENTblocks together where possible - Extracts metadata, including page numbers
- Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
-
Storage Layer
- PostgreSQL → metadata, relations
- OpenSearch → full-text index
- Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
-
Retrieval Layer
- Full-text search (BM25)
- Vector search (embedding similarity)
- Hybrid search (fusion)
-
Evidence Layer
- Maps chunk → page → article
- Provides page-level citation traceability
👉 这是典型 backend architecture 分层设计(server + database + API协同) (DEV Community)
4. Tech Stack
Backend Framework
- PHP 8+
- Webman (HTTP API)
- Workerman (async workers / background jobs)
Database
- PostgreSQL (relational metadata)
Search Engine
-
OpenSearch
- Full-text search (BM25)
- Optional vector search (kNN)
Vector Layer
- Option A: OpenSearch kNN
- Option B: Qdrant (preferred if scaling)
Data Flow Tools
- Custom chunking logic (PHP)
- Embedding via external API / local model
- Metadata enrichment via Redis queue + OpenAI-compatible chat completion API
5. Data Model (CRITICAL)
Core Entities
Archive
├── archive_uid (ULID)
├── title
├── summary
├── source
└── metadata
Page
├── page_number
├── block_count
├── chunk_count
└── content_length
PageBlock (internal import structure)
├── block_uid
├── archive_uid
├── page_number
└── content
Chunk
├── chunk_uid (archive_uid + sequence + short uid)
├── page_start
├── page_end
├── text
├── embedding_ref
Key Principle
- archive_uid 是档案级核心 ID,使用 ULID
- chunk_uid 是 chunk 级核心 ID,格式为
{archive_uid}_{chunk_index}_{short_uid} - PostgreSQL / OpenSearch / Vector DB 全部围绕
archive_uid和chunk_uid - page_number 是证据定位的关键字段
- Chunk 是向量化和检索召回单位,不是精确 citation 单位
- 证据定位只需要定位到页码,因此 chunk 可以跨段落合并,但不能跨页
6. Search Design
Full-text (OpenSearch)
-
Indexed at chunk level
-
Supports:
- keyword match
- phrase match
Vector Search
- embedding similarity
Hybrid Search
- BM25 + vector fusion
- rerank stage
7. API Design (First Phase)
Ingestion
POST /api/articles/import
Retrieval
POST /api/search/fulltext
POST /api/search/vector
POST /api/search/hybrid
Evidence
GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}
8. Design Philosophy (IMPORTANT)
- Evidence > Text
- Chunk > Document
- Traceability > Raw Retrieval
- Hybrid Search by default
9. Non-goals (IMPORTANT)
-
No frontend (Archive Cask handled later)
-
No OCR (Few-shot Engine external)
-
No heavy microservices (keep simple modular architecture first)
-
Proof DB ≠ storage
-
Proof DB = retrieval + meaning + traceability
10. StepMapToDo
Code review date: 2026-05-03 Scope reviewed:
app/,config/,scripts/,apidoc/, project root runtime/deploy files.vendor/is treated as third-party dependency code and not counted as project implementation. Database decision: PostgreSQL is the project database contract. The default Webman Dockerfile is out of scope for this StepMap.
Done
- Webman backend skeleton is present and listens on
0.0.0.0:8787. - Import API route is registered:
POST /api/articles/import. - Import controller supports multipart Markdown upload, raw Markdown body, and JSON body.
- Archive import service can normalize payloads, infer fallback title/source, validate inputs, parse DOCMASTER Markdown page markers, and fall back to single-page Markdown when markers are absent.
- Page-level parsing and chunking are implemented with page-bounded chunks. Current chunking does not intentionally cross page boundaries.
- Noise filtering exists for common archive/OCR boilerplate such as page numbers, classification headers, and declassification footer lines.
- List-style policy records and following
COMMENTblocks are kept together where the local pattern matches. archive_uiduses ULID andchunk_uidfollows{archive_uid}_{chunk_index}_{short_uid}.- Runtime import snapshot writing is implemented under
runtime/proofdb/imports/{import_uid}.json. - Relational persistence is implemented through
ArchiveRepository::saveImport(), includingarchivesandchunkswrites. - Minimal admin entry frontend exists: landing page with Archive Cask redirect and admin login, plus a session-backed admin dashboard shell.
- Admin dashboard now includes archives-table management, OpenSearch status, admin-user management, APIDOC viewing, and a whitelist-based maintenance-script terminal.
- PostgreSQL is the selected relational database, matching current
pgsql, JSONB,BIGSERIAL, andTIMESTAMPTZimplementation. - PostgreSQL setup script exists for creating
archivesandchunkstables plus indexes. - Admin user bootstrap script exists for creating
admin_usersand seeding/updating an admin account. - Async AI metadata queue exists on Redis with pending, delayed, failed, retry, and error keys.
ai_metadataWorkerman process is registered and can consume Redis jobs.- OpenAI-compatible chat client exists for metadata enrichment.
- Metadata enrichment service can request/fill
title,year,author,tags, andsummarywhen LLM config is available. - LLM retry helper exists for retryable HTTP/provider errors.
- Import API documentation exists in
apidoc/importapi.md. - BigModel/Zhipu
embedding-3client is implemented and verified with a live 256-dimension smoke test. - Generic async task queue/process foundation exists: one DB dispatcher process plus one Redis worker process.
- OpenSearch client factory is implemented and supports passwordless local OpenSearch when security is disabled.
- OpenSearch
proofdb_chunkshybrid index mapping exists with BM25 text fields and a 2048-dimensionknn_vectorembedding field. - OpenSearch search-index task handler is implemented and writes embedded chunks through bulk upsert.
- Archive-level
summarymetadata is written into OpenSearch chunk documents and participates in BM25 search alongsidetext,title, and other metadata fields. - End-to-end embedding-to-OpenSearch smoke test passed for 14 chunks: all are
embedding_status=embedded,search_index_status=indexed, and OpenSearch documents contain 2048-dimension vectors. - Full-text search service, route, controller, and external API documentation are implemented for
POST /api/search/fulltext. - Full-text OpenSearch smoke test passed with
query="policy documents", returning 12 total hits from indexed chunks. - Vector search service, route, controller, and external API documentation are implemented for
POST /api/search/vector. - Vector OpenSearch smoke test passed with English and Chinese queries. Chinese query
伊拉克入侵科威特与沙漠风暴correctly recalled the Iraq/Kuwait/Desert Storm chunk as top hit. - Hybrid search service, route, controller, and external API documentation are implemented for
POST /api/search/hybridusing Reciprocal Rank Fusion over full-text and vector candidates. - Hybrid smoke tests passed: English query combines fulltext/vector ranks, and Chinese query falls back to vector recall with the Iraq/Kuwait/Desert Storm chunk as top hit.
- Hybrid search supports
ai=true: the original query is used for vector search, while the full-text query is rewritten into BM25 keywords through the existing OpenAI-compatible LLM chat path. Keyword generation has a shorter timeout and falls back to the original query on failure. - Chunk detail API and evidence API are implemented with external documentation:
GET /api/chunks/{chunk_uid}andGET /api/evidence/{chunk_uid}. - Archive detail API is implemented with external documentation:
GET /api/archives/{archive_uid}. - Archive chunk-list and archive evidence-list APIs are implemented with external documentation:
GET /api/archives/{archive_uid}/chunksandGET /api/archives/{archive_uid}/evidence. - Evidence smoke test passed for
01KQHVREB6XPYF604RVZAP9NNY_1_39003, returning page label, citation string, and chunk quote. - Historical
archives.contentcan now be repaired withphp scripts/backfill_archive_content.php, using normalizedrawwhen available and ordered chunk text as fallback. - OpenSearch repair/reindex maintenance script exists:
php scripts/reindex_opensearch.php, with optional--archive_uid=...targeting.
Partially Done
- Archive/Page/Chunk model is partly persisted:
archivesandchunkstables exist, but pages/page blocks are only summarized in import output and snapshots, not stored as first-class relational tables. embedding_status,embedding_ref,embedding_model,embedding_error, andembedding_updated_atfields exist; embedding generation into PostgreSQL JSONB, OpenSearch vector indexing, and vector retrieval API are all implemented.search_index_status,search_index_error, andsearch_index_updated_atfields exist and are used by the generic task dispatcher/worker.- Import response exposes page summaries and chunk IDs. Archive-level and chunk-level read APIs now exist, but there is still no first-class page record API because pages are not stored as relational rows yet.
- AI metadata enrichment updates the archive row, but import-time response only reports the queue state; clients need a follow-up API or polling path to observe completed enrichment.
- Database and Redis credentials are hard-coded in config files; move them to environment variables before production use.
Async Task Contract
The search/vector pipeline should use two generic background processes instead of one process per task family:
ProofDbTaskDispatcher
-> periodically scans PostgreSQL for unfinished work
-> marks eligible rows as queued
-> pushes normalized task payloads into Redis
ProofDbTaskWorker
-> consumes Redis task payloads
-> dispatches by task_type to handlers
-> updates PostgreSQL status after success/failure
Task payload shape:
{
"task_type": "search_index",
"target_type": "archive",
"target_uid": "01...",
"attempt": 1
}
Initial task types:
search_index: enqueue records wheresearch_index_status != indexed; handler writes chunks to OpenSearch.embedding: enqueue records whereembedding_status in pending, queued, failed_retryable; handler calls BigModel/Zhipuembedding-3and writes embedding references.
Redis tasks may be duplicated or lost; PostgreSQL status is the recovery source of truth. Task handlers must be idempotent around archive_uid / chunk_uid.
Not Done
- Page-level citation reconstruction is not implemented beyond storing
page_startandpage_endon chunks. - Re-embed maintenance command is not present.
- Request validation is handwritten in the service; no dedicated validator classes or reusable validation layer are present.
- Automated tests for Markdown parsing, chunking, import persistence, queue behavior, and metadata enrichment are not present.
- Public API authentication and rate limiting are not present. Minimal admin login/session controls are now present for the maintenance frontend.
- Observability for import/search/enrichment jobs is still minimal; the admin panel now exposes coarse status endpoints, but there are no historical metrics, tracing, or alerting pipelines yet.
- Default landing page is replaced with a Proof DB-specific admin entry surface instead of the Webman starter content.
Future Optimizations
- Extend full-text search from single
querystring overmulti_matchfields to multi-query bool search, for examplequeries: ["Iraq Kuwait", "Desert Storm", "policy documents"]mapped to OpenSearchbool.should.
Next Build Order
- Normalize remaining API documentation wording from MySQL to PostgreSQL.
- Add read APIs for archives/chunks/evidence so imported data can be verified without reading snapshots or the database directly.
- Add focused tests for DOCMASTER page parsing, noise filtering, comment coalescing, chunk UID stability, and repository persistence.
- Add async task foundation: task statuses, Redis task payload format, generic DB dispatcher process, and generic Redis worker process. (Done for embedding and OpenSearch indexing)
- Improve page-level citation reconstruction beyond chunk page range metadata.