proofdb/readme.md
2026-05-07 01:40:58 +08:00

325 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## 1. Project Overview
**Project Name:** Proof DB
**Type:** Historical Evidence Retrieval System (RAG-oriented backend)
This project is a **backend-centric system** designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:
* Evidence traceability
* Chunk-level retrieval
* Hybrid search (full-text + vector)
* Citation reconstruction
Unlike generic RAG systems, this project treats **"evidence" as first-class structured objects**, not just text.
---
## 2. Core Concept
The system is divided into three conceptual layers:
* **Proof DB** → Data layer (PostgreSQL + OpenSearch + Vector)
* **Archive Cask** → Frontend interface (not part of this task)
* **Few-shot Engine** → OCR (external, not part of this task)
Current scope: **Proof DB only**
---
## 3. System Architecture (Backend Focus)
The backend follows a **modular service architecture** (not microservices yet, but clearly separated layers):
### Components:
1. **Ingestion Layer**
* Accepts raw Markdown archive documents
* Pre-processes Markdown page markers such as `<!-- DOCMASTER:PAGE 0001 -->`
* Splits documents into page-bounded vector chunks
* Keeps list-style archive records and their `COMMENT` blocks together where possible
* Extracts metadata, including page numbers
* Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
2. **Storage Layer**
* PostgreSQL → metadata, relations
* OpenSearch → full-text index
* Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
3. **Retrieval Layer**
* Full-text search (BM25)
* Vector search (embedding similarity)
* Hybrid search (fusion)
4. **Evidence Layer**
* Maps chunk → page → article
* Provides page-level citation traceability
👉 这是典型 backend architecture 分层设计server + database + API协同 ([DEV Community][1])
---
## 4. Tech Stack
### Backend Framework
* PHP 8+
* Webman (HTTP API)
* Workerman (async workers / background jobs)
### Database
* PostgreSQL (relational metadata)
### Search Engine
* OpenSearch
* Full-text search (BM25)
* Optional vector search (kNN)
### Vector Layer
* Option A: OpenSearch kNN
* Option B: Qdrant (preferred if scaling)
### Data Flow Tools
* Custom chunking logic (PHP)
* Embedding via external API / local model
* Metadata enrichment via Redis queue + OpenAI-compatible chat completion API
---
## 5. Data Model (CRITICAL)
### Core Entities
```text
Archive
├── archive_uid (ULID)
├── title
├── summary
├── source
└── metadata
Page
├── page_number
├── block_count
├── chunk_count
└── content_length
PageBlock (internal import structure)
├── block_uid
├── archive_uid
├── page_number
└── content
Chunk
├── chunk_uid (archive_uid + sequence + short uid)
├── page_start
├── page_end
├── text
├── embedding_ref
```
### Key Principle
* **archive_uid 是档案级核心 ID使用 ULID**
* **chunk_uid 是 chunk 级核心 ID格式为 `{archive_uid}_{chunk_index}_{short_uid}`**
* PostgreSQL / OpenSearch / Vector DB 全部围绕 `archive_uid``chunk_uid`
* **page_number 是证据定位的关键字段**
* Chunk 是向量化和检索召回单位,不是精确 citation 单位
* 证据定位只需要定位到页码,因此 chunk 可以跨段落合并,但不能跨页
---
## 6. Search Design
### Full-text (OpenSearch)
* Indexed at chunk level
* Supports:
* keyword match
* phrase match
### Vector Search
* embedding similarity
### Hybrid Search
* BM25 + vector fusion
* rerank stage
---
## 7. API Design (First Phase)
### Ingestion
```http
POST /api/articles/import
```
---
### Retrieval
```http
POST /api/search/fulltext
POST /api/search/vector
POST /api/search/hybrid
```
---
### Evidence
```http
GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}
```
---
## 8. Design Philosophy (IMPORTANT)
* Evidence > Text
* Chunk > Document
* Traceability > Raw Retrieval
* Hybrid Search by default
---
## 9. Non-goals (IMPORTANT)
* No frontend (Archive Cask handled later)
* No OCR (Few-shot Engine external)
* No heavy microservices (keep simple modular architecture first)
* Proof DB ≠ storage
* Proof DB = retrieval + meaning + traceability
[1]: https://dev.to/tomjohnson3/understanding-backend-architecture-ljb?utm_source=chatgpt.com "Understanding Backend Architecture"
[2]: https://exodata.io/what-is-a-tech-stack-how-to-architect-a-modern-scalable-technology-stack/?utm_source=chatgpt.com "How to Build a Tech Stack That Scales [2026] | Exodata"
[3]: https://medium.com/%40hanxuyang0826/roadmap-to-backend-programming-master-architectural-patterns-c763c9194414?utm_source=chatgpt.com "Roadmap to Backend Programming Master: Architectural ..."
---
## 10. StepMapToDo
> Code review date: 2026-05-03
> Scope reviewed: `app/`, `config/`, `scripts/`, `apidoc/`, project root runtime/deploy files. `vendor/` is treated as third-party dependency code and not counted as project implementation.
> Database decision: PostgreSQL is the project database contract. The default Webman Dockerfile is out of scope for this StepMap.
### Done
- [x] Webman backend skeleton is present and listens on `0.0.0.0:8787`.
- [x] Import API route is registered: `POST /api/articles/import`.
- [x] Import controller supports multipart Markdown upload, raw Markdown body, and JSON body.
- [x] Archive import service can normalize payloads, infer fallback title/source, validate inputs, parse DOCMASTER Markdown page markers, and fall back to single-page Markdown when markers are absent.
- [x] Page-level parsing and chunking are implemented with page-bounded chunks. Current chunking does not intentionally cross page boundaries.
- [x] Noise filtering exists for common archive/OCR boilerplate such as page numbers, classification headers, and declassification footer lines.
- [x] List-style policy records and following `COMMENT` blocks are kept together where the local pattern matches.
- [x] `archive_uid` uses ULID and `chunk_uid` follows `{archive_uid}_{chunk_index}_{short_uid}`.
- [x] Runtime import snapshot writing is implemented under `runtime/proofdb/imports/{import_uid}.json`.
- [x] Relational persistence is implemented through `ArchiveRepository::saveImport()`, including `archives` and `chunks` writes.
- [x] PostgreSQL is the selected relational database, matching current `pgsql`, JSONB, `BIGSERIAL`, and `TIMESTAMPTZ` implementation.
- [x] PostgreSQL setup script exists for creating `archives` and `chunks` tables plus indexes.
- [x] Async AI metadata queue exists on Redis with pending, delayed, failed, retry, and error keys.
- [x] `ai_metadata` Workerman process is registered and can consume Redis jobs.
- [x] OpenAI-compatible chat client exists for metadata enrichment.
- [x] Metadata enrichment service can request/fill `title`, `year`, `author`, `tags`, and `summary` when LLM config is available.
- [x] LLM retry helper exists for retryable HTTP/provider errors.
- [x] Import API documentation exists in `apidoc/importapi.md`.
- [x] BigModel/Zhipu `embedding-3` client is implemented and verified with a live 256-dimension smoke test.
- [x] Generic async task queue/process foundation exists: one DB dispatcher process plus one Redis worker process.
- [x] OpenSearch client factory is implemented and supports passwordless local OpenSearch when security is disabled.
- [x] OpenSearch `proofdb_chunks` hybrid index mapping exists with BM25 text fields and a 2048-dimension `knn_vector` embedding field.
- [x] OpenSearch search-index task handler is implemented and writes embedded chunks through bulk upsert.
- [x] End-to-end embedding-to-OpenSearch smoke test passed for 14 chunks: all are `embedding_status=embedded`, `search_index_status=indexed`, and OpenSearch documents contain 2048-dimension vectors.
- [x] Full-text search service, route, controller, and external API documentation are implemented for `POST /api/search/fulltext`.
- [x] Full-text OpenSearch smoke test passed with `query="policy documents"`, returning 12 total hits from indexed chunks.
- [x] Vector search service, route, controller, and external API documentation are implemented for `POST /api/search/vector`.
- [x] Vector OpenSearch smoke test passed with English and Chinese queries. Chinese query `伊拉克入侵科威特与沙漠风暴` correctly recalled the Iraq/Kuwait/Desert Storm chunk as top hit.
- [x] Hybrid search service, route, controller, and external API documentation are implemented for `POST /api/search/hybrid` using Reciprocal Rank Fusion over full-text and vector candidates.
- [x] Hybrid smoke tests passed: English query combines fulltext/vector ranks, and Chinese query falls back to vector recall with the Iraq/Kuwait/Desert Storm chunk as top hit.
- [x] Hybrid search supports `ai=true`: the original query is used for vector search, while the full-text query is rewritten into BM25 keywords through the existing OpenAI-compatible LLM chat path. Keyword generation has a shorter timeout and falls back to the original query on failure.
### Partially Done
- [ ] Archive/Page/Chunk model is partly persisted: `archives` and `chunks` tables exist, but pages/page blocks are only summarized in import output and snapshots, not stored as first-class relational tables.
- [ ] `embedding_status`, `embedding_ref`, `embedding_model`, `embedding_error`, and `embedding_updated_at` fields exist; embedding generation into PostgreSQL JSONB and OpenSearch vector indexing are implemented, but vector retrieval API is not implemented yet.
- [ ] `search_index_status`, `search_index_error`, and `search_index_updated_at` fields exist and are used by the generic task dispatcher/worker.
- [ ] Import response exposes page summaries and chunk IDs, but there is no read API yet to fetch archive, page, or chunk records after import.
- [ ] AI metadata enrichment updates the archive row, but import-time response only reports the queue state; clients need a follow-up API or polling path to observe completed enrichment.
- [ ] Database and Redis credentials are hard-coded in config files; move them to environment variables before production use.
### Async Task Contract
The search/vector pipeline should use two generic background processes instead of one process per task family:
```text
ProofDbTaskDispatcher
-> periodically scans PostgreSQL for unfinished work
-> marks eligible rows as queued
-> pushes normalized task payloads into Redis
ProofDbTaskWorker
-> consumes Redis task payloads
-> dispatches by task_type to handlers
-> updates PostgreSQL status after success/failure
```
Task payload shape:
```json
{
"task_type": "search_index",
"target_type": "archive",
"target_uid": "01...",
"attempt": 1
}
```
Initial task types:
- `search_index`: enqueue records where `search_index_status != indexed`; handler writes chunks to OpenSearch.
- `embedding`: enqueue records where `embedding_status in pending, queued, failed_retryable`; handler calls BigModel/Zhipu `embedding-3` and writes embedding references.
Redis tasks may be duplicated or lost; PostgreSQL status is the recovery source of truth. Task handlers must be idempotent around `archive_uid` / `chunk_uid`.
### Not Done
- [ ] Evidence reconstruction API is not implemented: `GET /api/evidence/{chunk_uid}`.
- [ ] Chunk detail API is not implemented: `GET /api/chunks/{chunk_uid}`.
- [ ] Page-level citation reconstruction is not implemented beyond storing `page_start` and `page_end` on chunks.
- [ ] Reindex/re-embed maintenance commands are not present.
- [ ] Reindex maintenance should detect/recover OpenSearch index loss or stale `search_index_status=indexed` rows when the index has been recreated.
- [ ] Request validation is handwritten in the service; no dedicated validator classes or reusable validation layer are present.
- [ ] Automated tests for Markdown parsing, chunking, import persistence, queue behavior, and metadata enrichment are not present.
- [ ] API authentication, rate limiting, and admin controls are not present.
- [ ] Observability for import/search/enrichment jobs is minimal; no structured job metrics or admin status endpoints are present.
- [ ] Default index page/view still uses Webman starter content and is not Proof DB specific.
### Future Optimizations
- [ ] Extend full-text search from single `query` string over `multi_match` fields to multi-query bool search, for example `queries: ["Iraq Kuwait", "Desert Storm", "policy documents"]` mapped to OpenSearch `bool.should`.
### Next Build Order
1. Normalize remaining API documentation wording from MySQL to PostgreSQL.
2. Add read APIs for archives/chunks/evidence so imported data can be verified without reading snapshots or the database directly.
3. Add focused tests for DOCMASTER page parsing, noise filtering, comment coalescing, chunk UID stability, and repository persistence.
4. Add async task foundation: task statuses, Redis task payload format, generic DB dispatcher process, and generic Redis worker process. (Done for embedding and OpenSearch indexing)
5. Add chunk detail API and evidence reconstruction API.