# 搜索 API ## 接口说明 Proof DB 的搜索接口基于 OpenSearch `proofdb_chunks` 索引。当前版本已实现全文搜索,检索单位是 chunk,返回结果包含档案元数据、页码范围和 chunk 文本,便于后续 evidence reconstruction。 OpenSearch 中每个 chunk 文档同时包含: - `text` 等全文字段,用于 BM25 检索。 - `embedding` 2048 维向量字段,用于后续 vector / hybrid 检索。 ## 全文搜索 ```http POST /api/search/fulltext ``` ### 请求格式 `Content-Type: application/json` | 字段 | 类型 | 必填 | 说明 | | --- | --- | --- | --- | | `query` | string | 是 | 搜索关键词或短语 | | `limit` | int | 否 | 返回条数,默认 `10`,最大 `50` | | `filters` | object | 否 | 过滤条件 | | `filters.archive_uid` | string | 否 | 只搜索某个 archive | | `filters.chunk_uid` | string | 否 | 只搜索某个 chunk | | `filters.source` | string | 否 | 精确匹配 source | | `filters.author` | string | 否 | 精确匹配 author | | `filters.series` | string | 否 | 精确匹配 series | | `filters.year` | int | 否 | 精确匹配年份 | | `filters.tags` | string\|array | 否 | 匹配一个或多个 tag | ### 请求示例 ```bash curl -X POST http://127.0.0.1:8787/api/search/fulltext \ -H 'Content-Type: application/json' \ --data '{ "query": "policy documents", "limit": 5 }' ``` 带过滤条件: ```bash curl -X POST http://127.0.0.1:8787/api/search/fulltext \ -H 'Content-Type: application/json' \ --data '{ "query": "Iraq Kuwait", "limit": 10, "filters": { "year": 1992, "tags": ["NSD 76"] } }' ``` ### 成功响应 状态码: ```http 200 OK ``` 响应示例: ```json { "code": 0, "message": "Full-text search completed.", "data": { "mode": "fulltext", "query": "policy documents", "limit": 5, "filters": [], "total": 1, "hits": [ { "score": 12.34, "chunk_uid": "01KQHVREB6XPYF604RVZAP9NNY_1_39003", "archive_uid": "01KQHVREB6XPYF604RVZAP9NNY", "chunk_index": 1, "page_start": 1, "page_end": 1, "title": "NSD 76 Disposition of NSC Policy Documents", "source": "archive://nsc/nsd-76", "author": "Brent Scowcroft", "year": 1992, "series": null, "tags": ["NSD 76", "政策文件"], "text": "chunk text...", "embedding_model": "embedding-3", "embedding_dimensions": 2048 } ] } } ``` ### 错误响应 #### JSON 格式错误 状态码: ```http 400 Bad Request ``` ```json { "code": 400, "message": "Invalid JSON body.", "errors": { "body": "Syntax error" } } ``` #### 缺少 query 状态码: ```http 422 Unprocessable Entity ``` ```json { "code": 422, "message": "Search request validation failed.", "errors": { "query": "query is required." } } ``` #### 搜索失败 状态码: ```http 500 Internal Server Error ``` ```json { "code": 500, "message": "Full-text search failed.", "errors": { "search": "error message" } } ``` ## 后续接口 ## 向量搜索 ```http POST /api/search/vector ``` ### 请求格式 `Content-Type: application/json` | 字段 | 类型 | 必填 | 说明 | | --- | --- | --- | --- | | `query` | string | 是 | 搜索语句。系统会先调用智谱 `embedding-3` 转成 2048 维向量 | | `limit` | int | 否 | 返回条数,默认 `10`,最大 `50` | | `k` | int | 否 | OpenSearch kNN 候选数,默认等于 `limit`,最大 `50` | | `filters` | object | 否 | 过滤条件,同全文搜索 | ### 请求示例 ```bash curl -X POST http://127.0.0.1:8787/api/search/vector \ -H 'Content-Type: application/json' \ --data '{ "query": "Iraq invasion and Desert Storm", "limit": 5, "k": 10 }' ``` 中文 query 也可以提交给向量搜索: ```bash curl -X POST http://127.0.0.1:8787/api/search/vector \ -H 'Content-Type: application/json' \ --data '{ "query": "伊拉克入侵科威特与沙漠风暴", "limit": 5 }' ``` ### 成功响应 状态码: ```http 200 OK ``` 响应示例: ```json { "code": 0, "message": "Vector search completed.", "data": { "mode": "vector", "query": "Iraq invasion and Desert Storm", "limit": 5, "k": 10, "filters": [], "embedding_model": "embedding-3", "embedding_dimensions": 2048, "total": 5, "hits": [ { "score": 0.91, "chunk_uid": "01KQHVREB6XPYF604RVZAP9NNY_14_97554", "archive_uid": "01KQHVREB6XPYF604RVZAP9NNY", "chunk_index": 14, "page_start": 8, "page_end": 8, "title": "NSD 76 Disposition of NSC Policy Documents", "source": "archive://nsc/nsd-76", "author": "Brent Scowcroft", "year": 1992, "series": null, "tags": ["NSD 76", "政策文件"], "text": "chunk text...", "embedding_model": "embedding-3", "embedding_dimensions": 2048 } ] } } ``` ### 错误响应 错误响应格式与全文搜索一致。常见错误包括: - JSON 格式错误:`400 Bad Request` - 缺少 `query`:`422 Unprocessable Entity` - embedding API 或 OpenSearch 查询失败:`500 Internal Server Error` ## 后续接口 ## 混合搜索 ```http POST /api/search/hybrid ``` ### 接口说明 混合搜索会对同一个 `query` 同时执行: 1. BM25 全文搜索。 2. 2048 维向量 kNN 搜索。 3. 使用 Reciprocal Rank Fusion(RRF)合并排序。 第一版不做额外 reranker。RRF 不直接比较 BM25 分数和向量分数,而是根据两路结果中的排名融合,适合作为稳定的 hybrid baseline。 ### 请求格式 `Content-Type: application/json` | 字段 | 类型 | 必填 | 说明 | | --- | --- | --- | --- | | `query` | string | 是 | 搜索语句 | | `limit` | int | 否 | 最终返回条数,默认 `10`,最大 `50` | | `candidate_limit` | int | 否 | 每一路召回候选数,默认 `max(limit * 3, 20)`,最大 `50` | | `rrf_k` | int | 否 | RRF 平滑参数,默认 `60` | | `filters` | object | 否 | 过滤条件,同全文搜索 | | `ai` | bool | 否 | 默认 `false`。传 `true` 时,系统先调用现有 LLM chat 通道把原始 query 改写为 BM25 关键词;全文搜索使用 AI 关键词,向量搜索仍使用原始 query | 如果 AI 关键词生成失败或超时,系统会回退为使用原始 `query` 做全文搜索,并在响应的 `keywords.error` 中返回错误信息;向量搜索不受影响。 ### 请求示例 ```bash curl -X POST http://127.0.0.1:8787/api/search/hybrid \ -H 'Content-Type: application/json' \ --data '{ "query": "Iraq invasion and Desert Storm", "limit": 5, "candidate_limit": 20 }' ``` 中文 query: ```bash curl -X POST http://127.0.0.1:8787/api/search/hybrid \ -H 'Content-Type: application/json' \ --data '{ "query": "伊拉克入侵科威特与沙漠风暴", "limit": 5, "ai": true }' ``` ### 成功响应 状态码: ```http 200 OK ``` 响应示例: ```json { "code": 0, "message": "Hybrid search completed.", "data": { "mode": "hybrid", "query": "Iraq invasion and Desert Storm", "limit": 5, "candidate_limit": 20, "rrf_k": 60, "filters": [], "ai": true, "fulltext_query": "Iraq Kuwait invasion Desert Storm", "vector_query": "Iraq invasion and Desert Storm", "keywords": { "enabled": true, "attempted": true, "error": null, "keywords": ["Iraq", "Kuwait", "invasion", "Desert Storm"], "query": "Iraq Kuwait invasion Desert Storm", "model": "glm-4.7-flash" }, "total": 10, "sources": { "fulltext_total": 1, "vector_total": 20, "fulltext_hits": 1, "vector_hits": 20 }, "hits": [ { "score": 4.13, "hybrid_score": 0.0325, "rank_sources": { "fulltext": { "rank": 1, "score": 4.13, "rrf": 0.0163934426 }, "vector": { "rank": 1, "score": 0.79, "rrf": 0.0163934426 } }, "chunk_uid": "01KQHVREB6XPYF604RVZAP9NNY_14_97554", "archive_uid": "01KQHVREB6XPYF604RVZAP9NNY", "page_start": 8, "page_end": 8, "text": "chunk text..." } ] } } ``` ### 错误响应 错误响应格式与全文搜索一致。常见错误包括: - JSON 格式错误:`400 Bad Request` - 缺少 `query`:`422 Unprocessable Entity` - embedding API、全文搜索或向量搜索失败:`500 Internal Server Error` ## 后续接口 以下能力尚未实现: ```http GET /api/chunks/{chunk_uid} GET /api/evidence/{chunk_uid} ```