413 lines
9.5 KiB
Markdown
413 lines
9.5 KiB
Markdown
# 搜索 API
|
||
|
||
## 接口说明
|
||
|
||
Proof DB 的搜索接口基于 OpenSearch `proofdb_chunks` 索引。当前版本已实现全文搜索,检索单位是 chunk,返回结果包含档案元数据、页码范围和 chunk 文本,便于后续 evidence reconstruction。
|
||
|
||
OpenSearch 中每个 chunk 文档同时包含:
|
||
|
||
- `text` 等全文字段,用于 BM25 检索。
|
||
- `summary` 档案摘要字段,会参与全文检索,也会随搜索结果一起返回。
|
||
- `embedding` 2048 维向量字段,用于后续 vector / hybrid 检索。
|
||
|
||
## 全文搜索
|
||
|
||
```http
|
||
POST /api/search/fulltext
|
||
```
|
||
|
||
### 请求格式
|
||
|
||
`Content-Type: application/json`
|
||
|
||
| 字段 | 类型 | 必填 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| `query` | string | 是 | 搜索关键词或短语 |
|
||
| `limit` | int | 否 | 返回条数,默认 `10`,最大 `50` |
|
||
| `filters` | object | 否 | 过滤条件 |
|
||
| `filters.archive_uid` | string | 否 | 只搜索某个 archive |
|
||
| `filters.chunk_uid` | string | 否 | 只搜索某个 chunk |
|
||
| `filters.source` | string | 否 | 精确匹配 source |
|
||
| `filters.author` | string | 否 | 精确匹配 author |
|
||
| `filters.series` | string | 否 | 精确匹配 series |
|
||
| `filters.year` | int | 否 | 精确匹配年份 |
|
||
| `filters.tags` | string\|array | 否 | 匹配一个或多个 tag |
|
||
|
||
### 请求示例
|
||
|
||
```bash
|
||
curl -X POST <APIdomain>/api/search/fulltext \
|
||
-H 'Content-Type: application/json' \
|
||
--data '{
|
||
"query": "policy documents",
|
||
"limit": 5
|
||
}'
|
||
```
|
||
|
||
带过滤条件:
|
||
|
||
```bash
|
||
curl -X POST <APIdomain>/api/search/fulltext \
|
||
-H 'Content-Type: application/json' \
|
||
--data '{
|
||
"query": "Iraq Kuwait",
|
||
"limit": 10,
|
||
"filters": {
|
||
"year": 1992,
|
||
"tags": ["NSD 76"]
|
||
}
|
||
}'
|
||
```
|
||
|
||
### 成功响应
|
||
|
||
状态码:
|
||
|
||
```http
|
||
200 OK
|
||
```
|
||
|
||
响应示例:
|
||
|
||
```json
|
||
{
|
||
"code": 0,
|
||
"message": "Full-text search completed.",
|
||
"data": {
|
||
"mode": "fulltext",
|
||
"query": "policy documents",
|
||
"limit": 5,
|
||
"filters": [],
|
||
"total": 1,
|
||
"hits": [
|
||
{
|
||
"score": 12.34,
|
||
"chunk_uid": "01KQHVREB6XPYF604RVZAP9NNY_1_39003",
|
||
"archive_uid": "01KQHVREB6XPYF604RVZAP9NNY",
|
||
"chunk_index": 1,
|
||
"page_start": 1,
|
||
"page_end": 1,
|
||
"title": "NSD 76 Disposition of NSC Policy Documents",
|
||
"summary": "Summary text...",
|
||
"source": "archive://nsc/nsd-76",
|
||
"author": "Brent Scowcroft",
|
||
"year": 1992,
|
||
"series": null,
|
||
"tags": ["NSD 76", "政策文件"],
|
||
"text": "chunk text...",
|
||
"embedding_model": "embedding-3",
|
||
"embedding_dimensions": 2048
|
||
}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
说明:
|
||
|
||
- `hits` 是当前返回的结果数组。
|
||
- `total` 是当前 full-text 查询下的命中总数。
|
||
- 全文搜索当前会综合匹配 `text`、`title`、`summary`、`source`、`author`、`series`、`tags`。
|
||
|
||
### 错误响应
|
||
|
||
#### JSON 格式错误
|
||
|
||
状态码:
|
||
|
||
```http
|
||
400 Bad Request
|
||
```
|
||
|
||
```json
|
||
{
|
||
"code": 400,
|
||
"message": "Invalid JSON body.",
|
||
"errors": {
|
||
"body": "Syntax error"
|
||
}
|
||
}
|
||
```
|
||
|
||
#### 缺少 query
|
||
|
||
状态码:
|
||
|
||
```http
|
||
422 Unprocessable Entity
|
||
```
|
||
|
||
```json
|
||
{
|
||
"code": 422,
|
||
"message": "Search request validation failed.",
|
||
"errors": {
|
||
"query": "query is required."
|
||
}
|
||
}
|
||
```
|
||
|
||
#### 搜索失败
|
||
|
||
状态码:
|
||
|
||
```http
|
||
500 Internal Server Error
|
||
```
|
||
|
||
```json
|
||
{
|
||
"code": 500,
|
||
"message": "Full-text search failed.",
|
||
"errors": {
|
||
"search": "error message"
|
||
}
|
||
}
|
||
```
|
||
|
||
## 向量搜索
|
||
|
||
```http
|
||
POST /api/search/vector
|
||
```
|
||
|
||
### 请求格式
|
||
|
||
`Content-Type: application/json`
|
||
|
||
| 字段 | 类型 | 必填 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| `query` | string | 是 | 搜索语句。系统会先调用智谱 `embedding-3` 转成 2048 维向量 |
|
||
| `limit` | int | 否 | 返回条数,默认 `10`,最大 `50` |
|
||
| `k` | int | 否 | OpenSearch kNN 候选数,默认等于 `limit`,最大 `50` |
|
||
| `filters` | object | 否 | 过滤条件,同全文搜索 |
|
||
|
||
### 请求示例
|
||
|
||
```bash
|
||
curl -X POST <APIdomain>/api/search/vector \
|
||
-H 'Content-Type: application/json' \
|
||
--data '{
|
||
"query": "Iraq invasion and Desert Storm",
|
||
"limit": 5,
|
||
"k": 10
|
||
}'
|
||
```
|
||
|
||
中文 query 也可以提交给向量搜索:
|
||
|
||
```bash
|
||
curl -X POST <APIdomain>/api/search/vector \
|
||
-H 'Content-Type: application/json' \
|
||
--data '{
|
||
"query": "伊拉克入侵科威特与沙漠风暴",
|
||
"limit": 5
|
||
}'
|
||
```
|
||
|
||
### 成功响应
|
||
|
||
状态码:
|
||
|
||
```http
|
||
200 OK
|
||
```
|
||
|
||
响应示例:
|
||
|
||
```json
|
||
{
|
||
"code": 0,
|
||
"message": "Vector search completed.",
|
||
"data": {
|
||
"mode": "vector",
|
||
"query": "Iraq invasion and Desert Storm",
|
||
"limit": 5,
|
||
"k": 10,
|
||
"filters": [],
|
||
"embedding_model": "embedding-3",
|
||
"embedding_dimensions": 2048,
|
||
"total": 5,
|
||
"hits": [
|
||
{
|
||
"score": 0.91,
|
||
"chunk_uid": "01KQHVREB6XPYF604RVZAP9NNY_14_97554",
|
||
"archive_uid": "01KQHVREB6XPYF604RVZAP9NNY",
|
||
"chunk_index": 14,
|
||
"page_start": 8,
|
||
"page_end": 8,
|
||
"title": "NSD 76 Disposition of NSC Policy Documents",
|
||
"summary": "Summary text...",
|
||
"source": "archive://nsc/nsd-76",
|
||
"author": "Brent Scowcroft",
|
||
"year": 1992,
|
||
"series": null,
|
||
"tags": ["NSD 76", "政策文件"],
|
||
"text": "chunk text...",
|
||
"embedding_model": "embedding-3",
|
||
"embedding_dimensions": 2048
|
||
}
|
||
]
|
||
}
|
||
}
|
||
|
||
```
|
||
|
||
说明:
|
||
|
||
- `hits` 是当前返回的结果数组。
|
||
- `total` 是当前 vector 查询返回的候选总数。
|
||
- `embedding_dimensions` 是本次 query embedding 的维度,而不是索引总维度统计字段。
|
||
|
||
### 错误响应
|
||
|
||
错误响应格式与全文搜索一致。常见错误包括:
|
||
|
||
- JSON 格式错误:`400 Bad Request`
|
||
- 缺少 `query`:`422 Unprocessable Entity`
|
||
- embedding API 或 OpenSearch 查询失败:`500 Internal Server Error`
|
||
|
||
## 混合搜索
|
||
|
||
```http
|
||
POST /api/search/hybrid
|
||
```
|
||
|
||
### 接口说明
|
||
|
||
混合搜索会对同一个 `query` 同时执行:
|
||
|
||
1. BM25 全文搜索。
|
||
2. 2048 维向量 kNN 搜索。
|
||
3. 使用 Reciprocal Rank Fusion(RRF)合并排序。
|
||
|
||
第一版不做额外 reranker。RRF 不直接比较 BM25 分数和向量分数,而是根据两路结果中的排名融合,适合作为稳定的 hybrid baseline。
|
||
|
||
### 请求格式
|
||
|
||
`Content-Type: application/json`
|
||
|
||
| 字段 | 类型 | 必填 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| `query` | string | 是 | 搜索语句 |
|
||
| `limit` | int | 否 | 最终返回条数,默认 `10`,最大 `50` |
|
||
| `candidate_limit` | int | 否 | 每一路召回候选数,默认 `max(limit * 3, 20)`,最大 `50` |
|
||
| `rrf_k` | int | 否 | RRF 平滑参数,默认 `60` |
|
||
| `filters` | object | 否 | 过滤条件,同全文搜索 |
|
||
| `ai` | bool | 否 | 默认 `false`。传 `true` 时,系统先调用现有 LLM chat 通道把原始 query 改写为 BM25 关键词;全文搜索使用 AI 关键词,向量搜索仍使用原始 query |
|
||
|
||
如果 AI 关键词生成失败或超时,系统会回退为使用原始 `query` 做全文搜索,并在响应的 `keywords.error` 中返回错误信息;向量搜索不受影响。
|
||
|
||
### 请求示例
|
||
|
||
```bash
|
||
curl -X POST <APIdomain>/api/search/hybrid \
|
||
-H 'Content-Type: application/json' \
|
||
--data '{
|
||
"query": "Iraq invasion and Desert Storm",
|
||
"limit": 5,
|
||
"candidate_limit": 20
|
||
}'
|
||
```
|
||
|
||
中文 query:
|
||
|
||
```bash
|
||
curl -X POST <APIdomain>/api/search/hybrid \
|
||
-H 'Content-Type: application/json' \
|
||
--data '{
|
||
"query": "伊拉克入侵科威特与沙漠风暴",
|
||
"limit": 5,
|
||
"ai": true
|
||
}'
|
||
```
|
||
|
||
### 成功响应
|
||
|
||
状态码:
|
||
|
||
```http
|
||
200 OK
|
||
```
|
||
|
||
响应示例:
|
||
|
||
```json
|
||
{
|
||
"code": 0,
|
||
"message": "Hybrid search completed.",
|
||
"data": {
|
||
"mode": "hybrid",
|
||
"query": "Iraq invasion and Desert Storm",
|
||
"limit": 5,
|
||
"candidate_limit": 20,
|
||
"rrf_k": 60,
|
||
"filters": [],
|
||
"ai": true,
|
||
"fulltext_query": "Iraq Kuwait invasion Desert Storm",
|
||
"vector_query": "Iraq invasion and Desert Storm",
|
||
"keywords": {
|
||
"enabled": true,
|
||
"attempted": true,
|
||
"error": null,
|
||
"keywords": ["Iraq", "Kuwait", "invasion", "Desert Storm"],
|
||
"query": "Iraq Kuwait invasion Desert Storm",
|
||
"model": "glm-4.7-flash"
|
||
},
|
||
"total": 10,
|
||
"sources": {
|
||
"fulltext_total": 1,
|
||
"vector_total": 20,
|
||
"fulltext_hits": 1,
|
||
"vector_hits": 20
|
||
},
|
||
"hits": [
|
||
{
|
||
"score": 4.13,
|
||
"hybrid_score": 0.0325,
|
||
"rank_sources": {
|
||
"fulltext": {
|
||
"rank": 1,
|
||
"score": 4.13,
|
||
"rrf": 0.0163934426
|
||
},
|
||
"vector": {
|
||
"rank": 1,
|
||
"score": 0.79,
|
||
"rrf": 0.0163934426
|
||
}
|
||
},
|
||
"chunk_uid": "01KQHVREB6XPYF604RVZAP9NNY_14_97554",
|
||
"archive_uid": "01KQHVREB6XPYF604RVZAP9NNY",
|
||
"page_start": 8,
|
||
"page_end": 8,
|
||
"summary": "Summary text...",
|
||
"text": "chunk text..."
|
||
}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
说明:
|
||
|
||
- `hits` 是融合排序后的结果数组。
|
||
- `total` 是融合后的候选总数。
|
||
- `sources.fulltext_total` 与 `sources.vector_total` 分别表示两路召回的原始统计。
|
||
- `rank_sources` 用于说明某条结果在 fulltext / vector 两路中的排名与 RRF 贡献。
|
||
- `summary` 来自 archive 级摘要元数据,不是 chunk 单独生成的摘要。
|
||
|
||
### 错误响应
|
||
|
||
错误响应格式与全文搜索一致。常见错误包括:
|
||
|
||
- JSON 格式错误:`400 Bad Request`
|
||
- 缺少 `query`:`422 Unprocessable Entity`
|
||
- embedding API、全文搜索或向量搜索失败:`500 Internal Server Error`
|
||
|
||
## 相关接口
|
||
|
||
与搜索结果配套的证据查看接口见:
|
||
|
||
- [evidenceapi.md](/www/proofdb/apidoc/evidenceapi.md)
|