RexUniNLU中文NLU教程：如何将抽取结果对接Elasticsearch构建检索系统-深圳市維司達科技有限公司

RexUniNLU中文NLU教程：如何将抽取结果对接Elasticsearch构建检索系统

1. 引言

你是否遇到过这样的场景：手头有一堆非结构化文本数据，想要快速构建一个智能检索系统，却苦于无法有效提取关键信息？今天我们就来解决这个问题。

RexUniNLU是阿里巴巴达摩院开发的基于DeBERTa的零样本通用自然语言理解模型，它能帮你从文本中自动抽取关键信息，而无需事先准备标注数据。本文将手把手教你如何将RexUniNLU的抽取结果导入Elasticsearch，搭建一个功能强大的智能检索系统。

2. 准备工作

2.1 环境准备

在开始之前，你需要准备好以下环境：

已部署的RexUniNLU服务（可以使用CSDN星图镜像快速部署）
Elasticsearch 7.x或8.x版本
Python 3.7+环境
基本Python库：requests、elasticsearch

2.2 快速测试RexUniNLU

让我们先确认RexUniNLU服务正常运行。发送一个简单的NER请求：

import requests url = "http://localhost:8000/ner" data = { "text": "阿里巴巴总部位于杭州，马云是创始人", "schema": {"组织机构": null, "人物": null, "地点": null} } response = requests.post(url, json=data) print(response.json())

预期输出应包含抽取的实体信息。如果一切正常，我们就可以继续了。

3. 数据抽取流程

3.1 设计Schema

Schema决定了RexUniNLU会抽取哪些信息。根据你的业务需求设计合适的Schema：

news_schema = { "人物": null, "组织机构": null, "地点": null, "时间": null, "事件": null }

3.2 批量处理文本

假设我们有一个文本文件articles.txt，每行是一篇文章。我们可以批量处理：

def process_texts(file_path, schema): results = [] with open(file_path, 'r', encoding='utf-8') as f: for line in f: text = line.strip() if not text: continue data = {"text": text, "schema": schema} response = requests.post("http://localhost:8000/ner", json=data) if response.status_code == 200: results.append({ "original_text": text, "entities": response.json().get("抽取实体", {}) }) return results

4. Elasticsearch集成

4.1 创建索引

首先，我们需要在Elasticsearch中创建一个适合存储抽取结果的索引：

from elasticsearch import Elasticsearch es = Elasticsearch(["http://localhost:9200"]) index_mapping = { "mappings": { "properties": { "content": {"type": "text"}, "entities": { "type": "nested", "properties": { "人物": {"type": "keyword"}, "组织机构": {"type": "keyword"}, "地点": {"type": "keyword"}, "时间": {"type": "keyword"}, "事件": {"type": "keyword"} } } } } } es.indices.create(index="news_articles", body=index_mapping)

4.2 导入数据

将RexUniNLU的处理结果导入Elasticsearch：

def index_documents(es, documents, index_name): for i, doc in enumerate(documents): es.index( index=index_name, id=i+1, body={ "content": doc["original_text"], "entities": doc["entities"] } )

5. 构建检索系统

5.1 基本全文检索

现在我们可以进行简单的全文检索了：

def search_content(es, query, index_name): body = { "query": { "match": { "content": query } } } return es.search(index=index_name, body=body)

5.2 实体过滤检索

更强大的是基于实体的检索：

def search_by_entity(es, entity_type, entity_value, index_name): body = { "query": { "nested": { "path": "entities", "query": { "bool": { "must": [ {"match": {f"entities.{entity_type}": entity_value}} ] } } } } } return es.search(index=index_name, body=body)

6. 实际应用示例

6.1 新闻检索系统

假设我们构建了一个新闻检索系统，用户可以：

搜索包含特定关键词的新闻
查找涉及特定人物的所有新闻
查询发生在某个地点的所有事件

# 查找所有提到"马云"的新闻 results = search_by_entity(es, "人物", "马云", "news_articles") for hit in results["hits"]["hits"]: print(hit["_source"]["content"][:100] + "...")

6.2 电商评论分析

另一个应用场景是分析电商评论：

review_schema = { "产品名称": null, "产品特性": null, "评价观点": null, "情感倾向": null } # 查找对"电池"特性有正面评价的所有评论 body = { "query": { "bool": { "must": [ {"nested": { "path": "entities", "query": {"match": {"entities.产品特性": "电池"}} }}, {"nested": { "path": "entities", "query": {"match": {"entities.情感倾向": "正面"}} }} ] } } }