13 Transformers - 使用Pipelien处理自然语言处理-深圳市維司達科技有限公司

文章目录

自然语言处理
- 文本分类
- 零样本文本分类
- `token` 分类
- 问答
- 表格问答
- 文本摘要
- 翻译
- 文本生成
- 文生文

Transformers是一个采用当下技术最新、表现最佳（State-of-the-art，SoTA）的模型和技术在预训练自然语言处理、计算机视觉、音频和多模态模型方面提供推理和训练的的开源库；旨在快速易用，以便每个人都可以开始使用transformer模型进行学习或构建。该库不仅包含Transformer模型，还包括用于计算机视觉任务的现代卷积网络等非Transformer模型。

自然语言处理

NLP任务是最常见的类型之一，因为文本是我们进行交流的自然方式。为了让文本变成模型识别的格式，需要对其进行分词。这意味着将一段文本分成单独的单词或子词（tokens），然后将这些tokens转换为数字。因此，可以将一段文本表示为一系列数字，一旦有了一系列的数字，就可以将其输入到模型中以解决各种NLP任务！

文本分类

像任何模态的分类任务一样，文本分类（Text classification）将一段文本（可以是句子级别、段落或文档）从预定义的类别集合中进行标记。文本分类有许多实际应用，其中一些包括：

情感分析：根据某些极性（如积极或消极）对文本进行标记，可以支持政治、金融和营销等领域的决策制定
内容分类：根据某些主题对文本进行标记，有助于组织和过滤新闻和社交媒体提要中的信息（天气、体育、金融等）

文本分类的任务标识为：sentiment-analysis。

from transformers import pipeline classifier = pipeline(task="sentiment-analysis") preds = classifier("Hugging Face is the best thing since sliced bread!") preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] preds

结果：

[{'score': 0.9991, 'label': 'POSITIVE'}]

零样本文本分类

零样本文本分类（Zero-shot Classification）是自然语言处理中的一项任务，为提供的输入文本，在新设定的分类集合上找到对应的分类并将文本指派给该分类。

零样本文本分类的任务标识为：zero-shot-classification

首先，将Pipeline实例化并传入所使用的模型facebook/bart-large-mnli：

from transformers import pipeline classify = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")

根据指定的模型在默认配置中指定了具体的任务处理类，就不用再传入任务标识。现在使用实例classify来实现推理：

output=classify( "I have a problem with my iphone that needs to be resolved asap!!", candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"], ) print(output)

输出：

{ "sequence": "I have a problem with my iphone that needs to be resolved asap!!", "labels":["urgent", "not urgent", "phone", "tablet", "computer"], "scores": [0.5036360025405884, 0.4787988066673279, 0.012600637972354889, 0.02655780641362071, 0.023087705485522747] }

`token`分类

在任何NLP任务中，文本都经过预处理，将文本序列分成单个单词或子词。这些被称为tokens。Token分类（Token classification）将每个token分配一个来自预定义类别集的标签。

两种常见的Token分类是：

命名实体识别（NER）：根据实体类别（如组织、人员、位置或日期）对token进行标记。NER在生物医学设置中特别受欢迎，可以标记基因、蛋白质和药物名称。
词性标注（POS）：根据其词性（如名词、动词或形容词）对标记进行标记。POS对于帮助翻译系统了解两个相同的单词如何在语法上不同很有用（作为名词的银行与作为动词的银行）。

目前通过Pipeline应用，只支持命名实体识别（传入ner）。

from transformers import pipeline classifier = pipeline(task="ner") preds = classifier("Hugging Face is a French company based in New York City.") result = [ { "entity": pred["entity"], "score": round(pred["score"], 4), "index": pred["index"], "word": pred["word"], "start": pred["start"], "end": pred["end"], } for pred in preds ] print(*result, sep="\n")

结果：

{'entity': 'I-ORG', 'score': 0.9968, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2} {'entity': 'I-ORG', 'score': 0.9293, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7} {'entity': 'I-ORG', 'score': 0.9763, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12} {'entity': 'I-MISC', 'score': 0.9983, 'index': 6, 'word': 'French', 'start': 18, 'end': 24} {'entity': 'I-LOC', 'score': 0.999, 'index': 10, 'word': 'New', 'start': 42, 'end': 45} {'entity': 'I-LOC', 'score': 0.9987, 'index': 11, 'word': 'York', 'start': 46, 'end': 50} {'entity': 'I-LOC', 'score': 0.9992, 'index': 12, 'word': 'City', 'start': 51, 'end': 55}

问答

问答（Question answering）是另一个token级别的任务，返回一个问题的答案，有时带有上下文（开放领域），有时不带上下文（封闭领域）。当向虚拟助手提出问题时，例如询问一家餐厅是否营业，就会发生这种情况。它还可以提供客户或技术支持，并帮助搜索引擎检索您要求的相关信息。

提供答案的方式有两种：
- 抽取式：给定一个问题和一些上下文，然后模型从给定的上下文抽取出一些文本片段来回答提出的问题；
- 抽象式：给定一个问题和一些上下文，然后根据问题和上下文生成所需的答案。这种方法由[Text2TextGenerationPipeline]处理，而不是下面展示的[QuestionAnsweringPipeline]。

问答的任务标识为：question-answering

from transformers import pipeline question_answerer = pipeline(task="question-answering") preds = question_answerer( question="What is the name of the repository?", context="The name of the repository is huggingface/transformers", ) print( f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}" )

结果为：

score: 0.9327, start: 30, end: 54, answer: huggingface/transformers

表格问答

表格问答（Table Question Answer）根据提问，根据给定的表格的信息来回答的任务。

应用场景：

自动化客服系统
智能搜索引擎
数据可视化工具
企业知识图谱构建
科学文献自动抽取

表格问答的任务标识为：table-question-answering

首先，通过实例化Pipeline并在实例化时指定任务table-question-answering和使用的模型google/tapas-base-finetuned-wtq：

from transformers import pipeline qa = pipeline(task="table-question-answering", model="google/tapas-base-finetuned-wtq")

然后假定一张表格数据，作为待分析的输入：

table = { "Repository": ["Transformers", "Datasets", "Tokenizers"], "Stars": ["36542", "4512", "3934"], "Contributors": ["651", "77", "34"], "Programming language": ["Python", "Python", "Rust, Python and NodeJS"], }

最后通过实例qa来实现分析并输出：

output= qa(query="How many stars does the transformers repository have?", table=table) print(output)

结果：

{"answer": "AVERAGE > 36542", "coordinates": "[(0,1)]", "cells":["36542"], "aggreator": "AVERAGE"}

文本摘要

文本摘要（Summarization）是从较长的文本中创建一个较短的版本，尽可能保留原始文档的大部分含义。摘要是一个序列到序列的任务；它输出比输入更短的文本序列。有许多长篇文档可以进行摘要，以帮助读者快速了解主要要点。法案、法律和财务文件、专利和科学论文等文档可以摘要，以节省读者的时间并作为阅读辅助工具。

像问答一样，摘要有两种类型：

提取式：从原始文本中识别和提取最重要的句子
抽象式：从原始文本生成目标摘要（可能包括不在输入文档中的新单词）；[SummarizationPipeline]使用抽象方法。

文本摘要的任务标识为：summarization

from transformers import pipeline summarizer = pipeline(task="summarization") summarizer( "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles." )

结果：

[{'summary_text': ' The Transformer is the first sequence transduction model based entirely on attention . It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers .'}]

翻译

翻译（Translation）将一种语言的文本序列转换为另一种语言。它对于帮助来自不同背景的人们相互交流、帮助翻译内容以吸引更广泛的受众，甚至成为学习工具以帮助人们学习一门新语言都非常重要。除了摘要之外，翻译也是一个序列到序列的任务，意味着模型接收输入序列并返回目标输出序列。

翻译的任务标识为：translation_xx_to_yy或translation

在早期，翻译模型大多是单语的，但最近，越来越多的人对可以在多种语言之间进行翻译的多语言模型感兴趣。

from transformers import pipeline text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning." translator = pipeline(task=""translation_xx_to_yy"", model="google-t5/t5-small") translator(text)

结果为：

[{'translation_text': "Hugging Face est une tribune communautaire de l'apprentissage des machines."}]

文本生成

文本生成（Text Generation）是一种预测文本序列中单词的任务。它已成为一种非常流行的NLP任务，因为预训练的语言模型可以微调用于许多其他下游任务。最近，人们对大型语言模型（LLMs）表现出了极大的兴趣，这些模型展示了zero learning或few-shot learning的能力。这意味着模型可以解决它未被明确训练过的任务！语言模型可用于生成流畅和令人信服的文本，但需要小心，因为文本可能并不总是准确的。

有两种类型的话语模型：

causal：模型的目标是预测序列中的下一个token，而未来的tokens被遮盖。该方式使用的任务标识为：text-generation

from transformers import pipeline prompt = "Hugging Face is a community-based open-source platform for machine learning." generator = pipeline(task="text-generation") generator(prompt) # doctest: +SKIP

masked：模型的目标是预测序列中被遮蔽的token，同时具有对序列中所有tokens的完全访问权限。该方式使用的任务标识为：fill-mask

text = "Hugging Face is a community-based open-source <mask> for machine learning." fill_mask = pipeline(task="fill-mask") preds = fill_mask(text, top_k=1) preds = [ { "score": round(pred["score"], 4), "token": pred["token"], "token_str": pred["token_str"], "sequence": pred["sequence"], } for pred in preds ] preds

结果：

[{'score': 0.224, 'token': 3944, 'token_str': ' tool', 'sequence': 'Hugging Face is a community-based open-source tool for machine learning.'}]

文生文

文生文（Text-to-Text）和文本生成（Text Generation）两者都是自然语言处理（NLP）的子领域，但它们有不同的重点和应用场景。文本生成主要指的是自动生成文本内容的技术，例如：自动生成新闻报道、自动生成产品描述、自动生成聊天机器人的对话，这种技术通常使用深度学习模型来训练语言模型，从而能够根据输入的条件或提示生成新的文本内容。文生文则主要指的是将一段文本转换为另一段文本的技术，例如：机器翻译、文本摘要、风格转换，这种技术通常使用序列到序列模型或变换器模型来训练语言模型，从而能够根据输入的文本生成新的文本内容。文本生成主要关注于自动生成文本内容，而文生文则主要关注于将一段文本转换为另一段文本。

文生文的任务标识为：text2text-generation

首先将Pipeline实例化，并在实例化过程中指定任务为text2text-generation，指定使用google/flan-t5-small模型：

from transformers import pipeline generator = pipeline(task="text2text-generation",model= "google/flan-t5-small" )

然后分析推理得到结果并打印输出：

output=generator( "Translate to German: My name is Arthur") print(output)

结果：

[{”generated_text": "Meinen Name ist Arthur."}]

13 Transformers - 使用Pipelien处理自然语言处理

文章目录

自然语言处理

文本分类

零样本文本分类

`token`分类

问答

表格问答

文本摘要

翻译

文本生成

文生文

揭秘Open-AutoGLM开源框架：5大核心功能及应用场景深度解析

学长亲荐8个AI论文软件，专科生毕业论文格式规范全搞定！

从入门到精通：Open-AutoGLM改prompt的4个进阶阶段（附实操案例）

Python+Vue的大学生宿舍水电管理系统设计与实现Pycharm django flask

仅限内部流出：Open-AutoGLM群控高级功能使用手册首次公开

Open-AutoGLM运行权限揭秘：为什么顶级团队都拒绝使用root？

文章目录

自然语言处理

文本分类

零样本文本分类

token分类

问答

表格问答

文本摘要

翻译

文本生成

文生文

揭秘Open-AutoGLM开源框架：5大核心功能及应用场景深度解析

学长亲荐8个AI论文软件，专科生毕业论文格式规范全搞定！

从入门到精通：Open-AutoGLM改prompt的4个进阶阶段（附实操案例）

Python+Vue的 大学生宿舍水电管理系统设计与实现Pycharm django flask

仅限内部流出：Open-AutoGLM群控高级功能使用手册首次公开

Open-AutoGLM运行权限揭秘：为什么顶级团队都拒绝使用root？

`token`分类

Python+Vue的大学生宿舍水电管理系统设计与实现Pycharm django flask