NLTK安装与使用

2023-06-12 21:19| 来源: 网络整理| 查看: 265

NLTK安装与使用--输出文本词性一、安装二、案例三、词性表示与含义

NLTK代表"Natural Language Toolkit"，它是一个用于自然语言处理（NLP）的Python库。NLTK提供了广泛的工具和资源，用于处理、分析、操作和理解人类语言数据。

NLTK是一个开源项目，旨在促进和支持NLP研究和开发。它提供了丰富的功能和算法，包括文本处理、词性标注、分词、语法分析、语义分析、语料库管理、词向量、机器学习等等。NLTK还提供了大量的语料库、词典和语言数据集，可以用于训练和评估NLP模型。

使用NLTK，开发人员可以轻松地处理和分析文本数据，从而构建各种NLP应用程序，如文本分类、信息抽取、机器翻译、问答系统等。它也是学术界和教育界中教授和研究NLP的重要工具之一。

NLTK是一个功能强大、易于使用的Python库，为NLP任务提供了丰富的工具和资源，使开发人员能够处理和分析人类语言数据。

一、安装

优秀教程1： NLTK库安装教程（详细版）(https://blog.csdn.net/weixin_51327281/article/details/127700781)

优秀教程2： NLTK安装方法(https://blog.csdn.net/weixin_47822556/article/details/114434233)

简言之

pip install nltk

再在脚本中使用

import nltk nltk.download('averaged_perceptron_tagger') 二、案例

demo1: 取出所有名词

要识别出列表中所有单词的词性并筛选出名词，您可以使用自然语言处理工具，如NLTK（Natural Language Toolkit）库，它提供了一些功能来处理文本和词性标注。

以下是一个使用NLTK库来实现词性标注和名词筛选的示例代码：

# 名词 import nltk nltk.download('averaged_perceptron_tagger') from nltk.tag import pos_tag def filter_nouns(word_list): # 使用pos_tag函数对单词列表进行词性标注 tagged_words = pos_tag(word_list) # 筛选出名词 nouns = [word for word, pos in tagged_words if pos.startswith('N')] return nouns word_list = ['As', 'a', 'history', 'of', 'Custer', ',', 'this', "insn't", 'even', 'close', '(', 'Custer', 'dies', 'to', 'help', 'the', 'indians', '?', 'I'] nouns = filter_nouns(word_list) print(nouns)

输出

['history', 'Custer', "insn't", 'Custer', 'indians']

词性标注并不是完美的，有时可能会出现错误的标注结果。因此，您可能需要对输出的结果进行验证和进一步处理，以确保得到准确的名词列表。

demo2: 取出每个词的词性

# 每个词的词性 import nltk nltk.download('averaged_perceptron_tagger') from nltk.tag import pos_tag def tag_pos(words): tagged_words = pos_tag(words) return tagged_words word_list = ['As', 'a', 'history', 'of', 'Custer', ',', 'this', "insn't", 'even', 'close', '(', 'Custer', 'dies', 'to', 'help', 'the', 'indians', '?', 'I'] tagged_words = tag_pos(word_list) print(tagged_words)

输出：

[('As', 'IN'), ('a', 'DT'), ('history', 'NN'), ('of', 'IN'), ('Custer', 'NNP'), (',', ','), ('this', 'DT'), ("insn't", 'NN'), ('even', 'RB'), ('close', 'RB'), ('(', '('), ('Custer', 'NNP'), ('dies', 'VBZ'), ('to', 'TO'), ('help', 'VB'), ('the', 'DT'), ('indians', 'NNS'), ('?', '.'), ('I', 'PRP')]

在NLTK中，pos_tag函数返回的词性标签遵循Penn Treebank标签集。在这个标签集中，名词的标签以大写字母"N"开头。

demo3: 取出专有名词

要筛选出专有名词，可以使用pos_tag函数返回的词性标签，通过判断标签是否为"NNP"或"NNPS"来确定。

# 专有名词 import nltk nltk.download('averaged_perceptron_tagger') from nltk.tag import pos_tag def filter_proper_nouns(words): tagged_words = pos_tag(words) proper_nouns = [word for word, pos in tagged_words if pos.startswith('NNP') or pos.startswith('NNPS')] return proper_nouns word_list = ['As', 'a', 'history', 'of', 'Custer', ',', 'this', "isn't", 'even', 'close', '(', 'Custer', 'dies', 'to', 'help', 'the', 'indians', '?', 'I'] proper_nouns = filter_proper_nouns(word_list) print(proper_nouns)

输出：

['Custer', 'Custer']

在上述代码中，pos.startswith('NNP')用于判断词性标签是否以"NNP"开头，表示单数专有名词（例如人名、地名等）。同样，pos.startswith('NNPS')用于判断词性标签是否以"NNPS"开头，表示复数专有名词。通过这两个条件，可以筛选出列表中的专有名词。

三、词性表示与含义

在上面的三个demo中，词性都是通过pos_tag(words) 返回词性。

pos_tag(words)函数返回一个由单词和词性标签组成的元组列表，其中每个元组表示输入单词的词性标注。

对于词性标签，NLTK库使用了Penn Treebank标签集。以下是一些常见的词性标签及其含义解释：

CC: Coordinating conjunction（并列连词） CD: Cardinal number（基数词） DT: Determiner（限定词） EX: Existential there（存在句中的there） FW: Foreign word（外来词） IN: Preposition or subordinating conjunction（介词或从属连词） JJ: Adjective（形容词） JJR: Adjective, comparative（形容词，比较级） JJS: Adjective, superlative（形容词，最高级） LS: List item marker（列表项标记） MD: Modal（情态动词） NN: Noun, singular or mass（名词，单数或不可数名词） NNS: Noun, plural（名词，复数） NNP: Proper noun, singular（专有名词，单数） NNPS: Proper noun, plural（专有名词，复数） PDT: Predeterminer（前位限定词） POS: Possessive ending（所有格结束词） PRP: Personal pronoun（人称代词） PRP$: Possessive pronoun（物主代词） RB: Adverb（副词） RBR: Adverb, comparative（副词，比较级） RBS: Adverb, superlative（副词，最高级） RP: Particle（小品词） SYM: Symbol（符号） TO: to（to 介词） UH: Interjection（感叹词） VB: Verb, base form（动词，基本形式） VBD: Verb, past tense（动词，过去式） VBG: Verb, gerund or present participle（动词，动名词或现在分词） VBN: Verb, past participle（动词，过去分词） VBP: Verb, non-3rd person singular present（动词，非第三人称单数现在时） VBZ: Verb, 3rd person singular present（动词，第三人称单数现在时） WDT: Wh-determiner（疑问限定词） WP: Wh-pronoun（疑问代词） WP$: Possessive wh-pronoun（疑问代词的所有格形式） WRB: Wh-adverb（疑问副词）

通过使用这些标签，可以了解每个单词的词性类型。标签集可能因不同的语料库和任务而有所差异。因此，具体情况可能会有所不同。

【本文地址】

公司简介

联系我们