1、安装textrank4zh
pip install textrank4zh
2、测试
from textrank4zh import TextRank4Keyword, TextRank4Sentence
text = "过去两天,国内生成式人工智能服务领域热闹极了:阿里云推出“通义千问”大模型;商汤科技“日日新”、昆仑万维“天工”大模型、有赞“加我智能”在同一天发布;360基于大模型开发的人工智能产品矩阵“360智脑”率先落地搜索场景……再加上百度已发布的“文心一言”,国内互联网巨头们在3月许下的诺言正在一一兑现。"
tr4w = TextRank4Keyword()
tr4w.analyze(text=text, lower=True, window=2, vertex_source="all_filters")
print( '关键词:' )
for item in tr4w.get_keywords(20, word_min_len=2):
print(item.word, item.weight)
print()
print( '关键短语:' )
for phrase in tr4w.get_keyphrases(keywords_num=20, min_occur_num= 2):
print(phrase)
tr4s = TextRank4Sentence()
tr4s.analyze(text=text, lower=True, source = 'all_filters')
print()
print( '摘要:' )
for item in tr4s.get_key_sentences(num=3):
print(item.index, item.weight, item.sentence) # index是语句在文本中位置,weight是权重
最后,运行结果如下:
C:\workspace>python create_tag.py
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\osystem\AppData\Local\Temp\jieba.cache
Loading model cost 0.837 seconds.
Prefix dict has been built successfully.
关键词:
人工智能 0.06974899614411177
推出 0.050498553906766885
科技 0.050498553906766885
搜索 0.050498553906766885
服务 0.0382373567280689
产品 0.0382373567280689
模型 0.036220973395951234
开发 0.03540654295435084
发布 0.03460207612456747
加上 0.03460207612456747
百度 0.03460207612456747
文心 0.03460207612456747
互联网 0.03460207612456747
巨头 0.03460207612456747
诺言 0.03460207612456747
正在 0.03460207612456747
通义 0.026653837233467786
商汤 0.026653837233467786
日日 0.026653837233467786
落地 0.026653837233467786
关键短语:
摘要:
1 0.2610907441134144 商汤科技“日日新”、昆仑万维“天工”大模型、有赞“加我智能”在同一天发布
0 0.25544560910201897 过去两天,国内生成式人工智能服务领域热闹极了:阿里云推出“通义千问”大模型
2 0.2480067018897252 360基于大模型开发的人工智能产品矩阵“360智脑”率先落地搜索场景
常见问题:
1、AttributeError: module 'networkx' has no attribute 'from_numpy_matrix'
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\osystem\AppData\Local\Temp\jieba.cache
Loading model cost 0.805 seconds.
Prefix dict has been built successfully.
Traceback (most recent call last):
File "create_tag.py", line 14, in <module>
tr4w.analyze(text=text, lower=True, window=2, vertex_source="all_filters") # py2中text必须是utf8编码的str或者unicode对象,py3中必须是utf8编码的bytes或者str对象
File "C:\Users\osystem\AppData\Local\Programs\Python\Python38\lib\site-packages\textrank4zh\TextRank4Keyword.py", line 93, in analyze
self.keywords = util.sort_words(_vertex_source, _edge_source, window = window, pagerank_config = pagerank_config)
File "C:\Users\osystem\AppData\Local\Programs\Python\Python38\lib\site-packages\textrank4zh\util.py", line 160, in sort_words
nx_graph = nx.from_numpy_matrix(graph)
AttributeError: module 'networkx' has no attribute 'from_numpy_matrix'
由于上面第一步安装extrank4zh时,自动安装的networkx包为3.1版本,而与extrank4zh适配的版本为1.9.1。因此,需要回退版本。
C:\workspace\>pip3 install networkx==1.9.1
Collecting networkx==1.9.1
Downloading networkx-1.9.1-py2.py3-none-any.whl (1.2 MB)
---------------------------------------- 1.2/1.2 MB 2.2 MB/s eta 0:00:00
Collecting decorator>=3.4.0 (from networkx==1.9.1)
Downloading decorator-5.1.1-py3-none-any.whl (9.1 kB)
Installing collected packages: decorator, networkx
Attempting uninstall: networkx
Found existing installation: networkx 3.1
Uninstalling networkx-3.1:
Successfully uninstalled networkx-3.1
Successfully installed decorator-5.1.1 networkx-1.9.1
2、ImportError: cannot import name 'escape' from 'cgi'
Traceback (most recent call last):
File "create_tag.py", line 9, in <module>
from textrank4zh import TextRank4Keyword, TextRank4Sentence
File "C:\Users\osystem\AppData\Local\Programs\Python\Python38\lib\site-packages\textrank4zh\__init__.py", line 3, in <module>
from .TextRank4Keyword import TextRank4Keyword
File "C:\Users\osystem\AppData\Local\Programs\Python\Python38\lib\site-packages\textrank4zh\TextRank4Keyword.py", line 10, in <module>
import networkx as nx
File "C:\Users\osystem\AppData\Local\Programs\Python\Python38\lib\site-packages\networkx\__init__.py", line 76, in <module>
import networkx.readwrite
File "C:\Users\osystem\AppData\Local\Programs\Python\Python38\lib\site-packages\networkx\readwrite\__init__.py", line 14, in <module>
from networkx.readwrite.gml import *
File "C:\Users\osystem\AppData\Local\Programs\Python\Python38\lib\site-packages\networkx\readwrite\gml.py", line 39, in <module>
from cgi import escape
ImportError: cannot import name 'escape' from 'cgi' (C:\Users\osystem\AppData\Local\Programs\Python\Python38\lib\cgi.py)
回退版本后执行,发现又报错了。此时,需要修改gml.py文件(文件路径在上面的报错提示的倒数第4行)中的引用来源。
将文件中的“from cgi import escape”修改为“from html import escape”即可。