Ex_treme's blog.

文章搜索引擎————TF-IDF算法整合

2018/04/03 Share

文章搜索引擎(三)

这一章主要将TF-IDF算法整合进SAS中,我们称之为TF-IDF检索模型,达到的效果就是,通过在搜索引擎中输入关键字(以空格做分离),POST到Django后台,后台GET导关键词之后,通过TF-IDF算法返回TF-IDF值最大的前五篇文章。

TF-IDF算法整合

搜索方法HTML相关代码

1
2
3
4
5
6
<div class="inputArea">
<form method="post" action="{% url 'index' %}">
<input name="keyword" class="searchInput" placeholder="输入搜索关键词">
{% csrf_token %}<!--防止恶意用户攻击网站,重复提交表单-->
</form>
</div>

POST&GET方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from django.shortcuts import render
from utils.ArticleTools import initdata
from utils.tfidf import tfidf

# Create your views here.
from django.views.generic import View
from django import forms

#验证表单信息
class ArticleForm(forms.Form):
#至少输入一个关键词,要和input的name保持一致
keyword = forms.CharField(required=True, min_length=1)

class ArticleView(View):
def get(self,request):
return render(request,'index.html')


def post(self,request):
article_from = ArticleForm(request.POST)
if article_from.is_valid():
# initdata()
word = request.POST.get('keyword','')
keylist = word.split(' ')
reslist = tfidf(keylist)
pass
return render(request,'index.html')

文章生成器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import jieba
import hashlib
import os
import json
from article.models import ArticleModel


def getStopword():
stopwordPath = os.getcwd() + '/apps/utils/stop_words.txt'
with open(stopwordPath,'r') as f:
stopwordlist = f.read().replace(' ','').replace('\n',' ').split(' ')
stopwordlist.append('\n')
stopwordlist.append(' ')
return stopwordlist


def initdata(filedir=os.getcwd()+'/data'):
stopwordlist = getStopword()
for path in os.listdir(filedir):
filepath = filedir + '/' + path
with open(filepath,'rb') as f:
m = hashlib.md5()
m.update(f.read())
filemd5 = m.hexdigest()
with open(filepath, 'r') as f:
text = f.read()
fence = jieba.cut(text)
fencelist = []
for f in fence:
if f in stopwordlist:
continue
fencelist.append(f)

fenceJson = json.dumps(fencelist,ensure_ascii=False)
if fenceJson == '[]':
continue
article = ArticleModel()
article.file_name = path
article.file_path = filepath
article.file_md5 = filemd5
article.file_fence = fenceJson
article.save()

TD-IDF算法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import json
import math
from article.models import ArticleModel

def tfidf(keylist):
reskeylist = []
for key in keylist:
articles = ArticleModel.objects.filter(file_fence__icontains=key)
# print(articles)
reskeylist.append(articles)

keyDict = dict()
for articles in reskeylist:
for article in articles:
id = int(article.id)
keyDict[id] = 0
#获取文章总数
allFileNum = ArticleModel.objects.count()

i = 0
for articles in reskeylist:

#匹配文章数
fileNum = len(articles)+1
idf = math.log(allFileNum/fileNum)

for article in articles:
am =ArticleModel.objects.get(id = article.id)
fenjson = json.loads(am.file_fence)
wordnum = len(fenjson)
keynum = fenjson.count(keylist[i])
tf = keynum/wordnum
keyDict[int(article.id)]+= tf*idf

i += 1
reslist = sorted(keyDict.items(),key = lambda x: x[1],reverse=True)
print(reslist)
return reslist[0:5]
CATALOG
  1. 1. 文章搜索引擎(三)
  2. 2. TF-IDF算法整合