静态站实现全站搜索

2022-08-25 • 技术/ 设计 • ----字 • ---

注：本文中的项目已更新重构，前往新的文章以查看更快速高效的v2.0版本

前言

全站搜索这一功能我想加入到我的博客中不是一年两年的事了。但因自己现在弃用Hexo转而自己做博客，这两年搜索这个功能就一直未能实现。
最近自己偶然有新想法，就给实现了。效果还不错，现在搭配Github Actions使用，可以实现新文章自动索引，实现了自动化。

效果

搜索示例

要想充分体验，还是自己去试试的好。转到Articles索引页

实现方式

数据生成

要想在静态页搜索，就要自己创建索引。这里使用python来创建一个JSON，存储全站文章信息：

                    # -*- coding: utf-8 -*-
                    ## 使用有问题请到github.com/ravelloh/RPageSearch提ISSUE
                    ### Author: RavelloH
                    #### LICENCE: MIT
                    ##### RPageSearch
                    import os
                    from bs4 import BeautifulSoup
                    
                    ## 设置目标
                    target = './articles/' # 目录位置
                    layers = 1 # 遍历层数
                    targettype = 'html' # 文件后缀名(只支持html)
                    
                    main_structure_head='{"articles":['
                    main_structure_end=']}'
                    inner_structure_1='{"title":"'
                    inner_structure_2='","path":"'
                    inner_structure_3='","time":"'
                    inner_structure_4='","text":"'
                    inner_structure_5='"}'
                    
                    
                    ## 打开目标目录
                    targetfile = []
                    for i in os.listdir(target):
                        if '.' not in i:
                            for k in os.listdir(target +i):
                                if targettype in k:
                                    targetfile.append(target + i + '/' + k)
                    
                    ## 按时间顺序排序
                    targetfilenum = []
                    for i in targetfile:
                        targetfilenum.append(i[11:19])
                    targetfilenum.sort(reverse=True)
                    targetfile=[]
                    for i in targetfilenum:
                        targetfile.append('./articles/'+str(i)+'/index.html')
                        
                    ## 解析重构目标文件
                    inner_structure_cache=[]
                    inner_structure_text=''
                    for i in targetfile:
                        inner_structure_text = ''    
                        with open(i,'r') as f:
                            filecontent = BeautifulSoup(f.read(),'html.parser')
                            textlist = filecontent.find_all(name='p')
                            title = filecontent.find_all(name='h2')
                            titlelen=len(title)
                        length = len(textlist)
                        for j in range(length):
                            inner_structure_text=inner_structure_text+textlist[j].get_text()
                        time = i[-19:-11]
                        time = time[0:4]+'-'+time[4:6]+'-'+time[6:8]
                        title = title[titlelen-1]
                        path = i[1:][:-10]
                        inner_structure_text=inner_structure_text.replace(' ','').replace('\n','').replace('"','&quot;').replace('\\','')
                        inner_structure_all = inner_structure_1 + str(title.get_text()) + inner_structure_2 + str(path) + inner_structure_3 + str(time) + inner_structure_4 + inner_structure_text + inner_structure_5
                        inner_structure_cache.append(inner_structure_all)
                    
                    ## 重构完整JSON
                    main_structure = main_structure_head
                    for i in inner_structure_cache:
                        main_structure = main_structure + i + ','
                    main_structure = main_structure[:-1] + main_structure_end
                    total_str = 'var SearchResult = "' + main_structure.replace('"','\\"') + '"'
                    print(total_str)
                    # 写入JSON#文件
                    with open('./js/searchdata.js','w+') as #1:
                        f1.write(total_str)

上述代码实现了将articles目录下所有文件夹中以.html后缀结尾的文件中的p标签中文字提取出来，并顺便提取h2的文章标题。
不过因为python中直接使用os.dirlist扫出的文件名是乱序，为方便后续排序还需要按照时间顺序排序，其中因为我的文章存储方式是以时间排序的，如这篇文章的存储结构就是/articles/20220825/inxex.html，因为时间可以直接从文件夹中读出，时间排序比较方便，7行就搞定了，如果是其余方式也同理。
上述代码运行后，得出的应该是类似于如下结构的json:

                    {
                        "articles":[{
                            "title":"文章标题","path":"相对路径","time":"更新时间","text":"所有正文"
                        }, {
                            "title":"文章标题","path":"相对路径","time":"更新时间","text":"所有正文"
                        }, {
                            "title":"文章标题","path":"相对路径","time":"更新时间","text":"所有正文"
                        }]
                    }

这样全站搜索的json就生成完成了，为了方便引用，上述代码最后会将这个json改为js格式，并转义"字符。
这样，就可以在后续处理搜索时直接引用js，其中json存储在变量SearchResult中。

搜索处理

有了json，搜索就只需在前端实现了。这样可以脱离服务器的限制，唯一限制速度的是访客的设备性能。
但因为这里只是简单的字符串搜索，性能需求并不大，下面我写的代码虽说并没有做到极限优化，但也通过多层次搜索降低了一部分运算量，可以做到实时搜索输入数据。
以下是HTML与JavaScript代码。当然，这跟我博客上的不一样，博客上还加入了一些css过渡之类的，不过本篇重点也不是css，如果有需求可自行到博客articles页F12看看。博客源代码在github，见此。

                    <div class='searchbox'></span>
                        <form class="searchbox" onSubmit="return check();" autocomplete="off">
                            <input type="text" placeholder="从所有文章内检索..." name="search" oninput="searchtext()" onpropertychange="searchtext()">
                            <button type="button" id='searchbutton'><span class="i_mini ri:search-line"></span></button>
                            <div class="resultlist" id="resultlist">
                                <i>- 搜索 -</i><hr><p align="center">
                                    输入关键词以在文章标题及正文中查询
                                </p>
                                <hr><a href="https://github.com/ravelloh/RPageSearch">Search powered by RavelloH's RPageSearch</a>
                            </div>
                        </form>
                    </div>

                    let input = document.querySelector("input[type='text']");
                    let result = document.getElementById('resultlist')
                    let button = document.getElementById('searchbutton')
                    
                    obj = JSON.parse(SearchResult);
                    function searchtext() {
                        result.innerHTML = input.value;
                        if (input.value == '') {
                            result.innerHTML = '<i>- 搜索 -</i><hr>'+'<p align="center">输入关键词以在文章标题及正文中查询</p><hr>'
                        }
                    
                        // 标题搜索
                        resultcount = 0;
                        resultstr = '';
                        var resulttitlecache = new Array()
                        for (i = 0; i < obj.articles.length; i++) {
                            if (obj.articles[i]['title'].includes(input.value) == true) {
                                resulttitlecache.unshift(obj.articles[i]['title'])
                                resultcount++;
                            }
                        }
                    
                        // 标题搜索结果展示
                        if (resultcount !== 0 && resultcount !== obj.articles.length) {
                            for (i = 0; i < resulttitlecache.length; i++) {
                                for (j = 0; j < obj.articles.length; j++) {
                                    if (obj.articles[j]['title'] == resulttitlecache[i]) {
                                        titlesearchresult = '<h4><a href="'+obj.articles[j]["path"]+'" class="resulttitle">'+obj.articles[j]['title'].replace(new RegExp(input.value, 'g'), '<mark>'+input.value+'</mark>')+'</a></h4><em>-标题匹配</em><p class="showbox">'+obj.articles[j]['text'].substring(0, 100)+'</p>'
                                        resultstr = titlesearchresult + '<hr>' + resultstr
                                    }
                                }
                                result.innerHTML = '<i>"'+input.value+'"</i><hr>'+resultstr;
                            }
                        }
                    
                        // 正文搜索
                        var resulttextcache = new Array()
                        for (i = 0; i < obj.articles.length; i++) {
                            if (obj.articles[i]['text'].includes(input.value) == true) {
                                resulttextcache.unshift(obj.articles[i]['text'])
                                resultcount++;
                            }
                        }
                    
                        // 正文搜索结果计数
                        var targetname = new Array()
                        var targetscore = new Array()
                        if (resulttextcache.length !== 0 && input.value !== '') {
                            for (i = 0; i < resulttextcache.length; i++) {
                                for (j = 0; j < obj.articles.length; j++) {
                                    if (obj.articles[j]['text'] == resulttextcache[i]) {
                                        targetname.unshift(obj.articles[j]['title'])
                                        targetscore.unshift(obj.articles[j]['text'].match(RegExp(input.value, 'gim')).length)
                                    }
                                }
                            }
                        }
                    
                        //排序相关选项
                        var targetscorecache = targetscore.concat([]);
                        var resultfortext = '';
                        var textsearchresult = ''
                        targetscorecache.sort(function(a, b) {
                            return b-a
                        })
                        for (i = 0; i < targetscorecache.length; i++) {
                            for (j = 0; j < targetscore.length; j++) {
                                if (targetscorecache[i] == targetscore[j]) {
                                    console.log('文章排序:'+targetname[j])
                                    for (k = 0; k < obj.articles.length; k++) {
                                        if (obj.articles[k]['title'] == targetname[j]) {
                                            // 确认选区
                                            textorder = obj.articles[k]['text'].indexOf(input.value) -15;
                                            while (textorder < 0) {
                                                textorder++
                                            }
                    
                                            resultfortext = '<h4><a href="'+obj.articles[k]["path"]+'" class="resulttitle">'+obj.articles[k]['title']+'</a></h4><em>-'+targetscorecache[i]+'个结果</em><p class="showbox">...'+obj.articles[k]['text'].substring(textorder, textorder+100).replace(new RegExp(input.value, 'g'), '<mark>'+input.value+'</mark>')+'</p>'
                                            textsearchresult = textsearchresult + '<hr>' + resultfortext;
                                        }
                                    }
                                }
                            }
                        }
                    
                        // 无效结果安排
                        if (resultcount !== obj.articles.length) {
                            if (resultcount == 0) {
                                result.innerHTML = '<i>"'+input.value+'"</i><hr><p align="center">没有找到结果</p>'
                            }
                        }
                        // 整合
                        result.innerHTML = result.innerHTML.substring(0, result.innerHTML.length-4)+textsearchresult.substring(0, textsearchresult.length-4)+'<hr><a href="https://github.com/ravelloh/RPageSearch" class="tr">Search powered by RavelloH\'s RPageSearch</a>'
                    }

当然，上述代码比复杂，接下来我会分步来说。

代码分析

前3行没什么好说的，找到对应的元素，方便后期处理。

第5行从生成的json内获取数据，之后从第六行开始是主函数，搭配html表单使用，效果是当输入时实时搜索。

7-10行判断输入框是否为空，若空替换为默认提示词。

12-21行遍历json中存储的标题，查找是否有相关字词，若有，为resultcount+1，并将完整标题加入到列表resulttitlecache中

23-34行用于展示搜索结果，条件是resultcount不为0（要能找得到结果才继续，节省运算量）且不为json中文章总数（若输入一些字符后全部删除，默认输入了""，所有文章都会有反馈。），里面是两个遍历组合，第一个遍历resulttitlecache中的有结果的标题，第二个遍历总数据，搜索到契合后即可确认总数据的路径、时间、正文、标题信息，然后整合存储到resultstr中。这样因为我们在python生成数据时就是按时间顺序排列的，在这里遍历也是时间最近的结果优先。同时，第一层遍历将搜索到的resultstr中的信息整合到html中展示结果的result元素(一个id为resultlist的div)中。这里也用js替换高亮了信息里面的关键字，会给匹配的字符添加到一个mark标签内（默认黄底白字，可以用css改，比如我的博客就改成了蓝底另外加了个圆角），另外也会截取正文前100字符用于展示，但是因为字符宽度不一显示起来不整齐，要解决可以用css限制行数，css我会放在文章下面。

36-43行与12-21行相似，将搜索匹配的结果存储在resulttextcache中，不在赘述。

45-57行也类似于23-34行，为两个嵌套的遍历，不同的是这个不写入，而是记录搜索结果中含关键词的总数，方便后续排序。具体做法是创建两个列表targetname和targetscore，targetname记录resulttextcache中的正文内容，targetscore记录对应含关键词的数量。这两个是一一对应的关系，比如targetname的第二项所含的关键词总数就等于targetscore第二项记录的数字。

59-84行花了大功夫排序，首先创建一个targetscore的备份targetscorecache，但是要深拷贝^{*关乎js数组知识，若只是简单的var a=b则a
b共享存储位置，换句话说就是a变b也变，对一维数组可以简单的使用a=b.concat([])来解决}，之后以数组sort方式排序，此时targetscorecache为降序排列，targetscore保持原顺序不变，之后套三层遍历~~（应该还能继续优化，但我懒得做了，又不是不能用，性能影响也不大）~~前两层嵌套和之前搜索匹配的23-34、45-57相同，不过这次匹配的是targetscorecache与targetscore，之后再从总数据到json种找到正文匹配的文章，获取到对应的path、title、time和text，然后是类似于23-34到写入过程，这里值得注意有三点，一是这次高亮的是text中的结果，2是这里也会用上作为排序依据的targetscore，在标题下方展示相应的结果数（见最上方用于展示的图片），三是这里生成的html代码不直接写入而是存储到变量textsearchresult中，方便最后整合的时候删掉多余的hr标签。

86-91给上面接底，如果啥也没找到就替换为相应的文本。

最后92-93用来整合、合并标题搜索与正文搜索的内容，顺便加上个powered by的后缀，这里你也可以替换为你自己的文字，或者干脆连前面的hr标签也一起删掉，但我还是希望能留下个注释标明出处。

上述就是js进行本地运算的过程，至于若想达到和我博客相同的效果，css是必不可少的，这里简单罗列一下css及相关的js。

                    <!-- CSS -->
                    form.searchbox input[type=text] {
                        color: #c6c9ce;
                        height: 30px;
                        padding: 5px;
                        font-size: 12px;
                        float: left;
                        width: 80%;
                        background: #000000;
                        border: 1px solid #1e1e1e;
                        border-radius: 10px 0px 0px 10px;
                    }
                    
                    form.searchbox button {
                        height: 30px;
                        float: left;
                        width: 20%;
                        padding: 5px;
                        background: #1e1e1e;
                        color: white;
                        font-size: 12px;
                        border: none;
                        cursor: pointer;
                        border: 1px solid #1e1e1e;
                        border-radius: 0px 10px 10px 0px;
                        text-align: center;
                        line-height: 10px;
                        margin: auto;
                    }
                    
                    form.searchbox button:hover {
                        background: #0b7dda;
                        transition: background 0.2s;
                    }
                    
                    form.searchbox::after {
                        content: "";
                        clear: both;
                        display: table;
                    }
                    .resultlist {
                        top: -20px;
                        height: 0;
                        width: 100%;
                        transition: height 0.4s;
                        background: #000000;
                        color: #c6c9ce;
                    }
                    .resultlist#active {
                        border-radius: 10px;
                        border: 1px solid #1e1e1e;
                        height: 40%;
                    }
                    .resultlist * {
                        margin: 2px;
                    }
                    .resulttitle {
                        color: #ffffff;
                        font-size: 1em;
                    }
                    
                    #info {
                        opacity: 1;
                        transition: opacity 0.4s;
                    }
                    #hidden {
                        opacity: 0;
                    }
                    
                    .fc {
                        text-align: center;
                    }
                    .title {
                        max-width: 45%;
                    }
                    .tr {
                        text-align: right
                    }
                    mark {
                        background-color: #0b7dda;
                        border-radius: 4px;
                        color: #fff;
                        margin: 0;
                    }
                    .showbox {
                        display: -webkit-box;
                        -webkit-box-orient: vertical;
                        -webkit-line-clamp: 2;
                        overflow: hidden;
                        margin: 2px !important
                    }

                    // JS 添加到刚才所有JS的最上方
                    function check() {
                        return false;
                    }
                    
                    function maxfor(arr) {
                        var len = arr.length;
                        var max = -Infinity;
                        while (len--) {
                            if (arr[len] > max) {
                                max = arr[len];
                            }
                        }
                        return max;
                    }
                    
                    // 下面的和刚才的代码有部分重复，替换就行
                    let input = document.querySelector("input[type='text']");
                    let result = document.getElementById('resultlist')
                    let infos = document.getElementById('info')
                    let button = document.getElementById('searchbutton')
                    input.addEventListener("focus", e => {
                        result.style.height = '40%'
                        result.style.overflow = 'auto'
                        result.style.padding = '3px'
                        infos.id = 'hidden'
                    })
                    input.addEventListener("blur", e => {
                        result.style.height = '0'
                        result.style.overflow = 'hidden'
                        result.style.padding = '0'
                        infos.id = 'info'
                    })
                    // 下方应为obj的定义

后言

上述就是目前这个版本搜索的实现，理论上来说这个和前面一篇文章一样，内容都是关于伪动态的，因为搜索这个功能到我写这篇文章为止，都是先存数据库，然后在搜索时拿去数据库比对，之后返回结果，像我这样把搜索过程全搬到用户端的估计全互联网我还是第一人。不过这样也有利有弊，最大的利是节省了服务器，而弊端就是搜索速度依靠用户端设备算力，另外就是在内容太多时需要先将json下载到本地才能搜索（这是必要的，但是可以通过预加载等方式提前这个过程加快）
做了这几篇文章，可以简单归纳伪动态：用其他脚本处理整合数据，前端用js处理。这里整合的载体是json，前面的EverydayNews载体是固定的文件夹结构，PSGameSpider则是混合，将数据直接写入定时更新生成的html的js部分数组里，读取则是靠固定文件夹结构。
说回主角，我把它在Github立了个项，放在Github@RavelloH/RPageSearch中，现在没有时间不会再去更新它，但是以后~~(也可能是很久以后)~~会陆续升级一下，加入模糊搜索等功能。

原创内容使用知识共享署名-非商业性使用-相同方式共享 4.0 (CC BY-NC-ND 4.0) 协议授权。转载请注明出处。

上一篇
正在加载中...

下一篇
正在加载中...

正在加载评论...

RavelloH's Blog

LOAD ing . . .

静态站实现全站搜索

前言

效果

搜索示例

实现方式

数据生成

搜索处理

代码分析

后言

RavelloH's

BLOG

00:00