Skip to content
This repository was archived by the owner on Mar 10, 2023. It is now read-only.
This repository was archived by the owner on Mar 10, 2023. It is now read-only.

去除重复内容 #9

@gricn

Description

@gricn

由于在之前版本更新策略的失误,出现了一些重复爬取的文章。

以“国家卫生健康委员会医师资格考试委员会公告”为例,出现了4处相同的文章

搜索界面
image

Kibana查看情况
image

能否参照以下内容对仓库去重呢?
https://github.com/alexander-marquardt/deduplicate-elasticsearch/blob/master/deduplicate-elaticsearch.py
https://stackoverflow.com/questions/29886477/how-to-remove-duplicate-search-result-in-elasticsearch

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions