暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Kettle集成Neo4j插件轻松构建医疗知识图谱的场景实践

梯度科技 2021-09-27
2780

导读:本文由梯度科技大数据技术专家刘伟平撰写,通过构建医疗知识图谱的小实验探索知识图谱模型和关系型数据库模型之间的本质关系,不同于以往采用RDF/OWL定义知识语义的方式,本文采用kettle的Neo4j插件通过零编码从0到1实现知识图谱的本体建模及实体关系抽取,并初步提出通用知识图谱平台产品化的解决方案。


01

环境准备


软件环境


Linux环境:MariaDB-5.5、neo4j-community-3.5.14

Windows环境:kettle-8.0、插件Neo4JOutput-4.1.4


插件集成


插件下载地址:

https://github.com/knowbi/knowbi-pentaho-pdi-neo4j-output/releases/


下载完成后将文件解压放在$KETTLE_HOME\plugins\目录下,安装插件完成。启动Kettle,发现多出6个Neo4j相关的插件,支持直接写入、建模写入,cypher读取/写入,csv导入,切分graph等操作,本次只介绍其中的“Neo4j Output”和“Neo4J Graph Output”插件的使用,其他插件感兴趣的可自行研究。



02

知识图谱建模


本次测试以医疗知识图谱为例,数据来源:https://github.com/liuhuanyong/QASystemOnMedicalKG/blob/master/data/medical.json,采样100条数据作为测试。

我们选用protege作为建模工具,这部分可参考知识图谱建模的相关文档,这里不展开介绍,仅以表格形式简化模型表达。

知识图谱实体类型


实体类型

中文含义

举例

Check

诊断检查项目

支气管造影;关节镜检查

Department

医疗科目

整形美容科;烧伤科

Disease

疾病

血栓闭塞性脉管炎;胸降主动脉动脉瘤

Drug

药品

京万红痔疮膏;布林佐胺滴眼液

Food

食物

番茄冲菜牛肉丸汤;竹笋炖羊肉

Producer

在售药品

通药制药青霉素V钾片;青阳醋酸地塞米松片

Symptom

疾病症状

乳腺组织肥厚;脑实质深部出血


知识图谱实体关系类型


实体关系类型

中文含义

举例

belongs_to

属于

<妇科,属于,妇产科>

common_drug

疾病常用药品

<阳强,常用,甲磺酸酚妥拉明分散片>

do_eat

疾病宜吃食物

<胸椎骨折,宜吃,黑鱼>

drugs_of

药品在售药品

<青霉素V钾片,在售,通药制药青霉素V钾片>

need_check

疾病所需检查

<单侧肺气肿,所需检查,支气管造影>

no_eat

疾病忌吃食物

<唇病,忌吃,杏仁>

recommand_drug

疾病推荐药品

<混合痔,推荐用药,京万红痔疮膏>

recommand_eat

疾病推荐食谱

<鞘膜积液,推荐食谱,番茄冲菜牛肉丸汤>

has_symptom

疾病症状

<早期乳腺癌,疾病症状,乳腺组织肥厚>

acompany_with

疾病并发疾病

<下肢交通静脉瓣膜关闭不全,并发疾病,血栓闭塞性脉管炎>


知识图谱属性类型


属性类型

中文含义

举例

name

疾病名称

喘息样支气管炎

desc

疾病简介

又称哮喘性支气管炎...

cause

疾病病因

常见的有合胞病毒等...

prevent

预防措施

注意家族与患儿自身过敏史...

cure_lasttime

治疗周期

6-12个月

cure_way

治疗方式

"药物治疗","支持性治疗"

cured_prob

治愈概率

95%

easy_get

疾病易感人群

无特定的人群

03

数据准备


将json数据导入mysql


将json数据用kettle导入到mysql中,转换见下图:


数据预处理


需要明确的几个概念:关系型数据库中实体表的一行,在neo4j中表示为一个node;关系型数据库中的外键映射关系或者jion后的宽表字段对应关系,可以理解为neo4j中的relationship。

预处理的逻辑如下:
    truncate table relationship;
    -- recommand_eat
    insert into relationship(rel_type,rel_type_name,disease_id,food_id,disease_name,food_name)
    select 'recommand_eat' as rel_type , '推荐食谱' as rel_type_name,b.id,c.id,b.name,c.name
    from (SELECT t.name,
    replace(
    replace(
    replace(substring_index(substring_index(t.recommand_eat,',', b.help_topic_id + 1), ',', -1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )


    as recommand_eat
    FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.recommand_eat) - LENGTH(REPLACE(t.recommand_eat, ',', '')) + 1)) a
    INNER JOIN disease b on a.name=b.name
    INNER JOIN food c on a.recommand_eat = c.name;


    -- not_eat
    insert into relationship(rel_type,rel_type_name,disease_id,food_id,disease_name,food_name)
    select 'not_eat' as rel_type ,'忌吃' as rel_type_name,b.id,c.id,b.name,c.name
    from (SELECT t.name,
    replace(
    replace(
    replace(substring_index(substring_index(t.not_eat,',', b.help_topic_id + 1), ',', -1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )


    as not_eat
    FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.not_eat) - LENGTH(REPLACE(t.not_eat, ',', '')) + 1)) a
    INNER JOIN disease b on a.name=b.name
    INNER JOIN food c on a.not_eat = c.name;


    -- do_eat
    insert into relationship(rel_type,rel_type_name,disease_id,food_id,disease_name,food_name)
    select 'do_eat' as rel_type ,'宜吃' as rel_type_name,b.id,c.id,b.name,c.name
    from (SELECT t.name,
    replace(
    replace(
    replace(substring_index(substring_index(t.eat,',', b.help_topic_id + 1), ',', -1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )


    as eat
    FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.eat) - LENGTH(REPLACE(t.eat, ',', '')) + 1)) a
    INNER JOIN disease b on a.name=b.name
    INNER JOIN food c on a.eat = c.name;




    -- common_drug
    delete from relationship where rel_type = 'common_drug';
    insert into relationship(rel_type,rel_type_name,disease_id,drug_id,disease_name,drug_name)
    select 'common_drug' as rel_type ,'常用药品' as rel_type_name,b.id,c.id,b.name,c.name
    from (SELECT t.name,
    replace(
    replace(
    replace(substring_index(substring_index(t.common_drug,',', b.help_topic_id + 1), ',', -1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )


    as common_drug
    FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.common_drug) - LENGTH(REPLACE(t.common_drug, ',', '')) + 1)) a
    INNER JOIN disease b on a.name=b.name
    INNER JOIN drug c on a.common_drug = c.name;




    -- recommand_drug
    delete from relationship where rel_type = 'recommand_drug';
    insert into relationship(rel_type,rel_type_name,disease_id,drug_id,disease_name,drug_name)
    select 'recommand_drug' as rel_type,'好评药品' as rel_type_name ,b.id,c.id,b.name,c.name
    from (SELECT t.name,
    replace(
    replace(
    replace(substring_index(substring_index(t.recommand_drug,',', b.help_topic_id + 1), ',', -1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )


    as recommand_drug
    FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.recommand_drug) - LENGTH(REPLACE(t.recommand_drug, ',', '')) + 1)) a
    INNER JOIN disease b on a.name=b.name
    INNER JOIN drug c on a.recommand_drug = c.name;




    -- need_check
    delete from relationship where rel_type = 'need_check';
    insert into relationship(rel_type,rel_type_name,disease_id,check_id,disease_name,check_name)
    select 'need_check' as rel_type,'诊断检查' as rel_type_name ,b.id,c.id,b.name,c.name
    from (SELECT t.name,
    replace(
    replace(
    replace(substring_index(substring_index(t.`check`,',', b.help_topic_id + 1), ',', -1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )
    as `check`
    FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.`check`) - LENGTH(REPLACE(t.`check`, ',', '')) + 1)) a
    INNER JOIN disease b on a.name=b.name
    INNER JOIN `check` c on a.`check` = c.name;




    -- has_symptom
    delete from relationship where rel_type = 'has_symptom';
    insert into relationship(rel_type,rel_type_name,disease_id,symptom_id,disease_name,symptom_name)
    select 'has_symptom' as rel_type ,'症状' as rel_type_name, b.id,c.id,b.name,c.name
    from (SELECT t.name,
    replace(
    replace(
    replace(substring_index(substring_index(t.`symptom`,',', b.help_topic_id + 1), ',', -1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )
    as `symptom`
    FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.`symptom`) - LENGTH(REPLACE(t.`symptom`, ',', '')) + 1)) a
    INNER JOIN disease b on a.name=b.name
    INNER JOIN `symptom` c on a.`symptom` = c.name;




    -- belongs_to
    delete from relationship where rel_type = 'belongs_to';
    insert into relationship(rel_type,rel_type_name,disease_id,department_id,parent_department_id,disease_name,department_name,parent_department_name)
    select 'belongs_to' as rel_type ,'属于' as rel_type_name,b.id,c.id,d.id,b.name,c.name,d.name
    from (SELECT t.name,
    replace(
    replace(
    replace(substring_index(t.`cure_department`,',', 1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )
    as `department`,
    replace(
    replace(
    replace(substring_index(t.`cure_department`,',',-1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )
    as `parent_department`
    FROM medical t ) a
    INNER JOIN disease b on a.name=b.name
    INNER JOIN `department` c on a.`department` = c.name
    left join `department` d on a.`parent_department` = d.name;


    -- acompany_with
    delete from relationship where rel_type = 'acompany_with';
    insert into relationship(rel_type,rel_type_name,disease_id,acompany_disease_id,disease_name,acompany_disease_name)
    select 'acompany_with' as rel_type ,'并发症' as rel_type_name,b.id,c.id,b.name,c.name
    from (SELECT t.name,
    replace(
    replace(
    replace(substring_index(substring_index(t.`acompany`,',', b.help_topic_id + 1), ',', -1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )
    as `acompany`
    FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.`acompany`) - LENGTH(REPLACE(t.`acompany`, ',', '')) + 1)) a
    INNER JOIN disease b on a.name=b.name
    INNER JOIN `disease` c on a.`acompany` = c.name;


    -- drugs_of 生产药品
    delete from relationship where rel_type = 'drugs_of';
    insert into relationship(rel_type,rel_type_name,producer_id,drug_id,producer_name,drug_name)
    select rel_type,'生产药品' as rel_type_name, producer_id ,drug_id,max(producer_name) , max(drug_name) from (
    select 'drugs_of' as rel_type ,b.id as producer_id,c.id as drug_id,b.name as producer_name,c.name as drug_name
    from (
    select substring_index(tmp.drug_detail,'(',1) as producer,replace(substring_index(tmp.drug_detail,'(',-1),')','') as drug_name
    FROM(
    SELECT t.name,
    replace(
    replace(
    replace(substring_index(substring_index(t.`drug_detail`,',', b.help_topic_id + 1), ',', -1),'\"',''),
    '[',''
    )
    ,']'
    ,''
    )
    as `drug_detail`
    FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.`drug_detail`) - LENGTH(REPLACE(t.`drug_detail`, ',', '')) + 1)) tmp
    ) as a
    INNER JOIN producer b on a.producer=b.name
    INNER JOIN `drug` c on a.`drug_name` = c.name) result
    GROUP BY rel_type,producer_id,drug_id

    为了更好理解,建议将每一种关系单独定义为一张宽表(例如疾病和饮食的关系)。这里为了测试方便,作者将多种关系数据处理组合成一张宽表,处理后的数据定义如下:


    部分数据如下:


    04

    知识任务抽取


    Neo4J Output插件导入


    整个医疗图谱的抽取任务:


    这里还是以疾病-食物的关系为例,配置好mysql和neo4j的连接。详细的转换如下:


    主要有三部分值得注意的地方:From Node、To Node、Relationship。

    From Node:



    To Node:



    Relationship:



    Neo4J Graph Output插件建模增量导入


    医疗图谱的增量抽取任务:



    表输入中添加增量条件:



    建模:



    Nodes配置:



    Relationships配置:



    模型建成后,可以导出为json:



    配置流字段与模型属性的Mapping:



    Mapping结果:



    05

    效果展示



    06

    产品化之路


    通过这次构建知识图谱的小实验,我们可以大胆发散思维,如果企业研发一个通用知识图谱平台产品应该具备哪些功能要素?


    首先,它应该具备本体建模功能,对于行业知识图谱一般采用自上而下的构建方式,定义本体类型与属性、关系类型与属性。


    其次,它应该具备异构数据知识抽取功能,针对结构化数据和非结构化数据(以文本形式居多)支持傻瓜式建立知识抽取任务,结构化数据知识抽取模块抽象出数据源方便快速对接各种关系型数据库;非结构化数据知识抽取模块支持对上传的文件做人工标注,通过自然语言处理技术进行有监督训练,提取出文本中包含的三元组。再建立源数据模型与知识图谱模型之间的映射,包括实体映射,实体与属性映射、关系映射、关系属性映射,实现知识自动抽取到知识图谱中。


    知识检索功能也是必不可少的,支撑图探索、路径分析等应用。


    平台还可以提供一个知识智能问答功能,它能够自动识别问题类型,从该租户具备知识搜索权限的知识库中定位最适合的知识图谱,从该图谱中检索知识并融合成人类容易理解的答案。当然,与其他产品一样,用户权限管理、日志监控、系统管理等一些基础功能也是必备的。


    目前,梯度科技始终致力于以技术赋能业务,降低企业技术应用门槛,打造可信的容器、微服务、DevOps、边缘计算、多云管理和大数据等PaaS技术平台,为客户提供云数智一体化综合解决方案,助力客户数智化转型发展。


    文章转载自梯度科技,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

    评论