导读:本文由梯度科技大数据技术专家刘伟平撰写,通过构建医疗知识图谱的小实验探索知识图谱模型和关系型数据库模型之间的本质关系,不同于以往采用RDF/OWL定义知识语义的方式,本文采用kettle的Neo4j插件通过零编码从0到1实现知识图谱的本体建模及实体关系抽取,并初步提出通用知识图谱平台产品化的解决方案。
01
环境准备
软件环境
Linux环境:MariaDB-5.5、neo4j-community-3.5.14
Windows环境:kettle-8.0、插件Neo4JOutput-4.1.4
插件集成
插件下载地址:
https://github.com/knowbi/knowbi-pentaho-pdi-neo4j-output/releases/
下载完成后将文件解压放在$KETTLE_HOME\plugins\目录下,安装插件完成。启动Kettle,发现多出6个Neo4j相关的插件,支持直接写入、建模写入,cypher读取/写入,csv导入,切分graph等操作,本次只介绍其中的“Neo4j Output”和“Neo4J Graph Output”插件的使用,其他插件感兴趣的可自行研究。

02
知识图谱建模
知识图谱实体类型
实体类型 | 中文含义 | 举例 |
Check | 诊断检查项目 | 支气管造影;关节镜检查 |
Department | 医疗科目 | 整形美容科;烧伤科 |
Disease | 疾病 | 血栓闭塞性脉管炎;胸降主动脉动脉瘤 |
Drug | 药品 | 京万红痔疮膏;布林佐胺滴眼液 |
Food | 食物 | 番茄冲菜牛肉丸汤;竹笋炖羊肉 |
Producer | 在售药品 | 通药制药青霉素V钾片;青阳醋酸地塞米松片 |
Symptom | 疾病症状 | 乳腺组织肥厚;脑实质深部出血 |
知识图谱实体关系类型
实体关系类型 | 中文含义 | 举例 |
belongs_to | 属于 | <妇科,属于,妇产科> |
common_drug | 疾病常用药品 | <阳强,常用,甲磺酸酚妥拉明分散片> |
do_eat | 疾病宜吃食物 | <胸椎骨折,宜吃,黑鱼> |
drugs_of | 药品在售药品 | <青霉素V钾片,在售,通药制药青霉素V钾片> |
need_check | 疾病所需检查 | <单侧肺气肿,所需检查,支气管造影> |
no_eat | 疾病忌吃食物 | <唇病,忌吃,杏仁> |
recommand_drug | 疾病推荐药品 | <混合痔,推荐用药,京万红痔疮膏> |
recommand_eat | 疾病推荐食谱 | <鞘膜积液,推荐食谱,番茄冲菜牛肉丸汤> |
has_symptom | 疾病症状 | <早期乳腺癌,疾病症状,乳腺组织肥厚> |
acompany_with | 疾病并发疾病 | <下肢交通静脉瓣膜关闭不全,并发疾病,血栓闭塞性脉管炎> |
知识图谱属性类型
属性类型 | 中文含义 | 举例 |
name | 疾病名称 | 喘息样支气管炎 |
desc | 疾病简介 | 又称哮喘性支气管炎... |
cause | 疾病病因 | 常见的有合胞病毒等... |
prevent | 预防措施 | 注意家族与患儿自身过敏史... |
cure_lasttime | 治疗周期 | 6-12个月 |
cure_way | 治疗方式 | "药物治疗","支持性治疗" |
cured_prob | 治愈概率 | 95% |
easy_get | 疾病易感人群 | 无特定的人群 |
03
数据准备
将json数据导入mysql

数据预处理
truncate table relationship;-- recommand_eatinsert into relationship(rel_type,rel_type_name,disease_id,food_id,disease_name,food_name)select 'recommand_eat' as rel_type , '推荐食谱' as rel_type_name,b.id,c.id,b.name,c.namefrom (SELECT t.name,replace(replace(replace(substring_index(substring_index(t.recommand_eat,',', b.help_topic_id + 1), ',', -1),'\"',''),'[',''),']','')as recommand_eatFROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.recommand_eat) - LENGTH(REPLACE(t.recommand_eat, ',', '')) + 1)) aINNER JOIN disease b on a.name=b.nameINNER JOIN food c on a.recommand_eat = c.name;-- not_eatinsert into relationship(rel_type,rel_type_name,disease_id,food_id,disease_name,food_name)select 'not_eat' as rel_type ,'忌吃' as rel_type_name,b.id,c.id,b.name,c.namefrom (SELECT t.name,replace(replace(replace(substring_index(substring_index(t.not_eat,',', b.help_topic_id + 1), ',', -1),'\"',''),'[',''),']','')as not_eatFROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.not_eat) - LENGTH(REPLACE(t.not_eat, ',', '')) + 1)) aINNER JOIN disease b on a.name=b.nameINNER JOIN food c on a.not_eat = c.name;-- do_eatinsert into relationship(rel_type,rel_type_name,disease_id,food_id,disease_name,food_name)select 'do_eat' as rel_type ,'宜吃' as rel_type_name,b.id,c.id,b.name,c.namefrom (SELECT t.name,replace(replace(replace(substring_index(substring_index(t.eat,',', b.help_topic_id + 1), ',', -1),'\"',''),'[',''),']','')as eatFROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.eat) - LENGTH(REPLACE(t.eat, ',', '')) + 1)) aINNER JOIN disease b on a.name=b.nameINNER JOIN food c on a.eat = c.name;-- common_drugdelete from relationship where rel_type = 'common_drug';insert into relationship(rel_type,rel_type_name,disease_id,drug_id,disease_name,drug_name)select 'common_drug' as rel_type ,'常用药品' as rel_type_name,b.id,c.id,b.name,c.namefrom (SELECT t.name,replace(replace(replace(substring_index(substring_index(t.common_drug,',', b.help_topic_id + 1), ',', -1),'\"',''),'[',''),']','')as common_drugFROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.common_drug) - LENGTH(REPLACE(t.common_drug, ',', '')) + 1)) aINNER JOIN disease b on a.name=b.nameINNER JOIN drug c on a.common_drug = c.name;-- recommand_drugdelete from relationship where rel_type = 'recommand_drug';insert into relationship(rel_type,rel_type_name,disease_id,drug_id,disease_name,drug_name)select 'recommand_drug' as rel_type,'好评药品' as rel_type_name ,b.id,c.id,b.name,c.namefrom (SELECT t.name,replace(replace(replace(substring_index(substring_index(t.recommand_drug,',', b.help_topic_id + 1), ',', -1),'\"',''),'[',''),']','')as recommand_drugFROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.recommand_drug) - LENGTH(REPLACE(t.recommand_drug, ',', '')) + 1)) aINNER JOIN disease b on a.name=b.nameINNER JOIN drug c on a.recommand_drug = c.name;-- need_checkdelete from relationship where rel_type = 'need_check';insert into relationship(rel_type,rel_type_name,disease_id,check_id,disease_name,check_name)select 'need_check' as rel_type,'诊断检查' as rel_type_name ,b.id,c.id,b.name,c.namefrom (SELECT t.name,replace(replace(replace(substring_index(substring_index(t.`check`,',', b.help_topic_id + 1), ',', -1),'\"',''),'[',''),']','')as `check`FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.`check`) - LENGTH(REPLACE(t.`check`, ',', '')) + 1)) aINNER JOIN disease b on a.name=b.nameINNER JOIN `check` c on a.`check` = c.name;-- has_symptomdelete from relationship where rel_type = 'has_symptom';insert into relationship(rel_type,rel_type_name,disease_id,symptom_id,disease_name,symptom_name)select 'has_symptom' as rel_type ,'症状' as rel_type_name, b.id,c.id,b.name,c.namefrom (SELECT t.name,replace(replace(replace(substring_index(substring_index(t.`symptom`,',', b.help_topic_id + 1), ',', -1),'\"',''),'[',''),']','')as `symptom`FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.`symptom`) - LENGTH(REPLACE(t.`symptom`, ',', '')) + 1)) aINNER JOIN disease b on a.name=b.nameINNER JOIN `symptom` c on a.`symptom` = c.name;-- belongs_todelete from relationship where rel_type = 'belongs_to';insert into relationship(rel_type,rel_type_name,disease_id,department_id,parent_department_id,disease_name,department_name,parent_department_name)select 'belongs_to' as rel_type ,'属于' as rel_type_name,b.id,c.id,d.id,b.name,c.name,d.namefrom (SELECT t.name,replace(replace(replace(substring_index(t.`cure_department`,',', 1),'\"',''),'[',''),']','')as `department`,replace(replace(replace(substring_index(t.`cure_department`,',',-1),'\"',''),'[',''),']','')as `parent_department`FROM medical t ) aINNER JOIN disease b on a.name=b.nameINNER JOIN `department` c on a.`department` = c.nameleft join `department` d on a.`parent_department` = d.name;-- acompany_withdelete from relationship where rel_type = 'acompany_with';insert into relationship(rel_type,rel_type_name,disease_id,acompany_disease_id,disease_name,acompany_disease_name)select 'acompany_with' as rel_type ,'并发症' as rel_type_name,b.id,c.id,b.name,c.namefrom (SELECT t.name,replace(replace(replace(substring_index(substring_index(t.`acompany`,',', b.help_topic_id + 1), ',', -1),'\"',''),'[',''),']','')as `acompany`FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.`acompany`) - LENGTH(REPLACE(t.`acompany`, ',', '')) + 1)) aINNER JOIN disease b on a.name=b.nameINNER JOIN `disease` c on a.`acompany` = c.name;-- drugs_of 生产药品delete from relationship where rel_type = 'drugs_of';insert into relationship(rel_type,rel_type_name,producer_id,drug_id,producer_name,drug_name)select rel_type,'生产药品' as rel_type_name, producer_id ,drug_id,max(producer_name) , max(drug_name) from (select 'drugs_of' as rel_type ,b.id as producer_id,c.id as drug_id,b.name as producer_name,c.name as drug_namefrom (select substring_index(tmp.drug_detail,'(',1) as producer,replace(substring_index(tmp.drug_detail,'(',-1),')','') as drug_nameFROM(SELECT t.name,replace(replace(replace(substring_index(substring_index(t.`drug_detail`,',', b.help_topic_id + 1), ',', -1),'\"',''),'[',''),']','')as `drug_detail`FROM medical t join mysql.help_topic b ON b.help_topic_id < (LENGTH(t.`drug_detail`) - LENGTH(REPLACE(t.`drug_detail`, ',', '')) + 1)) tmp) as aINNER JOIN producer b on a.producer=b.nameINNER JOIN `drug` c on a.`drug_name` = c.name) resultGROUP BY rel_type,producer_id,drug_id


04
知识任务抽取
Neo4J Output插件导入


主要有三部分值得注意的地方:From Node、To Node、Relationship。
From Node:

To Node:

Relationship:

Neo4J Graph Output插件建模增量导入
医疗图谱的增量抽取任务:

表输入中添加增量条件:

建模:

Nodes配置:

Relationships配置:

模型建成后,可以导出为json:

配置流字段与模型属性的Mapping:

Mapping结果:

05
效果展示

06
产品化之路
通过这次构建知识图谱的小实验,我们可以大胆发散思维,如果企业研发一个通用知识图谱平台产品应该具备哪些功能要素?
首先,它应该具备本体建模功能,对于行业知识图谱一般采用自上而下的构建方式,定义本体类型与属性、关系类型与属性。
其次,它应该具备异构数据知识抽取功能,针对结构化数据和非结构化数据(以文本形式居多)支持傻瓜式建立知识抽取任务,结构化数据知识抽取模块抽象出数据源方便快速对接各种关系型数据库;非结构化数据知识抽取模块支持对上传的文件做人工标注,通过自然语言处理技术进行有监督训练,提取出文本中包含的三元组。再建立源数据模型与知识图谱模型之间的映射,包括实体映射,实体与属性映射、关系映射、关系属性映射,实现知识自动抽取到知识图谱中。
知识检索功能也是必不可少的,支撑图探索、路径分析等应用。
平台还可以提供一个知识智能问答功能,它能够自动识别问题类型,从该租户具备知识搜索权限的知识库中定位最适合的知识图谱,从该图谱中检索知识并融合成人类容易理解的答案。当然,与其他产品一样,用户权限管理、日志监控、系统管理等一些基础功能也是必备的。
目前,梯度科技始终致力于以技术赋能业务,降低企业技术应用门槛,打造可信的容器、微服务、DevOps、边缘计算、多云管理和大数据等PaaS技术平台,为客户提供云数智一体化综合解决方案,助力客户数智化转型发展。





