2019基于深度学习的API误用缺陷检测-汪昕 , 陈驰 , 赵逸凡 , 彭鑫 , 赵文耘.pdf

上善若水

212

17页

0次

2022-05-23

免费下载

软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn

Journal of Software,2019,30(5):1342−1358 [doi: 10.13328/j.cnki.jos.005722] http://www.jos.org.cn

基于深度学习的 API 误用缺陷检测

∗

汪

昕

1,2

陈

驰

1,2

赵逸凡

1,2

彭

鑫

1,2

赵文耘

1,2

(复旦大学软件学院,上海 201203)

(上海市数据科学重点实验室(复旦大学),上海 201203)

通讯作者: 彭鑫, E-mail: pengxin@fudan.edu.cn

摘要: 开发人员经常需要使用各种应用程序编程接口(application programming interface,简称 API)来复用已有

的软件框架、类库等.由于 API 自身的复杂性、文档资料的缺失等原因,开发人员经常会误用 API,从而导致代码缺

陷.为了自动检测 API 误用缺陷,需要获得 API 使用规约,并根据规约对 API 使用代码进行检测.然而,可用于自动检

测的 API 规约难以获得,而人工编写并维护的代价又很高.针对以上问题,将深度学习中的循环神经网络模型应用于

API 使用规约的学习及 API 误用缺陷的检测.在大量的开源 Java 代码基础上,通过静态分析构造 API 使用规约训练

样本,同时利用这些训练样本搭建循环神经网络学习 API 使用规约.在此基础上,针对 API 使用代码进行基于上下文

的语句预测,并通过预测结果与实际代码的比较发现潜在的 API 误用缺陷.对所提出的方法进行实现并针对 Java 加

密相关的 API 及其使用代码进行了实验评估,结果表明,该方法能够在一定程度上实现 API 误用缺陷的自动发现.

关键词: API 误用;使用规约;缺陷检测;深度学习

中图法分类号: TP311

中文引用格式: 汪昕,陈驰,赵逸凡,彭鑫,赵文耘.基于深度学习的 API 误用缺陷检测.软件学报,2019,30(5):1342−1358. http://

www.jos.org.cn/1000-9825/5722.htm

英文引用格式: Wang X, Chen C, Zhao YF, Peng X, Zhao WY. API misuse bug detection based on deep learning. Ruan Jian Xue

Bao/Journal of Software, 2019,30(5):1342−1358 (in Chinese). http://www.jos.org.cn/1000-9825/5722.htm

API Misuse Bug Detec tion Base d on Deep Learni ng

WANG Xin

1,2

, CHEN Chi

1,2

, ZHAO Yi-Fan

1,2

, PENG Xin

1,2

, ZHAO Wen-Yun

1,2

(Software School, Fudan University, Shanghai 201203, China)

(Shanghai Key Laboratory of Data Science (Fudan University), Shanghai 201203, China)

Abstra ct : Developers often need to use various application programming interfaces (API) to reuse existing software frameworks, class

libraries, and so on. Because of the complexity of the API itself, or the lack of documentation, developers often make some API misuses,

which can lead to some code defects. In order to automatically detect API misuse defects, the API use specification is required and the

API is tested according to the specification. However, API specifications that can be used for automatic detection are difficult to obtain,

and the cost of manual writing and maintenance is high. To address the issue, this study applies the recurrent neural network model of

deep learning to the task of learning API use specifications and the task of detecting the API misuse defect. In this study, based on a large

number of open source Java code, the training sample of API use specification is extracted based on static analysis method, and then use

the training sample to set up the recurrent neural network to learning API use specification. On this basis, this study makes a context-

based prediction on the API use code, and finds out the potential API misuse defects by comparing the prediction results with the actual

code. The method above is implemented, and it is evaluated with experiments about Java encryption related APIs and their used code. The

results show that the proposed approach has the ability to a certain extent to automatically detect API misuse defects.

∗ 基金项目: 国家重点研发计划(2016YFB1000801)

Foundation item: National Key Research and Development Program of China (2016YFB1000801)

本文由智能化软件新技术专刊特约编辑申富饶教授和李戈副教授推荐.

收稿时间: 2018-08-31; 修改时间: 2018-10-31, 2018-12-14; 采用时间: 2019-02-03

汪昕等:基于深度学习的 API 误用缺陷检测

1343

Key words: API misuse; usage specification; bug detection; deep learning

在软件开发过程中,开发人员经常需要使用各种应用程序编程接口(application programming interface,简称

API)来复用已有的软件框架、类库,以节省软件开发时间,提高软件开发效率.但由于 API 自身的复杂性、文档

资料的缺失或自身的疏漏等原因

[1−4]

,开发人员经常会误用 API.API 的误用情形多种多样

[5]

,例如多余的 API 调

用、遗漏的 API 调用、错误的 API 调用参数、缺少前置条件判断、忽略异常处理等.这些 API 误用在实际项

目中常常导致了功能性错误、性能问题、安全漏洞等代码缺陷

[5,6]

.例如在使用文件流 API 时,如果最后遗漏了

对文件流进行关闭的 API 调用,那么将导致内存泄露问题.

为了自动检测 API 误用缺陷,需要获得 API 使用规约

[7]

,并根据规约对 API 使用代码进行检测.然而,可用于

自动检测的 API 规约难以获得,而人工编写并维护的代价又很高.相关研究工作

[8−10]

关注于利用数据挖掘、统计

语言模型等方法自动挖掘或学习 API 使用规约并用于缺陷检测,但存在合成能力不足等问题.

本文将深度学习中的循环神经网络模型应用于 API 使用规约的学习及 API 误用缺陷的检测.在大量的开源

Java 代码基础上,通过静态分析构造 API 使用规约训练样本,同时利用这些训练样本搭建循环神经网络结构学

习 API 使用规约.在此基础上,本文针对 API 使用代码进行基于上下文的语句预测,并通过预测结果与实际代码

的比较发现潜在的 API 误用缺陷.本文对所提出的方法进行实现,并针对 Java 加密相关的 API 及其使用代码进

行了实验评估,结果表明,该方法能够在一定程度上实现 API 误用缺陷的自动发现.

本文第 1 节对本文使用的深度学习背景知识进行说明.第 2 节介绍相关工作.第 3 节对本文所提出的方法

及其实现进行说明.第 4 节介绍本文中设计的两个实验——深度学习模型训练实验和 API 误用缺陷检测实验的

设计与结果分析.最后,在第 5 节对本文进行总结与展望.

1 背景知识

深度学习(deep learning,简称 DL)是一类通过连续多层变换的非线性处理单元对数据进行复杂特征提取,

并通过这些组合特征解决问题的机器学习算法

[11]

1.1 循环神经网络

循环神经网络(recurrent neural networks,简称 RNN)属于深度学习技术的一个重要分支.通过对数据中的时

序信息的利用以及对数据中语义信息的深度表达,RNN 在处理和预测序列数据上实现了突破,并在语音识别、

语言模型、机器翻译等方面发挥出色

[12−15]

1.2 长短时记忆网络

RNN 利用状态值保存迭代计算时的历史信息,利用时序信息辅助当前决策.但简单的 RNN 存在长期依赖

(long-term dependencies)问题.1997 年,Hochreiter 和 Schmidhuber 提出的长短时记忆网络(long short term

memory,简称 LSTM)

[16]

,经过计算

[16]

,能够进行前向传播,并对历史信息进行遗忘,对输入信息进行状态更新,有

效解决了长期依赖问题.

1.3 深层循环神经网络

深层循环神经网络(deep RNN)是 RNN 的一种变种,通过拓展浅层循环体,设置多个循环层,将每层 RNN 的

输出作为下一层 RNN 的输入,进一步处理抽象,增强 RNN 从输入中提取高维度抽象信息的能力

[12]

1.4 神经网络模型的优化

1.4.1 损失函数

损失函数定义了深度学习神经网络模型的效果和优化目标.本文将神经网络模型应用于分类问题,用 N 个

神经元的输出表示 N 个不同的 API 调用的评判概率.应用于分类问题的常用损失函数是交叉熵(cross entropy),

它表示了两个概率分布之间的距离,可用于计算正确答案概率分布与预测答案概率分布之间的距离.在训练过

of 17

免费下载

软件学报计算机技术

关注

评论