LNCS 13148

Hong Shen · Yingpeng Sang · Yong Zhang · Nong Xiao · Hamid R. Arabnia · Geoffrey Fox · Ajay Gupta · Manu Malek (Eds.)

Parallel and Distributed Computing, Applications and Technologies 22nd International Conference, PDCAT 2021 Guangzhou, China, December 17–19, 2021 Proceedings

Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA

Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA

13148

More information about this subseries at https://link.springer.com/bookseries/7407

Hong Shen · Yingpeng Sang · Yong Zhang · Nong Xiao · Hamid R. Arabnia · Geoffrey Fox · Ajay Gupta · Manu Malek (Eds.)

Parallel and Distributed Computing, Applications and Technologies 22nd International Conference, PDCAT 2021 Guangzhou, China, December 17–19, 2021 Proceedings

Editors Hong Shen Sun Yat-sen University Guangzhou, Guangdong, China

Yingpeng Sang Sun Yat-sen University Guangzhou, China

Yong Zhang Shenzhen Institute of Advanced Technology Shenzhen, China

Nong Xiao Sun Yat-sen University Guangzhou, China

Hamid R. Arabnia University of Georgia Athens, GA, USA

Geoffrey Fox University of Utah Salt Lake City, USA

Ajay Gupta Western Michigan University Kalamazoo, MI, USA

Manu Malek Stevens Institute of Technology Hoboken, NJ, USA

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-96771-0 ISBN 978-3-030-96772-7 (eBook) https://doi.org/10.1007/978-3-030-96772-7 LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues © Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

It is our great pleasure to introduce this collection of research papers which were presented at the 22nd International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2021). PDCAT is a major forum for scientists, engineers, and practitioners throughout the world to present the latest research, results, ideas, developments, and applications in all areas of parallel and distributed computing. The conference started in Hong Kong in 2000, and PDCAT 2021 took place in Guangzhou, China, after 21 years of success in different countries/regions including Taiwan, Japan, China, South Korea, Singapore, Australia, and New Zealand. Due to the impact of the COVID-19 pandemic, this year’s conference was conducted both online for external participants and offline for local participants. This year we received 97 submissions from authors in 15 different countries and regions across the world. Out of these submissions, we have accepted 24 regular papers and 34 short papers. This represents an acceptance rate of 25% for regular papers and 35% for short papers. The submissions were in general of high quality, making paper selection a tough task. The paper review process involved all Program Committee members. To ensure a high-quality program and provide sufficient feedback to authors, we made great effort to have each paper reviewed by three independent reviewers on average. All accepted papers are included in the proceedings. It would not have been possible for PDCAT 2021 to take place without the help and support of various people. The efforts of the authors, Program Committee members, and reviewers were essential to the conference’s quality and deserve our utmost appreciation. We also wish to thank the local organization committee members for all their hard work in making PDCAT 2021 a great success, and we thank our sponsors, Sun Yat-sen University and Springer, for their support. Last but not least, we wish to thank Guoliang Chen from the Nanjing University of Posts and Telecommunications and Shenzhen University, China; Depei Qian from Beihang University, China; Manu Malek as the Editor-in-Chief of the Computers and Electrical Engineering journal; Jiannong Cao from the Hong Kong Polytechnic University, China; Haibing Guan from Shanghai Jiao Tong University, China; Zhiwen Yu from the Northwestern Polytechnical University, China; Chengzhong Xu from the University of Macau, Macao SAR, China; Ajay Gupta from Western Michigan University, USA; and Hiroyuki Takizawa from Tohoku University, Japan, who delivered keynote speeches and helped attain the objectives of the conference.

vi

Preface

We are grateful to all authors for submitting their up-to-date research results to the conference and all participants for attending the conference. We hope that you found the conference rewarding. December 2021

Hong Shen Yingpeng Sang Yong Zhang Nong Xiao Hamid Arabnia Geoffrey Fox Ajay Gupta Manu Malek

Organization

Organizing Committee General Chair Hong Shen

Sun Yat-sen University, China

Program Chairs Nong Xiao Hamid Arabnia Geoffrey Fox Ajay Gupta Manu Malek

Sun Yat-sen University, China University of Georgia, USA University of Utah, USA Western Michigan University, USA Stevens Institute of Technology, USA

Workshop and Tutorial Chair Di Wu

Sun Yat-sen University, China

Publicity Chairs Shi-Jinn Horng Hiroyuki Takizawa

National Taiwan University of Science and Technology, China Tohoku University, Japan

Publications Chairs Yingpeng Sang Yong Zhang

Sun Yat-sen University, China Shenzhen Institute of Advanced Technology, China

Local Arrangement Chairs Yuedong Yang Chao Yu

Sun Yat-sen University, China Sun Yat-sen University, China

Registration and Finance Chair Xiangyin Liu

Sun Yat-sen University, China

viii

Organization

Program Committee Yuebin Bai Raj Bayyar Ümit V. Çatalyürek Zhansheng Chen Yawen Chen Shi-Jin Horng Zhengxiong Hou Mirjana Ivanovic Teofilo Gonzalez Huaxi Gu Haibing Guan Longkun Guo Hai Jin Haibin Kan Francis Lau Kenli Li Keqiu Li Yidong Li Yamin Li Weifa Liang Shangsong Liang Li Ma Rui Mao Koji Nakano James J. Park Depei Qian Jiangbo Qian Yingpeng Sang Michael Sheng Jiwu Shu Hiroyuki Takizawa Hui Tian Rangding Wang Xun Wang Jian Weng Di Wu Jigang Wu Weigang Wu

Beihang University, China University of Melbourne, Australia Georgia Institute of Technology, USA Beijing Union University, China University of Otago, New Zealand National Taiwan University of Science and Technology, China Northwestern Polytechnical University, China University of Novi Sad, Serbia University of California, Santa Barbara, USA Xidian University, China Shanghai Jiao Tong University, China Fuzhou University, China Huazhong University of Science and Technology, China Fudan University, China University of Hong Kong, China Hunan University, China Tianjin University, China Beijing Jiaotong University, China Hosei University, Japan Australian National University, Australia Sun Yat-sen University, China North China University of Technology, China Shenzhen University, China University of Hiroshima, Japan Seoul National University of Science and Technology, South Korea Beihang University, China Ningbo University, China Sun Yat-sen University, China Maquarie University, Australia Xiamen University, China Tohoku University, Japan Griffith University, Australia Ningbo University, China Zhejiang Gongshang University, China Jinan University, China Sun Yat-sen University, China Guangdong University of Technology, China Sun Yat-sen University, China

Organization

Chengzhong Xu Jingling Xue Yuedong Yang Chao Yu Jiguo Yu Zhiwen Yu Haibo Zhang Jianbiao Zhang Yong Zhang Zonghua Zhang Xiaofan Zhao Yuanjie Zheng Cheng Zhong Albert Zomaya

Organizers Hosted by

Sun Yat-sen University In Cooperation with

Springer

University of Macau, China University of New South Wales, Australia Sun Yat-sen University, China Sun Yat-sen University, China Qilu University of Technology, China Northwestern Polytechnical University, China University of Otago, New Zealand Beijing Polytechnic University, China Shenzhen Institute of Advanced Technology, China Huawei Paris, France Police University of China, China Shandong Normal University, China Guangxi University, China University of Sydney, Australia

ix

Contents

Networking and Architectures Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jingcheng Shen, Yifan Wu, Masao Okita, and Fumihiko Ino Routing with Ant Colony Optimization in Wireless Mesh Networks . . . . . . . . . . Jiadong Peng, Zhanmao Cao, and Qisong Huang

3

15

A Light-Weight Scheme for Detecting Component Structure of Network Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zihui Wu, Yi Xie, and Ziyang Wu

27

Evaluating the Performance and Conformance of a SYCL Implementation for SX-Aurora TSUBASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiahao Li, Mulya Agung, and Hiroyuki Takizawa

36

Bayesian Optimization-Based Task Scheduling Algorithm on Heterogeneous System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tan Cai and Hong Shen

48

Optimizing Uplink Bandwidth Utilization for Crowdsourced Livecast . . . . . . . . . Xianzhi Zhang, Guoqiao Ye, Miao Hu, and Di Wu A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rongfeng Huang, Tianyu Yu, Shifang Liu, Xinyin Zhang, and Yonghua Zhao

57

69

A Molecular Dynamics Based Multi-scale Platelet Aggregation Model and Its High-Throughput Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhipeng Xu and Qingsong Zou

81

Approximation and Polynomial Algorithms for Multi-depot Capacitated Arc Routing Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Yu and Yujie Liao

93

Zero-Shot Face Swapping with De-identification Adversarial Learning . . . . . . . . 101 Huifang Li, Yidong Li, Jiaming Liu, Zhibin Hong, Tianshu Hu, and Yan Ren

xii

Contents

An User-Driven Active Way to Push ACL in Software-Defined Networking . . . . 113 Haisheng Yu, Dong Liu, Wenyong Wang, Keqiu Li, Sai Zou, Zhaobin Liu, and Yan Liu Photonic Computing and Communication for Neural Network Accelerators . . . . 121 Chengpeng Xia, Yawen Chen, Haibo Zhang, Hao Zhang, and Jigang Wu Performance Comparison of Multi-layer Perceptron Training on Electrical and Optical Network-on-Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Fei Dai, Yawen Chen, Zhiyi Huang, and Haibo Zhang The Design and Implementation of Reconfigurable Quaternary Logic Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Hongjian Wang, Youdong Wu, Shan Ouyang, Xunlei Chen, Yunfu Shen, and Yi Jin A 3D Dubins Curve Constructing Method Based on Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Cheng Ji, Chu Wang, Mingyan Song, and Fengmin Wang Software Systems and Technologies Towards Conflict-Aware Workload Co-execution on SX-Aurora TSUBASA . . . . 163 Riku Nunokawa, Yoichi Shimomura, Mulya Agung, Ryusuke Egawa, and Hiroyuki Takizawa A Learning-Based Scheduler for High Volume Processing in Data Warehouse Using Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Vivek Bengre, M. Reza HoseinyFarahabady, Mohammad Pivezhandi, Albert Y. Zomaya, and Ali Jannesari Adaptive Updates for Erasure-Coded Storage Systems Based on Data Delta and Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Bing Wei, Jigang Wu, Xiaosong Su, Qiang Huang, and Yujun Liu Matching Program Implementations and Heterogeneous Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Martin Sandrieser and Siegfried Benkner FastDCF: A Partial Index Based Distributed and Scalable Near-Miss Code Clone Detection Approach for Very Large Code Repositories . . . . . . . . . . . . . . . . 210 Liming Yang, Yi Ren, Jianbo Guan, Bao Li, Jun Ma, Peng Han, and Yusong Tan

Contents

xiii

Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms . . . . . . . . . 223 Senhao Shao, Yizhuo Wang, Weixing Ji, and Jianhua Gao Temperature Matrix-Based Data Placement Using Improved Hungarian Algorithm in Edge Computing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Yuying Zhao, Pengwei Wang, Hengdi Huang, and Zhaohui Zhang Realtime Physics Simulation of Large Virtual Space with Docker Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Seiji Saito and Satoshi Fujita A Deep Reinforcement Learning-Based Approach to the Scheduling of Multiple Workflows on Non-dedicated Edge Servers . . . . . . . . . . . . . . . . . . . . . 261 Yongqiang Gao and Ke Feng A MVCC Approach to Parallelizing Interoperability of Consortium Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Weiyi Lin, Qiang Qu, Li Ning, Jianping Fan, and Qingshan Jiang An Effective and Reliable Cross-Blockchain Data Migration Approach . . . . . . . . 286 Mengqiu Zhang, Qiang Qu, Li Ning, Jianping Fan, and Ruijie Yang Algorithm for the Facility Location Problem with Origin and Destination . . . . . . 295 Fengmin Wang, Chu Wang, Na Li, and Wenxing Kang Reinforcement Learning-Based Auto-scaling Algorithm for Elastic Cloud Workflow Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Jian-bin Lu, Yang Yu, and Mao-lin Pan Optimal Energy Efficiency Strategy of mm Wave Cooperative Communication Small Cell Based on SWITP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 Taoshen Li and Mingyu Lu Low Latency Execution Guarantee Under Uncertainty in Serverless Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 M. Reza HoseinyFarahabady, Javid Taheri, Albert Y. Zomaya, and Zahir Tari High Resolution Patient-Specific Blood Flow Simulation in a Full-Size Aneurysmal Aorta Based on a Parallel Two-Level Method . . . . . . . . . . . . . . . . . . . 336 Jie Zhou, Jing Li, Shanlin Qin, and Rongliang Chen Optimizing Data Locality by Executor Allocation in Reduce Stage for Spark Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Zhongming Fu, Mengsi He, Zhuo Tang, and Yang Zhang

xiv

Contents

TEFRED: A Temperature and Energy Cognizant Fault-Tolerant Real-Time Scheduler Based on Deadline Partitioning for Heterogeneous Platforms . . . . . . . 358 Yanshul Sharma, Zinea Das, and Sanjay Moulik Algorithms and Applications Social Recommendation via Graph Attentive Aggregation . . . . . . . . . . . . . . . . . . . 369 Yuanwei Liufu and Hong Shen MACSQ: Massively Accelerated DeepQ Learning on GPUs Using On-the-fly State Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Marcel Köster, Julian Groß, and Antonio Krüger Model-Based Multi-agent Policy Optimization with Dynamic Dependence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 Biyang Hu, Chao Yu, and Zifan Wu Multi-index Federated Aggregation Algorithm Based on Trusted Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 Zhenshan Bao, Wei Bai, and Wenbo Zhang Few-Shot Generative Learning by Modeling Stereoscopic Priors . . . . . . . . . . . . . 421 Yuehui Wang, Qing Wang, and Dongyu Zhang Distributed Fair k-Center Clustering Problems with Outliers . . . . . . . . . . . . . . . . . 430 Fan Yuan, Luhong Diao, Donglei Du, and Lei Liu Multi-zone Residential HVAC Control with Satisfying Occupants’ Thermal Comfort Requirements and Saving Energy via Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 Zhengkai Ding, Qiming Fu, Jianping Chen, Hongjie Wu, You Lu, and Fuyuan Hu Approximating BP Maximization with Distorted-Based Strategy . . . . . . . . . . . . . 452 Ruiqi Yang, Suixiang Gao, Lu Han, Gaidi Li, and Zhongrui Zhao Streaming Algorithms for Maximization of a Non-submodular Function with a Cardinality Constraint on the Integer Lattice . . . . . . . . . . . . . . . . . . . . . . . . . 460 Jingjing Tan, Yue Sun, Yicheng Xu, and Juan Zou Adaptable Focal Loss for Imbalanced Text Classification . . . . . . . . . . . . . . . . . . . . 466 Lu Cao, Xinyue Liu, and Hong Shen

Contents

xv

Roman Amphitheater Classification Using Convolutional Neural Network and Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 Haïfa Nakouri Data-Hungry Issue in Personalized Product Search . . . . . . . . . . . . . . . . . . . . . . . . . 485 Bin Wu, Yuehong Wu, and Shangsong Liang Jointly Super Resolution and Degradation Learning on Unpaired Real-World Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Xuankun Chen, Junhong Chen, and Dongyu Zhang Enhanced Discriminant Local Direction Pattern Learning for Robust Palmprint Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 Siyuan Ma, Qintai Hu, Shuping Zhao, Lin Jiang, and Wenyan Wu Latent Multi-view Subspace Clustering Based on Schatten-P Norm . . . . . . . . . . . 512 Yuqin Lu, Yilan Fu, Jiangzhong Cao, Shangsong Liang, and Wing-kuen Ling Security and Privacy MOFIT: An Efficient Access Control Scheme with Attribute Merging and Outsourcing Capability for Fog-Enhanced IoT . . . . . . . . . . . . . . . . . . . . . . . . . 523 Richa Sarma and Ferdous Ahmed Barbhuiya RepBFL: Reputation Based Blockchain-Enabled Federated Learning Framework for Data Sharing in Internet of Vehicles . . . . . . . . . . . . . . . . . . . . . . . . 536 Haoyu Chen, Naiyue Chen, He Liu, Honglei Zhang, Jiabo Xu, Huaping Chen, and Yidong Li Multimodal Fusion Representation Learning Based on Differential Privacy . . . . 548 Chaoxin Cai, Yingpeng Sang, Jinghao Huang, Maliang Zhang, and Weizheng Li Efficient List Decoding Applied to ECC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 Peidong Guan, Yunqi Wan, Zhuoran Zhang, and Fangguo Zhang Federated Data Integration for Heterogeneous Partitions Based on Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 Jinghao Huang, Yingpeng Sang, Chaoxin Cai, Weizheng Li, and Maliang Zhang

xvi

Contents

Patient-Chain: Patient-centered Healthcare System a Blockchain-based Technology in Dealing with Emergencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576 Hai Trieu Le, Lam Nguyen Tran Thanh, Hong Khanh Vo, Hoang Huong Luong, Khoi Nguyen Huynh Tuan, Tuan Dao Anh, The Anh Nguyen, Khang Hy Nguyen Vuong, and Ha Xuan Son A Differential Privacy Image Publishing Method Based on Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 Guifen Zhang, Hangui Wei, Lina Ge, and Xia Qin Traffic Matrix Prediction Based on Differential Privacy and LSTM . . . . . . . . . . . 596 Weizheng Li, Yingpeng Sang, Maliang Zhang, Jinghao Huang, and Chaoxin Cai A Blockchain-Based Continuous Query Differential Privacy Algorithm . . . . . . . 604 Heng Ouyang, Hongqin Lyu, Shigong Long, Hai Liu, and Hongfa Ding Formalization and Verification of Group Communication CoAP Using CSP . . . . 616 Sini Chen, Ran Li, and Huibiao Zhu Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629

Networking and Architectures

Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression Jingcheng Shen(B) , Yifan Wu, Masao Okita, and Fumihiko Ino Osaka University, 565-0871 Osaka, Japan [emailprotected]

Abstract. Stencil computation is an important class of scientiﬁc applications that can be eﬃciently executed by graphics processing units (GPUs). Out-of-core approaches help run large scale stencil codes that process data with sizes larger than the limited capacity of GPU memory. Nevertheless, performance of out-of-core approaches is always limited by the data transfer between the CPU and GPU. Many optimizations have been explored to reduce such data transfer, however, published results on the use of on-the-ﬂy compression are insuﬃcient. In this study, we propose a method that accelerates GPU-based out-of-core stencil computation with on-the-ﬂy compression, introducing a novel data compression scheme that solves the data dependency between contiguous decomposed data blocks. We also modify a widely used GPU-based compression library to support pipelining that overlaps data transfer with computation. Experimental results show that the proposed method achieved a speedup of 1.2× compared with a method that involves no compression. Moreover, although precision loss caused by compression increased with the number of time steps, it was trivial up to 4,320 time steps, demonstrating the usefulness of the proposed method. Keywords: High performance computing · On-the-ﬂy compression Stencil computation · Simulation · GPGPU

1

·

Introduction

Stencil computation is the backbone of many scientiﬁc applications, such as geophysics simulations [4,15,16], computational electromagnetics [1], and image processing [22]. The key principle of stencil computation is to iteratively apply a ﬁxed calculation pattern (stencil) to every element of the input datasets. Such a single-instruction multiple-data (SIMD) characteristic of stencil computation makes itself a perfect scenario to use the graphics processing units (GPUs) for acceleration. A GPU has thousands of cores and its memory bandwidth is 5–10 times higher than that of a CPU, thus excelling at accelerating both computeand memory-intensive scientiﬁc applications [5,13,18,19]. However, as a GPU c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 3–14, 2022. https://doi.org/10.1007/978-3-030-96772-7_1

4

J. Shen et al.

has a limited capacity of device memory (tens of GBs), it fails to directly run a large stencil code whose data size exceeds the memory capacity. A large entity of research on GPU-based out-of-core stencil computation has been performed to address this issue [6,9,16,20,21]. For a large dataset whose data size exceeds the capacity of the device memory, out-of-core computation ﬁrst decomposes the dataset into smaller blocks and then streams the blocks to and from the GPU to process. Nevertheless, the performance of this approach is often limited by data transfer between the CPU and GPU because the interconnects fail to catch up with the development of the computation capability of GPUs as described in [19]. Data-centric strategies are thus necessary to reduce the data transfer. Studies have introduced strategies such as temporal blocking and region sharing to reuse the on-GPU data and to avoid extra data transfer [6,9,16]. Nevertheless, according to [16], the performance of out-of-core code was still limited by data transfer despite these strategies. We therefore need to further optimize the methods to reduce data transfer time. A potential solution is to use on-the-ﬂy compression to compress the data on the GPU before transferring it back to the CPU, and decompress the data on the GPU before processing. However, hitherto studies on the acceleration of GPU-based out-ofcore stencil computation with on-the-ﬂy compression are really rare. According to a comprehensive review [3], studies on leveraging compression techniques in scientiﬁc applications mainly focused on scenarios such as post-analysis and failure recovery. We think that the scarcity of relevant research raises two research questions: – Would the overhead of compression/decompression outweighs the reduced data transfer time? – Would the precision loss involved by data compression be so huge that the output becomes useless? In this study, we (1) propose a method to accelerate out-of-core stencil computation with on-the-ﬂy compression on the GPU and (2) try to give answers to the two above-mentioned questions. The contribution of this work is three-fold: – We introduced a novel approach to integrate an on-the-ﬂy lossy compression into the workﬂow of a 25-point stencil computation. For large datasets that are decomposed into blocks, this approach solves the data dependency between contiguous blocks and thus secures the accessibility to the common regions between contiguous blocks after compression. – We modiﬁed a widely-used GPU-based compression library [8] to support pipelining, which is mandatory for the purpose of overlapping CPU-GPU data transfer with GPU computation. – We analyzed experimental results to answer the aforementioned questions, i.e., on-the-ﬂy compression is useful in reducing the overall execution time of out-of-core stencil computation, and the precision loss is tolerable. The remainder of this study is organized as follows: Related studies on accelerating stencil and similar scientiﬁc applications with compression techniques

On-the-Fly Compression for Large Stencil Computation

5

are introduced in Sect. 2. Background of stencil computation and challenges in applying on-the-ﬂy compression to stencil computation are brieﬂy described in Sect. 3. Section 4 discusses the selection of an appropriate GPU-based compression library. The proposed method to integrate the compression processes into the workﬂow of out-of-core stencil computation is described in Sect. 5. In Sect. 6, experimental results are presented and analyzed. Finally, Sect. 7 concludes the present study and proposes future research directions.

2

Previous Work

Nagayasu et al. [10] proposed a decompression pipeline to accelerate out-of-core volume rendering of time-varying data. Their method was speciﬁed to handle RGB data and the decompression procedure was partially performed on the CPU. Tao et al. [23] proposed a lossy checkpointing scheme, which signiﬁcantly improved the checkpointing performance of iterative methods with lossy compressors. Their scheme reduced the fault tolerance overhead for iterative methods by 23%–70% and 20%–58% compared to traditional checkpointing and losslesscompressed checkpointing, respectively. Calhoun et al. [2] proposed metrics to evaluate loss of accuracy caused by using lossy compression to reduce the snapshot data used for checkpoint restart. They improved eﬃciency in checkpoint restart for partial diﬀerential equation (PDE) simulations by compressing the snapshot data, and found that this compression did not aﬀect overall accuracy in the simulation. Wu et al. [25] proposed a method to simulate large quantum circuits using lossy or/and lossless compression techniques adaptively. They managed to increase the simulation size by 2–16 qubits. However, their method was designed for CPU-based supercomputers and thus the compression libraries cannot be used for GPU-based scenarios. Moreover, the adaptive selection between lossy and lossless compression, i.e., using lossy compression if lossless one failed, is impractical in GPU-based high performance applications because such failures heavily impair the computational performance. Jin et al. [7] proposed a method to use GPU-based lossy compression for extreme-scale cosmological simulations. Their ﬁndings show that GPU-based lossy compression can enable suﬃcient accuracy on post-analysis for cosmological simulations and high compression and decompression throughputs. Tian et al. [24] proposed Cusz, an eﬃcient GPU-based error-bounded lossy compression framework for scientiﬁc computing. This framework reported high compression and decompression throughputs and a good compression ratio. However, according to their study, Cusz has sequential subprocedures, which prevents us to use this framework as on-the-ﬂy compression in our work due to the concern of the overhead to shift from GPU to CPU computation. Zhou et al. [26] designed high-performance MPI libraries with on-the-ﬂy compression for modern GPU clusters. In their work, they reduced the inter-node communication time by compressing the messages transferred between nodes, and the size of messages was up to 32 MB. On the other hand, our method

6

J. Shen et al.

Fig. 1. Five-point stencil computation. (a): Update of an element relies on its four neighboring elements. (b): The decomposed blocks must be transferred with the halo data.

Fig. 2. The contiguous blocks can share common regions on the GPU, thus avoid transferring the amount of data equivalent to that of the halo areas.

compressed large datasets for stencil computation that were more than 10 GB to reduce the data transfer time between the CPU and GPU (i.e., intra-node communication time). Moreover, our method is speciﬁed to handle out-of-core stencil code, solving the data dependency between decomposed data blocks.

3

Out-of-Core Stencil Computation

Stencil computation is an iterative computation that updates each element of input datasets according to a ﬁxed pattern that updates an element based on the elements surrounding it. A hello-world application of stencil computation is the solver of Laplace’s equation, which can describe the phenomenon of heat conduction: A ﬁve-point stencil code, where the temperature of each data point at the (t+1)-th time step is obtained by taking the average temperature of the four surrounding points at the t-th time step (Fig. 1(a)). To use out-of-core approaches that handle excess data, we decompose the original datasets into smaller blocks and stream the blocks to and from the GPU for processing. Due to data dependency of stencil computation, when we transfer a block to the GPU, we must also piggyback the neighbor data (“halo area”) with the block (Fig. 1(b)). The size of halo data we must transfer along with the block increases in conformity with the number of time steps we want to process the block on the GPU. As two contiguous blocks share common regions, a block can get common regions from its former block as well as provide its later block with common regions. By doing so, we can eﬀectively reduce the amount of data transfer equivalent to the size of halo data (Fig. 2).

On-the-Fly Compression for Large Stencil Computation

7

One challenge in integrating on-the-ﬂy compression into the workﬂow of outof-core stencil computation is that we must solve the aforementioned data dependency. Naively compressing each block not only consumes more memory space but also prevents sharing of common regions across contiguous blocks. Therefore, sophisticated compression strategy is necessary, and will be introduced in Sect. 5.1.

4

On-the-Fly Compression

Another concern in leveraging on-the-ﬂy compression in out-of-core stencil code is the overheads of compression and decompression that are often considerable. GPU-based compression libraries such as cuZFP [8], Cusz [24], and nvComp [12] report high speeds in compression and decompression. The cuZFP and Cusz libraries are based on lossy compression, whereas the nvComp is lossless. In this study, we used cuZFP given that it is a library of high performance with source code relatively easy to modify to implement functionalities we need. The library allows users to specify the number of bits used to preserve a value. For example, specifying 32 bits to preserve a double-precision ﬂoating-point (i.e., double-type) value achieves a compression ratio of 1/2. We avoided using the lossless nvComp due to the concern of compression ratio. In our preliminary experiments, we found the size of data compressed with nvComp was larger than that of the original data. Therefore, we chose not use nvComp in the present study because we could not estimate the upper bound of the size of the compressed data, and we must allocate device memory every time the compression happens instead of reusing pre-allocated device buﬀers with ﬁxed sizes. The reason why we avoided using Cusz was explained in Sect. 2.

5

Proposed Method

In this section, we introduce our proposed method, including separate compression that solves the data dependency between contiguous blocks and thus allows us to compress the decomposed datasets freely, and a pipelining version of cuZFP that supports overlapping compression/decompression with CPU-GPU data transfer. 5.1

Separate Compression

As shown in Fig. 2, two contiguous blocks have common regions that are shareable. The bottom halo areas needed by the i-th block lie in the (i + 1)-th block, and the top halo areas needed by the (i + 1)-th block lie in the i-th block. Therefore, the common region between the two blocks consist of the top areas and a part of the (i + 1)-th block whose size is equivalent to that of the top halo areas. If we transfer the i-th block together with its bottom halo areas, we can avoid transferring the common regions for the (i + 1)-th block.

8

J. Shen et al.

Fig. 3. Separate compression approach to solve data dependency between contiguous blocks. In this approach, the remainder and the common region are compressed separately for each block. As shown in (a), the i-th compressed remainder and common region are decompressed on the GPU for computation; and in (b), after computation, the remainder and common region are compressed and transferred back to CPU to update the i-th remainder and (i − 1)-th common region, respectively.

Similarly, each block only needs to be transferred with its remainder and bottom halo areas, so the two parts, i.e., the remainder and half of the common region, must be exclusively readable and writable to the according contiguous blocks. Based on this observation, we propose a separate compression approach that compresses the two parts separately. As shown in Fig. 3(a), prior to computation, the i-th compressed remainder and the common region are decompressed, therefore the i-th block can be computed on and provides the data needed by the (i + 1)-th block. As shown in Fig. 2(b), after computation, the (i + 1)-th block is compressed as the (i + 1)-th remainder and i-th common region. 5.2

Pipelining cuZFP

The cuZFP library [8] is mainly designed as a standalone tool that can be seamlessly used for post-analysis and CPU-centric scientiﬁc computations. However, as an on-the-ﬂy process in the out-of-core stencil computation, we have to modify the source code to support pipelining that overlaps CPU-GPU data transfer with GPU computations. Thanks to the good maintenance of the cuZFP project, we managed to modify the source code to add such functionality with a reasonable amount of programming eﬀort. In pipelining cuZFP, we use three CUDA [11] streams (Fig. 4).

On-the-Fly Compression for Large Stencil Computation

9

Fig. 4. Modiﬁed cuZFP that supports pipelining. Three CUDA streams are used to perform operations, overlapping CPU-GPU data transfer with GPU kernels including compression, decompression, and computation. Table 1. Target stencil code. No. of datasets Data type Dim. info. 4

Double

Entire data size

(1152+2×HALO)3 , HALO = 4 46 GB

Table 2. Testbed for experiments. GPU

NVIDIA Tesla V100-PCIe

Device memory 32 GB

6

CPU

Xeon Silver 4110

Host memory

500 GB

OS

Ubuntu 16.04.6

CUDA

10.1

cuZFP

0.5.5

Experimental Results

In this section, we analyze the experimental results to evaluate the beneﬁts of using on-the-ﬂy compression in out-of-core stencil computation on a GPU. The stencil code we used is an acoustic wave propagator from a previous work [16]. The code is a 25-point stencil code that has two read-write datasets, a writeonly dataset, and a read-only dataset. The two read-write datasets store the updated elements that need to be transferred to and from the GPU. The writeonly dataset stores intermediate results at run-time and does not need to be transferred at all. The read-only dataset are constant values that must be referenced at run-time and thus needs to be transferred to the GPU. The values are of double-type because it is more preferable compared to single-precision ﬂoatingpoint format (i.e. float-type) in iterative scientiﬁc applications. According to a previous work [17], the CPU version of a code using ﬂoat-type data leads to outputs diﬀerent from that of the GPU version. Such divergence becomes a more severe problem with the increase of the total number of iterations. On the other hand, when using double-type, results of the CPU and GPU versions of the same code were consistent. Table 1 shows the detail of the datasets used by the stencil code.

10

J. Shen et al.

Fig. 5. Performance of the four stencil codes.

Moreover, we used four codes in our experiments to evaluate the performance and precision loss. The four codes include: 1. The original stencil code. 2. The stencil code with one read-write dataset compressed using a 32/64 rate (i.e., using 32 bits to preserve each double value). 3. The stencil code with the read-only dataset compressed using a 32/64 rate. 4. The stencil code with one read-write dataset and the read-only dataset compressed using a 24/64 rate. Note that we used 24 bits to preserve each double value to reduce memory usage in conformity with the limited device memory capacity. The conﬁguration to run the stencil codes is as the one described in [16] where the number of division is 8 and the number of temporal blocking time steps is 12. Accordingly, we divide the data into 8 blocks, and when a block is transferred to the GPU, it will be computed on for 12 times before transferred back to the CPU. For the total time steps, we used numbers from 480 to 4,320 with an increment of 480. For speciﬁcations of the testbed for all experiments performed, see Table 2. 6.1

Evaluation of Performance Benefits

As shown in Fig. 5, the three codes using on-the-ﬂy compression ran faster than the original code. The code compressing one of the read-write datasets and the read-only dataset outperformed the others, running 1.20× as fast as the original code. The code compressing the read-only dataset and the code compressing one of the read-write datasets achieved speedups of 1.18× and 1.16×, respectively. Based on these results, our proposed method is beneﬁcial for GPU-based outof-core stencil computation in terms of performance. A detailed analysis of the achieved performance improvement will be given in next section.

On-the-Fly Compression for Large Stencil Computation

11

Fig. 6. Breakdown of the execution time of the four GPU-based codes that ran for 12 time steps. The execution time of a CPU-based code was measured to show the performance beneﬁts of using GPU acceleration. Note that the bounding operation time for the fourth GPU-based code was GPU computation time (bars in the middle), whereas the bounding operation time for the other three GPU-based codes was CPUto-GPU data transfer time (dark green bars). (Color ﬁgure online)

6.2

Detailed Analysis of Achieved Performance Improvement

In this experiment, we ran the four GPU-based codes individually for 12 time steps and proﬁled the breakdown of execution time. Moreover, we also ran a CPU-based code for 12 time steps to show the advanced performance of GPUbased code, compared to that of the CPU-based code. The CPU-based code was parallelized with OpenMP [14] and executed with 40 CPU threads. As shown in Fig. 6, we can see the three codes using compression reduced the CPU-to-GPU time (dark green bars) that limited the overall performance. The most interesting ﬁnding is that the fourth GPU-based code shifted from data-transfer-bounding to computation-bounding compared to the former three GPU-based codes, which is favorable because it theoretically means that the data transfer time can be fully hidden by the computation time. Moreover, although the code compressing the read-only dataset did not reduce the GPU-to-CPU data transfer time, nor did it involve relatively signiﬁcant compression time (dark purple). Therefore, the code compressing the read-only dataset slightly outperformed the code compressing one of the readwrite datasets. Nevertheless, the gaps between the overall execution time and the bounding operation time (i.e., longest bar) of the three codes with compression are larger than that of the original GPU-based code. This suggests that the compression or/and decompression involved some unidentiﬁed overheads that compromised the eﬃciency of overlapping data transfer with GPU computation, otherwise the overall execution time should have been closer to the bounding operation time. Therefore, more sophisticated measures to orchestrate the pipelining could achieve further improvement, providing a direction for future work.

12

J. Shen et al.

Fig. 7. Change in precision loss as total time steps increase.

6.3

Evaluation of Precision Loss

Besides showing performance beneﬁts, demonstrating that the compression involves no signiﬁcant precision loss is crucial. After completing the total time steps, we sampled 115,200 points (i.e., 100 points per plane) and compared the point values of the three codes using compression with that of the original code to calculate the average point-wise relative errors (Fig. 7). Although the relative errors increased with an increase in the total time steps, they were still far from signiﬁcant at 4,320 time steps. The code compressing the read-only dataset had the lowest precision loss because the read-only dataset does not need to be compressed repeatedly. The code compressing one of the read-write datasets and the read-only dataset using 24/64 rate resulted in the largest precision loss due to the fewer bits we used to preserve the double values. Nevertheless, the code is useful because the relative error was trivial (between 10−6 and 10−7 ). Given this, the proposed method will not lead to intolerable precision loss at least for a moderate number of time steps.

7

Conclusions and Future Work

In this study, we introduced a method to accelerate GPU-based out-of-core stencil computation with on-the-ﬂy compression. To realize the method, we proposed a novel approach to compress the decomposed data, solving the data dependency between contiguous blocks. We also modiﬁed the cuZFP library [8] to support pipelining for overlapping data transfer with GPU computation. Experimental results show that the proposed method achieved a speedup of 1.2× at the expense of a trivial precision loss, i.e., an average point-wise relative error between 10−6 and 10−7 . The results answer the two research questions mentioned in Sect. 1. First, the reduction of CPU-GPU data transfer time achieved by using on-the-ﬂy compression outweighs the overhead of compression/decompression, improving the overall performance of GPU-based out-of-core stencil computation. Secondly, the on-the-ﬂy compression does not cause severe precision loss for thousands of time steps. Future work includes (1) comparing other on-the-ﬂy compression

On-the-Fly Compression for Large Stencil Computation

13

algorithms to cuZFP and (2) orchestrating the pipelining for better eﬃciency in overlapping data transfer with GPU computation. Acknowledgment. This study was supported in part by the Japan Society for the Promotion of Science KAKENHI under grant 20K21794 and “Program for Leading Graduate Schools” of the Ministry of Education, Culture, Sports, Science, and Technology, Japan.

References 1. Adams, S., Payne, J., Boppana, R.: Finite diﬀerence time domain (FDTD) simulations using graphics processors. In: 2007 DoD High Performance Computing Modernization Program Users Group Conference, pp. 334–338. IEEE (2007) 2. Calhoun, J., Cappello, F., Olson, L.N., Snir, M., Gropp, W.D.: Exploring the feasibility of lossy compression for PDE simulations. Int. J. High Perf. Comput. Appl. 33(2), 397–410 (2019) 3. Cappello, F., Di, S., Gok, A.M.: Fulﬁlling the promises of lossy compression for scientiﬁc applications. In: Nichols, J., Verastegui, B., Maccabe, A.B., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds.) SMC 2020. CCIS, vol. 1315, pp. 99–116. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63393-6 7 4. Farres, A., Rosas, C., Hanzich, M., Jord` a, M., Pe˜ na, A.: Performance evaluation of fully anisotropic elastic wave propagation on NVIDIA volta GPUs. In: 81st EAGE Conference and Exhibition 2019, vol. 2019, pp. 1–5. European Association of Geoscientists & Engineers (2019) 5. Ikeda, K., Ino, F., Hagihara, K.: Eﬃcient acceleration of mutual information computation for nonrigid registration using CUDA. IEEE J. Biomed. Health Inf. 18(3), 956–968 (2014) 6. Jin, G., Lin, J., Endo, T.: Eﬃcient utilization of memory hierarchy to enable the computation on bigger domains for stencil computation in CPU-GPU based systems. In: 2014 International Conference on High Performance Computing and Applications (ICHPCA), pp. 1–6. IEEE (2014) 7. Jin, S., et al.: Understanding GPU-based lossy compression for extreme-scale cosmological simulations. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 105–115. IEEE (2020) 8. Lindstrom, P.: Fixed-rate compressed ﬂoating-point arrays. IEEE Trans. Vis. Comput. Graph. 20(12), 2674–2683 (2014) 9. Miki, N., Ino, F., Hagihara, K.: PACC: a directive-based programming framework for out-of-core stencil computation on accelerators. Int. J. High Perf. Comput. Netw. 13(1), 19–34 (2019) 10. Nagayasu, D., Ino, F., Hagihara, K.: A decompression pipeline for accelerating out-of-core volume rendering of time-varying data. Comput. Graph. 32(3), 350– 362 (2008) 11. NVIDIA Corporation: CUDA C++ Programming Guide v11.4 (2021) 12. NVIDIA Developer: nvComp: High Speed Data Compression Using NVIDIA GPUs (2021) 13. Okuyama, T., et al.: Accelerating ode-based simulation of general and heterogeneous biophysical models using a GPU. IEEE Trans. Parallel Distrib. Syst. 25(8), 1966–1975 (2013)

14

J. Shen et al.

14. Van der Pas, R., Stotzer, E., Terboven, C.: Using OpenMP# The Next Step: Aﬃnity, Accelerators, Tasking, and SIMD. MIT press, Cambridge (2017) 15. Serpa, M.S., et al.: Strategies to improve the performance of a geophysics model for diﬀerent manycore systems. In: 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pp. 49–54. IEEE (2017) 16. Shen, J., Ino, F., Farr´es, A., Hanzich, M.: A data-centric directive-based framework to accelerate out-of-core stencil computation on a GPU. IEICE Trans. Inf. Syst. 103(12), 2421–2434 (2020) 17. Shen, J., Mei, J., Walld´en, M., Ino, F.: Integrating GPU support for freesurfer with openacc. In: 2020 IEEE 6th International Conference on Computer and Communications (ICCC), pp. 1622–1628. IEEE (2020) 18. Shen, J., Shigeoka, K., Ino, F., Hagihara, K.: An out-of-core branch and bound method for solving the 0-1 knapsack problem on a GPU. In: Ibrahim, S., Choo, K.-K.R., Yan, Z., Pedrycz, W. (eds.) ICA3PP 2017. LNCS, vol. 10393, pp. 254–267. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65482-9 17 19. Shen, J., Shigeoka, K., Ino, F., Hagihara, K.: GPU-based branch-and-bound method to solve large 0–1 knapsack problems with data-centric strategies. Concurr. Comput. Pract. Exp. 31(4), e4954 (2019) 20. Shimokawabe, T., Endo, T., Onodera, N., Aoki, T.: A stencil framework to realize large-scale computations beyond device memory capacity on GPU supercomputers. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 525–529. IEEE (2017) 21. Sourouri, M., Baden, S.B., Cai, X.: Panda: a compiler framework for concurrent CPU+ GPU execution of 3D stencil computations on GPU-accelerated supercomputers. Int. J. Parallel Program. 45(3), 711–729 (2017) 22. Tabik, S., Peemen, M., Romero, L.F.: A tuning approach for iterative multiple 3d stencil pipeline on GPUs: anisotropic nonlinear diﬀusion algorithm as case study. J. Supercomput. 74(4), 1580–1608 (2018) 23. Tao, D., Di, S., Liang, X., Chen, Z., Cappello, F.: Improving performance of iterative methods by lossy checkponting. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pp. 52–65 (2018) 24. Tian, J., et al.: Cusz: an eﬃcient GPU-based error-bounded lossy compression framework for scientiﬁc data. arXiv preprint arXiv:2007.09625 (2020) 25. Wu, X.C., et al.: Full-state quantum circuit simulation by using data compression. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–24 (2019) 26. Zhou, Q., et al.: Designing high-performance MPI libraries with on-the-ﬂy compression for modern gpu clusters. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 444–453. IEEE (2021)

Routing with Ant Colony Optimization in Wireless Mesh Networks Jiadong Peng1

, Zhanmao Cao1(B) , and Qisong Huang2

1 South China Normal University, Tianhe District, Guangzhou 510631, China

[emailprotected], [emailprotected]

2 Agricultural Bank of China, Guangdong Branch, Tianhe District, Guangzhou 510623, China

[emailprotected]

Abstract. Multiple-radio multiple-channel wireless mesh networks (MRMC WMNs) are fitting as the wireless backbone networks for ubiquitous Internet access. It is quite a challenge to satisfy the multiple traffic requests from multiple source-destination pairs with different data transmission requirements. The multiple pair traffic flows may cause heavy conflict via the nature of wireless media. To take almost full use of the limited resources, we design a routing algorithm based on ant colony optimization. The pheromone leads the finding of primary paths. The various simulations show the efficiency of the algorithm performance. Keywords: Routing · Ant colony optimization · Wireless mesh networks · Link interference

1 Introduction Multi-radio multi-channel wireless mesh networks (MRMC WMNs) have become a promising solution to provide convenient and ubiquitous broadband access to the Internet, while aiming to provide ubiquitous information services [1]. WMNs can offer high levels of service and wide coverage, while the deployment takes relatively inexpensive costs [2]. Different from the traditional wireless network, WMNs is a dynamic self-organizing and self-configuring network [3]. In other words, each node of a mesh network automatically creates and maintain the network connection. The special features of WMNs also present as high reliability and easy access to Internet for mobile devices. Compared with traditional wireless network, MRMC WMNs provide higher capacity, but ant colony method is rarely used in discussing multiple flow problem in MRMC WMNs. It is common traffic mode for multiple users to transmit data at the same time. It is a challenge to satisfy multiple traffic flows from different source-destination pairs. We will try use ant colony to find nearly optimal routing and scheduling scheme for those simultaneous traffic flows. The interference is a nature character for wireless links, while interference will decrease the performance significantly. Near neighbor nodes and links share the same © Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 15–26, 2022. https://doi.org/10.1007/978-3-030-96772-7_2

16

J. Peng et al.

channel will cause heavy interference. As interference will decrease the network performance and waste the wireless network resources. Thus, interference free channel assignment is also critical [4]. To get rid of the interference, we need design efficient routing and channel assignment scheme for every real applications. If we can use ant colony to find an effective scheme automatically, it will make sense to deploy various MRMC WMNs. For full usage of the limited network resources, to reduce link interference is a key issue to improve network performance. The optimal multiple concurrent traffic flows is a challenge problem from multi-pair requests, which is the common phenomenon of the data stream and transmission requests [5]. Each data flow should have a path to forward data packets hop by hop. An uncooperative scheduling of multiple flows may result in unbalanced load, even serious interference. A simple fact is the transmission task not completed in time [6]. The shortest path routing, which simply based on hop count, cannot achieve better network performance [7]. Hence, we need consider some critical factors, such as the topology, the radio interfaces, and the channels, etc. Although ant colony algorithms were explored for sensor network, even for Ad Hoc, there are still few related conclusions that meet the multiple flow problem of MRMC WMNs. The main contribution of this paper is to propose an optimization routing algorithm based on ant colony, which aims at effective use of network resources and improve transmission performance. In order to find more multiple pair active paths by more interference free links over independent orthogonal channels, pheromone based algorithm is used to create optimal routing in WMNs. Through the regulation of pheromone, we connect the characteristics of MRMC WMNs to produce a better scheme, toward concurrent transmission and channel interference free. The rest of the paper is as following. Section 2 gives a survey of related work. Section 3 designs a routing algorithm. Section 4 evaluates the performance of our algorithm. Section 5 is a short conclusion.

2 Related Work In WMNs, routing the multiple flows is quite complex because more constraints have to be considered for optimization, scheduling, routing, channel allocation, and interference avoidance. For the combinatorial problem, even only one aspect involving, it is hard to get an exact optimal solution. For example, if we schedule the multiple paths, we first need give the channel assignment for real-time data flows. However, the CA problem is NP-complete, because it can be reduced to the 3-partition problem [8]. The problem to perform routing to achieve maximum utilization of network resources is also NPcomplete. Various solutions from different angles have been proposed. For example, a distributed multi-flow opportunistic routing algorithm combining candidate node selection and rate allocation is proposed by He et al. [9]. Chu et al. reported a distributed algorithm to minimize the maximum channel congestion and solve the routing problem of multiple concurrent flows based on MIMO [10]. They focused on the traffic load, but other factors were not involved.

Routing with Ant Colony Optimization in Wireless Mesh Networks

17

Qiao et al. propose a loose joint cooperative routing and channel allocation algorithm to promote network throughput effectively [11]. Bezzina et al. propose an interference aware routing metric, which considers intra flow and inter flow as well as link rate [12]. Yan et al. propose a cross layer joint channel allocation and routing algorithm, which greedily selects the channel with the least link interference in the channel allocation phase [13]. However, there are rare discussion on multiple concurrent flows. For ant colony algorithm, there are a lot research on sensor networks and other simple cases of the radio and the channel. For example, an energy consumption optimization algorithm based on ant colony algorithm is proposed for wireless sensor network by Li [14]. These conclusions of wireless sensor network is not suitable to our MRMC WMNs. Most of researches do not consider multi-channel, multi-radio and link interference. Even though Amudhavel et al. introduce a recursive scheme of ant colony optimization in the WMNs [15], while they subdivide the large routing problem into smaller ones and achieved some results, their algorithm does not make good use of the advantages of MRMC. Few reported algorithms are meeting the multiple concurrent flow problem. Those ant colony algorithms for sensor networks have no attention on link interference avoidance. In addition, most of the similar research pay attention on the deployed network, while our algorithm can play a role in the precomputing for the network deployment. As above, multiple pair concurrent paths problem in MRMC WMNs is a challenging and not thoroughly studied one. In the aspect of routing, most of the existing solutions to the concurrency problem need to use all the information of the network, like topology, the radio interfaces and the available channels, which should be collected in advance. In this paper, we tackle the optimal problem to maximize the utilization of wireless network resources. Inspired by the beneficial studies, we propose a routing algorithm based on ant colony optimization by using resources efficiently, in order to improve the network performance.

3 Routing Algorithm The Cartesian Product of Graph (CPG) model is useful to reduce the CA complexity for the path selection criteria under the condition of multiple concurrent flows [16]. It divides the network topology into different virtual layers according to the number of channels like Fig. 1. When one link of a neighbor pair is working over a certain channel layer, its other links over other layers can also work concurrently. It can intuitively help us deal with channel conflicts. To facilitate a rigorous formulation of the problem, we provide some symbols for both the model and the algorithm in Table 1. For routing the multiple pair traffic, we need to combine channel allocation and scheduling. For each request, routing algorithm should search a path from the source node to the destination node. To facilitate expression, it can be represented by a sequential node sequence. Our routing algorithm will take the channel allocation information and resource information into account in the process of routing, so that the subsequent channel allocation and scheduling can take full usage of the limited resources. If we split the routing, channel allocation and scheduling with each alone, it is hard to reach the optimal scheme.

18

J. Peng et al. Table 1. List of the notations

Notations

The symbolic meaning

(si , di )

The ith source-destination node pair

l(i,j)

The potential link of neighbor i and j

Pijk

The probability of the k th ant passing l(i,j)

Ak

The node set that the k th ant can reach

τij (t)

The pheromones in l(i,j) at iteration t

ηij (t)

The influence factor in l(i,j) at iteration t

ρ

The residue coefficient of pheromone

C

The available channel set

R

The available radio interfaces

|C|

The number of available channels

|R|

The number of available interfaces

m

The number of ants for each (si , di )

According to the number of orthogonal channels, the CPG model maps the MRMC mesh into the virtual channel layers [17]. Each channel layer has the same topology, as shown in Fig. 1. A link can only transmit over one available channel. Multiple links can coexist to forward packets if and only if those links meet the following link interference free conditions. For the senders, any two different sender’s distance is not less than two hops. For the receivers, any two different receiver’s distance is not less than one hop. For a sender and a receiver, the distance is more than one hop. The colored links can coexist in each channel as in Fig. 1.

Fig. 1. A mesh. (Color figure online)

For the problem of multiple concurrent traffic flows, it is not suitable to consider only the shortest path [18]. Because of interference and resource allocation, only considering the shortest path can easily lead to overload, which makes the network performance worse. Our algorithm does not simply find the shortest path. We will illustrate this problem with a simple example based on Fig. 1, shown in Fig. 2.

Routing with Ant Colony Optimization in Wireless Mesh Networks

19

We layer the topology by the number of channels, and each layer represents the usage of the channel. For example, this simple topology has three orthogonal channels c1 c2 and c3 , so we divide it into three layers. When a link is working in a certain layer, other links near it become unusable. We can calculate which links will conflict according to the previously mentioned interference conditions. Suppose a traffic request of node pair (1,9) gets the turn to transmit. Let denote the potential link along a path from node i to node j as l(i,j) . Figure 2(a) shows one of the shortest paths, which contains three potential links, l(1,6) , l(6,7) and l(7,9) . The links can be scheduled simultaneously over the channel of c1 , c2 and c3 respectively. If only one path, this choice is fine. However, for multiple pair requests, as we need to deal with many concurrent requests, it may cause serious interference. For example, if l(7,3) is working over channel c1 , it interferes with l(1,6) in Fig. 2(a). We may need to consider choosing another path to avoid congestion. In that case, the path in Fig. 2(b) may reduce conflicts and get better performance. This path consists of three links of l(1,6) , l(6,7) and l(7,9) , over the channels c1 ,c2 and c3 . Sometimes the ant chooses longer new path, while the path may lead to better performance.

Fig. 2. Paths for source-destination pair of (1,9). (Color figure online)

The better performance path should be the choice via pheromone in ant colony algorithm, in order to improve the network performance. Ant colony optimization is an algorithm that mimics the real ant colony behavior. When searching for food, ants will leave pheromones on the path and other ants will choose the path according to the pheromone concentrations. As the pheromones evaporate over time, the pheromones get rapidly accumulated in the shorter paths. After repeated times, a shortest path will be found. When there are multiple paths, ants spread into these directions at an equal chance in the beginning. After some iterations, due to the accumulation of pheromones, ants tend to choose the shorter path. As the simple shortest path cannot avoid the conflicts, our algorithm focuses on designing the pheromones. The pheromone in our algorithm will also evaporate over time, but it will accumulate in the links with surplus resources. When there are multiple paths, ants also spread in these directions with the same opportunity at the beginning. Later, due to the accumulation of pheromones, ants tend to choose a more optimized path rather than the shortest path. The formula of pheromone is defined as (3) in detail.

20

J. Peng et al.

Our algorithm is executed in a host to find the routing scheme for a given multiple pairs of (si , di ), i = 1, 2, . . . , k . The paths are found on demand for the first time. It will be a proactive solution for the future if the perspective multiple pairs emerge again. Our ant colony optimization algorithm steps are as follow: 1. Each link in the network topology is given the same initial pheromone value, in order to reduce the impact caused by ants at the beginning of the algorithm. 2. One source node generates m ants, which explore the path from the source node. 3. Each ant selects the next hop node according to the transfer formula until it reaches the destination node or exceeds the maximum hop count. 4. When all m ants complete a path search from the source point to the destination node, the pheromone values will be updated. 5. Check whether the iteration is finished. If the paths converge, the iteration is finished. Then, we get an available path. Otherwise, repeat steps 2 to 5. 6. Repeating the above steps to select paths for each request. Algorithm 1. ACO routing algorithm

Routing with Ant Colony Optimization in Wireless Mesh Networks

21

The above contents are the steps of the algorithm and the pseudo code of our algorithm. Some contents need to be described in more detail. In step 1, we set an initial pheromone value. This value needs to be set according to the size of the topology. A suitable initial value can make the algorithm converge faster. The initial value of the 64-node topology shown in Fig. 3 used in this paper is set to 20. The number of ants m used in each iteration, which is mentioned in step 2, is set according to the shortest-hop path between the pair. In this paper, we use 5 times of the shortest hops. When an ant selects a link, the algorithm will modify the resource data of the related nodes, which will affect the selection of another ant serving other pairs. When an ant is in the intermediate node, its next hop is determined by transfer formula, so it is an important part of our algorithm. The state transfer formula determines the rules that ant colony should follow, while moving from the current state to the next. The rationality of the parameters will affect the quality of paths selection. The formula of the transition probability Pijk is as follow: τij (t)∗ηij (t) , j ∈ Ak k s∈Ak τij (t)∗ηij (t) Pij = (1) 0, otherwise Where τij (t) is defined as the value of pheromones on link l(i,j) at iteration t. Ak is the node set which k th ant can reach with one hop on node i. The formula of the transition in the routing algorithm is used as the basis for ants to select the next hop node. For a possible node that the ant may reach in the next hop, the probability will be computed through the formula. We normalize the formula so that the sum of the probability of ants selecting the next hop node is 1. Then, we describe each parameter in (1). ηij (t) =

1 |C| ∗ |R| + 1

(2)

ηij (t) is the value of resource surplus for l(i,j) . It is calculated from the number of available channels and the number of available interfaces. After an ant completes its path, the pheromone of the path is updated. According to the ant colony algorithm, the pheromone on the link is defined as (3): τij (t + 1) = τij (t) ∗ (1 − ρ) + τij , 0 < ρ < 1

(3)

Where ρ denotes the residue coefficient of pheromone and t represents the number of iterations. In the simulation, ρ is set to 0.2, that is, 20% of the pheromone will be dispersed each time the pheromone is updated. τ ij is the sum of pheromones released by all ants walking through the l(i,j) . We define the increment of pheromone by (4): m τij = τijk (4) k=1

Where τ kij denotes pheromone released by k th ant on l(i,j) .Our algorithm is optimized based on ant colony algorithm and applied to wireless mesh network, aiming at maximizing the utilization of network resources and reducing the link interference between paths under multiple concurrent requests. For the problem of multiple concurrent paths, the algorithm can give a path scheme without the global information collected.

22

J. Peng et al.

Fig. 3. Mesh topology

4 Performance Evaluation The simulations are carried out under various network resource combinations and traffic requests. We evaluate the performance of our ACO algorithm based on maximum throughput. Therefore, we choose Dijkstra algorithm (DA) and joint optimal scheduling scheme (COSS) for comparison. The parameters and values are listed in following. The resource combinations are virtually deployed with the numbers of available interfaces as one in {4,8,12,16,20}, and the number of interference free channels as one in {8,16,32}. Time duration is set to be 5ms, packet size is set to be 1MB, and each link capacity is set to be 200 MB/s. In general, the computing time depends on both the topology and the number of pairs. The topology has two aspects: the size, and the special local distribution. On the topology in the paper, a convergence solution takes after about 200 iterations. To evaluate the performance of our algorithm, we conduct the simulation with a random mesh topology of 64-node, as showed in Fig. 3. The tendency of maximum throughput with different combinations of radios and channels. By changing the number of radios and channels, we analyze the performance of ACO as Fig. 4. In Fig. 4, when |C| = 8, the maximum throughput of the network does not increase significantly with the increase number of radios, which indicate that the dominant factor limiting the maximum throughput of the network in this example is the number of channels. When |C|=16, the maximum throughput of the network increases rapidly with the increase number of radios. This shows that the dominant factor limiting the maximum throughput of the network is the number of radios. When the number of radios is enough, the dominant factor limiting the maximum throughput of the network becomes the number of channels. When |C| = 32, the maximum throughput of the network increase with the increase number of radios. The number of available channels is sufficient and the maximum throughput of the network is no longer significantly improved by it. Through Fig. 4, we can find that our algorithm can make full use of network resources. As long as there are still available resources, the performance of our algorithm can be improved steadily.

Routing with Ant Colony Optimization in Wireless Mesh Networks

throughput (MB/s)

25000

|C|=8

|C|=16

23

|C|=32

20000 15000 10000 5000 0

4

8

12

16

20

number of radios

Fig. 4. The maximum throughputs for the combinations of the number of radios and that of channels

Network resource combinations are set as |C| = 8|R| = 8, |C| = 16|R| = 12, |C| = 32|R| = 16, to evaluate the ACO algorithm for various traffic requests. When the number of source-destination pairs varies from 20 to 200, the maximum throughput improves slowly for one resource combination. However, it has significant jumps compared to the lower resource deployment, as in Fig. 5. The more the source-destination pairs, the bigger of maximum throughput. Meanwhile, the increasing tendency is greater with the more plenty resources. It is easy to reason this fact, because the more network resource available means the more network capacity can be accessed. Hence, more compatible paths can be scheduled over the WMNs in a time slot. At the same time, the maximum throughput of the network tends to a stable value, this is because the number of source-destination pairs has exceeded the network capacity and the number of compatible paths in the same time slot has reached the peak, so the maximum throughput of the network is no longer significantly improved.

throughput (MB/s)

20000

|R| = 8 Λ |C| = 8

|R| = 12 Λ |C| = 16

|R| = 16 Λ |C| = 32

15000 10000 5000 0

40

80 120 number of source-destination pairs

160

200

Fig. 5. The maximum throughputs with different numbers of source-destination pairs

24

J. Peng et al.

Given different network resource combinations, Fig. 6 shows that the maximum throughput of ACO algorithm is better than that of DA. With the increase of traffic requests, our algorithm gradually exceeds COSS. Our algorithm can surpass the existing algorithms without the need for global network information. This scheme can save a lot of resources, because our algorithm is blind search one. To evaluate the efficiency of ACO, we conduct comparison with DA and COSS on maximum throughput. As a single source shortest path algorithm, DA does not consider the available resource and mesh property, and it will lead to the overload of some nodes. COSS is a combinatorial optimization algorithm. It uses heuristic methods to find many compatible paths to realize the combinatorial optimization of compatible paths. When there are a large number of traffic requests, DA will bring local overload and reduce performance. The performance of COSS is also worse than ours in this case. Moreover, with the increasing of network resource, the throughput performance of ACO algorithm is getting better. In the routing stage, path selection criteria of ACO algorithm can intelligently select the path of the maximum available resources under the current network status, meanwhile, it may effectively reduce the overlaps between multiple paths. With the resource intelligence, ACO realizes the node load balancing and improve the network performance. In order to achieve the multi-path simultaneous optimization, we proposed an ACO algorithm for wireless mesh networks in which each path is avoiding interference. It can improve the performance of this network and balance the load of nodes and channels. Simulation results show that the ACO algorithm can achieve better performance under different network resource combinations and various traffic requests. Through the throughput performance comparison, we can see that it is slightly better than DA and COSS algorithm in the case of fewer conditions and resources.

throughput (MB/s)

8000

ACO

DA

COSS

6000 4000 2000 0

40

80 120 number of source-destination pairs

160

200

Fig. 6. Maximum throughput comparison

5 Conclusion This work mainly focuses on the routing in WMNs and evaluates the performance by simulations. Paths are built upon the pheromone. The routing algorithm is based

Routing with Ant Colony Optimization in Wireless Mesh Networks

25

on ant colony optimization. The performance of the algorithm is verified via various simulations, which show the algorithm efficiency. In the research of routing algorithm optimization, we found that there are still some points to tackle in future, such as how to use the solutions for those already known traffic mode. We want to achieve the continuous optimization of the path scheme. After obtaining the corresponding path scheme through the algorithm, we carry out the experimental calculation. If we find that the effect is not good, we need to adjust the path selection again. It would be better to get assistant of the last ant colony optimization, and to support the new optimization. We may introduce localized scheme to improve in a distributed way.

References 1. Akyildiz, I.F., Wang, X., Wang, W.: Wireless mesh networks: a survey. Comput. Netw. 47(4), 445–487 (2005) 2. Cheng, H., Xiong, N., Yang, L.T., et al.: Links organization for channel assignment in multiradio wireless mesh networks. Multimedia Tools Appl. 65(2), 239–258 (2013) 3. Al Islam, A.B.M.A., Islam, M.J., Nurain, N., Raghunathan, V.: Channel assignment techniques for multi-radio wireless mesh networks: a survey. IEEE Commun. Surv. Tutor. 18(2), 988–1017 (2016) 4. Tian, Y., Yoshihiro, T.: Traffic-demand-aware collision-free channel assignment for multichannel multi-radio wireless mesh networks. IEEE Access 8, 120712–120723 (2020) 5. Cao, Z., Wu, C.Q., Berry, M.L.: An optimization scheme for routing and scheduling of concurrent user requests in wireless mesh networks. Comput. Sci. Inf. Syst. 14(3), 661–684 (2017) 6. Singh, A.R., Devaraj, D., Banu, R.N.: Genetic algorithm-based optimization of load-balanced routing for AMI with wireless mesh networks. Appl. Soft Comput. 74, 122–132 (2019) 7. Bhojannawar, S., Mangalwede, S.: Interference, traffic load and delay aware routing metric for wireless mesh network. Adv. Electric. Comput. Eng. 21(1), 57–64 (2021) 8. Cao, L., Zheng, H.: On the efficiency and complexity of distributed spectrum allocation. In: Cognitive Radio Oriented Wireless Networks and Communications, pp. 357–366 (2007) 9. He, S., Zhang, D., Xie, K.: Opportunistic routing for multi- flow in wireless mesh networks. Chin. J. Electron. 42(5), 1004–1008 (2014). (in Chinese) 10. Chu, S., Wang, X.: MIMO-aware routing in wireless mesh networks. In: 2010 Proceedings IEEE INFOCOM, pp. 1–9 (2010) 11. Qiao, H., Zhang, D., Xie, K., et al.: Joint cooperative routing and channel assignment in multi-radio wireless mesh network. Chin. J. Electron. 44(6), 1400–1405 (2016) 12. Bezzina, A., Ayari, M., Langar, R., Kamoun, F.: An interference-aware routing metric for multi-radio multi-channel wireless mesh networks. In: 2012 IEEE 8th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), pp. 284–291 (2012) 13. Yan, W., Pan, X.: Egwra: QoS routing algorithm in wireless mesh networks based on evolutionary game theory. In: 2017 International Conference on Computer Network, Electronic and Automation (ICCNEA), pp. 272–275 (2017) 14. Peng, L., Nie, H., Qiu, L., Wang, R.: Energy optimization of ant colony algorithm in wireless sensor network. Int. J. Distrib. Sensor Netw. 13 (2017)

26

J. Peng et al.

15. Amudhavel, J., Padmapriya, S., Nandhini, R., Kavipriya, G., Dhavachelvan, P., Venkatachalapathy, V.S.K.: Recursive ant colony optimization routing in wireless mesh network. In: Satapathy, S.C., Srujan Raju, K., Mandal, J.K., Bhateja, Vikrant (eds.) Proceedings of the Second International Conference on Computer and Communication Technologies. AISC, vol. 381, pp. 341–351. Springer, New Delhi (2016). https://doi.org/10.1007/978-81-322-2526-3_36 16. Cao, Z., Wu, C.Q., Berry, M.L.: On routing of multiple concurrent user requests in multi-radio multi-channel wireless mesh networks. In: 2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 24–29 (2016) 17. Cao, Z., Wu, C.Q., Zhang, Y., et al.: On modeling and analysis of MIMO wireless mesh networks with triangular overlay topology. Math. Prob. Eng. Article ID 185262, pp. 1–11 (2015) 18. Zhang, C., Liu, S., Sun, Z., Sun, S.: A breadth-first and disjoint multi-path routing algorithm in wireless mesh networks. In: 2013 15th IEEE International Conference on Communication Technology, pp. 560–564 (2013)

A Light-Weight Scheme for Detecting Component Structure of Network Traﬃc Zihui Wu, Yi Xie(B) , and Ziyang Wu School of Computer Science and Engineering, Guangdong Key Laboratory of Information Security, Sun Yet-sen University, 510275 GuangZhou, China [emailprotected]

Abstract. The rapid development of network services not only expands the scale of Internet traﬃc, but also diversiﬁes the types of traﬃc. In this work, we design a light-weight compromise scheme to meet the management requirements of large-scale and business sensitive scenarios. The proposed scheme regards the mixed traﬃc as a whole and directly analyzes the component structure for it. It converts the structural and attribute features into a traﬃc proﬁle by encoding, embedding and mapping. Then the traﬃc proﬁle is used to infer the component structure based on CNN. The proposed scheme has no need to perform ﬂow-byﬂow classiﬁcation, it is not limited to the “quantity” balance of traﬃc, but also considers the types of traﬃc in each link. Based on the experiments with actual dataset, the results show that the proposed scheme can infer component structure for mixed traﬃc quickly and accurately. Keywords: Component structure · Proportion analysis · Traﬃc proﬁle

1

Background

The rapid development of network services has brought huge network traﬃc with diﬀerent requirements to the Internet, which results in new challenges to the network. First, new devices and services bring massive traﬃc, which needs to be transmitted through the Internet. Second, the increase of service types and the access of heterogeneous devices lead to the complexity of network traﬃc. The “best eﬀort” service provided by traditional TCP/IP can not meet the diversiﬁed and customized requirements of diﬀerent business ﬂows [1]. Existing work on this area mainly includes two major categories: improving resource utilization [6] and guaranteeing end-to-end QoS [2]. These two kinds of schemes have their own advantages and disadvantages. The schemes focusing on the resource management can achieve balanced load distribution at the resource level and improve the utilization of resources. However, they treat each kind of traﬃc equally and can only achieve load balance from the perspective of “quantity”, without considering the needs of diﬀerent traﬃc at the business level, which results the diﬃculty of guaranteeing the service quality. The schemes that are designed for end-to-end QoS guarantee distinguishes service types through c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 27–35, 2022. https://doi.org/10.1007/978-3-030-96772-7_3

28

Z. Wu et al.

traﬃc classiﬁcation method, and meets resource requirements of diﬀerent service. However, these schemes need to perform ﬂow-by-ﬂow identiﬁcation. In large-scale traﬃc scenarios, the performance is aﬀected by the scale of the ﬂow, leading to huge computational overhead. In general, both the traditional resource management and end-to-end QoS guarantee are not suitable for large-scale, business sensitive network scenarios. To meet these challenges, we propose a light-weight component structure analysis scheme for large-scale, business sensitive scenarios. The scheme regards the mixed traﬃc as a whole, and uses the attribute and structure features of the mixed traﬃc to analyze the component structure, that is, the proportion of various service. The main contributions of this work include two aspects: (1) Compared with QoS guarantee scheme, there is no need to identify the ﬂows one by one so that it can avoid huge and meaningless overhead. (2) Compared with resource management scheme based on “quantity”, our scheme can realize load balancing in “quantity” and meet the link traﬃc composition ratio at the business level of diﬀerent scenarios, which is more ﬂexible.

2 2.1

Methodology Overview

In order to formally deﬁne the problem, we use u ⊆ {1, 2, ..., U } to represent the type of network traﬃc, and its proportion is calculated as follows: Pu =

Nu N

(1)

where Pu deﬁnes the ratio of Nu to N , Nu the number of ﬁve-tuple ﬂow with type u and N represents the number of all the ﬁve-tuple ﬂows in the mixed network traﬃc. In addition, the proportion of the rest of the network traﬃc is represented as PU +1 . Network traﬃc component structure analysis is to identify the proportion of each type of traﬃc [P1 , P2 , ..., PU , PU +1 ] in the mixed traﬃc. As shown in Fig. 1, the proposed scheme to solve component structure analysis problem consists of three modules: Preprocessing, Traﬃc analysis and Proportion analysis, which are described in detail below. 2.2

Preprocessing

The Preprocessing module is responsible for extracting information from mixed traﬃc and representing it in the traﬃc topology. Traﬃc capture tool is deployed on the network link and collect IPs and communication relationship which are stored in the IP set SIP and relationship IP set SIP,IP respectively. Then, for each IP in SIP , we extract its eigenvector f and store them in the communication IP eigenvector set S = fIP | IP ∈ SIP . fIP

A Light-Weight Scheme for Detecting Component Structure

29

Fig. 1. Architecture

The information extracted from the original traﬃc sample is represented in the traﬃc topology G = (V, E, F ), where V represents the node set of G and each node v in V represents each IP in SIP ; F represents the node’s eigenvector set and each item fv ∈ F corresponds to the fIP ∈ S which IP maps to the fIP

node v; E represents the edge set of G, for every IP pair IPi , IPj ∈ SIP,IP , there exists a edge evi ,vj between node vi and vj . 2.3

Traﬃc Analysis

Obviously, the speciﬁcations of traﬃc topology generated by diﬀerent traﬃc samples are diﬀerent, so it is diﬃcult to use a uniﬁed model for analysis. Therefore, the design idea of the Traﬃc analysis module is to map the information in the irregular traﬃc topology to a regular traﬃc proﬁle, and then a uniﬁed model can be used to analyze the component structure of the traﬃc proﬁle. Node encoding sub-module encodes all nodes according to its attribution which maps nodes in diﬀerent traﬃc topology to the same coding space and divides the nodes into limited types, so that the traﬃc topology can be regarded as constructed by ﬁnite types of nodes. In our scheme, we use the degree of the node itself and its ﬁrst-orderneighbor to encode the node. Each node vi is encoded as a two-tuple Cvi =

Deg(vi ),

vj ∈Nvi vj

Deg(vj ) |Nvi |

, where Nvi repre-

sents the ﬁrst-order neighbor nodes. It should be emphasized that the encoding attribution is not unique. Only the following two requirements need to be met: (1) The value of this attribute is rich enough to divide the nodes into enough categories to facilitate the subsequent construction of traﬃc proﬁles. (2) This attribute is distinguishable enough to eﬀectively realize node classiﬁcation. Various connection modes of nodes lead to diﬀerent structural characteristics of traﬃc topology. Node embedding technology is used to learn the node context relationship in the traﬃc topology and the procedure is shown as Algorithm 1. For all samples G , R round random walkings are performed in each traﬃc topology to generate sequence. The sequence set sequences is used to train the

30

Z. Wu et al.

Algorithm 1.CODE2VEC(G , δ) Input: G = {G1 , G2 , ..., Gn }: encoded traﬃc topology set δ : dimension of embedding Output: Φ : map for node’s code to node’s embedding, and Φ(Cvi ) ∈ Rδ 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

initialize sequences = {} for Gi ∈ G do for r = 1 → R do for vj ∈ Vi do sequence = randomWalk(Gi , vj , L) sequences.add(sequence) end for end for end for Φ ← word2vec(sentences) return Φ

Fig. 2. Building traﬃc proﬁle, where N f is 3 and δ is 2

word2vec model, which is used to get node’s embedding here. Then the map Φ C = Φ(Cv ) ∈ Rδ . for node’s code to node’s embedding is outputted, where H vi i For each encoded traﬃc topology G , each node vi in V can get its node’s embedding through Φ. And the encoded traﬃc topology can be updated to v | 1 ≤ i ≤ |V |} embedded traﬃc topology G = (V, E, F, C, H), where H = {H i 1×δ v ∈ R . and each node’s embedding H i After completing the node embedding, we use the proposed method of building traﬃc proﬁle to transform the irregular traﬃc topology into regular traﬃc proﬁle. It includes two steps: Node mapping and pixel assignment. Node mapping is to map each node vi in the node set V to the δ-dimensional space formed by the δ-dimensional embedding. For each node vi in the traﬃc v is regarded as the coordinate in the δtopology, the node’s embedding H i dimensional traﬃc proﬁle, and the node is mapped to the corresponding pixel of the traﬃc proﬁle according to the coordinate. If the embedding of two nodes v ,H v are the same, the two nodes vi ,vj will be mapped to the same pixel. H i j

A Light-Weight Scheme for Detecting Component Structure

31

The pixel assignment operation is to assign each channel of the pixel after mapping the node vi to the corresponding pixel. Firstly, the N f channel values of each pixel in the traﬃc proﬁle are initialized to 0. where N f is the same as the dimension of the node’s eigenvector |fvi |. Then, for the target pixel to be mapped, the values of its N f channels are set to the eigenvector fvi of the node vi mapped to the pixel. If multiple nodes are mapped to the same pixel, the value of the pixel is assigned to the mean value of the eigenvectors of all the nodes, and for the pixels that are not mapped to, the default value is kept. Figure 2 shows how to build a traﬃc proﬁle when δ = 2 and N f = 3. In the same way, when the value of δ is greater than 2, the δ-dimensional traﬃc proﬁle can also be built according to the method described above. The structure information of the original traﬃc topology will be expressed in the pixel coordinates, and the attribute information will be expressed in the value of the pixels. 2.4

Proportion Analysis

The Proportion analysis module analyzes the inputted multi-dimensional traﬃc proﬁle, and obtains the proportion results. The information of traﬃc proﬁle is contained in the arrangement and value of pixels, which is similar to images. Therefore, we use the multi-dimensional CNN that is often used in image analysis tasks to extract features in the traﬃc proﬁle. Finally, these features are used to predict the proportion of the original traﬃc sample. For the convolutional layer in δ-dimensional CNN, the convolution kernel’s size is deﬁned as S1k × S2k × ... × Sδk , and the convolution operation is deﬁned as: yjp1 ∼jδ

p

=b +

f Sk c=N 1

k

S2

c=1 s1 =1 s2 =1

k

...

Sδ sδ =1

Wspc xc 1 ∼sδ (j1 +s1 −1)(j2 +s2 −1)...(jδ +sδ −1)

(2)

where yjp1 ∼jδ is the output value of the pixel with the coordinate (j1 , j2 , ..., jδ ) of the p-th output feature map; bp is the bias corresponding to the p-th output feature map; c represents a channel of the input feature map. For the input traﬃc proﬁle, there are N f channels; S1k × S2k × ... × Sδk is convolutional kernel is the weight corresponding to the position (s1, s2, .., sδ ) of the size; Wspc 1 ∼sδ convolution kernel corresponding to the p-th output feature map and the cth input channel. xc(j1 +s1 −1)(j2 +s2 −1)...(jδ +sδ −1) is the value of the pixel’s c-th channel which locate in the position (j1 + s1 − 1, j2 + s2 − 1, ..., jδ + sδ − 1). For the pooling layer of δ-dimensional CNN, the pooling window’s size is deﬁned as S1p × S2p × ... × Sδp . Maximum pooling is to ﬁnd the maximum value in the window, and average pooling is to calculate the average of all pixel values in the pooling window. U +1

Pˆu = 1

(3)

u=1

Since our prediction goal is a multi-dimensional proportion vector and the vector satisﬁes the constraint of sum 1, we add a fully connected network with

32

Z. Wu et al.

Fig. 3. proportion analysis model architecture

softmax as the activation function to analyze the features extracted by CNN and make the output proportion result Pˆ1 , Pˆ2 , ..., PˆU , PˆU +1 meet the constraint.

3 3.1

Evaluation Data Set

To demonstrate the performance of our scheme, We construct a mixed traﬃc data set containing three types of traﬃc(web, P2P, and live), including a total of 2000 samples with proportion label. Each sample contains traﬃc data for 30 seconds through the collection point. Among them, 1600 traﬃc samples are divided into training sets, and the rest are divided into test sets. 3.2

Experiment Setting

For feature extracting, we ﬁrst collect the communication IP and communication relationship in the sample, and extract the following features for each IP: Average/Variance of traﬃcs that the IP communicates with other IPs; Number of diﬀerent destination ports for the IP’s communication; Number of source ports used for the IP’s communication; Ratio of the number of destination IP and the number of destination ports for the IP’s communication. For scheme setting, the embedding dimension is set as 3 and the proﬁle size is set as 32 × 32 × 32 × 5. The proportion analysis model is shown as Fig. 3. Since the proportion analysis model is a regression model, we choose RMSE as the loss function and Adam as optimizer with learning rate 0.0001. For model evaluation, it mainly includes accuracy evaluation and real-time evaluation. We ﬁrstly deﬁne an accuracy metric SOMP to measure the error between the predicted proportions and the ground truth: SOMP =

U +1

min(Pˆu , Pu )

(4)

u=1

where SOMP means sum of maximal proportion and U is the number of types of traﬃc. Moreover, the mean number of SOMP SOMP and standard deviation

A Light-Weight Scheme for Detecting Component Structure

33

Table 1. Experiment result SSOMP SOMP T P (s) ST P

T A (s) ST A

T (s)

ST

FS-Net[4]

99.7% 0.003

1.225

1.065

0.870

0.521

2.095

1.195

TRF+C4.5[3]

89.1%

2.429

1.429

1.879

0.867

2.472

1.441

CNN+LSTM[5]

87.3%

0.114

1.741

1.155

3.600

1.504

5.341

2.367

Traﬃcs2Proﬁle 94.9%

0.043

1.056 0.652 0.007 0.001 1.063 0.652

0.102

(a) Profile size discussion

(b)

discussion

Fig. 4. Parameter discussion experiment result

SSOMP are used to measure global error among the training set. For measuring time performance, we separately calculate the traﬃc processing time T P and the proportion analysis time T A . Among them, the traﬃc processing time is the time to obtain the traﬃc proﬁle from the original traﬃc sample, and the proportion analysis time is the time to predict the traﬃc proportion based on the traﬃc proﬁle. Same as accuracy evaluation, we use mean number and standard deviation of process time and analysis time to measure global time performance. 3.3

Experiment

We compare with three traﬃc classiﬁcation methods and show results in Table 1. According to Table 1, we can obtain two conclusions about accuracy and time performance evaluation. Firstly, for accuracy evaluation, FS-Net achieve highest average SOMP value of 99.7% and its performance is very stable. FS-Net is a SOTA method in traﬃc classiﬁcation ﬁeld, and it realizes proportion prediction through one-by-one classiﬁcation. Therefore, it is very reasonable for such ﬁnegrained method to achieve higher accuracy. The average accuracy of our coarsegrained method Traﬃcs2Proﬁle has also reached 94.9%. Although the accuracy and stability are not as good as those traﬃc classiﬁcation methods, the accuracy can meet most of the coarse-grained network management scenes. Secondly, for time performance evaluation, Traﬃcs2Proﬁle outperforms all the other methods. Among them, the traﬃc processing time slightly exceeds other methods, because no matter which method needs to extract features from the original traﬃc, how

34

Z. Wu et al.

to extract features eﬃciently is not the focus of this article. As for the proportion analysis time, our method greatly exceeds other traﬃc classiﬁcation methods due to its overall analysis which meets our expectations. Proﬁle’s size may have an impact on the prediction eﬀectiveness, we set proﬁle’s size to 16, 20, 24, 28, 32 and 36 to explore its impact. According to experiment result shown in Fig 4a, With the increase of traﬃc proﬁle’s size, the accuracy of the method ﬁrst improves and then slightly decreases. When the size of the traﬃc proﬁle is small, the resolution of the traﬃc proﬁle is also low. The nodes with similar embedding will be mapped to the same pixel, thus losing part of the structural information in the original traﬃc. However, excessively increasing the proﬁle’s size will result in an increase of invalid pixels in the traﬃc proﬁle. These noisy pixels have a negative impact on the accuracy performance. Embedding dimension δ is also an important parameter to aﬀect method’s accuracy. We set δ to 2, 3, 4 and use CNN with corresponding dimension to analysis traﬃc proﬁle. According to experiment result shown in Fig. 4b, when the embedding dimension is 3 or 4, the accuracy is signiﬁcantly better than the accuracy when the embedding dimension is 2. This is because higher-dimensional embedded vector representations can represent richer structure information, which can better realize the proportion analysis. However, when the embedding dimension is 3 and 4, the diﬀerence in accuracy is very small, which shows that there is a boundary for the embedding dimension to improve accuracy.

4

Conclusion

In this paper, we proposed a component structure analysis scheme, which can eﬀectively analyze the proportion of various traﬃc components in the mixed ﬂow. A mixed traﬃc data set was collected to verify the eﬀectiveness of our proposed scheme. The result shows that our scheme had a signiﬁcant advantage in both time consumption and performance for traﬃc proportion analysis task. Funding Information. This work was supported by the Natural Science Foundation of China (No. 61972431,U2001204,61873290), and the Natural Science Foundation of Guangdong Province, China (No.2018A030313303).

References 1. Barakabitze, A.A.: QoE management of multimedia streaming services in future networks: a tutorial and survey. IEEE Commun. Surv. Tutor. 22(1), 526–565 (2020) 2. Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., Weiss, W.: Rfc2475: an architecture for diﬀerentiated service (1998) 3. Draper-Gil, G., Lashkari, A.H., Mamun, M.S.I., Ghorbani, A.A.: Characterization of encrypted and vpn traﬃc using time-related. In: Proceedings of the 2nd International Conference on Information Systems Security and Privacy (ICISSP), pp. 407–414 (2016)

A Light-Weight Scheme for Detecting Component Structure

35

4. Liu, C., He, L., Xiong, G., Cao, Z., Li, Z.: FS-NET: a ﬂow sequence network for encrypted traﬃc classiﬁcation. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pp. 1171–1179. IEEE (2019) 5. Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., Lloret, J.: Network traﬃc classiﬁer with convolutional and recurrent neural networks for internet of things. IEEE Access 5, 18042–18050 (2017) 6. Zhang, J., Yu, F.R., Wang, S., Huang, T., Liu, Z., Liu, Y.: Load balancing in data center networks: a survey. IEEE Commun. Surv. Tutor. 20(3), 2324–2352 (2018)

Evaluating the Performance and Conformance of a SYCL Implementation for SX-Aurora TSUBASA Jiahao Li1(B) , Mulya Agung2 , and Hiroyuki Takizawa1

2

1 Cyberscience Center, Tohoku University, Sendai, Japan [emailprotected], [emailprotected] MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK [emailprotected]

Abstract. SX-Aurora TSUBASA (SX-AT) is a vector supercomputer equipped with Vector Engines (VEs). SX-AT has not only such a new system architecture, but also some execution modes to achieve high performance on executing a real-world application that often consists of vector friendly and unfriendly parts. Vector Engine Oﬄoading (VEO) is a programming framework to oﬄoad only a vector-friendly part to VEs, and neoSYCL has been developed on top of VEO to allow programmers to use the standard SYCL interface at oﬄoad programming on SX-AT. However, it is unclear how much neoSYCL based on VEO can conform to the SYCL standard, which is primarily based on OpenCL. Therefore, this paper discusses the conformance of neoSYCL to the SYCL standard, and also the performance. Our thorough evaluation with SYCL-Bench kernels demonstrates that neoSYCL is conformant to the SYCL standard except for OpenCL-related features. In addition, the runtime overhead for using the SYCL interface on top of VEO is negligible in most cases, allowing the neoSYCL codes to achieve comparable performance with the VEO codes. Keywords: SX-Aurora TSUBASA

1

· SYCL · Benchmarking

Introduction

SYCL is an open industry standard for programming a wide range of heterogeneous architectures [5]. The design of SYCL allows standard C++ source code to be written such that it can run on either an accelerator device or on the host. It features high-level abstractions, easing many of the burdens commonly encountered in parallel programming, while still allowing for ﬁne-grained control over performance and hardware features. NEC SX-Aurora TSUBASA (SX-AT) is the latest vector supercomputer [9]. An SX-AT system is equipped with two kinds of processors, Vector Hosts (VHs) c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 36–47, 2022. https://doi.org/10.1007/978-3-030-96772-7_4

A SYCL Implementation for SX-AT

37

and Vector Engines (VEs). A VH is a standard x86 processor for running the Linux operating system and hosting VEs, while a VE is NEC’s vector processor of eight cores implemented as a PCI-e device card. Having six High-Bandwidth Memory 2E (HBM2E) modules, a VE can provide high memory bandwidth of 1.53 TB/s [6]. Despite of the heterogeneous hardware conﬁguration, users can run a program on the VE as if the whole program is running on the standard Linux environment. However, since a practical application is often a mix of vector friendly and unfriendly parts, there is a demand for oﬄoading only the vector friendly parts to VEs and executing the rest on VHs. Thus, an oﬄoad programming model called Vector Engine Oﬄoading (VEO) [10] is also provided by NEC. However, the programming interface of VEO is not only low-level but also non-portable to other platforms. A SYCL implementation named neoSYCL is the ﬁrst and only SYCL implementation for SX-AT based on VEO [4]. At the source code level, neoSYCL provides a simple tool to identify and separate the kernel part of a SYCL application, and thereby converting it to a distinct function. Relying on this simple approach, neoSYCL has been implemented as a collection of only header ﬁles, internally using VEO functions. Due to architectural diﬀerences between the vector processor and GPU, some of OpenCL’s concepts employed in the SYCL standard do not ﬁt in the vector architecture as discussed in [12]. Hence, neoSYCL implements only a subset of the standard SYCL speciﬁcation. In addition, for VEs to achieve high sustained performance, the kernel code should be vectorfriendly, containing vectorizable long loops. Therefore, this paper discusses the conformance and performance of neoSYCL through some evaluation results. The purpose of this paper is to demonstrate that neoSYCL is conformant to the SYCL standard at oﬄoad programming for SX-AT. There are a large number of features deﬁned in the SYCL standard, and some of them are used mostly for GPU platforms, not for others such as Field Programmable Gate Arrays (FPGAs) [13]. Consequently, this paper focuses on basic and popular SYCL features to be likely used for SX-AT, and discusses the conformance of neoSYCL. In addition, this paper also experimentally discusses the runtime overheads induced by neoSYCL’s abstraction layer. The main contributions of this paper are as follows. 1. This is the ﬁrst work to demonstrate the conformance and performance of the neoSYCL implementation with a variety of benchmark programs. 2. Based on SYCL-Bench [7], we have developed a portable benchmark suite, named VEO-SYCL-Bench, to compare neoSYCL and VEO versions of a program. 3. We investigate the performance gain of using another framework, Alternative VE Oﬄoading (AVEO) [2], instead of VEO.

2

NEC SX-Aurora TSUBASA

SX-AT is a new generation of NEC’s SX-series supercomputers with dedicated vector processors. SX-AT employs a heterogeneous hardware conﬁguration consisting of VHs and VEs. A VH is a standard x86 processor for running the Linux

38

J. Li et al.

Fig. 1. Software stack of SX-AT.

operating system (OS) as well as hosting VEs. To control VEs, VEOS is a Linux process running on the VH and providing OS functionality to VE programs running on VEs. Each VE is packaged in the form factor of a PCI-e card. The vector processor consists of eight cores, six HBM2E modules, and one Last-Level Cache (LLC) of 16 MB shared by all the cores. Figure 1 shows an overview of an SX-AT system. Since there is no OS kernel on the VE side, VEOS running on the VH provides the OS functionality to a user process running on the VE. VEOS consists of the ve exec command and the VEOS service. The ve exec command loads a VE program, requests permission to create a VE process, and handles the system calls and exceptions of the VE process. The VE driver installed in the VH Linux kernel space is a PCI device driver that provides VE resource accessibility and handles interrupts from the VEs. NEC provides C, C++, and Fortran compilers to build a program executable on a VE. Since the vector processors can achieve high performance on executing the vectorized code, these compilers support automatic vectorization of loops. In other words, to achieve high performance, the application code should be vector-friendly, meaning that the execution time of the code is mostly spent for executing vectorizable long loops. There are two execution models to run a program with the VE. The ﬁrst execution model is the native execution that simply runs the whole program on a VE to avoid the data transfer between VHs and VEs. However, in some applications and application areas, it might not be straightforward to vectorize the whole of an application, and thus non-vectorized parts of the application could critically degrade the overall performance. Since most execution time of a scientiﬁc program is likely spent only on a particular loop (expressed as a kernel in SYCL), the second execution model is VEO, which is one of the acceleratorstyle programming models such as OpenCL [8] and CUDA [11]. In VEO, a compute-intensive kernel part of an application is oﬄoaded to VEs while the rest is executed on the VH. It provides a set of APIs that allow loading a shared library into the VE, locate functions and symbols in the library, allocate and free memory chunks on the VE, transfer data to and from the VE as well as asynchronously execute functions on the VE side. By properly oﬄoading only

A SYCL Implementation for SX-AT

39

a kernel part of an application to the VE, the total performance is improved in many cases. However, programmers need to invest more eﬀort in modifying the original source code. In addition, an application developed with VEO is not portable to other platforms because the VEO programming interface is dedicated to SX-AT. Therefore, we need a standard oﬄoad programming interface available for the SX-AT platform.

3

Overview of neoSYCL

neoSYCL is a new SYCL implementation that aims to address the productivity issue of oﬄoad programming on SX-AT. The SYCL standard is designed to encourage and support a data-parallel programming style. A SYCL single source code contains both host code that runs natively on the host CPU, and device code that is executed on SYCL devices. Although the host code and device code can be written within a single source ﬁle, we need to use diﬀerent compilers for VHs and VEs. Thus, neoSYCL ﬁrst extracts a kernel part from the source code and writes it to another ﬁle as a distinct function. This is so-called kernel outlining, and the neoSYCL project provides a kernel generator tool for it. The tool can extract and transform a kernel part at the source-code level. Since the kernel part has been converted to a C/C++ function, the function can be compiled by a device compiler for VEs and linked to the host program to be run on the VH. The proof-of-concept implementation of neoSYCL in [4] provides important SYCL concepts, including buﬀers, accessors, and queues. All of them are implemented by internally using VEO APIs. In the SYCL speciﬁcation, data storage and data access are handled by sycl::buffer and sycl::accessor classes, respectively. A sycl::accessor instance is created by calling sycl::buffer::get access() to represent basic operations to the data storage associated with the instance. A sycl::buffer instance can be associated with a 1D, 2D, or 3D array that is accessible from kernels by using the corresponding sycl::accessor instance. The sycl::buffer class is a C++ template with two parameters, the type and dimension of the data stored in the buﬀer. In the neoSYCL implementation, a sycl::buffer instance is implemented as a standard C++ array, and copying data to the VE can be done by just copying the whole array to the VE’s memory space. Accordingly, the original neoSYCL implementation provides buﬀers and accessors conformant to the SYCL speciﬁcation. Unlike buﬀers and accessors, queues in the original neoSYCL implementation are not conformant to the SYCL speciﬁcation. A sycl::queue instance represents a mechanism, through which a host code submits work to a device for execution in the future. A sycl::queue instance passes kernels to devices in an asynchronous manner. In neoSYCL, there are two kinds of devices available. One is a VE device as a device or an accelerator, and the other is a VH device working also as a host. A sycl::queue instance is by default bound to the VH running the application. Any task submitted to the queue is executed on the

40

J. Li et al.

VH without any data transfers between VH memory and VE memory. In the original neoSYCL implementation [4], only a sycl::ve queue instance could be bound to a VE to execute the kernel part on the VE. Although a SYCL application should be able to bind a queue to the host device or other accelerator devices by using the sycl::device selector class, the original version of the neoSYCL implementation does not support the sycl::device selector class. With the original neoSYCL implementation, it has been needed to replace every queue with a special one, sycl::ve queue, to run a standard SYCL program on a VE. Therefore, to improve the conformance to the SYCL standard, we have modiﬁed the neoSYCL implementation to support the sycl::device selector class compatible with the SYCL speciﬁcation. In this way, we have reviewed neoSYCL classes one by one to check if they are conformant to the SYCL speciﬁcations. Some classes are rewritten if they are needed for oﬄoad programming on SX-AT but not conformant to the SYCL speciﬁcation. In the SYCL speciﬁcation, there are two ways of invoking a kernel. One is to use sycl::queue::single task() (or its variant) to create a single thread on the device side to execute a kernel. If necessary, the single thread could later become the master thread and swan other worker threads for multi-thread execution. For example, we can use OpenMP directives [1] for multi-thread execution of the kernel loop. The other way is to use sycl::queue::parallel for() (or its variant) to create multiple threads on the device side to execute a kernel. The nd item and nd range classes are used to express the information about kernel invocation, such as the number of threads (work items) to be created. The latter way is a basic SYCL feature inherited from OpenCL, which has originally been designed with keeping GPU computing in mind. However, although GPUs need to create a large number of concurrent threads for eﬃcient data parallel processing, its execution model does not necessarily ﬁt in non-GPU platforms. Accordingly, we have decided that the current neoSYCL implementation should not support SYCL features relevant to the nd item and nd range classes, and thus this paper discusses the conformance and performance of neoSYCL except for the unsupported features.

4

Evaluation and Discussions

This section discusses the conformance of the original and new neoSYCL implementations through testing the basic test cases provided by DPC++ [3]. Meanwhile, we use SYCL-Bench kernels to further measure the conformance and performance of the neoSYCL implementations. The neoSYCL implementations support only SX-AT, while other SYCL implementations are not available on SXAT. Therefore, existing SYCL implementations cannot be directly compared to neoSYCL. However, this paper can still discuss the runtime overhead introduced by neoSYCL by comparing its performance to that of two oﬄoading frameworks for SX-AT, VEO and AVEO. The speciﬁcations of the system used in the following experiments are listed in Table 1. We use the default optimization level for VH and VE compilers to compile the programs used in our evaluations.

A SYCL Implementation for SX-AT

41

Table 1. System speciﬁcations. NEC SX-Aurora TSUBASA A100-1 VH processor VH memory VH compiler

Intel xeon gold 6126 CPU 96 Gbytes Clang version 12.0.0

VE processor VE memory VE compiler

NEC vector engine type-10C 24 Gbytes NEC ncc compiler 2.5.1

Operating system CentOS Linux 7.9.2009 VEOS 2.7.4 VEO 2.5.0 DPC++ source code1 SYCL-Bench2 1 https://github.com/intel/llvm/tree/sycl/sycl/test 2 https://github.com/bcosenza/sycl-bench Software

4.1

Conformance Test Cases

DPC++ provides test cases that cover various aspects of the SYCL speciﬁcation [3]. In this work, our neoSYCL implementations are compared in terms of conformance by using DPC++ test cases, while the conformance is quantiﬁed by the number of test cases passed. Note that some of their test cases are designed for Intel hardware and additional extensions. Therefore, we use only the most basic and important test cases in the following evaluation. Speciﬁcally, 37 test cases including runtime classes (device selection, device, platform, context, queue and event), data access and storage (buﬀer and accessor) are used because they are the most common APIs in SYCL applications. We evaluate the conformance of the original and new neoSYCL implementations by running these test cases on SX-AT. In the original neoSYCL implementation, an instance of special class, ve queue, must ﬁrst be created, and a task is submitted to the VE via the ve queue instance. However, in the SYCL speciﬁcation, at any point where the SYCL runtime needs to select a SYCL device through an explicit device selector specialization or through the implicit default selector, the system will call select device(), which will query all available SYCL devices in the system, pass each to this function call operator and select one device. In order to make neoSYCL more conformant to the SYCL standard, the ve queue is deprecated in this work and the device selector classes are implemented. Since a VE can be seen as a kind of accelerator, accelerator selector is deﬁned as a derived SYCL device selector class that selects a VE as a SYCL device. As a result, standard SYCL applications can be executed on SX-AT without code modiﬁcation. Therefore, the original neoSYCL implementation can only pass 20 test cases (54%), while the new neoSYCL implementation can pass 35 test cases (95%).

42

J. Li et al.

Table 2. The detailed list of benchmarks included in the VEO-SYCL-Bench suite. Benchmark name Short

Domain

lin reg coeﬀ

LRC

Data analytics

lin reg error

LRE

Data analytics

median

MEDIAN

Image processing

mol dyn

MD

Physics simulation

scalar prod

SP

Linear algebra

sobel3/5/7

SOBEL3/5/7 Image processing

vec add

VA

Linear algebra

2DConvolution

2DCON

Image processing

2mm

2MM

Linear algebra

3DConvolution

3DCON

Image processing

3mm

3MM

Linear algebra

atax

ATAX

Linear algebra

bicg

BICG

Linear algebra

correlation

CORR

Data mining

covariance

COV

Data mining

fdtd2d

FTD2D

Stencils

gemm

GEMM

Linear algebra

gesummv

GESUM

Linear algebra

gramschmidt

GRAMS

Linear algebra

mvt

MVT

Linear algebra

syr2k

SYR2K

Linear algebra

syrk

SYRK

Linear algebra

The SYCL speciﬁcation inherits some concepts from OpenCL, and some of those features are not supported by neoSYCL at present. However, there are still two test cases not passed even by the new neoSYCL implementation. This is because two functions parallel for work group and parallel for work item used in test cases are not supported by the neoSYCL implementations. Due to a great disparity between VEs and GPUs, it is diﬃcult for VEs to eﬃciently support those OpenCL-related functions [12]. Although the new neoSYCL implementation does not currently support those APIs, the results demonstrate that the new neoSYCL implementation conforms to most important and commonlyused SYCL APIs. 4.2

VEO-SYCL-Bench

The SYCL-Bench suite [7] contains a number of benchmarks which are real-world applications and kernels from diﬀerent domains such as linear algebra, image processing, and molecular dynamics. It is a benchmarking framework that provides a

A SYCL Implementation for SX-AT

43

(a) vec add benchmark of SYCL-Bench.

(b) vec add benchmark of VEO-SYCL-Bench.

Fig. 2. SYCL-Bench and VEO-SYCL-Bench versions of the vec add benchmark.

lot of features, such as the command line arguments, a veriﬁcation layer for all benchmarks and the automated execution of the entire benchmark suite. However, SYCL-Bench is not portable because some of APIs used in the framework are not conformant to the standard SYCL speciﬁcation, and thus neither DPC++ nor neoSYCL can compile the original SYCL-Bench. Hence, based on SYCL-Bench, we have developed a simple but portable version of those benchmarks. To discuss the runtime overhead induced by the neoSYCL’s abstraction layer, we also developed a VEO version of those benchmarks. The collection of our SYCL benchmarks and VEO benchmarks is named VEO-SYCL-Bench1 . Table 2 is the list of the benchmarks. In SYCL-Bench, many of benchmarks provide variants for diﬀerent kernel invocation mechanisms mentioned in Sect. 3. Since neoSYCL supports only single task() and parallel for(), we have copied only the kernels invoked 1

https://github.com/Tohoku-University-Takizawa-Lab/veo-sycl-bench.

44

J. Li et al.

Fig. 3. Performance and code complexity comparison between VEO and neoSYCL versions of benchmark programs.

with parallel for() from SYCL-Bench to VEO-SYCL-Bench, and rewrote the other parts such as buﬀer allocation and initialization that are performance insensitive. Figure 2 serves as an illustrating example. Figure 2a shows the original vec add benchmark consists of the most important parts including buﬀer initialization and kernel function, while the arguments, queue and device selection are initialized through the framework. Hence, we simplify the SYCL-Bench code for VEO-SYCL-Bench as shown in Fig. 2b. The queue is bound to VE as a device by explicitly passing a sycl::accelerator selector instance to the constructor. Data required by the kernel are initialized by using sycl::buffer instances. The kernel part is almost the same as that of the SYCL-Bench version. Since our VEO-SYCL-Bench only uses standard SYCL APIs, it can be easily used by other SYCL implementations. Figure 3 shows the execution time of benchmarks using two diﬀerent implementations. At the time measurement, we run each benchmark 10 times with a small input size, and then calculate the average execution time. The results show that the neoSYCL version is only 0.42% slower than the native VEO version on average. Although it is common that high-level abstraction introduces some runtime overhead, the results show that the overhead caused by the neoSYCL runtime is small enough and negligible. Furthermore, to evaluate the impacts of SYCL on productivity, a code complexity analyzer, called Lizard [14], is used to measure the code complexities of diﬀerent implementations. Figure 3 also shows the NLOC (the number of lines of code without comments) of two implementations for each benchmark. Other metrics including CCN (cyclomatic complexity number) and the token (the number of distinct operators and distinct operands) are also calculated in our experiments. The average CCN and token values of SYCL versions are 10 and 653, respectively. On the other hand, the average CCN and token values of VEO versions are 12 and 764.5, respectively. All results show

A SYCL Implementation for SX-AT

45

Fig. 4. Performance comparison between the VEO and AVEO implementations.

that SYCL versions are less complex and thus easier to maintain. In conclusion, because of employing the SYCL programming interface, neoSYCL can decrease the code complexity with achieving almost the same performance. AVEO is an alternative implementation of the VEO framework, which is fully compatible with the original VEO [2]. It has been redesigned to solve a set of problems in VEO and improve the kernel invocation latency as well as the data transfer bandwidth. Therefore, in this paper, we also evaluate the performance of the neoSYCL implementation with AVEO while changing the input data size. The neoSYCL implementations on top of VEO and AVEO are called neoSYCLVEO and neoSYCL-AVEO, respectively. The kernel invocation latency of the original VEO is about 100 µs, while Focht [2] has shown that the kernel invocation latency of AVEO is in the range of 5.5–6 µs. In this work, however, most benchmarks invoke kernels only a few times. As a result, the time spent on kernel invocation is almost negligible in comparison with the time spent on executing the kernel. Therefore, a reduction in the kernel invocation latency may not signiﬁcantly decrease the total execution time. On the other hand, the data transfer bandwidth between a VH and a VE is essential even for those benchmarks, because the benchmarks usually need to transfer a certain amount of data between the VH and the VE. For example, 3D Convolution is an image processing benchmark from SYCL-Bench, and stores a large amount of data on both of the VH memory and the VE memory. Therefore, the improvement in data transfer bandwidth could potentially decrease the total execution time. Figure 4 shows the performance comparison between neoSYCL-VEO and neoSYCL-AVEO for the 3D Convolution benchmark. In the case of a small buﬀer size, VEO and neoSYCL-VEO can achieve a better performance than AVEO and neoSYCL-AVEO. However, by increasing the input size, AVEO and neoSYCLAVEO outperform VEO and neoSYCL-VEO. This is because the architecture of

46

J. Li et al.

AVEO is more complex than that of VEO, resulting in some overhead in small input cases. Since the performance for large input data is more important in practice, these results suggest that it is more promising to use AVEO for implementing neoSYCL, though the performance diﬀerence is not very signiﬁcant in this particular benchmark. The results also show that both neoSYCL-VEO and neoSYCL-AVEO can always achieve comparable performance respectively with VEO and AVEO no matter of the input size. Thus, it is again demonstrated that the abstraction penalty induced by the neoSYCL implementation is negligible.

5

Conclusions

SX-AT is a heterogeneous computing system equipped with VEs, which can provide the world’s highest memory bandwidth, and neoSYCL is a SYCL implementation for enabling oﬄoad programming on SX-AT with the standard programming interface. Although the original neoSYCL implementation has already supported most of the major SYCL features, there are some unsupported ones. In this work, therefore, we demonstrate the conformance and performance of the neoSYCL implementation with a variety of benchmark programs. We have reviewed the SYCL classes one by one, and modiﬁed some classes to improve the conformance. Although the improved neoSYCL implementation still has some unsupported features due to the hardware limitations, our evaluation has shown that most SYCL benchmarks can be executed on SX-AT. For the performance evaluation, we also have developed a benchmark suite, named VEO-SYCL-Bench. The evaluation results indicate that the performance diﬀerence between neoSYCL and VEO versions of a program is small and thus the runtime overhead induced by neoSYCL is negligible in practice. Meanwhile, the results of code complexity metrics show that neoSYCL can always outperform the native implementation. Moreover, we investigate the performance gain of using AVEO. Our results show that VEO can perform better than AVEO for small input data, while for large input data, AVEO can outperform VEO. In our future work, we will improve the neoSYCL implementation to support more SYCL features and various devices. Meanwhile, due to a great distinction among computing architectures, an eﬃcient device selection mechanism is required to fully utilize computing resources. Thus, we will discuss the automation of a task-to-device mapping mechanism. Acknowledgement. This work is partially supported by MEXT Next Generation High-Performance Computing Infrastructures and Applications R&D Program “R&D of A Quantum-Annealing-Assisted Next Generation HPC Infrastructure and its Applications,” Grant-in-Aid for Scientiﬁc Research(A) #20H00593 and Grant-in-Aid for Scientiﬁc Research(B) #21H03449.

A SYCL Implementation for SX-AT

47

References 1. Chandra, R., Dagum, L., Kohr, D., Menon, R., Maydan, D., McDonald, J.: Parallel Programming in OpenMP. Morgan kaufmann, Burlington (2001) 2. Focht, E.: Speeding up vector engine oﬄoading with AVEO, pp. 35–47 (2021) 3. Intel: Data Parallel C++ language. https://software.intel.com/content/www/cn/ zh/develop/tools/oneapi/data-parallel-c-plus-plus.html 4. Ke, Y., Agung, M., Takizawa, H.: neosycl: a sycl implementation for sx-aurora tsubasa. In: The International Conference on High Performance Computing in AsiaPaciﬁc Region, pp. 50–57. HPC Asia 2021, Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3432261.3432268 5. Khronos: SYCL 1.2.1. Technical report, Khronos Group, Inc. (2020). https://www. khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf 6. Komatsu, K., et al.: Performance evaluation of a vector supercomputer sx-aurora tsubasa. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 685–696 (2018). https://doi.org/10.1109/ SC.2018.00057 7. Lal, S.: SYCL-bench: a versatile cross-platform benchmark suite for heterogeneous computing. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 629–644. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-576752 39 8. Munshi, A.: The opencl speciﬁcation. In: 2009 IEEE Hot Chips 21 Symposium (HCS), pp. 1–314 (2009). https://doi.org/10.1109/HOTCHIPS.2009.7478342 9. NEC: SX-Aurora TSUBASA - Vector Engine. https://www.nec.com/en/global/ solutions/hpc/sx/vector engine.html 10. Noack, M., Focht, E., Steinke, T.: Heterogeneous active messages for oﬄoading on the nec sx-aurora tsubasa. In: 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 26–35 (2019). https://doi.org/ 10.1109/IPDPSW.2019.00014 11. Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, Boston (2010) 12. Takizawa, H., Shiotsuki, S., Ebata, N., Egawa, R.: An opencl-like oﬄoad programming framework for sx-aurora tsubasa. In: 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 282–288 (2019). https://doi.org/10.1109/PDCAT46702.2019.00059 13. Waidyasooriya, H.M., Takei, Y., Tatsumi, S., Hariyama, M.: Opencl-based FPGAplatform for stencil computation and its optimization methodology. IEEE Trans. Parallel Distrib. Syst. 28(5), 1390–1402 (2017). https://doi.org/10.1109/TPDS. 2016.2614981 14. Yin, T.: Lizard: An extensible Cyclomatic Complexity Analyzer (2019)

Bayesian Optimization-Based Task Scheduling Algorithm on Heterogeneous System Tan Cai and Hong Shen(B) School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China {cait7,shenh3}@mail2.sysu.edu.cn Abstract. In heterogeneous computing systems, eﬃcient task scheduling is essential for utilizing resources and reducing computing time. This problem has been shown NP-complete in the general case. Existing solutions are mainly heuristic-based that would easily track into optimal local solutions and reinforcement learning-based that need an expensive computation cost for data training on neural networks. To overcome the shortcomings, we propose a Bayesian optimization based task scheduling algorithm that automatically searches for the best heuristic strategy in the problem space. Our algorithm builds a Bayesian optimization model on heuristic strategy and scheduling performance, and updates the model by interacting with the environment to ﬁnd the optimal solutions globally. To enhance the conﬁdence of our experiments, we measure the average (weighted) makespans and running time of our algorithm. The experimental results show that our approach can improve the scheduling performance compared to the baselines.

Keywords: Task scheduling

1

· Bayesian optimization · Heuristic

Introduction

A cloud data center containing heterogeneous servers interconnected in a highspeed network supports parallel execution of multiple tasks. In this paper, we study the problem of job (workﬂow) scheduling in a data center. For a given set of jobs, each job is represented as a task graph (directed acyclic graph or DAG), where nodes represent tasks in the job and edges represent the dependencies between the tasks. Under the task dependency constraints, we need to jointly determine the execution order of each task and the task-to-server allocation plan to minimize the overall makespan of all jobs. Improving the performance of task scheduling is remarkably challenging and has critical importance in boosting the proﬁt of cloud computing platforms. Existing static task scheduling methods can be roughly classiﬁed into three categories (heuristic-based scheduling, meta-heuristic scheduling, and machine learningbased scheduling algorithms). The heuristic-based scheduling algorithms [1,11,12] c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 48–56, 2022. https://doi.org/10.1007/978-3-030-96772-7_5

Bayesian Optimization-Based Task Scheduling Algorithm

49

have advantages over execution time. In contrast, the performance of those algorithms relies on the speciﬁc heuristic strategy, so is poor in robustness. Metaheuristic [8] is another type of popular algorithms, which provide good quality of schedules but the scheduling latency is much higher than other categories. In recent years, with the rapid development of machine learning technique, many researchers have attempted to address the problem of task scheduling with reinforcement learning [3,4,7] that can interact with the environment and automatically generate the scheduling strategy but requires expensive costs in computing with the neural network. We propose a Bayesian optimization based task scheduling algorithm for the static task schedules on a heterogeneous system to handle the above problems. We summarize the main contributions of the paper as follows: – We propose a Bayesian optimization based scheduling algorithm to handle the problem of task scheduling. Our algorithm searches for the best heuristic strategy in the problem space, eﬀectively improving the performance and reducing the latency in the scheduling. – We analyze the mathematical property of our algorithm and provide a guarantee for convergence in theory. The worst time complexity of our algorithm is less than other search methods (random search, grid search, etc.). – We conduct simulation experiments to validate our algorithm. The experimental result illustrates that our algorithm can improve the scheduling performance compared against the baselines.

2

Related Work

Existing research eﬀorts have been proposed for the task scheduling problem, and various algorithms are provided in the literature, which can be traditionally classiﬁed into three categories (heuristic-based scheduling, meta-heuristic scheduling, and machine learning-based scheduling algorithms). The heuristic is a kind of traditional algorithms to solve the scheduling problem, utilizing the heuristic strategy to guide the scheduling process. Heterogeneous Earliest-Finish Time (HEFT) [11] is the most well-known list-based scheduling algorithm. Predict Earliest-Finish Time (PEFT) [1], Lookahead [2], Critical-Path-on-a-server (CPOP) [11] have been proposed as extensions of HEFT. The heuristic-based algorithms have low computational complexity but easily get the optimal local solution, so that poor in robustness and often need a slight change in the problem space. Meta-heuristic is another popular algorithm, such as genetic algorithm, simulated annealing algorithm, ant colony optimization. Manasrah [8] proposes a hybrid GA-PSO algorithm that aims to reduce the makespan and the cost and balance the load of the dependent tasks over the heterogeneous resources in cloud computing environments. Meta-heuristic algorithms have the main disadvantage of the high computational cost. In recent years, with the rapid development of machine learning, machine learning-based scheduling algorithms have been proposed which reduce the computation time. In [9], the authors use reinforcement learning (Q-learning and

50

T. Cai and H. Shen

SARSA) to solve the workﬂow scheduling problem for computation resources that reduces the task execution time. In [5,7], and [4], reinforcement learning and deep learning are combined to solve more diﬃcult problems in real life. The machine learning-based scheduling algorithms have the main disadvantage on the high cost of training and computing in the neural network.

3 3.1

Problem Description Scheduling Model

In this section, we formulate the task scheduling model. In a heterogeneous computing system, the user submits jobs (workﬂows) {ji }N i=1 to be executed, and each job can be represented as a directed acyclic graph, where vertices indicate tasks and edges indicates dependencies between tasks. We call a task node without any parent as an entry node and a task node without any child as an exit node. The task scheduling process always begins with the entry nodes. We assume the heterogeneous environment consists of servers {pi }M i=1 . n , where the required resource For each job jn , it contains tasks {tn,i }K i=1 Kn n and the workload {w } are known in advance. For a task tn,i {rn,i }K n,i i=1 i=1 to be execute on a server pj , the execution time is obtained as: cjn,i =

wn,i . Wj

(1)

In this paper, we aim to minimize the total weighted Job Completion Time (JCT): N T = min Wn · Jn (2) n=1

where Wn indicates the weight of job jn and Jn shows the completion time, respectively. We suppose that important jobs are assigned higher weights. 3.2

Constraints

Resources Capacity Constraint. We use Vu to present a set of tasks running in server pu , and introduce Eq. 3 to ensure that resources are suﬃcient enough to support performing tasks: rt ≤ Ru (3) t∈Vu

Task Dependency Constraint. For each job jn , the completion time Jn has the following constraint: max Cn,i (4) Jn = i∈{1,...,Kn }

where Cn,i indicates the completion time of task tn,i . It’s noticeable that Cn,i starts timing after the job jn arrives. The completion time of a job depends on

Bayesian Optimization-Based Task Scheduling Algorithm

51

the completion time of the latest task that composes it. Similarly, a task tn,i can be scheduled after the job jn is released, which has the following constraint: Rn =

min

i∈{1,...,Kn }

Sn,i

(5)

where Sn,i indicates the start time of task tn,i , and Rn indicates the releasing time of jn . Consider the dependencies in each job, one task can start after the tasks it depends on has completed the transmission. The dependency constraint can be expressed as: Sn,i =

max

t ∈succ(tn,i )

(Ct + m(t , tn,i ))

(6)

Algorithm 1: Bayesian optimization based scheduling algorithm. M Input: A set of jobs to be scheduled: {ji }N i=1 ; A set of servers {pi }i=1 . Output: The weighted job completion time T .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Init a multivariate Gaussian process model M : {wi }4i=1 → T ; /* Ref to 1.1 */ repeat Get the most promising candidate {wi∗ }4i=1 by minimizing M ; /* Ref to 1.2 */ T ← 0; S ← ∅; J ← {ji }N i=1 ; foreach job ji in J do foreach task ti,n in ji do if indegree(ti,n ) is equal to 0 then S ← S ∪ {ti,n }; /* Ref to 2.2 */ ji ← ji /ti,n ; end end end while S is not empty do Select (t, p) from t ∈ S and p ∈ {Pi }M i=1 with the highest priority */ according to f ∗ ; /* Ref to 2.3 S ← S/t; Release the dependencies from t to its predecessors and append new tasks without dependencies to S; if all the task in j is completed then T ← T + Wj ∗ J ; end end M ← FITMODEL(M , (w∗ , T )); /* Ref to 1.2 */ T = min(T , T ); /* Ref to 2.4 */ until The result T is acceptable;

52

T. Cai and H. Shen

where succ(·) indicates the set of immediate successors, and m(·, ·) indicates the transmission time.

4

Algorithm

In this section, we ﬁrst introduce the scheduling strategy and then present our algorithm for ﬁnding the best scheduling strategy in the problem space automatically. 4.1

Scheduling Strategy

In heuristic-based scheduling, the scheduling strategy assigns priorities to tasks and decides the execution order. The format of the scheduling strategy is also diﬀerent with various algorithms. In this paper, we present the strategy f as a linear combination of some scheduling indicators (such as resource utilization rate, upward rank, and so on) and express it as: f (t, p) = w1 · xru (t, p) + w2 · xup (t) + w3 · xdown (c) + w4 · xexec (t, p),

(7)

where f (t, p) measures the priority of an executable pair (t, p), and the deﬁnition of resource utilization rate xru , upward rank xup , downward rank xdown , and execution time xexec are introduced next. Resource utilization rate for one task t executing on the server p can be represented as: rt rt + xru (t, p) =

t ∈Vp

Rp

,

(8)

Execution time is deﬁned in the Eq. 1. Upward rank can be represented below: xup (t) = wt +

max (xup (t ) + m(t, t ))

t ∈succ(t)

(9)

where wt is the workload of t, succ(t) is the set of immediate successors of t, and m(t, t )) is the transmission cost between task t and task t. Downward rank can be represented below: xdown (t) =

max (xdown (t ) + wt + m(t, t )),

t ∈pred(t)

(10)

where pred(t) is the set of immediate predecessors of task t. According the the Eq. 7, the quality of f is determined with the parameters {wi }4i=1 . We divide the scheduling process of our algorithm into two phases, which are introduced in the next subsection.

Bayesian Optimization-Based Task Scheduling Algorithm

4.2

53

First phase: Bayesian optimization training

We regard the relationship between the parameters and the total completion time as a black-box function, and describle it with the help of Bayesian optimization: 1.1 Propose a Gaussian process model M : {wi }4i=1 → T , where {wi }4i=1 indicates the parameters in Eq. 7 and T is deﬁned in Eq. 2; 1.2 Fetch parameter {w }ni=1 by minimizing M , gain the observe data ({wi }4i=1 , T ) by simulating the scheduling process, and update M with the observe data; 1.3 Repeat 1.2 until we obtain an acceptable result. 4.3

Second Phase: Task Scheduling Simulation

In this phase, we predict the total completion time T with {wi }4i=1 . 2.1 Determine the heuristic strategy f with the given parameters {wi }4i=1 ; 2.2 For the input {ji }N i=1 , deﬁne a set of all executable task list lexecute and initialize the lexecute with the entry nodes in {ji }N i=1 ; 2.3 With the help of the heuristic strategy f , we fetch a pair (t, p) having the highest priority assigned with f compared with all the candidates, where t is from lexecute and p is a server. If the task t is selected, it should be removed from lexecute and release the dependencies with other tasks. Other tasks without dependencies should be appended to the lexecute ; 2.4 Repeat 2.3 until the lexecute is empty, then we obtain the reward T deﬁned in Eq. 2. 4.4

Theoretical Analysis

In this subsection, we explore the eﬀectiveness of our algorithm. We ﬁrst introduce the assumption about Gradients of GP Sample Paths [6]: Lemma 1. For the optimal solutions T = F (w∗ ) in Bayesian optimization which can be represented as F (w∗ ) = maxi=1,...,j F (wi ), we have F (w∗ ) −

max F (wi ) ≤ E[F (w∗ ) − F (wj )] ≤ E[Lw∗ − wj ] ≤

i=1,2,...,j

≤

d τj

inf

t2

ae b2 dt = 0

√ dab π 1 = 2 2τj 2j

d E[L] τj

where j indicates the iteration times and wi indicates the most promising candidate w from M in the i-th iteration. The ﬁrst step bounds the diﬀerence in the function values by the largest partial derivative and the L distance between the points. The second step uses the properties of the discretization. We follow Lemma 1 and have the ratio between the optimal solution F (w∗ ) and the best observed solution maxi=1,2,...,j F (wi ) at the j-th iteration: F (w∗ ) maxi=1,2,...,j F (wi )

≤1+

1 1 1+ . 2j 2 · maxi=1,2,...,j F (wi ) Ψj · 2j 2

(11)

54

T. Cai and H. Shen

And we apply the result here [10] E[Ψj ] = μj + σj Φ−1 (

j − π8 ). j − π4 + 1

(12)

The result below is workable: F (w∗ )

1 π−8 2j 2 μj + 2j 2 σj Φ−1 (1 + 16−4π ) 1 ≤1+ 2 (13) 2j μj + 2.36j 2 σj x 1 where φ(x) = √2πexp(− and Φ(x) = −∞ φ(z)dz. The Eq.13 indicates that 1 2 2x ) our algorithm can guarantee the convergence. maxi=1,2,...,j F (wi )

5

≤1+

Experimental Results

5.1

Research Questions

We discuss the research questions in the remainder of the section: RQ1 Does the algorithm we propose workable and eﬀective? Does our algorithm obtain high-quality scheduling results compared with other scheduling algorithms? RQ2 Does the running time of our algorithm better than other algorithms? How does the running speed of our algorithm diﬀer from others? 1200

50

Average Makespan

40 35

RLTS GA-PSO Ours

1100

HEFT PEFT GA-PSO RLTS Ours

1000

Average Running Time (msec)

45

30 25 20 15

PEFT HEFT

900 800 700 600 500 400 300 200

10

100

5

25

50

75

100

The number of tasks

Fig. 1. The total weighted makespans of our algorithm and baselines with the number of tasks changing.

5.2

20

30

40

50

60

70

80

90

100

110

Number of Tasks

Fig. 2. Average running time (over 100 runs on randomly generated job) of our algorithm and baselines.

Baselines

In this subsection, we compare the performance with four typical algorithms: HEFT [11], PEFT [1], GA-PSO [8], and RLTS [5], where HEFT and PEFT belong to heuristic-based scheduling, GA-PSO belongs to meta-heuristic scheduling, and RLTS belongs to reinforcement learning-based scheduling (Figs. 1 and 2).

Bayesian Optimization-Based Task Scheduling Algorithm

5.3

55

Experiments

We set up two main experiments to answer the research questions listed above. In the ﬁrst experiment, we examine the performance of our algorithm compared with other task scheduling algorithms. We can see that all the values of makespan calculated by our algorithm are less than other algorithms. This is because our algorithm tries to search for the best heuristic strategy in the problem space and interact with the scheduling environment to avoid presenting local optimal solutions. In the second experiment, we evaluated the average running time of those algorithms. We can see that the running time of our algorithm is less than RLTS and GA-PSO, which indicates that our algorithm has less time complexity than the reinforcement learning-based algorithm and meta-heuristic algorithm. The running time of our algorithm is more than HEFT and PEFT because those algorithms have low computational complexity but can often be ended at local optimal solutions.

6

Conclusion

We propose a Bayesian optimization based scheduling algorithm to automatically search for the best heuristic strategy in the problem space. We also show theoretical guarantee of convergence of our algorithm. The experimental results show that our algorithm can improve the scheduling performance compared to the baselines. Acknowledgement. This work is supported by Key-Area Research and Development Plan of Guangdong Province #2020B010164003.

References 1. Arabnejad, H., Barbosa, J.G.: List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel Distrib. Syst. 25(3), 682–694 (2014) 2. Bittencourt, L.F., Sakellariou, R., Madeira, E.R.M.: Dag scheduling using a lookahead variant of the heterogeneous earliest ﬁnish time algorithm. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp. 27–34 (2010) 3. Chen, X., Zhang, H., Wu, C., Mao, S., Ji, Y., Bennis, M.: Optimized computation oﬄoading performance in virtual edge computing systems via deep reinforcement learning. IEEE Internet Things J. 6(3), 4005–4018 (2019) 4. Dai, H., Khalil, E.B., Zhang, Y., Dilkina, B., Song, L.: Learning combinatorial optimization algorithms over graphs (2018) 5. Dong, T., Xue, F., Xiao, C., Li, J.: Task scheduling based on deep reinforcement learning in a cloud manufacturing environment. Concurr. Comput. Pract. Exper. 32(11), e5654 (2020)

56

T. Cai and H. Shen

6. Kandasamy, K., Krishnamurthy, A., Schneider, J., Poczos, B.: Parallelised bayesian optimisation via thompson sampling. In: Storkey, A., Perez-Cruz, F. (eds.) Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics. Proceedings of Machine Learning Research, 09–11 April 2018, vol. 84, pp. 133–142. PMLR (2018). https://proceedings.mlr.press/v84/kandasamy18a. html 7. Lin, C.C., Deng, D.J., Chih, Y.L., Chiu, H.T.: Smart manufacturing scheduling with edge computing using multiclass deep q network. IEEE Trans. Ind. Inf. 15(7), 4276–4284 (2019) 8. Manasrah, A.M., Ba Ali, H., Gupta, B.B.: Workﬂow scheduling using hybrid gapso algorithm in cloud computing. Wirel. Commun. Mob. Comput. 2018 (2018). https://doi.org/10.1155/2018/1934784 9. Orhean, A.I., Pop, F., Raicu, I.: New scheduling approach using reinforcement learning for heterogeneous distributed systems. J. Parallel Distrib. Comput. 117, 292–302 (2018). https://www.sciencedirect.com/science/article/pii/ S0743731517301521 10. Royston, J.P.: Algorithm as 177: expected normal order statistics (exact and approximate). J. Roy. Stat. Soc. Series C (Appl. Stat.) 31(2), 161–165 (1982). http://www.jstor.org/stable/2347982 11. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-eﬀective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002) 12. Wang, H., Sinnen, O.: List-scheduling versus cluster-scheduling. IEEE Trans. Parallel Distrib. Syst. 29(8), 1736–1749 (2018)

Optimizing Uplink Bandwidth Utilization for Crowdsourced Livecast Xianzhi Zhang1,2 , Guoqiao Ye1,2 , Miao Hu1,2 , and Di Wu1,2(B) 1

School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, China {zhangxzh9,yegq3}@mail2.sysu.edu.cn, {humiao5,wudi27}@mail.sysu.edu.cn 2 Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China Abstract. Driven by the prevalence of video generation devices and the development of network infrastructures, there has been an explosive growth of Crowdsourced Video Livecast (CVL) services in the past few years. Signiﬁcant eﬀorts have been made to provide high quality CVL services with limited bandwidth availability. However, most of the existing works focused on optimizing downlink bandwidth for video distribution rather than uplink bandwidth for video uploading. For example, uploaders (i.e., broadcasters) in Twitch can arbitrarily set their upload rates, which may lead to a signiﬁcant waste of upload bandwidth with the increasing number of uploaders. In this paper, we propose an eﬀective low-complexity algorithm called Bubal to optimize upload bandwidth allocation among massive uploaders. Our objective is to optimize the utility of video uploading from the perspective of CVL platform operators by considering both viewers Quality-of-Experience (QoE) and upload bandwidth cost. To guarantee the eﬀectiveness and fairness of bandwidth allocation, we adopt the optimization framework of Nash Bargaining Solution (NBS ), which can determine the optimal bandwidth budget, upload bitrate and datacenter selection for each uploader jointly. Finally, we conduct extensive trace-driven simulations to evaluate our proposed algorithm and the results show that our algorithm achieves much higher utility than alternative strategies in various conditions. Keywords: Crowdsourced Video Livecast · Upload bandwidth · Quality-of-Experience (QoE) · Utility maximization · Nash bargaining solution

1

Introduction

In recent years, Crowdsourced Video Livecast (CVL) ﬂourishes with the prevalence of high-end user devices by leveraging the power of cloud computing This work was supported by the National Natural Science Foundation of China under Grant U1911201, U2001209, 62072486, 61802452, the Science and Technology Planning Project of Guangdong Province under Grant 2021A0505110008, the Science and Technology Program of Guangzhou under Grant 202007040006, 202002020045, 202103010004. c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 57–68, 2022. https://doi.org/10.1007/978-3-030-96772-7_6

58

X. Zhang et al.

platforms. A number of worldwide crowdsourced video livecast platforms have emerged, such as Twitch.tv, YouTube Live, Azubu.tv, and Hitbox.tv. As one of the most successful CVL platforms, Twitch.tv has attracted over 200 million concurrent viewers and more than 3 million concurrent broadcasters at its peak hours [2]. What’s more, CVL has received research attention from both industry and academia, spanning from measuring real platforms, developing transcoding frameworks to optimizing resources consumed by viewers. Despite extensive contributions made by previous researchers, there is very limited work focusing on optimizing bandwidth resources allocated to uploaders. However, according to the measurement study [18], 25% of upload bandwidth has been wasted by broadcasters. The reason is that, with an arbitrarily selection of upload bitrates, all uploaders prefer to choose the highest upload bitrates that they can support to maximize the streaming quality of their viewers no matter how many viewers there are, which can cause signiﬁcant resource wastage. From the perspective of the CVL platform operators, their goal is to maximize overall utility by maintaining good enough viewers’ QoE with reasonable bandwidth cost. However, it is non-trivial to determine the optimal upload bitrate and datacenter selection for the reasons as follows: First, we need to balance the upload bitrates of diﬀerent uploaders to achieve high overall QoE of the platform while minimizing the bandwidth cost. Second, since the bandwidth prices and locations of datacenters are diﬀerent, we need to carefully choose appropriate datacenters for uploaders to upload their videos. Third, the viewer population of a particular video stream may ﬂuctuate signiﬁcantly and rapidly with time. Therefore, the optimal bandwidth and upload bitrates of uploaders should be changed dyanmically with the latest viewer population. In this paper, to address the above challenges, we attempt to determine an optimal bandwidth budget, upload bitrate and datacenter selection to maximize the overall utility from the perspective of a CVL platform operator. We adopt the Nash Bargaining Solution (NBS ) to ensure the eﬀectiveness and fairness, and design an eﬀective low-complexity algorithm called Bubal, which can help CVL platforms control cost and enhance their service quality with a massive number of uploaders. In summary, our contributions are summarized as follows: – To the best of our knowledge, we are the ﬁrst to consider the optimization of upload bandwidth allocation among broadcasters in crowdsourced video livecast systems. We formulate the problem as a constrained utility optimization problem and balance the tradeoﬀ between bandwidth cost and QoE of viewers. – By exploiting the Nash bargaining solution (NBS ) optimization framework, we design an eﬀective low-complexity algorithm called Bubal to solve the optimization problem, which can determine the optimal bandwidth budget, upload bitrate and datacenter selection for each uploader. – To evaluate the eﬀectiveness of our proposed algorithm, we conduct extensive trace-driven simulations by utilizing the public Twitch live streaming traces. Experimental results show that our proposed algorithm achieves much higher utility than alternative strategies in various conditions.

Optimizing Uplink Bandwidth Utilization for Crowdsourced Livecast

59

The rest of this paper is organized as follows. We ﬁrst introduce the system model in Sect. 3. In Sect. 4, we propose the solution of QoE optimization problem with a given bandwidth budget. In Sect. 4.3, we explore how to solve the utility optimization problem to ﬁnd the optimal bandwidth budget. We conduct a series of experiments to evaluate the performance of our design in Sect. 5. We discuss the related work in Sect. 2 and conclude the paper in Sect. 6.

2

Related Work

The popularity of Crowdsourced Video Livecast (CVL) has attracted signiﬁcant attention recently. Related studies can divided into two major categories: i) measurements and pattern analysis of CVL systems, and ii) optimization of transcoding and scheduling. To understand viewer interactions, Wang et al. [15] performed a comprehensive measurement study of the viewer interactions on a popular crowdsourced live broadcasting website in China and further deigned methodologies to predict the popularity of channels. Yi et al. [17] for the ﬁrst time conducted an experiment-based measurement of YouTube’s 360 degree live video streaming and concluded the primary design weakness of current CVL systems. There are quite a few papers focusing on the optimization of the transcoding or scheduling and resource provisioning. Luo et al. [9] adopted a novel live video ingest approach, named CrowdSR, to transform a low-resolution video stream into high-resolution video stream for viewers in crowdsourced livecast with superresolution method. Zhang et al. [19] designed a novel framework, CastFlag, to predict the highlights, i.e., key events in livecast and optimize the transcoding task workload. Ma et al. [10] conducted a viewer-assisted Crowdsourced Livecast Services (CLS) framework with a fairness-guaranteed task assignment scheme, which was solved by a dynamic programming problem. Wang et al. [14] presented an edge-assisted crowdcast framework called DeepCast toward to the heterogeneous and personalized QoE demands of viewers leveraging DRL. Besides, a review study of crowdcast solutions, challenges and opportunities for personalized CVL with intelligent edge technology was provided by Wang et al. [13] Our work diﬀers from previous work in mainly three aspects. Firstly, we focus on upload bandwidth allocation problem instead of the optimization on video transcoding or distribution of CVL platforms in previous paper. Secondly, we focus on utility optimization, considering bandwidth cost and QoE proﬁts of viewers, jointly. Thirdly, this paper provides guidelines, used to empirically set by CVL platforms, for operators to ﬁnd the optimal bandwidth budget.

3

System Model

In this section, we ﬁrst introduce our system model for the CVL system. Then, we describe how to formulate the problem as a constrained optimization problem.

60

X. Zhang et al.

3.1

System Overview

In a generic CVL system, there are three major players: ui ∈ U (or broadcasters), CVL platform with multiple datacenters dm ∈ D, and viewers. N = |U| and M = |D| are total number of elements in uploader and datacenter set, respectively. Vi represents the amount of viewers associated with ui . A speciﬁed workﬂow is described as follows: the uploader ui uploads live stream with a bitrate ri,m to a datacenter dm , and then the CVL platform transcodes the video to multiple bitrates under the uploaded bitrate and delivers video streams to viewers with a suitable bitrates. The goal of our design is to achieve the maximum utility of video uploading by balancing the viewers’ QoE and upload bandwidth cost. 3.2

QoE and Bandwidth Cost

The uploaded video is transcoded by the CVL platform based on network conditions and viewers’ devices. To evaluate the impact of videos’ upload bitrates on viewers’ QoE, we deﬁne Opportunity QoE with the video upload bitrate ri , which is treated as the upper bound of the actual QoE at the viewers’ side1 . Assume that each viewer needs to be served with the minimum bitrate as rmin and similar to the QoE model in [6], the opportunity QoE Qoi for all viewers watching the video uploaded by ui is deﬁned as: ri Qoi = Q(ri , rmin ) = ln 1 + min , r where the minimum opportunity QoE for each viewer watching the video uploaded by uploader ui is deﬁned as Qmin = Q(rmin , rmin ). i In our problem, upload bandwidth cost is incurred when uploaders are uploading videos to datacenters. We deﬁne Cm as the bandwidth cost associated with the datacenter dm . We adopt a pricing model similar to that of Google [1] cloud platform. Let cm denote the unit price of upload traﬃc in dataand bandwidth cost associated with datacenter dm can be deﬁned center dm , N as Cm = i=1 ri,m ∗ cm . As the perspective of CVL platforms, we deﬁne M C = m=1 Cm as the total upload bandwidth cost of the CVL platform. 3.3

Problem Formulation

Before the problem formulation, we ﬁrst introduce some constraints in our model. We denote rimin , rimax as the minimum and maximum upload bitrates, respectively. Thereby, we have the following constraints: rimin ≤ ri =

M

ri,m ≤ rimax , ∀ri,m ≥ 0,

(1)

ri,m1 ∗ ri,m2 ≤ 0, ∀i.

(2)

m=1 M

M

{m1 =1} {m2 =1,m2 =m1 } 1

Transcoded bit rates can not exceed the uploaded bitrate.

Optimizing Uplink Bandwidth Utilization for Crowdsourced Livecast

61

In the above constraints, constraint (1) ensures that upload bitrate ri,m would be non-negative and the global upload bitrate ri of uploader ui would neither exceed the upper bound nor less than the lower bound. Constraint (2) ensures that the uploader can connect to at most one datacenter at a time. Besides, we assume that each datacenter dm has the bandwidth of bm ∈ b, where b = {bm , ∀m} is the total bandwidth of all datacenters. In order to guarantee that the bandwidth requirement of uploaders can not exceed the bandwidth of all the datacenters, we introduce the following constraint: N

ri,m ≤ bm , ∀m.

(3)

i=1

Note that our goal is to solve the utility optimization problem with any utility function of aggregated viewers’ QoE and bandwidth cost. Then we deﬁne the utility optimization problem as: P1 : argmax r ,b

f (r, b) − k ∗ g(r, b),

s.t. (1)(2)(3). where k is a tunable parameter representing the weight of upload bandwidth cost. Besides the total bandwidth cost of all the datacenters is deﬁned as: g(r, b) =

M

Cm .

m=1

We propose to tackle problem P1 by solving two subproblems: 1)Optimization of upload bitrates allocation r with a given bandwidth budget; 2)Optimization of utility with vary bandwidth budgets For the ﬁrst subproblem, given the total upload bandwidth budget b, the major objective of a CVL platform is to maximize the overall QoE of all viewers by determining the upload bitrate and datacenter selection for each uploader, i.e., r = {ri,m , ∀i, ∀m}, which is indeed a bandwidth allocation problem. Considering both eﬀectiveness and fairness, we employ the Nash bargaining solution (NBS ) in game theory to tackle this problem, which was ﬁrstly presented by Mazumdar et al. [11] in communication networks. Besides, we introduce the key concepts of NBS in our scenario according to the game-theoretical optimization frameworks [16]. The N uploaders can be viewed as the players who are competing for given upload bandwidth b in a CVL system. For each viewer in Vi associated with uploader ui , the initial proﬁt is the basic QoE represented as the proﬁt gain represented as (Qoi − Qbi ). Deﬁne the Qbi . We need to maximize N

r

i,m ) ≥ Qbi , ∀i ∈ N, ∀m} and (G, Qbi ) is a bargaining set G = {ri,j | ln(1 + i=1 r min game by supposing G is nonempty. Due to the fact that the numbers of viewers associated with diﬀerent uploaders are distinct, all the players in game have their asynashimmetric weights [4] by adopting the exponentiation of the proﬁt gain, i.e., (Qoi − Qbi )Vi . Intuitively, if an

62

X. Zhang et al.

uploader has more viewers, he (or she) should be allocated with more bandwidth resources. Thus, we can deﬁne the aggregated viewers’ QoE f (r, b) =

N

(Qoi − Qbi )Vi

i=1

as the Nash product and with a mathematical derivation, the ﬁrst subproblem is then formulated as the following Nash bargaining problem: P2 : argmin r

−

N

Vi · ln(Qoi − Qbi ),

i=1

s.t. (1)(2)(3). P2 depicts the joint proﬁt in the bargaining game, represented as the product of the proﬁt gains of all the players, which can be maximized by the Nash bargaining solution.

4

Upload Bitrate Allocation with Bandwidth Constraints

In this section, we tackle the optimization problem P2 deﬁned in previous. P2 is a mixed-integer convex programming (MICP ) problem, which is a NP-hard problem [8]. Therefore, we relax the problem P2 into a convex problem P3. Then we adopt Lagrangian transformation, dual decomposition, subgradient method and design an eﬀective algorithm to obtain the optimal solution of the problem P3. In addition, we also design a heuristic algorithm to obtain the sub-optimal solution for the problem P2. With the output of the P2, we ﬁnd the optimal b for maximum overall utility and solve the problem P1. 4.1

Problem Relaxation

The ﬁrst and third constraints in P2 follow the disciplined convex programming (DCP ) ruleset [5] while the second constraint violates the DCP rules, which ensures that each uploader can connect to at most one datacenter. If we deﬁne N ∗M , the second constraint is mathematically equivalent a binary M variable I ∈ R to m=1 I(i, m) ≤ 1, ∀i, which can be regraded as a binary variable constraint. Therefore, P2 can be converted to a mixed-integer convex programming (MICP ) problem with the binary variable constraint [7]. Hence, we relax constraint (2) and formulate P3 as follows: P3 : argmin r

−

N

Vi ln(Qoi − Qbi ),

i=1

s.t. (1)(3). In the problem P3, each uploader may be connected to more than one datacenter. However, the upload bitrate for each uploader is aggregated by the upload M bitrates over all datacenters, i.e., ri = m=1 ri,m , indicting that the eliminated constraint would not aﬀect the QoE of viewers if the sum ri is identical.

Optimizing Uplink Bandwidth Utilization for Crowdsourced Livecast

4.2

63

UBA Algorithm Design

For the convex problem P3, we note that the constraints of the variable r are linear so that we can apply the method of Lagrange multipliers and the dualbased decomposition to solve P3. Therefore, we deﬁne the Lagrangian function L(·) associated with P3 with KKT conditions [5]: L(r, α, β, κ, γ ) = −

N i=1

+

N i=1

βi (

M m=1

ri,m −

Vi ∗ ln(Qoi − Qbi ) −

rimax )

+

N

γi (r

min

M N

αi,m ∗ ri,m

i=1 m=1

−

M

ri,m ) +

m=1

i=1

M m=1

κm (

N

(4) ri,m − bm ),

i=1

where α, β, κ, γ are the dual variables associated with the problem. Let the derivative of Lagrangian function equal to zero and we can obtain ∇L(r ∗ , α, β, κ, γ) = 0,

(5)

∗ where r ∗ = {ri,m , ∀i, ∀m} is the optimal solution of P3. Besides, the Lagrange dual function d(·) corresponding to the L(·) function is deﬁned as follows:

d(α, β, κ, γ) = inf L(r, α, β, κ, γ). r

(6)

Note that P3 is a convex problem and the variable r follows the KKT conditions [5]. Therefore, we can obtain the dual problem corresponding to P3 without duality gap. The dual problem can be depicted in the following form: Max d(α, β, κ, γ) = L(r∗ , α, β, κ, γ), where d(·) is the dual function and L(·) is the Lagrangian function of P3. On the basis of the sub-gradient algorithm, the iterative expressions of α, β, κ, γ and the partial derivatives of d(α, β, κ, γ) can be obtain directly, which are omitted for space reasons. The iterative algorithm is terminated when |d(s+1)−d(s)| ≤ σ, where σ is a very small positive scalar and s is the step number. The sub-gradient updating laws guarantee that α, β, κ, γ will converge to the optimal multipliers α∗ , β ∗ , κ∗ , γ ∗ as long as ξ satisﬁes the diminishing step size rules [5]. Based on the above formulation, we design an Upload Bitrate Allocation (UBA) algorithm for allocating upload bandwidth resource to each uploader, whose details are shown in Algorithm 1. In the UBA algorithm, each uploader ﬁrstly maximizes the proﬁts of his (or her) viewers by calculating an optimal upload bitrate in each iteration based on the upload bitrates of other uploaders in the previous iteration. When the algorithm converges, all uploaders can obtain stable upload bitrates with a highest overall proﬁt, which maps to the key idea of the Nash bargaining solution, i.e., no player can proﬁtably deviate, given the actions of other players and the overall utility is maximized. After the optimal solution r ∗ of P3 is obtained, we also design a heuristic algorithm to obtain the sub-optimal solution of P2. The details of the heuristic algorithm are shown in line 7–22 in Algorithm 1, whose key is to calculate the sum of bitrates for each uploader ﬁrstly and then assign it into datacenters. If the assignment can not be satisﬁed, we will sacriﬁce the uploaders with the smallest number of viewers by setting the basic bitrate to these uploaders.

64

X. Zhang et al.

Algorithm 1: Upload Bitrate Allocation Algorithm Require: b = {bm , ∀m}; rmin , rimax , ∀i; Vi , ∀i; ξ; σ. Ensure: Optimal bandwidth allocated to uploaders: r ∗ 1: Initialize the Lagrangian multipliers, let f lag = N , = 0.01; 2: while |d(s + 1) − d(s)| > σ do 3: Update ri,m , ∀i, ∀m based on Eq. (5); 4: Update step size ξ and iteration round s ← s + 1; 5: Update α, β, κ, γ based on the partial derivatives of Eq. (6); 6: end while 7: while Not all uploaders are allocated in a speciﬁc datacenter do sub = 0, ∀i, ∀m; 8: Initialization: Let ri,m 9: for i = 1 to N do 10: if i ≤ f lag then ∗ 11: ri = M m=1 ri,m ; 12: else 13: ri = rmin + ; 14: end if 15: for m = 1 to M do sub 16: if ri ≤ (bm − N i=1 ri,m ) then sub = ri ; break; 17: ri,m 18: end if 19: end for 20: end for 21: f lag = f lag − 1; 22: end while

4.3

Bubal: UBA Algorithm with Optimal Bandwidth Budget

In this subsection, we ﬁnd the optimal b for maximum overall utility and solve the problem P1 with the output of the P2 obtained in last subsection. We assume that all the datacenters have a maximum bandwidth constraint, denoted as bmax = {bmax m , ∀m}. Let B = |b| denote the value of the total bandwidth budget.We can derive the lower and upper bounds of B as bl = N ∗ rmin and N bup = i=1 rimax , and we assume |bmax | > bl . Note that if we assign B with a speciﬁc value, there exist diﬀerent b = {bb , ∀m} which can ensure B = |b|. If the value of B is given, more bandwidth budget should be allocated to the datacenter with smaller unit price of upload traﬃc. With the Assumption that the datacenters are arranged in an ascending order according to the unit price, i.e., c1 ≤ c2 ≤ · · · ≤ cm , we can determine the unique division b when B is given: b = {bmax , ..., bmax 1 i−1 , B −

i−1 j=1

i−1 j=1

bmax j

≤B≤

i j=1

bmax , j

bmax , 0, 0, ...}, j (7)

Optimizing Uplink Bandwidth Utilization for Crowdsourced Livecast

65

where B is in the range of bl and bup . Besides, assume that f (·) is a concave function and g(·) is a linear function, there exists an optimal value B ∗ between bl and bup to maximize the overall utility in P1. Therefore, we design our algorithm called Bubal to search the optimal value of B ∗ and the related b∗ . In this algorithm, we iteratively divide the domain space of bandwidth budget in three-fold to obtain the utility with the constraint of bandwidth budget at the fold point and shorten the domain by eliminating the part with lower utility until the space of the domain less than a tiny positive value.

5

Performance Evaluation

In this section, we conduct extensive trace-driven simulations based on the public Twitch living streaming traces. 5.1

Experimental Settings

In our simulation, we discretize time into slots corresponding to a 5-min interval. We retrieve the public Twitch living streaming dataset from [12] to simulate the behaviors of 1000 uploaders in each slot. The unit price for datacenters is set as [0.02 + 0.02 ∗ m] per GB, where m is the index of the datacenter. The minimum and maximum of the upload bitrates for each uploader are set as 0.4 Mbps and 5 Mbps for simplicity. The maximum bandwidth of datacenters is set as bmax = [400, 1000, 600, 2000] and we can derive the lower and upper bounds of B ∗ as bl = 400 and bup = 5000. In addition, the weight of bandwidth cost deﬁned in P1 is initialized as 0.5 and we can increase the value of k to emphasize the importance of bandwidth cost. In order to evaluate the performance of our proposed algorithm Bubal, we mainly select two alternative strategies as baselines: – Proportional Allocation (Proportional) [3], in which the upload bandwidth will be allocated for each uploader based on the proportion of the amount of viewers. If an uploader has a larger proportion of viewers, he (or she) will be allocated with more bandwidth. – Average Allocation (Average), in which we allocate the upload bandwidth among all uploaders evenly. Figure 1(a) describes the total number of viewers of 1,000 uploaders over 500 time slots. First, it is obvious that the number of viewers is highly dynamic. Second, there are about 480,000 online viewers of these 1,000 uploaders in the peak time period and the peak-to-valley gap is about 285,000. Figure 1(b) illustrates the CDF of viewers for the uploaders in the time slot 500 and from this ﬁgure, nearly 80% uploaders have less than 100 viewers.

X. Zhang et al.

3 2 1 0 0

100

200

300

Time slot

400

500

1

2

5500

The QoE gain of total uploaders

4

5000

Bandwidth budget (Mbps)

10 5

5

Percentage of viewers (CDF)

The amount of total viewers

66

0.8

0.6

0.4

0.2

4500 4000 3500 3000 2500 2000 1500 1000

0 101

102

103

104

105

The amount of viewers for each uploader

500 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

The parameter of k

0.8

0.9

1

1.5

1

Bubal Average Proportional Upper bound

0.5

1000

2000

3000

4000

5000

Bandwidth budget (Mbps)

(a) Total viewers of (b) CDF of viewers (c) Optimal band- (d) QoE gain of 1000 1000 uploaders over of 1000 uploaders in width budget with uploaders with vary500 slots. the last slot. varying k. ing bandwidth budget.

Fig. 1. Descriptions of the dataset and some experiment results of the bandwidth budget.

5.2

Performance Comparison

We ﬁrstly explore the performance of our design compared to alternative strategies. To evaluate our algorithm with diﬀerent tradeoﬀ requirements between bandwidth cost and viewer QoE, we conduct simulations with three diﬀerent values of k as 0.5, 0.2, 0.05, the results of which are shown in Fig. 2. When k is higher, it means the CVL platform is more sensitive to bandwidth cost. We ﬁnd that when k = 0.5, which simulates the most cost-sensitive CVL platforms, our proposed Bubal achieves the best overall utility and QoE gain compared to the other two baselines while having a slightly higher bandwidth cost. When k = 0.2 and k = 0.05, which respectively simulate the scenarios that CVL platforms are moderately cost-sensitive and the least cost-sensitive, our proposed Bubal still achieves a better overall utility compared to two baselines while having a slightly lower QoE gain but lowest bandwidth cost. To further explain why Bubal always achieves the best overall utility while not always having the highest QoE gain and the lowest bandwidth cost, we show the detailed impact of parameter k (varying from 0 to 1) on bandwidth budget in Fig. 1(c). k = 0 means that the bandwidth cost is ignored and all uploaders can be allocated with the maximum upload bitrate. When k increases, the bandwidth budget B ∗ decreases as CVL platforms care more about the bandwidth cost. As such, three algorithms perform very diﬀerently with various bandwidth budgets shown in Fig. 1(d). When the bandwidth budget is low, the slope of QoE gain with Bubal is the highest, which means Bubal can achieve a high QoE gain beneﬁt. Therefore, as shown in Fig. 2(a), when k = 0.5, Bubal achieves the best QoE gain while having a slightly higher bandwidth cost. When the bandwidth budget is higher, the slopes of QoE gain with P roportional and Average are the steepest, successively. Therefore, Proportional and Average achieve the best QoE gain while having the highest bandwidth cost in Fig. 2(b)/2(c), respectively.

−0.2 −0.4 −0.6 −0.8 0

100

200

300

400

500

Time slot

QoE gain

1

Bubal Average Proportional

0.5

0 0

100

200

300

67

Bandwidth cost

1.5

400

500

The bandwidhth cost ($)

Utility Bubal Average Proportional

The QoE gain of total viewers

The utility of total uploaders

Optimizing Uplink Bandwidth Utilization for Crowdsourced Livecast 3.5 3

Bubal Average Proportional

2.5 2 1.5 0

100

Time slot

200

300

400

500

Time slot

Bubal Average Proportional

0.5

−0.5 0

100

200

300

400

500

Time slot

QoE gain

Bandwidth cost

1.5

1

Bubal Average Proportional

0.5

0 0

100

200

300

400

500

The bandwidhth cost ($)

Utility 1

The QoE gain of total viewers

The utility of total uploaders

(a) The most cost-sensitive CVL platforms (k = 0.5). 12 10

Bubal Average Proportional

8 6 4 2 0

100

Time slot

200

300

400

500

Time slot

1

0.5

0 0

Bubal Average Proportional

100

200

300

Time slot

400

500

QoE gain

Bandwidth cost

1.8

1.6

1.4

1.2 0

Bubal Average Proportional

100

200

300

Time slot

400

500

The bandwidhth cost ($)

Utility 1.5

The QoE gain of total viewers

The utility of total uploaders

(b) The moderate cost-sensitive CVL platforms (k = 0.2). 40

30

Bubal Average Proportional

20

10 0

100

200

300

400

500

Time slot

(c) The least cost-sensitive CVL platforms (k = 0.05).

Fig. 2. The overall utility, QoE gain and bandwidth cost for diﬀerent k.

6

Conclusion

In this paper, we focus on utility optimization of upload bandwidth from the perspective of CVL platform operators. We adopt the Nash Bargaining Solution (NBS ) optimization framework to ensure the eﬀectiveness and fairness and design an eﬀective low-complexity algorithm to determine an optimal bandwidth budget, upload bitrates and datacenter selection. At last, some trace-driven simulations are conducted and the results show that our design can signiﬁcantly improve the overall utility. In our future work, we plan to study the impact of dynamic network conditions. In addition, we plan to consider more live streaming scenarios and more complicated network structures.

References 1. Google cloud platform pricing. https://cloud.google.com/pricing/, Accessed 2020 2. Twitch revenue and usage statistics. https://www.businessofapps.com/data/ twitch-statistics/, Accessed 2020 3. Abdel-Hadi, A., Clancy, C.: A utility proportional fairness approach for resource allocation in 4g-lte. In: 2014 International Conference on Computing, Networking and Communications (ICNC), pp. 1034–1040. IEEE (2014)

68

X. Zhang et al.

4. Boche, H., Schubert, M., Vucic, N., Naik, S.: Non-symmetric nash bargaining solution for resource allocation in wireless networks and connection to interference calculus. In: 2007 15th European on Signal Processing Conference, pp. 1317–1321. IEEE (2007) 5. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 6. He, J., Wen, Y., Huang, J., Wu, D.: On the cost-qoe tradeoﬀ for cloud-based video streaming under amazon ec2’s pricing models. IEEE Trans. Circ. Syst. Video Technol. 24(4), 669–680 (2014) 7. Lubin, M., Yamangil, E., Bent, R., Vielma, J.P.: Extended formulations in mixedinteger convex programming. In: Louveaux, Q., Skutella, M. (eds.) IPCO 2016. LNCS, vol. 9682, pp. 102–113. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-33461-5 9 8. Lubin, M., Zadik, I., Vielma, J.P.: Mixed-integer convex representability. In: Eisenbrand, F., Koenemann, J. (eds.) IPCO 2017. LNCS, vol. 10328, pp. 392–404. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59250-3 32 9. Luo, Z., et al.: Crowdsr: enabling high-quality video ingest in crowdsourced livecast via super-resolution. In: Lutu, A., Simon, G., Farias, M.C.Q. (eds.) Proceedings of the 31st ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, NOSSDAV 2021, pp. 90–97. ACM (2021) 10. Ma, Y., Xu, C., Chen, X., Xiao, H., Zhong, L., Muntean, G.M.: Fairness-guaranteed transcoding task assignment for viewer-assisted crowdsourced livecast services. In: ICC 2021 - IEEE International Conference on Communications, pp. 1–6 (2021) 11. Mazumdar, R., Mason, L.G., Douligieris, C.: Fairness in network optimal ﬂow control. In: SBT/IEEE International Symposium on Telecommunications, ITS 1990 Symposium Record, pp. 590–596. IEEE (1990) 12. Pires, K., Simon, G.: Dash in twitch: adaptive bitrate streaming in live game streaming platforms. In: Proceedings of the 2014 Workshop on Design, Quality and Deployment of Adaptive Video Streaming, pp. 13–18. ACM (2014) 13. Wang, F., Liu, J., Zhang, C., Sun, L., Hwang, K.: Intelligent edge learning for personalized crowdsourced livecast: challenges, opportunities, and solutions. IEEE Netw. 35(1), 170–176 (2021) 14. Wang, F., et al.: Deepcast: towards personalized qoe for edge-assisted crowdcast with deep reinforcement learning. IEEE/ACM Trans. Netw. 28, 1255–1268 (2020) 15. Wang, X., Tian, Y., Lan, R., Yang, W., Zhang, X.: Beyond the watching: understanding viewer interactions in crowdsourced live video broadcasting services. IEEE Trans. Circ. Syst. Video Technol. 29(11), 3454–3468 (2018) 16. Ya¨ıche, H., Mazumdar, R.R., Rosenberg, C.: A game theoretic framework for bandwidth allocation and pricing in broadband networks. IEEE/ACM Trans. Netw. 8(5), 667–678 (2000) 17. Yi, J., Luo, S., Yan, Z.: A measurement study of youtube 360◦ live video streaming. In: Proceedings of the 29th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 49–54 (2019) 18. Zhang, C., Liu, J., Wang, H.: Towards hybrid cloud-assisted crowdsourced live streaming: measurement and analysis. In: Proceedings of the 26th International Workshop on Network and Operating Systems Support for Digital Audio and Video, p. 1. ACM (2016) 19. Zhang, C., Liu, J., Wang, Z., Sun, L.: Look ahead at the ﬁrst-mile in livecast with crowdsourced highlight prediction. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pp. 1143–1152. IEEE (2020)

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems Rongfeng Huang1,2 , Tianyu Yu1,2 , Shifang Liu1,2 , Xinyin Zhang1,2 , and Yonghua Zhao1(B) 1

Computer Network Information Center, Chinese Academy of Sciences, Beijing, China [emailprotected] 2 University of Chinese Academy of Sciences, Beijing, China

Abstract. Batched linear algebra problems are becoming increasingly important in engineering and scientiﬁc applications. As the performance of graphics processing units (GPUs) improves rapidly, GPUs are very attractive to solve this class of problems. This paper presents a parallel blocked Jacobi SVD algorithm for many small matrices on GPUs. The parallelism of the Jacobi algorithm is squeezed suﬃciently. Our algorithm can be mapped to the GPU memory hierarchy properly due to the blocking structure. Reduction operations used for computing inner products and having low thread utilization are instead by performing the Jacobi rotation on the Gram matrix in parallel. We identify the kernels with sharing data and fuse them to improve memory locality by placing shared data, originally passed via oﬀ-chip global memory, into the on-chip shared memory. Numerical results on an NVIDIA Tesla V100 GPU show that our batched SVD routine outperforms state-of-the-art approaches between 2.0× and 4.1× for the examples tested. As one of the applications for our routine, the numerical simulation of quantum lattice systems is tested and achieves a maximum of 54.1× speedups over the CPU implementation running on a 48-core Xeon CPU.

Keywords: Batched execution fusion · GPU

1

· SVD · Blocked algorithms · Kernel

Introduction

Batched linear algebra problems are to solve many independent problems simultaneously. When the matrices are large enough to take full advantage of the computing resources of the device, these independent problems are preferred to be solved in serial for better data locality and reuse, thus there is no need This work is supported by National Key Research and Development Program of China (2017YFB0202202) and Strategic Priority Research Program of Chinese Academy of Sciences (XDC05000000). c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 69–80, 2022. https://doi.org/10.1007/978-3-030-96772-7_7

70

R. Huang et al.

for batched routines. However, when matrices are small, such as the matrices of size no more than 512, the workloads of a single matrix cannot saturate the device, especially GPUs. To this end, a lot of matrices should be solved together, and batched routines are required. Up to now there are many batched linear algebra routines, such as batched general matrix-matrix multiplication (GEMM), batched Cholesky factorization, batched lower-upper (LU) factorization, batched singular value decomposition (SVD), to name a few. These routines are widely used in machine learning, computer vision, astrophysics, and other ﬁelds [1–4]. The development of routines for batched small matrices computing is relatively easy for multicore CPUs. For example, a combination of the OpenMP and highly optimized LAPACK/BLAS libraries (such as MKL, openBLAS) usually obtain an optimistic performance, since most of the computation can be performed through the fast CPU cache. However, the development is not intuitive for GPUs due to the lack of large caches. Batched GEMM may be the most basic operation in dense linear algebra probably because many other batched routines also call it. Many vendors provide the batched GEMM implementation on their devices [5,6] to satisfy the growing demand from diﬀerent ﬁelds. The University of Tennessee also gives an implementation for batched GEMM in open-source package MAGMA both on CPUs and GPUs. There are also a lot of works focusing on batched LAPACK routines on GPUs [7]. For example, Dong et al. [8] presented diﬀerent implementations of batched Cholesky factorizations. Abdelfattah et al. [9] demonstrated a high-performance routine of batched LU factorization with partial pivoting. Unlike the Cholesky factorization and LU factorization, the SVD algorithm is iterative, so the computation of all matrices can terminate at the same time barely. After most of the matrices converge, the remaining matrices cannot fully utilize the streaming multiprocessors on GPUs. As a result, it is more challenging to design batched SVD algorithms on GPUs. The cuSOLVER library [10] released by NVIDIA support only batched SVD decompositions for matrices no more than 32 × 32. Dong et al. [11] presented a method to accelerate the SVD bi-diagonalization stage of a batch of small matrices using GPUs. However, the following SVD diagonalization stage remains unresolved. Badolato et al. [12] used each thread within a warp to compute the SVD of a single matrix. The algorithm implemented in their work was a conventional non-blocked Jacobi algorithm. Boukaram et al. [13] presented batched SVD routines and used these routines for the compression of hierarchical matrices. For the matrices of sizes that were no more than 64 × 64, whole matrices were loaded up to the register or the shared memory, and thus the good performance was achieved. For the matrices of sizes that were larger than 64 × 64, because the register and the shared memory cannot hold the entire matrices, the blocked Jacobi SVD algorithm was employed. But the blocked Jacobi rotations in a single matrix were conducted serially, then the GPUs could not be satiated after some matrices terminated early. The batched SVD routines were integrated into the KBLAS library [14]. In this paper, we present an optimized routine for batched SVD decomposition on GPUs and it’s applications. We summarize our contributions as follows:

A Batched Jacobi SVD Algorithm on GPUs

71

(1) design a parallel blocked Jacobi SVD algorithm, as well as eﬃcient implementations and optimization techniques. In our design, the blocked Jacobi rotations in a single matrix were conducted concurrently. This is the main diﬀerence between our work and previous works; (2) replace reduction operations with low thread utilization by performing the Jacobi rotation on the Gram matrix in parallel; (3) show the application of our work by accelerating the quantum lattice simulation. The remainder of this paper is organized as follows. Section 2 introduces the algorithmic background. In Sect. 3, eﬃcient implementations and optimization techniques are presented. Section 4 provides the experimental results and analysis. Section 5 shows the accelerating of the quantum lattice simulation. Section 6 concludes this paper and outlines the future work.

2

Algorithmic Background

Given a m × n real matrix A, the SVD decomposition of A is to ﬁnd a m × m orthonormal matrix U , a m × n diagonal matrix Σ, and a n × n orthonormal matrix V , such that (1) A = U ΣV T The columns of U and V are called the left singular vectors and the right singular vectors respectively. The diagonal entries of Σ are called the singular values and are sorted in decreasing order. 2.1

Jacobi Algorithms

Algorithm 1 describes the canonical one-sided Jacobi SVD algorithm. The algorithm is a repeatedly orthogonalized procedure in sweeps using the Jacobi rotation until all columns are mutually orthogonal up to machine precision. The process of any pair of columns is orthogonalized once is called a sweep, so a sweep includes n(n − 1)/2 pairs of columns. There are many methods to give all pairs of columns in a sweep. The classical methods include the row-cyclic ordering method and column-cyclic ordering method. Unfortunately, the two methods result in poor parallelism. One of the superiorities of Jacobi SVD algorithms is parallelism. As long as the picked columns are excluded, step 3 and step 4 can be performed simultaneously. The most common two methods to give all pairs of columns in a sweep that are suitable for parallel calculations are the round-robin method [15] and odd-even method [16]. Despite the parallelism of the round-robin method being superior to the odd-even method, the round-robin method does not converge for some particular matrices [17]. Hence, the odd-even method is adopted because it converges for all matrices [17]. As shown in Algorithm 1, the Jacobi rotation is the core module. Let ap and aq be the pth and qth columns of matrix A respectively, then the 2 × 2 Jacobi rotation matrix J p,q can be achieved by some formulas depending on the inner product of ap and aq [15]. The oﬀ-norm of a matrix in Algorithm 1 is deﬁned by the Frobenius norm of a new matrix which is equal to the initial matrix except

72

R. Huang et al.

Algorithm 1: Jacobi algorithms

1 2 3 4 5 6 7 8 9 10

Input: A Output: U, Σ, V while of f (AT A) > ε do foreach (p, q) in {All pairs of columns in a sweep} do Calculate the Jacobi rotation matrix J p,q Conduct the Jacobi rotation: [ap aq ] = [ap aq ] · J p,q /* BLAS-1 operation */ end end √ Calculate the singular values Σ: Σ = AT A Calculate the left singular vectors U : U = AΣ −1 Calculate the right singular vectors V : V = Σ −1 U T A Sort the singular values and move the corresponding singular vectors if necessary

that the diagonal elements are all zero. In addition, ε is the convergence tolerance. As the Jacobi rotation is a memory-bounded BLAS-1 operation, Algorithm 1 is also memory-bounded. 2.2

Parallel Blocked Jacobi Algorithms

The blocked versions of algorithms accumulate some BLAS-1 or BLAS-2 operations into a BLAS-3 operation. Therefore, the blocked versions are computeintensive and perform well on a modern machine. In the blocked SVD algorithms, the matrix A is divided into a lot of panels, i.e., A = [A1 A2 · · · AK ]. For simplicity, we assume that K is even and the sizes of all panels are equal to N B. The blocked SVD algorithms follow a similar workﬂow with Algorithm 1 and also iterate sweep by sweep until converge. In each iteration, a pair of panels are picked and the blocked Jacobi rotation is conducted. However, the computation of the blocked Jacobi rotation matrix BJ p,q is more complicated, and iterative algorithms must be used. In [13], two algorithms are used. One is to compute the SVD decomposition of the Gram matrix, i.e. the inner product of the picked panels. Another is to carry out the QR factorization on the picked panels ﬁrst, and then apply the SVD decomposition on the upper triangular matrix arising from the QR factorization. The former achieved higher performance than the latter. In this paper, we get the blocked Jacobi rotation matrix by conducting Algorithm 1 on the picked panels directly. However, a trivial design of Algorithm 1 results in poor performance because the calculations of the inner product are reduction operations and can not make good use of GPUs’ numerous cores. Our approach is to update the inner product of matrix columns in parallel, and the details are presented in Algorithm 2. In fact, the Gram matrix is the inner product of the columns of the matrix. Except for the initial step (step 4 in Algorithm 2), the costly iterative process (steps 7 to 9 in Algorithm 2) does not visit the picked

A Batched Jacobi SVD Algorithm on GPUs

73

Algorithm 2: Parallel blocked Jacobi algorithms

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

Input: A Output: U, Σ, V while of f (AT A) > ε do foreach (p, q) in {All pairs of panels in a sweep} do parallel /* Steps 3 to 11 is essentially Algorithm 1 */ Initialize BJ p,q as an identity matrix I : BJ p,q = I Calculate the Gram matrix H p,q : H p,q = [Ap Aq ]T · [Ap Aq ] while of f (H p,q ) > ε do foreach (pin , qin ) in {All pairs of columns in a sweep} do parallel Calculate the Jacobi rotation matrix J pin ,qin Conduct the Jacobi rotation on H p,q : H p,q [pin qin ] = H p,q [pin qin ] · J pin ,qin (H p,q [pin qin ])T = (H p,q [pin qin ])T · J pin ,qin Conduct the Jacobi rotation on BJ p,q : BJ p,q [pin qin ] = BJ p,q [pin qin ] · J pin ,qin end end Conduct blocked Jacobi rotation on [Ap Aq ]: [Ap Aq ] = [Ap Aq ] · BJ p,q /* BLAS-3 operation */ end end Calculate Σ, U , and V in a similar way with Algorithm 1 Sort the singular values and move the corresponding singular vectors if necessary

panels which reside on the global memory. On the other hand, there are no reduction operations from steps 7 to 9. As a result, the performance is greatly improved. Similar to Algorithm 1, the blocked SVD algorithms are also suited for parallel development. As long as the picked panels are excluded, the blocked Jacobi rotations can be conducted simultaneously. On the other hand, we employ paralleled Algorithm 1 to achieve the blocked Jacobi rotation matrix. Therefore, a parallel blocked Jacobi SVD algorithm (Algorithm 2) enjoying a two-tier parallel can be developed.

3

Design Details

We assumed that all matrices have been stored in the global memory of GPU, and the matrix elements are aligned in a column-major manner. The representative CUDA is used as the programming model of GPUs. However, our ideas are also available for other programming models such as OpenCL and HIP.

74

3.1

R. Huang et al.

Overall Design

The previous work depended on the independence of the batched execution. For a single matrix, the serial block Jacobi algorithm was used. As a result, the onedimensional grid was a good choice. Unlike the Cholesky factorization and LU factorization, the Jacobi SVD algorithm is iterative, so the computation of all matrices can terminate at the same time barely. After some matrices terminate early, the remaining matrices cannot utilize the computing resources of GPUs eﬀectively. This problem can be overcome by conducting the parallel block Jacobi algorithm for a single matrix. Therefore, the two-dimensional grid is more favorable. The ﬁrst dimension of the grid is n/(2N B). The second dimension of the grid is equal to the number of matrices. Then each matrix has a unique ID, and all matrices are factorized concurrently. Because matrices are two-dimensional, the two-dimensional thread block is an appropriate candidate, and the size of the thread block is (2N B, N B). We use a thread block to conduct the blocked Jacobi rotation of a pair of panels and have a 1:2 map between threads and the elements of the Gram matrices. One of the advantages of CUDA is that users can program on the L1 cache through the on-chip shared memory. The shared memory has the same physical structure and the same bandwidths as the L1 cache, but lower access latency than the L1 cache. It is a pity that the shared memory is a precious resource and the size is no more than 48KB for most GPUs. In our design, the panel size N B is chosen such that the shared memory can hold the Gram matrix which is a 2N B × 2N B matrix. In that way, the costly iterative process (steps 7 to 9 in Algorithm 2) can be performed on the shared memory entirely, so high performance can be achieved. In addition, a column of threads is used to perform steps 7 to 9 in Algorithm 2 for a pair of columns. Calculating the Gram matrix (step 4 in Algorithm 2) and conducting the blocked Jacobi rotation (step 12 in Algorithm 2) can be fulﬁlled by batched GEMM routines. There are many excellent batched GEMM routines to use directly. However, in our scenario, the two panels picked are not continuous in memory. Memory copy is inevitable to call existing batched GEMM routines, thereby degrading performance. So new solutions must be proposed. Our implementation of batched GEMM is similar to the MAGMA but equips a ﬂexible interface. Based on the ﬂexible interface, we just need to shift the pointer to access the picked panels with uncontinuous addresses for performing GEMM and thus avoid unnecessary memory copy. The well-known techniques such as eﬃcient oﬀ-chip memory access and double-buﬀering are used to optimize the eﬃciency of matrix computations [18]. 3.2

Kernel Optimization

For designing batched algorithms on GPUs, higher performance can be delivered by improving data reuse. A common way to improve the reuse is to fuse kernels. Kernel fusion [19] is employed in this paper not only to decrease the associated overload of launching kernels but also to improve memory locality by placing the

A Batched Jacobi SVD Algorithm on GPUs

75

data shared by multiple kernels, originally passed via oﬀ-chip global memory, into the on-chip memory shared memory. We fuse all kernels corresponding to steps 3 to 12 of Algorithm 2 into a single kernel, which brings the following two beneﬁts. First and foremost, diﬀerent kernels cannot share the register ﬁles and the shared memory, so the global memory with the greatest access latency must be used to exchange data. As noted before, the Gram matrix resides on the shared memory. Lunching only a kernel can avoid reading and writing the Gram matrix form and to the global memory. Second, launching kernels is associated with kernel launch overhead. It is advantageous to decrease the number of kernels. 3.3

Convergence Criterion

Since the rates of convergence for diﬀerent matrices are not equal, keeping track of every matrix is necessary to identify the unconverged matrices. For our batched SVD routine, we exploit the highly optimized batchedGEMM routine from cuBLAS and some auxiliary routines, which are not diﬃcult to develop so the details are omitted for brevity, to calculate the oﬀ-norms (step 1 in Algorithm 2). Then, we ﬁnd the maximum value of the oﬀ-norms for all matrices. The algorithm terminates when the maximum value is less than the given tolerance ε. Another convergence criterion i.e. step 5 in Algorithm 2 that is applied within a single thread block can be implemented in a more intelligent way. We use a root thread to calculate the oﬀ-norm and broadcast the oﬀ-norm to other threads through the shared memory. The iterative process comes to an end when the oﬀ-norm is less than the given tolerance ε.

4 4.1

Experimental Results and Analysis Experimental Setup

We conduct our experiments on a computing platform with a CPU and a GPU. The CPU is an Intel(R) Xeon(R) Gold 6240R CPU whose frequency is 2.4 GHz and has a total of 48 cores. The GPU is an NVIDIA V100-PCIe-32GB GPU. The CUDA version used in this paper is 11.1. Our batched SVD routine is developed in C without any use of low-level instruction sets. The following two types of synthetic matrices are used to test the performance of our batched SVD routine. – Type 1: All matrix elements are generated randomly from the uniform distribution U (0, 1). – Type 2: The matrices are ﬁrst arranged to upper triangular matrices. What’s more, the diagonal elements are equal to n. The remaining matrix elements are generated randomly from the uniform distribution U (0, 1). Matrices in this type are row diagonally dominant. Figure 1 shows the convergent sweeps of two types of matrices for our batched SVD routine. The sizes of matrices are from 96 to 512. For each size, 200 matrices

76

R. Huang et al.

are tested and the convergent sweeps shown in Fig. 1 are the maximum sweep among the 200 matrices. It is observed that the convergent sweeps of type 1 are larger than the type 2 for all sizes. 12

2200 10

2000

Gflops/s

Sweeps

8

6 Type 1 Type 2

4

1600

Type 1 Type 2 Unfused version for Type 1 Unfused version for Type 2

1400

1200

2

1000

0 96

128

192

256

320

384

448

512

Matrix size

Fig. 1. Convergent sweeps for diﬀerent types of matrices.

4.2

1800

50

100

150

200

250

300

350

400

450

500

550

Matrix size

Fig. 2. Performance using kernel fusion.

improvement

Performance Analysis

To verify the impact of kernel fusion, we also develop an unfused version where three kernels are launched for step 4, steps 5 to 11, and step 12 respectively. The Gram matrix and the blocked Jacobi rotation matrix sharing by diﬀerent kernels are transmitted through the global memory. Figure 2 shows the test results of two versions using 200 matrices that are from type 1 or type 2 and are from the same type. It can be seen that kernel fusion can achieve increments of performance ranging from 6% to 10%. What’s more, the beneﬁts of kernel fusion are greater while the matrices become larger. This is because more kernels are launched for larger matrices in the unfused version. In the following, we compare our routine with kernel fusion with the KBLAS library. There are two batched SVD routines in the KBLAS: the Gram SVD routine and the direct SVD routine. The performance of the Gram SVD routine is higher than the direct SVD routine. Therefore, we compare our routine with the Gram SVD routine. First, we generate matrices with varying amounts for testing. In this test, matrices are all from type 1, and results are shown in Fig. 3(a). Second, 200 matrices mixed with diﬀerent types based on the ratio of type 1 versus type 2 are generated for testing. The ratios are 2:8 and 8:2. Figure 3(b) presents the results. It is obvious that our routine outperforms the KBLAS for all test cases, scoring speedups ranging from 2.0× up to 4.1×. What’s more, the KBLAS gets poor performance for small amounts of matrices but our routine also achieves considerable performance. In all tests, the singular values, the left singular vectors, and the right singular vectors are both computed. Four quantities errσ , errU , errV and errD deﬁned in (2) and (3) are used to depict the accuracy of SVD implementations. The σiex

A Batched Jacobi SVD Algorithm on GPUs

77

2400 2200

2200

2000 2000 1800

Our routine for 8:2 KBLAS for 8:2 Our routine for 2:8 KBLAS for 2:8

1800 Our routine for 50 matrices KBLAS for 50 matrices Our routine for 100 matrices KBLAS for 100 matrices Our routine for 200 matrices KBLAS for 200 matrices

1400 1200

Gflops/s

Gflops/s

1600

1600 1400 1200

1000 1000

800

800

600

600

400 50

100

150

200

250

300

350

400

450

500

550

Matrix size

(a) Matrices with varying amounts

50

100

150

200

250

300

350

400

450

500

550

Matrix size

(b) Matrices mixed with diﬀerent types

Fig. 3. Performance comparison of our routine versus the KBLAS.

are the exact singular values. In practice, the exact singular values of random matrices are unavailable, so we use the SVD routines in LAPACK to acquire ˆ U ˆ and Vˆ are the calculated singular values, singular values instead. The Σ, calculated left singular vectors, and calculated right singular vectors respectively. Table 1 gives the numerical error of our routine and the KBLAS. We can see that our routine is comparable to the KBLAS in the results of errσ . Nevertheless, our routine is slightly inferior to the KBLAS in the other three quantities. As we can see in Sect 5, the accuracy is suﬃcient for practical applications. ˆΣ ˆ Vˆ − U A ex |σi − σ ˆi | F , errD = √ errσ = max (2) ex 1≤i≤n max(1, σi ) n AF ˆT U ˆ I − U I − Vˆ T Vˆ F F √ √ , errV = (3) errU = n n

5

Application to Quantum Lattice Systems

One of the applications on top of our batched SVD routine presented in this paper is numerical simulations of quantum lattice systems which are very important in modern condensed matter physics. We only show the application of our proposed routine on one-dimensional systems, and multi-dimensional systems can also be applied trivially. For one-dimensional quantum lattice systems, an optimal tensor network is matrix product states (MPS). Reviewing and analyzing MPS algorithms are beyond the scope of this paper. Here we focus on the narrow task of calculating time evolution for one-dimensional quantum lattice systems by the time-evolving block decimation (TEBD) algorithm. We just give a brief description of the TEBD algorithm, and the details can be found in [20].

78

R. Huang et al. Table 1. Numerical error of our routine and the KBLAS.

Matrix dimension Our errσ

KBLAS errU

errV

errD

errσ

errU

errV

errD

96

2.4E–17 1.9E–15 6.4E–14 1.1E–15 3.4E–17 4.3E–16 4.0E–14 4.9E–16

128

7.1E–17 2.1E–15 1.1E–13 2.4E–15 5.8E–17 5.2E–16 4.0E–14 5.2E–16

192

3.4E–17 2.5E–15 1.2E–13 3.0E–15 4.4E–17 6.1E-16 4.0E–14 6.5E-16

256

3.9E–18 2.5E–15 1.1E–13 1.4E–15 3.5E–17 7.2E–16 7.1E–14 7.7E–16

320

5.0E–17 3.2E–15 8.0E–13 2.4E–15 2.8E–17 8.2E–16 4.0E–13 8.7E–16

384

3.6E–17 3.2E–15 5.7E–12 1.9E–15 1.2E–17 9.0E–16 2.3E–12 9.7E–16

448

7.7E–18 3.5E–15 3.8E–13 4.6E–15 1.7E–17 8.8E–16 1.9E–13 9.3E–16

512

8.8E–17 3.8E–15 2.8E–13 3.1E–15 3.9E–17 9.6E–16 1.2E–13 1.0E–15

Table 2. THE SVD routine time ratio of the TEBD algorithm. Matrix dimension Percentage 96

70.6%

128

74.1%

256

90.6%

512

95.9%

There are two main steps of the TEBD algorithm. The ﬁrst step is to contract the two-site-operators and tensors in MPS representation, which results in a nonMPS representation. To recover the MPS representation, each tensor needs to be split into two tensors. To achieve this, the tensors are reshaped to matrices, and the routines of SVD decompositions for matrices are employed. This is the second step. As a benchmark, we ﬁrst implement the TEBD algorithm using the TNSPackage [21] with version 3.5.8, which is a highly optimized Fortran 2003 library for tensor network state methods. The one-dimensional quantum lattice system used in this paper is the Heisenberg model. The number of quantum lattices is 100. The physical degree freedom on each site is 2. The initial time step is 0.1, and the time step is decreased to 15 after ten steps walkthrough. The total steps are 50, so the minimum time step is 0.00016. Experimental environments in this section are the same as the Sect. 4. Table 2 shows the running time percentage of SVD routine in the TEBD algorithm implemented by the TNSPackage. We can see that the ratios grow with the increase of matrix dimensions, and the ratio is up to 95.9% when the matrix dimension is equal to 512. Therefore, accelerating the SVD routine is critical for improving the performance of the TEBD algorithm. In the TEBD algorithm, the splits of tensors are independent, thus the splitting can be conducted simultaneously. For utilizing our proposed batched SVD routine, we ﬁrst reshape all tensors to matrices and then do SVD decomposition of all matrices simultaneously. Table 3 displays the time in seconds of the TEBD

A Batched Jacobi SVD Algorithm on GPUs

79

Table 3. The comparison of the TEBD algorithm for diﬀerent implementations. Matrix dimension TNSPackage Our

Speedup Diﬀerence

96

27.168

3.700

7.3×

1.6E–13

128

64.320

5.074 12.7×

2.0E–13

256

561.428

17.211 32.6×

2.2E–13

512

4542.161

83.958 54.1×

1.9E–13

algorithm implemented by the TNSPackage and our batched SVD routine. It can be seen that the implementation based on our batched SVD routine outperforms the TNSPackage for all dimensions tested, and the maximum speedup is up to 54.1×. We also compare the numerical diﬀerence of the TEBD algorithm implemented by the TNSPackage and our batched SVD routine. The diﬀerence is presented in Table 3. In fact, the approximation error of the TEBD algorithm is O(δt) introduced by the Trotter-Suzuki decomposition [20]. Obviously, the difference presented in Table 3 is much less, which indicates that our bathed SVD routine is trustworthy.

6

Conclusion

In this paper, we presented a parallel blocked Jacobi algorithm and its eﬃcient implementation for singular value decomposition of many small matrices. Our approach exploits adequately the blocking structure and the parallelism of the blocked Jacobi SVD algorithm thus ﬁtting well into the SIMT GPU architectures. Our implementation needs a CPU only for controlling ﬂow and deliver high performance against state-of-the-art solutions. For illustrating the power of our routine, we further develop an application, the numerical simulation of quantum lattice system, on top of our routine, and achieve a maximum speedup of 54.1× versus its CPU counterpart. In the future, we plan to generalize our methodology for non-uniform workloads. Acknowledgment. We would like to acknowledge He L. and Dong S. for helpful conversations and insights on numerical simulations of quantum lattice systems.

References 1. Abdelfattah, A., Baboulin, M., Dobrev, V., et al.: High-performance tensor contractions for GPUs. Procedia Comput. Sci. 80(1), 108–118 (2016) 2. Molero, J.M., Garz´ on, E.M., Garc´ıa, I., Quintana-Ort´ı, E.S., Plaza, A.: Eﬃcient implementation of hyperspectral anomaly detection techniques on GPUs and multicore processors. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 7(6), 2256– 2266 (2014) 3. Villa, O., Gawande, N., Tumeo, A.: Accelerating subsurface transport simulation on heterogeneous clusters. In: IEEE International Conference on Cluster Computing, Indianapolis, pp. 1–8. IEEE (2013)

80

R. Huang et al.

4. Zhang, T., Liu, X., Wang, X., Walid, A.: cuTensor-tubal: eﬃcient primitives for tubal-rank tensor learning operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 31(3), 595–610 (2020) 5. NVIDIA cuBLAS Homepage. https://docs.nvidia.com/cuda/pdf/cublas Library. pdf 6. AMD rocBLAS Homepage. https://github.com/ROCmSoftwarePlatform/ rocBLAS 7. Abdelfattah, A., Costa, T., Dongarra, J., et al.: A set of batched basic linear algebra subprograms and LAPACK routines. ACM Trans. Math. Softw. 47(7), 1–23 (2021) 8. Dong, T., Haidar, A., Tomov, S., Dongarra. J.: A fast batched cholesky factorization on a GPU. In: International Conference on Parallel Processing, pp. 432–440. IEEE (2014) 9. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra. J.: Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. In: International Conference on Computational Science, pp. 606–615 (2017) 10. NVIDIA cuSOLVER Homepage. https://docs.nvidia.com/cuda/cusolver/index. html 11. Dong, T., Haidar, A., Tomov, S., Dongarra, J.: Accelerating the SVD bidiagonalization of a batch of small matrices using GPUs. J. Comput. Sci. 26(5), 237–245 (2018) 12. Badolato, I., Paula, L.D., Farias, R.: Many SVDs on GPU for image mosaic assemble. In: IEEE International Symposium on Computer Architecture and High Performance Computing Workshop, pp. 37–42 (2015) 13. Boukaram, W.H., Turkiyyah, G., Ltaief, H., Keyes, D.E.: Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comput. 74(5), 19–33 (2018) 14. KBLAS Homepage. https://github.com/ecrc/kblas-gpu, Accessed 30 Nov 2020 15. Brent, P.P., Luk, F.T.: The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6(1), 69–84 (1985) 16. Luk, F.T., Park, H.: On parallel Jacobi orderings. SIAM J. Sci. Stat. Comput. 10(1), 18–26 (1989) 17. Luk, F.T., Park, H.: A proof of convergence for two parallel jacobi SVD algorithms. IEEE Trans. Comput. 38(6), 806–811 (1989) 18. Rivera, C., Chen, J., Xiong, N., Zhang, J., Song, S.: TSM2X: high-performance talland-skinny matrix-matrix multiplication on GPUs. J. Parallel Distrib. Comput. 151(3), 70–85 (2021) 19. Filipoviˇc, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015). https://doi.org/10.1007/s11227-015-1483-z 20. Vidal, G.: Eﬃcient simulation of one-dimensional quantum many-body systems. Phys. Rev. Lett. 93(4), 40502–40505 (2004) 21. Dong, S., Liu, W., Wang, C., Han, Y., Guo, G., He, L.: TNSPackage: a fortran 2003 library designed for tensor network state methods. Comput. Phys. Commun. 228(7), 163–177 (2018)

A Molecular Dynamics Based Multi-scale Platelet Aggregation Model and Its High-Throughput Simulation Zhipeng Xu1 and Qingsong Zou2(B) 1

School of Science, Nantong University, Nantong 226019, Jiangsu, China [emailprotected] 2 School of Computer Science and Engineering and Guangdong Province Key Laboratory of Computational Science, Sun Yat-sen University, Guangzhou 510006, Guangdong, China [emailprotected]

Abstract. In this paper, we develop a multi-scale model to simulate the aggregation of platelets in a low shear-coeﬃcient ﬂow. In this multi-scale model, the Morse potential is used to describe the interaction between the αIIbβ3 receptor and ﬁbrinogen, the dissipative particle dynamics (DPD) is used to simulate ﬂuids on the macro-scale, and the coarsegrained molecular dynamics (CGMD) is used to simulate the ﬁne-scale receptors’ biochemical reactions. Moreover, with the assistance of the high-throughput simulations on the heterogeneous cluster, we calibrate the parameters for the Morse potential which are critical in the proper simulation of the aggregation of platelets. With this model, we simulate the long-term behaviour of thrombus formation constructed by many platelets. Our simulating results are consistent with in-vitro experiments on contact areas and detaching forces. Moreover, it reduces the computational cost signiﬁcantly. Keywords: Platelet aggregation · High-throughput simulation Molecular dynamics · Morse potential

1

·

Introduction

Platelet aggregation is a common phenomenon in the blood ﬂow, which promotes wound repair in general. However, platelet aggregation might also be a crucial factor in triggering thrombosis. For patients suﬀering from cardiovascular disease or wearing some extracorporeal blood circulation device, abnormal platelet aggregation caused by high blood pressure may cause serious complications. Therefore, understanding the mechanism of platelet aggregation is of great signiﬁcance for the prevention and treatment of cardiovascular diseases. So far, many medical experiments have been completed to understand the mechanism of platelet aggregation. For instance, some experiments show that platelet aggregation in the vein or aorta with the low-to-medium shear ﬂow is due to ﬁbrinogen c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 81–92, 2022. https://doi.org/10.1007/978-3-030-96772-7_8

82

Z. Xu and Q. Zou

[11,16,17] binding distributed in the blood to the αIIbβ3 protein [2]. Some other experiments discover that the initial small clot will attract more platelets to participate, and eventually, they combine to form a large thrombus. Since various factors, including platelet surface proteins, ligands, and shear stress, will participate in the reaction during platelet aggregation, it seems very diﬃcult to discover the mechanism of platelet aggregation only by medical/chemical experiments. Recently, more and more scientists tried to use numerical simulation to reveal the mechanism of platelet aggregation. Since the process of platelet aggregation usually involves multi-scale physical or chemical behaviors such as the macroscale ﬂuid ﬂow, and molecular-size reaction occurs among surface proteins, a high-precision simulation requires a multi-scale numerical model that covers at least ﬂuid mechanics and molecular dynamics. Usually, a good simulation model [4–6,18–21] includes three scales: macro-scale, meso-scale, and micro-scale. With the macro-scale model, we simulate the ﬂow of blood in vessels. Note that the classical ﬂuid dynamics equation such as Navier-Stokes equation and some particle methods [14], such as SPH [10], DPD [1], and SDPD [8,15], etc., can also be used to characterize the blood ﬂow. With the meso-scale model, we simulate the interaction between the blood ﬂuid and the platelet. The motion of platelets caused by blood ﬂow and the change of ﬂow ﬁeld due to immersed platelets and thrombi can be calculated in this scale. With the micro-scale model, we simulate the nanometer-level size proteins, which play a critical role in the aggregation of platelets. It is challenging to simulate all the platelet aggregation process details using full-atom molecular dynamics, but multi-scale models are available. For instance, in [4], Prachi et al. simulate the platelet aggregation with the DPD-CGMD model [20], in which the DPD method is used to solve viscous ﬂow, and the CGMD is used to simulate the movements of the particles in the interior of platelets. With their method, they successfully simulate the aggregation process of two platelets in a blood ﬂow. However, the massive amount of computation limits the application of their model to simulate the aggregation process of more platelets, even on the fastest modern supercomputer. In this paper, we propose a rigid platelet multi-scale model to simulate the aggregation of more platelets. In our model, each platelet is regarded as a rigid body in the sense that there is no relative movement between the particles/molecules of the same platelet. Of course, this simpliﬁed model cannot simulate the process of platelet deformation to produce ﬁlopodia. However, our model includes the interaction of the particles/molecules distributed on the membrane of platelets. Since the real platelet aggregation is driven by many αIIbβ3 proteins distributed on the membrane of platelets and the mediation of ﬁbrinogen in the blood, our simpliﬁed model can simulate the following main process of platelet aggregation: once there exists two platelets gathers together, the blood ﬂow speed will be slowed down, the consequent increase of shear pressure leads to aggregation of more platelets, and eventually to form an enormous clot.

Multi-scale Platelet Aggregation Model and High-Throughput Simulation

2

83

A Rigid Platelet Multi-scale Model

This section introduces basic ingredients to stimulate platelet aggregation with the MD-based rigid platelet multi-scale model. 2.1

DPD Model for Blood Flow

From the microcosmic point of view, calculating the statistical properties for all ﬂuid molecule trajectories is the way to realize the ﬂuid dynamics. However, the enormous of calculations cannot be implemented even on the fastest modern supercomputer. Therefore, the mesoscopic DPD ﬂuid model [12] was established and used in the simulation of biological ﬂuids. Here, the DPD particle represents a cluster of ﬂuid molecules, similar to the renormalization group [7]. Although the details of a single molecule are lost, the physical property of a bunch of DPD particles still can reﬂect the ﬂuid’s motion characteristics [3], even turbulence. Assuming that within the cut-oﬀ distance, the ith particle is aﬀected by its surrounding DPD particles, the resultant force and the change in velocity can be written as follows, dvi =

N √ 1 C E R F ij dt + F D ij dt + F ij dt + F ij dt . mi

(1)

j=i

D R E Here, mi , F C ij , F ij , F ij and F ij represent the mass of the ith particle, the conservative force, dissipation force, random force and external force of the jth particle on the ith particle, respectively. Detailed representations of the above forces are expressed as below. rij C eij , F i = a 1.0 − rc D FD ij = −γω (rij ) (eij · v ij ) eij , R FR ij = σω (rij ) ςij eij , 2k R 2 rij D ω (rij ) = ω (rij ) = 1.0 − . rc

(2)

The physical meaning of the conservative force F C ij is the compressibility of the ﬂuid. Here rij is the distance between two particles, rc represents the cut-oﬀ distance, eij is the unit vector pointing from the ith particle to the jth particle, and the coeﬃcient a is determined by letting the two particles be in the same position (i.e. rij = 0). We may observe that the conservative force is the linear function of the negative correlation with the distance rij . That is to say, the conservative force attenuates with a larger distance in a linear format. When the particle density increase, the decrease of average distance cause more particles to move in the range of cut-oﬀ distance. But the larger repulsive force will promote the density of particles to be stable, which shows compressibility

84

Z. Xu and Q. Zou

likes spring. Similarly, the small repulsive force corresponding to the low density always attracts other particles to increase the density. The dissipation force F D ij reﬂects the frictional force between particles that run irregularly in the ﬂuid. Namely, the dissipation force indicates the viscosity of the liquid. The parameter γ is the coeﬃcient of the dissipative force. The magnitude of the dissipative force is related to the relative distance and relative speed between particles. The negative sign of F D ij means the decelerating eﬀect in the direction of relative velocity. The random force F R ij reﬂects the characteristics of the random Brownian motion of liquid particles. To meet the features of the constant-temperature, constant-volume ensemble (NVT), according to the ﬂuctuation-dissipation theorem, the coeﬃcients of the conservative force, random force, and dissipation force satisfy the following relationship, a = 75kb T / (ρf rc ) ,

(3)

σ 2 = 2γkB T, kB T = 1.0,

(4)

where ρf is the density of DPD particles, and according to Prachi’s [4] simulation, the above parameters can be chosen as a = 25.0, γ = 67.5, k = 0.25, rc = 1.7. 2.2

(5)

CGMD-DPD Model for Fluid-Platelet Interaction

The DPD model at the ﬂuid level only considers the interaction between ﬂuid particles. To present the interaction between ﬂuid particles and particles in the platelet membrane, we need to introduce a platelet interface potential of which the velocity update function can be written as below: dvi =

N √ 1 R ∇ULJ (rij ) dt + F D ij dt + F ij dt . mi

(6)

j=i

where, FijD = −γω D (rij ) (eij · vij ) eij , FijR = σω R (rij ) ςij eij and 12

6 σ σ ULJ = 4 −2 . rij rij In this model, the particles on the surface of platelets are regarded as a particular part of the ﬂuid during the coupling of ﬂuid and platelets. However, since the pressure from the ﬂuid on the membrane particles can be ignored, there is no conservative force in the above formula. Moreover, the external force is represented by Lenord-Jones potential ULJ , preventing the ﬂow particle from penetrating the membrane [19]. Note that in [4] the L-J potential is replaced by σij 12 σij 6 2 kb (r − r0 ) + 4 ij − , (7) VCGMD = r r bonds

L−J

Multi-scale Platelet Aggregation Model and High-Throughput Simulation

85

where kb is the bond energy between two adjacent membrane particles. The bond energy term L-J is used to maintain the structure of the platelet; otherwise, the platelet will shrink to a clump with the minimal energy principle. Since the platelet is assumed to be a rigid body with no deformation for our model, the platelet structure will never change. Therefore, we keep using (6) as our model to simulate the ﬂuid-solid interaction. 2.3

The Morse Potential for Platelet-Platelet Interaction

When two platelets move close to each other, the protein on the surface binds to ﬁbrin to produce an aggregation eﬀect. The attractive force between particles of diﬀerent platelets drives the aggregation progress. The Morse potential is a common tool to describe the interaction for diatomic molecular in chemical reactions. Prachi and Zhang et al. [19] modiﬁed the Morse potential function to simulate ﬂuid and platelet aggregation by calibrating with experimental data. They deﬁne

fA 2 (r − r0 ) . (8) E = D0 e−2α(r−r0 ) − 2e−α(r−r0 ) + 2r0 According to the literature, the CGMD elastic model with harmonic and LJ potentials in Eq. (7) leads to more computational cost. Therefore, we consider building the rigid model that ignores Eq. (7) with original Morse potential to simulate the aggregation. The relationship between the potential energy and the distance for original Morse potential is as follows,

(9) E = D0 e−2α(r−r0 ) − 2e−α(r−r0 ) . In Eq. 9, D0 is the coeﬃcient to measure the energy when one molecule moves from the stabilizing point of minimum energy to inﬁnity. r0 is the equilibrium distance, and alpha is a parameter related to the molecule. When the distance of r − r0 is minimal, the equation will perform a simple harmonic motion at the equilibrium point using Taylor expansion. Hence, two platelets will vibrate near the aggregate balance point when the driven force can be neglected than the Morse force. The parameters D0 , α, and r0 in Eq. 9 should be speciﬁed to make the model computable, which means we should exhaust the parameter space and compare the output metric data with experiment results. In summary, Table 1 lists the equation of each scale for the rigid platelet model, and the contribution of the work is to achieve the detailed parameters for Eq. 9 by using the highthroughput simulation.

3

Parameters Calibration

To determine the parameters multi-scale for platelet aggregation, we need to compare two indicators: contact area and detaching force with medical experimental data. Since molecular dynamics is a multi-body problem, the relationship between the parameters and experimental medical data results is nonlinear. Finding appropriate parameters often takes considerable time and cost.

86

Z. Xu and Q. Zou Table 1. Model of each layer for the rigid platelet model Layer

Model

Fluid

Eq. 1

Interaction of Fluid-Platelet

Eq. 6

Interaction of Platelet-Platelet Eq. 9

Here, we use the Hygon DCUs (Deep Computing Units) [22] to accelerate the high-throughput molecular dynamics simulation to achieve model parameters. Figure 1 demonstrates the size of the system as 16 µm × 16 µm × 8 µm, and the system contains 256,000 DPD particles, 23, 592 × 2 platelets particles. As shown in the ﬁgure, each particle (blue point) represents the protein on the membrane, and the bounding of protein and ﬁbrinogen will be described with Morse potential. The max velocity of Poiseuille ﬂow is 0.28 cm/s that driven by a constant force and no-slip boundary conditions [13].

23592 particles

16µm

8µm

0.28cm/s

Fib

rin

16µm

og

en

During Aggregation

Fig. 1. Proﬁle of the system.

3.1

Contact Area and Detaching Force

When the distance between two platelets is smaller than the critical distance, the protein on the membrane of the platelet will attract the nearest protein to drive the aggregation of two platelets. During the reaction progress, part of the protein detached due to the external force of ﬂuid. Still, part of the proteins will re-aggregate if the driven pressure from the liquid is not too enormous, which is similar to the eﬀect of nylon buckles. Therefore, the contact area calculation can be converted into a particle pair whose distance is less than the sum of the length of the ﬁbrinogen protein and the membrane protein. Some work indicates

Multi-scale Platelet Aggregation Model and High-Throughput Simulation

87

that the ﬁbrinogen protein length is about 47.5 nm, and the membrane protein is about 20 nm above the cell membrane surface. Therefore, when the distance between two proteins on the surface of two platelets is less than 87.5 nm, namely, 0.5 in the L-J unit, two proteins can be considered as contacted. Consequently, the contact area between two small platelets can be calculated by the following formula, S . (10) Ca = |CAB | · Ns Here, CAB can be calculated as follows, CAB = ri | rij < Td , rij = r i − r j 2 ri ∈ NA , rj ∈ NB . (11) S is the surface area of the membrane; Ns is the number of distance pairs less on the membrane surface, Td represents the critical distance. |CAB | is the number of distance pairs of the nearest neighbor distance between A and B platelets less than Td . In the system, Ns = 23592, S ≈ 22.696 um2 with semi-major axis a, b = 1.78 um and c = 0.445 um. Two platelets will produce an interaction force when they aggregate. When the external force exceeds the critical value, two platelets will separate, so the interaction force can also be called the detaching force. For a single ﬁbrin-αIIbβ3 protein pair, the interaction force can be measured by atomic force microscopy. Atomic force microscopes use contact currents to image the surface of a sample, which can distinguish single atoms. The detaching force can be obtained by calculating the sum of the Morse force corresponding to all the distance pairs where the distance is less than Td at the time of contact,

(12) Fdetaching = D0 −2αe−2α(rij −r0 ) + 2αe−α(rij −r0 ) . rij 1.4660

0.213 ± 0.001

2.227 ± 0.003

1.4660 − 2.4340

Detaching force (nN) 16.1 ± 0.5

0.844 ± 0.007

17.842 ± 0.027

9.10 − 18.20

region is a round shape due to the regular surface of the rigid body. Table 2 compared Prachi’s and our results. In [4], the contact area of rigid platelet is much smaller than non-rigid bodies with the same parameters. But our work of the high-throughput molecular dynamics simulation shows that the model with appropriate potential parameters for a rigid body also can reﬂect the experiment result.

Fig. 6. The contact region for t = 40 µs

4

Conclusion

Thrombus is closely related to various diseases. High-precision platelet aggregation simulation helps to understand the formation mechanism of thrombus. Most elastic platelet models based on molecular dynamics are computationally expensive, and it isn’t easy to achieve further thrombosis simulation. High-throughput molecular dynamics show that the simulation results of the rigid platelet model are consistent with the actual experimental data. Our rigid platelet model with calibrated parameters is successfully applied to simulate the aggregation of four platelets, as shown in Fig. 6. The simulation of the formation process of thrombosis containing hundreds or thousands of platelets is our next goal.

Multi-scale Platelet Aggregation Model and High-Throughput Simulation

91

Acknowledgment. The research was supported in part by NSFC Grant 12071496, Guangdong Provincial NSF Grant 2017B030311001, Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University (2020B1212060032), and Nantong Science & Technology Research Plan (No. JC2021133). This work also beneﬁted from resources made available at the National Supercomputer Center in Kunshan.

References 1. Duc, D.-H., Nhan, P.-T., Xijun, F.: An implementation of no-slip boundary conditions in DPD. Comput. Mech. 35(1), 24–29 (2004) 2. Durrant, T.N., van den Bosch, M.T., Hers, I.: Integrin αIIbβ3 outside-in signaling. Blood 130(14), 1607–1619 (2017) 3. Gao, C., Zhang, P., Marom, G., Deng, Y., Bluestein, D.: Reducing the eﬀects of compressibility in DPD-based blood ﬂow simulations through severe stenotic microchannels. J. Comput. Phys. 335, 812–827 (2017) 4. Gupta, P., Zhang, P., Sheriﬀ, J., Bluestein, D., Deng, Y.: A multiscale model for recruitment aggregation of platelets by correlating with in vitro results. Cell. Mol. Bioeng. 12(4), 327–343 (2019) 5. Gupta, P., Zhang, P., Sheriﬀ, J., Bluestein, D., Deng, Y.: A multiscale model for multiple platelet aggregation in shear ﬂow. Biomech. Model. Mechanobiol. 20(3), 1013–1030 (2021). https://doi.org/10.1007/s10237-021-01428-6 6. Han, C., Zhang, P., Bluestein, D., Cong, G., Deng, Y.: Artiﬁcial intelligence for accelerating time integrations in multiscale modeling. J. Comput. Phys. 427, 110053 (2021) 7. Lan, Y.: Bridging steady states with renormalization group analysis. Phys. Rev. E 87, 012914 (2013) 8. Li, G., Ye, T., Wang, S., Li, X., UI Haq, R.: Numerical design of a highly eﬃcient microﬂuidic chip for blood plasma separation. 32(3), 031903 (2020) 9. Litvinov, R.I., Farrell, D.H., Weisel, J.W., Bennett, J.S.: The platelet integrin αIIbβ3 diﬀerentially interacts with ﬁbrin versus ﬁbrinogen. 291(15), 7858–7867 (2016) 10. Tanaka, N., Takano, T.N.: Microscopic-scale simulation of blood ﬂow using sph method. 02(04), 555–568 (2005) 11. Vilar, R., Fish, R.J., Casini, A., Neerman-Arbez, M.: Fibrin(ogen) in human disease: both friend and foe. Haematologica 105(2), 284–296 (2020) 12. Wang, L., Chen, Z., Zhang, J., Zhang, X., Wu, Z.J.: Modeling clot formation of shear-injured platelets in ﬂow by a dissipative particle dynamics method. Bull. Math. Biol. 82(7), June 2020 13. Willlemsen, S.M., Hoefsloot, H.C.J., Iedema, P.D.: No-slip boundary condition in dissipative particle dynamics. 11(05), 881–890 (2000) 14. Yamaguchi, T., et al.: Particle-based methods for multiscale modeling of blood ﬂow in the circulation and in devices: challenges and future directions. Ann. Biomed. Eng. 38(3), 1225–1235 (2010) 15. Ye, T., Phan-Thien, N., Lim, C.T., Peng, L., Shi, H.: Hybrid smoothed dissipative particle dynamics and immersed boundary method for simulation of red blood cells in ﬂows. 95(6), 063314, June 2017 16. Yesudasan, S., Wang, X., Averett, R.D.: Coarse-grained molecular dynamics simulations of ﬁbrin polymerization: eﬀects of thrombin concentration on ﬁbrin clot structure. J. Mol. Model. 24(5), 1–14 (2018). https://doi.org/10.1007/s00894-0183642-7

92

Z. Xu and Q. Zou

17. Yesudasan, S., Wang, X., Averett, R.D.: Fibrin polymerization simulation using a reactive dissipative particle dynamics method. Biomech. Model. Mechanobiol. 17(5), 1389–1403 (2018). https://doi.org/10.1007/s10237-018-1033-8 18. Zhang, N., Zhang, P., Kang, W., Bluestein, D., Deng, Y.: Parameterizing the morse potential for coarse-grained modeling of blood plasma. J. Comput. Phys. 257, 726– 736 (2014) 19. Zhang, P., Gao, C., Zhang, N., Slepian, M.J., Deng, Y., Bluestein, D.: Multiscale particle-based modeling of ﬂowing platelets in blood plasma using dissipative particle dynamics and coarse grained molecular dynamics. Cell. Mol. Bioeng. 7(4), 552–574 (2014) 20. Zhang, P., Zhang, L., Slepian, M.J., Deng, Y., Bluestein, D.: A multiscale biomechanical model of platelets: correlating with in-vitro results. J. Biomech. 50, 1–15 (2016) 21. Zhang, P., Zhang, N., Deng, Y., Bluestein, D.: A multiple time stepping algorithm for eﬃcient multiscale modeling of platelets ﬂowing in blood plasma. J. Comput. Phys. 284, 668–686 (2015) 22. Zhang, Y., Qian, H.: Porting and optimizing g-BLASTN to the ROCm-based supercomputer. In 2020 International Conference on Computer Science and Management Technology (ICCSMT). IEEE, November 2020

Approximation and Polynomial Algorithms for Multi-depot Capacitated Arc Routing Problems Wei Yu(B) and Yujie Liao School of Mathematics, East China University of Science and Technology, Shanghai 200237, China [emailprotected], [emailprotected]

Abstract. We study the multi-depot capacitated arc routing problem (MCARP), which generalizes the classical arc routing problem to the more realistic situation with multiple depots. We propose approximation and polynomial algorithms for diﬀerent variants of the MCARP. First, we present the ﬁrst constant-factor approximation algorithms for the MCARP and the nonﬁxed destination variant. Second, for a restricted case of the MCARP with inﬁnite vehicle capacity, called the multi-depot 1 )-approximation algorithm rural postman problem, we devise a (2 − 2k+1 with k indicating the number of depots. Lastly, we show that the equaldemand MCARP deﬁned on a line graph is polynomially solvable and develop a 2-approximation algorithm for the multi-depot capacitated vehicle routing problem on a line. Keywords: Approximation algorithm · Multi-depot · Vehicle routing problem · Arc routing problem · Rural postman problem

1

Introduction

Given an undirected graph G = (V, E), which may be a multigraph, with vertex set V and edge set E. Each edge e ∈ E is associated with a nonnegative cost c(e) and a nonnegative integer demand d(e). There is a ﬂeet of hom*ogeneous vehicles with capacity Q located at a speciﬁed vertex o ∈ V , called the depot. The Capacitated Arc Routing Problem (CARP) is to ﬁnd a set of routes (or closed walks), starting from and ending at the depot, for the vehicles to serve the edges with positive demands such that each vehicle serves a total demand of at most Q (capacity constraint) and the total cost of the routes is minimized. If the demands are deﬁned for the vertices instead of the edges in the CARP, we obtain the Capacitated Vehicle Routing Problem (CVRP). As noted by Golden and Wong [13], the CVRP can be seen as a special case of the CARP. Because we can split the vertices in the CVRP into two vertices which are connected by a zero-cost edge with a demand equal to the original vertex demand. The CARP occurs frequently in practice applications, including the c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 93–100, 2022. https://doi.org/10.1007/978-3-030-96772-7_9

94

W. Yu and Y. Liao

inspection of electric power lines [9], distribution service [16], garbage collection [10], school bus routing problem [24], and so on. A natural extension of the CARP/CVRP is the Multi-Depot Capacitated Arc/Vehicle Routing Problem (MCARP/MCVRP) where there are multiple depots instead of a single depot and the routes are required to start from and end at the same depot (but diﬀerent routes may use diﬀerent depots). The motivation to study the MCARP/MCVRP lies not only in their theoretical interest, but also in their wide-spread applications. For the CARP/CVRP, when the service area is large, multiple depots are usually setting up to meet the service requirements [11]. Such depots correspond to vehicle stations, warehouses, dumping places, supply points or relay boxes. For example, the online shopping business usually operates at multiple depots to improve the customers experience and satisfaction in cities [19]. Other applications of the MCARP/MCVRP encompass mail delivery [17], explosive waste recycling [27], police patrolling [7], etc. One can see that the CARP (resp. CVRP) is NP-hard, since it contains the well-known Rural Postman Problem (resp. Metric Traveling Salesman Problem) as a special case where the vehicle capacity is inﬁnite. In turn, as a generalization of the CARP/CVRP, the MCARP/MCVRP is also NP-hard. Therefore, the existing literature on the MCARP/MCVRP has centered on branch-and-cut approach (e.g. see [12,20]) and meta-heuristics (e.g., see [17,19,23]). However, we address the multi-depot CARP from the point view of approximation algorithms. As far as we know, there are few approximability results on multi-depot variants for the CARP/CVRP. In particular, we have not aware any approximation algorithm for the MCARP. The research of approximation algorithms for the CARP/CVRP was initiated by Haimovich and Rinnooy Kan [14], who studied the equal-demand CVRP, which is a special case of the CVRP with d(v) = 1 for each vertex v. They gave the well-known Iterated Tour Partition heuristic, denoted by IT P (α), where α indicates the approximation ratio of the metric TSP (α ≤ 32 due to the results in 1 )α [5,8]), and proved that IT P (α) achieves an approximation ratio of 1 + (1 − Q if the number n = |V | of vertices is a multiple of Q. Later, Haimovich et al. [15] and Altinkemer and Gavish [2] removed the condition that n is a multiple of Q while achieving the same result1 . For the general CVRP, Altinkemer and 2 )α)-approximation algorithm, called U IT P (α), Gavish [1] obtained a (2 + (1 − Q which is an extension of IT P (α) to the general case of unequal demands. A simpliﬁed proof of this result can be found in [15]. Recently, Blauth et al. [6] have managed to improve the longstanding ratio for the CVRP to 2 + α − 2 for some absolute constant > 0. For the equal-demand case, they also devised an improved (1 + α − )-approximation algorithm. Besides the results on the CVRP deﬁned on general graphs, there are also approximation algorithms tailored for the CVRP deﬁned on special graphs. Labbe et al. [21] devised a 2-approximation for the CVRP on trees. If the graph is a line, Wu and Lu [26] further improved the ratio to 35 . Note that the CVRP on 1

Actually, the versions of IT P (α) in [2, 15] are slightly diﬀerent from that in [14], but we still refer to them as IT P (α).

Approximation and Polynomial Algorithms for Multi-depot CARP

95

a half-line (i.e. the depot is located at one of the end point of the line) is already NP-hard [3]. What’s worse, the CVRP on a half-line cannot be approximated within ratio 3/2 unless P = NP [26]. As for the CARP, Jansen [18] showed how to generalize the above IT P (α) and U IT P (α) heuristics for the CVRP to obtain approximation algorithms with 1 2 )α0 and 2 + (1 − Q )α0 for the CARP with triangle inequality, ratios 1 + (1 − Q where α0 is the approximation ratio for the Rural Postman Problem (due to the 2 )α0 )results in [4,9], α0 ≤ 32 ). Wohlk [25] presented an alternative (2 + (1 − Q approximation algorithm for the CARP with triangle inequality. Interestingly, van Bevern [4] proved that any factor β approximation algorithm for the CARP with triangle inequality yields a factor β approximation algorithm for the general CARP (without the triangle inequality). As a result, the (equal-demand) CARP 2 1 )α0 (1 + (1 − Q )α0 ). admits an approximation algorithm of ratio 2 + (1 − Q For the multi-depot CVRP, Li and Simchi-Levi [22] developed approximation 1 2 )α and 2 + (2 − Q )α for the equal-demand case algorithms with ratios 1 + (2 − Q and the general case, respectively. In addition, they also considered the nonﬁxed destination MCVRP, i.e. a variant of the MCVRP where the vehicles are allowed to depart from one depot but end at another depot, and gave two approximation 1 2 )α and 2 + (1 − Q )α for the equal-demand case algorithms with ratios 1 + (1 − Q and the general case, respectively. In this paper, we mainly obtain the following results. First, we present the ﬁrst approximation algorithms for the MCARP and the nonﬁxed destination variant, which have constant approximation ratios. Second, for the multi-depot Rural Postman Problem (MRPP), which is a restricted case of the MCARP with inﬁnite vehicle capacity, we devise a better approximation algorithm with 1 , where k indicates the number of depots. Lastly, we investigate ratio 2 − 2k+1 the MCARP/MCVRP deﬁned on a line graph and show that the equal-demand MCARP on a line is polynomially solvable and propose a 2-approximation algorithm for the MCVRP on a line. The rest of the paper is organized as follows. We give some notations used throughout the paper in Sect. 2. In Sect. 3 we deal with the approximation algorithms for the nonﬁxed destination MCARP. Subsequently, we discuss the (ﬁxed destination) MCARP in Sect. 4. Approximation algorithms for the MRPP are presented in Sect. 5. At last, we give approximation and polynomial algorithms for the MCARP/MCVRP deﬁned on a line graph in Sect. 6.

2

Notations

Throughout the paper, we analyze algorithms on diﬀerent versions of the MCARP/MCVRP. For the MCARP, we denote by Z ∗ the optimal value. Zn∗ indicates the optimal value of the nonﬁxed destination MCARP. Z A denotes the objective value of the solution obtained by some algorithm A. Let G = (V, E) be the underlying graph with vertex set V and edge set E, c(e) ≥ 0 indicates the cost (or length) of edge e ∈ E. If e = (u, v), we call u, v the end vertices of e. The nonnegative integer demand of vertex v (edge e) is

96

W. Yu and Y. Liao

denoted by d(v) (d(e)). The edges with d(e) > 0 are called required edges. The set of all required edges is denoted by R. Q is the capacity of the vehicles. For any u, v ∈ V , cs (u, v) denote the length of the shortest path between u and v. For a subgraph H of G, V (H) and E(H) denote the vertex set and edge (multi)set of H, respectively. The cost of H is deﬁned as c(H) = e∈E(H) c(e). Let cR (H) be the sum of the costs of the required edges in H. Consequently, the sum of the costs of the non-required edges in H equals c(H) − cR (H).

3

The Nonfixed Destination MCARP

In this section, we extend the algorithm for the nonﬁxed destination MCVRP in [22] to solve the nonﬁxed destination MCARP. Our algorithm, called N M CARP (β), also has a simple description by using the result for the CARP (without triangle inequality) in [4]. Here β indicates the approximation ratio for the CARP. Let G = (V, E) be the original graph for the nonﬁxed destination MCARP and D ⊆ V is the depot set. N M CARP (β) uses a β-approximation algorithm for the CARP as a subroutine and consists of two stages. The ﬁrst stage is to contract the set D of depots in G into a single depot d to generate a new graph G and use the β-approximation for the corresponding CARP to derive a solution composed of a series of routes starting from and ending at d. The second stage of the algorithm is to uncontract d back to the original set D of depots, which produces a feasible solution of the original MCARP. The following is the formal description of the algorithm. Algorithm N M CARP (β) Step 1. Obtain a new graph G = (V , E ) from G = (V, E), where V = {d} ∪ (V \ D) and each edge (u, v) ∈ E corresponds to an edge (u , v ) ∈ E with the same cost and demand such that ⎧ u = u, v = v, if u, v ∈ V \ D; ⎪ ⎪ ⎨ u = u, v = d, if u ∈ V \ D, v ∈ D; ⎪ u = d, v = v, if u ∈ D, v ∈ V \ D; ⎪ ⎩ if u, v ∈ D. u = v = d, Note that the last case indicates that (u , v ) is a self-loop in G . Step 2. Apply a β-approximation algorithm for the CARP deﬁned on G to generate a solution consisting of l routes C1 , . . . , Cl starting from and ending at the depot d. Moreover, we assume w.l.o.g that each Ci contains d exactly twice 2 . Step 3. For each Ci (i = 1, . . . , l), replacing each edge (u , v ) of Ci by the original edge (u, v) corresponding to (u , v ). This will result in a route Pi in G whose both end points are depots in D (but may be diﬀerent). 2

Otherwise, we can break Ci into a series of routes containing d exactly twice.

Approximation and Polynomial Algorithms for Multi-depot CARP

97

Step 4. Return the routes in P1 , . . . , Pl . Lemma 1. Z N M CARP (β) ≤ βZn∗ . Proof. Let Z ∗ (G ) be the optimal value of the CARP deﬁned on G in Step 2. It can seen that any feasible solution to the nonﬁxed destination MCVRP induces a feasible solution to the CARP deﬁned on G of no greater cost after contracting the depots in D into a single depot d. This implies that Z ∗ (G ) ≤ Zn∗ . By deﬁnition, the total cost of the routes C1 , . . . , Cl is at most βZ ∗ (G ). Observe that in Step 3 the total cost of the routes in P1 , . . . , Pl is the same as the total cost of the routes C1 , . . . , Cl . Therefore, Z N M CARP (β) ≤ βZ ∗ (G ) ≤ βZn∗ . Due to the results in [4,18,25], there exists an approximation algorithm, say 2 U IT P (α0 ), with ratio 2 + (1 − Q )α0 for the CARP and another approximation 1 algorithm, which we call IT P (α0 ), with ratio 1+(1− Q )α0 for the equal-demand problem. Recall that α0 is the approximation ratio for the Rural Postman Problem. Using Lemma 1, this yields the following result. Theorem 1. The nonfixed destination MCARP admits a (2 + (1 − approximation algorithm. If the demands are equal, there is a (1 + (1 − approximation algorithm.

2 Q )α0 )1 Q )α0 )-

Remark 1. One can see that our algorithm has a very simple description, which thanks to the adoption of the β-approximation algorithm for the CARP without triangle inequality. In particular, when constructing the graph G we need not alter the costs and demands of the edges except for contracting the depot set. In contrast, the U IT Pn (α) heuristic for the nonﬁxed destination CVRP, given by Li and Simchi-Levi [22], has to further revise the edge costs by computing the all-pairs shortest path between the vertices in G and add some dummy edges. Because their algorithm invokes the U IT P (α) heuristic for the CVRP, which need the triangle inequality, and G may not respect the triangle inequality.

4

The (Fixed Destination) MCARP

We now discuss the (ﬁxed destination) MCARP where all the routes are required to start from and end at the same depot. We give an algorithm, called M U IT P (α0 ), for the MCARP by modifying the algorithm N M CARP (β) as follows. First, we replace the β-approximation algorithm in Step 2 by the above-mentioned algorithm U IT P (α0 ). Then we modify the solution generated in Step 4 to derive a feasible solution for the (i) (i) (i) (i) MCARP. Let Pi = d1 , v1 , . . . , vr , d2 be the ith route with (i)

(i)

c(Pi ) = cs (d1 , v1 ) +

r−1 h=1

(i)

(i)

(i)

cs (vh , vh+1 ) + cs (vr(i) , d2 ),

98

W. Yu and Y. Liao (i)

(i)

(i)

(i)

(i)

where d1 , d2 ∈ D are the depots and vh ∈ V \ D (h = 1, . . . , r). The mod(i) (i) iﬁcation of Pi (i = 1, . . . , l) to Ci is deﬁned as below: if d1 = d2 then Pi is already feasible and we set Ci = Pi , otherwise Ci is replaced by Ci =

(i)

(i)

(i)

(i)

(i)

(i)

(i)

(i)

(i)

(i)

d1 , v1 , . . . , vr , d1 , if cs (d1 , v1 ) + cs (vr , d1 ) ≤ cs (d2 , v1 ) + cs (vr , d2 ); (i) (i) (i) (i) (i) (i) (i) (i) (i) (i) (i) (i) d2 , v1 , . . . , vr , d2 , if cs (d1 , v1 ) + cs (vr , d1 ) > cs (d2 , v1 ) + cs (vr , d2 ) .

To analyze the performance of the algorithm M U IT P (α0 ), we deﬁne L∗ as the cost of the optimal rural postman tour with respect to G in Step 2. In other words, L∗ is the length of the shortest closed walk in G going through 0 and all required edges. L(α0 ) is the cost of an α0 -approximate rural postman tour used by U IT P (α0 ). Clearly, L(α0 ) ≤ α0 L∗ . Moreover, according to U IT P (α0 ) l r−1 (i) (i) it holds that i=1 h=1 cs (vh , vh+1 ) ≤ L(α0 ). We proceed to show the following result. 2 α0 Z ∗ . Lemma 2. Z M U IT P (α0 ) ≤ 2 + 2 − Q Proof. Similarly to the analysis of the IT Pf (α) heuristic for the MCVRP in [22], r−1 (i) (i) we can show that c(Ci ) ≤ c(Pi ) + h=1 cs (vh , vh+1 ) and hence Z M U IT P (α0 ) =

l i=1

Ci ≤

2 2+ 1− α0 Zn∗ + L(α0 ) . Q

Since Zn∗ ≤ Z ∗ and L(α0 ) ≤ α0 L∗ ≤ α0 Z ∗ , the proof of is completed.

By substituting IT P (α0 ) for U IT P (α0 ) in the above algorithm M U IT P (α0 ), we can obtain an approximation algorithm for the equal-demand 1 )α0 . To sum up, we have the following result for MCARP with ratio 1 + (2 − Q the MCARP. 2 Theorem 2. There exists a (2 + (2 − Q )α0 )-approximation algorithm for the 1 MCARP. Moreover, for the equal-demand problem there is a (1 + (2 − Q )α0 )approximation algorithm.

5

The Multi-depot Rural Postman Problem

In this section, we consider the multi-depot Rural Postman Problem (MRPP), which is a restricted case of the MCARP with inﬁnite vehicle capacity, i.e., Q = +∞. Suppose that there are k = |D| depots. Then the MRPP is essentially to ﬁnd at most k closed walks, each of which starts from and ends at a distinct depot, such that these walks cover all the required edges and the total cost of the walks is minimized. 1 )-approximation algorithm for the MRPP. Theorem 3. There exists a (2 − 2k+1

Approximation and Polynomial Algorithms for Multi-depot CARP

6

99

Multi-depot CARP on a Line

In this section, we deal with the MCARP/MCVRP deﬁned on a line graph. We show that the equal-demand MCARP on a line can be solved in O(n2 ) time. For the MCVRP on a line, we give the ﬁrst 2-approximation algorithm. Theorem 4. The equal-demand MCARP on a line can be solved in O(n2 ) time. Theorem 5. The MCVRP on a line admits a 2-approximation algorithm. Acknowledgements. This research is supported by the National Natural Science Foundation of China under grant numbers 11671135, 11871213, 11901255 and the Natural Science Foundation of Shanghai under grant number 19ZR1411800.

References 1. Altinkemer, K., Gavish, B.: Heuristics for unequal weight delivery problems with a ﬁxed error guarantee. Oper. Res. Lett. 6(4), 149–158 (1987) 2. Altinkemer, K., Gavish, B.: Heuristics for delivery problems with constant error guarantees. Transp. Sci. 6(4), 294–297 (1990) 3. Archetti, C., Feillet, D., Gendreau, M., Speranza, M.G.: Complexity of the VRP and SDVRP. Transp. Res. Part C Emerg. Technol. 19, 741–750 (2011) 4. van Bevern, R., Hartung, S., Nichterlein, A., Sorge, M.: Constant-factor approximations for Capacitated Arc Routing without triangle inequality. Oper. Res. Lett. 42, 290–292 (2014) 5. van Bevern, R., Slugin, V.A.: A historical note on the 3/2-approximation algorithm for the metric traveling salesman problem. Hist. Math. 53, 118–127 (2020) 6. Blauth, J., Traub, V., Vygen, J.: Improving the approximation ratio for capacitated vehicle routing. In: Singh, M., Williamson, D.P. (eds.) IPCO 2021. LNCS, vol. 12707, pp. 1–14. Springer, Cham (2021). https://doi.org/10.1007/978-3-03073879-2 1 7. Chen, H., Cheng, T., Shawe-Taylor, J.: A balanced route design for min-max multiple-depot rural postman problem (MMMDRPP): a police patrolling case. Int. J. Geogr. Inf. Sci. 32(1), 169–190 (2018) 8. Christoﬁdes, N.: Worst-case analysis of a new heuristic for the traveling salesman problem. Technical report, Graduate School of Industrial Administration, Carnegie-Mellon University, Pittsburgh (1976) 9. Eiselt, H.A., Gendreau, M., Laporte, G.: Arc routing problems, part II: the rural postman problem. Oper. Res. 43, 399–414 (1995) 10. Fernandez, E., Fontana, D., Grazia Speranza, M.: On the collaboration uncapacitated arc routing problem. Comput. Oper. Res. 67, 120–131 (2016) 11. Fern´ andez, E., Rodr´ıguez-Pereira, J.: Multi-depot rural postman problems. TOP 25(2), 340–372 (2016). https://doi.org/10.1007/s11750-016-0434-z 12. Fernandez, E., Laporte, G., Rodriguez-Pereira, J.: A branch-and-cut algorithm for the multidepot rural postman problem. Transp. Sci. 52(2), 353–369 (2018) 13. Golden, B.L., Wong, R.T.: Capacitied arc routing problems. Networks 11(3), 305– 315 (1981) 14. Haimovich, M., Rinnooy Kan, A.H.G.: Bounds and heuristics for capacitated routing problems. Math. Oper. Res. 10(4), 527–542 (1985)

100

W. Yu and Y. Liao

15. Haimovich, M., Rinnooy Kan, A.H.G., Stougie, L.: Analysis of heuristics for vehicle routing problems. In: Golder, B.L., Assad, A.A. (eds.) Vehicle Routing: Methods and Studies, pp. 47–61. Elsevier, Amsterdam (1988) 16. Hertz, A., Laporte, G., Mittaz, M.: A taub search heuristic for the capacitated arc routing problem. Oper. Res. 48(1), 129–135 (2000) 17. Hu, H., Liu, T., Ning, Z., Zhou, Y., Min, D.: A hybrid genetic algorithm with perturbation for the multi-depot capacitated arc routing problem. J. Appl. Sci. 13(16), 3239–3244 (2013) 18. Jansen, K.: Bounds for the general capacitated routing problem. Networks 23, 165–173 (1993) 19. Kansou, A., Yassine, A.: A two ant colony approaches for the multi-depot capacitated arc routing problem. In: International Conference on Computers & Industrial Engineering, Troyes, France, pp. 1040–1045 (2009) 20. Krushinsky, D., Van Woensel, T.: An approach to the asymmetric multi-depot capacitated arc routing problem. Eur. J. Oper. Res. 244, 100–109 (2015) 21. Labbe, M., Laporte, G., Mercure, H.: Capacitated vehicle routing on trees. Oper. Res. 39(4), 616–622 (1991) 22. Li, C.-L., Simchi-Levi, D.: Worst-dase analysis of heuristics for multidepot capacitated vehicle routing Problems. ORSA J. Comput. 40, 790–799 (1992) 23. Liu, T., Jiang, Z., Geng, N.: A genetic local search algorithm for the multi-depot heterogeneous ﬂeet capacitated arc routing problem. Flex. Serv. Manuf. J. 26(4), 540–564 (2012). https://doi.org/10.1007/s10696-012-9166-z 24. Park, J., Kim, B.I.: The school bus routing problem. Eur. J. Oper. Res. 202(2), 311–319 (2010) 25. Wohlk, S.: An approximation algorithm for the capacitied arc routing problem. Open Oper. Res. J. 2, 8–12 (2008) 26. Wu, Y., Lu, X.: Capacitated vehicle routing problem on line with unsplittable demands. J. Comb. Optim. (2020). https://doi.org/10.1007/s10878-020-00565-5 27. Zhao, J., Zhu, F.: A multi-depot vehicle-routing model for the explosive waste recycling. Int. J. Prod. Res. 54(2), 550–563 (2016)

Zero-Shot Face Swapping with De-identification Adversarial Learning Huifang Li1 , Yidong Li1(B) , Jiaming Liu2 , Zhibin Hong2 , Tianshu Hu2 , and Yan Ren3 1

School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China {hflili,ydli}@bjtu.edu.cn 2 Baidu Inc. Baidu Technology Park Building No. 2, Xibeiwang East Road, Beijing 100193, China {liujiaming03,hongzhibin,hutianshu01}@baidu.com 3 QI-ANXIN Technology Group Inc., Beijing 100044, China [emailprotected]

Abstract. In this paper, we propose a Zero-shot Face Swapping Network (ZFSNet) to swap novel identities where no training data is available, which is very practical. In contrast to many existing methods that consist of several stages, the proposed model can generate images containing the unseen identity in a single forward pass without ﬁne-tuning. To achieve it, based on the basic encoder-decoder framework, we propose an additional de-identiﬁcation (De-ID) module after the encoder to remove the source identity information, which contributes to removing the source identity retaining in the encoding stream and improves the model’s generalization capability. Then we introduce an attention component (ASSM) to blend the encoded source feature and the target identity feature adaptively. It ampliﬁes proper local details and helps the decoder attend to the related identity feature. Extensive experiments evaluated on the synthesized and real images demonstrate that the proposed modules are eﬀective in zero-shot face swapping. In addition, we also evaluate our framework on zero-shot facial expression translation to show its versatility and ﬂexibility. Keywords: Face swapping Adversarial learning

1

· Facial expression translation ·

Introduction

Image-to-image translation is changing a particular aspect of a given image to the required one, such as changing facial identity, facial expression, hairstyle and This work is supported by the Fundamental Research Funds for the Central Universities of China 2019YJS032, the Joint Funds of the National Natural Science Foundation of China under Grant No. U1934220, and 2020 Industrial Internet Innovation and Development Project. c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 101–112, 2022. https://doi.org/10.1007/978-3-030-96772-7_10

102

H. Li et al.

Fig. 1. Visualizations of the image translation results of ZFSNet. The ﬁrst and second row illustrate the face swapping and facial expression translation results respectively.

gender. It is a popular topic with ubiquitous access to the usage of social media. Face swapping (as shown in the ﬁrst row of Fig. 1) is a task that transforms the target identity into a source image while keeping the source content like pose, expression, yaw/pitch unchanged. This technique can be widely used for entertainment [10] and data augmentation. With the introduction of generative adversarial networks (GANs) [6], recent years have witnessed great progress in image translation [4,6,7]. However, the success of image translation relies heavily on enormous paired training data, which is unlikely to collect training data for every class in the real world. In order to tackle this situation, we propose a zeroshot face swapping network, which attempts to transfer an unseen target identity to the source face image. It is practical, especially when the target images are diﬃcult to collect. Recently, existing methods achieve face swapping with deep generative models [6]. For example, Deepfakes [1] ﬁrstly leverages the autoencoder network for face swapping and achieves a promising result. However, it requires hundreds or even thousands of examples to train the network and the model could only be applied to a speciﬁc identity. Aiming at loosing the one-to-one face swapping constraint, [15] proposes a many-to-many face swapping framework by disentangling identity and content features of faces, and then recombining the content feature with another identity feature. But, it suﬀers from the limited generalization capability. When tested with unseen identities that are not included in the training data, the model’s performance would deteriorate. One explanation of this phenomenon might be that the embedded content features still retain some source identity information, and the redundant source identity could result in an unidentiﬁable face. To address this problem, [2,14] constrain the content embedding by the standard Gaussian. Speciﬁcally, they regularize the content distribution qθ (z|x) based on the KL divergence between q(z|x) and p(z) = N (0, I) to force the content embedding to be general and neglect the identity information. Nevertheless, images generated based on this method are usually blurry. A possible cause is that the posterior deﬁned by qθ (z|x) is not complex enough [22]. Alternatively, [20] creates a few-shot face reenactment model. It requires

Zero-Shot Face Swapping with De-identiﬁcation Adversarial Learning

103

a few examples to generalize the model to unseen targets via ﬁne-tuning the trained model, but the time-consuming ﬁne-tuning process restricts its potential application. In this paper, a general one-stage Zero-shot Face Swapping Network (ZFSNet) is presented to address the challenging zero-shot face swapping, which requires only one target image and no ﬁne-tuning. The generated image not only has the identity of the unseen targets, but also retains the content (such as pose and expression) of the source image. To achieve it, based on the basic encoderdecoder framework, we propose a novel De-IDentification (De-ID) module to destroy the identity-speciﬁc features of source images based on an adversarial classiﬁer. As a result, the generated image will not be disturbed by the source identity information. We conduct extensive experiments to validate its eﬀectiveness to destroy the source identity. Then we propose an Attentive Spatial Style Modulation (ASSM) module to fuse the source content feature and the target identity adaptively, allowing the decoder to retrieve the appropriate code for each spatial location and pay attention to local details. Extensive experiments are conducted to demonstrate that the proposed modules are eﬀective, and the network can be applied to other image translation tasks, such as facial expression translation (as shown in the second row of Fig. 1).

2 2.1

Method Overall Framework

As shown in Fig. 2, the basic ZFSNet contains a content encoder ΦEc to learn content features zc , an identity encoder ΦEi to extract the identity information of the target image, and an decoder ΦD to generate a new image. Then we innovatively introduce a De-IDentiﬁcation module ΦCl to disturb the ΦEc to learn source identity features via the adversarial learning. As a result, there is no stable representation of identity features in zc and the identity information of source image can not be clearly decoded from zc . Furthermore, in decoder ΦD , we propose an Attentive Spatial Style Modulation module to help fuse the source content and the target identity feature adaptively. Therefore, the generated image can contain the identity of the target image and the content of the source image. More details are shown in the following subsections. 2.2

De-identiﬁcation Module

As mentioned in the previous section, given a source image Ic , a content encoder is employed to extract the face content feature zc . Intuitively, we expect zc contains as little identity information as possible while keeping enough content information for the decoder to recover the content of the output face. Inspired by popular adversarial learning method [16], we introduce a de-identiﬁcation module after the content representation and train the content encoder in an adversarial fashion.

104

H. Li et al.

Fig. 2. The framework of our ZFSNet. The model is trained based on a mix-batch training strategy. In paired training, the inputs of the network have the same identity, while in unpaired training, the inputs of the network have diﬀerent identities.

In this module, we employ a classiﬁer ΦCl which is a multi-layer perception network (as shown in Table 1), and it takes zc as input. The learning of ΦCl and the other sub-networks, i.e. ΦEc , ΦEi and ΦD , are conducted in two iterative steps. In the ﬁrst step, the parameters of ΦEc are ﬁxed and the classic cross entropy loss is introduced as follows: ˆ|x = zc ))], Lcls = E[− log P(y = y

(1)

where y ˆ is the ground truth identity label of Ic . In this phrase, ΦCl is trained to extract the identity information from zc and therefore diﬀerentiates the identity of Ic . In the second step, ΦCl is frozen, and a de-identiﬁcation loss is imposed. In speciﬁc, the de-identiﬁcation loss Ldeid is the negative entropy of ΦCl prediction. Ldeid = E[−H(y|x = ΦEc (Ic ))].

(2)

It should be noted that Ldeid is jointly imposed with other functional losses to learn ΦEc . In this manner, the content-related feature in zc can be eﬀectively encoded by ΦEi but the identity-related features in zc is unstable due to Ldeid , and the identity of the source image can not be learned and captured. Table 1. Network architecture of the classiﬁer ΦCl . ID represents the identity number. Model ΦC l

BN,

FC (1024),

FC (1024),

FC (1024),

FC (1024),

FC (512),

FC (512),

BN,

BN,

BN,

BN,

BN,

BN,

LeakyRelu

LeakyRelu

LeakyRelu

LeakyRelu

LeakyRelu

LeakyRelu

FC (ID)

Zero-Shot Face Swapping with De-identiﬁcation Adversarial Learning

2.3

105

Attentive Spatial Style Modulation Module

The decoder is required to recover the target identity from the identity encoder. Inspired by recent works [8,9], controlling the statistics, a.k.a styles, of feature maps can enable the decoder to yield controllable face synthesis results. However, in [9], features across each spatial location share the same style code. Face editing tasks usually require spatial-aware style modiﬁcation. Instead of generating global styles, we design an attentive spatial style modulation module to help the decoder retrieve the corresponding style code adaptively. Concretely, the ASSM module takes a content feature F and an identity ˜ More concretely, F feature Fi as inputs, and returns the modulated feature F. produces a query map Q, and Fi generates styles V and the corresponding keys K, where Q, K and V are produced by 1 × 1 convolutions. Then an attentive matrix A is calculated as exp(λat Q(i)T · K(j)T ) , T T τ ∈H ·W exp(λat Q(i) · K(τ ) )

A(i, j) =

(3)

where λat denotes the temperature term to control the sharpness of softmax distribution, i and j are the indices of the row and column in A, which are spatial locations in Q and K respectively. λat is set to 0.01 as a default setup. The retrieval style γ is the weighted average of V by multiplying A as A(i, j) · V(j). (4) γ(i) = j∈H ×W

˜ are generated by Norm(Conv(F ⊗ γ)), where Finally, the modulated feature F Norm(X) is the normalization operator, and ⊗ refers to element wise product. ˜ contains both the content of F and Therefore, the modulated feature map F the style of Fi , which is achieved by adaptively combining the identity features according to the corresponding semantics in F. 2.4

Mix-Batch Training Strategy

During the training phase, we employ a Mix-batch Training Strategy. As shown in Fig. 2, the batch of images is mixed by two diﬀerent components. One component contains paired images which means Ic and Ii1 come from the same person, and the output Io must to be the same as the Ic . This component can help the network quickly learn to generate face images by providing strong reconstruction supervision. Meanwhile, the other component consists of unpaired images that means Ic and Ii2 have diﬀerent identities, and the output Io contains the pose, expression of Ic and the identity of Ii2 . The unpaired training is essential to enhance the target identity or attribute transfer ability of the model. When paired images are fed into the network, the output is the reconstruction of Ic as well as its binary face mask Mc . The reconstruction loss are as follows: Lrec = SSIM(Ic · Mc , Io · M)),

Lmask1 = L1(Mc , M),

(5)

106

H. Li et al.

where Io represents the predicted image, M is the predicted mask, and SSIM refers to the Structural Similarity (SSIM) [19] loss. It is worth mentioning that we just force the network to reconstruct the facial area of the image regardless of the background. Finally, we can simply apply alpha blending by using Ic , Io and M to get the ﬁnal face swapping result Ib as Ib = M · Io + (1 − M) · Ic . As mentioned in Sect. 2.2, we introduce a de-identiﬁcation classiﬁer ΦCl in the latent space and train the network ΦCl and ΦEc in an adversarial way. The training losses are formulated in Eqs. 1, 2. For the unpaired branch, to ensure that the generated face in Io keeps the same identity of Ii2 , we propose an identity preserving loss Lid . We also add a mask reconstruction loss Lmask2 to guide the network to learn the pose of the source image. Lid = L2(f(Io ), f(Ii2 )),

Lmask2 = L1(Mc , M).

(6)

where f(·) indicates a pre-trained facial identity extractor [5]. The adversarial loss Ladv following WGAN-GP [6] is used to learning the parameters of the generator and discriminator. Ladv is formulated as Eq. 7, where the ﬁrst two terms are original critic losses and the last term is a penalty on the gradient norm. Pr is the real data distribution, Pg is the generator distribution, P˜I is the random interpolation distribution and λgp is a penalty coeﬃcient. Ladv = EIo ∈Pg [D(Io )] − EIc ∈Pr [D(Ic )] + λgp E˜ [( D(˜I)2 − 1)2 ].

(7)

I∈P˜I

Finally, we combine these constraints to optimize our network. In the test time, a source image and a target image are fed into the content encoder ΦEc and the identity encoder ΦEi , respectively, and the decoder ΦD equipped with the ASSM module outputs the results.

3 3.1

Experiments Experiment Setup

Datasets. Our model is trained on the VGGFace2 [3] and FaceForensics++ [17] datasets. To verify the generalization ability of the model over unseen subjects, we choose the last 50 videos of FaceForensics++ as the unseen test set N , meaning that the identities inside are not included in the training data. We also prepare a seen test set S, consisting of 100 subjects with 20 images per subject. The 100 subjects in test set S are exposed in training and the images are not included in the training. The remaining image sequences of the FaceForensics++ dataset are the training data. In addition, we validate that our ZFSNet can be applied to other facial translation tasks ﬂexibly. Since face swapping is closely related to facial expression translation, which is a task that changes the source expression of a given image to the target expression, we employ ZFSNet on expression data RaFD [13]. To

Zero-Shot Face Swapping with De-identiﬁcation Adversarial Learning

107

validate the model, we also prepare the seen test set S and the unseen test set N on RaFD. Following [4], we exclude the ‘neutral’ expression during training and regard the ‘neutral’ images as the N test set. The remaining 7 expression categories are retained for training. From the 7 categories, we randomly selected 20 images from each category as a test set S. Implementation Details. For face swapping, we align and crop the face image with MTCNN [21] and utilize the face recognizer VGGFace2 [3] to extract 256dim face identity embedding. The cropped image is of size 256 × 256, and then resized as 128 × 128. For facial expression translation, we ﬁrst train an expression classiﬁcation model on the RaFD training set based on the VGG19 [18] network and then test the classiﬁcation accuracy on the set S. The accuracy of the classiﬁer is 99.28%. The expression classiﬁcation model is used to extract expression embedding. Metrics. For identity preserving capacity, we compute the Cosine similarity of embedding vectors extracted from the widely used face recognition model [5]. The larger value means a higher similarity between the two images. During the experiment, for each subject in the test sets, we ﬁrstly ﬁnd a frontal face image by the Euler angle calculated by Dlib [11] and the frontal face images are used as target images for face swapping. The identity preserving metrics are calculated between the generated image and the frontal target image. To inspect the pose and expression ﬁtting accuracy, we use Dlib [11] to estimate Euler angles and Landmark position of face images. Then, we compute the root mean square error of Euler angles and the mean distance of the Landmark vectors normalized by the face’s binocular distance between the synthesized image and the source image. For these two indicators, a lower value means a smaller diﬀerence in pose and expression. 3.2

Ablation Study

The Eﬀectiveness of the Proposed Components. We conduct ablation study to validate the proposed De-ID module and the ASSM module. The conﬁgurations are as follows: (1) Ours is the proposed network with the De-ID and the ASSM modules. (2) w/o De-ID is the network without the De-ID adversarial learning. (3) w/ KL refers to the network that constrains the content feature with the KL divergence regularization instead of the proposed De-ID. (4) w/o ASSM corresponds to the network without the ASSM module. Table 2 shows the quantitative results and Fig. 3 visualizes the outputs. We can ﬁnd that the output images of ZFSNet have the content of the source image and the identity of the target image. On the contrary, when ZFSNet is trained without the De-ID module, the generated images sometimes retains some identity information from source image. The Impact of the De-ID Module on Content Feature Learning. We further explore whether the De-ID module really prevent the encoder from

108

H. Li et al. Table 2. Quantitative results on N − N swapping setting. Method

Cosine ↑ Euler ↓ Landmark ↓

Our

0.520

w/o De-ID module 0.385

18.910

2.877

20.35

1.631

w/ KL

0.429

31.252

3.324

w/o ASSM

0.481

21.683

2.027

Fig. 3. Results on ablation setups. The source-target images are all from N set.

extracting identity information. Concretely, we train two models with and without the De-ID module on both face swapping and face expression dataset. Taking zc of pre-trained models as the input feature, we train classiﬁers to classify zc to their corresponding class. Ideally, the more identity related information contained in zc , the more accurate the classiﬁer will be. Therefore, we expect a signiﬁcant drop on classiﬁcation performance or an increase in converged loss when zc is trained with De-ID. The training losses of the classiﬁers are shown in Fig. 4, where each classiﬁer is trained for 2000 iterations, and the weights of ZFSNet are ﬁxed during training. The converged training losses with De-ID module are close to the original training losses with randomly initialized classiﬁers. Therefore, we can safely draw the conclusion that the De-ID module indeed disturbs the encoder from extracting identity features.

Fig. 4. Training loss of adding and not adding the De-ID module.

Zero-Shot Face Swapping with De-identiﬁcation Adversarial Learning

3.3

109

Comparison

We compare our methods with popular face swapping methods Deepfakes [1] and FaceSwap [12]. Deepfakes [1] is an one-to-one swapping model based on the denoising autoencoder. FaceSwap is also a one-to-one swapping model based on the convolutional network by rendering an image with the style of the target image. Our ZFSNet is many-to-many framework for zero-shot face swapping, which requires only one target image and no ﬁne-tuning. Importantly, this comparison is unfair for us, since Deepfakes and FaceSwap need to retrain the face swapping model for each source-target subject pair, and the target identity are used during training. Compared with them, our method can achieve comparable performance or better performance, which proves that our model successfully achieves zero-shot face translation. Quantitative Comparison. We provide two quantitative comparisons to evaluate our model ZFSNet to validate its zero-shot translation ability. Our ZFSNet is a many-to-many framework and can translate images without ﬁne-tuning. The comparisons are implemented on N − N swapping (source and target identities are all from the unseen test set) and S − N swapping (source and target identities are from the seen test set and the unseen test set respectively). The results are shown in Tables 3, 4. On N − N swapping, our method performs better than Deepfakes and FaceSwap, which is our main concern in this paper. On S − N swapping, the performance of our method is comparable to Deepfakes and better than FaceSwap. Our ZFSNet achieves results comparable to or even better than [1,12] without ﬁne-tuning and without using the target image during training. The results illustrates the eﬀectiveness of our method. Table 3. Quantitative face swapping results on N − N swapping setting. Method

Cosine ↑ Euler ↓ Landmark ↓

Deepfakes [1]

0.506

49.078

4.194

FaceSwap [12] 0.441

29.903

2.593

Ours

18.910 2.877

0.520

Table 4. Quantitative face swapping results on S − N swapping setting. Method

Cosine ↑ Euler ↓ Landmark ↓

Deepfakes [1]

0.530

73.137

8.515

FaceSwap [12] 0.446

42.215

4.453

Ours

29.889 4.855

0.515

Qualitative Comparison. We also visualize the outputs of our method in the Fig. 5. It is obvious that our ZFSNet can preserve the identity of target faces but also retain the content, such as pose, yaw/angle, expression of source faces. It achieves results comparable to Deepfakes and FaceSwap.

110

H. Li et al.

Fig. 5. Quantitative comparison on FaceForensics++ image sequences.

3.4

Facial Expression Translation

We also veriﬁes our ZFSNet on the facial expression translation task. In practice, the facial expression embedding is deﬁned as a kind of identity embedding. We train our model on (RaFD) [13]. The source image and the target expression image are fed into the content encoder and the expression encoder to get content feature zc and expression feature zi . Then the decoder decodes zc and zi to generate a new expression image Io .

Fig. 6. Results of transferring seven facial expressions to the unseen neutral expression.

Zero-Shot Face Swapping with De-identiﬁcation Adversarial Learning

111

To validate the zero-shot expression translation, the source images are selected from the seen 7 expression data and the target image is taken from the unseen neutral expression data. Figure 6 shows the result of translating the expression into the unseen neutral expression. It can be seen that our ZFSNet outputs high quality neutral images, indicating that this model can be ﬂexibly applied to unseen domains.

4

Conclusion

In this paper, we propose a general Zero-shot Face Swapping Network (ZFSNet), which can realize the unseen target face swapping using only one target image. Speciﬁcally, we propose a de-identiﬁcation (De-ID) module to constrain the content encoder and alleviate the identity information retaining problem in the learned content feature. The De-ID module and the content encoder are learned in an adversarial manner. Then we design an attentive spatial style modulation module (ASSM) to combine the content feature and the target identity feature adaptively, guiding the decoder to attend to related local details. Through these improvements, the ZFSNet can successfully generate images containing the speciﬁc unseen identity. Moreover, the proposed method is generic and can be easily applied to other attribute translation tasks, such as facial expression translation. Extensive experiments validate the eﬀectiveness of our method. In future work, we will further improve the generalization ability of the model.

References 1. Deepfakes. faceswap (2016). https://github.com/deepfakes/faceswap. Accessed 06 Feb 2019 2. Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: Towards open-set identity preserving face synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6713–6722 (2018) 3. Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: VGGFace2: A dataset for recognising faces across pose and age. In: Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, pp. 67–74 (2018) 4. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: uniﬁed generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018) 5. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFACE: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019) 6. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Proceedings of the conference on Neural Information Processing Systems, pp. 5767–5777 (2017) 7. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)

112

H. Li et al.

8. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019) 9. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958 (2019) 10. Kim, H., et al.: Deep video portraits. ACM Trans. Graph. 37(4), 1–14 (2018) 11. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755– 1758 (2009) 12. Korshunova, I., Shi, W., Dambre, J., Theis, L.: Fast face-swap using convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3677–3685 (2017) 13. Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, D.H., Hawk, S.T., Van Knippenberg, A.: Presentation and validation of the radboud faces database. Cogn. Emot. 24(8), 1377–1388 (2010) 14. Natsume, R., Yatagawa, T., Morishima, S.: FSNet: an identity-aware generative model for image-based face swapping. In: Proceedings of the Asian Conference on Computer Vision, pp. 117–132 (2018) 15. Natsume, R., Yatagawa, T., Morishima, S.: Rsgan: face swapping and editing using face and hair representation in latent spaces. arXiv preprint arXiv:1804.03447 (2018) 16. Perera, P., Nallapati, R., Xiang, B.: OCGAN: one-class novelty detection using gans with constrained latent representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2898–2906 (2019) 17. Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: learning to detect manipulated facial images, pp. 1–11 (2019) 18. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556 3 (2014) 19. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Peprocessing 13(4), 600–612 (2004) 20. Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9459–9468 (2019) 21. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016) 22. Zheng, Z., Sun, L.: Disentangling latent space for vae by label relevant/irrelevant dimensions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12192–12201 (2019)

An User-Driven Active Way to Push ACL in Software-Defined Networking Haisheng Yu2,3(B) , Dong Liu1,3 , Wenyong Wang1,2 , Keqiu Li4 , Sai Zou5 , Zhaobin Liu6 , and Yan Liu1 1

2

Macau University of Science and Technology, Macao, China University of Electronic Science and Technology of China, Chengdu, China 3 BII Group, Beijing, China 4 Tianjin University, Tianjin, China 5 Guizhou University, Guiyang, China 6 Dalian Maritime University, Dalian, China [emailprotected]

Abstract. Compared with the traditional network, Software-Deﬁned Networking (SDN) provides a more convenient network paradigm to build Access Control List (ACL) application. There has been a few studies focusing on ACL application in SDN up to now, but most of the existing work adopts a reactive way to enforce ACL, resulting in new ACL update can not take eﬀect immediately. In this paper, we propose CLACK, an approach for user-driven centralized ACL in SDN. We implement CLACK on both Floodlight and ONOS controller. The experimental results show that CLACK has a better performance than the existing Floodlight ﬁrewall application. Keywords: Access Control List (ACL) (SDN) · Security · Floodlight · ONOS

1

· Software-Deﬁned Networking

Introduction

Internet, accommodating a variety of heterogeneous networks and distributed applications [12], has achieved great success and been the enormous power of promoting social and economic development since it is proposed [10]. However, the current Internet environment has changed dramatically as a result of the emerging network services and the network scale expansion, the traditional architecture of Internet has exposed serious deﬁciencies, such as unexpected delays for data communication [14] and diﬃculty in the traﬃc load balance among links [9]. The fundamental reason for that is the tight coupling of control logic and data forwarding in network devices (e.g. router, switch) and the distributed control of network devices [3]. SDN provides an open software programmable model and a diversity of network control functions. It has gained wide recognition and good support from both academia and industry. Access Control List (ACL) is a network security enhancement. It applies a set of ACL rules to each IP packet and determines whether to forward or drop the c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 113–120, 2022. https://doi.org/10.1007/978-3-030-96772-7_11

114

H. Yu et al.

packet based on its header ﬁelds. ACL is similar to the stateless ﬁrewall or packet ﬁltering ﬁrewall which provides basic traﬃc ﬁltering capabilities [13]. In traditional networks, ACL is often placed in network devices (e.g. router, switch) and can be conﬁgured to control both inbound and outbound traﬃc. Network devices examine each packet and determine whether to forward or drop the packet on the basis of the rules speciﬁed in ACL [4]. Unfortunately, the approach has several deﬁciencies. Firstly, network devices should have appropriate hardware and processing capabilities to enforce ACL, causing a vast expense. What’s worse, it is too complicated to design and conﬁgure ACL in distributed network devices, not to mention the situation when network security policy changes. The cumbersome maintenance of ACL in complex networks is also prone to error. The root reason for that lies in the distributed way to enforce ACL in traditional networks. Software-deﬁned Networking (SDN) just provides an convenient network paradigm to solve the problem. SDN separates control logic and forwarding logic in traditional networks, and SDN controller conﬁgures networks in a centralized manner rather than distributed conﬁguration [8]. In this paper, we propose CLACK, an approach for user-driven centralized ACL in SDN. CLACK adopts a proactive way to enforce ACL thus to avoid additional delay and save controller’s resource, it reacts to new ACL update and network view update in real time to ensure network security. CLACK uses abstract network view to accelerate processing and does match check for new added ACL rule to avoid invalid rule. We implement CLACK on both Floodlight and ONOS controller [2], and CLACK is also integrated into the new version of both controllers.

Fig. 1. Network security violation in a reactive way

An User-Driven Active Way to Push ACL in Software-Deﬁned Networking

115

Fig. 2. CLACK architecture

2 2.1

Clack Design Overview

Figure 2 depicts CLACK’s architecture, CLACK provides REST API for users and contains two core modules, Accessing Pair (AP) Manager and Access Control List (ACL) Manager. Each module has several submodules in charge of diﬀerent processing. In CLACK, each ACL rule contains several match ﬁelds and an action ﬁeld. Packets deﬁned in match ﬁelds are forwarded or dropped following the action ﬁeld. An ACL rule is denoted as: R{id; nw proto; src ip; dst ip; dst port; action} Each ACL rule has a distinct id. Match ﬁelds comprises nw proto (network protocol), src ip (source IP address), dst ip (destination IP address), dst port (TCP or UDP destination port). Match ﬁeld value may be a wildcard, which can be substituted for all possible ﬁeld values. src ip and dst ip ﬁeld use CIDR IP address, which can designate many unique IP addresses. action ﬁeld value is either “ALLOW” or “DENY”. CLACK provides a friendly and centralized user interface through REST API for users to add, remove, and query ACL rules. Users can use CLACK easily by sending an HTTP request containing JSON string, and they don’t need to conﬁgure distributed switches one by one any more for CLACK does all the work.

116

H. Yu et al.

CLACK ﬁlters IP packets by ACL ﬂow entries exactly reﬂecting ACL rules in ingress or egress switches. After receiving user’s new ACL update request, CLACK updates ACL rules and ACL ﬂow entries immediately. We will describe CLACK’s core modules in the following subsections.

Fig. 3. Abstract network view and Accessing Pair (AP)

2.2

Accessing Pair (AP) Manager

In CLACK, the real network view is transformed to an abstract network view. The abstract network view conceals internal network topology, and it only exposes the interfaces between edge switches and external hosts in the networks, as Fig. 3 depicts. We use Accessing Pair (AP) to store the interface information in the abstract network view. An AP is denoted as: AP : {id; dpid; ip} The ﬁelds represent AP id, edge switch’s dpid (data path id), and host’s IP address, respectively. AP Manager is a CLACK module, which maintains AP information in real time and provides a query function. AP Manager monitors host update event in the networks and stores all interface information in AP Set. When a new host appears or disappears in the networks, AP Manager updates AP Set correspondingly and calls ACL Manager for further processing which will be described in Sect. 2.3.

An User-Driven Active Way to Push ACL in Software-Deﬁned Networking

117

AP Manager also provides a query function getSwitchSet. Given a CIDR IP address, the function traverses AP Set and returns a switch set. Each switch in the set connects with a host whose IP address is contained in the CIDR IP address. This function will be used when generating ACL ﬂow entry. 2.3

Access Control List (ACL) Manager

Access Control List (ACL) Manager is a CLACK module, which updates ACL and processes AP update. After receiving a new ACL update request, ACL Manager veriﬁes its validity and returns an error message if not valid. If user requests to add a new ACL rule, ACL Manager ﬁrstly parses user’s request JSON string and generates a new ACL rule. It then traverses ACL Rule Set to check whether the new ACL rule matches another existing rule, the new rule is rejected if a match is found. ACL Manager generates a distinct id for each rule passing match check, adds it to ACL Rule Set and starts the enforcing stage. Match check is important because it rejects invalid rules, so as to reduce storage overhead in both switches and controller. Two functions are used in match check, and they give the deﬁnition of match: cover( Rnew ,Rold ,field): A Boolean function, where Rnew , Rold denote ACL rules and field denotes ACL rule’s match ﬁeld. We cover(Rnew , Rold , f ield) = true if: for f ield{nw proto, dst portg}, Rold .field has a wildcard value, and Rnew .field has an user-assigned value; for f ield{src ip; dst ip}, Rold .field contains all the IP addresses in Rnew .field. match(Rnew ,Rold ): A Boolean function. We say: match(Rnew ; Rold ) = true if: for f ield{nw proto, src ip, dst ip, dst port}, there is: Rnew .field = Rold .field or cover (Rnew ,Rold .field ) = true. We say ACL rule Rnew matches Rold if all packets ﬁltered by Rnew is already ﬁltered by Rold , and Rnew will not work at all if added. If user requests to remove an existing ACL rule, ACL Manager ﬁrstly parses user’s request and gets the rule’s id. It then removes the rule from ACL Rule Set and starts the enforcing stage.

3

Evaluation

We compare CLACK with the Floodlight ﬁrewall application. As is mentioned before, to enforce ACL, CLACK works in a proactive way while Floodlight adopts a reactive way. It means that diﬀerent events trigger their ACL enforcing

118

H. Yu et al.

process, user’s request for CLACK and Packet-in message for Floodlight ﬁrewall application; therefore it is unreasonable to compare their performance in general situation. We create a situation that a new ACL update conﬂicts with ACL ﬂow entry in switches and compare the delay for a new ACL update to take eﬀect, like in Fig. 1. We build a virtual network in Mininet [1] and run several experiments. For each experiment, we add diﬀerent numbers of ACL rules in advance and insure that CLACK has to traverse ACL Rule Set during update. Then we let host A in the network send ICMP packets to host B using Ping command. If host A succeed in Ping host B at ﬁrst, we add a new ACL rule to deny the ﬂow and record the delay until an ACL ﬂow entry drops the ﬂow. If there is already a ACL rule denying the ﬂow and host A fails to Ping host B at ﬁrst, we then remove that ACL rule and record the delay until an regular ﬂow entry forwards the ﬂow. The delay in the Floodlight ﬁrewall application is more than 5000 ms because a ﬂow entry’s default idle timeout is set to 5000 ms in Floodlight, no Packet-in messages is sent to the controller as long as the ACL ﬂow entry persists. As a result, new ACL update will not take eﬀect at all until after at least an idle timeout. We regard the delay as 5000 ms uniformly.

CLACK rule adding CLACK adding enforcing CLACK total Floodlight firewall total 5000

Delay (ms)

1.0 0.8 0.6 0.4 0.2

4000

8000

12000

16000

20000

Existing rule number

Fig. 4. Add a new ACL rule (single-controller version)

As Fig. 4 shows, in the single-controller version, the delay for rule adding and removing in CLACK goes up linearly as the existing ACL rule number increases because CLACK needs to traverse ACL Rule Set. The delay for enforcing ACL update vibrates for reason that CLACK needs to communicate with switches, and the delay depends on the network quality at that time.

An User-Driven Active Way to Push ACL in Software-Deﬁned Networking

119

CLACK rule adding CLACK adding enforcing CLACK total Floodlight firewall total 5000

Delay (ms)

140 120 100 80 60 40 20 4000

8000

12000

16000

20000

Existing rule number

Fig. 5. Add a new ACL rule (multi-controller version)

As Fig. 5 shows, The evaluation result for the multi-controller version is mostly similar with the single-controller version except that the delay for rule removing remains almost unchanging. That is because we use hash tables rather than a single set to store ACL rules, and hash tables is move eﬀective when processing indexing and updating. The comparison result indicates that CLACK beats the Floodlight ﬁrewall application by miles when handling new ACL update requests at the collision situation.

4

Conclusion and Future Work

In this paper, we propose CLACK, an approach for user-driven centralized ACL in SDN. CLACK adopts a proactive way to enforce ACL and reacts to new ACL update and network view update in real time. We implement CLACK on Floodlight and ONOS controller and then conduct a large number of experiments. The experimental results show that CLACK has a better performance than the existing Floodlight ﬁrewall application. Dynamic ﬂow tunneling scenario shows that malicious application can evade ACL by simply adding a few ﬂow entries in SDN [11]. The root reason lies in that OpenFlow allows various Set-Field actions that can dynamically change the packet headers [5]. P. Kazemian proposed a real time policy checking tool called NetPlumber [6] based on Header Space Analysis [7]. We intent to add security check capability based on HSA in CLACK to prevent attacks from adversaries in the future. Acknowledgement. This work is supported by Macau Science and Technology Development Fund (Grant No. 0018/2021/A).

120

H. Yu et al.

References 1. Mininet: An instant virtual network on your laptop. [EB/OL]. http://mininet.org/ 2. Onos - a new carrier-grade sdn network operating system designed for high availability, performance, scale-out. [EB/OL]. http://onosproject.org/ 3. Casado, M., Foster, N., Guha, A.: Abstractions for software-deﬁned networks. Communications of the ACM (2014) 4. Cisco, I.: Security conﬁguration guide, release 12.2. CISCO, San Jose, CA (2003) 5. Hu, H., Han, W., Ahn, G.J., Zhao, Z.: Flowguard: building robust ﬁrewalls for software-deﬁned networks. In: Proceedings of the Third Workshop on Hot Topics in Software Deﬁned Networking, pp. 97–102 (2014) 6. Kazemian, P., Chang, M., Zeng, H., Varghese, G., McKeown, N., Whyte, S.: Real time network policy checking using header space analysis. In: 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13), pp. 99–111 (2013) 7. Kazemian, P., Varghese, G., McKeown, N.: Header space analysis: Static checking for networks. In: 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pp. 113–126 (2012) 8. Kim, H., Feamster, N.: Improving network management with software deﬁned networking. IEEE Commun. Mag. 51(2), 114–119 (2013) 9. Manoj, N.: Fuzzy controlled routing in a swarm robotic network. IAES Int. J. Robot. Autom.(IJRA) 3(4), 272 (2014) 10. Paulus, C.: A brief history of the internet (1997) 11. Porras, P., Shin, S., Yegneswaran, V., Fong, M., Tyson, M., Gu, G.: A security enforcement kernel for openﬂow networks. In: Proceedings of the First Workshop on Hot Topics in Software Deﬁned Networks, pp. 121–126 (2012) 12. Shrivastav, A.A.: Reorganization of intruder using ad-hoc network and rﬁd. IAES Int. J. Robot. Autom. (IJRA) 3(4), 46–52 (2014) 13. Stallings, W.: Network Security Essentials: Applications and Standards. Applications and Standards, Network Security Essentials (2010) 14. Vasalya, A., Agrawal, R.: Smart telerobotic surveillance system via internet with reduced time delay. Iaes Int. J. Robot. Autom. 2(1), 11 (2012)

Photonic Computing and Communication for Neural Network Accelerators Chengpeng Xia1(B) , Yawen Chen1 , Haibo Zhang1 , Hao Zhang1 , and Jigang Wu2 1

2

Department of Computer Science, University of Otago, Dunedin, New Zealand {chengpeng.xia,hao.zhang}@postgrad.otago.ac.nz {yawen,haibo}@cs.otago.ac.nz School of Computers, Guangdong University of Technology, Guangzhou, China

Abstract. Conventional electronic Artiﬁcial Neural Networks (ANNs) accelerators focus on architecture design and numerical computation optimization to improve the training speed. Optical technology with low energy consumption and high transmission speed are expected to play an important role in the next generation of computing architectures. To provide a better understanding of optical technology used in ANN acceleration, we present a comprehensive review for the optical implementations of ANNs accelerator in this paper. We propose a classiﬁcation of existing solutions which are categorized into optical computing acceleration and optical communication acceleration according to optical eﬀects and optical architectures. Moreover, we discuss the challenges for these photonic neural network acceleration approaches to highlight the most promising future research opportunities in this ﬁeld. Keywords: Optical neural networks · Optical interconnection networks · Neural network accelerator

1

Introduction

The wide applications of Artiﬁcial Intelligence (AI), such as computer vision, speech recognition, and language processing, call for eﬃcient implementation of the model training and inference phases in machine learning [16]. Especially for Artiﬁcial Neural Networks (ANNs), due to the seminal work by Hinton et al. on deep learning in 2006, ANNs have reappeared in people’s vision [5]. Multiple neural networks have been studied and applied in diﬀerent ﬁelds. However, with large data sets and massively interconnected ANNs, the traditional computer architectures suﬀer from the eﬃcient inference and prediction due to the limited device computing power. Photonic architectures with low power consumption, high bandwidth and high transmission speed have been considered as a potential future alternative for electronic architectures. Optical solutions for ANNs computing and communication acceleration emerge as the times require. To this aim, many linear c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 121–128, 2022. https://doi.org/10.1007/978-3-030-96772-7_12

122

C. Xia et al. Optical Resonance based ANN Accelerators Optical Implementations for Computing

Optical Diffraction based ANN Accelerators Optical Interference based ANN Accelerators

Photonic Computing and Communication in ANN Accelerators

Off-Chip Communication for ANN Accelerators Optical Implementations for Communication

On-Chip Communication for ANN Accelerators

Fig. 1. Classiﬁcation of photonic implementation in ANN accelerators.

transformations have been demonstrated to be able to performed with passive optics without power consumption and with minimal latency [14]. The feasibility of optical logic gates has also been demonstrated [7]. Hence, optical implementations of neural networks have been investigated to increase the ANN training speed and the energy eﬃciency [15]. Moreover, optical on/oﬀ chip network architectures have been designed, with the aim of increasing model parallelism and data transmission speed. In this paper we present a survey of approaches for implementing Optical Neural Network (ONN) accelerator. A classiﬁcation of the existing solutions is proposed which includes two categories: optical implementations for computing and communication. Previous citation focused either on the computing acceleration in neural network or on bottlenecks of photonics technologies, which ignored the contribution of on-chip optical communication to neural networks acceleration. The remainder of this paper is organized as follows: The classiﬁcation for photonic computing and communication in ANN accelerators is presented in Fig. 1. In Sect. 2, we review the most relevant solutions categorized according to the optical implementations for computing, while in Sect. 3 we describe the optical approaches devised for the communication acceleration of ANN training. Section 4 discusses the challenges and future research opportunities in this ﬁeld, while Sect. 5 concludes the paper.

2 2.1

Optical Implementations for Computing Optical Resonance Based Neural Network Accelerators

Inspired by the ﬁeld of neuroscience in which biological neurons communicate by short pulses. Optical resonance based ANN accelerators have been carefully studied. Since the wavelength speciﬁcity of Microring-Ring-Resonator (MRR), a key element of ANN accelerator, the realization of Wavelength Division Multiplexing (WDM) approach is made possible, which is closely tied to the noncoherent architectures of the ONN. In contrast to spatial multiplexing, the WDM channel

Photonic Computing and Communication for Neural Network Accelerators

123

Fig. 2. The Broadcast-and-weight architecture proposed by [17].

can coexist in a single bus waveguide channel without interference, which simpliﬁes the interconnection network of neurons to some extent. An on-chip optical architecture for neural network implementations, named Broadcast-and-Weight (BW) was explored in [17]. As shown in Fig. 2, the BW architecture employs multiple wavelengths to transfer data in parallel with each distinct wavelength outputted to a common bus waveguide. The outputs are multiplexed and distributed to all-neuron connection, in which the broadcast is realized by passively splitting the bus waveguide. The MRR weight bank is an array of reconﬁgurable ﬁlters that can be tuned to drain energy from their resonant wavelength, thereby imprinting the weight coeﬃcient to each corresponding channel. Inspired by the BW protocol, an photonics convolution accelerator (PCNNA) was proposed for CNNs inference-mode in [11]. PCNNA designed a single-layer multiplexing CNN architecture, which enables the propagation of diﬀerent neural network layers. The authors argued that as multiple kernels share the same receptive ﬁeld values per layer, convolution computations for diﬀerent kernels can be performed in parallel. In the high-level framework, PCNNA is designed to run on two clock cycle domains, the faster domain is used for the operation of the optical network, and the slower domain is used for interfacing with electronic circuits. 2.2

Optical Diﬀraction Based Neural Network Accelerators

Diﬀraction eﬀects are usually the main factor limiting the performance of optical devices, while appropriately using the principle of diﬀraction eﬀect can eﬀectively realize the ONN. Holographic Optical Element (HOE) is one of the research focuses currently in information storage, which is considered to be a great storage tool for weights and directions in the ONN connection [13]. In [19], Zuo et al. presented a Spatial Light Modulator-based (SLM) all-optical ANN, in which optical matrix multiplication is implemented in a clever way. The authors divided the SLM into several regions according to the number of input beams, and each region is a superposition of multiple phase grating stacks, i.e., holograms. The multiplication of ANN is realized by the diﬀraction of the incident beam in the

124

C. Xia et al.

Fig. 3. Diﬀractive deep neural networks (D2 NN) depicted by [10].

HOEs, in which the weight of the neural network is mapped to the direction of the incident beam. After the diﬀracted beam passes through a convex lens, it performs a Fourier transform. Finally, beams are focused on the plane in the same direction to realize the accumulation operation. In addition to holograms, based on the sequentially cascading phase masks, Lin et al. [10] explored a diﬀraction-based all-optical neural network called D2 NN. As depicted in Fig. 3, in D2 NN, the fully connected neural network is implemented by multiple 3D printed phase masks which are formed as a hierarchical array in order and with interval. Each layer has only one single phase mask representing one layer in the fully connected neural network. The small grids in phase masks denote the neurons, which are loaded as diﬀerent weight information by diﬀerent refractive indices and thicknesses in grids. In the same direction of the incident beam, each neuron can be connected to all neurons in the next layer after diﬀraction, so all neurons can be fully connected in each phase mask. D2 NN changes weight to the neural network by adjusting the phases and changing the light attenuation. 2.3

Optical Interference Based Neural Network Accelerators

Diﬀerent from diﬀraction, interference eﬀect usually requires fewer linear light waves, in which waveguides are needed to propagate the light waves. Interference based ANNs implementation mainly relies on the optical device Mach-Ze-Delphi Interferometer (MZI) that is made of two waveguides with directional couplers and phase shifters. MZI has a coherent structure that loads the weight information into the neural network by adjusting the phase and amplitude of the input light. Shen et al. in [15] proposed an all-optical neural network using coherent nanophotonic circuits which became kind of a seminal work for all future interference based ANN accelerator. Singular Value Decomposition (SVD) [9] is used to realizeoptical matrix multiplication which decomposes the matrix M into M = U V including two unitary matrices U and V and a diagonal matrix . MZIs are set up as a cascaded array that is divided into three parts, with each part realizing the matrix U , and V respectively. The cascaded array can

Photonic Computing and Communication for Neural Network Accelerators

125

be regarded as a fully connected neural network. When the input lights pass through the MZIs, the accelerator applies two parallel coherent light waves at both phase shifters which will cause interference to the input light, so that the matrix multiplication operation in CNN can be well realized in these processes.

3

Optical Implementations for Communication

Existing ANNs have been challenged by the fact that high computational complexity, large amount of computational data, strong demand for memory access, and high demand for system parallelism exist widely in current model training. In the latest ANNs, tens to hundreds of megabytes of parameters are required to execute a single inference pass. Over one billions of operations will generate large amounts of memory access requirements from the processing elements (PE) which makes existing architectures face the challenge of memory wall. In the processing of model training, a large amount of reusable data is usually generated. For example, a huge amount of ﬁlter data, input feature map data, and partial sum data are created in the processing of convolution in CNN, in which these data can be regarded as reusable resources. 3.1

Oﬀ-Chip Communication for Neural Network Accelerators

Optical interconnection have a deep research history in the ﬁeld of datacenter. To improve communication performance, prior work shows the beneﬁts of reconﬁgurable topologies in datacenter networks by adding optical links to the electrical topology [4,12] or by creating all-optical datacenter interconnects [1]. Nevertheless, there are only a limited number of studies on using optical interconnection to optimize the ANN accelerator. In [6] proposed all-optical interconnects for ANN systems named SiP-ML, for strong scaling of ML workloads by leveraging SiP chiplets. Considering the parallelism of ANN algorithm and the singleness and repeatability of communication pattern during entire training, Sip-ML designed two data reuse based topologies at opposite ends of the spectrum. As shown in Fig. 4a, an Optical Circuit Switch (OCS) based topology called SiP-OCS is proposed consisting of Q commercially available optical switches. Each OCS has N ports (the same as the number of GPUs), and each GPU is connected to every OCS in a ﬂat topology. Due to the 10 ms reconﬁguration latency, Sip-OCS can last through the entire model training. Meanwhile, Micro-ring resonators embedded in SiP ports are used to build a switch-free topology which completely removes switching elements, named Sip-Ring. MRRs act as spectral ﬁlters to select and forward wavelengths, and they enable the reuse of wavelengths across non-overlapping segments of the ring. In contrast to SiP-OCS, SiP-Ring reconﬁgures wavelengths within each port to achieve logically rich topologies. Moreover, an Inter/Intra-Chip silicon photonic network for rack-scale computing systems called RSON was presented in [18]. RSON adopts circuit switching for the inter-chip and ONoC because of the relatively high overhead on

126

C. Xia et al.

Fig. 4. Two topologies for SiP-ML proposed by [6].

optical path setup/teardown and the diﬃculty on buﬀering optical signals. [18] utilized the inter-node interface as the medium to coordinate the request from both local ONoC and optical switch. A channel partition and dynamic path priority control scheme is designed to reduce the control complexity and arbitration overhead. 3.2

On-Chip Communication for Neural Network Accelerators

In [8], the authors considered that electrical interconnection in the existing manycore platform would not be sustainable for handling the massively increasing bandwidth demand of big data driven AI applications. Hence, a rapid topology generation and core mapping of ONoC (REGO) for heterogeneous multicore architecture was proposed. Based on the genetic algorithm, REGO receives an application task graph including the number of cores and ONoC parameters as inputs, which further includes the available router structure and loss and noise factors of the optical elements. Thus, the REGO can accommodate various router structures and optical elements because it calculates the worst-case OSNR through loss and noise parameters obtained in advance through the parameters of optical. A ﬁne-grained parallel computing model for ANNs training was depicted in [3] on ONoC, in which the trade-oﬀ between computation and communication can be analyzed to support the ANN acceleration. To minimize the total training time, three mapping strategies were designed in each ANN training stage which has the optimal number of cores. The advantages and disadvantages for each mapping strategy are discussed and analyzed in terms of hotspot level, memory requirement, and state transitions.

4

Challenges and Opportunities

In this paper, we reviews the optical approaches to accelerate neural networks from two aspects, i.e., computing and communication. In recent years, with the maturity of ANN theory and the development of silicon optical technology, one of the areas with growing concerns is the implementations of ONN. Nevertheless, there are still some outstanding challenges that limit the inference accuracy,

Photonic Computing and Communication for Neural Network Accelerators

127

reliability and scalability of ONNs. Hence, we summarize the challenges and opportunities to oﬀer suggestions for future research. Scalability: The exiting works that have been discussed in this review mainly focus on three approaches to accelerate ANNs model training that are small optical neural network implementation, matrix vector multiplication acceleration and optical network architectures for communication accelerating. The two major issues of the above approaches are area consumption and energy attenuation of the optical devices. The schemes in [14] and [2] described that the optical depth (the number of MZI units traversed through the longest path) for the unitary matrix is limited to 2N − 3 and N , in an ANN with N number of neurons, respectively. The optical depth increases linearly with the number of neurons increasing which directly translates into additional loss in silicon photonics integrations. Research is thus needed to design new novel architectures for reducing silicon photonic hardware complexity. Robustness: Robustness also becomes more and more critical due to the scaleup. Speciﬁcally, since the phase of each MZI is highly impacted by environmental change, thermal crosstalk and imperfect manufacturing, the phase error is cascaded throughout the computation. Whereas the on-chip thermal crosstalk can be suppressed, the ﬁnite encoding precision on phase settings will remain as the fundamental limitation for the ONNs with high computational complexity. The phase errors, in particular, accumulate when the lightwave signal traverses the MZI mesh with an optical depth of 2N + 1. In addition, such errors propagate through each layer of the network, which ultimately restricts the depth of the neural network. In order to realize robust photonic accelerator, research is needed to achieve eﬀective photonic crosstalk mitigation, phase noise correction, and noise resilient photodetection.

5

Conclusion

In this paper, we provide a comprehensive survey for optical implementation of ANN accelerators, including Photonic computing acceleration and Photonic communication acceleration. For the optical neural networks, we present the current ANN accelerators that are realized by the optical eﬀects. For the optical interconnection, we introduce the existing studies from the perspectives of oﬀ-chip communication and on-chip communication for ANN accelerator. Furthermore, we point out the open challenges and the future research opportunities for photonic neural network accelerator, which is expected to provide guidance and insight for future researchers and developers on this research ﬁeld. Acknowledgement. This work is supported by the National Natural Science Foundation of China under Grant Nos. 62106052 and 62072118.

128

C. Xia et al.

References 1. Chen, L., et al.: Enabling wide-spread communications on optical fabric with megaswitch. In: 14th Symposium on Networked Systems Design and Implementation, pp. 577–593 (2017) 2. Clements, W.R., Humphreys, P.C., Metcalf, B.J., Kolthammer, W.S., et al.: Optimal design for universal multiport interferometers. Optica 3(12), 1460–1465 (2016) 3. Dai, F., Chen, Y., Zhang, H., Huang, Z.: Accelerating fully connected neural network on optical network-on-chip (onoc). arXiv preprint arXiv:2109.14878 (2021) 4. Farrington, N., et al.: Helios: a hybrid electrical/optical switch architecture for modular data centers. In: 2010 ACM SIGCOMM, pp. 339–350 (2010) 5. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006) 6. Khani, M., et al.: Sip-ml: high-bandwidth optical network interconnects for machine learning training. In: 2021 ACM SIGCOMM, pp. 657–675 (2021) 7. Kim, J.Y., Kang, J.M., Kim, T.Y., Han, S.K.: All-optical multiple logic gates with xor, nor, or, and nand functions using parallel soa-mzi structures: theory and experiment. J. Lightwave Technol. 24(9), 3392 (2006) 8. Kim, Y.W., Choi, S.H., Han, T.H.: Rapid topology generation and core mapping of optical network-on-chip for heterogeneous computing platform. IEEE Access 9, 110359–110370 (2021) 9. Lawson, C.L., Hanson, R.J.: Solving least squares problems. SIAM (1995) 10. Lin, X., Rivenson, Y., Yardimci, N.T., Veli, M., et al.: All-optical machine learning using diﬀractive deep neural networks. Science 361(6406), 1004–1008 (2018) 11. Mehrabian, A., Al-Kabani, Y., Sorger, V.J., El-Ghazawi, T.: Pcnna: a photonic convolutional neural network accelerator. In: 2018 31st IEEE International Systemon-Chip Conference (SOCC), pp. 169–173. IEEE (2018) 12. Mellette, W.M., McGuinness, R., Roy, A., Forencich, A., Papen, G., Snoeren, A.C., Porter, G.: Rotornet: a scalable, low-complexity, optical datacenter network. In: ACM Special Interest Group on Data Communication, pp. 267–280 (2017) 13. Psaltis, D., Brady, D., Wagner, K.: Adaptive optical networks using photorefractive crystals. Appl. Opt. 27(9), 1752–1759 (1988) 14. Reck, M., Zeilinger, A., Bernstein, H.J., Bertani, P.: Experimental realization of any discrete unitary operator. Phys. Rev. Lett. 73(1), 58 (1994) 15. Shen, Y., et al.: Deep learning with coherent nanophotonic circuits. Nat. Photonics 11(7), 441–446 (2017) 16. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016) 17. Tait, A.N., Nahmias, M.A., Shastri, B.J., Prucnal, P.R.: Broadcast and weight: an integrated network for scalable photonic spike processing. J. Lightwave Technol. 32(21), 4029–4041 (2014) 18. Yang, P., et al.: Rson: an inter/intra-chip silicon photonic network for rack-scale computing systems. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1369–1374. IEEE (2018) 19. Zuo, Y., Li, B., Zhao, Y., Jiang, Y., Chen, Y.C., Chen, P., et al.: All-optical neural network with nonlinear activation functions. Optica 6(9), 1132–1137 (2019)

Performance Comparison of Multi-layer Perceptron Training on Electrical and Optical Network-on-Chips Fei Dai(B) , Yawen Chen, Zhiyi Huang, and Haibo Zhang University of Otago, Dunedin, New Zealand {travis,yawen,hzy,haibo}@cs.otago.ac.nz

Abstract. Multi-layer Perceptron (MLP) is a class of Artiﬁcial Neural Networks widely used in regression, classiﬁcation, and prediction. To accelerate the training of MLP, more cores can be used for parallel computing on many-core systems. With the increasing number of cores, interconnection of cores has a pivotal role in accelerating MLP training. Currently, the chip-scale interconnection can either use electrical signals or optical signals for data transmission among cores. The former one is known as Electrical Network-on-Chip (ENoC) and the latter one is known as Optical Network-on-Chip (ONoC). Due to the diﬀerences of optical and electrical characteristics, the performance and energy consumption of MLP training on ONoC and ENoC can be very diﬀerent. Therefore, comparing the performance and energy consumption between ENoC and ONoC for MLP training is worthy of study. In this paper, we ﬁrst compare the diﬀerences between ONoC and ENoC based on a parallel MLP training method. Then, we formulate their performance model by analyzing communication and computation time. Furthermore, the energy model is formulated according to their static energy and dynamic energy consumption. Finally, we conduct extensive simulations to compare the performance and energy consumption between ONoC and ENoC. Results show that compared with ENoC, the MLP training time of ONoC is reduced by 70.12% on average and the energy consumption of ONoC is reduced by 48.36% under batch size 32. However, with a small number of cores in MLP training, ENoC consumes less energy than ONoC. Keywords: Multi-layer perceptron · Optical network-on-chip Artiﬁcial Neural Networks · Energy consumption

1

·

Introduction

Multi-layer Perceptron (MLP) is one type of deep learning model that can be applied to classiﬁcation, recommendation engine, and anomaly detection. However, the training of complex MLP model can be very slow with large data sets. Since the MLP has intrinsic characteristic for parallel computation, more cores can be integrated in many-core systems to accelerate the training of MLP. With c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 129–141, 2022. https://doi.org/10.1007/978-3-030-96772-7_13

130

F. Dai et al.

the increasing number of cores integrated into the chip, on-chip interconnection becomes an essential factor to accelerate MLP training which is normally constrained by the communication cost and memory requirements. Electrical Network-on-Chip (ENoC) was ﬁrst proposed to improve the system performance with communications among cores using electrical signals. However, it has scalability issues due to the hop-by-hop routing via electrical routers, which does not scale well with a large number of cores. Optical Network-on-Chip (ONoC) was proposed as a promising alternative paradigm for ENoC using optical communications among cores. Compared with ENoC, ONoC has many advantages, such as low transmission delay, low power cost, high bandwidth, and large throughput [1]. Moreover, ONoC enables multiple signals transmission in one waveguide using diﬀerent wavelengths by Wavelength Division Multiplexing (WDM) technology [2]. With these advantages, ONoC has great capability to eﬃciently perform intensive inter-core communications, and can eﬀectively accelerate the parallel computing of MLP training. However, ONoC also has some extra overheads such as OE/EO conversion cost, insertion loss caused by the light transmission of the waveguide, and tuning power of micro-ring, which can aﬀect the performance and energy consumption for MLP training. Moreover, the performance of MLP training also depends on diﬀerent communication patterns in on-chip network, which is dependent on the number of cores, batch size, NN benchmarks, and etc. Up to date, there have been no comparative studies that compare MLP training between ENoC and ONoC regarding the training performance and energy consumption. Only several pieces of work on performance comparison between ENoC and ONoC can be found. The paper [3] compares performance between ENoC and ONoC under diﬀerent topologies and the report [4] shows performance and energy consumption between ENoC and ONoC by using synthetic traﬃc. Nevertheless, these studies does not consider the comparisons in scenario of neural network training. Therefore, it is of great importance to investigate the comparison of MLP training eﬃciency between ONoC and ENoC under diﬀerent conﬁgurations. The research questions include: 1) Does ONoC always outperform ENoC for training MLP training? 2) How much improvement can be achieved for MLP training on ONoC compared with ENoC under diﬀerent conﬁgurations? 3) In what conditions and settings, ENoC consumes less energy than ONoC for the MLP training? In this paper, we aim to compare the performance and energy consumption of MLP training on ONoC and ENoC under diﬀerent conﬁgurations. We answer the above questions with key contributions summarized as follows: 1. We compare the diﬀerences between ONoC and ENoC based on a parallel MLP training method [5]. We formulate their performance by analyzing their communication and computation costs and formulate their energy based on the static and dynamic energy costs. 2. We conduct extensive simulations to compare the MLP performance and energy consumption betwen ONoC and ENoC under diﬀerent batch sizes using diﬀerent NN benchmarks. Results show that ONoC outperforms ENoC with an average training time reduction of 70.12% and ONoC is more energyeﬃcient than ENoC especially when a large number of cores are used.

Performance Comparison of MLP Training on ENoC and ONoC

131

The remaining part of the paper proceeds as follows: Sect. 2 describes the background of this paper, which includes MLP training, ONoC/ENoC system. Section 3 ﬁrst illustrates the parallel MLP training on NoC systems, then presents the performance and energy models of ONoC and ENoC. Section 4 compares performance and energy consumption between ONoC and ENoC. Finally, Sect. 5 concludes the paper.

2

Background

2.1

Training of MLP

The training process of MLP consists of forward propagation and backward propagation. We use Zl to represent the input vector in the layer l (output vector of layer l − 1) and Wl to represent the weight matrix at layer l. In the forward propagation, the forward propagation of MLP with nl neurons at layer l can be deﬁned as Zl = f (Wl Zl−1 + bl ), where f (∗) is the activation function, and bias vector is bl in layer l. In backward propagation, we use the El , ΔWl to represent error vector and gradient of weight in layer l. The error can be calculated as El = (El+1 WlT )f (Zl ), where f (∗) is the derivative function of f (∗). Then, by using error vector El , the gradient of weight can be calculated as ΔWl = ZlT El+1 . Finally, after we obtain the gradient, weights are updated as Wl = Wl + σΔWl , where σ is the learning rate.

RWA

Routing and Wavelength Allocator

ONI

Optical Network Interface

Optical ring waveguide Optical Network Plane RWA

Core Plane

Optical Router

Optical Router

ONI

ONI

M PE

PE

...

...

Optical Router

Optical Router

ONI

ONI

PE

PE

Optical Control Plane

Manager Processing Element

(a) Optical NoC R Router Electrical Network Plane

Core Plane

R

R

NI PE

NI PE

...

NI Network Interface

R

R

NI PE

NI PE

Processing Elements

(b) Electrical NoC

(c) Illustration of Periods in Parallel MLP Training

Fig. 1. Overview of (a) Optical network-on-chip system and (b) Electrical network-onchip system; (c) Illustration of periods in parallel MLP training.

132

2.2

F. Dai et al.

Optical and Electrical On-Chip Interconnects

We ﬁrst illustrate the major diﬀerences, advantages, and disadvantages of ONoC and ENoC respectively, then we demonstrate the ONoC and ENoC architectures used in this paper. The main diﬀerence between ENoC and ONoC is that they use diﬀerent transmission media for communications among cores. In ENoC, the communications among cores are conducted through the electrical routers, where the electrical packets from the source go through electrical links and routers until the destination. While the transmission among cores in ONoC is diﬀerent via optical routers, which can use diﬀerent wavelengths to communicate in parallel through the waveguide using Wavelength Division Multiplexing (WDM) technology. The merits of optical communication can be summarized as follows: low transmission delay (2–3 cycles between any two points on the chip with a 2 GHz clock), low power cost (roughly independent of the distance), high bandwidth (up to 40 Gb/s per wavelength) and the feasibility of wavelength division multiplexing (64 per waveguide). One of the drawbacks of ONoC is that ONoC requires a large number of optical components, which dissipate a lot of static power. Compared with ONoC, ENoC has good ﬂexibility (a variety of topologies) and it performs well in short distance communication, but ENoC does not scale well resulting in high latency with more cores integrated into the chip. The overview of ONoC and ENoC architectures used to train the MLP in this paper are shown in Fig. 1(a) and (b) respectively, which are based on ring topology. The ONoC architecture is similar to the one proposed in [6], where the optical network plane has an optical control plane for conﬁguring the optical router (a pair of transmitter and receiver). In the router, the receiver is set with a splitter to split optical signals. As can be seen from Fig. 1(a), the PEs and the optical routers are connected to the optical network interface through vertical links for router conﬁguration and data transmission. Before the communication, Manager Processing Element (MPE) and the Routing Wavelength Allocator (RWA) are used to conﬁgure the optical network. After the conﬁguration is ﬁnished, the corresponding modulators in transmitters and drop ﬁlters in receivers are conﬁgured and ready for communications. We assume only one optical waveguide is used in this paper. The ENoC architecture consists of electrical network plane and core plane, as can be seen from Fig. 1(b). The network interface in each PE connects a electric router, and the routers in the electrical network plane concatenate with each other via electrical links by a ring topology. Note that each core in the core plane for ONoC/ENoC has an onchip distributed memory architecture with its L1 private cache and distributed SRAM connected to the main memory via the memory controller. More details about system parameters will be described in Sect. 4.

3 3.1

Methodology on Parallel MLP Training on ONoC and ENoC System Parallel MLP Training

We ﬁrst use an example given in Fig. 1(c) to explain the process of MLP training on ONoC/ENoC system. For parallel computation during MLP training, the

Performance Comparison of MLP Training on ENoC and ONoC

133

neurons in the MLP can be mapped to multiple cores to execute in parallel, where multiple neurons can be mapped onto the same core. As illustrated in Fig. 1(c), one epoch of training is divided into multiple periods based on layers and these periods are executed sequentially. In the initialization process (Period 0), data and MLP instructions in the main memory are loaded to the distributed SRAM of cores. In the subsequent periods, the cores mapped with neurons in the corresponding layer perform computations concurrently and then exchange the outputs with the cores mapped with neurons in the next layer through inter-core communications instead of accessing the main memory. 3.2

Performance Model

As illustrated in Fig. 1(c), one epoch of training is divided into multiple periods based on layers. The FP process is divided into l + 1 periods labeled from Period 0 to Period l, and the BP process is divided into another l periods labeled from Period l + 1 to Period 2l. Note that Period 0 is the initialization period, which does not have any computations and communications. To take advantage of data locality, the cores used in the forward propagation will be used in the back propagation. In this way, all MLP parameters and intermediate values are stored in SRAM of the corresponding cores distributively, with these parameters staying in the corresponding SRAM during one epoch of training. Cores used in diﬀerent layers exchange data by communications on ONoC/ENoC. During the MLP training process, the only diﬀerence between ENoC and ONoC is the communication stage, which can result in diﬀerent training time. Therefore, we ﬁrst formulate the communication time of ONoC and ENoC separately and then formulate their computation time. Finally, we derive their total MLP training time respectively. Because each epoch of MLP training is repetitive, the formulation below is based on one epoch of MLP training. Communication Time. We use m to represent the number of cores used in the parallel MLP training and assume the neurons are evenly mapped to the m cores in each period. Let di , ni represent the transferred data volume and neuron number in period i, where i ∈ [1, 2l]. According to the parallel training of FP and BP process, the transferred data volume varies by using diﬀerent number of cores, as can be calculated by ⎧ ⎪ i = 1, l and 2l; ⎨0, i va (1) di = nm , i ∈ [2, l − 1]; ⎪ ⎩ (ni n2l−i +ni )va , i ∈ [l + 1, 2l − 1], m where v and a represent the number of batch size and storage size of one parameter, respectively. d1 , dl , d2l = 0 because there is no communication in these periods.

134

F. Dai et al.

ONoC Communication Time: The communication time of MLP training on ONoC in period i equals the amount of time that the m cores in period i ﬁnish exchanging their data di with other cores using optical communications. Let s represent the size of ﬂit, then the total number of ﬂits transmitted in period i equals dsi . Assume the number of available wavelengths is λmax . By leveraging the WDM technology, the communications of ONoC in each period can be parallelized by letting multiple cores transmit simultaneously using diﬀerent wavelengths. For period i that demands communications, all the m cores can transmit concurrently if m ≤ λmax ; otherwise Time Division Multiplexing (TDM) needs to be used to complete the transmissions from the m cores. The delay of O/E/O conversion, time of ﬂight, de/serialization, and routing and wavelength assignment are represented by Do, Df , Ds, Da, respectively. Let ε1 (i) be the amount of time required to complete communications in period i for ONoC. We have m di ε1 (i) = (Do + Df + Ds) + Da . (2) λmax s ENoC Communication Time: The communication time of MLP training on ENoC in period i equals the time that the m cores in period i ﬁnish exchanging data volume di with each other via electrical routers. The communication pattern on ENoC is the same as an all-gather/all-reduce operation among cores. As Bulk synchronous parallel (BSP) model is widely used for evaluating the performance of parallel algorithm in distributed-memory system [7], we use the BSP model to evaluate the performance of all-gather/all-reduce operation during parallel MLP training on ENoC system, with the communication time on ENoC formulated as follows. Each super-step in the BSP model is regarded as one execution period of MLP on ENoC. We denote hij as the number of ﬂits that core j sends or receives during period i, where i ∈ [1, 2l] and j ∈ [1, m]. Then, the maximum number of ﬂits among all the cores sent or received in period i, denoted as Hi , can be calculated as m (3) Hi = max(hij ). j=1

The process of all the cores exchanging their data with any other cores in each execution period is an all-gather/all-reduce process, in which we use recursive doubling method [8] to execute the all-gather operation. Then, this all-gather/allreduce process takes log2 m sub-steps to ﬁnish. The size of data in each core doubles at each sub-step until di data volume is fully gathered/reduced. Then, the log m cost of sending or receiving data volume of di in period i is gHi k=12 (di /2k ), where g is the bandwidth of the ENoC to transmit data, and k (k ∈ [1, log2 m]) is the index of the sub-steps in the all-gather/all-reduce process. Let ε2 (i) be the amount of time required to complete communications for ENoC in period i. We have log2 m

(di /2k ) + bi , (4) ε2 (i) = gHi k=1

where bi is the latency for barrier synchronization in period i.

Performance Comparison of MLP Training on ENoC and ONoC

135

Computation Time. The computation time in each period equals the time that the corresponding cores ﬁnish processing its computation workload for that period. We use ρi to represent the amount of computation for each neuron in period i of the FP process and use σi to represent the amount of computation to calculate the gradients and update the weight of one connection based on all training samples. When the batch size (i.e., the number of samples in one training epoch) is larger than one, ρi is the amount of computation for each neuron in period i to process all samples in the current training. According to the deﬁnition of periods, the neurons in layer i where i ∈ [1, l] get involved in period i during the FP process, and the neurons in layer 2l − i + 1 where i ∈ [l + 1, 2l] get involved in period i during the BP process. Therefore, the corresponding number of neuron ni in FP process is the same as n2l−i+1 in the BP process. Then, the amount of computation for FP process is ρimni where i ∈ [1, l] and the amount of computation for BP process is σi n2l−i+1m(n2l−i +1) where i ∈ [l + 1, 2l]. Let τ (i) represent the amount of computation time required for each of the m cores in period i and assume all the cores are hom*ogeneous with same computation capacity C. We have ρi ni , i ∈ [1, l]; (5) τ (i) = σi mC ni (n2l−i +1) , i ∈ [l + 1, 2l]. mC Total Training Time. Since we have obtained the communication costs on ONoC and ENoC by Eq. (2) and Eq. (4) and their computation cost by Eq. (5), we can derive the total MLP training time on ONoC and ENoC as follows. The total training time of ONoC, denoted as Tonoc , equals the sum of ONoC communication time, computation time, and initialization delay in one epoch of training. Then 2l

(ε1 (i) + τ (i)) + ξ, (6) Tonoc = i=1

where ξ represents the initialization delay caused by loading input data and MLP instructions from the main memory to the cores in initialization process and other extra main memory access, software overhead, etc. Similarly, the total training time of ENoC, denoted as Tenoc , can be formulated as follows: 2l

Tenoc = (ε2 (i) + τ (i)) + ξ. (7) i=1

3.3

Energy Model

ENoC Energy Consumption. We use P S and P L to represent the power of switch and power of link. Let Estat be the static energy consumption of ENoC, which can be calculated as

136

F. Dai et al.

Estat =

n s

P Si +

i=1

nl

P Li

× Tenoc ,

(8)

i=1

where ns is the number of switches and nl is the number of links used during the MLP training. We use ESi and ELi to represent the energy/bit of the ith switch and link. BSi and BLi are used to represent the bits transmitted through the ith switch and link. Let Edyn be the dynamic energy consumption of ENoC, then we have Edyn =

ns

(ESi × BS i ) +

i=1

nl

(ELi × BLi ) .

(9)

i=1

ONoC Energy Consumption. The static energy consumption of ONoC is denoted as OEstat , which is related to the energy costs for micro-ring tuning, laser, and electric-to-optical conversion. So, then static energy consumption of ONoC can be calculated by OEstat = (Pmt + Plaser + Poe ) × Tonoc ,

(10)

where Pmt , Plaser and Poe represent the powers of micro-ring tuning, laser and electric-to-optical conversion respectively. The dynamic energy consumption of ONoC is denoted as OEdyn , which is decided by the overall amount of optical ﬂits that traverse through modulator, photo-detector, serializer/deserializer, and waveguide. We use Em , Ep , Es and Ew to represent the energy/ﬂit of modulator, photo-detector, serializer/deserializer and waveguide respectively. According to [9], the dynamic energy consumption of ONoC can be calculated as OEdyn = (Em + Ep + Es + Ew ) × Nf3lits ,

(11)

where Nf lits is the number of ﬂits.

4 4.1

Comparison of MLP Training on ENoC and ONoC Simulation Setup

Since the computation part for both ONoC and ENoC are identical, we separate the simulation of computation and communication into the two processes. For the communication level simulation, we build an in-house simulator to simulate the ONoC based on the cost model in Sect. 3.2 while the communication time of ENoC is tested on Garnet standalone mode [10]. To collect computation time and communication traces, we implemented the MLP in C using GNU Scientiﬁc Library and BLAS gemm [11] in a machine with an intel i5 3200 CPU and 32 Gb main memory. To get the accurate computation time of each core, we repeat the computation workload of each core a thousand times and then obtain the average time. In this way, we make sure the computation is carried out in the

Performance Comparison of MLP Training on ENoC and ONoC

137

CPU caches, which matches our simulated architecture. We run the conﬁgured workloads with up to 300 threads to generate the communication traces for up to 300 cores. The communication traces are fed into our ONoC and ENoC simulator to obtain the communication time of the simulated ONoC and ENoC systems. Based on the simulated results, we calculate the energy consumption of ONoC and ENoC using the energy model in Sect. 3.3, where the values of ONoC/ENoC energy parameters are retrieved from DSENT [12]. We use the three well-known MLP models [5] for processing fashion-mnist and cifar-10 datasets with high classiﬁcation accuracy for our simulation, the hyperparameters for the neural networks can be seen in Table 1. The parameters of the simulated architecture are shown in Table 2, and other ONoC parameters are set as follows: bandwidth/per wavelength 40 Gb/s, waveguide propagation 1.5 dB/cm, waveguide bending 0.005 dB/90o , spliter 0.5 dB, MR pass 0.005 dB/MR, laser eﬃciency 30%, MR drop 0.5 dB/MR, coupler 1 dB. These parameters are obtained from [5,9,13]. The packet size and ﬂit size for ONoC/ENoC are set as 64 bytes and 16 bytes, respectively. Note that the size of distributed SRAM in Table 2 is the maximum memory requirement for the NN benchmarks under batch size 32 calculated by the worst case. The value of distributed SRAM can be greatly reduced if we adopt state-of-the-art pruning technique for the neural network [14]. If the memory requirement of NN is beyond the memory capacity, the performance will be degraded because additional main memory accesses are required causing extra delay for the training time. Table 1. Hyper-parameters for Neural network NN1

784–1000–500–10

NN2

784–1500–784–1000–500–10

NN3

1024–4000–1000–4000–1000–4000–1000–4000–10 Table 2. Parameters of simulated architecture

Core

3.4 GHz, 6 GFLOPS (64 bit)

Private L1 (I cache/ D cache) 128/128 KB L1 latency

1 cycle

Distributed SRAM

42 M

Distributed SRAM latency

10 cycles (front end/back end)

Memory controller latency

6 cycles

Bandwith of main memory

10 Gb/s

NoC

Parameters setup

ENoC

2D-Ring, 2 cycles/hop, 2 cycles/routing, 32 nm, shortest-path routing, 4 virtual channel router

ONoC

3D-Ring, 1 waveguide, 30 mm length, Time of flight & OE/EO: 1 cycle/flit, 64 wavelengths De/Serialization: 2 cycles/flit, 10 Gb/s

138

4.2

F. Dai et al.

Performance Comparison

To better show the performance comparison of ONoC and ENoC, we ﬁrst compare their computation and communication time by using the NN benchmarks with a list of ﬁxed number of cores (50, 100, 150, 200, 250, 300) under batch size 32. Note that the following results are obtained from one epoch of MLP training including forward and back propagation.

Fig. 2. Performance comparisons of ONoC and ENoC with diﬀerent number of cores.

From Fig. 2, we can see that the communication time of ONoC during one epoch training almost keeps steady, and the total training time keeps decreasing with the increasing number of cores. However, the communication time of ENoC shows an upward trend with the increasing number of cores and the training time of ENoC (for most of NNs) ﬁrst decreases and reaches the bottom within the range from 50 to 100 cores, then keeps increasing. The reason for this is that the communication cost on ENoC relates to the number and locations of the communication cores. According to Eq. (4), communication time of ENoC mainly depends on the synchronization time and maximum cost of sending or receiving di message in ENoC. The barrier synchronization time of each execution period equals the latency of barrier synchronization for each sub-steps multiplied with the number of sub-steps log2 m during the all-gather process. Though data volume to transfer from each core is reducing with the increasing number of cores, the number of sub-steps and synchronization time are increased because more cores need to exchange data with other cores. Therefore, the communication time of ENoC greatly increases with the increasing number of cores. However, the communication time in ONoC depends on the transmission data volume and the number of time slots according to Eq. (1) and Eq. (2). With the increasing number of cores, data volume to transfer from each core is reduced but more time slots are needed to communicate between cores due to limited number of wavelengths. Compared with ENoC, the communication time of ONoC only occupies a very low percentage in the total training time. On average, the MLP training time of ONoC is reduced by 70.12% compared with ENoC. In conclusion, ONoC outperforms ENoC under diﬀerent number of cores in MLP training. This eﬀect is more notable when more cores are used for the training (e.g. 300 cores).

Performance Comparison of MLP Training on ENoC and ONoC

4.3

139

Comparison of Energy Consumption

To show the energy consumption of ONoC and ENoC in a better way, we ﬁrst compare their static and dynamic energy consumption by using 3 NN benchmarks with a list of number of cores (50, 100, 150, 200, 250, 300) in wavelength number 64 and batch size 32.

Fig. 3. Energy comparisons of ONoC and ENoC with diﬀerent number of cores.

Figure 3 shows the energy consumption of 3 NN benchmarks with diﬀerent number of cores under batch size 32. It can be seen from Fig. 3 that, with the increasing number of cores, the total energy consumption of ONoC is decreasing while its dynamic energy is increasing slowly. However, the energy consumption of ENoC shows a diﬀerent trend with both total energy and dynamic energy increasing with the increasing number of cores. Besides, we also notice that the total energy consumption of ONoC is larger than ENoC when the number of cores is small (e.g. 50), but is smaller than ENoC with the increasing number of cores. This is because the static power is dominant in ONoC, which is largely dependent on training time according to Eq. (10). However, the dynamic energy consumption is dominated in ENoC, which is mainly related to the communication quantity. From Eqs. (8) and (10), we know that the static energy of both ONoC and ENoC has a linear relationship with the training time. Thus, the static energy consumption of ONoC is decreasing by using more cores in MLP training. Also, as can be seen from Eqs. (9) and (11), the dynamic energy of ENoC is dominated by the electrical components (e.g. switches and links) that ﬂits traverse, while the dynamic energy of ONoC is related to the number of ﬂits that traverse the optical components. When we use more cores in the MLP training on ENoC, the communication requires more electrical components involved which consumes much more dynamic energy resulting in the increasing total energy consumption. When we use a smaller number of cores (e.g. 50), the training time of ENoC and ONoC is more close, but ONoC has a larger static power, which results in larger total energy consumption of ONoC than ENoC. On average, the energy consumption of ONoC is reduced by 48.36% compared with ENoC for the 3 NNs.

140

F. Dai et al.

In summary, ONoC is more energy-eﬃcient especially when a large number of cores are used for MLP training. ENoC shows better energy eﬃciency than ONoC when a small number of cores is used for MLP training (e.g. less than 50 in our simulations).

5

Conclusion

In this paper, we ﬁrst compare the diﬀerences of ONoC and ENoC based on a parallel MLP training method. Next, we formulate their performance according to the communication and computation time and formulate their energy consumption based on static and dynamic energy consumption respectively. Then, we conduct simulations to compare performance and energy eﬃciency of ONoC and ENoC using MLP training. The results show that ONoC outperforms ENoC in MLP training time with 70.12% time reduction on average. Moreover, the energy consumption of ONoC is reduced by 48.36% compared with ENoC under batch size 32. Results also show that, when a smaller number of cores is used in the MLP training, ENoC consumes less energy than ONoC. Our future work can be conducted with extension to other neural networks and other topologies.

References 1. Liu, F., Zhang, H., Chen, Y., Huang, Z., Huaxi, G.: Wavelength-reused hierarchical optical network on chip architecture for manycore processors. IEEE Trans. Sustain. Comput. 4(2), 231–244 (2017) 2. Yang, W., Chen, Y., Huang, Z., Zhang, H.: Rwadmm: routing and wavelength assignment for distribution-based multiple multicasts in onoc. In 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), pp. 550–557. IEEE (2017) 3. Yahya, M.R., Wu, N., Ali, Z.A., Khizar, Y.: Optical versus electrical: Performance evaluation of network on-chip topologies for uwasn manycore processors. Wirel. Pers. Commun. 116(2), 963–991 (2021) 4. Okada, R.: Power and performance comparison of electronic 2d-noc and optoelectronic 2d-noc 5. Dai, F., Chen, Y., Zhang, H., Huang, Z.: Accelerating fully connected neural network on optical network-on-chip (onoc). arXiv preprint arXiv:2109.14878 (2021) 6. Liu, F., Zhang, H., Chen, Y., Huang, Z., Gu, H.: Dynamic ring-based multicast with wavelength reuse for Optical Network on Chips. In: IEEE MCSoC (2016) 7. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990) 8. Zhuang, X., Liberatore, V.: A recursion-based broadcast paradigm in wormhole routed networks. IEEE Trans. Parallel Distrib. Syst. 16(11), 1034–1052 (2005) 9. Grani, P., Bartolini, S.: Design options for optical ring interconnect in future client devices. ACM J. Emerg. Technol. Comput. Syst. (JETC) 10(4), 1–25 (2014) 10. Lowe-Power, J., Mutaal, A.: He gem5 simulator: version 20.0+: a new era for the open-source computer architecture simulator. ArXivorg (2020)

Performance Comparison of MLP Training on ENoC and ONoC

141

11. K˚ agstr¨ om, B., Ling, P., Van Loan, C.: Gemm-based level 3 blas: high-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. (TOMS) 24(3), 268–302 (1998) 12. Sun, C., et al.: Dsent-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In: 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip, pp. 201–210. IEEE (2012) 13. Van Laer, A.: The eﬀect of an optical network on-chip on the performance of chip multiprocessors. Ph.D. thesis, UCL (University College London) (2018) 14. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huﬀman coding. arXiv preprint arXiv:1510.00149 (2015)

The Design and Implementation of Reconfigurable Quaternary Logic Processor Hongjian Wang1 , Youdong Wu1 , Shan Ouyang2(B) , Xunlei Chen2 , Yunfu Shen2 , and Yi Jin2 1

Donghua University, North Renmin Rd. 2999, Shanghai 201620, China 2 Shanghai University, Shangda Rd. 99,Shanghai 200444, China [emailprotected]

Abstract. We propose a multi-valued processor called reconfigurable quaternary logic processor (RQLP), where we use two binary bits to express one quaternary (i.e. 4-valued) bit. The RQLP can be built with massive processor bits. Each processor bit has a uniﬁed structure consisting of four column operators which are gathered by an electric potential combiner. The structure of each column operator is composed of a signal selector, working enabler, reconﬁguration register, reconﬁguration circuit, output enabler, and output generator. The uniﬁed structure of each processor bit can be reconﬁgured into one of 416 types of two-input quaternary logic operators. Compared with modern binary 64-bit processors, the proposed many-bit RQLP can perform much more types of logic operations via hardware, where the massive processor bits on a single RQLP can be divided for parallel processing. We design a general structure of RQLP and provide the prototype circuit for RQLP’s processor bit. We implement the RQLP using FPGA and verify it with diﬀerent quaternary logic operations. Our results demonstrate the eﬀectiveness of RQLP in the aspect of correctness and reconﬁgurability. Keywords: Multi-valued logic · Quaternary logic operator Reconﬁgurable processor · Many-bit processor · FPGA

1

·

Introduction

In the digital world, although the binary expression and Boolean logic have become the foundation of modern computing, multi-valued or many-valued logic is still a very active ﬁeld of study [1–3,5–9]. Multi-valued logics diﬀer from binary logic by the fundamental fact that they do not restrict the number of truth values to only two: they allow for a larger set of truth degrees. For example, 4-valued logic or quaternary logic allows four truth degrees which could be represented by four symbols. For binary logic, there are only 2(2×2) = 16 types of two-input binary logic operations. For quaternary logic, however, there are 4(4×4) = 4, 294, 967, 296 types of two-input quaternary logic operations in total. c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 142–149, 2022. https://doi.org/10.1007/978-3-030-96772-7_14

The Design and Implementation of Reconﬁgurable Quaternary

143

It is impossible and unnecessary to design the speciﬁc circuit for each of the many types of quaternary logic operators individually. In this paper, we propose a quaternary processor called reconfigurable quaternary logic processor (RQLP), where we use two binary bits to represent the four symbols of a quaternary bit. The RQLP can be built with massive processor bits. Each processor bit has a uniﬁed structure, which can be reconﬁgured into any speciﬁc two-input quaternary logic operator. We only have to set different reconﬁguration instructions into a processor bit’s reconﬁguration register in order to realize diﬀerent logic functions. The main contributions of this paper are summarized as follows: • We propose a structure of RQLP with massive processor bits. The uniﬁed structure of each processor bit consists of four column operators which are gathered by an electric potential combiner. The structure of each column operator is composed of a signal selector, working enabler, reconﬁguration register, reconﬁguration circuit, output enabler, and output generator. The RQLP has the ability of performing all types of quaternary logic operations. • We design a prototype circuit for RQLP and its processor bits. Each processor bit is equipped with a reconﬁguration register. We use reconﬁguration instructions to determine speciﬁc logic functions for column operators. The logic function of each processor bit can be changed (or reconﬁgured) while the RQLP is running, by simply rewriting another reconﬁguration instruction into its reconﬁguration register. • We implement a 1-bit RQLP (with only one processor bit) on FPGA device, and verify the eﬀectiveness and reconﬁgurability of the proposed processor structure and circuit. Based on the 1-bit RQLP, we then realize a many-bit RQLP with 1,696 processor bits. Compared with conventional binary 64-bit processors, the proposed manybit RQLP have three merits. Firstly, it can perform much more (almost 4.3 billion) types of logic operations via hardware. Secondly, the massive processor bits on a single RQLP can be divided and assigned to diﬀerent tasks for parallel processing, where any group of processor bits can be conﬁgured into a userspeciﬁc operator. Thirdly, the processor bits can be regrouped and reassigned, while the hardware logic function of each processor bit can be reconﬁgured. These merits will bring new algorithms and new ways to deal with diﬃcult problems in various ﬁelds, where many potential applications can be envisaged. For example, tasks like quaternary logic operations, quaternary symbol transformation and quaternary decision-making, which can only be processed slowly by software in the current binary computers, will be accelerated to ﬁnish in one clock cycle on a quaternary logic operator. Currently, we are developing a novel encryption chip that utilizes 416 types of quaternary logic operators to achieve one-time pad encrypted real-time communication. Moreover, the prototype circuit for RQLP and its processor bits can be simpliﬁed to implement reconﬁgurable ternary (i.e. 3-valued) logic processor, or it can be extended to implement reconﬁgurable n-valued logic processor (where

144

H. Wang et al.

n > 4). We hope RQLP and its construction method will provide new insight into the development of modern processors.

2

Reconfigurable Quaternary Logic Processor

There are various expression methods for n-valued logic. The most common method is one-dimensional n-valued expression. That is, a logic value is expressed by one symbol, where the symbol has n diﬀerent possible values. For example, a one-symbol set for n-valued logic expression could be {0,1,2,...,n − 1}, where n = 4 in the case of quaternary logic expression. However, the values of nvalued logic may be alternatively expressed by multiple symbols, which is a mathematically equivalent information expression form. For example, two binary values could be used for quaternary logic expression, and it is called “2-binarybit” expression in the rest of this paper. We use 2-binary-bit set {00, 01, 10, 11} for quaternary logic expression during the design of RQLP in the rest of this paper. The advantage of adopting this expression form is that it can make full use of existing binary logic devices to make n-valued logic operators in a convenient and inexpensive way. The design of RQLP starts from a truth table for quaternary logic operation. Conventionally, the truth table for a quaternary logic operation is a 4 × 4 square table, such as the four examples shown in Table 1 where A and B are two inputs while C is the output. Each A, or B, or C is represented by the 2-binary-bit quaternary logic expression set {00, 01, 10, 11}. 2.1

General Structure

The RQLP is designed to have massive processor bits, where each processor bit corresponds to a quaternary logic unit that can perform any types of the 416 quaternary logic operations. Inspired by the decrease-radix design principle [11] and the reconfigurable ternary optical processor [4,10], we design a general structure called column operator with four diﬀerent forms. Figure 1 shows a schematic diagram of the structure of an m-bit RQLP. Each processor bit includes four column operators, such as 3 and 4 , which are connected by an electric potential combiner with four input terminals, such as 11 . The output terminal of the k th column operator is connected to the k th input terminal of the electric potential combiner, where k ∈ {0, 1, 2, 3}. The output of the ith electric potential combiner forms the output signal of the ith processor bit, where i ∈ {0, 1, ..., m − 1}. In Fig. 1, the output of the k th column operator included in the ith processor bit is denoted as Ci k (i ∈ {0, 1, ..., m − 1}, and k ∈ {0, 1, 2, 3}). Each of the four column operators mainly includes six components, namely output enabler 5 , output generator 6 , A-signal selector 7 , working enabler 8 , reconﬁguration register 9 , and reconﬁguration circuit 10 . The working procedure of the m-bit RQLP is based on the reconﬁguration register. A reconﬁguration instruction, which can be written into the reconﬁguration register by using the line G, determines a speciﬁc function for the

The Design and Implementation of Reconﬁgurable Quaternary

145

Fig. 1. Schematic diagram of the structure of an m-bit RQLP.

corresponding column operator, that is, to implement one of the 416 types of quaternary logic operations or no operation. The electric potential combiner of the ith processor bit is designed to combine output signals of all the column operators of that processor bit, and to form a ﬁnal output signal of that processor bit. Since no matter what value the ith bit of input data A is, it will satisfy the selection requirement of one of the four A-signal selectors among the four column operators. Hence, the processor bit can deﬁnitely complete the logic operation for any value of the ith bit of the input data A and B. (Note that the inputs A and B are m-bit quaternary data.) The many processor bits can be divided into diﬀerent groups with ﬂexible group size, where each group can be reconﬁgured into a speciﬁc quaternary logic operator with k (k ≤ m) processor bits according to user’s need, by writing corresponding reconﬁguration instructions into the reconﬁguration registers. After the task is ﬁnished, the many processor bits can be re-grouped and reconﬁgured. 2.2

Circuit of RQLP’s Processor Bit

Based on the general structure, we design a circuit structure of the RQLP’s processor bit. An m-bit RQLP (Fig. 1) contains m processor bits, where each processor bit has the same structure (Fig. 2) and working principle. Here, we only give the structure of the ith (i ∈ {0, 1, ..., m − 1}) processor bit, as shown in Fig. 2. Each processor bit includes four column operators ( 13 , 14 , 15 , and 16 ) and one electric potential combiner ( 17 ). The diﬀerences among the four

146

H. Wang et al.

column operators only lie in that A-signal selectors ( 20 , 40 , 41 , and 42 ) have diﬀerent structures: 20 is a NOR gate whose two input terminals are respectively connected to Ai 1 (high bit of the ith line of input data A) and Ai 0 (low bit of the ith line of input data A); 40 is an AND gate with one inverted input terminal, where the inverted input terminal is connected to Ai 1 while the other input terminal is connected to Ai 0 ; 41 is also an AND gate with one inverted input terminal, where the inverted input terminal is connected to Ai 0 while the other input terminal is connected to Ai 1 ; 42 is an AND gate whose two input terminals are respectively connected to Ai 0 and Ai 1 . The remaining parts of the four column operators are identical, and we depict them only in the ﬁrst column operator in Fig. 2.

Fig. 2. Schematic diagram of the structure of a RQLP’s processor bit. The structure is only for one processor bit where other processor bits are the same.

Now we explain the ﬁrst column operator in detail. It includes an A-signal selector, a working enabler, a reconﬁguration register, a reconﬁguration circuit, an output enabler, and an output generator. The A-signal selector is implemented by a NOR gate 20 . The working enabler is implemented by an AND gate 19 . The reconﬁguration register is implemented by a register 29 denoted as RGi 0 . (Similarly, RGi k is used to denote the reconﬁguration register in the k th column operator of the ith processor bit.) The reconﬁguration circuit consists of two components. One component is formed by an 8-to-1 multiplexer 23 ; two XOR gates 22 and 24 ; an AND gate 25 ; two AND gates with inverted

The Design and Implementation of Reconﬁgurable Quaternary

147

input terminal 26 and 27 ; and a NOR gate 28 . Similarly, the other component is formed by an 8-to-1 multiplexer 32 ; two XOR gates 31 and 33 ; an AND gate 34 ; two AND gates with inverted input terminal 35 and 36 ; and a NOR gate 37 . The output enabler is implemented by an AND gate 18 . The output generator is implemented by two AND gates 21 and 30 . The connections among the parts in the column operator are as shown in Fig. 2. For the two 8-to-1 multiplexers, D0∼D7 are eight input signals while C0∼C2 are three select lines. Eight input lines of the 8-to-1 multiplexer are respectively connected to one circuit for ﬁltering the input data Bi signal. According to the circuit structure and the working principle of column operators, we can work out the reconﬁguration instructions for all 16 possible situations of a column operator. Then, each of the m processor bits can be reconﬁgured into one bit of quaternary logic operator, so that the entire processor becomes a composite operator having various quaternary logic units.

3

Experiments

To verify the eﬀectiveness of the proposed RQLP, we implement the circuit structure on an embedded Zynq AX7020 FPGA device. We make column operator a module, and we connect four column operator modules according to Fig. 2 to form a RQLP’s processor bit. As for resource utilization, it takes 18 LUTs and 36 FFs to implement a processor bit. Table 1. Truth tables of four tested quaternary logic operations. Test No.1 C1C0

Test No.2

Test No.3

Test No.4

A1A0 00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11

B1B0 00 01 10 11

00 00 00 00

01 01 01 01

10 10 10 10

11 11 11 11

11 00 01 10

10 11 00 01

01 10 11 00

00 01 10 11

01 10 11 10

10 10 00 01

11 00 00 11

10 01 11 01

01 10 00 00

10 11 01 10

00 10 10 11

We test the processor bit on four quaternary logic operations, whose truth tables are listed in Table 1. Test case No. 1 is a relatively simpler truth table where each column has the same value. Test case No. 2 is a complex truth table where each column and each row all have four diﬀerent values. Test case No.3 is randomly chosen from all 416 quaternary logic operations. Test case No.4 is also a randomly chosen truth table but with the third column deleted. Based on the circuit of RQLP’s processor bit and the working principle of column operators, we obtain the reconﬁguration instructions of the four test

148

H. Wang et al.

cases. Each tested quaternary logic operation has four 9-bit (G8–G0) reconﬁguration instructions for the four column operators, forming a 36-bit reconﬁguration instruction for the processor bit. In order to verify the reconﬁgurability of the proposed RQLP, we test the four quaternary logic operations one-by-one without turning oﬀ the FPGA device, so that the processor reconﬁguration is done at runtime. For each test case, we ﬁrstly input its 36-bit reconﬁguration instruction, ﬁnishing processor reconﬁguration. Then, we input all the 16 A–B combinations one-by-one and check the output results. After the ﬁrst test case is ﬁnished, we continue to test the second one in the same way, without turning the FPGA device oﬀ and on again. All the observed outputs of tested cases are consistent with the expected values in Table 1. The above experiments prove that, for the proposed RQLP, 1) the processor structure and reconﬁguration circuit function correctly; 2) the reconﬁguration instructions are eﬀective; 3) the processor reconﬁgurability is valid. Based on the implemented processor bit, we build a RQLP with massive processor bits. We make 32 processor bits together as a group, where we use a 5:32 address decoder for addressing the 36-bit reconﬁguration register of each processor bit. Then, we combine 53 groups using a 6:64 address decoder to form a RQLP with 1,696 processor bits in total. As for resource utilization, this manybit RQLP takes 31,436 LUTs and 79,938 FFs on the FPGA device.

4

Conclusions and Future Work

In this paper, we have proposed a general structure of RQLP. We have instantiated the general structure, and designed a prototype circuit for processor bit. We have proposed to use reconﬁguration instructions to determine speciﬁc logic functions for column operators, that is, to implement one of the 416 types of quaternary logic operations. We have implemented a 1-bit RQLP using FPGA and tested it with four carefully selected examples of quaternary logic operations. Based on the 1-bit RQLP, we have also implemented a many-bit RQLP with 1,696 processor bits. Experimental results have veriﬁed the eﬀectiveness and reconﬁgurability of the RQLP structure and circuit. As a starting work in RQLP, we currently use FPGA to verify the correctness and functionality of our circuit design. In future, we will gradually perform timing and speed evaluation, ISA design, programming model, etc. We will also experimentally compare RQLP with other architectures such as normal CPU or GPU implementations for executing speciﬁc applications or benchmarks. The ultimate goal is ASIC chip of multi-valued processor with reconﬁgurability. On the one hand, we should study how to make the reconﬁgurable multi-valued processor cooperate seamlessly with current CPUs and GPUs. On the other hand, we need to ﬁnd more interesting applications that can take full advantage of this new class of processors.

The Design and Implementation of Reconﬁgurable Quaternary

149

Acknowledgements. The work was supported by the “Fundamental Research Funds for the Central Universities” from Donghua University under grants no. 2232020D36, the “Shanghai Pujiang Program” from Shanghai Municipal Human Resources and Social Security Bureau under grants no.21PJD001, the “Young Teacher Research Startup Fund” from Donghua University under grants no.112-07-0053079.

References 1. Bhattacharjee, D., Kim, W., Chattopadhyay, A., Waser, R., Rana, V.: Multi-valued and fuzzy logic realization using taox memristive devices. Sci. Rep. 8(1), 1–10 (2018) 2. Bykovsky, A.Y.: A multiple-valued logic for implementing a random oracle and the position-based cryptography. J. Russ. Laser Res. 40(2), 173–183 (2019) 3. Homma, N., Saito, K., Aoki, T.: Formal design of multiple-valued arithmetic algorithms over galois ﬁelds and its application to cryptographic processor. In: 2012 IEEE 42nd International Symposium on Multiple-Valued Logic, pp. 110–115. IEEE (2012) 4. Jin, Y., Wang, H., Ouyang, S., Zhou, Y., Shen, Y., Peng, J., Liu, X.: Principles, structures, and implementation of reconﬁgurable ternary optical processors. Sci. China Inf. Sci. 54(11), 2236–2246 (2011) 5. Kazakova, N., Sokolov, A.: Spectral and nonlinear properties of the complete quaternary code. In: CPITS pp. 76–86 (2020) 6. Nov´ ak, V.: A formal theory of intermediate quantiﬁers. Fuzzy Sets Syst. 159(10), 1229–1246 (2008) 7. Roy, J.N., Chattopadhyay, T.: All-optical quaternary logic based information processing: challenges and opportunities. In: Design and Architectures for Digital Signal Processing, pp. 81–109. InTech (2013) 8. Stoilos, G., Stamou, G., Pan, J.Z., Tzouvaras, V., Horrocks, I.: Reasoning with very expressive fuzzy description logics. J. Artif. Intell. Res. 30, 273–320 (2007) 9. Straccia, U.: Reasoning within fuzzy description logics. J. Artif. Intell. Res. 14, 137–166 (2001) 10. Wang, H., Song, K.: Simulative method for the optical processor reconﬁguration on a dynamically reconﬁgurable optical platform. Appl. Opt. 51(2), 167–175 (2012) 11. Yan, J., Jin, Y., Zuo, K.: Decrease-radix design principle for carrying/borrowing free multi-valued and application in ternary optical computer. Sci. China Ser. F Inf. Sci. 51(10), 1415–1426 (2008)

A 3D Dubins Curve Constructing Method Based on Particle Swarm Optimization Cheng Ji , Chu Wang , Mingyan Song , and Fengmin Wang(B) Beijing Jinghang Research Institute of Computing and Communication, Beijing 100074, People’s Republic of China casic [emailprotected]

Abstract. The navigation error of aircraft increases in task. Aircraft has to correct the navigation error under structure constraints to avoid path deviation caused by navigation error. Aircraft path planning with navigation correction under the turning radius constraint is a challenge for traditional path planning methods. In this paper, we propose a 3D Dubins curve constructing method which can draw a smooth path in 3D space for the aircraft, next we extend Dynamic Programming for Navigation Error Correction method by 3D Dubins curves to abtain a feasible path under the constraints of turning radius, and then we improve particle swarm optimization method to compute an almost optimal Dubins curve. Finally our algorithm return a feasible smooth path with approximately the optimal length for the path planning problem with navigation correction under the turning radius constraint.

Keywords: Path planning optimization

1

· Dubins curve · Particle swarm

Introduction

Aircraft path planning is a multi-objective optimization problem involving in collision avoiding, complicated landform, structure constraint, risk avoiding and so on. Aircraft path planning is always based on many factors, such as turning radius and climb ability of aircraft. Aircraft path planning has been widely applied in many tasks, such as topographic survey [1], war information reconnaissance [2], electronic interference [3], material placement [4] and so on. The solution of aircraft path planning can be devided into two parts: the ﬁrst one is to choose the regions of feasible path; the second one is to calculate the aircraft trajectory planning. The regions of feasible path have already been chosen in [5], but the aircraft trajectory planning problem remains to be solved. For an aircraft trajectory planning strategy, turning radius is an important factor. Many methods aim to compute a feasible aircraft path under the constraint of turning radius, such as the Dubins curve [6] and Clothoid curve [7]. Supported by NNSF of China under Grant No. 11901544. c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 150–160, 2022. https://doi.org/10.1007/978-3-030-96772-7_15

A 3D Dubins Curve Constructing Method

151

[8] extended continuous Bezier curves along Z-axis to construct 3D Dubins curve, which is derived from two-dimensional Dubins curve. In this paper, we propose a novel method to compute the feasible path under the constraint of turning radius in 3D space. We have select feasible navigation error regions based on DyProg [5]. We parameterize a feasible path based on dubins curve under the condition of smoothness and coplanarity. We optimize the parameter by PSO to have the shortest feasible path. This paper is organized as follows: In Sect. 2, we state the problem of path planning with navigation error correction formally. In Sect. 3, we show that the turning radius constraint is considered in the problem of path planning with navigation error correction, and then we propose the 3D Dubins curve to describe the aircraft path with turning radius and improve PSO to calculate the length of this path. In Sect. 4, we show and analysis the experimental results of the proposed methods on simulated data. In Sect. 5, we conclude our work and show some prospects in future.

2

Problem Formulation

Let A and B be the departure and destination respectively. For a path p from A to B, we continue to use the error correction restriction [5] as the condition restriction of path p, and deﬁne {pi }ni=1 as the collection of error correction regions in p. We add a new restriction that turning radius R is no fewer than r. We consider the problem that how to compute a feasible path such that both the number of error correction regions and the length of the path is minimal.

3

Proposed Methods for Path Planning with Dubins Curve

The feasible path p has been calculated in [5]. In this section, we show the method to compute the feasible path by 3D Dubins curve. In subsection A, we propose a three-dimensional dubins curve to solve the problem of path planning with turning radius. In subsection B, we propose Dynamic Programming for Navigation Error Correction (DyProg) [5] based on 3D Dubins curve to calculate the feasible path. In subsection C, we improve PSO to compute a feasible path with almost minimal length. As the calculation method is similar, we regard A as p0 and B as pn+1 in the calculation process. 3.1

Smooth Path with Dubins Curve

In this section, we propose a novel method to computing a smooth path from A to B which goes through given error correction regions {pi }ni=1 . We construct two-dimensional dubins curve for each adjacent error correction regions. And then, we construct a three-dimensional dubins curve by splicing multiple twodimensional dubins curves together smoothly.

152

C. Ji et al.

To construct two-dimensional dubins curves, we need to change threedimensional parameters to the corresponding two-dimensional parameters. And we need to maintain a linear change in velocity to construct a smooth path. The two-dimensional Dubins curve is smooth, and then we need to keep the velocity → vi , so as in each error correction regions consistent. Deﬁne the velocity at pi is − → − − − → − − → to vi+1 . The Dubins curve is from pi to pi+1 , and vi , vi+1 are both in the plane γ: → −−→ − −−→ − (1) p− i pi+1 · ( vi × vi+1 ) = 0. We convert 3D space to two-dimensional plane with Gram-Schmidt Orthogonalization to calculate Dubins curve. For each case from pi to pi+1 , we create a two-dimensional coordinate system. We deﬁne pi as the origin point. The fol→ → lowing are − x and − y: − −−→ p− i pi+1 − → x = −−−−→ pi pi+1 → − − → → → y =→ vi − (− vi · − x )− x → − y → − y = − → |y |

(2)

We transform the point Q (QX , QY ) in the two-dimensional plane into the point Q three-dimensional space by the following formula: − → Q = pi + QX · → x + QY · − y

(3)

−−→ In the two-dimensional plane, pi is (0, 0) and pi+1 is (− p− i pi+1 , 0). We deﬁne the incidence angle of the dubin curve as η and the exit angle as |λ|. We calculate η and |λ| by the following formula: −−→ → − p− vi · − i pi+1 η = arccos( → −−→ ) − vi − p− i pi+1 − → −−−−→ v− i+1 · pi pi+1 λ = arccos( −−→ −−−−→ ) vi+1 pi pi+1 − → − → λ = λ if v− i+1 · y ≥ 0 − → − → λ = −λ if v− i+1 · y < 0

(4)

η ∈ (0, π), λ ∈ (−π, π) The two-dimensional Dubins curve can be used to obtain a aircraft path that −−→ satisﬁes the constraints of η, λ, − p− i pi+1 .

A 3D Dubins Curve Constructing Method

153

Fig. 1. 3D Dubins curve

In Fig. 1, we show a 3D Dubins curve. We set three error correction regions and give the direction of the velocity of each error correction region, then get a smooth 3D curve through the Dubins curve. 3.2

Dynamic Programming Algorithm with Turning Radius Constraint

We consider improving DyProg to calculate a feasible aircraft path with turning radius. The aircraft cannot change the direction of velocity immediately, so we add a turning radius constraint to this discrete constraint optimization problem. → → vi and − vj , we Since the path from pi to pj is determined by the directions of − consider the upper bound of the length of the path from pi to pj to calculate the navigation error instead of the Euclidean Distance from pi to pj in DyProg. In our case, the distance between two error correction regions is far shorter than turning radius, deﬁne the path in Dubins Set D = LSL, RSR, RSL, LSR , the shortest path in D as the shortest Dubins curve. Redeﬁne the p p distance between pi and pj as dij = iR j , η and λ still as above. The upper bound of the Dubins curve Dubins( max) as follows [9]: Dubins(max) = min(D) ≤ min(RSR, LSL) = |η − λ|mod(2π) + P ≤ 2π + P

(5)

154

C. Ji et al.

P in RSR means as follows: P(rsr) = 2 + d2 − 2cos(η − λ) + 2d(sinλ − sinη) ≤ 2 + d2 − 2 ∗ (−1) + 2d(1 + 1) =d+2

(6)

From 5 and 6, we can get Dubins(max) = 2π + d + 2 Deﬁne the length of path from pi to pj as lenij : lenij = (2π + 2) ∗ R + pi pj

(7)

We can get the feasible error correction regions by DyProg, but need to consider how to calculate the length of speciﬁc aircraft path. 3.3

Improved Partical Swarm Optimization → We describe the velocity − vi of each error correction regions pi by (μi , ψi ) and compute a smooth path from A to B. We calculate the aircraft path with minimal length through optimization {(μi , ψi )}n+2 i=0 by PSO [10]. Assuming the direction of velocity the aircraft at each region, we can describe the aircraft path between two error correction regions by Dubins curve to get a complete aircraft path. The aircraft path changes with the direction of velocity the aircraft at each region. We optimize the aircraft path by PSO to get a set of direction of velocity to make the aircraft path as short as possible. Then our problem becomes the Optimization Problem [11]: min(sum(dubins(pi , pi+1 )))

(8)

Two-dimensional Dubins curve describing aircraft path needs to ensure that the aircraft paths between every two error correction regions are coplanar, so we need to add 1 to 8: min(sum(dubins(pi , pi+1 ))) −−→ − → −−→ s.t. − p− i pi+1 · ( vi × vi+1 ) = 0(i = 0, 1, 2, ..., n + 2) Then we turn Optimization Problem 8 into: → p−p−−→ · (− v ×− v−→)|)) min(sum(dubins(p , p ) + |− i

i+1

i i+1

i

i+1

(9)

The problem becomes Optimization Problem 9. Considering the direction of velocity the aircraft at each error correction region pointing to the centre of ball → which contains all error correction regions, we can assume the direction of − vi in → − polar coordinates. vi can be deﬁned by (μi , ψi ) as follows: xi = cosμi sinψi yi = sinμi sinψi zi = cosψi

A 3D Dubins Curve Constructing Method

155

− −−→ p− i pi+1 can be deﬁned as follows: − −−→ −−−−→ −−−−→ −−−−→ p− i pi+1 = (pi pi+1 [x], pi pi+1 [y], pi pi+1 [z]) We adjust {(μi , ψi )}n+2 i=0 to calculate the aircraft path with minimal length, so we optimize 2n + 4 parameter by PSO. Optimization Problem 9 is diﬃcult to accurately guarantee Constraints 1. If coplanarity cannot be guaranteed, we cannot characterize the 3D aircraft path. We propose the improved PSO to solve the problem of non-coplanarity. For each Dubins curve, the coplanar condition is Constraints 1. vi and vi+1 both contain two values. In the ith curve, if we have μi ,ψi and μi+1 , then we calculate ψi+1 by the Constraints 1: − −−→ − −−→ − −−→ p− p− p− i pi+1 [x] i pi+1 [y] i pi+1 [z] cosμi sinψi sinμi sinψi cosψi = 0 (10) cosμi+1 sinψi+1 sinμi+1 sinψi+1 cosψi+1 For simplicity, deﬁne equ1, equ2, equ3, equ4, equ5, equ6 as follows: −−→ equ1 = − p− i pi+1 [x]sinμi sinψi cosψi+1 equ2 = − p−p−−→[y]cosψ cosμ sinψ i i+1

i

i+1

i i+1

i

i

i+1

−−→ equ3 = − p− i pi+1 [z]cosμi sinψi sinμi+1 sinψi+1 −−→ equ4 = − p− i pi+1 [x]cosψi sinμi+1 sinψi+1 equ5 = − p−p−−→[y]cosμ sinψ cosψ i+1

−−→ equ6 = − p− i pi+1 [z]sinμi sinψi cosμi+1 sinψi+1 Constraints 10 equal to: equ1 + equ2 + equ3 − equ4 − equ5 − equ6 = 0 For simplicity, deﬁne equ7, equ8, equ9 as follows: −−→ −−−−→ p− equ7 = cosμi sinψi − i pi+1 [y] − sinμi sinψi pi pi+1 [x] −−→ −−−−→ equ8 = cosμi sinψi − p− i pi+1 [z]sinμi+1 − sinμi sinψi pi pi+1 [z]cosμi+1 − − − − → − − − − → equ9 = cosψi pi pi+1 [x]sinμi+1 − cosψi pi pi+1 [y]cosμi+1 Then we can repersent ψi+1 with equ7, equ8, equ9 as follows: ψi+1 = arctan(

equ1 ) equ2 − equ3

(11)

For the ﬁrst curve, we assume μ1 , μi+1 and ψ1 , and then ψi+1 can be derived. For the ith curve, we assume μi and μi+1 . ψi of the pi region is obtained from the i − 1th curve, and ψi+1 can be derived. And then our optimization problem become the Optimization Problem 9. The PSO sets two values at p0 and one value at the other error correction regions, so we can optimize n + 3 parameter and calculate the other n + 1 parameter by 11.

156

C. Ji et al.

Algorithm 1. Improved Partical Swarm Optimization For Track Planning Inputs: {pi }n−1 i=0 , a feasible path p Outputs: A 3D aircraft path op Initialization: set PSO parameters, iteration N , particle initial position and velocity, particle initial position ﬁtness Step 1: Update and iterate particles: while iterationnumber < N do Update particle speed and position Calculating particle ﬁtness f itness(x) Update particle swarm individual optimal value and population optimal value end while Go to Step 2. Step 2: Calculate 3D Dubins curve: for i = 0 to length(p) do T, Q = dubins(pi , pi+1 ) Establishing a two-dimensional coordinate system Pi+1 (dis, 0) Pi (0, 0) Vi+1 (cosηi+1 , sinλi+1 ) Vi (cosηi , sinλi ) radius of Dubins curve Ri+1 (sinηi+1 , −cosλi+1 ) Ri (sinηi , −cosλi ) Label1, Label2 = 1 if Incident Dubins curve turn left then Label1 = −1 end if if Exit Dubins curve turn left then Label2 = −1 end if circle center of Dubins curve ri = Pi + Label1 ∗ Ri ∗ R ri+1 = Pi+1 + Label2 ∗ Ri+1 ∗ R incident Dubins curve arci = ri + R ∗ (Ri ∗ cosτ + Vi ∗ sinτ ), τ ∈ (π, π − T ) 3D incident Dubins curve Arci = pi + arci [0] ∗ xaxis + arci [1] ∗ yaxis exit Dubins curve arci+1 = ri+1 + R ∗ (Ri+1 ∗ cosτ + Vi+1 ∗ sinτ ), τ ∈ (π, π + Q) 3D exit Dubins curve Arci+1 = pi+1 + arci+1 [0] ∗ xaxis + arci+1 [1] ∗ yaxis τ = π − T , a1 = Arci τ = π + Q, a2 = Arci+1 Str = a1 a2 op ∪ {Arci , Arci+1 , Str} end for

A 3D Dubins Curve Constructing Method

157

return op ﬁtness(x): for i = 0 to length(p) do → − vi [x] = cosμi sinψi → − vi [y] = sinμi sinψi → − vi [z] = cosψi calculate ψ by Formula 11 F itness+ = dubins(pi , pi+1 ) end for return Fitness → − −−→ v i ·− p− i pi+1 dubins(pi ,pi+1 ): η = arccos( → − −−p −−→ ) vi p i i+1 → − → − vi vi = → as xaxis positive direction − vi → − y = v − (x ·− p−p−−→)− p−p−−→ axis

i axis i i+1 i i+1 yaxis as y positive direction axis yaxis − → −−−−→ v− i+1 ·pi pi+1 arccos( − → −−−−→ ) v− i+1 pi pi+1

yaxis =

|λ| = → if − v− i+1 · yaxis < 0 then λ = −|λ| end if di,i+1 = pi pRi+1 choose the case of the Dubins curve with η, λ, dis by scheme calculate the Dubins curve with η, λ, dis and case by formula

4

Experiment

In this section, we show the experimental results of our methods on simulated data. In our experiment, we show that Algorithm 1 processes the feasible path into 3D Dubins curve. More details will be shown later in this section. First, let us see the set up of the experiment. 4.1

Experimental Set Up

The datasets of our experiments are simulated and there exists at least one feasible path for each of them. All other error correction regions are generated randomly. In our experiment, we choose two group of simulated parameters: Parameters I and Parameter II. The details of Parameter I are shown as follows: α1 = 25,

α2 = 15,

δ = 0.001,

β1 = 20,

β2 = 25,

θ = 30

r = 200

The details of Parameter II are show as follows: α1 = 20, β1 = 15,

α2 = 10, β2 = 20,

δ = 0.001, θ = 20

r = 200

158

C. Ji et al.

From the results of simulations, there are more feasible paths for Parameter I than that for Parameter II if the number of error correction regions is similar. In our experiment, we simulate datasets 1 to 8 whose parameters are shown in Table 1: Table 1. Parameters of simulated data 1 to 8 No. Parmeter Number of error correction regions 1

I

200–400

2

I

400–600

3

I

600–800

4

I

800–1000

5

II

200–400

6

II

400–600

7

II

600–800

8

II

800–1000

In our experiment, our proposed methods are shown as follows: – DyProg2: DyProg with turning radius constraint. – IPSO: 3D Dubins curve calculating which is shown in Algorithm 1 In our experiment, the performance of feasible paths is measured by the length of path and running time (RT) in the same hardware environment. 4.2

Experimental Results on Simulated Data

In this section, we calculate the 3D Dubins curve by our methods DyProg2 and IPSO. In DyProg2, we use 7 instead of the Euclidean Distance and then run DyProg. We can get the error correction regions of a feasible path. And then we use the error correction regions to calculate the optimized curve in IPSO. We record the path length before optimization (PLBO, Unit: km), path length after optimization (PLAO, Unit: km), straight path length (SPL, Unit: km), optimization rate (OR), RT of DyProg2 (RTOD, Unit: s) and RT of IPSO (RTOP, Unit: s) in Table 2. The results of aircraft trajectory planning experiment is shown in Table 2. In Experiment 5, we cannot ﬁnd a feasible path for the distribution of error correction regions. In the other experiments, DyProg2 and IPSO have a good performance. The optimized length of aircraft path is far less than before optimization. We subtract the SPL from the curved path which contain PLBO and PLAO, and calculate the optimization rate. The optimization rate of Parameter I is up to 85% and the Parameter I is higher than 60%. The RT of DyProg2 is less than 10 s. The RT of PSO is a little longer, but not more than 2 min.

A 3D Dubins Curve Constructing Method

159

Table 2. Experimental results of RT (Unit: s) on parameters I, II No. PLBO PLAO SPL

5

OR

RTOD RTOP

1

122.17 120.64 120.41 86.67% 0.688

73.203

2

113.49 111.81 111.77 96.61% 2.281

57.719

3

113.32 111.19 111.03 93.01% 6.047

58.719

4

109.27 107.84 107.67 89.48% 8.671

54.234

5

––

6

118.90 116.22 115.32 75.04% 2.484

84.969

7

129.03 125.94 125.27 82.18% 5.531

90.797

8

113.81 111.99 110.89 62.30% 9.828

85.188

––

––

––

––

––

Conclusion

In this paper, we consider the problem of aircraft path planning under the constraint of turning radius. In this problem, we propose a 3D Dubins curve, and use this curve to accurately plan the aircraft path. We apply the PSO on optimizing the aircraft path length and obtain an almost optimal aircraft path. In the future, we will try to use diﬀerent curves to make the feasible paths more suitable for aircraft, and use parallel and distributed computing methods to increase the speed of aircraft path planning.

References 1. Stentz, A.: Optimal and eﬃcient path planning for partially-known environments. In: Proceedings of the 1994 IEEE International Conference on Robotics and Automation, 1994. IEEE (1994) 2. Liu, Yu.Z., Wei, X., et al.: An amphibious vehicle modeling and maneuvering path planning method suitable for military topographic maps. Wuhan Univ. J. Natural Sci. 25(134(06)), 67–75 (2020) 3. Wang, Y., Wang, S., Tan, M.: Path generation of autonomous approach to a moving ship for unmanned vehicles. IEEE Trans. Industr. Electron. 62(9), 5619–5629 (2015) 4. Yongwei, L.I., Hongfei, W.: Fuzzy adaptive PID control for six rotor eppo UAV. J. Hebei Univ. Sci. Technol. 38, 59–65 (2017) 5. Song, M., Ji, C., Wang, C., et al.: A novel dynamic programming based method for path planning with navigation error correction. In: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). IEEE (2020) 6. Dubins, L.E.: On curves of minimal length with a constraint on average curvature and with prescribed initial and terminal positions and tangents. Am. J. Math. 79(3), 497–516 (1957) 7. Wensen, L.I., Guan, S., Zheng, L., et al.: A smoothing method for tool path with G 2 continuity based on clothoid curves. J. Xi’an Polytechnic Univ. (2019)

160

C. Ji et al.

8. Cai, W., Zhang, M.: Smooth 3D Dubins curves based mobile data gathering in sparse underwater sensor networks. Sensors 18(7), 2105 (2018) 9. Shkel, A.M., Lumelsky, V.: Classiﬁcation of the Dubins set. Robot. Auton. Syst. 34(4), 179–202 (2001) 10. Innocente, M.S., Sienz, J.: A Study of the Fundamental Parameters of Particle Swarm Optimizers (2021) 11. Gao, H., Li, Y., Zhang, H.: The analysis of alternating minimization method for double sparsity constrained optimization problem. Asia Paciﬁc J. Oper. Res. 37(4), 2040002 (2020)

Software Systems and Technologies

Towards Conflict-Aware Workload Co-execution on SX-Aurora TSUBASA Riku Nunokawa1 , Yoichi Shimomura2 , Mulya Agung3 , Ryusuke Egawa2,4 , and Hiroyuki Takizawa1,2(B) 1

3

Graduate School of Information Sciences, Tohoku University, Sendai, Japan [emailprotected] 2 Cyberscience Center, Tohoku University, Sendai, Japan {shimomura32,takizawa}@tohoku.ac.jp Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK [emailprotected] 4 School of Engineering, Tokyo Denki University, Tokyo, Japan [emailprotected]

Abstract. NEC SX-Aurora TSUBASA is the latest vector supercomputer, consisting of host processors called Vector Hosts (VHs) and vector processors called Vector Engines (VEs). The ﬁnal goal of this work is to simultaneously use both VHs and VEs to increase the resource utilization and improve the system throughput by co-executing more workloads. However, performance interferences among VH and VE workloads could occur because they share some computing resources and potentially compete to use the same resource at the same time, so-called resource conﬂicts. As the ﬁrst step to achieve eﬃcient workload co-execution, this paper experimentally investigates the performance interference between a VH and a VE, when each of the two processors executes a diﬀerent workload. Our evaluation results clearly demonstrate that some characteristics of a workload such as system call frequency can be used as a good indicator to predict if the workload can aﬀect the performance of another co-executing workload. We believe that this will be helpful to identify a pair of workloads causing frequent resource conﬂicts, and thus reduce the risk of performance interference between co-executing workloads on an SX-AT system. Keywords: Workload colocation Performance interference

1

· SX-Aurora TSUBASA ·

Introduction

Recently, high-performance computing systems often adopt heterogeneous system architectures equipped with diﬀerent kinds of processors. NEC SX-Aurora TSUBASA (SX-AT) is one of heterogeneous computing systems, which consists of x86 processors and vector processors, called Vector Hosts (VHs) and Vector Engines (VEs), respectively [20]. A VE is physically implemented as a PCIExpress card, which is similar to an accelerator such as a graphics processing c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 163–174, 2022. https://doi.org/10.1007/978-3-030-96772-7_16

164

R. Nunokawa et al.

Fig. 1. The hardware conﬁguration of a VI.

unit (GPU). On the other hand, a VH is responsible for executing the operating system (OS) and managing VEs. In one compute node, each VH could manage multiple VEs. Such a node of VHs and VEs is called a Vector Island (VI). The hardware conﬁguration of one VI is illustrated in Fig. 1. Since a VE has a high memory bandwidth of 1.53 TB/s, VEs are expected to achieve high sustained performance at executing memory-intensive scientiﬁc computations while using the standard x86 environment provided by the VH [4,9]. Unlike other accelerators such as GPUs, a VE can execute an application as if the whole application is running on the VE. However, when the application running on a VE invokes a system call, the system call is implicitly forwarded to the VH, and processed by the OS running on the VH. In addition to the VH’s CPU time for handling system calls, some other computing resources such as the VI’s network bandwidth are shared by the VEs. Thus, on a large SX-AT system shared by many users, each VI is exclusively assigned to a job so as to avoid performance interferences among jobs, which could occur by sharing VHs. For example, in the AOBA system installed at Tohoku University Cyberscience Center [17], multiple jobs do not usually share one VI, and one VI might coexecute multiple jobs only if every of the jobs uses only a single VE in the VI. In such an operation policy, a job does not necessarily use all VEs in the assigned VIs, and some of VEs are thus unused during the job execution. Therefore, if multiple jobs are assigned to one VI so that more VHs and VEs are used for the execution, it is possible to increase the utilization of computing resources. However, multiple jobs running on a VI may simultaneously require the same computing resource. This is a so-called resource conﬂict, and could cause severe performance degradation. For this reason, understanding the performance interference between multiple jobs running on a VI is an important technical issue to achieve high eﬃciency on SX-AT systems. This paper ﬁrst empirically investigates the performance interference between a VH and a VE, when each of the two processors executes a diﬀerent workload. Then, we discuss workload co-execution with reducing the performance interference due to resource conﬂicts, in order to improve the resource utilization.

Towards Conﬂict-Aware Workload Co-execution on SX-Aurora TSUBASA

165

Evaluation results demonstrate that some characteristics of workloads such as system call frequency can be used as a good indicator to identify a pair of workloads causing frequent resource conﬂicts, and thus reduce the risk of performance degradation while improving the resource utilization.

2

Resource Conflicts on an SX-Aurora TSUBASA System

This section brieﬂy reviews the resource conﬂicts among VH and VE workloads co-executing in one VI. When an application is running on a standard x86 Linux system, the application has a user memory space, which is logically diﬀerent from the kernel memory space used by the OS. On the other hand, when a user memory space is assigned to an application running on a VE, unlike the standard system, the user memory space physically resides on the memory devices attached to the VE. However, even on an SX-AT system, the kernel memory space is located in the VH memory. Namely, when an application is running on a VE, its user memory space is not only logically but also physically isolated from the kernel memory space. System calls on the VE are forwarded to a dedicated process running on the VH, called a VEOS pseudo process, that actually invokes the corresponding system calls on the VH to call the OS kernel. Accordingly, when an application is running on a VE, it internally uses a VH within the VI. Moreover, if each of a VH and a VE within one VI executes a diﬀerent application, both of VH and VE workloads share the VH and have their own memory spaces, which are logically and physically isolated from each other. In this case, if the VE workload invokes a system call, the system call request is forwarded to the VEOS process and then the VH workload would be contextswitched to the VEOS process so that the VH core can handle the system call from the VE workload. Since the VEOS process spends the CPU time, the VH workload execution would be delayed, degrading the VH performance. If the VH workload cannot immediately be switched to the VEOS process for any reasons, the system call from the VE workload might be delayed, degrading the VE performance. In this way, VH and VE workloads may compete to use the same computing resources such as the VH’s CPU time, the VH’s memory bandwidth, network and ﬁle access. Therefore, the VH and VE workloads can aﬀect their performance each other, referred to as inter-process performance interferences. Performance interference is expected to occur especially if a workload on either of the VH or the VE intensively uses the shared computing resources. For example, suppose that a memory-intensive workload is running on one of the VH cores. Then, if the memory bandwidth is spent out, the memory access latency of the VEOS process increases and thus the system call from the VE workload is delayed, degrading the VE performance. As in research dealing with performance interference on a single processor [19], in order to maximize the beneﬁts of concurrency while eﬃciently controlling the overall performance degradation that may occur, it is necessary to clarify the characteristics of the applications

166

R. Nunokawa et al.

that cause conﬂict through quantitative research. As one kind of major resource conﬂict, it is known that the total execution time of a workload increases due to access conﬂicts to the ﬁle system [2], which is one of the shared computing resources. If such a root cause of performance interferences is known in advance, it would be possible to schedule jobs so as to avoid resource conﬂicts among them.

3 3.1

Performance Interference by Workload Co-execution Evaluation Setup

In this work, we experimentally investigate the eﬀect of co-executing various VH and VE workloads on their performances, and identify the combinations of VH and VE workloads causing severe performance interferences on SX-AT. The execution time of each workload is adjusted to be almost the same. In the evaluation, popular benchmarks of Himeno [5], IOR [1], Intel MPI [6], STREAM [10], b eﬀ [13], MiniAMR [15], and HPL [12] are ﬁrst used as VH and VE workloads for general discussions on performance interferences. After that, we further discuss the performance interferences with some tiny benchmark programs that intensively use only particular computing resources, such as the CPU time, memory bandwidth, ﬁle I/O, and network. Each benchmark program is compiled for both of a VH and a VE, and executed by using all cores in the processor. The system speciﬁcations used in the following evaluations are listed in Table 1. Table 1. Hardware conﬁguration of NEC SX-Aurora TSUBASA A300-8. Vector host

Xeon Gold 6126 (12 cores) × 2

Vector engine

Type 10B (8 cores) × 8

Host channel adaptor Mellanox HDR100 × 2

3.2

Operating system

CentOS Linux 8.1.1911

VEOS

veos-2.6.2-1.el8.x86 64

VH compiler

gcc-4.8.5

VE compiler

ncc-3.3.1

Interference Evaluation Results

We evaluate the changes in execution time when a VH and a VE within one VI co-execute the benchmark programs, expecting that the performance interference will increase the execution time. Figure 2 shows the increase in execution time of each VH benchmark program while changing the combination of VH and VE workloads. On the other hand, Fig. 3 shows the increase in execution time of each

Towards Conﬂict-Aware Workload Co-execution on SX-Aurora TSUBASA

167

Fig. 2. Increases in elapsed time of VH workloads.

Fig. 3. Increases in elapsed time of VE workloads.

VE benchmark program. The system call frequency of each benchmark program is shown in Fig. 4. Comparing Figs. 2 and 3, we can see that the VH performance is likely to degrade more signiﬁcantly than the VE performance when co-executing VH and VE workloads. This is because the VH’s computing resources such as the CPU time and memory bandwidth are spent not only by the VH workload but also the VEOS process for handling system calls from the VE workload. Since the Himeno [5] and STREAM [10] benchmarks are memory-intensive workloads, their performances are degraded mainly by sharing the memory bandwidth with the VEOS process. The HPL benchmark is compute-intensive, and thus the performance is degraded by sharing the VH’s CPU time with the VEOS process. In comparison with the VH performance, the VE performance degradation is small because the VE computing resources are dedicated to each VE workload, and only the system call overhead increases by co-executing VH and VE workloads. In most scientiﬁc computing applications, most of the total execution time is spent for executing kernel loops and hence the system call overhead is not signiﬁcant. Therefore, these results clearly show that co-execution of VH and VE workloads is a promising approach to improving the resource utilization without critical performance degradation except for some cases. In Fig. 2, the execution time of every VH workload obviously increases when the IOR benchmark is running on the VE. Similarly, in Fig. 3, the execution time

168

R. Nunokawa et al.

Fig. 4. System call frequency.

of every VE workload increases when the IOR benchmark is running on the VH. Thus, it is clear that the performance interference occurs if the IOR benchmark is running on either of the VH or the VE. This is because the IOR benchmark invokes system calls very frequently for measuring the ﬁle I/O performance. As shown in Fig. 4, the IOR benchmark frequently invokes system calls of ﬁle I/O operations. Therefore, it is demonstrated that frequent context-switching for handling system calls from the IOR benchmark hinders the co-executing program from consuming the CPU time as well as other shared computing resources. In addition, VH workloads of MPI applications are generally more sensitive to resource conﬂicts than VE workloads. One reason for this is that only some of VH cores are spent for executing the VEOS processes and the others are not, resulting in the load imbalance among MPI processes that could lead to a long delay at synchronizations such as MPI collective communications. The results above suggest that performance interference at co-execution of VH and VE workloads can signiﬁcantly be aﬀected by the system call frequency. To further analyze the causes of performance interferences, we develop a microbenchmark program that invokes typical system calls at an arbitrary interval. The program is executed on one processor, either of the VH or the VE, and another benchmark program is co-executed on the other processor. In this work, we have developed tiny benchmark programs to repetitively invoke a pair of system calls at a certain time interval, and evaluate the performance degradation due to the system call overheads. The evaluation results with changing system call frequencies are shown in Figs. 5 and 6. In those ﬁgures, Alone indicates the execution without co-execution, and thus there is no interference. When we ran the HPL benchmark on the VH side while invoking the read and write system calls on the VE side to exchange 10 KB of data every 100 ms, the performance drop is too large to ﬁnish the execution for time measurement. The results clearly indicate that the system call frequency correlates to performance interference. It is because a system call could cause context-switching, switching between kernel and user modes, and access shared computing resources on the VH side

Towards Conﬂict-Aware Workload Co-execution on SX-Aurora TSUBASA

169

via the system call. These results demonstrate that the system call frequency is a good indicator to detect if a workload can degrade the performance of another co-executing workload. 3.3

Avoidance of Performance Interferences

The evaluation results discussed so far have clariﬁed that the system call frequency of a workload can be used to quantify the risk of degrading the performance of co-executing workloads. If a job scheduler knows the system call frequency of each job in advance, the job scheduler might be able to ﬁnd a combination of jobs that can safely share VIs for co-execution. Since the main purpose of this paper is to experimentally investigate the performance interferences at workload co-execution on an SX-AT system, such a job scheduling mechanism will be discussed in our future work. Even if a pseudo VEOS process running on the VH core invokes system calls so frequently to cause conﬂicts, it does not signiﬁcantly aﬀect the memory access latency nor bandwidth. Figures 7 and 8 show that co-execution of VH and VE workloads (lmbench [11]) does not drastically aﬀect their sustained memory bandwidths, while the overhead of context switching obviously increases with the number of co-running processes and thus the context switching frequency. Those results indicate that, on the VH side, one main factor of causing conﬂicts is frequent context switching. One major reason for this would be that context

(a) mmap and munmap

(b) open and close

(c) read and write (10 bytes/call)

(d) read and write (10 Kbytes/call)

Fig. 5. Changes in elapsed time of VH workloads when changing system call frequency and type.

170

R. Nunokawa et al.

(a) mmap and munmap

(b) open and close

(c) read and write (10 bytes/call)

(d) read and write (10 Kbytes/call)

Fig. 6. Changes in elapsed time of VE workloads when changing system call frequency and type.

(a) memory latency

(b) memory bandwidth

(c) context switch cost

Fig. 7. VH memory access performance at executing a VH workload alone.

switching could save the context in cache memory by evicting other data and thus increase cache misses. Therefore, it is experimentally shown that VH workloads intensively accessing cached data are prone to be aﬀected by frequent context switching. One might consider that one approach to avoiding context switching overheads due to the interference is allocating some VH cores to handling the system

Towards Conﬂict-Aware Workload Co-execution on SX-Aurora TSUBASA

(a) memory latency

(b) memory bandwidth

171

(c) context switch cost

Fig. 8. VH memory access performance at co-executing VH and VE workloads frequently invoking system calls.

calls forwarded from the VE workloads. However, we have experimentally conﬁrmed that this approach is ineﬀective for a VI consisting of multiple VEs, such as the conﬁguration in Table 1. Notice that, in our evaluation, the total number of VE cores in the VI is 64 while the total number of VH physical cores is 24. As a result, if all the VE cores are used to execute VE workloads, 64 VEOS processes compete to use 24 VH physical cores, resulting in severe performance degradation. Thus, the degradation is clearly alleviated when the number of VEs managed by the VH becomes smaller, as shown in Fig. 9. In the ﬁgure, the vertical axis indicates the increase rate of the execution time at executing the HPL benchmark on the VH side. In this evaluation, a tiny benchmark of calling mmap and munmap is executed on the VE side, by changing the number of VE cores executing the tiny benchmark program in parallel. As shown in Fig. 5(a), this combination of VH and VE workloads cause resource conﬂicts, resulting in performance degradation. Note that the number of VE workloads running in parallel is changed from 12 to 64 without changing the frequency for each workload to invoke system calls. In Fig. 9, we can see that the performance degradation is clearly mitigated by reducing the number of VE workloads and making some VE cores unused. Consequently, the performance degradation of VH workloads can be restrained by suﬃciently reducing the number of VE used cores, simply assuming that every VE workload invokes system calls with the same frequency. However, as this approach also reduces the utilization of VE cores, a way of ﬁnding a good trade-oﬀ point between performance interference avoidance and resource utilization will be discussed in our future work.

4

Related Work

In Sect. 2, we reviewed that SX-AT adopts a heterogeneous conﬁguration. Therefore, the challenges discussed so far in improving the computational eﬃciency of

172

R. Nunokawa et al.

Fig. 9. Increases in elapsed time of a VH workload by increasing the number of VE workloads.

a standard CPU-GPU system might provide interesting insight also for improving the SX-AT eﬃciency. A GPU workload running on a standard CPU-GPU system is likely to saturate shared hardware resources, such as memory and network bandwidths due to their massive thread parallelism. Hence, in [7], a platform has been proposed to control the performance trade-oﬀ between CPU and GPU workloads. The proposed platform can dynamically determine the GPU concurrency level so as to maximize the system performance with considering both system-wide memory and network conﬂict information as well as the state of GPU cores. Another related study introduces a runtime framework for scheduling each of multiple users’ OpenCL tasks to its optimal device, either a GPU or a CPU on a CPU-GPU system [18]. The runtime framework uses a performance prediction model based on machine learning at runtime to select optimal devices. Some algorithms and power prediction models are proposed in [21] for schedulers to co-execute workloads with considering the impact on power consumption as well as other shared resources. There are many other studies on oversubscription [2], where each CPU is used for the concurrent execution of multiple workloads. However, most of these studies do not assume a heterogeneous computing system consisting of diﬀerent types of processors. On the other hand, studies on job scheduling and resource allocation for heterogeneous computing systems usually focus on whether a CPU or a GPU is used to execute each job [3], and those existing approaches cannot directly be applied to SX-AT, on which a pseudo process, i.e., VEOS, is running on the VH to control the VE and sharing the VH resources with VH workloads. Several researchers have evaluated the performance of SX-AT and reported various scientiﬁc applications [4,9], VH-VE oﬄoad programming [8,16], and I/O performance [14]. However, there is no report that quantitatively evaluates the performance interference when VH and VE workloads coexist. We believe that this study is the ﬁrst to discuss the concurrent execution of VH and VE workloads through quantitative performance evaluation results.

Towards Conﬂict-Aware Workload Co-execution on SX-Aurora TSUBASA

5

173

Concluding Remarks

This paper has experimentally investigated the performance interference between a VH and a VE, when each of the two processors executes a diﬀerent workload. The evaluation results clearly demonstrate that the system call frequency of a workload can be used as a good indicator to predict if the workload can aﬀect the performance of another co-executing workload. It is also worth considering the number of used cores, because performance interference could be restrained if there are some unused VE cores when co-executing VH and VE workloads. These experimental results will be helpful to identify a combination of workloads causing frequent resource conﬂicts, and thus reduce the risk of performance interference between co-executing workloads on an SX-AT system. In our future work, we will develop a job scheduling mechanism that uses the experimental ﬁndings in this paper to realize conﬂict-aware workload coexecution on an SX-AT system. Acknowledgements. The authors would like to thank Associate Professor Masayuki Sato of Tohoku University for his valuable help. This work is partially supported by MEXT Next Generation High-Performance Computing Infrastructures and Applications R&D Program “R&D of A QuantumAnnealing-Assisted Next Generation HPC Infrastructure and its Applications,” and Grant-in-Aid for Scientiﬁc Research(B) #21H03449.

References 1. HPC IOR benchmark repository. https://github.com/hpc/ior 2. Aceituno, J.M., Guasque, A., Balbastre, P., Sim´ o, J., Crespo, A.: Hardware resources contention-aware scheduling of hard real-time multiprocessor systems. J. Syst. Architect. 118, 102223 (2021) 3. Alsubaihi, S., Gaudiot, J.L.: PETRAS: performance, energy and thermal aware resource allocation and scheduling for heterogeneous systems. In: International Workshop on Programming Models and Applications for Multicores and Manycores (2017) 4. Egawa, R., et al.: Exploiting the potentials of the second generation SX-Aurora TSUBASA. In: 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (2020) 5. Himeno, R.: Himeno benchmark. https://i.riken.jp/en/supercom/documents/ himenobmt/ 6. Intel Corporation: Introducing Intel MPI benchmarks. https://software.intel.com/ content/www/us/en/develop/articles/intel-mpi-benchmarks.html 7. Kayiran, O., et al.: Managing GPU concurrency in heterogeneous architectures. In: IEEE/ACM International Symposium on Microarchitecture (MICRO) (2014) 8. Ke, Y., Agung, M., Takizawa, H.: neoSYCL: a SYCL implementation for SXAurora TSUBASA. In: International Conference on High Performance Computing in Asia-Paciﬁc Region, pp. 50–57 (2021) 9. Komatsu, K., et al.: Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In: The International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, pp. 685–696 (2018)

174

R. Nunokawa et al.

10. McCalpin, J.D.: STREAM: sustainable memory bandwidth in high performance computers. https://www.cs.virginia.edu/stream/ 11. McVoy, L., Staelin, L.: lmbench: portable tools for performance analysis. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference (1996) 12. Petitet, A., Whaley, R.C., Dongarra, J., Cleary, A.: HPL - a portable implementation of the high-performance Linpack benchmark for distributed-memory computers, version 2.3. https://www.netlib.org/benchmark/hpl/ 13. Rabenseifner, R., Koniges, A.E.: The parallel communication and i/o bandwidth benchmarks: b eﬀ and b eﬀ io. https://cug.org/5-publications/proceedings attendee lists/2001CD/S01 Proceedings/Pages/Authors/Rabenseifner/Rabensei. htm 14. Sasaki, Y., Ishizuka, A., Agung, M., Takizawa, H.: Evaluating i/o acceleration mechanisms of SX-Aurora TSUBASA. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (2021) 15. Sasidharan, A., Snir, M.: MiniAMR - a miniapp for adaptive mesh reﬁnement. Technical report (2016) 16. Takizawa, H., Shiotsuki, S., Ebata, N., Egawa, R.: OpenCL-like oﬄoading with metaprogramming for SX-Aurora TSUBASA. Parallel Comput. 102, 102754 (2021) 17. Tohoku University Cyberscience Center: Supercomputer AOBA (2020). https:// www.sc.cc.tohoku.ac.jp/sc20/ 18. Wen, Y., O’Boyle, M.F.P.: Merge or separate? Multi-job scheduling for OpenCL kernels on CPU/GPU platforms. In: Proceedings of the General Purpose GPUs, GPGPU-10 (2017) 19. Xiong, Q., Ates, E., Herbordt, M.C., Coskun, A.K.: Tangram: colocating HPC applications with oversubscription. In: IEEE High Performance Extreme Computing Conference (2018) 20. Yamada, Y., Momose, S.: Vector engine processor of NEC’s brand-new supercomputer SX-Aurora TSUBASA. In: A Symposium on High Performance Chips (Hot Chips) (2018) 21. Zhu, Q., Wu, B., Shen, X., Shen, L., Wang, Z.: Co-run scheduling with power cap on integrated CPU-GPU systems. In: International Symposium on Parallel and Distributed Processing (2017)

A Learning-Based Scheduler for High Volume Processing in Data Warehouse Using Graph Neural Networks Vivek Bengre1 , M. Reza HoseinyFarahabady2(B) , Mohammad Pivezhandi1 , Albert Y. Zomaya2 , and Ali Jannesari1 1

Department of Computer Science, Laboratory for Software Analytics and Pervasive Parallelism (SwAPP), Iowa State University, Ames, USA {bvivek2,mpvzhndi,jannesari}@iastate.edu 2 School of Computer Science, Center for Distributed and High Performance Computing, The University of Sydney, Camperdown, NSW, Australia {reza.hoseiny,albert.zomaya}@sydney.edu.au

Abstract. The process of extracting, transforming, and loading (also known as ETL) of a high volume of data plays an essential role in data integration strategies in data warehouse systems in recent years. In almost all distributed ETL systems currently use in both industrial and academia context, a simple heuristic-based scheduling policy is employed. Such a heuristic policy tries to process a stream of jobs in the besteﬀort fashion, however, it can result in under-utilization of computing resources in most practical scenarios. On the other hand, such ineﬃcient resource allocation strategy can result in an unwanted increase in the total completion time of data processing jobs. In this paper, we develop an eﬃcient reinforcement learning technique that uses a Graph Neural Network (GNN) model to combine all submitted tasks graphs into a single graph to simplify the representation of the states within the environment and eﬃciently make a parallel application for processing of the submitted jobs. Besides, to positively augment the embedding features in each leaf node, we pass messages from leaf to root so the nodes can collaboratively represent actions within the environment. The performance results show up to 15% improvement in job completion time compared to the state-of-the-art machine learning scheduler and up to 20% enhancement compared to a tuned heuristic-based scheduler. Keywords: Extract Transform Load (ETL) operations · Scheduling policy · Data streaming processing system · Graph neural networks · Job completion time · Reinforcement learning

1

Introduction

The process of extracting, transforming, and loading (also known as ETL) a high volume of data plays an essential role in data integration strategies in data c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 175–186, 2022. https://doi.org/10.1007/978-3-030-96772-7_17

176

V. Bengre et al.

warehouse systems in recent years. A typical ETL process gathers several types of data from diﬀerent sources, and then tries to reﬁne and delivers the reﬁned sets to a Data Warehouse (DW) platform (e.g., Amazon Red-shift [1], Azure Data Warehouse Service [2], or Google Big-Query [3]) where the underlying engine allows the end-users to eﬀectively perform the critical business intelligence (BI) activities (such as data predictive analytic). Data processing systems over batch/streaming ﬂows are becoming more and more prominent in the past few years as there is a need to manage apply a set of distributed data mining algorithms over massive data-sets in a petabyte scale. The range of versatility allows the end-users to submit and run a variety of diﬀerent algorithms with diﬀerent load characteristics. In particular, the set of end-users jobs can be scheduled by running a simple scheduling heuristic-based algorithm such as Round Robin (RR), rule based scheduling heuristics, First Come First Serve (FCFS), Shortest Job First (SJF) among others [4,5]. While on a small scale, the achieved performance of such simple scheduling policy can be considered in an acceptable level, the performance degradation caused by applying such simple policies becomes immediately visible on larger clusters that handle various large workloads on their expensive compute applications. As a result, achieving a near optimal solution that can eﬀectively cope with the challenging issues of dedicating an appropriate number of executors to each job or stage when the arrival rate of jobs or data is unknown in prior is highly desirable. In most distributed ETL frameworks in data warehouse environments, the set of data processing jobs are broken down into smaller sub-tasks which is known as processing stages. Each of these processing stages can be conceptually linked together to form an abstract processing structure (as a graph) that represents the dependencies between the processing stages. Breaking the submitted processing jobs down into smaller stages/fragments makes them more manageable. Moreover, fragmentation makes it possible to run sub tasks in a concurrent/parallel fashion. In most practical scenarios, such smaller tasks are linked together to form an underlying structure for the application that is usually referred to as a Directed Acyclic Graph (DAG). When encoded as DAGs, dedicating jobs to each cluster node is shown to be an NP-hard problem, but we can approximate the solution using graph processing techniques [6]. As such, we use the information present within the job structure to ﬁnd patterns of eﬃcient execution. Manually traversing all the execution paths to make a decision is not feasible (or extremely slow) for large job sets. Therefore, in this paper, we aim to develop an innovative way to look ahead from the leaf nodes to the root node of the DAG using Graph Neural Networks (GNNs) and decide the order of execution. Original Contribution In this paper, we develop an eﬃcient embedding plan to reduce the time of convergence and enhances the amount of the reward in each episode of the reinforcement learning (RL) agents. We employ a Double Deep Q-Networks (DDQN) [7]

Deep RL-Based Scheduler for High Volume Processing

177

can tune the parameters of the graph neural networks to set the eﬃcient embedding for the DAGs. For any DQN based algorithm to ﬁnd an eﬃcient policy (e.g., [8]), it has to explore the state space suﬃciently. However, this will make the converging and conforming to a policy take a long time. We use an initial step of the heuristic-based scheduler and reinforcement learning- neural networks agent to assist for eﬃcient policy exploration through the ﬁrst episode. Further, we solve the executor limit selection by limiting a stage to one executor and allowing the Agent to select the order. Limiting the number of executors per node allows more executors to be accessible at a given time. We test our model on a simulator built for Apache Spark that also simulates Decima [9]. Our method, Decima, FIFO, and a heuristic-based dynamic partitioning, are compared based on average job completion time, executor usage, and training time. The main contribution of the current study is summarized as follows. – We use SageCONV to implement message passing in the reverse direction, which allows us to embed more information in each node for taking actions. – We make the training process a signiﬁcant order of magnitude faster by directly representing the Q-values by node feature embedding in the reinforcement learning agent. – We combine all the DAGs into a single DAG structure to enhance reinforcement learning parallelism and descriptively in-state representation. We train the model by utilizing DDQNs for continuous job arrival. The rest of this paper is organized as follows. Section 2 highlights the main challenges associated with scheduling of sub-tasks for performing data ETL operations in distributed data processing platforms (such as a data warehouse system). Section 3 presents the details of our proposed scheme. The performance of the proposed solution against famous heuristic-based static and dynamic algorithms is evaluated in Section 4. Finally, Section 5 concludes our work.

2

Problem Statement

The process of data extraction from data sources, transformation, and loading to a central host (commonly known as ETL operations) is among the core strategies and technologies used by enterprises for the data analysis of business information for making business decisions in common Business intelligence (BI) platforms. Business intelligence technologies can handle large amounts of structured/unstructured data to develop and create new strategic business opportunities by easy interpretation of big data sets usually derived from the market in which an enterprise operates (also known as the external data) with data from the internal sources of the business (such as ﬁnancial and operations data). Such insights can provide enterprises with a competitive market advantage and longterm stability at the broadest level. Common applications of the BI tasks include, but not limited to online analytical processing, data/process/text mining, complex event processing, and predictive/prescriptive analytic. Such applications

178

V. Bengre et al.

can empower enterprises to gain insight into new markets or to assess demand and suitability of products and services for diﬀerent market segments. Large scale data processing systems can involve a considerable amount of complexity; hence, a signiﬁcant operational problem can occur when one employs improperly designed data processing systems. Creating an eﬀective scheduling of data processing tasks over limited computing resources across the lifetime of its usage is immensely important in such systems. In particular, an eﬃcient scheduling policy must solve issues such as the decomposition of the original data processing applications to some smaller independent tasks which may be processed in a parallel or distributed manner. Further, thread management, their synchronization and communication can exacerbate the problem as the amount of data becomes larger. Parallel processing of data stream is a very active research topic and there are a myriad of researches that proposed diﬀerent scheduling strategies to process data streams or real-time data streams [10,11]. The common requirements for all systems are throughput (eﬃcient utilization of available resources) [12]. The average or p-99 response time becomes the target of some previous researches to address. In the rest of this section, we highlight some main challenges when designing a scheduling policy for a large scale data processing application. Job Scheduling Challenge. Scheduling policies can be grouped to two broad categories of either domain-speciﬁc [12,13] or general data processing approaches. The domain-speciﬁc policies mostly concentrate on eﬃcient separation of tasks into eﬃcient processing sub-tasks. On the other hand, the general data processing approaches focus on separation of the general jobs into multiple stages and tasks regardless of their intrinsic behavior [11,14,15]. The most commonly used scheduler policies in the industrial projects are those that are designed based on simple heuristic-based [4,5,16] approaches. Authors in [17–20] propose a control-based approach for guaranteeing the Quality-of-Service (QoS) requirements associated with parallel running queries in distributed stream processing engines and event-driven serverless platforms. Such policies usually simplify the scheduling policies by modeling the task properties based on the embedded features of the jobs. These modeling policies can be improved by considering the dependencies among tasks [21], or making them hybrid with the learning mechanisms [22]. However, it has been proven they are ineﬃcient for complex, and high-frequency job arrival [9]. The current trend is to provide a self-intelligible scheduler to enhance resource allocation through time [15,23]. Graph Structure Challenge. Eﬀective handling of task scheduling problems are critically important part of any data processing framework. Because an application can be composed of several partial smaller tasks/operations (also known as the underlying application graph), an optimal scheduler must be optimized accordingly. The goal can be optimizing the utilization of the CPU or memory of the underlying system or to reduce the response time of the tasks (or a combination of both). Having the application graph helps to reduce the model complexity substantially and introduces tools for eﬃcient learning, fast training, and low latency scheduling [24].

Deep RL-Based Scheduler for High Volume Processing

179

Graph Neural Networks. Graph Neural Networks (GNNs) is a deep learning structure that addresses graph-related problems represented via vertices and an edge regarding dependencies. Graph neural networks have a wide variety of applications in Social network recommendations, node classiﬁcations, medicinal drug delivery, and protein-protein interaction. The graph embedding is developed to change the nodes and edges representation of the graphs to preserve information while compressing them down to a manageable size. There are multiple ways in which this embedding can be done, but all the procedures use message passing in some way to include the features in the adjacent nodes. Computing the node embedding is based on the user-speciﬁed function, and, similarly, edges can have features of their own, and the embedding for each edge is calculated by considering the connected nodes and the node features themselves [24]. Reinforcement Learning. Machine Learning, in essence, is trying to ﬁnd patterns in data. Very often, optimal data is required by ML algorithms to make the correct prediction. However, data for the optimal solution does not exist in some instances, such as decision-making environments. The optimum has to be found itself without the correct data. Reinforcement learning algorithms provide a way of interacting with the environment to make decisions and classify a decision as good or bad. Reinforcement learning is always goal-directed and is implemented in an active learning model, i.e., the model learns while interacting with the environment. Reinforcement learning models that make decisions are called agents. An agent has a state, a policy, a value function, and a model. The actions performed by an agent entirely depend on the state it is in, and this state is not to be mistaken by the environmental state. Environment states are generally not completely visible to the Agent; however, there are cases where the environment state is visible in games like Chess. A policy deﬁnes agent behavior and maps from state to action, and it is represented by π in the Eqs. 1 and 2. The value function calculates the expected reward by following π for a state s, and the model predicts what the environment does. The model is never perfect but a good approximation of the environment. For reinforcement learning algorithms, the environment is always considered Markovian, i.e., the current time step represents all the time steps before it. π(action|state) = P (action|state)

(1)

vπ (s) = E [ Gt | St , At → π(s) ]

(2)

In Eq. 2 and Eq. 3, Gt represents the total expected reward for state St and the action At as per policy π. Gt can also be expanded as Eq. 3 to represent the total expected reward. The discount factor γ, shown in Eq. 4, represents the uncertainty with which the reward for the next steps will be computed. The objective of the algorithm is to ﬁnd the optimal policy π∗ . Gt = Rt+1 + γRt+2 + γ 2 Rt+3 ...

(3)

180

V. Bengre et al.

π∗ = max( γ t Rt )

(4)

t>0

3

Proposed Approach: Design and Analysis

The information present within the structure of the job would help to ﬁnd eﬃcient patterns of execution. We develop an innovative way to look ahead from the leaf nodes to the root node of the DAG using Graph Neural Networks (GNNs) [6] and decide the order of execution according to an enhanced agglomeration of information. A Double Deep Q-Networks [7] tunes the parameters of the graph neural networks to set eﬃcient embedding for the DAG features. GNNs generalize the conventional deep learning by representing their structure as a set of nodes and edges as their dependencies [25]. The graph neural network can be used to represent the deep neural networks hierarchically to reduce the complexity of training by creating replicated kernels [26–29]. Besides, the stages are limited to one executor, and the Agent decides to dedicate free executors after resources are allocated. To deal with time-consuming convergence in search for an eﬃcient policy on the proposed approach in [8], we propose a hybrid heuristic-based scheduler to assist by executing for the ﬁrst few episodes. Along with creating a large DAG structure, we also utilize a state representation that helps us to parallelize the training and inference processes. 3.1

Preliminaries

Apache Spark is one of the most widely used open-source computing engines. Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object, that is in the main program. To run a Spark engine on a cluster, the SparkContext object needs to connect to the cluster manager. SparkContext object can connect to YARN, Mesos, Kubernetes or Spark’s default Standalone manager. Once the nodes are connected to the cluster manager, Spark engine acquires executors on the worker nodes in the cluster, which are processes that run computations and store data for the application. Then, the SparkContext object sends the submitted tasks to the executors to be executed. Spark engine provides a rich selection of APIs and libraries that support Extract Transform Load (ETL) operations, graph computations, streaming jobs, real-time query processing, and Machine Learning capabilities. The model proposed in this paper would be tested based on an accurate simulator built for Apache Spark [9]. The comparison metrics would be average job completion time, executor usage, and training and inference time. Our model would be compared with various heuristic-based schedulers, including First in First out (FIFO), dynamic partitioning algorithms, and the state-of-the-art reinforcement learning-based scheduler named Decima. The proposed workﬂow in this paper is represented in Fig. 1.

Deep RL-Based Scheduler for High Volume Processing

181

Table 1. Notation used in the paper Entity

Symbol

Discount factor γ At

Action

Policy function π St

State

3.2

State Representation

The environment state at any given point contains all the job DAGs that have not been executed. Each job DAG is a sparse matrix of edges and vertices, along with a matrix of features for each node. An environment state can be represented as a collection of sparse matrices with their corresponding feature matrices. The proposed solution in this paper uses something similar to describe the environment state, i.e., it uses one large graph containing all the DAGs present i.e., DAGs that have not been fully executed. A ﬂag is changed to indicate it has completed its execution. The “soft” delete is done to keep the node numbering and positions correct. In the Sect. 3.2, the node numbering for some nodes has been repeated, and these repetitions are representative of the new job arrivals. To identify nodes internally, they’ve been numbered from 0 to node count where node count is the total number of nodes in the graph. Having this structure allows for adding new jobs quickly by appending them to the existing list. The job features keep changing as new jobs arrive and the cluster state changes. So, it’s better to compute the feature matrix just before training or inference. The process of feature calculation is also not expensive as most frameworks provide this information about the node. This aggregated state representation also allows an eﬃcient way to parallelize and compute the embedding in the next step. The ﬁnal step is to reverse the directions of all the edges in the graph, this is required for the leaf embedding to have inﬂuence from the higher nodes. Environment State

GraphSAGE-2 Convolutions

1 2

1 2

3

1

4 3

2 3

1

4

1

3

:

New State & Reward

3

2

Reverse Messaging

Q-values

3

1

− ∇(

−

Leaf node mask Argmax

Node limit =1

)

Fig. 1. The workﬂow and the structure of the proposed solution.

182

3.3

V. Bengre et al.

Graph Embedding

A graph neural network is used to embed each node information as logits. In the case of our solution, the resulting logits are just one number which represents the Q value for selecting a node as its action. The graph neural network takes job dags and the features of each node as input and outputs a Q value for each node. In the graph neural network, we adopted three GraphSAGE [30] layer. In ﬁrst two layers, the number of input and out features are ﬁve while last layer takes ﬁve features from the output of second layer as input but it outputs only one feature which is the Q value or logits for each node. Since the Message Passing path is reversed, the leaf node values calculated will have inﬂuence from nodes that are a few generations (in terms of dependency) above it. So, the leaf node value will represent the “path” from the leaf to some parent node. For any node in the graph the embedding is calculated as follows. ∀ v parent(u) } ) xlparent(u) = agg( { xl−1 v

(5)

l xlu = σ( W . concat(xl−1 u , xparent(u) ) )

(6)

The N or the Neighborhood of a given node automatically changes to the parents/dependants of the node. One round of message passing will not be enough for the leaf nodes to have enough inﬂuence from the nodes that are higher up. So, to have a reasonable inﬂuence, three rounds of message passing is done. This assures an embedding that will take into account the neighbourhood that spans reasonably away from the leaf nodes. The next step is to train the embedding to give accurate/eﬃcient Q-values per node.

4

Performance Evaluation Results

The proposed approach is based on the Decima spark simulator [9]. We compared the results with FIFO as spark default scheduling, dynamic partitioning scheduler, and Decima. The executor usages, average job completion time, and cumulative distribution of the rewards are three signiﬁcant evaluated criteria. The jobs are generated randomly based on the TPC-H dataset [31], and the rewards may be increased based on extending the generated job set. The proposed solution includes the same randomness of input jobs, and the evaluation is based on the average ratio for the improvement over multiple runs. Instead of focusing on matrix factorization, which is a common embedding technique in GCN, we use an inductive method based on node features in GraphSAGE [30] to learn the embedding features that would generalize to unseen nodes. Our model is based on an aggregation of feature information based on the neighboring nodes, and the back-propagation by stochastic gradient descent is used to train the parameters. The symmetric aggregator function makes the model trainable by ordering the unordered set of vectors as the neighbors of each node. We considered two diﬀerent aggregator functions, mean and pooling,

Deep RL-Based Scheduler for High Volume Processing

183

Fig. 2. Performance evaluation of decima executor usage versus dynamic scheduling

Table 2. Parameters for diﬀerent training stages based on pooling and mean aggregator stages. Parameter

Pooling Mean

Stage

–

1

2

Burning

1000

1000

1000

Learning rate

0.001

0.001

0.001

Episodes

0.001

15

30

Gamma

0.9

0.9

0.9

Assist

90%

100%

0%

Random exploration 10%

0%

100%

Exploration decay

0.9998 0.9998

0.9999

to train the models. Our experimentation in Fig. 3 shows that the pool aggregator requires more training episodes, and the loss value converges considerably slower than the mean aggregator. However, the convergence policy is comparable in terms of the eﬃciency of scheduling. The parameters are initialized as provided in Table 2.

184

V. Bengre et al.

Fig. 3. GraphSAGE with pool aggregator converges fairy fast in the beginning episodes then stabilizes. The left side ﬁgure shows the average losses plot, the right side ﬁgures represent the average QS value. Top images represent training via pooling aggregator and the down images are stage 1 and stage 2 for mean aggregator.

5

Conclusion

Graphinator successfully reduces the job completion time in high-frequency job arrival cases. Our results show that having a graph neural network computing the Q-value helps execute jobs much more eﬃciently. Irrespective of the number of parallel nodes assigned, this work also shows that with the assistance of optimized scheduling algorithms, the training time for a model can drastically be reduced. We also show that assigning one executor per stage in a job DAG works well for high-load environments. However, we also observed that the overall response time (makespan) of the jobs is the limiting factor of assigning one node per stage. This limitation can be solved by manually tuning the algorithm for lower loads and increasing the maximum number of executors per stage. The ability to learn of our model helps to eﬃciently enhance its performance for an extended duration of time with more randomized real-life cluster loads. As future work, the proposed method can be continued to optimize hardware requirements with limited memory, CPU, and storage. Acknowledgment. We thank the Research IT team (ResearchIT – RIT) of Iowa State University for their continuous support in providing access to HPC clusters for conducting the experiments of this research project. Prof. Albert Y. Zomaya acknowledges the support of Australian Research Council Discovery scheme (DP190103710). Dr. MohammadReza HoseinyFarahabady acknowledge the continued support and patronage of The Center for Distributed and High Performance Computing in The University of Sydney, NSW, Australia for giving access to advanced high-performance computing platforms and industry’s leading cloud facilities, machine learning (ML) and analytic infrastructure, the digital IT services and other necessary tools.

Deep RL-Based Scheduler for High Volume Processing

185

References 1. Amazon Redshift: Cloud data warehouse. https://aws.amazon.com/redshift/. Accessed 25 Oct 2021 2. Azure data warehousing architectures. https://docs.microsoft.com/en-us/azure/ architecture/data-guide/relational-data/data-warehousing. Accessed 25 Oct 2021 3. BigQuery: Cloud data warehouse. https://cloud.google.com/bigquery. Accessed 25 Oct 2021 4. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed dataparallel programs from sequential building blocks. In: 2007 Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007) 5. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, p. 2 (2012) 6. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Trans. Neural Netw. 20(1), 61–80 (2008) 7. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double QLearning. In: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 30 (2016) 8. Mnih, V., et al.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013) 9. Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM Special Interest Group on Data Communication, pp. 270–288 (2019) 10. Yang, Z., Nguyen, P., Jin, H., Nahrstedt, K.: MIRAS: model-based reinforcement learning for microservice resource allocation over scientiﬁc workﬂows. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 122–132. IEEE (2019) 11. Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an eﬃcient dynamic resource scheduler for deep learning clusters. In: Proceedings of the 13th EuroSys Conference, pp. 1–14 (2018) 12. Peng, Y., Bao, Y., Chen, Y., Wu, C., Meng, C., Lin, W.: DL2: a deep learningdriven scheduler for deep learning clusters. arXiv preprint arXiv:1909.06040 (2019) 13. Moritz, P., Nishihara, R., Stoica, I., Jordan, M.I.: SparkNet: training deep networks in Spark. arXiv preprint arXiv:1511.06051 (2015) 14. Mirhoseini, A., et al.: Device placement optimization with reinforcement learning. In: International Conference on Machine Learning, pp. 2430–2439. PMLR (2017) 15. Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56 (2016) 16. Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: NSDI 2011, pp. 24–24 (2011) 17. Farahabady, M.R.H., Zomaya, A.Y., Tari, Z.: QoS- and contention-aware resource provisioning in a stream processing engine. In: International Conference on Cluster Computing, pp. 137–146 (2017) 18. Wang, Y., Tari, Z., HoseinyFarahabady, M.R., Zomaya, A.Y.: QoS-aware resource allocation for stream processing engines using priority channels. In: International Symposium on Network Computing and Applications (NCA), pp. 1–9 (2017)

186

V. Bengre et al.

19. HoseinyFarahabady, M.R., Zomaya, A.Y., Tari, Z.: A model predictive controller for managing QoS enforcements and microarchitecture-level interferences in a Lambda platform. IEEE Trans. Parallel Distrib. Syst. 29(7), 1442–1455 (2018) 20. Kim, Y.K., HoseinyFarahabady, M.R., Lee, Y.C., Zomaya, A.Y., Jurdak, R.: Dynamic control of CPU usage in a Lambda platform. In: International Conference on Cluster Computing (CLUSTER), pp. 234–244 (2018) 21. Grandl, R., Kandula, S., Rao, S., Akella, A., Kulkarni, J.: GRAPHENE: packing and dependency-aware scheduling for data-parallel clusters. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, pp. 81–97 (2016) 22. Kumar, N., Vidyarthi, D.P.: A novel hybrid PSO-GA meta-heuristic for scheduling of DAG with communication on multiprocessor systems. Eng. Comput. 32(1), 35– 47 (2016) 23. Bingqian, D., Chuan, W., Huang, Z.: Learning resource allocation and pricing for cloud proﬁt maximization. Proc. AAAI Conf. Artif. Intell. 33, 7570–7577 (2019) 24. Zonghan, W., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24 (2020) 25. Wang, M., et al.: Deep graph library: a graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315 (2019) 26. Yu, S., Nguyen, P., Abebe, W., Anwar, A., Jannesari, A.: SPATL: salient parameter aggregation and transfer learning for heterogeneous clients in federated learning (2021) 27. Yu, S., Mazaheri, A., Jannesari, A.: Auto graph encoder-decoder for neural network pruning. In: Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), pp 6362–6372, October 2021 (2021) 28. Yu, S., Mazaheri, A., Jannesari, A.: Auto graph encoder-decoder for model compression and network acceleration. arXiv preprint arXiv:2011.12641 (2020) 29. Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for eﬃcient architecture search. arXiv preprint arXiv:1711.00436 (2017) 30. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035 (2017) 31. TPC-H version 2 and version 3

Adaptive Updates for Erasure-Coded Storage Systems Based on Data Delta and Logging Bing Wei, Jigang Wu(B) , Xiaosong Su, Qiang Huang, and Yujun Liu The School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China [emailprotected]

Abstract. With the explosive growth of data in modern storage systems, erasure coding is widely used to ensure data reliability because of its low storage cost and high reliability. However, a small update can lead to a partial update for erasure-coded storage system, the update of data incurs high I/O latency. This paper proposes an adaptive update approach, named DETOG, which eﬃciently speeds up the partial update for erasure-coded storage systems. DETOG employs machine learning approaches to classify ﬁles into non-write-only and write-only ﬁles. For non-write-only ﬁles, DETOG uses the data deltas that are the diﬀerences between latest data values and original data values, rather than the parity deltas, to reconstruct the lost data. This allows erasure-coded storage systems only need to read the old data for the ﬁrst update instead of each update. For write-only ﬁles, DETOG directly appends the new data to the logs of the data nodes and the parity nodes. This allows erasure-coded storage systems not to read the old data for each update. We implement DETOG on the newly designed prototype storage system to perform performance evaluation. Extensive experimental results on real-world traces show that, DETOG can eﬃciently improve the I/O throughput. Keywords: Partial updates delta · Logging

1

· File classiﬁer · Erasure coding · Data

Introduction

Modern storage systems continuously expand in scale to cope with the everincreasing volume of data storage. In large-scale storage systems, it is necessary to ensure both high data availability and data reliability, because failures become more prevalent due to disk crashes, sector errors, or server outages, etc. [10,12,14]. To ensure both high data availability and data reliability, keeping additional redundancy in storage systems is a commonly used approach to enable data recovery once failures occur [7]. Two representatives of redundancy mechanisms are replication and erasure coding (EC) [7]. When replication is applied, c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 187–197, 2022. https://doi.org/10.1007/978-3-030-96772-7_18

188

B. Wei et al.

the identical replicas of each data are copied and then distributed across multiple data nodes of storage systems. This can incur substantial storage overhead, especially in the face of the ever-increasing volume of data being stored nowadays. When EC is applied, original data blocks are encoded to generate new parity blocks, such that a subset of data and parity blocks can suﬃciently recover all original data blocks. It is known that EC introduces less storage overhead and write bandwidth than replication under the same degree of fault tolerance as replication [16]. Although EC can provide fault tolerance with low redundancy, it can introduces additional performance overhead for small updates. This is because EC needs to maintain the consistency of parity chunks to ensure the correctness of data reconstruction [15]. In EC, two representatives of update mechanisms are re-encoding and delta-based write [5,11,13,16]. In re-encoding, the new parity blocks can be generated by computing a linear combination of the unmodiﬁed data blocks and the new data blocks of an EC group [9,14]. Delta-based write computes the new parity blocks based on the change of data blocks instead of summing over all data blocks. It employs the diﬀerence between new data and old data to compute the parity deltas, then uses the parity deltas to reconstruct the lost data blocks. In small write scenarios, delta-based write signiﬁcantly outperforms re-encoding [2,16]. Erasure-coded storage systems usually combine delta-based write and logging to speed up partial updates. Full-logging (FL) saves the disk read overhead of parity chunks by appending all data and parity updates. That is, after the modiﬁed data range and parity deltas are respectively sent to the corresponding data and parity nodes, the storage nodes create logs to store the updates [2]. Parity-logging (PL) takes a hybrid of full-overwrite (FO) and FL. It saves the disk read overhead of parity chunks and additionally avoids merging overhead on data chunks introduced in FL, because FO applies in-place updates to both data and parity chunks. However, FL and PL still have to perform a time-consuming write-after-read for each partial update. A speculative partial write scheme for fast parity logging. PARIX performs write-after-write instead of write-after-read to reduce the seek overhead. However, it introduces an extra write for each partial update, in comparison to replication. In this paper, we focus on how to minimize the I/O overhead of partial updates for erasure-coded storage systems. We propose an adaptive update approach, named DETOG, to solve the problem. DETOG classiﬁes ﬁles into nonwrite-only and write-only using a decision tree (DT) [8]. For non-write-only ﬁles, DETOG uses the data deltas that are the diﬀerences between latest data values and original data values, rather than the parity deltas, to reconstruct the lost data. This allows DETOG to perform single write instead of write-after-read for the last n − 1 partial updates, when handling a series of n partial writes to the same data. For write-only ﬁles, DETOG performs partial updates using FL. This allows DETOG directly sends the new data to the data nodes and the parity nodes for each update, thereby transforming write-after-read to single write for

Adaptive Updates for Erasure-Coded Storage Systems

189

each partial update. The main contributions of this paper are summarized as follows. – We propose an adaptive update approach DETOG to speed up partial updates. DETOG uses data deltas instead of parity deltas to bypass the computation of parity deltas and the read of old data. DETOG classify ﬁles into non-write-only and write-only using machine learning. When updating non-write-only ﬁles for a series of n partial updates to the same data, DETOG performs write-after-read for the ﬁrst partial update and single write for the last n − 1 partial updates. When updating write-only ﬁles, DETOG performs single write for each partial updates. – Based on DETOG, we have designed a distributed prototype ﬁle system for small-write-intensive workloads. We have implemented DETOG, compared it with the latest work on the proposed storage system through the same real-world I/O trace used in [3]. Extensive experimental results show that DETOG can successfully improve the I/O throughput.

2

Preliminary

We divide ﬁle content into blocks and apply EC independently on a per-block basis. We denote an (k, m)-code as an EC approach deﬁned by two parameters k and m. An (k, m)-code encodes k equal-size data blocks to form m parity blocks. Let n denote the number of nodes (or servers) in an erasure-coded storage cluster. We assume n ≥ k +m, and the collection of k +m data and parity blocks distributed across k + m of the n nodes in the erasure-coded storage cluster. We mainly consider Maximum Distance Separable (MDS) codes. It has been proved that MDS codes can achieve the optimal storage eﬃciency for a given level of fault tolerance [3]. For example, regarding an (k, m)-code, k original data blocks are encoded to generate m parity blocks, and the original data blocks can be reconstructed from any k of the k + m data and parity blocks. In an EC group, each parity block can be encoded by computing a linear combination of k data blocks. For an (k, m)-code, let dj (1 ≤ j ≤ k) denote a data block, pi (1 ≤ i ≤ m) denote a parity block, then pi can be computed by pi = γi1 d1 + γi2 d2 + · · · + γik dk

(1)

where γij (1 ≤ j ≤ k, 1 ≤ i ≤ m) denotes an encoding coeﬃcient. All arithmetic operations are performed in the Galois Field GF (2w ) [9]. The re-encoding approach computes the new parity blocks by Eq. (1). The linearity property of EC provides an alternative to reduce the I/O overhead for computing the new parity blocks, when one or more data blocks are updated. Assume that the data block dl (1 ≤ l ≤ k) is updated to dl in an EC group, then each parity block in the group must be updated. Each new parity block pi (1 ≤ i ≤ m) can be computed by

pi =

k j=1,j=l

γij dj + γil dl = pi + γil (dl − dl ) = pi + γil × Δdl = pi + Δpi

(2)

190

B. Wei et al.

where Δdl is the data delta, Δpi is the parity delta. Thus, instead of summing over all data blocks, the new parity blocks can be computed by the old parity blocks and the change of the data blocks. The delta-based write approach computes the updated parity blocks by Eq. (2). Furthermore, Eq. (2) can be generalized when only part of a data block is updated, but a subtlety is that a data update may aﬀect diﬀerent parts of a parity block depending on the erasure code construction Delta-based write leverages the linearity of EC described in Eq. (2), it introduces smaller I/O overhead than re-encoding for small updates. Three typical delta-based write approaches used in modern EC based storage systems are described as follows. FO. FO applies in-place updates to both data and parity blocks. It requires an additional disk reads of old parity block at the speciﬁc oﬀset. FL. FL appends all data and parity updates to logs for saving the disk read overhead. That is, after the modiﬁed data range and parity deltas are respectively sent to the corresponding data and parity nodes, the storage nodes create logs to store the updates. The logs will be merged with the original blocks when the blocks are read subsequently. PL. PL can be regarded as a hybrid of FO and FL. It saves the disk read overhead of parity blocks and additionally avoids merging overhead on data blocks introduced in FL. Because data blocks are more likely to be read than parity blocks, merging logs in data blocks can signiﬁcantly degrade read performance.

Incoming data stream

x1'

x2'

Data

Data update

Parity

Parity delta

FO

FL PL

x1

y1

x1

y1

x1

y1 Data node 1

x1'

x2

y2

x2

y2

x2

y2 Data node 2

x2'

pix

piy

pix

piy

pix1

pix2

piy

pix1

pix2

pix

Parity node 1

Fig. 1. Illustration on diﬀerent parity update approaches.

Figure 1 illustrates the diﬀerences of the delta-based approaches, using a (2,1)-code as an example. FO performs in-place writes for both data updates and parity deltas; FL appends both data updates and parity deltas according to the incoming order; PL performs in-place writes for data updates and appends parity deltas. FO introduces extra disk reads for the old parity blocks, in comparison to FL and PL. FL introduces additional disk seeks to the update log for reads, because

Adaptive Updates for Erasure-Coded Storage Systems

191

the data are scattered in the log. PL updates data with in-place manner and uses logging to update parities. It can eﬀectively improve the update performance without aﬀecting data reads.

3

Proposed Approach DETOG

This section presents the proposed approach DETOG, which classiﬁes ﬁles into non-write-only and write-only ﬁrst, then uses diﬀerent delta-based write approaches to perform updates. In large-scale storage clusters, users are diverse and varying, which results in dynamic features. Machine learning approaches not only accomplish eﬃcient data analysis, but also adapt to dynamic workloads and automatically adjust feature selection. Therefore, we use machine learning approaches to classify ﬁles. For non-write-only ﬁles, DETOG uses data delta based approach to perform partial updates. For write-only ﬁles, DETOG directly appends the new data to the logs of the data nodes and the parity nodes, so as to bypass the read of old data. 3.1

File Classification

In our scenario, both high AUC and low complexity are important. Therefore, we choose DT as the ﬁle classiﬁer. Feature extraction dominates the implementation eﬀect of prediction algorithms. The features used in the DT is listed as follows. File type: ﬁle type is strongly related to ﬁle classes. For example, a large number of image ﬁles that are used to store the checkpoint of applications, are frequently appended, but rarely read. However, some document ﬁles are frequently read, but rarely written. File age: ﬁle age is measured by the time interval between current time and the creation time. Intuitively, newer ﬁles are more popular. Recency: the diﬀerence between the current access time and the last access time. File size: ﬁle size is related to ﬁle classes. In general, for a ﬁle, the larger the size, the higher possibility of being frequently appended. Owner: the owner of a ﬁle. There are some ﬁle owners who no longer read or write the ﬁles after uploading them. Recent access requests: the number of access requests in a recent conﬁgured internal. In general, a higher number means a more high activity of the whole user group. Access count: the access count of a ﬁle in a day. For a given feature set {a1 , a2 , ..., an } that has n features, we choose the optimal feature based on the information gain. In general, the larger the information gain, the better the classiﬁcation. For example, we ﬁrst choose ai that has the largest information gain to construct the target set {ai }, then remove ai from the original feature set. Again, we move the optimal feature {aj } from the

192

B. Wei et al.

original feature set to the target set. If {ai , aj } is superior to the previous target set {ai }, which means the eﬀect of the new classiﬁcation is better than that of the old classiﬁcation, then the iteration will be repeated accordingly. Otherwise, the process terminates. 3.2

File Updates

Intuitively, PL is the best choice for non-write-only ﬁles. However, the old on data nodes have to be read to compute parity deltas in PL. This leads to a timeconsuming write-after-read for each partial update. We use data deltas instead of parity deltas to bypass the computation of parity deltas and the read of old data. (r) Let pi denote the rth update on parity pi , and pi is corresponding to the (0) (0) data dl in an EC group. Let dl and pi denote the original data of dl and the original parity of pi , respectively. Assume that dl is updated r times, then we (1) (2) (r) (1) (2) (r) have dl , dl , · · · , dl , pi , pi , · · · , pi . According to Eq. (2), we have (r)

pi

(0)

= pi

(0)

− γil dl

(1)

+ γil dl

(r−2)

=

(1)

− γil dl (r−1)

− · · · − γil dl

+ γil dl

(0) pi

(0) dl )

+

(r) γil (dl (r)

−

(2)

+ γil dl

(r−1)

− γil dl

(r)

(3)

+ γil dl

(0)

(r)

(0)

the equation illustrates that pi can be computed by pi , dl , and dl . We propose a new update approach, named data-delta based PL (DDBPL), which is built on PL and Eq. (3). Figure 2 shows the procedure of DDBPL for non-write-only ﬁles, in terms of partial updates. For each partial update, the client ﬁrst forwards the new data (r) (r) dl to the data node, then the data node forwards dl to the parity node. The (0) original data value dl is read in the 1st partial update, whereas it will no longer be read in subsequent partial updates. In Fig. 2(a) shows the procedure of the (1) (0) 1st partial update. When receiving dl , the data node knows that dl has not (0) been updated by retrieving its log. Then the data node reads dl directly. When (1) (1) receiving dl , the parity node appends dl to its logs, then explicitly request (0) dl asking the data node. Once receiving the request. The data node appends (1) (0) (0) dl to its log after sending dl . Once receiving dl , the parity node appends it to the logs and then return success to the parity node. Figure 2(b) shows the procedure of the rth (r > 1) partial update. The data (r) node directly sends the dl to the parity node. Then the data node in-place (r) (r) writes dl into the original ﬁle. Meanwhile, the parity node appends dl to its own logs. (0) In FL, dl will never be overwritten. Therefore, the data node does not (0) need to send dl to the parity node. Based on this analysis, we propose a new update approach, named data-delta based FL(DDBFL), which is built on FL and

Adaptive Updates for Erasure-Coded Storage Systems Client

Data node

R

193

Parity node i

read dl(0)

W write dl(1) to log

(1) W write dl to log

W

write dl(0) to log

(a) Procedure of the 1st partial write. Client

Data node

W

Parity node i

in-place write dl(r)

(r) W write dl to log

(b) Procedure of the rth (r >1) partial write.

Fig. 2. Procedure of DDBPL for non-write-only ﬁles, in terms of partial updates. Client

Data node

W

Parity node i

write dl(r) to log

W write dl(r) to log

Fig. 3. Procedure of DDBFL for write-only ﬁles, in terms of partial updates.

Eq. (3). Figure 3 shows the procedure of DDBFL for write-only ﬁles, in terms of (r) partial updates. For each update, the data node and the parity node append dl (r ≥ 1) to their own logs.

4

Implementation

Based on DETOG, we implement a prototype of distributed ﬁle system named DETFS. DETFS splits ﬁle content into ﬁxed-size data blocks, it stores each block at a single data node. DETFS encodes each k consecutive data blocks of a ﬁle to generate m parity blocks. The size of a parity block is the same as that of a data block, each parity block is independently stored on a single parity node. Figure 4 shows the architecture of DETFS. DETFS implements a global master (metadata node) to maintain all ﬁle system metadata. The master chooses

194

B. Wei et al.

the node to host a data block or a parity block. When reading a ﬁle, the DETFS client ﬁrst asks the master for the location information of the blocks of the ﬁle. It then contacts the data node that holds the target block for data transfer. When writing a ﬁle, the DETFS client ﬁrst asks the master to choose the nodes to host the data block and the corresponding parity blocks. The ﬁle classiﬁer is implemented on client. The classiﬁer classiﬁes ﬁles into non-write-only and write-only. DETFS uses DDBPL and DDBFL to perform partial updates for non-write-only ﬁles and write-only ﬁles, respectively. When the utilization of a node (ratio of used disk space at the node to total capacity of the node) reaches a threshold, merging compactions are performed asynchronously to shrink the disk usage of the logs. Application

DETFS master (file name, operation type, offset, size)

DETFS client

File namespace

system.meta Block 32ea Block 25fb

(block handle, block locations)

Instructions to server Server state

block handle, byte range

Data node server

Parity node server

Linux file system

Linux file system

...

block data

...

...

Fig. 4. Architecture of DETFS.

5

Experiments

Our experiments are conducted on 8-node machines, four of which are the data nodes, two of which are the parity nodes, one of which is the client, and the last one is the master. Each machine is conﬁgured with two 20-core 2.2 GHz Intel Xeon 4114 CPUs, 128 GB of memory, four 4 TB disks, and the Ubuntu 18.04 LTS operating system. The network is 1-Gigabit Ethernet. The size of each data block or parity block is 64 MB. For an EC(k, m) group, k and m are set to 4 and 2, respectively. This is the same as did as [16]. We evaluate our proposed approach DETOG by comparing with the following four state-of-thearts: 1) FL [2]; 2) PL [6]; 3) PARIX [16]; and 4) three-way replication (R3) [4]. R3 takes in-place writes for data updates. All approaches are implemented into the DETFS. We evaluate the performance of all approaches using NFS trace set [1]. We randomly sample trace data of each trace in the set using the following steps: 1) extracting distinct ﬁles to construct the ﬁle set S; 2) constructing the ﬁle set S by randomly sampling on S at 1:100; and 3) extracting the records whose ﬁle id belongs to S from the original data set, so as to construct a new trace sequence according to the timestamp.

Adaptive Updates for Erasure-Coded Storage Systems

(a) dasna_w1 (13.7% update; 7.4% write-only file)

10000 0

(d) lair62b_w1 (63.7% update; 47.16% write-only file)

27000

18000 9000 0

(e) home04_w3 (87.6% update; 66.1% write-only file)

24000

16000 8000 0

(c) home03_w1 (51.6% update; 38.2% write-only file) 28000

36000

I/O throughput (KB/s)

I/O throughput (KB/s)

20000

9000

(b) lair62_w2 (35.6% update; 12.2% write-only file)

40000 30000

18000

I/O throughput (KB/s)

20000 10000

27000

32000

I/O throughput (KB/s)

30000

36000

I/O throughput (KB/s)

I/O throughput (KB/s)

40000

195

21000

14000 7000 0

(f) dasna2_w2 (95.2% update; 89.4% write-only file)

Fig. 5. I/O throughput for all approaches when replaying NFS traces.

5.1

Trace Evaluations

We choose six representative traces with diﬀerent percentages of overwrites and write-only ﬁles to perform performance evaluations. Merging compactions are triggered whenever the utilization of a node (ratio of used space at the node to total capacity of the node) is greater than the threshold value. Figure 5 shows the I/O throughput for all approaches when replaying NFS traces. R3 always works with the highest I/O throughput for all selected traces. This is because it does not need to perform additional read, parity computation, and data compaction. DETOG performs better than FL, PL, and PARIX for all selected traces, particularly for the traces with the high percentages of updates and write-only ﬁle. For example, Fig. 5 (a) shows that DETOG can improve the I/O throughput by 29.55%, 18.71%, and 9.88%, compared with FL, PL, and PARIX, respectively, when the trace dasna w1 with a update percentage of 13.7% and a write-only ﬁle percentage of 7.4% is replayed; whereas Fig. 5 (f) shows that DETOG can improve the I/O throughput by 51.41%, 75.77%, and 62.69%, compared with FL, PL, and PARIX, respectively, when the trace dasna w2 with a update percentage of 95.2% and a write-only ﬁle percentage of 89.4% is replayed. This behavior occurs because DETOG performs a single write for the r(th) (r ≥ 2) partial updates on the same data for non-write-only ﬁles, and a single write for each partial update for write-only ﬁles. The higher the overwrite percentage and the write-only ﬁle percentage of a trace, the larger the advantage of DETOG.

196

B. Wei et al.

5.2

Storage Overhead

Figure 6 shows the storage overhead for diﬀerent approaches replaying NFS traces. R3 always works with the highest storage overhead, and its storage overhead is 3× for all traces. This is because R3 keeps three replicas for every data block, and employs in-place writes instead of log-based writes to perform updates. The storage overhead of FL, PL, PARIX, and DETOG is much lower than that of R3. This demonstrates that EC can signiﬁcantly reduce storage overhead. 3.5 R3

FL

PL

PARIX

DETOG

Storage overhead

3.0 2.5 2.0 1.5

1.0 dasna_w1 (13.7% update; 7.4% write-only file)

lair62_w2 (35.6% update; 12.2% write-only file)

home03_w1 (51.6% update; 38.2% write-only file)

lair62b_w1 (63.7% update; 47.16% write-only file)

home04_w3 (87.6% update; 66.1% write-only file)

dasna2_w2 (95.2% update; 89.4% write-only file)

Fig. 6. Storage overhead for diﬀerent approaches replaying NFS traces.

PL always works with the lowest storage overhead. This is because it updates data block using in-place writes. The storage overhead of PARIX is greater than that of PL. This is because the original data have to be stored on the parity nodes. The storage overhead of DETOG is greater than that of PARIX. This is because DETOG performs logging-based update for the ﬁrst update on the same data for non-write-only ﬁles and performs logging-based update for each update for write-only ﬁles. The storage overhead of FL is greater than that of DETOG. This is because FL appends all data and parity updates to logs.

6

Conclusion

We have proposed DETOG, an adaptive update approach to support fast partial updates for erasure-coded storage systems. DETOG classiﬁes ﬁles into non-writeonly and write-only. For non-write-only ﬁles, DETOG uses data deltas rather than parity deltas to bypass the read of old data and the computation of parity deltas. For write-only ﬁles, DETOG directly appends the new data to the logs of data nodes and parity nodes, so as to bypass the read of old data. Extensive experimental results show that DETOG has successfully improved the I/O throughput compared with the sate-of-the-art. Acknowledgment. This work was supported in part by the National Natural Science Foundation of China under Grant No. 62072118, the China Postdoctoral Science Foundation under Grant No. 2021M690733, and the Key-Area Research and Development Program of Guangdong Province under Grant 2019B010121001.

Adaptive Updates for Erasure-Coded Storage Systems

197

References 1. Harvard NFS traces. http://iotta.snia.org/traces/nfs/3378. Accessed August 2021 2. Aguilera, M.K., Janakiraman, R., Xu, L.: Using erasure codes eﬃciently for storage in a distributed system. In: 2005 International Conference on Dependable Systems and Networks, pp. 336–345 (2005) 3. Chan, J.C., Ding, Q., Lee, P.P., Chan, H.H.: Parity logging with reserved space: towards eﬃcient updates and recovery in erasure-coded clustered storage. In: 12th USENIX Conference on File and Storage Technologies, pp. 163–176 (2014) 4. Ghemawat, S., Gobioﬀ, H., Leung, S.T.: The Google ﬁle system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pp. 29–43 (2003) 5. Hu, Y., Cheng, L., Yao, Q., Lee, P.P., Wang, W., Chen, W.: Exploiting combined locality for wide-stripe erasure coding in distributed storage. In: 19th USENIX Conference on File and Storage Technologies, pp. 233–248 (2021) 6. Jin, C., Feng, D., Jiang, H., Tian, L.: RAID6L: a log-assisted raid6 storage architecture with improved write performance. In: 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies, pp. 1–6. IEEE (2011) 7. Kadekodi, S., Rashmi, K., Ganger, G.R.: Cluster storage systems gotta have HeART: improving storage eﬃciency by exploiting disk-reliability heterogeneity. In: 17th USENIX Conference on File and Storage Technologies, pp. 345–358 (2019) 8. Myles, A.J., Feudale, R.N., Liu, Y., Woody, N.A., Brown, S.D.: An introduction to decision tree modeling. J. Chemometr. J. Chemome. Soc. 18(6), 275–285 (2004) 9. Plank, J.S., Greenan, K.M., Miller, E.L.: Screaming fast Galois ﬁeld arithmetic using Intel SIMD instructions. In: 11th USENIX Conference on File and Storage Technologies, pp. 299–306 (2013) 10. Shen, Z., Lee, P.P.: Cross-rack-aware updates in erasure-coded data centers: design and evaluation. IEEE Trans. Parallel Distrib. Syst. 31(10), 2315–2328 (2020) 11. Silberstein, M., Ganesh, L., Wang, Y., Alvisi, L., Dahlin, M.: Lazy means smart: reducing repair bandwidth costs in erasure-coded distributed storage. In: Proceedings of International Conference on Systems and Storage, pp. 1–7 (2014) 12. Subedi, P., Huang, P., Young, B., He, X.: FINGER: a novel erasure coding scheme using ﬁne granularity blocks to improve Hadoop write and update performance. In: 2015 IEEE International Conference on Networking, Architecture and Storage, pp. 255–264. IEEE (2015) 13. Xia, M., Saxena, M., Blaum, M., Pease, D.A.: A tale of two erasure codes in HDFS. In: 13th USENIX Conference on File and Storage Technologies, pp. 213–226 (2015) 14. Xu, B., Huang, J., Qin, X., Cao, Q.: Traﬃc-aware erasure-coded archival schemes for in-memory stores. IEEE Trans. Parallel Distrib. Syst. 31(12), 2938–2953 (2020) 15. Ye, L., Feng, D., Hu, Y., Wei, X.: Hybrid codes: ﬂexible erasure codes with optimized recovery performance. ACM Trans. Storage 16(4), 1–26 (2020) 16. Zhang, Y., Li, H., Liu, S., Xu, J., Xue, G.: PBS: an eﬃcient erasure-coded block storage system based on speculative partial writes. ACM Trans. Storage 16(1), 1–25 (2020)

Matching Program Implementations and Heterogeneous Computing Systems Martin Sandrieser(B) and Siegfried Benkner Research Group Scientiﬁc Computing, Faculty of Computer Science, University of Vienna, Vienna, Austria {martin.sandrieser,siegfried.benkner}@univie.ac.at

Abstract. High performance computing (HPC) systems have become highly parallel aggregations of heterogeneous system elements. Diﬀerent kinds of processors, memory regions, interconnects and software resources constitute the modern HPC computing platform. This makes software development and eﬃcient program execution a challenging task. Previously, we have developed a platform description framework for describing multiple aspects of computing platforms. It enables tools and users to better cope with the complexities of heterogeneous platforms in a programming model and system independent way. In this paper we present how our platform model can be used to describe program implementation variants that utilize diﬀerent parallel programming models. We show that by matching platform models of program implementations to descriptions of a concrete heterogeneous system we can increase overall resource utilization. In addition, we show that our model featuring control relationships brings signiﬁcant performance gains for ﬁnding platform patterns within a commonly used heterogeneous compute cluster conﬁguration.

Keywords: Modeling

1

· Platform · Heterogeneous computing

Introduction

Software development and eﬃcient program execution for highly parallel computing systems has always been challenging. With the spread of heterogeneous computing paradigms those challenges got aggravated. Users and tools now have to cope with diﬀerent kinds of hardware resources and diverse programming environments available within a single system. Achieving high computational performance while maintaining productivity is very demanding. Therefore, methods and tools are required to better support programming of heterogeneous systems. Previously we have developed an XML-based platform description language (PDL) as well as a generic platform model [13]. The main goal of these platform description facilities is to enable programmers to describe – in a machinereadable way – hardware- and software-properties that are relevant for application tuning, tool support and portability of software. They provide a holistic c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 198–209, 2022. https://doi.org/10.1007/978-3-030-96772-7_19

Matching Program Implementations and Heterogeneous Computing Systems

199

view about the computing platform which we deﬁne as a set of hardware- and software resources. Our platform description facilities also enable to describe generic platform patterns. Platform patterns describe how processing- and memory resources interact. Such interactions are usually only deﬁned implicitly by programming models and not available in a machine-readable form. Multiple programming models may adhere to high-level platform patterns. In addition to high-level interactions of resources, our modeling approach also allows to capture low-level hardware and software information such as memory-sizes, locality and CPUproperties. By supporting both aspects, high-level resource interactions and lowlevel entity properties, we aim at providing descriptor facilities that are usable for a variety of use-cases at diﬀerent layers of abstraction. In this paper we make the following contributions: – We introduce our platform description framework which is based on a generic platform model. – We use our platform modeling framework to model characteristics of program implementation variants developed with MPI [11], OpenMP [10], Nvidia CUDA [12] and AMD HIP [7]. – We show that by matching platform descriptors of program implementation variants and of a target system we can increase resource utilization. Utilizing our modeling framework we generate optimized program execution conﬁgurations that improve benchmark application performance by up to 2.9x. – We show that our hierarchical modeling approach based on control relationships results in more eﬃcient graph search for ﬁnding platform patterns compared to an approach that does not use a hierarchical model. This paper is structured as follows. In Sect. 2 we present context and related work. In Sect. 3 we introduce our platform description framework. Sect. 4 shows how our approach is used to improve resource utilization on a highly heterogeneous system. In Sect. 5 we evaluate our modeling approach with respect to the applicability of a graph algorithm. Section 6 summarizes our ﬁndings.

2

Context and Related Work

Using higher level models that capture aspects of the computing environment is a common method in all software development domains. Especially in the context of high performance computing (HPC) we observe a wide variety of platform abstractions to improve programmer productivity. In many cases utilized abstractions focus on locality information and are tightly coupled with speciﬁc programming languages or runtime systems. A prominent example is the X10 [4] programming language. X10 introduces the concept of place which describes a locality boundary for data and computational tasks. How such places are mapped to concrete resources of an execution environment can be inﬂuenced externally with low programmer interaction.

200

M. Sandrieser and S. Benkner

This methodology highly increases programmer productivity and code portability. Therefore, there exist multiple similar approaches to improve portability through adaptable locality abstractions e.g., Chapel [3], HPX [8], Charm++ [9]. The memory hierarchy is a key factor for achieving high computational performance. Hence, the projects Sequoia [6], HPT [16] and Legion [1] utilize treebased models of a system’s memory organization. Also, these projects use changeable mappings of abstract descriptors to concrete hardware resources to improve code portability. The previously mentioned approaches all combine abstract platform modeling and mapping with speciﬁc programming languages or runtime systems. Our approach does not include a speciﬁc programming environment. In fact, in addition to locality information, we aim at describing the properties and resource interactions of the programming approach itself. This serves to support the interoperability of software in heterogeneous environments where multiple programming models are combined within one system or application. In this paper, we use our modeling approach to support the selection of program implementation variants. Implementation variants achieve the same computational task but are implemented in diﬀerent ﬂavors, often with diﬀerent programming models and resource requirements. Such a programming methodology is common in heterogeneous environments where programs need to be adapted to a diverse set of hardware resources. However, in many cases the resource requirements of implementation variants are only deﬁned implicitly with string-based identiﬁers. Our approach aims at providing more detailed, machine processable structural information on how an implementation variant utilizes resources.

3

Platform Descriptors

Describing relevant properties of the computing platform in a structured and machine-readable way is a challenging task. Description facilities have to be generic and adaptable to support a wide variety of use-cases ranging from highlevel platform patterns to low-level hardware speciﬁc information. Our platform descriptor facilities utilize a generic platform model. This model is based on a hierarchical aggregation of processing units, memory regions and interconnect entities. We represent this model as an undirected graph with different node and edge types. The nodes in this graph are of type processing unit (PU) or memory region (MR). Edges between nodes represent control relationships between processing units or interconnects. An Interconnect describes communication and data-transfer within the platform. In addition, we deﬁne a control relationship as the possibility for oﬄoading computational tasks from one processing unit to another [13]. Due to this hierarchical control relation between PUs, we further introduce three diﬀerent PU-types: Master, Hybrid and Worker. Master PUs may delegate work to other processing units and at least one master must exist within a platform. Worker PUs execute work delegated by other PUs but cannot oﬄoad work. Hybrid PUs may act as both, master or worker. Figure 1 depicts an example platform graph with 5 processing units and a single

Matching Program Implementations and Heterogeneous Computing Systems

201

M

H

W0

W1

IC

IC

IC

W2

IC

IC

MR

Fig. 1. Example platform graph with master (M), hybrid (H) and worker (W0, W1, W2) processing units. One memory region (MR) is accessible for all processing units. This memory interconnect is depicted as dashed edges. The control edges are shown as solid lines.

shared memory region. The PUs form a hierarchy with one intermediate level (H). The shared memory access is modeled by interconnect edges between the PUs and the memory region (MR). 3.1

Programming Models

What distinguishes our approach from other approaches that focus on hardware description (e.g., [2]) is the capability to express logical relationships between system entities which are usually deﬁned implicitly by the programming environment. Hence, in our approach multiple platform descriptions for the same physical hardware may exist depending on how system resources are utilized. Moreover, platform descriptions may combine multiple platform models within one graph. For example, this situation arises for hybrid programs that combine multiple programming models. This is a common scenario for clusters of sharedmemory machines (e.g., MPI+OpenMP) or machines equipped with accelerators (e.g., OpenMP+OpenCL). 3.2

Abstraction Levels

To be applicable for a wide variety of use-cases, platform description facilities should enable system modeling at diﬀerent levels of abstraction. Our model has been designed to support coarse grained and ﬁne grained modeling of software and hardware characteristics. Therefore, in addition to the structural graphbased model, we support the attachment of arbitrary descriptor properties to all system entities. This support has been realized via a generic key-value scheme. We distinguish between the following descriptor abstraction levels:

202

M. Sandrieser and S. Benkner

– High-Level: Generic platform patterns which capture entity interactions found in multiple programming environments. We have pre-speciﬁed generic patterns often found in the HPC domain such as Threading, Message-Passing or Accelerator (see Fig. 2). – Mid-Level: Platform descriptors that may comprise abstract higher-level patterns but make further reﬁnements regarding entity quantities and their connectivity. For example, a high-level thread pattern might be present at several sub-parts of a complex platform that features multiple shared memory regions (i.e., a cluster of shared-memory machines). – Low-Level: Platform descriptors that include mappings of the abstract platform entities (processing units, memory regions and interconnects) to concrete hardware and software resources of a computing system. A major motivation for our approach is that the same modeling facilities can be used at all levels of the computing platform. This aims at improving the interoperability between programming approaches, supporting portability and performance optimization. 3.3

Programming Support

We have implemented our platform modeling framework as a C++ programming library. This library supports the import and export of platform descriptors to/from an XML-based storage format. In addition, it provides functionality to work with high-level platform models, store and query entity properties and automatic creation of platform descriptions for concrete target systems.

4

Case Study: Improving System Utilization

In this section we investigate a common performance tuning problem occurring in heterogeneous systems and show how our approach can improve application throughput and resource utilization. Problem: We consider a highly heterogeneous compute cluster. Each of the compute nodes has diﬀerent hardware characteristics and therefore eﬃcient program execution requires the utilization of diﬀerent programming models and/or conﬁguration parameters on each machine. Finding good conﬁgurations usually requires the manual examination of program implementation variants (i.e., for each available programming model) and low-level hardware details. This process is often time consuming and requires a high degree of expert knowledge. Our approach provides means to describe programming model characteristics as well as low-level platform details. By comparing descriptors of a program’s required platform and descriptors of the concrete execution environment, eﬃcient program mapping conﬁgurations can be created automatically. This approach alleviates users from time-consuming application tuning steps and improves portability.

Matching Program Implementations and Heterogeneous Computing Systems

203

In what follows we show a concrete example for a highly challenging heterogeneous system conﬁguration. We generate program execution conﬁgurations that can utilize all available resources of the heterogeneous platform and therefore increase application throughput. Execution Environment: The heterogeneous compute cluster Exa is comprised out of 4 compute nodes (exa01-04) that are connected via 4X QDR Inﬁniband and an Ethernet network. Exa01 features 4 Intel Xeon 6138 2.0 GHz (4 × 20cores) with 4 NUMA domains and 192 GB RAM. Exa02 is comprised of 2 AMD Epyc 7501 2.0 GHz (2 × 32cores) with 8 NUMA domains and 96 GB RAM. The nodes exa03 and exa04 each feature 2 Intel Xeon 6130 2.1 GHz (2 × 16cores) with 2 NUMA domains and 96 GB RAM. Exa03 is further equipped with one Nvidia Tesla V100 32 GB GPU. The node exa04 features one AMD Radeon Instinct MI25 16 GB GPU. With the complex memory conﬁgurations, diﬀerent kinds of processors and GPU accelerators, this system poses great challenges for executing applications that aim at using all available resources. We automatically created a platform description for the whole system in the following way. As input we use hardware locality information gathered from the Hwloc [2] library, Cmake-based library discovery and Nvidia/AMD GPU management libraries. For each NUMA domain we model one memory region (MR). Per NUMA memory region we then use one CPU-core as master processing unit and the remaining cores as worker entities. We insert control relationship edges between master and worker PUs. Processing units and memory regions are connected via shared memory interconnect edges. Those edges also store relative distances between PUs and NUMA domain MRs as edge properties. In addition, the GPUs in exa03 and exa04 are modeled as worker PUs with one distinct memory region. There is one CPU-core acting as master for one GPU worker. Subsequently, we insert interconnect edges between the GPU memory region and the related master and worker PUs. To express the availability of a message-passing library on the target system, we insert message interconnect edges between master processing units. M

M

M

IC

W

shmIC

W

MR

osIC

msgIC shmIC

MR

(a) Thread

IC

devIC

AccMR

(b) Accelerator

M2

MR

(c) Message

Fig. 2. Platform patterns used in program implementation variant descriptions.

Application: We investigate the execution of the XSBench Monte Carlo neutron transport application benchmark [15]. This application is available in a variety

204

M. Sandrieser and S. Benkner

of diﬀerent programming models and is therefore suitable to run on all available resources of the Exa machine. We use implementations that utilize the following programming models: OpenMP, CUDA, HIP and MPI. Each program implementation variant is compiled into a separate binary executable. In addition to the intra-process parallelism of the application, multiple processes can be combined via an MPI [11] coordination layer. This may result in complex execution conﬁgurations featuring diﬀerent programming models (e.g., MPI+OpenMP+X) within one application run. To utilize our approach for generating suitable mappings to the target environment, we model each application variant. For the OpenMP programming model, we create a model that utilizes a Thread pattern. As depicted in Fig. 2a, this pattern has one master PU and one worker PU connected to a shared memory region. For the CUDA and HIP implementation variants we model an Accelerator pattern. Fig. 2b shows this pattern with one master PU and one worker PU with one additional distinct worker memory region (AccMR). The worker refers to a CUDA/HIP device. To distinguish between the CUDA and HIP programming environments, we further annotate the accelerator entities with key/value properties. For implementation variants featuring message-passing, we use a Message pattern (Fig. 2c) consisting of a message interconnect between master processing units. Mapping: We use an execution conﬁguration generator that produces MPI rankﬁles equipped with additional information on executable ﬁlenames and thread counts. For ﬁnding suitable mappings, we search for platform patterns message, thread and accelerator deﬁned by the requirement descriptors that model the application implementation variants. We search for these platforms within the concrete execution environment description of the Exa system. Since all descriptors utilize the graph-based model described in Sect. 3, we can rely on the wide-spread VF2 [5] (sub-)graph isomorphism algorithm. The generator records all concrete system entities from the target description that are capable of forming a speciﬁc platform pattern. It maps message interconnect participants to MPI ranks, worker PUs of the thread model to thread groups and worker entities of the accelerator pattern to GPUs. Results: We have conducted experiments on the Exa system with 4 reference conﬁgurations OpenMP (OMP), CUDA, HIP, MPI+OMP. The reference conﬁgurations were executed on exa01 (OMP), exa03 (CUDA), exa04 (HIP) and nodes exa01-04 (MPI+OMP). Reference conﬁgurations were run with default settings with resource selection as speciﬁed by the original application. We compare the reference against two auto-generated execution conﬁgurations created by our approach. These versions utilize all compute nodes exa01-04. All programs were run on CentOS 7.8.2003, Kernel 3.10 and have been compiled with GCC 8.3.0 with -O3 ﬂags. For the GPUs we used NVCC/CUDA 11.3 and HIP 4.1.0 with Clang 12. In addition, OpenMPI 4.0.5 with UCX 1.8.1 was used. All distributed (MPI) application versions execute the full amount of work (no work sharing across ranks) and use MPI for coordination and performance data reporting. Hence, we show total lookups measured by all ranks. We use XSBench V20 with benchmark size large and event-based simulation. For the GPU-based variants we

Matching Program Implementations and Heterogeneous Computing Systems

205

Fig. 3. XSBench performance for diﬀerent program execution conﬁgurations. Using our modeling framework, we can automatically generate execution conﬁgurations that utilize all available resources and improve performance.

include device data-transfer in the timing. All results are mean values gathered from 10 repeated application runs. Figure 3 shows the application performance of diﬀerent execution conﬁgurations. We observe that since the threading pattern in the target platform description is built around NUMA domains our generated MPI+OMP conﬁguration outperforms the reference version that does not consider the NUMA organization. The highest overall performance is achieved by our generated conﬁguration MPI+OMP+CUDA+HIP that uses all available program variants. This version considers NUMA-based mapping and selects the CUDA and HIP implementations for compute nodes exa03 and exa04.

5

Model Performance Evaluation

Introducing a platform model that includes the logical relationships of how processing units are utilized is an uncommon approach. Many existing projects follow a platform model that is predetermined by the physical hardware organization. This usually results in tree-like hierarchies that often resemble a system’s memory hierarchy. Our approach also provides memory locality information, but achieves this through a more generic graph structure with memory-associated interconnect edges. To evaluate our platform model, we have performed experimental evaluations that aim at answering the following question: Does the hierarchical platform model featuring control relationships and diﬀerent processing unit classes bring an advantage for ﬁnding platform patterns?

206

M. Sandrieser and S. Benkner

Therefore, we evaluate the search performance of the VF2 [5] algorithm for ﬁnding platform patterns. We compare our hierarchical platform model against a reference modeling approach that does not utilize control relationships and diﬀerent processing unit classes. Experimental Setup: Our implementation uses the Boost Graph Library (BGL) [14] from Boost Version 1.74.0. For matching of platform patterns in larger platform graphs, we use the VF2 [5] (sub-)graph isomorphism algorithm implementation from BGL. All examples have been compiled with GCC 8.3.0 and -O3 optimization ﬂag. We search for small platform pattern graphs within larger platform graphs. These larger graphs resemble a commonly used HPC system conﬁguration which is based on shared-memory multi-processor systems with multiple NUMA domains. This model represents a commonly used multi-socket compute node where each processor features multiple CPU cores. Compute nodes are further connected to a larger cluster via a networking fabric. For all experiments, we have modeled a generic cluster of shared-memory compute nodes in the following way: 16 processing units (PU) have two shared-memory interconnect relations with two distinct memory regions (MR). One MR is local and the other MR is remote to each group of PUs. This distance is stored as an edge property with the interconnect in the graph representation. Each of the compute nodes in the system features 2 × 16 general-purpose PUs that represent CPU cores and two distinct MRs representing NUMA domains. For experiment conﬁgurations that feature accelerators, we assume that 10% of the compute nodes are equipped with one accelerator per NUMA domain. We evaluate the platform pattern search performance for platform graphs representing systems with up to 100 compute nodes. We consider a common HPC use-case of hybrid parallel programs that use message-passing for communication and combine the message-passing model with another parallel programming environment (i.e., MPI+X). For the messagepassing layer, we model a topology of participating processes in a Cartesian grid. Hence, we insert message interconnect edges in the platform graph in such a way that a 3-dimensional processing unit topology is constructed. The resulting platform graph structure resembles a commonly used 3D-torus interconnect network topology. In the example, this kind of messaging interconnect exists between general-purpose master processing units. We have conducted the experiments on a server machine running CentOS 7.8.2003, Kernel 3.10, which was equipped with two Intel Xeon Gold 6130 16core 2.10 GHz processors and 96 GB RAM. All results are based on the mean from 10 repeated runs. Error bars in plots show the 95% conﬁdence interval. For the pattern search, we check vertex equivalence by comparing vertex types (Master, Hybrid, Worker, Memory). For edge equivalence we compare edge types (Control, Interconnect) and sub-types (e.g., Message, Shared-Memory). Thread: As described in Sect. 4, this platform pattern features one master PU and one worker PU which share a memory region (MR). Since the concrete system under investigation features 2 × 16 CPU-cores and two NUMA domains, we model one master PU and 15 worker PUs which have access to two MRs, one for

Matching Program Implementations and Heterogeneous Computing Systems

207

Fig. 4. Pattern search performance with and without control relationships

each NUMA domain. The master PUs are connected via messaging interconnect edges in a 3D-torus fashion. For the alternative modeling approach that is used as comparison, we omit control relationships and therefore also do not use the worker PUs. However, to still capture a threading relation between processing units, we insert message interconnects between one PU that takes a coordinative role and the remaining 15 PUs in the same NUMA domain. All the PUs are modeled as master PUs but only the coordinative master is participating in the inter-NUMA 3D-torus messaging interconnect. Since there is no further diﬀerentiation between message interconnects, there is a semantic gap in the reference model. The reference also matches for PU interactions that span across remote NUMA domains. Hence, we observe that the control relationships provide more utility for locality modeling. We observe that the control relationship approach for modeling of the thread patterns brings signiﬁcant performance advantages. As shown in Fig. 4, for all cluster sizes of up to 100 compute-nodes featuring 3200 CPU-cores, the mean time to ﬁnd all thread patterns with control relationships is well below 40ms. For the reference thread pattern that omits control relationships mean search times are higher in all cases. Accelerator: Due to the increase of heterogeneous computing, this pattern became more and more important in recent years. The accelerator pattern models the oﬄoading of computational tasks to often specialized compute units that feature distinct memory regions. We model this pattern by introducing a control relationship between a master and a worker PU. As the reference modeling approach that does not use control relationships, we use a onesided model. The structural diﬀerence of this model is that the control relationship is again replaced by a message interconnect and no worker PUs are used. Similar to the thread model we use the coordination master PU for the inter-NUMA torus messaging interconnect. The experimental results show that ﬁnding the reference pattern without control relationships has much higher performance variations and lower performance compared to when control relationships are used. As shown in Fig. 4,

208

M. Sandrieser and S. Benkner

the average search times for the reference can reach around 250 ms whereas our approach does not go beyond 50 ms.

6

Conclusion

Utilizing heterogeneous computing systems is a challenging task. Users have to consider a diverse set of hardware and software resources. This makes software development and application tuning time-consuming and error-prone. Methods and tools are needed that improve productivity and performance. In this paper we utilized a platform description framework that supports tools and users to better cope with heterogeneous systems. Our approach is based on a hierarchical platform model that enables to capture major characteristics of hardware and software in a structured way. In addition to low-level system properties, our framework enables to describe high-level structural platform patterns which are usually implicitly deﬁned by the programming environment. We have shown that our approach can support the automatic generation of optimized program execution conﬁgurations in a highly heterogeneous environment. By automatically combining diﬀerent program implementation variants, each developed with a diﬀerent programming model, we could increase resource utilization of a highly heterogeneous cluster. We achieved this by describing software implementations as well as the target execution environment with the same platform modeling framework. We then used a common graph algorithm to determine which implementation variant should be mapped to which sub-parts of the target machine. This approach alleviates users from time-consuming optimization tasks that usually require expert knowledge about software implementations and the hardware execution environment. Using our approach, we could improve the performance of a hybrid benchmark application by up to 2.9×. In addition, we did show the applicability of our approach for modeling a 100-node heterogeneous compute cluster with complex NUMA memory setup, 3200 CPU-cores and GPU accelerators. We could show that our model is well suited for ﬁnding high-level platform patterns in the 100-node cluster model. In addition, we did show that our hierarchical platform model brings signiﬁcant search performance improvements compared to a reference approach that omits hierarchical control relationships between processing units. In the future we will perform the automatic generation of platform descriptors from program execution runs. In addition, we will investigate the use of platform models for task-based runtime systems to facilitate dynamic adaptation of programs.

References 1. Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–11. IEEE (2012)

Matching Program Implementations and Heterogeneous Computing Systems

209

2. Broquedis, F., et al.: hwloc: a generic framework for managing hardware aﬃnities in HPC applications. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, pp. 180–186 (February 2010). ISSN 2377-5750. https://doi.org/10.1109/PDP.2010.67 3. Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007). https://doi. org/10.1177/1094342007078442 4. Charles, P., et al.: X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Not. 40(10), 519–538 (2005). https://doi.org/10.1145/ 1103845.1094852 5. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26(10), 1367–1372 (2004). https://doi.org/10.1109/TPAMI.2004.75 6. Fatahalian, K., et al.: Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 2006, p. 83-es (November 2006). https://doi.org/10.1109/SC.2006.55 7. HIP: HIP Programming Guide - ROCm Documentation 1.0.0 documentation. https://rocmdocs.amd.com/en/latest/Programming Guides/HIP-GUIDE.html 8. Kaiser, H., Heller, T., Adelstein-Lelbach, B., Serio, A., Fey, D.: HPX: a task based programming model in a global address space. In: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014, pp. 1–11. ACM, New York (October 2014) 9. Kal´e, L., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. In: Paepcke, A. (ed.) Proceedings of OOPSLA 1993, pp. 91–108. ACM Press (September 1993) 10. Menon, R., Dagum, L.: OpenMP: an industry-standard API for shared-memory programming. Comput. Sci. Eng. 5, 46–55 (1998). https://doi.org/10.1109/99. 660313 11. MPIForum: MPI: A Message-passing Interface Standard, Version 3.1; June 4, 2015 (2015). https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf 12. Nvidia: CUDA C++ Programming Guide (2020). https://docs.nvidia.com/cuda/ pdf/CUDA C Programming Guide.pdf 13. Sandrieser, M., Benkner, S., Pllana, S.: Using explicit platform descriptions to support programming of heterogeneous many-core systems. Parallel Comput. 38(1–2), 52–65 (2012) 14. Siek, J., Lee, L., Lumsdaine, A.: The Boost Graph Library: User Guide and Reference Manual. Addison-Wesley (2002) 15. Tramm, J.R., Siegel, A.R., Islam, T., Schulz, M.: XSBench - the development and veriﬁcation of a performance abstraction for Monte Carlo reactor analysis. In: The Role of Reactor Physics toward a Sustainable Future, PHYSOR 2014, Kyoto (2014). https://www.mcs.anl.gov/papers/P5064-0114.pdf 16. Yan, Y., Zhao, J., Guo, Y., Sarkar, V.: Hierarchical place trees: a portable abstraction for task parallelism and data movement. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 172–187. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13374-9 12

FastDCF: A Partial Index Based Distributed and Scalable Near-Miss Code Clone Detection Approach for Very Large Code Repositories Liming Yang1 , Yi Ren1(B) , Jianbo Guan1(B) , Bao Li1 , Jun Ma1 , Peng Han2 , and Yusong Tan1 1 National University of Defense Technology, Changsha, China

{ylm19,renyi,guanjb}@nudt.edu.cn 2 CS&S Information System Engineering Co., Ltd., Beijing, China

Abstract. Despite a number of techniques have been proposed over the years to detect clones for improving software maintenance, reusability or security, there is still a lack of language agnostic approaches with code granularity flexibility for near-miss clone detection in big code in scale. However, it is challenging to detect near-miss clones in big code since it requires more computing and memory resources as the scale of the source code increases. In this paper, we present FastDCF, a fast and scalable distributed clone finder, which is partial index based and optimized with multithreading strategy. Furthermore, it overcomes single node CPU and memory resource limitation with MapReduce and HDFS by scalable distributed parallelization, which further improves the efficiency. It cannot only detect Type-1 and Type-2 clones but also can discover the most computationally expensive Type-3 clones for large repositories. Meanwhile, it works for both function and file granularities. And it supports many different programming languages. Experimental results show that FastDCF detects clones in 250 million lines of code within 24 min, which is more efficient compared to existing clone detection techniques, with recall and precision comparable to state-of-the-art approaches. With BigCloneBench, a recent and widely used benchmark, FastDCF achieves both high recall and precision, which is competitive with other existing tools. Keywords: Clone detection · Distributed algorithm · Large scale code analysis · Efficiency and scalability · Language agnostic · Multiple granularities

1 Introduction Code clones are source code fragments that are identical or similar to each other, which widely exist in different software projects [1]. Code clones can be categorized according to the level of similarity [22], i.e., Type-1 are exact clones, Type-2 are parameterized clones, Type-3 are clones with further modifications (like inserting or deleting statements) based on Type-1/2, and Type-4 are clones that are not syntactically similar but semantically similar. Code fragments that are not exactly identical but share certain level of similarity are known as near-miss clones [14]. © Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 210–222, 2022. https://doi.org/10.1007/978-3-030-96772-7_20

FastDCF

211

Code cloning can be helpful if it is properly used, but it is also regarded as a bad programming since it can raise maintenance costs [15], reduce code quality [16], and even propagate software vulnerabilities [3, 6]. Many researchers have proposed code clone detection to address these clone-related problems. In the big data era, large scale software is widely deployed in mission critical systems. Studying clones in big code is a useful way to improve the code quality and to facilitate inter-project maintenance. Therefore, it is necessary to extend clone detection to large scale systems. However, as code size grows, the detection turns much more expensive since the number of code fragment comparisons to detect clones drastically increases. For instance, the time complexity of one-to-one code segment matching is O(n2 ), which makes 25 million comparisons for only 5 thousand segments. Thus, enormous computation resources and memory are required. Furthermore, near-miss clones are the most common clones in software systems and the most needed in code clone detection [20]. However, near-miss clone detection is particularly expensive because numerous differences (i.e., insertion, deletion or modification of source code lines or tokens) between code segments need to be examined. Detecting near-miss clones in large scale systems is a challenging task. A number of tools have been proposed to address this problem [2, 7, 22–25]. However, non-distributed techniques still take hours or even days to detect inter-project clones on 250 million lines of code (MLOC) [22, 23] because of limited computation and memory resources in single node. Distribution is an effective way to solve this problem. However, existing distributed approaches have some problems. Benjamin et al. present an index-based clone detection approach [25]. It is both incremental and scalable to very large codebases. But it only supports Type-1 and Type-2 clone detection. IBFET [2] is a MapReduce based tool which utilizes an index-based features extraction technique to detect code clones. But IBFET is non-distributed in preprocessing stage, and this will become a bottleneck when processing large code. Furthermore, since inter-projects often contains code written in diverse programming languages, which makes it necessary to support multi-language code detection to work with codes cross large repositories. And it is also important to be flexible to support different granularities of detection. For example, function-level detection is suitable for vulnerable detection based on clone detection [6] while file-level detection is handy for license violation checking. Therefore, it is necessary for a clone detection approach to support for many different languages and multiple granularities. In this paper, we present FastDCF, an efficient and effective distributed clone detection approach that can detect clones in inter-project/intra-project big code with flexibility in both programming language and code processing granularity: Efficiency and Scalability: In order to break the limitation of computation and memory resources, we design FastDCF as a fully distributed approach. This makes FastDCF can work efficiently and scalably on a massive code base. To further improve the efficiency of our approach, we use partial token indexing to reduce the number of required comparisons. Type-1/2/3 Clone Detection: To detect near-miss clones, FastDCF use a simple and fast bag-of-tokens strategy which is resilient to Type-3 changes to compare code blocks. Therefore, FastDCF can detect Type-1/2/3 clones.

212

L. Yang et al.

Language Agnostic and Multiple Granularities: By our designed parser, FastDCF transforms source code into their lower-case equivalent. This allows FastDCF support many languages such as C, C++, Java, Python, and C Sharp, and support code granularities at both file and function levels. We evaluate FastDCF in terms of efficiency, scalability, recall, precision, language support and multi-granularity detection. The experimental results show that FastDCF significantly outperforms existing typical tools, including IBFET [2], DCCFinder [24], SourcererCC [22], CloneWorks [23] and so on. It takes only a few minutes for FastDCF to detect clones on 250 MLOC. FastDCF is 10 times faster than CloneWorks on 250 MLOC and 60 times faster than SourcererCC on 75 MLOC. According to available literature and the test results, FastDCF is the fastest approach which has been implemented to detect near-miss clones for large scale systems. The rest of this paper is organized as follows. Section 2 summarizes existing approaches concerning code clone detection. Section 3 discusses several key issues for designing a fast and scalable distributed code detection tool and presents the design of FastDCF. Section 4 describes our implementation. Section 5 demonstrates comprehensive experimental evaluations and the results between our approaches and the most competitive existing tools with large scale real-world code. The paper concludes with discussion in Sect. 6.

2 Related Work There are many approaches on large scale code clone detection, and we can divide them into two categories: non-distributed and distributed scalable clone detection. Non-distributed Scalable Clone Detection. Nicad is a text-based single node processing approach [7], which uses longest common subsequence algorithm to compare lines of source code. It can detect Type-1, Type-2 and Type-3 clones. SourcererCC [22] and CloneWorks [23] are also non-distributed approaches. They use effective token-based single node methods. Though these approaches improve the efficiency of clone detection on large scale code, there are bottlenecks in these approaches since the resources of single node are limited. Distributed Scalable Clone Detection. With the development of hardware capabilities and virtualization technology, distributed and parallel processing optimization for clone detection is emerging. DCCFinder [24] is the first distributed clone detection tool which run CCFinder [4] in parallel. In order to be analyzed with CCFinder, the target must be partitioned into small pieces. Every node loads two pieces to detect clones between them. Hummel et al. use an index-based strategy to enlarge the scale of clone detection and to provide real-time cloning information for very large software [25]. However, they can only detect Type-1 and Type-2 clones. IBFET is an index-based method that uses hash algorithms to extract features from source code and these features are saved by HBase [2]. IBFET can scale clone detection to billions of LOC at file level granularity. However, IBFET is not a fully distributed clone detection tool since it only optimizes feature-based index creation and code clone detection and retrieval by parallelization, with core steps

FastDCF

213

such as preprocessing and normalization, and feature extraction not parallelized. This significantly affects the overall efficiency of it and these non-distributed core steps will turn to be bottlenecks, if the code is very large.

3 Design 3.1 Preliminary Concepts and Definition In this section, we introduce concepts and definition regarding code clones or appear in our approach. Code block is a continuous segment of source code, which can be a function or a sequence of statements in a source file. A clone pair is a pair of code blocks that are similar and detected as clones. A clone group is a set of similar code blocks and consists of a number of clone pairs. Clones are made up of clone pairs or groups. Query code block is the code block which is used to query index and get potential clones. Candidates are code blocks returned by query code blocks’ query index. They are potential clones of query code blocks. Zipf’s law is an assertion. It claims that f, the frequencies of specific events, is inversely proportional to their rank r in probability [11]. 3.2 Efficiency and Scalability Limitation of Existing Techniques: Experiments and Analysis Due to single node capacity limitation of main memory and CPU, the scalability of non-distributed tools is usually prohibited when the size of code reaches a threshold. SourcererCC is the first approach proposed and implemented to detect clones in MLOC. To illustrate this dilemma, we evaluate SourcererCC [22] with bcb_reduced dataset, the data size of which is 10 MLOC. The experimental environment is set according to [22]. The tests are carried out in a workstation with 4 Intel Xeon Platinum 8269CY cores and 16G RAM (8G are set as available). Figure 1 shows the CPU usage and memory usage changes over the time. From Fig. 1(a), we can see that the CPU usage rate shows a fluctuating increase in the start, and it finally reaches the usage rate of almost 100% shortly after the start. This suggests that SourcererCC contains mainly CPU intensive tasks and has a large demand for computing resources when processing large scale code. As shown in Fig. 1(b), the memory usage increases in a more fluctuating way in the begging. Then it reaches nearly 52% of the upper limit we set. This is because the big code data are usually loaded from and kept in the memory. The larger the amount of code, the more memory is required. From Fig. 1(the screenshots of the running system), we can see memory and CPU usage are always close to upper limit since detecting clones in big code requires much computation and memory resources. 3.3 Design Goals and Our Approaches We want to design an approach that can detect clones in inter-project/intra-project big code with flexibility in both programming language and code processing granularity.

214

L. Yang et al.

(a) Usage rate of CPU

(b) Usage rate of main memory

Fig. 1. SourcererCC: the usage rate of CPU and main memory

From Fig. 1, we can know detecting clones in big code requires much computation and memory resources. In order to address this problem, we propose I2 nOPT, an intra/internode optimized method which combines distributed parallelization and token-based partial indexing. And by building our parser with flexible source code parsing techniques, we allow our approach can support multi language and granularity code clone detection. 3.3.1 I2 nOPT: Combination of Distributed Parallel Optimization and TokenBased Partial Indexing We propose a fast and scalable approach combined both distributed parallel optimization and token-based partial indexing. We use distribution as inter-node optimization to breaks the boundary of single node resource limitation and use token-based partial indexing as intra-node optimization to further improve the efficiency and scalability of our approach. Another benefit of using token-based partial indexing is that it can detect near-miss clones. Inter-node Optimization. Generally speaking, clone detection is divided into two stages, preprocessing stage and clone detection stage, both of which require a lot of computing and memory resources. We parallelize FastDCF in both preprocessing stage and clone detection stage. Multi-threading is used in each stage. Codebase in our design consists of many projects which are collected and maintained by administrators. User code is the project which is submitted by users and is used to find clones between user code and codebase. Sub-codebase is a part of codebase, which is used for parallelization. Distributed preprocessing stage is the first parallelization stage. We divide the job into smaller tasks and assigns them to each node. The big codebase is split into a number of smaller sub-codebases and the preprocessing is executed independently in a mapper for each sub-codebase. In clone detection stage, source code has been split into many small fragments. In order to detect all clones between two projects, each node loads a preprocessed subcodebase and keeps it in memory. Then the user code is streamed into the main memory to node and detect clones between user code and loaded sub-codebase. User code is not

FastDCF

215

stored in the memory. This is repeated until all of the potential clones are identified. This way makes our approach faster since it makes full use of the distributed CPU and memory resources. Intra-node Optimization. When detecting clones between a loaded sub-codebase and user code, token-based partial index [22, 23] is accepted. In traditional token-based approaches, the source code is converted into code blocks made up of tokens and each code block are compared with another to detect clone pairs. The time complexity is O(n2 ) and is not desirable in large scale clone detection. Thus, we use partial index into FastDCF to reduce the number of comparisons and to save the computational overhead. We state it in the form of the following property formally: Property 1: Code block A consist of t 1 tokens and B consist of t 2 tokens, each in predefined order. Denote a sub-block of A as S A and a sub-block of B as S B . If |A ∩ B| ≥ i, i is the given threshold, then any sub-blocks S A which consists of t 1 -i+1 and SB which consists of t 2 -i+1 tokens will have at least one token overlapped. To illustrate this property, let us consider two code blocks A = {T1, T2, T3, T4, T5} and B = {T6, T7, T3, T4, T5} with 5 tokens (t = 5) each. If two blocks have more than 4 common tokens, they are considered as clones (i = 4). Then if we want to find out if A and B are clones, according to this property, we only need to check if any of their sub-blocks consisting of t-i+1 = 2 tokens have shared tokens. In this example, they do not because they have no common tokens in their sub-blocks (marked in bold). We could have most certainly figured out that A and B are not clones because even if the remaining tokens are all the same, the number of shared tokens will not reach the threshold. In other words, this property can help us deduce if two blocks will not be clones by comparing only their sub-blocks instead of comparing all the tokens of A and B. Tokens in sub-blocks are used as partial index in our approach. Furthermore, software vocabulary exhibit very similar characteristics to natural languages corpus and also follow Zipf’s law [11]. That means the frequency of tokens decreases very rapidly with rank, and a few popular tokens prevail in most of the code blocks and rare tokens are shared by a few code blocks. According to this law, we sort code blocks from low to high frequency and take first f-t+i tokens as sub-blocks. Near-Miss Clone Detection. Near-miss clones are common clones in real projects, and it is necessary to detect near-miss clones when designing a clone detection tool. FastDCF is token based. And compared to existing token-based tools, FastDCF can detect nearmiss clones in that it supports bag-of-tokens model. The model is similar to bag-of-words model. It computes similarity by common tokens. It can detect near-miss clones as long as two code blocks share enough tokens to exceed a given threshold. While other tokenbased approaches use token sequences as a unit of match [4], which is more difficult to detect near-miss clones.

3.3.2 Flexible Source Code Parsing In addition to fast and scalable clone detection, we hope our approach is convenient and user friendly, and can be applied to different scenarios. We hope FastDCF is language

216

L. Yang et al.

agnostic and support multiple-granularities detection. In order to achieve this goal, we need a flexible parser to convert the source code of different languages into intermediate representation in any granularity we want. Therefore, we aim to build our parser by using TXL [9]. TXL is a functional programming language specifically designed for expressing source transformation tasks. We use it in FastDCF to extract code blocks from source code at different granularities. Thus, FastDCF is language agnostic and can detect clones at both file level and function level.

4 FastDCF Implementation We implement FastDCF in Java with about 3000 lines of code. As shown in Fig. 2, FastDCF fulfills the fast and effective detection of big code in four stages: data submitting, codebase splitting, preprocessing and clone detection. The output of the previous step becomes the input of the next step and it yields the final detection by elaborate steps. We use HDFS [21] and MapReduce [5] as the distributed computing framework of our parallelization. Data Submitting. In data submitting stage, the administrator uploads the codebase and the user submits the code for clone detection. In order to improve the disk space usage, we package a number of small files into one big file in SequenceFile format.

Fig. 2. Implementation of FastDCF

Codebase Splitting. The whole codebase is too large to fit into the memory. In order to solve this problem, we break the codebase into smaller sub-codebases and the size of each sub-codebase is suitable for the memory capacity of each node. Preprocessing. Preprocessing stage converts data blocks (the sub-codebase or the user code) into token sequences. Each node loads a sub-codebase as an input. Then it performs code block retrieval, filtering and token extraction. The code block retrieval module retrieves code blocks from given sub-codebase by using the robust parser. Filtering module filters code blocks which do not satisfy required size. In token extraction module, tokens are extracted with operator and separator filtered and assembled into token sequences. Finally, the preprocessed token sequences are written back into HDFS.

FastDCF

217

Clone Detection. In clone detection stage, clones between the codebase and the user code are to be detected. Each node loads a sub-codebase and all user code. By counting the frequency of each token in sub-codebase via token frequency creation module, local token frequency is produced. Then index creation module creates partial index for each sub-codebase. In code search module, tokens in sub-block from the user code are used to query the index info and to generate candidates. Finally, FastDCF computes the similarity by using Jaccard approach [22] and outputs the detection results.

5 Evaluation FastDCF is evaluated in four aspects: 1) we evaluate the scalability and efficiency of FastDCF using inputs of varying sizes in terms of lines of code and compare it with other start-of-art tools. 2) we measure FastDCF’s recall BigCloneBench [27], and we also measure the precision. 3) we verify the effectiveness of our distributed optimization by comparing the efficiency before and after using distribution 4) we show the ability of FastDCF to detect clone at file-level and function-level. We rented a total of 23 instance ESC machines. Each machine has a quad-core CPU, 16 G memory and 60 G hard disk. The Hadoop version is 2.7.7, and Ubuntu16.04 is used as the operating system. We limit each task to use up to 10 G of memory. We evaluate distributed tools on multiple machines and non-distributed tools on a single machine. 5.1 Execution Time and Scalability

Table 1. Execution time for varying input sizes FastDCF

CloneWorks

Nicad

SourcererCC

1M

42 s

22 s

1 min 1 s

1 min 18 s

10M

5 min 41 s

4 min 16 s

2 h 4 min 12 s

29 min 18 s

30M

7 min 36 s

18 min 7 s

Internal

49 min 19 s

75M

9 min 42 s

52 min 19 s

_

9 h 47 min 15 s

150M

15 min 7 s

1 h 54 min 35 s

_

_

250M

23 min 19 s

4 h 3 min 24 s

–

–

Comparing with Non-distributed Methods. We compare FastDCF’s execution time and scalability against three clone detection tools, including CloneWorks [23], SourcererCC [22] and Nicad [7], which are representative and pioneer work in clone detection for big code. We chose them because they perform well in large-scale detection [2]. Nicad is a popular tool that support Type-3 detection. SourcererCC is the first tool designed for large scale clone detection. CloneWorks optimizes the implementation details on the

218

L. Yang et al.

basis of SourcererCC and the efficiency is improved. Files were randomly selected form IJaDataset [12] to build inputs of different size, ranging from 1 MLOC to 250 MLOC. Experimental results are shown in Table 1. From Table 1, we can see that FastDCF is able to scale with reasonable execution time when the input size increases. Its execution time decreases from 1 MLOC 2 250 MLOC. In contrast, Nicad is able to scale to the 10 MLOC input, but it cannot scale to a dataset of 30 MLOC or more. According to the description in [8], due to the limitation of its internal data structure, it cannot handle large computation of clone pairs, which prevents its scaling up when the code size turns larger. CloneWorks can scale better than Nicad and SourcererCC. But it spends more time than FastDCF when the input is larger than 10 MLOC. When the size is less than 10 MLOC, the effect of FastDCF’s distributed strategy is not obvious and its execution speed is slightly inferior to CloneWorks. The reason is that parallelization brings extra delay. However, as the size of the input becomes larger, FastDCF’s lead over other tools becomes obvious. When the size of the input reaches 250 MLOC, the efficiency of FastDCF is 8 times that of CloneWorks. Comparing with Distributed Methods. We also compare FastDCF with representative distributed methods, including the technique of Hummel et al. [25] and IBFET. Table 2 shows the comparison results of clone detection with other index-based distributed clone detection techniques. We use Linux 2.6 as the dataset, which contains about 11 MLOC. Hummel’s method can only detect Type-1 and Type-2 clones and they spend much more time than IBFET and FastDCF. IBFET can support Type-1, Type-2 and Type-3 clones. However, FastDCF has an obvious advantage over IBFET on execution time. FastDCF performs best in three tools because of its subtle distributed design which is based on partial index.

Table 2. Clone detection execution time comparison Techniques

Linux-Kernel

Clone types

Hummel

47 min 29 s

T-1, T-2

IBFET

20 min 40 s

T-1, T-2, T-3

FastDCF

7 min 45 s

T-1, T-2, T-3

5.2 Distributed Parallelization In order to measure the performance speedup through distributed parallelization in FastDCF, we conduct two experiments. In the first experiment, the number of nodes is kept constant and the size of the data grows. In the second experiment, input size is kept constant while the number of nodes grows. Figure 3 illustrates the results of the first experiment. The number of nodes is fixed at 23. The value represented by the y-axis is the ratio of the time required for distributed methods to the time required for non-distributed methods and x-axis’s value means

FastDCF

219

input size, which is from 1M to 250M. When the code size is less than or equal to 10M, the performance of distributed approach is worse than non-distributed approach (except 10M preprocessing). This is mainly due to the extra communication overhead of distrusted nodes. When the code size grows, the effect of parallelization optimization turns obvious.

Fig. 3. Different size of the data

Figure 4 shows the results of the second experiment. The code size is 250 MLOC. The value of x-axis is the number of nodes that varies from 1 to 23. The value represented by the y-axis is ratio of multiple nodes to a single node and time. When the number of nodes increases, the average time spent on preprocessing and clone detection decreases. And the optimization effect of the preprocessing is even better than that of clone detection. And the speedup of the preprocessing and the clone detection are both nearly linear.

Fig. 4. Different number of the nodes

5.3 Precision and Recall BigCloneBench is a big clone benchmark of manually validated clone pairs in the interproject software repository IJaDatset [12]. In order to measure the recall in more detail,

220

L. Yang et al.

we further divide the Type-3 and Type-4 clones into four categories based on their syntactical similarity: Very Strongly Type-3 (VST3) clones have a syntactical similarity between 90% and 100%, Strongly Type-3 (ST3) in 70%–90%, Moderately Type-3 (MT3) in 50%–70% and Weakly Type-3/Type-4 (WT3/4) in 0–50%. The more details are explained in [27]. MT3 and WT3/4 are not belong to near miss clones, therefore they are not in consideration of our work. It can be seen from Table 3 that FastDCF has perfect detection of Type-1 and nearperfect Type-2 detection. This means that FastDCF has strong ability to detect Type-1 and Type-2 clones. FastDCF has excellent Type-3 recall for the VST3 category. FastDCF’s Type-3 recall is slightly lower for the ST3 recall. Though Nicad can detect more clones, as we see previously in Sect. 5, the execution time of Nicad for larger inputs and its scalability constrains at the 100 MLOC input are not as good. CloneWorks and SourcererCC are both competitive Type-3 clone detectors. Table 3. BigCloneBench recall and precision results Type-1

Type-2

VST3

ST3

Precision

FastDCF

100

99

93

67

94

CloneWorks

100

99

94

62

93

Nicad

100

100

100

95

80

SourcererCC

100

98

93

61

86

Precision. Measuring clone detection precision remains an open problem since there is no standard benchmark or methodology. We estimate the precision of the tools by manually validating a random sample of their outputs, which is the typical accepted approach. We randomly selected 100 clones, which is a statistically significant sample. The results are show in Table 3. FastDCF has the best precision of 94%, which is slightly better than CloneWorks. Nicad and SourcererCC also have good precision but is lower than that of FastDCF. 5.4 Multi-granularity Detection FastDCF can detect clones at different granularities. The function-level detection is validated in the experiments in previous sections. We use Linux-kernel 5.10 to measure the function of file level granularity detection. The results show that FastDCF is capable to input codes in the format of files. And the file level clones can be detected. For instance, FastDCF detects the clone pair of “linux-master/arch/x86/um/ptrace_32.c” and “linux-master/arch/x86/um/ptrace_64.c” which is similar in content.

6 Conclusion In this paper, we propose FastDCF, a fast and scalable near-miss clone detection technique, which exploits distribution strategy over MapReduce framework to scale the

FastDCF

221

detection to large scale and uses partial indexing and multi-threading to improve the scalability and efficiency. We measure the efficiency and scalability with 250 MLOC of IJaDataset. Experimental results show that outperforms existing work in scale. FastDCF’s recall and precision are comparable to the state-of-the-art clone detection tools. And it achieves the goal of multi-language support and multiple code granularities. To the best of our knowledge, FastDCF is the most efficient tool which has been implemented to detect near-miss clone. For the future work, we plan to apply our approach to vulnerability detection for large scale software such as OS distributions, Web servers, data-intensive large systems and so on. Acknowledgement. The work in this paper is supported by the Natural Science Foundation of China (Under Grant NO.: 61872444 and U19A2060) and the National Key Research and Development Program of China (2018YFB1003602).

References 1. Lopes, C.V., et al.: DéjàVu: a map of code duplicates on GitHub. Proc. ACM Program. Lang. 1, 1–28 (2017) 2. Akram, J., Mumtaz, M., Luo, P.: IBFET: index-based features extraction technique for scalable code clone detection at file level granularity. Softw. Pract. Exp. 50(1), 22–46 (2020) 3. Baker, B.S.: On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd Working Conference on Reverse Engineering, pp. 86–95 (1995) 4. Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28, 654–670 (2002) 5. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53, 72–77 (2010) 6. Kim, S., Woo, S., Lee, H., Oh, H.: VUDDY: a scalable approach for vulnerable code clone discovery. In: IEEE Symposium on Security and Privacy (SP) (2017) 7. Cordy, J.R., Roy, C.K.: The NiCad clone detector. In: IEEE 19th International Conference on Program Comprehension, pp. 219–220 (2011) 8. Chen, K., Liu, P., Zhang, Y.: Achieving accuracy and scalability simultaneously in detecting application clones on Android markets. In: Proceedings of the 36th International Conference on Software Engineering, pp. 175–186 (2014) 9. The TXL Programming Language. https://www.txl.ca/. Accessed 21 Apr 2020 10. Roy, C.K., Cordy, J.R.: A mutation/injection-based automatic framework for evaluating code clone detection tools. In: Software Testing, Verification and Validation Workshops, ICSTW 2009, pp. 157–166 (2009) 11. Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: 34th International Conference on Software Engineering (ICSE), pp. 837–847 (2012) 12. Ambient Software Evoluton Group, IJaDataset 2.0 (January 2013). http://secold.org/projects/ seclone. Accessed 21 Oct 2019 13. Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Program. 74, 470–495 (2009) 14. Zibran, M.F., Saha, R.K., Asaduzzaman, M., Roy, C.K.: Analyzing and forecasting nearmiss clones in evolving software: an empirical study. In: IEEE International Conference on Engineering of Complex Computer Systems (2011)

222

L. Yang et al.

15. Mayrand, J., Leblanc, C., Merlo, E.M.: Experiment on the automatic detection of function clones in a software system using metrics. In: International Conference on Software Maintenance (1996) 16. Lavoie, T., Eilers-Smith, M., Merlo, E.: Challenging cloning related problems with GPUbased algorithms. In: International Workshop on Software Clones (2010) 17. Pham, N.H., Nguyen, T.T., Nguyen, H.A., Nguyen, T.N.: Detection of recurring software vulnerabilities. In: IEEE/ACM International Conference on Automated Software Engineering (2010) 18. Li, H., Kwon, H., Kwon, J., Lee, H.: CLORIFI: software vulnerability discovery using code clone verification. Concurr. Comput. Pract. Exp. 28, 1900–1917 (2016) 19. Saha, R.K., Roy, C.K., Schneider, K.A., Perry, D.E.: Understanding the evolution of Type-3 clones: an exploratory study. In: 2013 10th IEEE Working Conference on Mining Software Repositories (MSR) (2013) 20. Wang, P., Svajlenko, J., Wu, Y., Xu, Y., Roy, C.K.: CCAligner: a token based large-gap clone detector. In: IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 1066–1077 (2018) 21. Honnutagi, P.S.: The Hadoop distributed file system. Int. J. Comput. Sci. Inf. Technol. 5, 6238–6243 (2014) 22. Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: SourcererCC: scaling code clone detection to big code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2015) 23. Svajlenko, J., Roy, C.K.: CloneWorks: a fast and flexible large-scale near-miss clone detection tool. In: IEEE/ACM International Conference on Software Engineering Companion (2017) 24. Livieri, S., Higo, Y., Matush*ta, M., Inoue, K.: Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: 29th International Conference on Software Engineering, ICSE 2007, pp. 106–115 (2007) 25. Hummel, B., Juergens, E., Heinemann, L., Conradt, M.: Index-based code clone detection: incremental, distributed, scalable. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–9 (2010) 26. Roy, C.K., Cordy, J.R.: Near-miss function clones in open source software: an empirical study. J. Softw. Maint. Evol. Res. Pract. 22, 165–189 (2012) 27. Svajlenko, J., Islam, J.F., Keivanloo, I., Roy, C.K., Mia, M.M.: Towards a big data curated benchmark of inter-project code clones. In: IEEE International Conference on Software Maintenance and Evolution, pp. 476–480 (2014) 28. Jang, J., Agrawal, A., Brumley, D.: ReDeBug: finding unpatched code clones in entire OS distributions. In: IEEE Symposium on Security and Privacy, pp. 48–62 (2012)

Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms Senhao Shao, Yizhuo Wang(B) , Weixing Ji, and Jianhua Gao School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China [emailprotected]

Abstract. Increasing computing power has become available through the use of GPUs, bringing new opportunities to accelerate fast matrix multiplication using GPUs. Although researchers have proposed several optimization schemes for the Strassen algorithm on the GPU, they have not fully utilized the computing resources of CPU. In this paper, we propose a CPU-GPU heterogeneous implementation for the Winograd algorithm based on task graph scheduling. It uses work-stealing scheduler to achieve balanced load. We also propose two recursive task graph extension strategies: hom*ogeneous and heterogeneous extension. We invoke diﬀerent execution strategies in diﬀerent recursive levels and design a predictor based on the random forest regression model to make a decision. Finally, the experimental evaluations are performed on a CPU-GPU heterogeneous platform. It shows that the improved Winograd algorithm achieves an average speedup of 1.6x, 1.5x and 1.4x against to cuBLAS, Winograd on CPU, and Winograd on GPU for matrices with matrix dimension greater than 5000, respectively. Keywords: Winograd algorithm · Matrix multiplication forest regression · CPU-GPU heterogeneous architecture

1

· Random

Introduction

Matrix multiplication is an important linear algebra operation with a myriad of applications in image processing, scientiﬁc computing, etc. Fast matrix multiplication algorithms have lower time complexity than standard matrix multiplication with O(n3 ) time complexity. In 1969, Volker Strassen proposed the ﬁrst fast matrix multiplication with a time complexity of O(n2.81 ), which is named Strassen algorithm [19]. It is a divide-and-conquer algorithm that decomposes matrix multiplication, reorganizes the calculation based on block matrix multiplication, and completes the calculation through 7 recursive matrix multiplications and 18 matrix additions. Its proposal has led to more research on fast This work is supported by the National Natural Science Foundation of China (Grant Nos. 61972033). c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 223–236, 2022. https://doi.org/10.1007/978-3-030-96772-7_21

224

S. Shao et al.

matrix multiplication, resulting in faster methods, such as the CoppersmithWinograd algorithm. Heterogeneous computing system usually consists of one or multiple CPUs with a set of computing cores, and a GPU. In the system, the CPU is a latencyoptimized general purpose processor that is best for executing a wide variety of tasks quickly, while the GPU is a throughput-optimized specialized processor that is designed to accelerate a number of speciﬁc tasks that demonstrate a high degree of parallelism. At present, the CPU-GPU heterogeneous computing is mainly divided into two cases: (1) The CPU is only responsible for task scheduling and not involved in calculation; (2) Both CPU and GPU are responsible for calculation. Most of the existing Strassen algorithms are implemented on GPU or CPU, and the computing resources of both computing units cannot be fully utilized at the same time. Our implementation based on the collaborative computing of the CPU and GPU can fully tap the computing performance of the CPU and GPU. In this paper, we propose a CPU-GPU heterogeneous implementation for the Winograd algorithm based on task graph scheduling. We also propose two recursive task graph extension strategies: hom*ogeneous and heterogeneous extension. We invoke diﬀerent execution strategies in diﬀerent recursive levels. In our implementation, a predictor based on the random forest regression model is applied to ﬁnd the approximate optimal extension strategy for a given matrix. The input of the runtime system is the task graph generated according to the extension strategy, and the runtime system uses work-stealing scheduler to achieve balanced load. Finally, we perform the experimental evaluations on a CPU-GPU heterogeneous platform consisting of Intel i9-10920X CPU and GTX 3090 GPU. It shows that the proposed Winograd algorithm achieves an average speedup of 1.6x, 1.5x and 1.4x against to cuBLAS, Winograd on CPU, and Winograd on GPU for matrices with matrix dimension greater than 5000, respectively.

2

Related Work

In order to reduce the time complexity of matrix multiplication, some researchers have conducted a myriad of researches [13,18]. Pan constructed a fast linear noncommutative algorithm for matrix multiplication by using the trilinear operations with a time complexity of O(n2.7951 ) [16]. Bini et al. proposed an approximate algorithm with a time complexity of O(n2.7799 ) [3]. Strassen achieved the time complexity of O(n2.4785 ) by using the laser method [20]. Subsequently Coppersmith and Winograd adopted the laser method to reduce the time complexity to O(n2.376 ) [4]. Francois Le Gall proposed a method based on convex optimization to reduce the time complexity to O(n2.3728639 ) [13]. However, the current research has only theoretical signiﬁcance. Although fast matrix multiplications have lower complexity, they have numerical stability problems. Some researchers have studied the numerical stability problem of fast matrix multiplications, and found that a limit on the number of recursion levels will not aﬀect the numerical stability of the algorithm [6,8].

Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms

225

Therefore, after some levels of recursion, the subsequent implementation relies on the standard general matrix multiplication. With the improvement of the performance of multi-core processors, fast matrix multiplications based on multi-threaded architecture have an extensively research [5,11]. Huang et al. used the BLIS software framework to implement the Strassen algorithm, which eﬀectively avoided the additional intermediate matrix storage [9]. Ballard et al. developed an automatic code generation tool that can automatically generate sequential and shared-memory implementations of each fast algorithm [2]. The fast matrix multiplications based on the GPU architecture have been widely implemented as the computing performance of the GPU improves. Li et al. implemented the Strassen and Winograd algorithms based on NVIDIA C1060 GPU [14]. Lai et al. implemented the Strassen algorithm and proposed to determine the cut-oﬀ point based on the experience-driven model [12]. Ray et al. compared Strassen’s algorithm and classical matrix multiplication on CPU and GPU respectively [17]. Huang et al. proposed the novel Strassen primitive under the GPU architecture, which eﬀectively reused the shared memory and registers to avoid additional memory space overhead [10]. Although fast matrix multiplications have been extensively optimized based on CPU or GPU, the implementations have failed to eﬀectively utilize the computing resources of CPU and GPU.

3 3.1

Method Overall Framework

As shown in Fig. 1, the overall framework includes the runtime system and task graph generation transforming the recursive Winograd algorithm into a nonrecursive task graph. First, we perform feature calculations based on the input matrix, use the oﬄine training model to obtain the optimal extension strategy, and ﬁnally generate a task graph based on the extension strategy. The task graph can be abstracted into a directed acyclic graph. In the task graph, each circle represents a task node, representing matrix operation, and the ﬂow between nodes represents the dependency between tasks. The runtime system schedules tasks based on the task graph. Current task scheduling algorithms can be classiﬁed into two main groups: static scheduling and dynamic scheduling. Static scheduling is the mechanism, where the decision is made before the task is executed, while the dynamic scheduling algorithm allocates resources at runtime. In our implementation, we adopt the dynamic scheduling algorithm based on work stealing, and its process is shown in Fig. 2. The CPU and GPU, called worker, have a ready queue of ready tasks respectively. Initially, we assign tasks to the CPU and GPU based on the Round-Robin scheduling algorithm. At runtime, the worker ﬁrstly check the ready queue. If the queue is not empty, it will remove the head of the queue and execute it. When a task is completed, it may cause some tasks to become ready. If there exist such tasks, the tasks will be placed at the end of the ready queue of the current worker. When the worker runs out

226

S. Shao et al.

of the ready tasks, it will perform the stealing operation from another worker, and the tail of the stolen ready queue will be inserted into the tail of the current ready queue.

Fig. 1. The framework of the heterogeneous implementation of fast matrix multiplication.

Fig. 2. Task scheduling process based on work stealing.

The analysis method of oﬄine trace is applied to evaluate the heterogeneous load of the work stealing scheduling method. It is mainly divided into two parts: grabbing task runtime trace and visualization using the bokeh library. After all the tasks are completed, the trace information captured through the heterogeneous runtime system that manages the running environment and records the running time of each task is written to the trace ﬁle. The running time includes the start time and ﬁnish time of the task. The format of the trace is divided into two parts. The ﬁrst part is in the format of “running device 0 - running device 1”, accounting for the ﬁrst line independently. The second part is the speciﬁc execution trace of the task in the format of “running device - start time - ﬁnish time”, which takes up from the second line to the last line. As shown in Fig. 3, it is an example of the heterogeneous load visualization using the bokeh library. The horizontal axis represents the time, and the vertical axis represents the device name. The area covered in red indicates that the device is performing

Fig. 3. Oﬄine trace visualization of the heterogeneous system load.

Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms

227

tasks, and the blank gap is either used for data transmission or idle. It can be seen from the ﬁgure that the load of CPU and GPU is basically balanced. 3.2

Winograd

Table 1. The 18-variables Winograd algorithm ID Task

ID Task

1

S3 = A11 − A21 12 P1 = A11 * B11

2

T3 = B22 − B12 13 U2 = P1 + P6

3

P7 = S3 * T 3

4

S1 = A21 + A22 15 U4 = U2 + P5

5

T1 = B12 − B11 16 U7 = U3 + P5

6

P5 = S1 * T 1

17 U5 = U4 + P3

7

S2 = S1 − A11

18 T4 = T2 − B21

8

T2 = B22 − T1

19 P4 = A22 * T4

9

P6 = S2 * T 2

20 U6 = U3 − P4

14 U3 = U2 + P7

10 S4 = A12 − S2

21 P2 = A12 * B21

11 P3 = S4 * B22

22 U1 = P1 + P2

The Winograd algorithm is a variant of Strassen algorithm. Its computing sequence is shown in Table 1. Considering the computing sequence as a set of tasks, a single level Winograd algorithm can be abstracted into the task graph based on the dependency between variables, as shown in Fig. 4(a). Because 18 additional intermediate matrices are needed, this algorithm is called 18-variables Winograd algorithm. In order to facilitate synchronization between tasks in a heterogeneous environment, we add an empty task named “Join” to the task graph. For the convenience of analysis, assume that the matrix multiplication involves a square matrix, that is, m = k = n. Assuming that the extra storage used by the above algorithm is denoted as E(m, k, n). The expression is as follows: E(m, k, n) = 4 ·

kn mn m k n mk +4· +3· + E( , , ) 2 2 22 2 2 2 2 2

E(m, k, n) =

log m i=1

1 (4 · mk + 4 · kn + 3 · mn) 4i

(1) (2)

With the continuous recursion of the algorithm, more intermediate storage will be introduced. Because of the obvious advantages of the algorithm for largescale matrix, the storage space consumption is severe. Lai et al. implemented the Winograd algorithm with the number of intermediate matrices of 2, optimizing

228

S. Shao et al.

(a) 18-variables Winograd.

(b) 2-variables Winograd.

Fig. 4. The task graph of the Winograd algorithm with 18 and 2 additional intermediate matrices.

the number of the intermediate storage [12]. The task graph is shown in Fig. 4(b). The sequence of the computation is shown in Table 2. Table 2. The 2-variables Winograd algorithm ID Task

ID Task

1

S3 = A11 − A21 X

12 P1 = A11 * B11 X

2

T3 = B22 − B12 Y

13 U2 = P1 + P6

C12

3

P7 = S3 * T 3

C21 14 U3 = U2 + P7

C21

4

S1 = A21 + A22 X

15 U4 = U2 + P5

C12

5

T1 = B12 − B11 Y

16 U7 = U3 + P5

C22

6

P5 = S1 * T 1

C22 17 U5 = U4 + P3

C12

7

S2 = S1 − A11

X

18 T4 = T2 − B21 Y

8

T2 = B22 − T1

Y

19 P4 = A22 * T4

C11

9

P6 = S2 * T 2

C12 20 U6 = U3 − P4

C21

10 S4 = A12 − S2

X

11 P3 = S4 * B22

C11 22 U1 = P1 + P2

21 P2 = A12 * B21 C11 C11

It is assumed that the additional storage of the algorithm is denoted as R(m, k, n), which is as follows: R(m, k, n) =

k n kn m k n m max( , ) + + R( , , ) 2 2 2 22 2 2 2

R(m, k, n) =

log m i=1

1 (m · max(k, n) + kn) 4i

(3) (4)

Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms

229

Compared with the 18-variables algorithm, the algorithm introduces fewer intermediate matrices, reducing the overall storage.

4 4.1

Regression Model Predictor “Depth First” and “Breadth First”

In the ﬁeld of parallel computing, researchers have carried out breadth-ﬁrst and depth-ﬁrst parallel strategies to avoid communication problems. Depth-ﬁrst and breadth-ﬁrst are alternative ways for processors to process subproblems in the recursive problems. At a depth-ﬁrst step, subproblems are executed in sequence, while at a breadth-ﬁrst step, subproblems are executed in parallel. Although the breadth-ﬁrst strategy reduces the amount of communication between subproblems and exposes higher parallelism, the extra memory consumption is required compared to the depth-ﬁrst strategy. In the shared-memory environment, the interleaving strategies of the depth-ﬁrst and breadth-ﬁrst will aﬀect the memory consumption of fast matrix multiplication algorithm, cache access mode, the number of execution threads and the size of base problem. All of these will lead to the performance diﬀerence. In the heterogeneous environment, we can speculate that diﬀerent interleaving strategies can lead to diﬀerent performance of fast matrix multiplication. Because the number of threads adopted in this paper is ﬁxed, it is diﬀerent from the implementation in hom*ogeneous environment. Since the 2-variables Winograd algorithm has more dependencies and most of the tasks are executed in sequence, the execution is similar to the depth-ﬁrst, so the depth-ﬁrst strategy in this paper corresponds to the 2variables Winograd algorithm, while the breadth-ﬁrst strategy corresponds to the 18-variables Winograd algorithm. 4.2

Strategy Sequence

Recursive task graph extension includes the hom*ogeneous extension and heterogeneous extension. The hom*ogeneous extension means that the task graph generated by each recursion is same, which is reﬂected in the Winograd algorithm with 18 variables or 2 variables in each recursion. The heterogeneous extension means that the algorithm used for each recursion is distinct. The strategy sequence is applied to describe the task graph generation. The strategy sequence is a string consisting of the character B and D. Each character represents a recursive extension strategy, where D represents a depthﬁrst strategy that is the 2-variables Winograd algorithm, and B represents a breadth-ﬁrst strategy that is the 18-variables Winograd algorithm. The length of the sequence determines the cutoﬀ point for fast matrix multiplication. The task graph of the hom*ogeneous extension strategy sequence “BB” and the heterogeneous extension strategy sequence “BD” are shown in Fig. 5.

230

S. Shao et al.

(a) BB.

(b) BD.

Fig. 5. The task graph of the hom*ogeneous extension strategy sequence “BB” and the heterogeneous extension strategy sequence “BD”

4.3

Details of the Implementation

Recursive task graph extension can be divided into the hom*ogeneous extension and heterogeneous extension. Due to the complexity and diversity of extension strategies for the same matrix, a predictor based on the random forest regression algorithm is applied to ﬁnd the approximate optimal extension strategy by predicting the performance. Suppose that for a given matrix M, the set of extension strategies is {SEQ1 , SEQ2 ,...,SEQn }. The performance is in the case of G(M, SEQi ), i = 1, 2, ..., n. The predictor can return an approximately optimal extension strategy based on the predicted performance: Seq = argmax(G(M, SEQ1 ), G(M, SEQ2 ), ..., G(M, SEQn ))

(5)

As shown in the Fig. 6, the prediction consists of two phases. The ﬁrst is the oﬄine training phase, which focuses on generating performance data on a heterogeneous runtime system using a series of extension strategies based on a given dataset. Then, the random forest regression algorithm is used for training based on the performance data, and the selected series of features (matrix size, extension strategy, recursion depth, number of temporary matrices, maximum size of temporary matrices, and minimum size of temporary matrices). The second stage is the online decision phase, where a series of features are generated according to the extension strategy for a given matrix, and the trained model is applied to predict the performance, and ﬁnally the extension strategy with the optimal prediction performance is selected as the output.

Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms

231

Fig. 6. Regression model predictor based on random forest regression algorithm.

5 5.1

Experiment Experimental Setup

All experiments are conducted on a heterogeneous platform consisting of a GTX 3090 GPU and Intel i9-10920X CPU. The CPU is running at 3.5 GHz with 12 cores and 256 GB of memory. The GPU has 10,496 cuda cores with a 24 GB GDDR6X memory conﬁguration. Our software environment is based on Ubuntu OS, GCC 9.0 and cuda 11.0. 5.2

Performance Evaluation

In order to evaluate the performance of heterogeneous fast matrix multiplication quantitatively, GFLOPS is used to measure the strengths and weaknesses of each implementation. The expression for computing GFLOPS is shown below: GF LOP S = 5.3

2n3 × 10−9 seconds

(6)

Heterogeneous Implementation

In order to evaluate the eﬀectiveness of the heterogeneous implementation, a series of experiments are conducted. We select a total of 113 matrices at intervals of 64 between matrix sizes from 1024 to 8192 and extension strategies of “BDB”, “BD”, “B”, “BB”, “BDDB”, “D”, “BBD”, “DBD”. As shown in Fig. 7, it shows that the improved Winograd algorithm achieves an average speedup of 1.6x, 1.5x and 1.4x against to cuBLAS, Winograd on CPU, and Winograd on GPU [12] for matrices with matrix dimension greater than 5000, respectively, and the performance of the “BB” and “BD” extension strategies drops suddenly in a matrix size of 5200. Based on the analysis of the trace information, we ﬁnd that the decrease in speedup is due to the overhead involved in a large-scale submatrix multiplication task executed by CPU.

232

S. Shao et al. variable

variable

BD

BD

B

5

B

BDDB

4

BDDB

SpeedUP

SpeedUP

1.5

D

1.0

BBD DBD

0.5 0.0 2000

4000

6000

8000

Matrix Size

D

3

BBD 2

BB

1

BDB

gpu_winograd

DBD BB BDB 2000

4000

6000

8000

cpu_winograd

Matrix Size

(a) GPU Winograd.

(b) CPU Winograd. variable BD B

SpeedUP

1.5

BDDB D

1.0

BBD DBD

0.5

BB

0.0

BDB

2000

4000

6000

8000

cublas

Matrix Size

(c) cublas.

Fig. 7. The speedup ratio of each extension strategy relative to cuBLAS, GPUWinograd and CPU-Winograd implementation.

5.4

Extension Strategy

Diﬀerent recursive extension strategies correspond to diﬀerent algorithm implementations. We conduct experimental evaluations for extension strategies, and analyze the impact of diﬀerent extension strategies on performance. We take a total of 113 matrices selected from a matrix range of 1024–8192 with an interval of 64 as an example, and select some extension strategies for performance statistics. variable

1000

BD B

GFLOPS

750

BDDB D

500

BBD DBD

250

BB BDB

0 2000

4000

6000

8000

Matrix Size

Fig. 8. The performance comparison between diﬀerent extension strategies.

It can be seen from the Fig. 8 that the extension strategies corresponding to the best performance of diﬀerent matrices are distinct. The extension strategies

Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms

233

of “B”, “BB”, “D” and “BD” have shown the optimal performance when the matrix size is less than 2000. The performance of “BDB”, “DBD”, etc. improves as the matrix size increases. The extension strategies of “BDB”, “DBD”, and “BBD” show better performance than other strategies for the relatively large matrices. Therefore, diﬀerent execution strategies in diﬀerent recursive levels have a signiﬁcant impact on the performance. 5.5

Regression Model Predictor

The performance predictor based on the random forest regression can predict the performance of a given matrix. In order to distinguish from the training dataset, we take 103 matrices with an interval of 70 between 1023 and 8193 as an example to illustrate the eﬀectiveness of the predictor. As shown in Fig. 9, the red circle solid line represents the best performance predicted by the predictor for each matrix. The blue triangle solid line represents the performance of the actual optimal extension strategy for each matrix. variable optimal

1000

predict BDDB

GFLOPS

750

BD D

500

B BB

250

BDB BBD

0 2000

4000

6000

8000

DBD

Matrix Size

Fig. 9. The comparison between the predicted performance of the model and the actual performance of each extension strategy. (Color ﬁgure online)

In order to measure the correctness of the predictions, we use the root mean square error (RMSE), the maximum, the minimum, and the mean absolute error (MAE) for evaluation, as follows: m 1 (yi − yi )2 (7) RM SE = m i=1 m

1 |yi − yi | M AE = m i=1

(8)

In the above formula, yi is the performance of the actual optimal extension strategy for the i − th matrix, yi is the performance predicted by model in milliseconds. The evaluation results are shown in Table 3. From the table, it shows that the performance predicted by model is similiar to that of the actual optimal

234

S. Shao et al.

strategy, the maximum error is less than 100 ms, and the minimum error is less than 0.1ms. As shown in Fig. 10, we take two matrices as examples to evaluate the rationality of the extension strategy predicted by model. The maximum and minimum errors of the matrix with a scale of 4733 are 26.278 ms and 0.2217 ms, and the maximum and minimum errors of the matrix with a scale of 8163 are 120.039 ms and 5.015 ms. It can be seen that the performance of the extension strategy predicted by model for matrices is similar to the performance of the actual extension strategy. The optimal strategies predicted by model are “BD”, and “BBD”, respectively, which are also consistent with the actual optimal extension strategies. Moreover, in the online decision-making stage, the preprocessing time for predicting the optimal extension strategy is about 6 milliseconds. The proportion of preprocessing overhead decreases as the matrix size increases. Therefore, the random forest regression model designed in this paper has feasibility in selecting the optimal extension strategy by predicting performance.

Table 3. The evaluation results. Evaluation index

Milliseconds

Root mean squard error 20.087428

(a) 4733

4733

Mean absolute error

12.618624

Max absolute error

93.297792

Min absolute error

0.004760

4733.

(b) 8163

8163

8163.

Fig. 10. The performance of the extension strategies predicted by model.

6

Conclusion

In this paper, a CPU-GPU heterogeneous Winograd algorithm is implemented. We propose two recursive task graph extension strategies: hom*ogeneous and heterogeneous extension, invoke diﬀerent execution strategies in diﬀerent recursive levels and design a predictor based on the random forest regression model to

Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms

235

make a decision. In our implementation, we ﬁrstly perform feature calculations based on the input matrix, then invoke the trained model to obtain the optimal extension strategy, and generate task graph based on the extension strategy. The input of the runtime system is the task graph, and the runtime system uses workstealing scheduler to achieve balanced load. Overall, our method achieved higher performance than GPU-based approaches, including CUBLAS, and CPU-based approach.

References 1. Ballard, G., Demmel, J., Holtz, O., Lipsh*tz, B., Schwartz, O.: Communicationoptimal parallel algorithm for Strassen’s matrix multiplication. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 193–204 (2012) 2. Benson, A.R., Ballard, G.: A framework for practical parallel fast matrix multiplication. ACM SIGPLAN Not. 50(8), 42–53 (2015) 3. Bini, D., et al.: O (n2. 7799) complexity for nxn approximate matrix multiplication (1979) 4. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pp. 1–6 (1987) 5. D’Alberto, P., Nicolau, A.: Adaptive Winograd’s matrix multiplications. ACM Trans. Math. Softw. (TOMS) 36(1), 1–23 (2009) 6. Demmel, J., Dumitriu, I., Holtz, O., Kleinberg, R.: Fast matrix multiplication is stable. Numer. Math. 106(2), 199–224 (2007) 7. Demmel, J., et al.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272. IEEE (2013) 8. D’Alberto, P., et al.: The better accuracy of Strassen-Winograd algorithms (FastMMW). Adv. Linear Algebra Matrix Theory 4(01), 9 (2014) 9. Huang, J., Smith, T.M., Henry, G.M., van de Geijn, R.A.: Implementing Strassen’s algorithm with BLIS. arXiv preprint arXiv:1605.01078 (2016) 10. Huang, J., Yu, C.D., Geijn, R.A.v.d.: Strassen’s algorithm reloaded on GPUs. ACM Trans. Math. Softw. (TOMS) 46(1), 1–22 (2020) 11. Kumar, B., Huang, C.H., Sadayappan, P., Johnson, R.W.: A tensor product formulation of Strassen’s matrix multiplication algorithm with memory reduction. Sci. Program. 4(4), 275–289 (1995) 12. Lai, P.W., Arafat, H., Elango, V., Sadayappan, P.: Accelerating StrassenWinograd’s matrix multiplication algorithm on GPUs. In: 20th Annual International Conference on High Performance Computing, pp. 139–148. IEEE (2013) 13. Le Gall, F.: Powers of tensors and fast matrix multiplication. In: Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, pp. 296–303 (2014) 14. Li, J., Ranka, S., Sahni, S.: Strassen’s matrix multiplication on GPUs. In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems, pp. 157– 164. IEEE (2011) 15. Lipsh*tz, B., Ballard, G., Demmel, J., Schwartz, O.: Communication-avoiding parallel Strassen: implementation and performance. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE (2012)

236

S. Shao et al.

16. Pan, V.Y.: Strassen’s algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In: 19th Annual Symposium on Foundations of Computer Science (SFCS 1978), pp. 166–176. IEEE (1978) 17. Ray, U., Hazra, T.K., Ray, U.K.: Matrix multiplication using Strassen’s algorithm on CPU & GPU. Int. J. Comput. Sci. Eng. 4(10), 98–105 (2016) 18. Stothers, A.J.: On the complexity of matrix multiplication (2010) 19. Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13(4), 354–356 (1969) 20. Strassen, V.: The asymptotic spectrum of tensors and the exponent of matrix multiplication. In: 27th Annual Symposium on Foundations of Computer Science (SFCS 1986), pp. 49–54. IEEE (1986)

Temperature Matrix-Based Data Placement Using Improved Hungarian Algorithm in Edge Computing Environments Yuying Zhao1,2 , Pengwei Wang1,2(B) , Hengdi Huang1 , and Zhaohui Zhang1 1 School of Computer Science and Technology, Donghua University, Shanghai 201620, China

[emailprotected] 2 Engineering Research Center of Digitalized Textile and Fashion Technology,

Ministry of Education, Shanghai 201620, China

Abstract. The scale of data shows an explosive growth trend, with wide use of cloud storage. However, there are problems such as network latency and power costs. The emergence of edge computing brings data close to the edge of the network, making edge computing a good supplement to cloud computing. The spatiotemporal characteristics of data have been largely ignored in studies of data placement and storage optimization. To address this, we propose a temperature matrix-based data placement method using an improved Hungarian algorithm (TEMPLIH). A temperature matrix reflects the influence of data characteristics on its placement. A replica selection algorithm based on a temperature matrix (RSATM) can meet latency requirements. An improved Hungarian algorithm (IHA-RM) is proposed on the basis of replica selection, which satisfies the balance among the multiple goals of latency, cost, and load balancing. Compared with commonly used data placement strategies, experiments show that TEMPLIH can effectively reduce the cost of data placement while meeting user access latency requirements and maintaining a reasonable load balance between edge servers. Keywords: Edge computing · Data placement · Data temperature · Hungarian algorithm · Load balancing

1 Introduction Cloud computing has developed rapidly. However, with the advent of artificial intelligence and 5G, applications continue to appear and amounts of data increase, placing high demands on network latency. Hence, edge computing is in great demand because it places computing at or near the physical location of the data source, enabling faster and more reliable service. From the perspective of application providers, centralized cloud computing adapts with difficulty to accommodate frequent data interaction. It has become increasingly powerless in terms of network latency, broadband load, and data management costs. Hence, they seek to reduce their operating costs while meeting the service requirements of users, and data caching in the edge computing environment is the object of much © Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 237–248, 2022. https://doi.org/10.1007/978-3-030-96772-7_22

238

Y. Zhao et al.

research. Although researchers have done much optimization work, they have focused on improving the optimization algorithm itself in terms of latency, cost, and service quality. In fact, with increasing amounts of data, there is a huge space for exploration, especially in terms of regional temporal and spatial characteristics. Whether in social networks or streaming media, there are obvious differences between individuals and regions. Therefore, in this study, we propose a concept of data temperature that considers the temporal and spatial characteristics to model and calculate data. To be precise, it is based on the temperature matrix to obtain a data replica placement scheme that satisfies the latency. Finally, the improved Hungarian algorithm based on the cost matrix reduces the cost of data placement while ensuring reasonable load balancing. This study makes three main contributions: • We propose the concept of data temperature and its calculation model. On this basis, we construct a data temperature matrix, which can be used to optimize the placement of data; • To meet the user’s latency needs and improve the user experience, we propose a data replica matrix selection algorithm based on a temperature matrix (RSA-TM), which can obtain a replica placement solution that meets latency requirements; • We propose an improved Hungarian algorithm (IHA-RM) based on the data replica matrix, which can satisfy user latency needs, and guarantees the load balance and cost-effectiveness of data placement. The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 provides related definitions and the calculation model of the problem. Section 4 discusses the design of the algorithm. Section 5 compares our algorithm with some classic algorithms. Section 6 presents our conclusions.

2 Related Work As the quantity of data increases, so does the number of users. The reasonable placement of data must not only meet the increasingly high service-quality requirements of users, but also take into account the constraints of system storage space and computing power in the context of large-scale data storage in a real-world environment. Current research on strategy optimization of data placement focuses on cost optimization, latency optimization, and load balancing in the cloud computing environment. Cloud computing has an on-demand usage model. Service providers hope to reduce operating costs while meeting user service requirements. Wang et al. [1] proposed a multi-cloud storage architecture. A multi-objective optimization problem was defined to minimize total cost and maximize data availability. This can be solved by a method based on non-dominated sorting genetic algorithm II (NSGA-II) and a set of non-dominated solutions called a Pareto optimality set. Wang et al. [2] proposed an adaptive data placement architecture that can adjust according to time-varying data access patterns and topics to minimize the total cost and maximize data availability. Wang et al. [3] proposed a method based on an ant colony algorithm for data hosting in a multi-cloud environment, constrained by optimization objectives such as cost and availability.

Temperature Matrix-Based Data Placement Using Improved Hungarian Algorithm

239

With the development of the network and the emergence of various applications, service providers cannot just reduce costs and ignore increasing latency requirements of users. Wang et al. [4] analyze the geographical distribution characteristics of data centers through a clustering algorithm, and propose an effective data initialization strategy, then they use a genetic algorithm to further optimize the cost-effectiveness and minimal latency. Rao et al. [5] studied the problem of minimizing the total cost while ensuring the quality of service for different locations and times, modeling it as constrained mixed integer programming problem. The load balance of the system is another important factor affecting performance [6]. Pujol et al. [7] proposed an algorithm to locate connected user data in the same service while maintaining load balance, with the aim to maintain a better online social environment. Tran and Zhang [8] proposed a framework based on evolutionary algorithms to place data to minimize and balance the server load, and to optimize storage efficiency. Chen et al. [9] proposed a method to explore the potential social relationships of users in social networks while balancing the workload between servers to minimize the traffic between them. The emergence of edge services can effectively provide real-time, high-bandwidth, and low-latency access to applications. There has been much research on content placement in a combined edge environment. Cao et al. [10] presents a method combined NSGA-II with multi-group method which has better ability of global search to help users determine cloud and edge services to store and access data object. Xu et al. [11] studied the service caching problem in MEC’s cellular network. An online algorithm was proposed for the random online service caching of edge computing to minimize computational latency under the constraint of long-term energy consumption. While there is a lack of research on data placement based on the edge environment. Most such research has addressed the optimization of algorithms, without considering the temporal and spatial characteristics of the data. We propose temperature matrixbased data placement using an improved Hungarian algorithm (TEMPLIH), combining temperature, replica, and cost matrices. While ensuring user latency, we can reduce storage costs as much as possible while balancing loads through the improved Hungarian algorithm.

3 System Model and Problem Definition We introduce the system structure of edge data placement; define the three matrices, including the temperature matrix, data replica matrix, and cost matrix; and define the optimization objectives and constraints. 3.1 System Framework We define a dataset D, D = {d1 , d2 , d3 , . . . dm } as a data block of a user’s requests for data. The user area, R = {r1 , r2 , r3 , . . . rN }, is the access area formed by the user set, and is used for latency calculation. The edge server, S = {s1 , s2 , s3 , . . . sK }, includes a number of edge server sets in each area provided by each service provider to store data blocks that meet latency requirements. Each edge server is associated with a set of

240

Y. Zhao et al.

attributes , where Pes is the storage price, Peb is the bandwidth price, Peo is the obtained operation price, and le is the storage capacity. The relationship between edge server, user area and data is shown in Fig. 1.

Fig. 1. Framework of data placement in edge environment.

3.2 Data Temperature and Calculation Since the popularity of data access differs across regions, data have their own attributes according to the degree of access to them in different regions [12]. This degree of preference must consider the changes in data attributes and spatial characteristics during a certain period of time. Spatiotemporal data refers to geographic entities whose spatial elements or attributes change over time. We propose the concept of data temperature based on the attributes of the data and the regional characteristics of the data distribution. On this basis, we define that each data block contains a set of attributes , where dc is the number of clicks, dt is the number of comments, dd is the number of downloads of the video, and df is a user-favorited video. The importance xi of each data block di is evaluated and calculated by the number of clicks and views, numbers, downloads, and favorites. The number of views, comments, and downloads accounts for 0.8, while the number of favorites accounts for 0.2; i.e., xi = 0.8(dc + dt + dd ) + 0.2df .

(1)

The relative weight wi of data block di is determined by the ratio of the importance of xi to that of all other data, xi wi = m−1 . xi i

(2)

According to the change characteristics of data temperature, H is the temperature value of the current data, w is the relative importance, H0 is the initial temperature, and k is the attenuation coefficient. Heat is positively correlated with importance and timeliness, and negatively correlated with time. H (t) = w ∗ H 0 ∗ e−kt

(3)

Temperature Matrix-Based Data Placement Using Improved Hungarian Algorithm

241

The data temperature matrix Tmn is defined to store the temperature values of data in different regions, i.e., the temperature value hm,n of data m in area n, ⎞ ⎛ h1,1 · · · h1,n ⎟ ⎜ (4) Tmn = ⎝ ... . . . ... ⎠. hm,1 · · · hm,n

In addition, because the number of edge servers differs by region, their computing power, storage cost, and operational cost also differ. We define a regional server matrix to record servers by region. 1 Server k is in area n Rnk = (5) 0 Server k is not in area n 3.3 Network Latency Satisfying the user’s access latency requirements is important in the optimization of data placement strategies. We take time latency as a constraint to ensure that users can access the data they want within an acceptable time. We guarantee that the maximum response time of each request is 200 ms [13]. We use geographic distance as a rough measure of network latency, which we express as a linear function of distance. The correlation between latency and geographic distance can be obtained through network latency data collection, and the round-trip time (RTT) [14, 15] is used to calculate the data access latency, lm = max {5 + 0.02D(d )}.

(6)

d ∈S(t)

where D(d ) is the distance between the user and the data center, D is the maximum acceptable time latency, and the average access latency is as follows. M (7) a( li ) ≤ D

i=1

3.4 System Cost To reduce the average access latency, the number of data copies must be appropriately increased, which will increase the cost. The cost of the service provider and the average latency in responding to user requests are conflicting considerations. To place more copies of content on edge nodes can reduce the average latency in responding to file requests, but it will increase the resource usage of edge nodes. We consider the three main parts of resource usage costs, i.e., the costs of data calculation, bandwidth, and storage. At time t, the storage cost of data di is the total cost of storage for a placement plan, including storage, network, and operation, is PC = zi Pes + zi Peb + dc Peo (8)

e∈S(t)

e∈S(t)

e∈S(t)

242

Y. Zhao et al.

3.5 Load Balancing With the explosive growth of data requiring storage and processing, to maintain a good system balance is of practical significance. If the server stores a group of active users, a large number of visits will be accepted. At this time, a longer response time will diminish the user experience. By maintaining a good load balance of the storage system, system performance and response speed can be improved. The load of a data placement scheme is 1 M (Um − UK )2 (9) L= i=1 K where K is the total number of servers, M is the number of servers where data is placed, Um is the server utilization, and UK is the total server utilization. The smaller the value of L, the more balanced the load. 3.6 Problem Definition The optimization goal is to perform reasonable data placement for any given data object and to give its placement plan in the edge environment, so that its cost and load at the edge can reach a relatively balanced state. Therefore, the entire optimization problem can be defined as follows. M ,N ,K Emnk P C minC = (10)

m=1,n=1,k=1

minL =

1 M (Um − UK )2 i=1 K M a( li ) ≤ D

(11)

i=1

(12)

4 Algorithm Design TEMPLIH consists of a data replica selection algorithm based on a temperature matrix (RSA-TM), and the improved Hungarian algorithm based on a replica matrix (IHA-RM). RSA-TM considers the characteristics of the data and obtains the data temperature matrix, which can screen suitable data and reduce unnecessary resource consumption. When the latency condition is met, placement is stopped and the data placement area is recorded. Otherwise, we select areas to place the data in descending order according to the data temperature matrix, and stop placing it when the latency requirement is met. The time complexity in calculating the temperature matrix is O(MN). We define a data replica matrix based on the temperature matrix. The placement area where data m satisfies the latency in area n is recorded as 1, and otherwise it is 0, i.e., 1 Data m is placed in area n Lmn = . (13) 0 Data m is not placed in area n

Temperature Matrix-Based Data Placement Using Improved Hungarian Algorithm

243

When meeting user access latency, to obtain a data placement solution at the least cost while ensuring load balance, we propose an improved Hungarian algorithm based on the replica matrix (IHA-RM). The data server placement matrix Dmk expresses the placement relationship between data m and area server k, 1 Data m is placed on server k Dmk = . (14) 0 Data m is not placed on server k We combine the regional server matrix Rnk and data server placement matrix Dmk to get the placement cost of the data on the server in each region according to the cost calculation formula. The cost matrix PN = [p1 , p2 , p3 , ..., pN ] represents the placement cost of the data block on the server in each region, i.e., the data placement cost matrix P of data block m and server k under the N areas is collected, the cost of the server storage data block is recorded as Ck,m , and ⎞ ⎛ C1,1 · · · C1,m ⎟ ⎜ (15) Pn = ⎝ ... . . . ... ⎠. Ck,1 · · · Ck,m

In our scenario, the data and servers in each area are often not equal. Therefore, we compare the numbers of data blocks and computing resources in each area. If these are equal, the standard Hungarian algorithm can be used to solve the problem. If they are unequal, we must determine the numbers of servers and data blocks. If the number of servers exceeds the number of data blocks, we add the number of virtual data blocks (add 0) to create as many dimensions as the number of servers, and then use the standard Hungarian algorithm. If there are more data blocks than there are servers, the cost matrix is split, according to the dimension of the number of servers, into a small matrix of the

244

Y. Zhao et al.

number of data blocks divided by the number of servers. If the number of data blocks in the last sub-matrix is less than the number of servers, the number of virtual data blocks (add 0) is added to make it consistent with the number of servers. After completing the matrix, we use the traditional Hungarian algorithm to determine the data placement plan. The time complexity of calculating the data server matrix and cost matrix is O(NMK).

5 Experimental Evaluation We introduce simulation experiment settings and give multiple benchmark algorithms for comparison. Experimental evaluation shows that our algorithm can balance the cost and load balance goals of data placement under the premise of satisfying latency requirements.

Temperature Matrix-Based Data Placement Using Improved Hungarian Algorithm

245

5.1 Experiment Setup We introduce the video dataset, edge server information, and parameter settings. The dataset is a YouTube popular video dataset with 40,726 items, including the number of views, shares, comments, and likes. Regional edge server information was obtained from the websites of major cloud service providers, including storage price ($/GB), bandwidth price ($/GB), get operation price ($/10k times), and latitude and longitude of the edge server. The experiment was run on a computer with an Intel Core i7-7500U at 2.7 GHz, with 8 GB memory and Windows 10. 5.2 Experimental Results and Analysis We compared the cost and load rate of TEMPLIH with those of several other algorithms for data placement with the same experimental data. • Random: The distribution relationship between the data and server is obtained from the replica matrix, and the data block is randomly placed on the regional edge server. • Latency-based [16]: The data are placed on the regional edge server with the lowest total network latency. We calculate the data placement considering cost and load balancing. • Cost-based [5]: According to the replica matrix, we can obtain the distribution relationship between the data and server. We place the data block on the edge server with the lowest cost. • Load Balance [8]: After the data replica matrix that meets the latency requirement is known, the data blocks are sequentially placed in the edge server. The algorithm performance was evaluated by changing the number of data blocks from 6000 to 13000. The data block size was fixed at 0.6 GB, the number of servers was 425, and the server capacity was 600 GB. Figures 3 and 2 describe the cost and load rate, respectively, of the data placement schemes obtained by the five algorithms. It can be seen that the load rate of our algorithm is similar to that of the load balance algorithm, but its total average cost is 18.9% less.

Fig. 2. Comparison of load rate with changing of data blocks.

Fig. 3. Comparison of cost with changing of data blocks.

246

Y. Zhao et al.

The data block size was changed to 1.2 GB, with 10,000 fixed data blocks. The number of servers and their capacities were consistent with the above experiment. Figure 5 shows the cost changes of the placement schemes obtained by the five algorithms. It can be seen that the data block size was too large, the data resources tended to be saturated, and the cost was reduced. Figure 4 compares the load rates of the five algorithms. As the size of the data block increases, the distribution of blocks becomes more dispersed, so the load rate decreases when the number of servers and their capacities are unchanged. The load will be more balanced. Our TEMPLIH algorithm is less effective in cost than the data solution obtained by the Cost algorithm. However, Fig. 4 demonstrates that the load rate of the cost-based algorithm is 32.8 times that of our proposed algorithm in terms of load conditions. It is worth noting that the load balancing of the algorithm based on cost is much worse than the TEMPLIH algorithm.

Fig. 4. Comparison of load rate with changing data block size.

Fig. 5. Comparison of cost with changing data block size.

Changing the range of server capacity from 400 GB to 650 GB, there were 10,000 data blocks of size 0.6 GB, and the number of servers remained unchanged. Figures 6 and 7 show the changes in load rate and the cost of data placement, respectively, for the five algorithms when the server capacity changed. It can be seen that the load rate increases with the server capacity. This is because the increase in server capacity enables better placement options for data blocks when resources are relatively abundant. Combined with Figs. 6 and 7, with the increase of server capacity. The load rate of the algorithm we proposed is similar to that of the load balancing algorithm. Figure 7 shows that our proposed method (TEMPLIH) performs better than the load balancing algorithm in terms of cost, and our algorithm saves 16.8% in total average cost.

Temperature Matrix-Based Data Placement Using Improved Hungarian Algorithm

Fig. 6. Comparison of load rate with changing server capacity.

247

Fig. 7. Comparison of cost with changing server capacity.

6 Conclusion and Future Work In the current environment where the scale of data and the number of terminals continue to expand, demands on network latency continue to increase. In the edge environment, the edge server can take advantage of its own lightweight, real-time computing capabilities, and closer proximity to users to place data reasonably, which can effectively improve user experience. However, how to use data characteristics and quickly weigh the relationship between various indicators is a problem that remains to be solved in the field of data placement. Our proposed TEMPLIH can optimize the cost and load balance of data in the edge environment under the premise of meeting the latency requirements. Specifically, the RSA-TM and the IHA-RM are adopted. Experiments have proved that with respect to optimization effects, the TEMPLIH strategy, which considers the data temperature matrix is better than the traditional multi-cloud data storage strategy. In future work, we will consider data characteristics for collaborative application research on data placement and task scheduling. Acknowledgement. This work was partially supported by the National Natural Science Foundation of China (NSFC) under Grant 61602109, DHU Distinguished Young Professor Program under Grant LZB2019003, Shanghai Science and Technology Innovation Action Plan under Grant 19511101802, Fundamental Research Funds for the Central Universities.

References 1. Wang, P., Zhao, C., Liu, W., Chen, Z., Zhang, Z.: Optimizing data placement for cost effective and high available multi-cloud storage. Comput. Inform. 39(1–2), 51–82 (2020). https://doi. org/10.31577/cai_2020_1-2_51 2. Wang, P., Zhao, C., Wei, Y., Wang, D., Zhang, Z.: An adaptive data placement architecture in multicloud environments. Sci. Program. 2020(1), 1–12 (2020). https://doi.org/10.1155/2020/ 1704258

248

Y. Zhao et al.

3. Wang, P., Zhao, C., Zhang, Z.: An ant colony algorithm-based approach for cost-effective data hosting with high availability in multi-cloud environments. In: 2018 15th International Conference on Networking, Sensing and Control (ICNSC), pp. 1–6. IEEE (2018). https://doi. org/10.1109/ICNSC.2018.8361288 4. Wang, P., Chen, Z., Zhou, M., Zhang, Z., et al.: Cost-effective and latency-minimized data placement strategy for spatial crowdsourcing in multi-cloud environment. IEEE Trans. Cloud Comput. 1 (2021). https://doi.org/10.1109/TCC.2021.3119862 5. Rao, L., Liu, X., Xie, L., Liu, W.: Minimizing electricity cost: optimization of distributed internet data centers in a multi-electricity-market environment. In: 2010 Proceedings IEEE INFOCOM, pp. 1–9. IEEE (2010). https://doi.org/10.1109/INFCOM.2010.5461933 6. Kumar, A., Kalra, M.: Load balancing in cloud data center using modified active monitoring load balancer. In: 2016 International Conference on Advances in Computing, Communication, & Automation (ICACCA) (Spring), pp. 1–5. IEEE (2016). https://doi.org/10.1109/ICA CCA.2016.7578903 7. Pujol, J.M., Erramilli, V., Siganos, G., Yang, X., et al.: The little engine(s) that could: scaling online social networks. IEEE/ACM Trans. Netw. 20(4), 1162–1175 (2012). https://doi.org/ 10.1109/TNET.2012.2188815 8. Tran, D.A., Zhang, T.: S-PUT: An EA-based framework for socially aware data partitioning. Comput. Netw. 75(24), 504–518 (2014). https://doi.org/10.1016/j.comnet.2014.08.026 9. Chen, H., Jin, H., Wu, S.: Minimizing inter-server communications by exploiting selfsimilarity in online social networks. IEEE Trans. Parallel Distrib. Syst. 27(4), 1116–1130 (2016). https://doi.org/10.1109/TPDS.2015.2427155 10. Cao, E., Wang, P., Yan, C., Jiang, C.: A Cloudedge-combined data placement strategy based on user access regions. In: 6th International Conference on Big Data and Information Analytics (BigDIA 2020), Shenzhen, China, pp. 243–250 (2020). https://doi.org/10.1109/BigDIA 51454.2020.00046 11. Xu, J., Chen, L., Zhou, P.: Joint service caching and task offloading for mobile edge computing in dense networks. In: IEEE Conference on Computer Communications, pp. 207–215. IEEE (2018). https://doi.org/10.1109/INFOCOM.2018.8485977 12. Wang, P., Wei, Y., Zhang, Z.: Optimizing data placement in multi-cloud environments considering data temperature. In: the 7th International Conference on Artificial Intelligence and Security, pp. 167–179. ICAIS (2021). https://doi.org/10.1007/978-3-030-78612-0_14 13. Khalajzadeh, H., Dong, Y., Grundy, J., Yang, Y.: Improving cloud-based online social network data placement and replication. In: IEEE International Conference on Cloud Computing, pp. 678–685. IEEE (2016). https://doi.org/10.1109/CLOUD.2016.0095 14. Wu, Z., Butkiewicz, M., Perkins, D., Katz-Bassett, E., Madhyastha, H.V.: SPANStore: costeffective geo-replicated storage spanning multiple cloud services. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 292–308. ACM (2013). https://doi.org/10.1145/2517349.2522730 15. Wu, Y., Wu, C., Li, B., Zhang, L., Lau, F.: Scaling social media applications into geodistributed clouds. IEEE/ACM Trans. Netw. 23(3), 689–702 (2015). https://doi.org/10.1109/ TNET.2014.2308254 16. Li, X., Wu, J., Tang, S., Lu, S.: Let’s stay together: towards traffic aware virtual machine placement in data centers. In: IEEE Conference on Computer Communications, pp. 1842– 1850. IEEE (2014). https://doi.org/10.1109/INFOCOM.2014.6848123

Realtime Physics Simulation of Large Virtual Space with Docker Containers Seiji Saito and Satoshi Fujita(B) Graduate School of Advanced Science and Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi-Hiroshima 739-8527, Japan [emailprotected] Abstract. In this paper, we propose a way of distributed processing of realtime physics simulations for 3D video games with a large virtual space. The basic idea of the proposed method is to divide a given virtual space into multiple subspaces, and to simulate each subspace with a physics simulator running on a container of virtual environment by assuming that subspaces are suﬃciently independent so that each simulation is not aﬀected by the others. In the prototype system we have implemented, the conﬁguration of objects in the subspace allocated to a client is exchanged among hosts every few frames through WebSocket. According to the experiments conducted with the prototype system, it is conﬁrmed that we could achieve a suﬃciently high processing speed and high frame rate by bounding the number of objects in each subspace, even if the entire virtual space contains a huge number of virtual objects exceeding 10,000. Keywords: Edge computing · Virtual world Docker container · Cloud game

1

· Physics simulation ·

Introduction

A Japanese novelist Reki Kawahara wrote a light novel called Sword Art Online in 2002, which depicts various conﬂicts of characters playing a VRMMORPG (Virtual Reality Massively Multiplayer Online Role-Playing Game) called Sword Art Online, launched in 2022 (in the novel). This virtual online game allows more than 10,000 players to simultaneously log in to the system, and the scenery of the virtual world including the shape of monsters and the face of other players dynamically changes in realtime according to: the movement of the location and the gaze of the player (e.g., through walking or ﬂying), interfering with the virtual world (e.g., engagement in battles), and other miscellaneous game events. From a technical point of view, realtime rendering of high-resolution videos such as 4K or 8K quality is becoming a reality with the rapid progress in GPU technology. A typical example is Detroit: Become Human, which is a game software for PlayStation 4 released by Sony Interactive Entertainment (SIE) in 2018. On the other hand, in recent years, speciﬁc game software called VR games have c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 249–260, 2022. https://doi.org/10.1007/978-3-030-96772-7_23

250

S. Saito and S. Fujita

become widely popular, especially in game genres such as shooting, action, simulation and strategy, so that realistic artiﬁcial images which can be mistaken for photographs are being developed through head-mounted display (HMD) of game players. Based on the above technological trends, this paper focuses on another important issue in VRMMORPG: the scalability issue related to the complexity of the virtual space and the number of players. The basic idea of our proposed method is to divide the entire virtual space into several subspaces, and assign each subspace to a separate machine for processing, in order to keep the peak load of each physical server as low as possible. If objects in the virtual space and the gaze of the player are both stationary, we can generate a high resolution still image within a short time by using a sophisticated rendering engine provided on the server, and even when the player’s gaze dynamically changes, the rendering results for each gaze can be combined to generate a realistic video stream. Therefore, the remaining problem is how to keep track of the position and state of virtual objects in the virtual space as they are updated by external events. In this paper, we consider this challenging issue and propose a method for calculating the position and state of virtual objects without exceeding the processing capacity. Such physics simulations are generally conducted by using physics engines such as PhysX1 , Open Dynamics Engine, and Newton Game Dynamics. It should be worth noting here that in the physics simulation for video games, the accuracy can be often sacriﬁced to some extent since a high responsiveness is much more important than the accuracy. In fact, it is common to treat only limited number of objects relevant to the player as the target of physics simulations and regard the rest as static images, since the changes in the distant scene on the human retina are usually very small, even if any. However, there could exist some situations in which the details of moving images which becomes visible as a result of player’s actions have a signiﬁcant impact on the player’s impression; e.g., the reader could imagine the ears of wheat rustling in the wind and the changes in the scene of a snowstorm caused by changes in temperature. With those observations, we thought that it would be of great signiﬁcance to study the basis for realizing such physics simulations in a scalable manner with as little loss of accuracy as possible. In this paper, we focus on PhysX as the concrete real-time physics engine, and investigate the way of decentralizing physics simulations using a container-based virtual environment. We implemented a prototype system consisting of one server application and one or more client applications. Each application is assigned a speciﬁc machine, where each client is executed on a Docker container to allow for live migration of clients depending on the change of the load of physical machines. The partitioning of a large virtual space into subspaces is realized by using the coordinates in the virtual space, which could dynamically change according to the load of the clients. In order to properly conduct such a subspace processing, the server should designate the information on the subspace in a rigid manner, and should send it to the clients in a reliable and timely manner. To this end, we introduce a speciﬁc data format called O-data (object data) and 1

https://github.com/NVIDIAGameWorks/PhysX.

Realtime Physics Simulation of Large Virtual Space with Docker Containers

251

use WebSocket to send network commands written as a text data. With the prototype system, we conducted experiments to evaluate the performance of the proposed method. The result of experiments shows that although it reduces the load of the physics simulation, the aggregation of the simulation results becomes a bottleneck so that the host could not keep a high frame rate such as 90 fps. The remainder of this paper is organized as follows. Section 2 overviews related work. Section 3 describes the proposed method. Section 4 summarizes results of evaluations. Finally, Sect. 5 concludes the paper with future work.

2

Related Work

The design of scalable Cloud Gaming Platform (CGP) has been a main concern in realizing an eﬃcient handling of requests issued by a huge number of game players in real-time. Many existing works on CGP explore an eﬀective way of assigning tasks to virtual machines (VMs) and assigning resources to each VM [3,5,8–10,12,18–20]. Avino et al. [1] measured the amount of CPU utilizations by Docker containers while executing the game server of a multiplayer game, to evaluate the suitability of container architecture for Multi-access Edge Computing. In the experiments, they used Minecraft Pocket Edition2 (version 0.10.5) as the container of game server and employed an emulator called Genymotion3 , which emulates an Android client, to test the behavior of mobile clients. In addition, to realize a rigorous veriﬁcation, they installed FRep4 of Android application on each emulator. The evaluation results show that for game services, the overhead due to Docker increases as the number of servers increases. Messaoudi et al. [13] evaluated the performance of Unity 3D, which is one of the most popular game engines, in MEC environments. Their main question was whether the computation of the game engine can be properly oﬄoaded to edge servers, and they considered this question by dividing the game engine into several modules. The conclusion of the paper can be summarized as follows: 1) there is a high correlation between CPU and GPU consumptions, and in many cases GPU was the main cause of performance limitation; 2) the frame rates of device-friendly games were generally higher than 60 fps; 3) some modules related to rendering were mostly in standby mode, and the CPU consumption associated with those modules was not signiﬁcant; and 4) in many games, the rendering process accounted for 70% of the CPU load, but in a certain class of games with complex scripts, the non-graphical components accounted for most of the CPU utilization. Messaoudi proposed a game system called Oﬄoad 3D FPS [14] based on the Unity 3D. A scene in the game system is a projection of dynamic foreground onto a static background or a static layout, and it classiﬁes game objects (GOs) processed by the game engine into several types. Diﬀerent types of GOs are placed in the game world and controlled by modules in diﬀerent manners, so that the game player 2 3 4

MinecraftPocketEdition. http://www.pocketmine.net/. Genymotion. https://www.genymotion.com. FRep. http://strai.x0.com/frep/.

252

S. Saito and S. Fujita

explores the virtual world through interactions with them. Oﬄoad 3D FPS tries to oﬄoad modules controlling GOs to meet the performance requirements. Gaming Anywhere [6] is an open source cloud gaming platform developed by a group in Taiwan. It runs on several platforms including Windows and macOS, and can be easily customized by replacing several components with others. This architecture has two basic ﬂows called data ﬂow and control ﬂow. The data ﬂow is used to stream audio-video (A/V) frames from the server to clients, whereas the control ﬂow is used to send user actions from clients to the server. In this system, every game selected by the users runs on the game server, and agents of the users run along with the selected game on the same server. The agent can be a standalone process or a module (in the form of a shared object or DLL) injected into the selected game depending on the game type and implementation. Since the server of Gaming Anywhere delivers encoded A/V frames using standard RTSP and RTP protocols, clients can watch the game play by simply accessing the corresponding URL using a standard VLC-enabled multimedia player. A fog-based architecture proposed by Kannan et al. [7] uses Gaming Anywhere as the underlying platform. In this architecture, the game server is realized as a Docker container, and is created from the source code of Gaming Anywhere and other necessary packages and libraries, More speciﬁcally, after selecting the target of task oﬄoad, it deploys the docker container created from a docker image to the selected fog node. The deployed container acts as a dedicated game server which contains necessary game resources such as video/audio encoders, decoders, and realtime streaming capabilities. Simiscuka et al. [16] proposed a social VR-IoT (Virtual Reality Internet of Things) environment in which IoT devices are shared and controlled on a virtual platform. This environment includes a synchronization scheme called VRITESS (VR-IoT Environment Synchronization Scheme) which allows VR headsets to be used to control real-world IoT objects. VRITESS updates real objects according to instructions given in the virtual world, and vice versa. Results of experiments show that the local network testbed exhibits lower latency than the cloud testbed, and experiments conducted on communication protocols implemented in the cloud testbed indicate that MQTT protocol has lower latency and less data traﬃc than REST-based protocols.

3 3.1

Prototype System Overview

In this section, we describe an overview of the prototype system which uses Docker containers the virtual environment for executing physics operations, and PhysX as the physics operation simulator. We also use glut to visualize the results of physics operations, and WebSocketpp for the communication between (virtual) machines. The prototype system consists of one server application and one or more client applications. See Fig. 1 for illustration. Each application is assigned a speciﬁc machine, where each client is not executed directly on the physical machine

Realtime Physics Simulation of Large Virtual Space with Docker Containers

253

Fig. 1. Prototype system consisting of server and client applications.

but on a Docker container (this conﬁguration is intended to allow for live migration of clients depending on the change of the load of physical machines). The program is written as a console application in C++, and the server application and the client application have the same structure as a program. Thus, when the application starts on a machine, we need to select the execution mode, i.e., whether to run as a server or a client, in addition to the URI of WebSocket connection. If it is invoked as a server application, it immediately builds the PhysX Scene corresponding to the entire virtual space and starts the glut rendering of the space, and if it is invoked as a client application, it transits to the waiting state to accept requests from the server. It then builds the PhysX Scene corresponding to an assigned subspace according to instructions received from the server application. 3.2

Partitioning into Subspaces and Assigning to Clients

In the proposed method, a large-scale virtual space is divided into several subspaces to reduce the machine load in physics simulation. In the following explanation, the number of clients and the number of subspaces are both ﬁxed to two, and clients and subspaces are distinguished with name A or B. In the prototype system, the server is responsible for the entire space, and each client is responsible for each subspace. The partitioning of the whole space into subspaces is realized by using the coordinates in the PhysX Scene, e.g., whether or not the value of x-coordinate exceeds 0. It is also possible to change the boundary of the partition according to the load of the clients. A client conducts the processing of an assigned subspace, which means the physics simulation of objects whose coordinates are contained in the subspace. In order to properly conduct such a subspace processing, the server should designate

254

S. Saito and S. Fujita Table 1. Network commands. Command Data

Explanation

Init

None

Initialization of PhysX Scene

Object

O-data

Update of PhysX Objects

Input

Input keydata Process the input keydata

Return

None

Return O-data to the server

the information on the subspace in a rigid manner, and should send it to the client in a reliable and timely manner. In other words, we should determine the way of representing the subspace information and the way of transferring the represented information. In the prototype system, we introduce a speciﬁc data format called O-data (object data) for the former, and for the latter, we use WebSocket to send network commands written as a text data (see Table 1 for illustration). In summary, the allocation (and updates) of a subspace to a client is realized in the following two steps: 1) the server creates an O-data for each subspace and packs it into a network command; and 2) the server sends a created network command to each client through WebSocket. The result of physics operations is collected to the server by returning another O-data to the server from the client. 3.3

Distributed Simulation of the Virtual Space

After assigning a subspace to a client, a physics simulation using PhysX is actually conducted on each client, which is almost the same as when the entire space is simulated on a single machine. The server, on the other hand, does not conduct such a physics simulation, but only maintains the position and angle of the PhysX Objectin the entire virtual space (note that to generate the game view, the rendering of the virtual space should also be conducted by glut, while it could be turned oﬀ). Before starting the physics simulation, each client receives O-data from the server through network commands to reﬂect objects in the assigned subspace to the scene. Since the O-data contains all objects which should exist in that scene in a mixed manner, so that some objects in the O-data already exist in the scene and others do not, we should conduct the matching of object IDs, in such a way that if an object with the same ID already exists in the scene, the information on the object should be updated with the O-data, and otherwise, we should add a new object to the scene. Such an addition of objects can be done while running the physics simulation. However, if the added object intersects with an existing object, it would lead the physics simulation to a wrong result. Thus, to avoid such an intersect, the prototype system takes an approach such that when a new object is added to the scene, it is added at a position which is slightly higher than the position designated in the O-data, which is based on an intuition such

Realtime Physics Simulation of Large Virtual Space with Docker Containers

255

Fig. 2. Virtual space to be simulated. Table 2. Speciﬁcations of machines. Name

CPU

RAM

GPU

Host

Intel Core i3-7100

8 GB

NA

Client A Intel Core i7-7700

16 GB GeForce GTX 1070 Ti

Client B Intel Core i7-7700K 16 GB GeForce GTX 1080

that objects being simulated are less likely to be in a high position due to the eﬀect of gravity.

4 4.1

Evaluation Setup

To evaluate the performance of the proposed method, we conducted experiments using the prototype system. In the experiments, we use one host machine and two client machines, which are referred to as Host, Client A, and Client B, respectively. The speciﬁcations of those machines are summarized in Table 2. In the experiments, we conducted simulations of a virtual space (i.e., PhysX Scene) illustrated in Fig. 2, which consists of two subspaces isolated by a big green wall and several small walls enclosing a large number of PhysX Objects. When it is

256

S. Saito and S. Fujita

(a) Without GPU.

(b) With GPU.

Fig. 3. Scatter plot of the execution time of physics simulation conducted on Client A.

(a) Without GPU.

(b) With GPU.

Fig. 4. Average execution time of physics simulation conducted on Client A. (Color ﬁgure online)

simulated with two clients, the subspace in front of the green wall is assigned to Client A, the other subspace is assigned to Client B, and during the simulation, one of enclosing walls in each subspace moves left and right to stir up PhysX Objects inside, to intentionally cause collisions of objects so as to keep the load for the physics simulation suﬃciently high. In the following, to clarify the eﬀect of GPU and Docker virtualization in the physics simulation, we also evaluate the performance without GPU and when the application is directly executed on the target machine without Docker. In addition, to clarify the eﬀect of decentralization, we evaluate the performance on a single machine, which is indicated as “local” in the ﬁgures of the experimental results. 4.2

Eﬀect of Decentralization for Reducing the Simulation Time

At ﬁrst, we evaluate the eﬀect of decentralization in terms of the reduction of the simulation time. Figure 3 summarizes the distribution of the simulation time per

Realtime Physics Simulation of Large Virtual Space with Docker Containers

257

Fig. 5. The number of frame updates per second (fps) observed by the host.

frame conducted by Client A for each number of objects, where we evaluated 150 frames for a ﬁxed number of objects. The simulation time was measured with a C++ library called chrono, and we did not conduct rendering since our objective is to evaluate the performance of the physics simulation. For reference, this ﬁgure includes the results without distributed processing (i.e., local) and with Docker (i.e., docker). The diﬀerence between ﬁgures (a) and (b) is whether GPU is used or not. From these results, it can be seen that the simulation time of most frames is within the target time of 20 ms with the decentralization, while it is often not within 20 ms when the simulation is conducted on a single machine (i.e., local). Figure 4 shows the average simulation time for each number of objects. From this ﬁgure, we can observe that the average simulation time increases almost in proportion to the number of objects, indicating a reduction in the simulation time due to decentralization and an increase in the simulation time due to the use of Docker. In addition, we can ﬁnd that the use of GPU certainly reduces the average simulation time. Note that in this ﬁgure, the target time of 20 ms (corresponding to 100 fps) is indicated by a horizontal blue dashed line. To properly evaluate the eﬀect of decentralization in a consecutive task such as real-time physics simulation, it is not enough to look at the average simulation time per frame, but it is also necessary to evaluate the number of frames to be processed per ﬁxed time including the time required for communicating with the host; i.e., the throughput. To this end, we count the number of updates per second, where the timing of update is when the reﬂection of O-data to the host is completed. The results are shown in Fig. 5. This ﬁgure summarizes the number of frame updates per second (fps) which are measured 60 times for a ﬁxed number of objects, where (a) shows the scatter plot of fps values and (b) is the average fps value for each number of objects. From these ﬁgures, it can be

258

S. Saito and S. Fujita

(a) Time for reflection

(b) Time for creation.

Fig. 6. Average processing time for O-data.

found that the fps value is more stable in the case with single machine than in the case with distributed processing. In particular, in the distributed processing, the variation of fps values increases with the increase in the number of objects. The reason for this phenomena will be discussed later. 4.3

Overhead of Decentralization

Finally, to evaluate the overhead associated with the decentralization, we measured the time required for creating/reﬂecting O-data, respectively. The former is the time required to create O-data on Client A, and the latter is the time required to reﬂect O-data received from Client A on the host. Note that since a half of objects are processed on Client A in our setup, the number of objects to be reﬂected is half of the number of objects present on the host. The average time for reﬂecting O-data is shown in Fig. 6(a), which includes the results with Docker and GPU for comparison. In all cases, the time required to reﬂect O-data is proportional to the number of objects; e.g., when there are 10000 objects, the average time required to reﬂect O-data is about 12 ms. Figure 6(b) shows the average time taken to create O-data on Client A. From this ﬁgure, we can see that the creation time is the shortest when Docker is used, followed by GPU without Docker. For example, the creation time for 8000 objects is about 13 ms when Docker is used, although it takes about 20 ms in other cases, which equals to the time length for one frame. This indicates that although the physics simulation itself becomes faster as the number of objects allocated to the client decreases, our current implementation could not achieve suﬃcient throughput due to the bottleneck of returning the simulation results to the host, which is a reason of the badness of the proposed method illustrated in the last subsection.

Realtime Physics Simulation of Large Virtual Space with Docker Containers

5

259

Concluding Remarks

This paper proposes a distributed processing method for the physics simulation of a large virtual space. By using PhysX as a physics simulator and Docker containers as a virtual machine environment, we realized a prototype system which does not depend on speciﬁc platforms. As a result of experiments conducted using the prototype system, it is conﬁrmed that although the distributed processing certainly reduces the load of the physics simulation, the aggregation of the simulation results to the host becomes a bottleneck so that the host could not maintain a suﬃciently high frame rate such as 100 fps. Our future work includes the optimization of the creation and reﬂection of O-data, investigation of methods for dealing with cases in which the subspace is not explicitly separated and objects come and go across boundaries, and automatic scaling due to increases or decreases in the load. We plan to use orchestration tools such as Kubertenes for dynamic partitioning of tasks and migration between physical computers.

References 1. Avino, G., Malinverno, M., Malandrino, F., Casetti, C., Chiasserini, C.F.: Characterizing docker overhead in mobile edge computing scenarios. In: Proceedings of the Workshop on Hot Topics in Container Networking and Networked Systems (HotConNet), pp. 30–35 (2017) 2. Erez, T., Tassa, Y., Todorov, E.: Simulation tools for model-based robotics: comparison of bullet, Havok, MuJoCo, ODE and PhysX. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2015) 3. Finkel, D., Claypool, M., Jaﬀe, S., Nguyen, T., Stephen, B.: Assignment of games to servers in the OnLive cloud game system. In: Proceedings of the Annual Workshop on Network and Systems Support for Games (NetGames 2014) (2014) 4. He, H., Zheng, J., Li, Z.: Comparing realistic particle simulation using discrete element method and physics engine. In: Proceedings of the Geo-Congress (2020) 5. Hong, H.-J., Chen, D.-Y., Huang, C.-Y., Chen, K.-T., Hsu, C.-H.: QoE-aware virtual machine placement for cloud games, In: Proceedings of the Annual Workshop on Network and Systems Support for Games (NetGames 2013), pp. 1–2 (2013) 6. Huang, C.-Y., Chen, K.-T., Chen, D.-Y., Hsu, H.-J., Hsu, C.-H.: GamingAnywhere: the ﬁrst open source cloud gaming system. ACM Trans. Multimedia Comput. Commun. Appl. 10(1s), 1–25 (2014) 7. Kannan, M.: Enhancing cloud gaming user experience through docker containers in fog nodes. M.Sc. thesis, National College of Ireland (2019) 8. Li, Y., Tang, X., Cai, W.: Play request dispatching for eﬃcient virtual machine usage in cloud gaming. IEEE Trans. Circuits Syst. Video Technol. 25(12), 2052– 2063 (2015) 9. Lin, Y., Shen, H.: Leveraging fog to extend cloud gaming for thin-client MMOG with high quality of experience. In: Proceedings of the 35th ICDCS, pp. 734–735. IEEE (2015) 10. Lin, Y., Shen, H.: CloudFog: leveraging fog to extend cloud gaming for thin-client MMOG with high quality of service. IEEE Trans. Parallel Distrib. Syst. 28(2), 431–445 (2017)

260

S. Saito and S. Fujita

11. Lu, Z., Sankaranarayanan, G., Deo, D., Chen, D., De, S.: Towards physics-based interactive simulation of electrocautery procedures using PhysX. In: Proceedings of the IEEE Haptics Symposium (2010) 12. Marzolla, M., Ferretti, S., D’Angelo, G.: Dynamic resource provisioning for cloudbased gaming infrastructures. ACM Comput. Entertainment 10(3), 4:1-4:20 (2012) 13. Messaoudi, F., Ksentini, A., Simon, G., Bertin, P.: Performance analysis of game engines on mobile and ﬁxed devices. ACM Trans. Web 9(4) (2016). Article no. 39 14. Messaoudi, F.: User equipment based-computation oﬄoading for real-time applications in the context of Cloud and edge networks. Ph.D thesis, Universit´e Rennes 1 (2019) 15. Pugalendhi, A.: Cloud gaming system in docker container image. M.Sc. thesis, National College of Ireland (2018) 16. Simiscuka, A.A., Markande, T.M., Muntean, G.-M.: Real-virtual world device synchronization in a cloud-enabled social virtual reality IoT network. IEEE Access 7, 106588–106599 (2019) 17. Soltesz, S., P¨ otzl, H., Fiuczynski, M.E., Bavier, A., Peterson, L.: Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys 2007), pp. 275–287 (2007) 18. S¨ uselbeck, R., Schiele, G., Becker, C.: Peer-to-peer support for low latency massively multiplayer online games in the cloud. In: Proceedings of the 8th Annual Workshop on Network and Systems Support for Games (NetGames 2009), pp. 1–2 (2009) 19. Tian, H., Wu, D., He, J., Xu, Y., Chen, M.: On achieving cost-eﬀective adaptive cloud gaming in geo-distributed data centers. IEEE Trans. Circuits Syst. Video Technol. 25(12), 2064–2077 (2015) 20. Wang, S., Liu, Y., Dey, S.: Wireless network aware cloud scheduler for scalable cloud mobile gaming. In: Proceedings of the IEEE International Conference on Communications (ICC), pp. 2081–2086 (2012) 21. Xavier, M.G., Neves, M.V., Rossi, F.D., Ferreto, T.C., Lange, T., De Rose, C.A.F.: Performance evaluation of container-based virtualization for high performance computing environments. In: Proceedings of the 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (2013)

A Deep Reinforcement Learning-Based Approach to the Scheduling of Multiple Workflows on Non-dedicated Edge Servers Yongqiang Gao1,2,3(B) and Ke Feng2,3 1

3

Engineering Research Center of Ecological Big Data, Ministry of Education, Hohhot 010021, China 2 Inner Mongolia Engineering Laboratory for Cloud Computing and Service Software, Hohhot 010021, China College of Computer Science, Inner Mongolia University, Hohhot 010021, China [emailprotected], [emailprotected]

Abstract. Prompted by the remarkable progress in mobile communication technologies, more and more users are starting to execute their workﬂow applications on the mobile edge computing environment. Scheduling multiple parallel workﬂows on a non-dedicated edge server is a great challenge because of diﬀerent users’ requirements. In this paper, we propose an approach based on Deep Reinforcement Learning (DRL) to schedule multiple workﬂows on an edge server with multiple heterogeneous CPUs to minimise the violation rate of service level agreement of workﬂows. The eﬀectiveness of our proposed approach is evaluated by simulation experiments based on a set of real-world scientiﬁc workﬂows. The results show that our approach performs better than the current state-of-the-art approaches applied to similar problems. Keywords: Edge server · Scientiﬁc workﬂows · Multiple workﬂows scheduling · Resource allocation · Deep reinforcement learning

1

Introduction

Due to the rapid development of scientiﬁc computing, scientiﬁc workﬂow application has become an extensive data application, requiring large-scale infrastructure to execute reasonably. The inherent resources of mobile devices cannot meet their regular operation; therefore, scheduling workﬂows on heterogeneous Supported in part by the National Natural Science Foundation of China under Grant 61662052, in part by the Natural Science Foundation of Inner Mongolia Autonomous Region under Grant 2021MS06002, in part by he Science and Technology Planning Project of Inner Mongolia Autonomous Region under Grant 2021GG0155, and in part by the Major Research Plan of Inner Mongolia Natural Science Foundation under Grant 2019ZD15. c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 261–272, 2022. https://doi.org/10.1007/978-3-030-96772-7_24

262

Y. Gao and K. Feng

resources has become an urgent problem to be solved. Previous studies usually schedule the scientiﬁc workﬂow generated by the application to the cloud computing platform with powerful computing resources [1]. However, the cloud computing platform is too far away from users. It will increase communication costs and energy consumption, obviously fatal to some applications requiring low latency. Therefore, edge computing can solve this problem as a complementary computing platform between mobile devices and remote cloud. In this distributed architecture, the large-scale services initially handled by the central node are cut into smaller parts and distributed to the edge nodes closer to users for processing, which signiﬁcantly reduces the delay and energy consumption. Existing studies usually focus on oﬄoading single or multiple workﬂows on the edge server [2], and there is little literature on multiple workﬂows scheduling on the non-dedicated edge server. In this paper, we propose a multiple workﬂows scheduling algorithm based on DRL to assign workﬂow tasks to appropriate CPUs on the non-dedicated edge server to reduce the violation rate of service level agreement of workﬂows and improve the QoS of the edge server. The contributions of this paper are as follows: – We investigate the scheduling problem of multiple workﬂows on the nondedicated edge server with multiple heterogeneous CPUs to minimize the violation rate of service level agreement of workﬂows. – We formulate the scheduling problem into a constrained optimization model and propose a novel PRDDQN algorithm based on DRL to solve the problem. The proposed PRDDQN utilizes a new sample storage structure to optimize the sampling process. – We evaluate the eﬀectiveness of our approach by simulation experiments conducted on real-world scientiﬁc workﬂows. The results show that, compared with other alternatives, our approach has better performance. The rest of this paper is organized as follows. The related works are summarized in Sect. 2. Section 3 describes the models and problem formulation. Section 4 describes the proposed PRDDQN algorithm in detail. The experimental results are presented in Sect. 5. Finally, and the conclusion is drawn in Sect. 6.

2

Related Work

As we all know, the workﬂow scheduling problem is an NP-hard problem [3], so it is diﬃcult to ﬁnd an optimal solution for the problem. Typically there are two kinds of methods to solve this problem: heuristic algorithm and meta-heuristic algorithm. For heuristic algorithm, Yuan et al. [4] proposed a DBL algorithm with deadline constraints. The algorithm divides the nodes of the same layer into the same group from the bottom to the top based on the deep reverse layering of nodes and then uses the reverse layering to transform the deadline of workﬂow into the time interval of activity to optimize the cost locally. In addition, there are many heuristic algorithms to optimize diﬀerent objectives, such as accuracy, reliability, etc [5–8]. For the meta-heuristic algorithm, Gao et al. [9] proposed a

Multiple Workﬂows Scheduling in Mobile Edge Server

263

new Pareto-based multi-objective workﬂow scheduling algorithm HGAABC. It combines the development capability in ABC [10] with the exploration capability in GA [11] and maps each task to the instance series of the corresponding virtual machine type according to the pay-per-use pricing model to reduce the cost of the virtual machine and make-span of workﬂow. Rizvi et al. [12] proposed a scheduling method HBDCWS, which minimizes the scheduling time and cost of the workﬂow by allocating budget and deadline for workﬂow in advance. Unlike these studies, we proposed an approach based on DRL to schedule multiple workﬂows to minimize the violation rate of service level agreement of workﬂows. Some literature has recently studied how to use machine learning algorithms to schedule workﬂow applications in a cloud computing environment. Tong et al. [13] proposed a new artiﬁcial intelligence algorithm - deep Q-learning scheduling algorithm, which combines the advantages of Q-learning algorithm and deep neural network, the target is to minimize the make-span of the workﬂow and maximize load balancing. Dong et al. [14] developed an Actor-Critic algorithm and designed a new P-Network model to predict the queuing order of tasks and reduce the average execution time in the workﬂow. Wang et al. [15] proposed a multi-agent DQN algorithm in which the optimized target cost and total execution time of the workﬂow are regarded as a Markov game between two agents, the Nash equilibrium of two optimization objectives is obtained ﬁnally. Diﬀerent from these studies, we focus on the scheduling problem of multiple workﬂows on non-dedicated edge severs.

3 3.1

Problem Modeling System Model

Assuming there are many users in a particular area, and there is a base station with a non-dedicated edge server in this geographic area. In this paper, we consider a non-dedicated edge server with multiple heterogeneous processor resources represented by CP U S = {CP U1 , CP U2 , . . . , CP Ul }. This nondedicated edge server includes a workﬂow scheduler. At diﬀerent times, these users can submit tasks associated with the workﬂow to the workﬂow scheduler by cellular mobile network or WIFI. Next, workﬂows are sent to the proposed PRDDQN algorithm, which is used to ﬁnd the optimal placement for each workﬂow task. Finally, the results are collected and then returned to the corresponding user. Figure 1 shows the proposed system architecture in this paper. 3.2

Workflow Model

We utilize W F = {wf1 , wf2 , . . . , wfm } to denote a system that is composed of multiple scientiﬁc workﬂows. A scientiﬁc workﬂow which submitted by a user can be represented as a Directed Acyclic Graph (DAG) G = (T, D), where T = {tm,0 , tm,1 , ..., tm,n } is a set of diﬀerent tasks of the workﬂow m represented by vertices and D = {dm,i,j |tm,i , tm,j ∈ T } is a set of dependencies between

264

Y. Gao and K. Feng

Fig. 1. System architecture.

tasks tm,i and tm,j represented by directed edges. A dependency dm,i,j indicates a constraint between tasks tm,i and tm,j , which means that task tm,j can start to execute only after task tm,i completed its execution on the corresponding CPU and transferred all data to task tm,j . Therefore, task tm,i can be called the predecessor of task tm,j , and task tm,j is called the successor of task tm,i . A task without any predecessor, we call it the entry task tentry . Similarly, we call the task exit task texit which has no successor. For a task tm,i , it may have multiple predecessors or successors, deﬁned as pr(tm,i ) and su(tm,i ). We can think that task tm,i is ready only when all the predecessors of task tm,i have been completed. In addition, each edge dm,i,j has a weight, representing the data transferred from task tm,i to task tm,j . However, its transmission time is too short, so that it can be ignored in this paper. Each task tm,i has its length, also called workload, which can be expressed as LDm,i . When each user submits a workﬂow, they specify a deadline for the workﬂow to be represented as Deadlinem , and the edge server must observe. Otherwise, it will cause the violation of the service level agreement of workﬂow, which will reduce the QoS of the edge server. 3.3

Scheduling Model

The completion time of each workﬂow is called make-span, denoted by M Sm . Because texit is the last task in the workﬂow to be executed, make-span is equivalent to the completion time of texit . M Sm can be calculated as M Sm = CTm,texit ,k

(1)

where CTm,texit ,k is the completion time of the last task in the workﬂow. A task can only be scheduled to one CPU, and the CPU release its resources until the completion of this task. When task tm,i is scheduled to the cpuk , its run-time can be calculated as RTm,i,k =

LDm,i P Pk

(2)

where P Pk represents the processing performance of cpuk in terms of Million Instruction Per Second (MIPS). The earliest start time of task tm,i can be calculated as

Multiple Workﬂows Scheduling in Mobile Edge Server

STm,i,k

⎧ ⎪ 0, if tm,i = tentry ⎪ ⎨ CTm,q,k , max CTm,p,kˆ ), ˆ = max(t max ∈SC(k) tm,p ∈pr(tm,i ) m,q ˆ ⎪ ⎪ ⎩ otherwise

265

(3)

where SC(k) represents a collection of all tasks scheduled on cpuk . CTm,q,k is ˆ on cpu and CT is the completion time the completion time of task tm,q ˆ ˆ k m,p,k of the direct precursor task of task tm,i . Therefore, the completion time of task tm,i CTm,i,k can be calculated as CTm,i,k = STm,i,k + RTm,i,k

(4)

Our target is to schedule multiple workﬂows to appropriate CPUs to minimize the violation rate of service level agreement of workﬂows. That is to say, make every workﬂow complete before its deadline as much as possible. Therefore, the scheduling problem can be formulated as M inimize

V SLA =

m∈W F

Vm SIZE(W F )

(5)

SIZE(CP U ) SIZE(T Q)

Subject to

k=1

s=1

xm,i,s,k = 1

(6)

∀m ∈ W F, ∀i ∈ Tm SIZE(W F ) SIZE(Tm )

m=1

i=1

xm,i,s,k = 1

(7)

∀k ∈ CP U, ∀s ∈ T Q STm,i,k ≥

max

tm,p ∈pr(tm,i )

STm,i,k + RTm,i,k ≤ xm,i,s,k ∈ {0, 1}

STm,p,kˆ + RTm,p,kˆ

min

tm,q ∈su(tm,i )

STm,q,kˆ

(8)

(9)

(10)

Constraints (10) deﬁne the value ranges of decision variables xm,i,s,k , where xm,i,s,k represents whether the ith task of workﬂow m is assigned to the sth location in waiting queue of CP Uk . Tm represents the set of all tasks of workﬂow m, CP U represents the set of processors, W F represents the workﬂow set to be scheduled, and T Q represents the queue of the task queue. The constraint (6) ensures that each task can only appear in the task queue once. The constraint (7)

266

Y. Gao and K. Feng

ensures that each position in the task queue can only be occupied by one task. Constraints (8) and (9) are constraints on the dependencies between tasks in the workﬂow. In (5), V SLA represents the violation rate of service level agreements for all scheduled workﬂows, Vm represents whether the workﬂow m violates the service level agreement, and the value of Vm can be calculated in the following way: when the total execution time of wfm exceeds the Deadlinem , Vm is 1; otherwise, Vm is 0.

4

The Prioritized Replay Double DQN Algorithm

Multiple workﬂows scheduling problem is an NP-hard problem, so we propose an algorithm PRDDQN based on DRL to ﬁnd the approximate optimal solution. 4.1

Algorithm Theory

The parameter updating method of neural network of traditional DQN algorithm may lead to an overestimation problem in which Q-value is super signiﬁcant. Therefore, we introduce another neural network to eliminate the inﬂuence of some maximum errors. Q(St , At ) ← Rt+1 + γQ(St+1 , max Q(St+1 , a; ω), ω − ) a

(11)

In addition, the PRDDQN algorithm uses an experience replay mechanism. In the process of interaction between the agent and environment, the data obtained by the agent will be put into replay memory. When the parameters of the neural network need to be updated, mini-batch sampling data will be taken from replay memory to train the neural network. Similar to the PRDDQN algorithm, the DQN algorithm also uses an experience replay mechanism, but its sampling method is random sampling. It has an apparent defect: some random samples have little eﬀect on the training of the neural network; therefore, there is no need to extract such samples. In order to solve this shortcoming, the proposed PRDDQN algorithm uses a new storage structure called sumtree, which is a binary tree structure. Each leaf node of sumtree stores the priority P of each sample, and each non-leaf node has only two branches. The value of this node is the sum of the two branches, so the top node of sumtree is the sum of all P of leaf nodes. When sampling, we divide the sum of P at the top node by the number of samples to be sampled into several intervals and then randomly select a number in each interval. According to the number, we ﬁnd the sample for this sampling. Through sumtree, the PRDDQN algorithm can sample the data that are really worth learning. 4.2

Algorithm Framework

The pseudo-code of the PRDDQN algorithm is presented in Algorithm 1. The main steps of the algorithm are as follows: ﬁrstly, initializing the variables and

Multiple Workﬂows Scheduling in Mobile Edge Server

267

Algorithm 1: Framework of PRDDQN algorithm

1 2 3 4 5 6

Input: budget T , mini-batch m, decay factor γ, exploration rate ε, replay interval I and replay capacity N , exponents α and β, the number of leaf nodes of sumtree R, Q target network parameter update frequency C Output: scheduling scheme A Initialize action-value function Q with random weights ω, initialize target ˆ with random weights ω − = ω; action-value function Q Initialize structure of sumtree, initialize the priority Pj of R leaf nodes of sumtree is 1; Initialize the agent and environment, including workﬂows and CPUs; for t=1 to T do The agent observes current state St , obtains its eigenvector φt ; With probability ε select a random action at , otherwise select at = arg max Q(φ(St ), a; ω); a

7 8 9

The agent take action at and get reward rt according to (12) and get whether it is in the termination state f inalt ; St =St+1 ; Store transition (φt , at , rt , γt , f inalt , φt+1 ) in sumtree with maximal priority Pt = max Pi ; i tj if i > j

ki

A state key

vj

The transaction state version, vj is a newer version than vj−1

tsid

Transaction serial ID

tn

The transaction that updates the current state, n is its tsid

skvji ,tn

The leaf node of MVM-DAG

MVM-DAG The Multi-Version Merkle Directed Acyclic Graph we proposed active list

The active transaction list while the snapshot is generated

WS

The write set of transaction

Pw

The probability of write operations

Definition 1 (Cross-chain Transaction). A complete interoperable transaction on one blockchain side is denoted by a cross-chain transaction. A cross-chain transaction is composed of multiple sub-transactions to interact with the other blockchain. The following transaction operation interfaces need to be implemented: – BeginT x starts a cross-chain transaction on one blockchain. – ContinueT x continues the cross-chain transaction after the execution of the other blockchain.

A MVCC Approach to Parallelizing Interoperability

277

– CommitT x commits all operations of a cross-chain transaction. – RollbackT x rolls back all operations of the cross-chain transaction. As shown in Fig. 1, blockchain submits only the validated transactions in a block at ti while applying operations to world state. Thus, for a cross-chain transaction that operates across blocks, the temporary data generated by the subtransactions before the ﬁnal submission should be invisible to other transactions. Otherwise the integrity would be destroyed as shown in Example 1. Example 1. In an interoperable transaction, the account on blockchain A ﬁrst deducts the transfer amount, and then blockchain B should increase the corresponding account balance. If the execution on B fails, the A system must roll back. At the same time, there is another contract on A to grant a subsidy based on the account balance. The contract is executed after A deducts the money, and the corresponding account receives the subsidy. Then A rolls back that account due to the transaction execution failure on the B system. Thus, the account can obtain subsidies through system loopholes. It is the dirty read on invisible temporal data that cause the problem in Example 1. Contract locks tackle that by blocking invisible versions of data via data locks. However, it also decreases the eﬃciency as Example 2.

Fig. 2. A read-write blocking case

Example 2. Figure 2 illustrates two cross-chain transactions on an identical blockchain. T x1 writes(puts) the state of kB (KeyB ) ﬁrst, and reads(gets) the state of kA (KeyA ). Then T x2 writes the state of kA . Meanwhile, related transactions involving kA (kB ) include T xi , ..., T xj (T xm , ..., T xn ).

278

W. Lin et al.

In the contract lock scheme, T x1 cannot read the state of kA until t3 because T x2 holds a lock on kA in advance. Finally, T x1 completes the read of kA after T x2 is submitted, submits at t5 . Since T x1 writes kB and holds a lock on it, T xm , ..., T xn , can all be blocked due to the inability to access kB . Moreover, T xm , ..., T xn can block more transactions shortly afterward, dragging down overall performance. On the contrary, in MVCC, T x1 can read the state of kA without blocking. Therefore, it would not block other data-related transactions. We aim to ﬁnd a MVCC approach for the blockchain interoperability satisfying: (1) integrity ensured: each transaction obtains its visible version and blocks the invisible version; (2) data locks removed: to achieve good performance via reducing read-write blocking caused by data locks.

4

Multi-version Merkle Directed Acyclic Graph

The original MVCC on HLF only supports the latest version match while not making full use of the historical data of the blockchain, as discussed in Sect. 2. That inspired us to utilize these data to realize the traceability of the state versions. Thus, we transform the Merkle Tree into a Multi-Version Merkle Directed Acyclic Graph (MVM-DAG) to store and trace these versions of data.

Fig. 3. Multi-Version Merkle Directed Acyclic Graph, MVM-DAG

Figure 3 is a MVM-DAG example. Each block in the blockchain contains a Merkle tree keeping the historical state data. Each leaf has a pointer to its previous state to eﬀectively support state traceability. The leaf can be expressed as skvji ,tn = {value, point(skvji −1,tm )}. A transaction can access its visible versions of states according to tsid via the structure proposed above. Thus, the system eﬀectively avoids the accumulation

A MVCC Approach to Parallelizing Interoperability

279

of reading requests caused by write locks. Since it is hard to keep a uniform transaction number between blockchains, the ID of a cross-chain transaction is determined by its ﬁrst transaction on a chain. For example, tx2 is a cross-chain transaction with tsid t2 determined by BeginT x. During its execution, k1 is updated to v2 and v3 , which are temporary data until the block n is submitted. Note that they are not visible or accessible to other transactions.

5

Algorithms

We implement MVCC to parallelize interoperability by MVM-DAG. We reimplement the P utState and GetState for HLF, making each transaction operates on its visible versions of states, blocking invisible ones. Further, we provide cross-chain transactions with speciﬁc operation interfaces. 5.1

Operations on States

Each transaction, in the beginning, generates a snapshot, which is used for subsequent version matching of its operating state. Such snapshot includes tsid of the transaction, active list, and the write set (W S), which supports rollback. Accordingly, we re-implement P utState and GetState as Algorithm 1. Algorithm 1. Operations on States 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27:

procedure GetState(T C, k) T C, Transaction Context RV ← T C.Get(“snapshot”) k, key of state s if s.version > RV.active list.max id then max id, the max tsid while s.version > RV.active list.max id do s ← s.roll ptr end while end if if s.version < RV.active list.min id then return s.value end if if s.version = RV.ts id then return s.value end if if s.version ∈ RV.active list.tx list then return s.roll ptr.value else return s.value end if end procedure procedure PutState(SD, T C, k, v, W S) SD, State Database s ← SD.Get(k) W S, Write Set new s ← state{} new s, the new state to be written RV ← T C.Get(“snapshot”) new s.version ← RV.tsid new s.value ← v new s.rollp tr ← s W S.Append(new s) end procedure

280

W. Lin et al.

GetState is to return the visible state version to the caller transaction. As Algorithm 1, we get the state s with its latest version from the world state. If s.version > active list.max id, it means that other transactions have updated this state after snapshot generation. This version should be invisible for the caller transaction. Execute the same process to judge its previous version via pointer roll ptr until s.version 0.4, the throughput can reach 1.5x Contract Lock. Additionally, when Pw =95%, the performance improvement is signiﬁcant. When s > 0.6, the throughput can reach 4x, and the delay is 42% of the other one.

Fig. 5. Transaction latency of MVCC and contract lock methods

284

7

W. Lin et al.

Conclusion

This paper overviewed the challenges of contract locks for data integrity and performance. To solve the issues of the existing methods, we proposed a MVCC approach for blockchain interoperability and implemented it on HLF. Experiments with data integrity analysis demonstrated the eﬀectiveness and eﬃciency of the proposed approach. The ﬁndings show that the proposed approach achieved up to 4x performance increase compared with the existing methods, and decreased the average latency with 58%. Acknowledgments. This work was partially supported by National Key Research and Development Project of China (Grant No. 2019YFB2102500), National Natural Science Foundation of China (No. 61902385), Shenzhen Key Basic Research Project (JCYJ20200109115422828), Huawei Cloud Research Project (YBN2020085125) and National Archives Technology Project (2020-X-10).

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

11. 12. 13. 14. 15.

16. 17.

(2021). https://bitcoin.org/bitcoin.pdf (2021). https://github.com/ethereum/wiki/wiki/White-Paper Dean, J., Ghemawat, S.: (2021). https://github.com/google/leveldb/ Apache CouchDB (2021). https://couchdb.apache.org/ Oracle Timeline (2021). http://oracle.com.edgesuite.net/timeline/oracle/ Buterin, V.: Chain interoperability. R3 Research Paper (2016) Stonebraker, M., Rowe, L.A.: The design of POSTGRES. SIGMOD (1986) Zakhary, V., Agrawal, D., El Abbadi, A.: Atomic commitment across blockchains. Proc. VLDB Endowment 13(9) He, Y., Zhang, C., Wu, B., et al.: A cross-chain trusted reputation scheme for a shared charging platform based on blockchain. IEEE Internet Things J. (2021) Androulaki, E., et al.: Hyperledger fabric: a distributed operating system for permissioned blockchains. In: Proceedings of the Thirteenth EuroSys Conference (2018) Warnat-Herresthal, S., et al.: Swarm learning for decentralized and conﬁdential clinical machine learning. Nature 594(7862), 265–270 (2021) Muzammal, M., Qu, Q., Nasrulin, B.: Renovating blockchain with distributed databases: an open source system. Future Gener. Comput. Syst. 90, 105–117 (2019) Thakkar, P., Senthil Nathan, N.: Performance benchmarking & optimizing hyperledger fabric blockchain platform (2018) Qu, Q., Nurgaliev, I., Muzammal, M., et al.: On spatio-temporal blockchain query processing. Future Gener. Comput. Syst. 98, 208–218 (2019) Sharma, A., Schuhknecht, F.M., Agrawal, D., et al.: Blurring the lines between blockchains and database systems: the case of hyperledger fabric. In: SIGMOD, pp. 105–122 (2019) Ruan, P., Loghin, D., Ta, Q.T., et al.: A transactional perspective on executeorder-validate blockchains. In: SIGMOD, pp. 543–557 (2020) Nurgaliev, I., Muzammal, M., Qu, Q.: Enabling blockchain for eﬃcient spatiotemporal query processing. In: Hacid, H., Cellary, W., Wang, H., Paik, H.-Y., Zhou, R. (eds.) WISE 2018. LNCS, vol. 11233, pp. 36–51. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02922-7 3

A MVCC Approach to Parallelizing Interoperability

285

18. Saberi, S., et al.: Blockchain technology and its relationships to sustainable supply chain management (2018) 19. Chacko, J.A., Mayer, R., Jacobsen, H.A.: Why do my blockchain transactions fail? A study of hyperledger fabric. In: SIGMOD, pp. 221–234 (2021) 20. Zhang, L., et al.: The challenges and countermeasures of blockchain in ﬁnance and economics. Syst. Res. Behav. Sci. 37(4), 691–698 (2020) 21. Batubara, F.R., Ubacht, J., Janssen, M.: Challenges of blockchain technology adoption for e-government: a systematic literature review (2018) 22. Thomas, S., Schwartz, E.: A protocol for interledger payments (2015). https:// interledger.org/interledger.pdf 23. Kwon, J., Buchman, E.: A network of distributed ledgers. Cosmos 1–41 (2018) 24. Polkadot, W.G.: Vision for a heterogeneous multi-chain framework. https://github. com/polkadot-io/polkadotpaper/raw/master/PolkaDotPaper.pdf 25. Herlihy, M.: Atomic cross-chain swaps. arXiv e-prints arXiv: 1801.09515 (2018) 26. Herlihy, M., Liskov, B., Shrira, L.: Cross-chain deals and adversarial commerce. VLDB J. 1–19 (2021). https://doi.org/10.1007/s00778-021-00686-1 27. Reed, D.P.: Naming and synchronization in a decentralized computer system. Massachusetts Institute of Technology (1978) 28. Larson, P.˚ A., Blanas, S., Diaconu, C., et al.: High-performance concurrency control mechanisms for main-memory databases. Proc. VLDB Endowment 5(4) (2011) 29. Qu, Q., et al.: Graph-based knowledge representation model and pattern retrieval. In: 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 5. IEEE (2008) 30. Wang, T., Kimura, H.: Mostly-optimistic concurrency control for highly contended dynamic workloads on a thousand cores. Proc. VLDB Endowment 10(2), 49–60 (2016) 31. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. (TOPLAS) 12(3), 463–492 (1990) 32. Cahill, M.J.: Serializable isolation for snapshot databases (2009) 33. Yu, X., Bezerra, G., Pavlo, A., et al.: Staring into the Abyss: an evaluation of concurrency control with one thousand cores. Proc. VLDB Endowment 8(3) (2014)

An Eﬀective and Reliable Cross-Blockchain Data Migration Approach Mengqiu Zhang1,2 , Qiang Qu1,3(B) , Li Ning1 , Jianping Fan1,2 , and Ruijie Yang3 1

Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China {zhangmq,qiang,li.ning,jp.fan}@siat.ac.cn 2 University of Chinese Academy of Sciences, Beijing, China 3 Huawei Cloud Blockchain Lab, Shenzhen, China [emailprotected]

Abstract. As blockchain is widely applied, various decentralized applications would inevitably encounter data migration problems, for reasons, such as the multilevel blockchain scenarios, the exhaustion of blockchain disk space and the swap of the rapidly evolving blockchain engines. In order to proceed the applications smoothly, it is necessary to migrate original blockchain data to a new blockchain instance, which is the cross-blockchain data migration. However, ensuring the reliability of data provenance and the data consistency, and balancing migration eﬃciency and historical state granularity, introduce unique challenges over crossblockchain data migration. This paper proposes an eﬀective and reliable cross-blockchain data migration approach to coping with these challenges. To ensure the reliability, a collective mechanism of controlling, executing and storing procedures is proposed to assort migration transactions between blockchains. Furthermore, we propose two migration schemes in order to adapt decentralized application scenarios. Extensive experiments are conducted to demonstrate the eﬀectiveness of the proposed approach. Keywords: Blockchain · Data migration · Cross-blockchain Distributed transactions · Decentralized applications

1

·

Introduction

Since 2008, the Blockchain technology introduced by Satoshi Nakamoto in “Bitcoin: A Peer-to-Peer Electronic Cash System” [1] has been paid enormous attention due to the growing demands of decentralized applications for trust purpose. The emergence of Ethereum [2] enables blockchain applicable in wide ﬁelds because of the leverage of smart contract. Meanwhile, a line of blockchain engines and platforms are proposed in order to satisfy particular features. With c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 286–294, 2022. https://doi.org/10.1007/978-3-030-96772-7_26

An Eﬀective and Reliable Cross-Blockchain Data Migration Approach

287

the fast development of blockchain, we have witnessed the technology being broadly used in the applications of product tracing, privacy protection, supply chain, ﬁnance, health, and decentralized ﬁle storage [3,4]. The increasing volume of blockchain-based applications shows that people are recently keen to set up blockchain systems, which presents a underlying requirement of cross-blockchain data migration due to the limitation of blockchain data storage, the scenarios of multi-level blockchains, and the swap need of the rapidly evolving blockchain engines [5]. With the changes in technology, policy and market circ*mstances, blockchain applications have encountered a series of new problems. For instance, the distributed consistent storage of blockchain raises huge consumption of space and blockchain services suﬀer considerable pressure from the unprecedented growth of decentralized transactions [6]. On the other hand, to ensure the competitiveness and safety of blockchain-based applications, we often need to replace underlying blockchain engines to adapt new application requirements, e.g., the latest version of Hyperledger Fabric. Hereby, cross-blockchain data migration is necessary to copy the data from the original blockchain to a new instance. However, how to ensure the reliability, data consistency while realizing the eﬃciency of the migration remains challenging to the traditional data migration methods in centralized systems [7]. Thus, the paper proposes an eﬀective and reliable cross-blockchain data migration approach. A collective mechanism of contrailing, executing and storing procedures is proposed with the following three procedures to assort the migration transactions. In general, a controlling procedure provides services for crossblockchain data migration including registration of blockchain that requires data migration. An executing procedure provides solo and aggregate migration methods in the process of cross-blockchain data migration. The storing procedure is used to store conﬁguration ﬁles and data migration records. This approach with the three modules can eﬀectively implement cross-blockchain data migration, and the experiments show the eﬀectiveness and reliability. The rest of this paper is organized as follows. Section 2 discusses the related work of data migration. Section 3 presents the proposed approach with the collective mechanism design and its three main modules. The detailed experiments and the application usage of the proposed methods are discussed in Sect. 4. Section 5 concludes the paper.

2

Related Work

We classify the literature into the data migration in the traditional centralized systems and the blockchain decentralized systems [8,11,14], respectively. In the traditional centralized systems, Research on data migration is carried out on moving data stored on devices in a network from one conﬁguration to another. For instance, Haller [10] mentioned the importance of data migration to maintain system competitiveness and proposed a general migration architecture. Sujit Biswas et al. [11] proposed a blockchain data migration method for the

288

M. Zhang et al.

medical ﬁeld, which supports the migration of medical records from traditional databases to the blockchain. These methods are based on traditional databases, and they do not consider the complexity of data structures of distributed applications in the migration methods. These methods are thus hard to be directly adapted to the cross-blockchain migration problem. In blockchain decentralized systems, a limited number of methods have been proposed for data migration due to the short research history of blockchain. However, few study realized the importance of the cross-blockchain data management, e.g., blockchain interoperability and data migration. For the blockchain interoperability, Wang H et.al. [9] introduced a blockchain router that empowers blockchains to connect and communicate cross chains, and Herlihy M et al. [14] introduced several commonly used cross-chain transaction methods. They described novel safety and liveness properties, along with two alternative protocols for implementing cross-chain deals in a system of independent blockchain ledgers. These methods are able to support interoperation between blockchains, but they cannot be directly applied for cross-blockchain data migration because of the lack of consideration on historical data status, consistency and thoughput. For the cross-blockchain data migration, VeChain [12] introduced a method of swapping original tokens and newly-issued tokens, and Bandara et al. [13] introduced a set of blockchain migration scenarios and data ﬁdelity levels and then identiﬁed several patterns to achieve those migration scenarios under varying data ﬁdelity levels. These methods are designed for particular systems and they are lack of generalization for eﬀective cross-blockchain data migration.

3

The Proposed Migration Approach

Our purpose is to conduct eﬀective and reliable cross-blockchain data migration when various decentralized applications encounter data migration problems. To achieve this goal, the approach provides a collective mechanism of controlling, executing and storing procedures to assort migration transactions between blockchains. Blockchain information needs to be recorded before migration for ensuring that the data source and migration process are reliable. Two migration schemes are proposed to adapt decentralized application scenarios. Furthermore, we apply conﬁguration parameters, migration records, and other information to support recovery after migration interruption. Table 1 lists the notations used throughout the paper. Table 1. The summary of notations Notation

Definition

namef rom , nameto User-defined chain names during registration solo, aggr

Solo and aggregate migration mode.

CF G

Configuration parameter, logfile path etc.

timeout

Timeout threshold of migration event

blockout

Threshold of traversed blocks in a migration event

routineM ax

Maximum number of coroutines during migration event

An Eﬀective and Reliable Cross-Blockchain Data Migration Approach

289

Figure 1 overviews the proposed cross-blockchain data migration approach. The approach with a collective mechanism of controlling, executing and storing procedures provides custom parameter conﬁguration to support personalized migration for various application scenarios. The architecture ensures the reliability of data sources and the consistency of transactions while balancing migration eﬃciency and historical state granularity.

Fig. 1. The cross-blockchain data migration method architecture

3.1

Preparing for the Migration

Before data migration, blockchain information needs to be registered to ensure that data sources are trusted. In order to ensure data source reliability, consistency and invariance of migration transactions, two to three steps are necessary to implement data migration: registering a blockchain, viewing a registered list (optional), and executing the migration. Registration means recording information about the source and target blockchain before migration, including the user-deﬁned name (blockchain name), channel (channel name), type (Blockchain Type), config (conﬁguration ﬁle path) and certs (Certiﬁcate ﬁle path). Viewing registered list is to query the list of successfully registered blockchains. Executing migration selects the migration mode and operating parameters in terms of the conﬁguration of input to migrate data from the namef rom to the nameto . 3.2

Cross-Chain Data Migration Process

In a speciﬁc migration, the cost, migration duration, and data consistency need to be considered. In order to balance eﬃciency and historical status granularity, the data migration approach we proposed supports two execution modes: solo and aggr.

290

M. Zhang et al.

Figure 2 shows the solo solo and aggregation aggr migration mode . N represents the number of transactions in the block, and M represents the number of key-value pairs (K-V) included in the block. In the solo mode, information of each transaction on the source chain can be written into the replication chain separately. In the aggregation mode, X represents the number of blocks aggregated at a time. M represents the number of K-V pairs remaining after the repeated K-V pairs are removed from the transaction set in X blocks. This mode supports the deduplication of the read-write set of X blocks on the source chain and then aggregates them into a transaction to be written to the target new blockchain instance.

Fig. 2. The schematic of solo pattern

Algorithm 1 shows the execution of cross-blockchain data migration. Algorithm 1. cross-blockchain Data Migration Require: namef rom , nameto 1: chainf rom , chainto ← init(namef rom , nameto ) 2: hpoint ← getBreakpoint(namef rom , nameto ) 3: hend ← chainf rom .GetBlockHeight() 4: CF G ← GetConf ig() 5: if mode == solo then 6: for hpoint → hend do 7: block ← chainf rom .GetBlock(hpoint ) 8: T X ← unmarshal(block) 9: for i = 1 → i = len(T X) do 10: chainto .InputT X(txi ) 11: end for 12: hpoint + + 13: end for

An Eﬀective and Reliable Cross-Blockchain Data Migration Approach

291

14: else if mode == aggr then 15: for hpoint → hend do 16: while M EET CON DIT ION do 17: The number of blocks obtained reaches blockout or the time it takes to

execute reaches timeout

18: block ← chainf rom .GetBlock(hpoint ) 19: T X ← unmarshal(block), T Xset ← append(T X) 20: hpoint + + 21: end while 22: tx ← aggrT X(T Xset) 23: chainto .InputT X(tx) 24: end for 25: end if

In this approach, we ﬁrst initialize blockchain that needs data migration to ensure the reliability of the data source, then get the starting and ending block height of the source blockchain for data migration by using getBreakpoint and GetBlockHeight functions. If the migration mode is solo in the CF G, executing procedure would unmarshal the blocks of the source blockchain to obtain transactions and ordinally write them into the target blockchain. Elsewhen the migration mode is aggr, executing procedure would traverse blocks in the source blockchain until the number of blocks obtained reaches blockout or the time it takes to execute reaches timeout, then the algorithm removes duplicate keys from transaction set T X and aggregates them into a transaction. Finally, it writes the transaction to the target blockchain and repeats the process until the cross-blockchain data migration is completed. To support the complete execution of the data migration process and the function of resuming broken migration. The conﬁguration and data migration record should be stored. The conﬁguration contains parameters such as working mode, logpath, and running parameters including timeout, blockout, and routineM ax in the aggregation mode.

4

Experimental Results

To verify the eﬀectiveness of the cross-blockchain data migration method and test the eﬀect of diﬀerent conﬁguration parameters on the migration results, sets of experiments are carried out: comparative experiments of the solo and the aggregation migration mode, and parameter studies of the approach. 4.1

Comparison Study of solo and aggr

A dataset including 1000 transactions is used to perform ﬁve groups of crossblockchain data migration tests based on the solo and aggregate mode, where conﬁguration of aggregate is: timeout = 50, blockout = 50, and routineM ax = 50.

292

M. Zhang et al. Table 2. Data migration tests for the solo model and the aggr model Sequence Solo migration mode Migration result

Aggregate migration mode Time

Migration result

Time

1

Full

Part

Failure

2017 s Full

Part

Failure

13 s

2

Full

Part

Failure

2070 s Full

Part

Failure

13 s

3

Full

Part

Failure

2053 s Full

Part

Failure

10 s

4

Full

Part

Failure

2032 s Full

Part

Failure

7s

5

Full

Part

Failure

2049 s Full

Part

Failure

13 s

As Table 2 shows, both the solo and aggregate modes eﬀectively complete full-blockchain data migration. The average migration duration in solo is 2044 s, while aggregate is 11 s. Both modes ensure that the world states of the target blockchain and the source chain are entirely consistent. The solo mode additionally ensures that the historical states of the target and source chain are consistent.

Fig. 3. Comparison of storage consumption and migration duration

Figure 3 shows the storage consumption of the source and target blockchain in the two migration modes. The storage consumption of the target blockchain in solo is slightly higher than that of the source blockchain because more block header data is contained. However, in aggregation, the storage consumption of the target blockchain is much lower due to the fewer transactions and blocks. The advantage of the solo mode is that it can simultaneously make the world

An Eﬀective and Reliable Cross-Blockchain Data Migration Approach

293

state and historical state of the target new blockchain completely consistent with the source blockchain. Furthermore, the aggregation mode achieves higher eﬃciency but less storage consumption. 4.2

Configuration Parameter Study for Aggregation

The timeout is ﬁxed to 10 s, and 9 sets of data migration experiments are performed using the data set containing 100,000 transactions. Each set of experiments successfully completes the total amount of data migration. The migration eﬃciency is shown in the upper right corner of Fig. 3. On the one hand, the migration duration decreases as routineM ax increases. On the other hand, when blockout is small, the migration duration decreases as the number of aggregation blocks increases. When blockout is large, the block processing time is greater than the transaction submission time. The migration duration will increase as the number of aggregation blocks increases. Then we test the comparison of migration duration while blockout and timeout is diﬀerent, routineM ax is ﬁxed to 50. The migration eﬃciency is shown in the lower right corner of Fig. 3. Since the production of the last block during migration often fails to meet the “maximum number of transactions” condition, it is necessary to wait for the timeout to meet the timeout condition to produce blocks, and the migration duration will increase with the increase of timeout.

5

Conclusion

In this paper, we proposed an eﬀective and reliable approach to coping with scenarios where historical data on a blockchain needs to be migrated to a new blockchain engine. A collective mechanism with various methods were presented in order to achieve the reliability of the migration process. In addition, we discussed two migration methods, the solo and aggr modes, and analyzed the pros and cons of them. We demonstrated the eﬀectiveness of the proposed method given extensive experiments under diﬀerent conﬁguration parameters. Acknowledgments. This work was partially supported by National Key Research and Development Project of China (Grant No. 2019YFB2102500), National Natural Science Foundation of China (No. 61902385), Shenzhen Key Basic Research Project (JCYJ20200109115422828), Huawei Cloud Research Project (YBN2020085125) and National Archives Technology Project (2020-X-10).

References 1. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. Decentralized Bus. Rev. 21260 (2008) 2. Wood, G.: Ethereum: a secure decentralised generalised transaction ledger. Ethereum Project Yellow Paper 2014(151), 1–32 (2014)

294

M. Zhang et al.

3. Nasrulin, B., Muzammal, M., Qu, Q.: ChainMOB: mobility analytics on blockchain. In: 2018 19th IEEE International Conference on Mobile Data Management (MDM). IEEE, pp. 292–293 (2018) 4. Muzammal, M., Qu, Q., Nasrulin, B.: Renovating blockchain with distributed databases: an open source system. Future Gener. Comput. Syst. 90, 105–117 (2019) 5. Xie, J., Yu, F.R., Huang, T., et al.: A survey on the scalability of blockchain systems. IEEE Netw. 33(5), 166–173 (2019) 6. Kanza, Y.: Technical perspective: revealing every story of data in blockchain systems. ACM SIGMOD Record 49(1), 69 (2020) 7. Das, S., Nishimura, S., Agrawal, D., et al.: Albatross: lightweight elasticity in shared storage databases for the cloud using live data migration. Proc. VLDB Endowment 4(8), 494–505 (2011) 8. Carreira, P., Galhardas, H.: Eﬃcient development of data migration transformations. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 915–916 (2004) 9. Wang, H., Cen, Y., Li, X.: Blockchain router: a cross-chain communication protocol. In: Proceedings of the 6th International Conference on Informatics, Environment, Energy and Applications, pp. 94–97 (2017) 10. Haller, K.: Towards the industrialization of data migration: concepts and patterns for standard software implementation projects. In: van Eck, P., Gordijn, J., Wieringa, R. (eds.) CAiSE 2009. LNCS, vol. 5565, pp. 63–78. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02144-2 10 11. Biswas, S., Sharif, K., Li, F., et al.: Blockchain for e-health-care systems: easier said than done. Computer 53(7), 57–67 (2020) 12. VeChain: ‘VeChainThor wallet manual including token swap and X node migration, July 2018. https://cdn.vechain.com/vechainthor wallet manual en v1.0.pdf 13. Bandara, H.M.N.D., Xu, X., Weber, I.: Patterns for blockchain data migration. In: Proceedings of the European Conference on Pattern Languages of Programs 2020, pp. 1–19 (2020) 14. Herlihy, M., Liskov, B., Shrira, L.: Cross-chain deals and adversarial commerce. VLDB J. 1–19 (2021). https://doi.org/10.1007/s00778-021-00686-1

Algorithm for the Facility Location Problem with Origin and Destination Fengmin Wang(B) , Chu Wang, Na Li , and Wenxing Kang Beijing Jinghang Research Institute of Computing and Communication, Beijing 100074, People’s Republic of China Abstract. The Uncapacitated Facility Location Problem with Origin and Destination (FLPWOD) is an extension of the Uncapacitated Facility Location Problem (UFLP), where each unit of demand has its own origin and destination, and must be shipped from its origin via a location at which a transit station is built, to its destination. As in the UFLP, facilities can be opened at any of the predeﬁned locations with given ﬁxed costs. In classical location models, the clients have to be assigned to the open facilities, and the assignment cost is the distance between a client and an open facility. In the FLPWOD, the demands with origins and destinations have to be assigned to the open facilities, and the assignment cost is the length of a tour form the origin to the destination through an open facility. LP-rounding approximation algorithm is developed with the ﬁrst constant approximation ratio 4. Keywords: Facility location Approximation algorithm

1

· Origin and destination · LP-rounding ·

Introduction

The Uncapacitated Facility Location Problem with Origin and Destination (FLPWOD)is an extension of the Uncapacitated Facility Location Problem (UFLP), which has been extensively investigated in the field of combinatorial optimization over the past three decades [2,5,9]. The UFLP consists of locating uncapacitated facilities among a set of candidate sites and of allocating clients to open facilities in such a way that the sum of location and allocation costs is minimized. More precise, in the UFLP, the inputs are a facility set F , a client set C, a nonnegative facility opening cost for every facility in F , and a nonnegative service cost for connecting each pair of a facility in F and a client in C. The connection cost is often assumed to be metric. The objective is to open (locate) some facilities in F , and connect (allocate) each client in C to one of the open facilities, in such a way that the sum of opening and connection costs is minimized. In this model, each client is serviced separately. However, in several applications, visits to clients may be combined, such as in the bike sharing systems (see, e.g. [12,14,16]). The FLPWOD corresponds to the case Supported by NNSF of China under Grant No. 11901544. c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 295–302, 2022. https://doi.org/10.1007/978-3-030-96772-7_27

296

F. Wang et al.

where depots must be located and two clients can be serviced together from a given depot. Such applications arise naturally in container transportation, in petroleum delivery, and in bulk garbage collection. 1.1

Related Work

We briefly review here the studies related to the facility location problem. Shmoys et al. [13] developed the first constant factor approximation algorithm for the metric uncapacitated facility location problem. They used the LP-rounding technique to obtain the approximate ratio 3.16. The ratio was improved later by Chudak and Shmoys [3] who provided a randomized rounding based 1.736approximation algorithm. Currently, Li [7] gives the best approximation ratio of 1.488. On the hardness side, Guha and Khuller [4] presents that it is hard to approximate uncapacitated facility location problem within a factor of 1.463. For the capacitated version, An et al. [1] consider the metric capacitated facility location problem, and present a constant factor approximation algorithm based on LP-rounding. Furthermore, the (randomized) LP-rounding techniques have been successfully used to design several algorithms for the facility location problem and its variants (see [8,10,15] and reference therein). Nezhad et al. [11] investigated the facility location problem with point and area destinations in fuzzy environment. 1.2

Our Contribution

The main contributions of this paper are summarized as follows. – We firstly introduce the FLPWOD which generalizes the classic facility location problem. – We present LP-rounding approximation algorithm with the ratio 4. – Our algorithm obtain the first constant approximation ratio for the FLPWOD. 1.3

Organization

The remainder of this paper is organized as follows. We state the FLPWOD, give its model and algorithm in Sect. 2, and theoretical analysis are conducted to show how the LP-rounding approximation algorithm dealing with the FLPWOD problem. Section 3 is devoted to conclusions and future works.

2 2.1

Uncapacitated Facility Location Problem with Origin And Destination Problem Statement

Consider a set of locations N = {1, . . . , n}, the travel costs between them, cst ≥ 0, s, t = 1, . . . , n, are assumed symmetric and satisfy the triangle inequality.

Algorithm for the Facility Location Problem with Origin and Destination

297

There is a facility set F ⊆ N , and a origin and destination demand pair set D = {(i, j) : i, j ∈ N }. In the uncapacitated facility location problem with origin and destination (FLPWOD), we should select some facilities to open, and assign each demand pair to exactly one open facility; for each demand pair (i, j) ∈ D, there is a positive integral demand dij that must be shipped to its assigned facility. For each location k ∈ F , the non-negative cost of opening a facility at k is fk . The cost of assigning demand pair (i, j) to an open facility at k is cijk = cik + ckj per unit of demand shipped. The objective of the FLPWOD is to minimize the sum of the fixed facility location costs and of the assignment costs. The general solution structure of the FLPWOD addressed in this study is represented in Fig. 1.

Fig. 1. Solution structure of the FLPWOD.

We introduce the following two decision variables: yk and xijk . If a facility is open at location k, yk = 1, if not, yk = 0; and if the origin and destination demand pair (i, j) is assigned to facility k, xijk = 1, if not, xijk = 0. The model is as follows. min

k∈F

s. t.

fk yk +

dij cijk xijk

(i,j)∈D k∈F

xijk = 1,

∀(i, j) ∈ D,

(1)

k∈F

xijk ≤ yk ,

∀(i, j) ∈ D, k ∈ F,

xijk , yk ∈ {0, 1},

∀ (i, j) ∈ D, k ∈ F.

The first constraint guarantees that any origin and destination demand pair (i, j) ∈ D should be assigned to only one transit station k ∈ F . The second constraints indicate that if the demand pair (i, j) ∈ D is assigned to the transit station k ∈ F , then the facility k must be open. Relax the 0 − 1 constraints of the above integer program (1), we have the following linear relaxation program.

298

F. Wang et al.

min

k∈F

s. t.

fk yk +

dij cijk xijk

(i,j)∈D k∈F

xijk = 1,

∀(i, j) ∈ D,

(2)

k∈F

xijk ≤ yk , xijk , yk ≥ 0, 2.2

∀(i, j) ∈ D, k ∈ F, ∀ (i, j) ∈ D, k ∈ F.

Algorithm

Our algorithm is a slight adaptation of the approximation algorithm for the uncapacitated facility location problem by Shmoys et al. [13], where we think of the origin destination pair as an imaginary client in the facility location problem. The major contribution of this paper is to show that the LP-rounding algorithm of Shmoys et al. [13] can be easily adapted to solve the FLPWOD. We present the following definition used in our algorithm. Definition 1. For each demand pair (i, j) ∈ D, given gij . A feasible solution (x, y) to the linear program (2) is said to be g-close, if it satisfies the property xijk > 0 ⇒ cijk ≤ gij . We can see from the above definition that if a fractional solution to the linear program (2) is g-close, then whenever a demand (i, j) is fractionally assigned to a (partially opened) facility k, the cost cijk associated with that assignment is not too big. In our algorithm, based on solving the linear relaxation of the integer program (1), we apply the filtering and rounding technique to obtain a new g-close fractional solution. We then show how to round the g-close fractional solution to a 3g-close integer solution. We now give the details of the rounding algorithm. Algorithm 1. We run the following steps. Step 1. Solve the linear program (2). Denote the feasible fractional solution by (x, y). Step 2. (Filtering and rounding) Let α be a fixed value in the interval (0, 1). For each demand pair (i, j) ∈ D, we sort the connection costs cijk over all facilities k ∈ F , in nondecreasing order; add the associated values xijk in this order, note k ∗ to be the first facility for which this running sum is at least α, we set cij (α) = cijk∗ . For each demand pair (i, j) ∈ D, let αij = k∈F :cijk ≤cij (α) xijk . We then round the fractional solution (x, y) to obtain (¯ x, y¯) as follows. For each demand pair (i, j) ∈ D, and each facility k ∈ F , we set xijk /αij , if cijk ≤ cij (α), x ¯ijk = y¯k = min{1, yk /α}. 0, otherwise, x, y¯) is a g-close For each demand pair (i, j) ∈ D, let gij = cij (α), then (¯ solution.

Algorithm for the Facility Location Problem with Origin and Destination

299

Step 3. (Clustering and rounding) Step 3.1 (Construct clustering) The algorithm maintains a feasible fractional solution (ˆ x, yˆ); initially, we set (ˆ x, yˆ) = (¯ x, y¯). Let Dc denote the set of demand pairs that are selected as the center of the cluster, U denote the set of demand pairs that have not been cluster. At the beginning of the algorithm, set Dc := ∅, C := ∅, U := D. Consider each demand pair (i, j) ∈ U , for the given values gij , find (ic , j c ) := arg min(i,j)∈U gij . If there are more than one (i, j) ∈ U , such c c that gij is the smallest, then (ic , j c ) is one of them. Let F (i ,j ) := {k ∈ c c c c (i ,j ) (i ,j ) := {(i, j) : ∃k ∈ F ,x ˆijk > 0}. Denote F : x ˆic j c k > 0}. S c c c c c c the cluster centered at (ic , j c ) as C (i ,j ) := F (i ,j ) ∪ S (i ,j ) . Update c c c c Dc := Dc ∪ {(ic , j c )}, C := C ∪ {C (i ,j ) }, U := U − S (i ,j ) . Iterate over the above clustering process, until U = ∅. Go to Step 3.2. Step 3.2 (Rounding) For each demand pair (ic , j c ) ∈ Dc , denote k c := arg mink∈F (ic ,j c ) fk , c c open k c , assign the demand pairs in S (i ,j ) to the facility k c . We have c c 1, k = k c , 1, (i, j) ∈ S (i ,j ) , k = k c , c c c c yˆk = x ˆ = ijk 0, k ∈ F (i ,j ) − {k c }, 0, (i, j) ∈ S (i ,j ) , k = k c . So far we obtain a 3g-close solution (ˆ x, yˆ) (See the proof of Lemma 3). The fractional solution obtained by Step 2 denoted by (¯ x, y¯) is feasible. By the definition of x ¯, we have x ¯ijk = (xijk /αij ) + 0 k∈F :cijk ≤cij (α)

k∈F

=

k∈F :cijk ≤cij (α)

k∈F :cijk >cij (α)

xijk /

xijk

k∈F :cijk ≤cij (α)

= 1. Thus the first condition of program (2) hold. Furthermore, x ¯ijk ≤ 1. Since (x, y) ¯ijk = xijk /αij ≤ yk /αij . By is feasible, we have xijk ≤ yk . If cijk ≤ cij (α), x ¯ijk ≤ y¯k . If the definition of cij (α), we have αij ≥ α. So yk /αij ≤ yk /α. Thus x ¯ijk = 0 ≤ y¯k . The second condition of program (2) hold as well. cijk > cij (α), x The feasibility of the solution (ˆ x, yˆ) is clearly visible. The algorithm only assigns demand (i, j) ∈ D to an opened facility, and when we set any variable yˆk to 0, we also set each variable x ˆijk to 0. 2.3

Analysis

We present the following lemma which is important in analyzing the assignment cost. 1 cijk xijk . Lemma 1. For each demand pair (i, j) ∈ D, cij (α) ≤ 1−α k∈F

300

F. Wang et al.

Proof. Let K = {k : cijk ≥ cij (α)}, then by the definition of cij (α), we have x < α, which together with the fact that x k∈F ijk = 1, imply that k∈F −K ijk x ≥ 1 − α. Hence, c x ≥ c x k∈K ijk k∈F ijk ijk k∈K ijk ijk ≥ (1 − α)cij (α), i.e., 1 cijk xijk .

cij (α) ≤ 1−α k∈F

We now analyze the approximation factor of Algorithm 1, i.e., analyze the relationship between the cost of the solution obtained from Algorithm 1 and the cost of the optimal solution denoted by OP T . In order to bound the total cost of the solution (ˆ x, yˆ), we provide the following lemmas to bound the facility cost and the assignment cost respectively. Lemma 2. The facility cost of the feasible integer solution (ˆ x, yˆ) is no more than α1 times of the facility cost of the feasible fractional solution (x, y), i.e., k∈F

fk yˆk ≤

1 fk yk . α

Proof. By step 3.2 in Algorithm 1, fkc =

k∈F

fk . Since the minimum of a set of numbers is never more than their weighted average, and x ¯ijk = 1, (ic ,j c ) k∈F fk x ¯ijk . We have present at the end of Subsect. 2.2 that we obtain fkc ≤ c) k∈F (ic ,j fk y¯k . This inequality implies that the facility cost of x ¯ijk ≤ y¯k , so fkc ≤ k∈F (ic ,j c ) fk yˆk ≤ yˆ never increases throughout the execution of the algorithm, hence k∈F fk y¯k . By the definition of y¯, we know that y¯k ≤ α1 yk . Finally, we obtain that k∈F fk yˆk ≤ α1 fk yk .

k∈F

min

k∈F (ic ,j c )

k∈F

Lemma 3. The assignment cost of the feasible integer solution (ˆ x, yˆ) is no more 3 times of the assignment cost of the feasible fractional solution (x, y), than 1−α i.e., 3 dij cijk x ˆijk ≤ dij cijk xijk . 1−α k∈F (i,j)∈D

k∈F (i,j)∈D c

c

Proof. Consider the demand pair in the cluster C (i ,j ) . According to Step 3 in Algorithm 1, there are the following case. Case 1. If (i, j) = (ic , j c ), then cijkc = cic j c kc ≤ gic j c . c c ˆijk > Case 2. If (i, j) = (ic , j c ), then there must exist k ∈ F (i ,j ) such that x ˆic j c k > 0. We 0. We have cijk ≤ gij . If k = k c , then cijkc ≤ gij . If k = k c , then x have cic j c k ≤ gic j c . By the triangle inequality, we have the following inequalities. When i = ic , j = j c , cijkc = cikc + ckc j ≤ cik + cic k + cic kc + ckj + ckj c + ckc j c = cijk + cic j c k + cic j c kc ≤ gij + 2gic j c ≤ 3gij .

Algorithm for the Facility Location Problem with Origin and Destination

301

When i = ic , j = j c , cijkc = cic jkc = cic kc + ckc j ≤ cic kc + ckj + ckj c + ckc j c ≤ cic j c kc + cijk + cic j c k ≤ gij + 2gic j c ≤ 3gij . When i = ic , j = j c , cijkc = cij c kc = cikc + ckc j c ≤ cikc + cic k + cic kc + ckc j c ≤ cijk + cic j c k + cic j c kc ≤ gij + 2gic j c ≤ 3gij . Since for each demand pair (i, j) ∈ D, gij = cij(α), and by Lemma 1, cij (α) ≤ 1 3 cijk xijk , we obtain that cijkc ≤ 1−α cijk xijk . Add all the demand 1−α k∈F

k∈F

pairs in the cluster C, we obtain

dij cijk x ˆijk ≤

k∈F (i,j)∈D

3 dij cijk xijk . 1−α k∈F (i,j)∈D

Theorem 4. The total cost of the feasible integer solution (ˆ x, yˆ) is no more than 4 times of the OP T , i.e., fk yˆk + dij cijk x ˆijk ≤ 4OP T. k∈F

k∈F (i,j)∈D

Proof. By Lemma 2 and Lemma 3, we have that fk yˆk + dij cijk x ˆijk k∈F

k∈F (i,j)∈D

1 3 ≤ fk yk + dij cijk xijk α 1−α k∈F

k∈F (i,j)∈D

3 1 }( ≤ max{ , fk yk + dij cijk xijk ) α 1−α k∈F

k∈F (i,j)∈D

3 1 }OP T. ≤ max{ , α 1−α Set α = 1/4, we obtain the theorem.

3

Conclusion

In this paper, we introduce the uncapacitated facility location problem with origin and destination, where each unit of demand has its own origin and destination, and must be shipped from its origin via a location at which a transit station is built, to its destination. An LP-rounding approximation algorithm is developed with the ratio 4, which is a good reference for other methods to

302

F. Wang et al.

improve the approximation ratio. And for further research in the future, one can present experiment and analysis about the algorithm. There are several other directions for future research, such as considering the capacitated facility location problem with origin and destination, the k-level facility location problem with origin and destination.

References 1. An, H.C., Singh, M., Svensson, O.: LP-based algorithms for capacitated facility location. In: Proceedings of IEEE, Symposium on Foundations of Computer Science. IEEE Computer Society, pp. 256–265 (2014) 2. Comuejols, G., Nemhauser, G.L., Wolsey, L.A.: The uncapacitated facility location problem. In: Mirchandani, P.B., Francis, R.L. (eds.) Discrete Location Theory. Wiley, New York, pp. 119–171 (1990) 3. Chudak, F.A., Shmoys, D.B.: Improved approximation algorithms for the uncapacitated facility location problem. SIAM J. Comput. 33, 1–25 (2003) 4. Guha, S., Khuller, S.: Greedy strikes back: improved facility location algorithms. J. Algorithms 31(1), 228–248 (1999) 5. Hh, A., Zo, B.: An improved scatter search algorithm for the uncapacitated facility location problem. Comput. Ind. Eng. 135, 855–867 (2019) 6. Klincewicz, J.G.: Enumeration and search procedures for a Hub location problem with economies of scale. Ann. Oper. Res. 110, 107–122 (2002) 7. Li, S.: A 1.488 Approximation algorithm for the uncapacitated facility location problem. In: Proceedings of the ICALP, Part II, pp. 77–88 (2011) 8. Li, Y., Du, D., Xiu, N., Xu, D.: Improved approximation algorithms for the facility location problems with linear/submodular penalties. Algorithmica 73(2), 460–482 (2015) 9. Labbe, M., Louveaux, F.V.: Location problems. In: DelPAmico, M., Maﬃoli, F., Martello, S. (eds.) Annotated Bibliographies in Combinatorial Optimization. Wiley, Chiehester, UK, pp. 261–281 (1997) 10. Lv, W., Wu, C.: An LP-rounding based algorithm for a capacitated uniform facility location problem with penalties. J. Comb. Optim. 41(4), 888–904 (2021). https:// doi.org/10.1007/s10878-021-00726-0 11. Nezhad, N.A.T., Moradi, S., Karamali, G.: Fuzzy facility location problem with point and rectangular destinations. Int. J. Math. Oper. Res. 18(1), 21–44 (2021) 12. Quilliot, A., Sarbinowski, A.: Facility location models for vehicle sharing systems. In: Computer Science Information Systems. IEEE (2016) ´ Aardal, K.I.: Approximation algorithms for facility loca13. Shmoys, D.B., Tard¨ os, E., tion problems. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pp. 265–274 (1997) 14. Wang, F., Hu, X., Wu, C.: 2-level station location for bike sharing. In: Zhang, Z., Li, W., Du, D.Z. (eds.) Algorithmic Aspects in Information and Management. AAIM 2020 (2020) 15. Xu, G., Xu, J.: An LP rounding algorithm for approximating uncapacitated facility location problem with penalty. Inf. Process. Lett. 94, 119–123 (2005) 16. Zhang, J., Pan, X., Li, M., Yu, P.S.: Bicycle-sharing systems expansion: station re-deployment through crowd planning. In: ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems ACM, vol. 2 (2016)

Reinforcement Learning-Based Auto-scaling Algorithm for Elastic Cloud Workflow Service Jian-bin Lu, Yang Yu(B) , and Mao-lin Pan School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China [emailprotected]

Abstract. Deploying a workflow engine as a service on a container cloud environment can improve its service quality and reliability, but auto-scaling of the elastic cloud workflow service doesn’t attract much study attention. Current autoscaling algorithms oriented to common microservices consider little about the characteristics of a long time and high cost of starting up workflow service, which can easily cause problems such as untimely scaling and excessive scaling. Given this, based on reinforcement learning and semi-Markov decision process (SMDP) modeling, an auto-scaling algorithm for elastic cloud workflow engine is proposed, which enables the cloud workflow service to scale in time, appropriately allocating resources and ensuring service availability. Simulation comparison experiments show that the algorithm automatically scales instances in advance and adapts to changes in traffic through the reinforcement learning SMDP strategy, so that it reduces the violation rate in Service Level Agreements (SLA), and improves the availability of the cloud workflow service. Keywords: Workflow · Cloud computing · Auto scaling · Reinforcement learning

1 Introduction With the increase of globalization, business process management (BPM) is expected to help modern enterprises be both competitively agile and cost-efficient. And due to the development of cloud computing, BPM is located as a service that offers a dedicated business process in a cloud-based manner, so-called BPM as a service (BPMaaS) [1]. Current researches on cloud workflow services focus on the application and architecture design of cloud workflow services to improve the efficiency of the cloud environment but pay little attention to the elasticity of cloud workflow services [2]. To improve the elasticity of BPMaaS, it is significant to auto-scale the cloud workflow engine services, one of the cores of BPMaaS. However, compared with the general cloud services, the cloud workflow engine service has a larger granularity, takes longer to start, and consumes more resources [3], and auto-scaling such a service has to face more challenges. Considering the stochasticity and uncertainty in the cloud environment, solutions based on Reinforcement Learning (RL) are purposed to solve the auto-scaling problems [4]. The auto-scaling problems are usually modeled as Markov decision processes © Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 303–310, 2022. https://doi.org/10.1007/978-3-030-96772-7_28

304

J. Lu et al.

(MDP) problems. In cloud auto-scaling problems, the RL agent learns how to allocate appropriate resources in a pay-per-use manner. However, due to the characteristics of cloud workflow services, the observation of rewards and states is not as intuitive as ordinary microservices and it is necessary to auto-scale the BPMS proactively. Applying ordinary RL methods will cause untimely scaling, over-allocation of resources, and oscillation. In addressing these challenges, this paper purposes an automatic scaling algorithm for elastic cloud workflow services based on load prediction and reinforcement learning considering the features of cloud workflow service scaling. The algorithm models the automatic scaling problem of cloud workflow services as SMDP and combines reinforcement learning and load prediction algorithms to perform automatic scaling operations on cloud workflow services. And it can auto-scale the services in advance with the changes in traffic load and allocate resources rationally so that it can provide stable service.

2 Problem Description This section will analyze the auto-scaling problem of cloud workflow services from the perspectives of auto-scaling and reinforcement learning. The auto-scaling problems for cloud applications are commonly abstracted as a MAPE (Monitoring, Analysis, Planning, and Execution) control loop [5]. And because of the longtime startup and the high resource consumption, the cloud workflow engine service should be scaled proactively. And oscillation should be prevented as it results in resource wastage and more SLA violations.

Fig. 1. MDP interaction process between an agent and the environment.

SMDP has proven to be a successful approach to help make the best decisions in the stochastic environment and it is feasible to model the auto-scaling problem of cloud workflow engines as an SMDP problem [6]. We apply reinforcement learning to the automatic scaling problem of cloud workflow services. We will describe the problem modeling as an SMDP, as is depicted in Fig. 1. An SMDP is defined as a 5-tuple (S, ψ, P.(·, ·), R.(·, ·), γ ), where: S represents the environmental state space. The indicators of the last several time intervals are combined as a state. represents the decision sequence space. A decision sequence comprises a scale action and the number of time intervals it stays (i.e. +1, 0, 0, 0). Pa s, s represents the probability that action a in state s at time t will lead to state s at time t + 1.

Reinforcement Learning-Based Auto-scaling Algorithm

305

Ra s, s represents the (expected) immediate reward received after transitioning from state s to state s due to action a. γ is the discount factor. It represents the difference in importance between future and immediate rewards. The auto-scaling problem of cloud workflow engine services is complex, requiring more monitoring indicators to achieve more precise control. And the state-action space in such problems is relatively large. It costs a lot of resources to maintain such a space. Also, it may cause oscillation because of its explore policies and frequent actions.

3 RL-Based Auto-scaling Algorithm for Elastic Cloud Workflow Service The objective of the proposed algorithm is to auto-scale the cloud workflow engine service to attain maximum resource utilization, minimal response time, and maximum throughput. The system architecture and the algorithm are introduced in this section. 3.1 System Design The proposed algorithm is implemented on Kubernetes, an open-source system for the management of containerized applications [7]. And the architecture and its components are presented in Fig. 2. The major components of the architecture are explained subsequently.

Kubernetes Monitor

observation

Workload

Gateway

Resource Utilization

Runtime Bundle Autoscaler

Workflow engine

Workflow engine

Workflow engine

Query Service

Connector Execute

Notification Service Scale API

Audit Service

REST API Server

Fig. 2. System architecture for auto-scaling cloud workflow engine services.

Monitor. The system collects indicators such as the amount of workload to be processed, resource utilization, and the number of database interactions, and uses the indicators as observations to be processed by the RL agent.

306

J. Lu et al.

Autoscaler. The system calculates the best scaling action based on the performance, utilization, and load information sent by the monitor. The overview of the proposed system is stated as follows: The autoscaler takes scale action of the cloud workflow engine containers through the interface provided by the Kubernetes cluster, and the scaling action will be adopted at regular intervals. The indicator monitor obtains workload, resource utilization, and other performance indicators from Kubernetes and the cloud workflow engine container, and submits these performance indicators to the autoscaler for calculation and processing. The autoscaler obtains the feedback indicator from the indicator monitor at the next time point after the operation is performed, and performs the reward calculation of the previous state and the state of the next state. The autoscaler uses the SARSA algorithm to learn the auto-scaling strategy, which can predict future reward estimates from the current state. 3.2 Algorithm Design In the SMDP problem, the optimal Q-function satisfies Eq. 1. ∞ t −βτ Pa s, s e dsdFss (t|a) Q∗ (s, a) = s ∈χ

+

y∈X

Pa s, s

∞ 0

e−βτ max Q∗ s , a dFss (t|a) a ∈A

(1)

Here, Fss (·|a) represents the distribution that the time until the transition from s to s occurs. Equation 1 leads SARSA for SMDP to update the function Q(·, ·) as expressed in Eq. 2. −βτ Q (St , ψt ) ← Q(St , ψt ) + α 1−eβ rt + e−βτ Q(St+1 , ψt+1 ) − Q(St , ψt ) (2) −βτ

Here, 1−eβ rt is cumulative reward and e−βτ is the discount factor that means the difference in importance between future and immediate rewards. The proposed algorithm is described in Algorithm 1.

Reinforcement Learning-Based Auto-scaling Algorithm

307

We combine the indicators that the monitor gains, i.e. CPU utilization and workload, into a state s. Then we take state s in the past τ time intervals as the state of RL S, as described in Eq. 3. S = s0 , a0 , s1 , a1 , . . . , sτ −1 , aτ −1 , sτ

(3)

Inspired by Deep RL, we use a neural network approximation function to estimate Q-value. We combine the -greedy policy and time series forecasting algorithm, propose an -workload-predict method, as is described in Algorithm 2.

For rewards, we set penalties for SLA violation and rewards for saving resources, as the first part of the reward function in Eq. 4. r=

tres

1−e−ρ RTTH 1−σ

− θ Δins

(4)

Autoscaler calculates the reward and receives the next state S . Then it takes the action sequence ψ through the neural network and -workload-prediction and updates the Q value through Eq. 2.

4 Experiment To evaluate the performance of RL-based autoscaling algorithm for cloud workflow engine service, the design of the experiment is introduced in this section, and then the experimental results are given and analyzed. 4.1 Experiment Design To study the advantages and disadvantages of the proposed algorithm, the experiment separately tested and compared the performance of static threshold algorithm, SARSA algorithm modeled as MDP, and proposed algorithm in auto-scaling of runtime bundle containers of Activiti Cloud. The environment of this experiment is a Kubernetes cluster deployed on 3 local servers. Each server is configured with Intel(R) Xeon(R) CPU E52609 v3 @ 1.90 GHz, 32 GB memory. The official reference version of Activiti Cloud was deployed in the experiment, which offers a set of cloud-native building blocks designed to run on distributed infrastructures [8].

308

J. Lu et al.

Fig. 3. A process definition for ranking movies.

Activiti Cloud separates the creation of the process definition, and parses it out, and stores it in the database when deploying the cloud workflow engine. And it takes a similar time to process different requests of creating processes of various definitions. The process definition used in the experiment is as shown in Fig. 3. We send different types of requests such as creating processes and finishing tasks to test the elasticity of the system.

Traﬃc diagram 100 80 60 40 20 0 1 11 21 31 41 51 61 71 81 91 101111121131

Fig. 4. 1998 World Cup Web Site Access traffic diagram.

To simulate the actual traffic more realistically, this article uses part of the access traffic of the 1998 World Cup as the data set of the access traffic, as is shown in Fig. 4. For convenience, we intercept the part of the diagram from June 1st to June 4th . 4.2 Experiment Result This section evaluates the static threshold algorithm, the SARSA algorithm, and the algorithm in this article from the aspects of CPU utilization, resource usage, and SLA violation rate. SLA violation stipulates that response time exceeds 1s. As is depicted in Fig. 5(a), the static threshold algorithm is unable to handle traffic peaks in time. MDP-based SARSA algorithm would cause oscillation. The proposed algorithm can automatically scale cloud workflow engine service in time with changes in traffic and mitigate oscillation. And Fig. 5(b) and Table 1 show that the static threshold algorithm is unable to cope with changes in traffic, resulting in high response time and SLA violation rate. MDP-based SARSA can scale in time with load changes to a certain extent. But due to the oscillation it causes, its average response time is also high. The

Reinforcement Learning-Based Auto-scaling Algorithm

309

proposed algorithm can scale cloud workflow engine service in time and SLA violation rate and average response time are reduced.

Pod supply

Response me

8

100

25000

6

80

20000

60

15000

40

10000

2

20

5000

4

1 6 11 16 21 26 31 36 41 46 51 56 61 66

1 6 111621263136414651566166

Sta c threshold

SARSA-MDP

Sta c threshold

Proposed

traﬃc

Proposed

(a)

SARSA-MDP

(b)

Fig. 5. (a) Pod supply comparisons with Static threshold, SARSA-MDP algorithm. (b) Response time comparisons. Table 1. Comparison of SLA violation rate, average response time, and average pod supply. Algorithm

SLA violation rate/% Average response time/ms Average pod supply

Static threshold

45

3443

3.01

MDP-based SARSA 28

1534

3.70

937

4.01

Proposed

17

As is depicted in the above figures and tables, compared with the other algorithms, the proposed algorithm can scale cloud workflow engine service in time. It reduces the SLA violation rate and improves the availability of cloud workflow engine services. And the proposed algorithm allocates a little more resources, but the SLA violation rate reduces by 39% and the average response time reduces by 39%.

5 Conclusion To improve the elasticity of cloud workflow engine service, allocate resources appropriately and achieve high availability, we design a cloud workflow engine auto-scaling algorithm based on reinforcement learning considering the characteristics of cloud workflow engine service. We model the auto-scaling problem of cloud workflow engine service as an SMDP problem, use -workload-predict policy for strategy exploration, and use SMDP-based SARSA algorithm to learn appropriate scaling policy. As is shown in experiments, the proposed algorithm can scale cloud workflow engine service automatically with changes in traffic load. It reduces the SLA violation rate and improves the availability of cloud workflow engine services.

310

J. Lu et al.

Although the proposed algorithm can solve the elasticity problem of cloud workflow engine service to a certain extent, there are still some improvements in the research of the thesis. Firstly, the convergence speed of reinforcement learning is relatively slow, and it may be difficult to cope with sudden changes in traffic load in practical applications. To ensure the high availability of cloud workflow engine service, methods such as parallel learning can be used to speed up the convergence. Secondly, due to the cache mechanism of cloud workflow engine service, CPU and other resource utilization in the initial stage are relatively high. How to effectively start a cloud workflow engine in advance is also a future research direction. Last but not the least, we choose Activiti Cloud as a cloud workflow engine service for experiments and it is still to be verified and tested to improve on auto-scaling of other workflow engines. Acknowledgements. This work is Supported by the NSFC-Guangdong Joint Fund Project under Grant No. U20A6003;the National Natural Science Foundation of China (NSFC) under Grant No. 61972427; the Research Foundation of Science and Technology Plan Project in Guangdong Province under Grant No. 2020A0505100030.

References 1. Baeyens, T.: BPM in the cloud. In: Daniel, F., Wang, J., Weber, B. (eds.) BPM 2013. LNCS, vol. 8094, pp. 10–16. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40176-3_3 2. Schulte, S., Janiesch, C., Venugopal, S., Weber, I., Hoenisch, P.: Elastic business process management: state of the art and open challenges for BPM in the cloud. Fut. Gener. Comput. Syst. 46, 36–50 (2015) 3. Garí, Y., Monge, D.A., Pacini, E., Mateos, C., Garino, C.G.: Reinforcement learning-based application autoscaling in the cloud: a survey. Eng. Appl. Artif. Intell. 102, 104288 (2021) 4. Van, M.: The Logic of Adaptive Behavior: Knowledge Representation and Algorithms for Adaptive Sequential Decision Making under Uncertainty in First-Order and Relational Domains. IOS Press (2009) 5. Qu, C., Calheiros, R.N., Buyya, R.: Auto-scaling web applications in clouds: a taxonomy and survey. ACM Comput. Surv. (CSUR) 51, 1–33 (2018) 6. Bradtke, S.J., Duff, M.O.: Reinforcement learning methods for continuous-time Markov decision. In: Advances in Neural Information Processing Systems, vol. 7, p. 393 (1995) 7. Kubernetes. https://kubernetes.io/. Accessed 11 Jun 2021 8. Activiti.org. https://www.activiti.org/. Accessed 11 Jun 2021

Optimal Energy Efficiency Strategy of mm Wave Cooperative Communication Small Cell Based on SWITP Taoshen Li1(B) and Mingyu Lu2 1 China-ASEAN International Join Laboratory of Integrate Transport, Nanning University,

8 Longting Road, Nanning, People’s Republic of China [emailprotected] 2 School of Computer, Electronics and Information, Guangxi University, 100 Daxue Road, Nanning, People’s Republic of China

Abstract. Aiming at the optimization problem in the stage of simultaneous wireless information and power transfer (SWITP), an optimal energy efficiency strategy of millimeter-wave cooperative communication small cell based on SWITP was proposed to maximize the link energy efficiency, in which the receiver of user equipment devices worked in the power splitting mode. Under the constraints of minimum link transmission rate and minimum energy harvested, the strategy maximized the link energy efficiency of the system by jointly optimizing the transmitting power control and the power splitting factor. Since the original problem is a non-convex fractional programming problem and the NP-hard, the strategy transformed the original problem into a tractable convex optimization problem which is easy to solve by Dinkelbach method, and then Lagrange dual method was used to solve the problem. Finally, a cross-iteration algorithm was designed to get the optimal solution. Simulation results show that the proposed strategy is more effective and superior than the traditional power control method and the maximum transmit power method. Keywords: Millimeter-wave cooperative communication · Simultaneous wireless information and power transfer (SWITP) · Energy harvesting · Energy efficiency · Spectral efficiency · Power beacon

1 Introduction The 5G wireless communications has brought new challenges to traditional energyconstrained wireless networks. Energy harvesting (EH) technology can harvest energy from the radio frequency (RF) and use it for subsequent wireless communication, which can prolong the lifetime of equipment and improve the performance of wireless network. The simultaneous wireless information and power transfer (SWITP) is an effective way to solve the problem of energy limitation in wireless communication networks, and can realize information transmission and energy harvesting at the same time [1]. The device-to-device (D2D) technology is a direct communication model between two peerto-peer user nodes, which can reduce the resource consumption and delay of access and © Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 311–323, 2022. https://doi.org/10.1007/978-3-030-96772-7_29

312

T. Li and M. Lu

backhaul network, alleviate the data pressure of the core network of the communication system, and improve the spectrum utilization and system capacity. The millimeter-wave (mm Wave) band mainly includes 30–300 Ghz, and it has rich spectrum resources, high transmission rate and few interference sources. Obviously, the application of D2D and mm Wave can improve the performance of wireless network by improving spectrum efficiency and system throughput. The Energy harvesting (EH) of RF signal can provide continuous and stable energy for mobile devices, so as to ensure D2D sustainable communication. Therefore, the application of SWITP technology to D2D communication is a potential solution. [2] and [3] studied a D2D network with wireless power and information transmission (WPIT) function. Based on SWIPT, [4] proposed a D2D communication EH heterogeneous cellular network. [5] presented a novel D2D-aware caching policy for high-rate D2D mm Wave communication. [6] proposed an energy-efficient multicast scheduling scheme that can utilize D2D communications. [7] solved the average energy efficiency of EH-based D2D communication heterogeneous networks. In 5G network, the deployment of ultra dense cells can greatly reduce the propagation loss of wireless energy transmission (WET). [8] focused on the design and optimization of SWITP network with 5G new frequency. [9] proposed a low-power multi-antenna mm Wave receiver architecture. [10] implemented SWITP in mm Wave network by power splitting (PS) method. [11] designed a wireless Ad-hoc network with power beacon (PB) aided mm Wave. [12] studied the feasibility of using mm Wave for WET in a largescale network composed of PB and energy collector. [13] used non-orthogonal multiple access (NOMA) to improve spectral efficiency in mm Wave massive multiple-input multiple-output (MIMO) systems. Most of the existing researches on energy harvesting technology using mm Wave only consider harvesting mm Wave energy from RF signal energy sources (such as base station, AP and PB), and does not consider the case of SWITP based on the receiving end. However, in D2D communication, the transmitter and receiver are a paired device pair, which should not be considered separately. Moreover, the deployment of multi antenna system also means greater energy consumption. Aiming at the green communication demand, this paper apples energy harvesting technology in D2D and mm Wave communication, establishes a new small cell network model of user equipment devices (UEs) and mm Wave cooperative communication for high-low frequency hybrid networking, and proposes an optimal energy efficiency strategy based on mm Wave cooperative communication small cell under SWIPT to maximize the link energy efficiency. Finally, the feasibility and effectiveness of the proposed scheme are illustrated by simulation and comparison experiments.

2 System Model 2.1 Network Model Consider the cellular cell of 5G high and low frequency hybrid network as shown in Fig. 1. Within the base station (BS), there are multiple mm Wave small cells suitable for transmission using mm Wave technology. BS works in the Sub-6 GHz spectrum range and provides additional signal services for other mm wave small cells. Since the

Optimal Energy Efficiency Strategy of mm Wave

313

mm Wave cell uses mm Wave communication and works in different frequency bands with the macro cell, the interference between the macro cell and the mm Wave cell can be avoided. In addition, because mm Wave has the characteristics of directional transmission, high path loss and sensitivity to blocking, the interference between mm Wave cells and the interference between indoor and outdoor mm Wave cells can be almost ignored.

Fig. 1. 5G high-low frequency hybrid networking cellular cell

Assuming that the UEs in the mm Wave cell work in the WPCN mode, the system working time slot is shown in Fig. 2. In a WPCN cycle, all UEs in a small cell first obtain energy from the RF signal radiated by PB through energy harvesting technology, and then use SWIPT to realize simultaneous transmission of energy and information in the downlink phase. WPCN cycle

PB→UE energy harvesting phase

SWITP phase

Fig. 2. Structure of full-duplex relay

2.2 System Model In the SWITP phase, the mm Wave small cell system model is shown in Fig. 3. In Fig. 3, K pairs of energy limited transmitters (TX) and receivers (RX) are represented by TX = {1, 2, · · · , K} and RX = {1, 2, · · · , K} respectively. In consideration of computing power and resource saving, it is assumed that all energy limited devices are equipped with a single antenna. Assuming that each RX adopts SWITP technology. From the mm Wave signal transmitted by the corresponding TX, each RX harvests a

314

T. Li and M. Lu

Fig. 3. Illustration of the UE paired system model

certain amount of energy from the received signal through the power splitting method. The signal received by the i-th RX can be expressed as: √ √ yi = h(i,i) L( r(i,i) · xi + h(j,i) L( r(j,i) · xj + nA (1) j∈,j=i

Where, h(i,i) denotes the quasi-static fading of the i-th channel link, and L(r (i,i) 1/2 ) denotes the path loss factor; x i is the mark of the signal transmitted from TXi . The second part of the formula represents the common channel interference caused by other TX to RXi except TXi·nA ∼ CN 0, σA2 denotes the additive white Gaussian noise produced by the antenna in the RF signal receiving stage, and its variance is σA2 and the mean value is 0. Assuming that the PS structure is shown in Fig. 4 and each RX divides the received signal into two power streams by PS method, the power stream of information decoding at RXi is as follows: yiID

⎞ ⎛ √ √ √ √ = ρi · yi + n0 = ρi · ⎝h(i,i) L( r(i,i) · xi + h(j,i) L( r(j,i) · xj + nA ⎠ + n0

(2)

j∈,j=i

Where, 0 < ρ i < 1 represents the power split ratio and n0 ∼ CN 0, σ02 denotes the additive white Gaussian noise produced by information decoding circuit, and its variance is σ02 and the mean value is 0.

Fig. 4. The RX structure of power splitter

Optimal Energy Efficiency Strategy of mm Wave

315

The signal-to-noise ratio of RXi is

SINRi = ρi

2 ρi Pi h(i,i) L(r(i,i) )

j∈TX ,j=i

2 2 Pj h(j,i) L(r(j,i) ) + σA + σ02

(3)

Where, Pi denotes the transmission power of TXi , and Pj denotes the transmission power of other TX with common channel interference. According to Shannon theory, the unit bandwidth throughput of i-th pair of UEs can be expressed as: ⎛ ⎞ ⎜ ⎜ Ri = log2 (1 + SINRi ) = log2 ⎜ ⎜1 + ⎝

ρi

ρi Pi |h(i,i) |2 L(r(i,i) )

j∈TX ,j=i

⎟ ⎟ ⎟ ⎟ 2⎠

Pj |h(j,i) |2 L(r(j,i) ) + σA2 + σ0

(4) Similarly, the power flow for energy harvesting can be expressed as: ⎛ ⎞ √ √ yiEH = 1−ρi · yi = 1−ρi · ⎝h(i,i) L( r(i,i) · xi + h(j,i) L( r(j,i) · xj + nA ⎠ j∈,j=i

(5) Since the energy carried by the noise nA and n0 is too small to activate the energy collection circuit, it can be ignored. Therefore, the energy harvested at RXi is: ⎛ ⎞ 2 2 Ei = (1 − ρi )η⎝Pi h(i,i) L(r(i,i) ) + Pj h(j,i) L(r(j,i) )⎠ (6) j∈TX ,j=i

According to the linear power consumption model [14], the total power consumption of the i-th pair of UEs is: Pitot = ξ Pi + 2Pcir

(7)

Where, ξ ∈ [1, ∞) denotes the efficiency of power amplifier, and Pcir is the static circuit power consumed by the filter, digital to analog converter and other modules. The energy efficiency on the i-th pair of UEs is defined as: EEi =

Ri Pitot

(8)

3 Problem Description and Solution Strategy Under the joint constraints of minimum rate and minimum energy harvesting, the strategy proposed in this paper takes the energy efficiency of all UEs as the optimization goal,

316

T. Li and M. Lu

and optimizes the transmission power control and power shunt factor to transmit more bits per unit power. Therefore, the mathematical model of the optimization problem P1 can be expressed as follows: max

Pi ,ρi i∈K

s.t. C1: C2: C3: C4:

EEi

Ei ≥ Emin , ∀i ∈ K Pi ≤ Pmax , ∀i ∈ K 0 < ρi < 1, ∀i ∈ K Ri ≥ Rth , ∀i ∈ K

(9)

Where, K = {1, . . . , k} is the set index of UE pair; E min is the minimum EH constraint on RX; Pmax denotes the maximum allowable transmission power on TX; Rth denotes the minimum rate threshold on UE link. In order to express the solution process of the optimal value conveniently, let N A 2 = σA2 , N 0 = σ02 , gi,i = |h(i,i) |2 L(r (i,i) ), Ij,i = Pj h(j,i) L(r(j,i) ). Then, P1 can be j∈TX ,j=i

rewritten as max

log2 1+

Pi ,ρi

(

ρi Pi gi,i

)

ρi Ij,i +NA +N0

ξ Pi +2Pcir

i∈K s.t. C1: (1 − ρi )η Pi gi,i + Ij,i ≥ Emin , ∀i ∈ K C2: Pi ≤ Pmax , ∀i ∈ K C3: 0 < ρ i < 1, ∀i ∈ K

C4: log2 1 +

ρi Pi gi,i ρi (Ij,i +NA )+N0

(10)

≥ Rth , ∀i ∈ K

Obviously, the optimization problem P2 is a nonlinear planning problem, and is difficult to find an accurate solution. According to Dinkelbach method, this problem can be transformed into an equivalent convex subtraction problem. Assuming q* ee is the optimal value of the problem, it is defined as follows: ρi∗ Pi∗ gi,i 1 + log 2 ρi∗ (Ij,i +NA )+N0 ∗ = max (12) qee ∗ Pi ,ρi ξ Pi + 2Pcir i∈K

Where, Pi * and ρ i * are the optimal transmission power and power split ratio when the energy efficiency of the i-th pair of UEs reaches the optimal value. method, the equivalent subtractive objective function can be obtained by Dinkelbach method. Therefore, the original optimization problem P2 can be rewritten as: ρPg log2 1 + ρ I i+Ni i,i +N − qee (ξ Pi + 2Pcir ) max 0 i ( j,i A) Pi ,ρi i∈K s.t. C1: (1 − ρi )η Pi gi,i + Ij,i ≥ Emin , ∀i ∈ K (13) C2: Pi ≤ Pmax , ∀i ∈ K C3: 0 < ρ i < 1, ∀i ∈ K C4: log2 1 +

ρi Pi gi,i ρi (Ij,i +NA )+N0

≥ Rth , ∀i ∈ K

Optimal Energy Efficiency Strategy of mm Wave

317

The rewritten problem P2 is a convex optimization problem, which can be solved by common convex optimization methods (such as Lagrange dual method). The Lagrange function of Eq. (13) is L(Pi , ρi , λ1,i , λ2,i , λ3,i , λ4,i )

ρi Pi gi,i = log2 1 + − qee (ξ Pi + 2Pcir ) ρi Ij,i + NA + N0 i∈K + λ1,i (1 − ρi )η Pi gi,i + Ij,i − Emin i∈K

+

λ2,i (Pi − Pmax )+

i∈K

+

λ4,i log2

i∈K

(14)

λ3,i (ρi − 1)

i∈K

ρi Pi gi,i 1+ ρi Ij,i + NA + N0

− Rth

Where, {λ1 , λ2 , λ3 , λ4 } ≥ 0 respectively represent the Lagrange multipliers of C1–C4, and the dual function of Lagrange function (14) is: max L(Pi , ρi , λ1,i , λ2,i , λ3,i , λ4,i )

min

(15)

λ1,i ,λ2,i ,λ3,i ,λ4,i Pi ,ρi

The Pi and ρ i can be obtained by Karush Kuhn Tucker (KKT) condition: N0 + ρi (Ij,i + NA ) + (1 + λ4,i ) log2 e Pi = (16) − (qee ξ + λ2,i − λ1,i (1 − ρi )ηgi,i ) ρi gi,i ⎧ ⎫+ 4A0 (A0 +Pi gi,i )(1+λ4,i ) log2 e ⎪ ⎪ ⎪ ⎪ N N P g P g + 0 i i,i 0 i i,i ⎬ ⎨ −N (2A + P g ) λ1,i η(Pi gi,i +Ij,i )+λ3,i 0 0 i i,i + ρi = ⎪ ⎪ 2A0 (A0 + Pi gi,i ) 2A0 (A0 + Pi gi,i ) ⎪ ⎪ ⎭ ⎩ (17) Where, A0 = I j,i + N A , {x}+ = max{1, x}. The Lagrange multipliers λ1,i , λ2,i , λ3,i and λ4,i can be updated iteratively by gradient descent method. That is + λ1,i = λ1,i − α (1 − ρi )η Pi gi,i + Ij,i − Emin , ∀i ∈ K (18) + λ2,i = λ2,i − α(Pi − Pmax ) , ∀i ∈ K

(19)

+ λ3,i = λ3,i − α(ρi − 1) , ∀i ∈ K

(20)

! λ4,i = λ4,i − α log2

ρi Pi gi,i 1+ ρi Ij,i + NA + N0

Where, α is the step size to ensure convergence.

"#+ − Rth

, ∀i ∈ K

(21)

318

T. Li and M. Lu

According to the above analysis, a cross iterative algorithm to solve the overall optimization problem is described as follows:

4 Experimental Results and Analysis 4.1 Experimental Environment and Parameter Setting In order to illustrate the feasibility and effectiveness of our strategy, this section will evaluate, analyze and verify the proposed strategy through simulation experiments. The parameters of simulation experiments are set by mmwave channel and power consumption model mentioned in [7] and [16].The experimental parameters are set as follows: maximum transmit power of TX Pmax = 23 dBm, energy conversion efficiency η = 0.7, circuit static power consumption Pcir = 50 mW, path loss factor α L = 2, α N = 4, amplifier efficiency ζ = 1/0.38, Gaussian white noise power N 0 = −70 dBm, N A = − 100 dBm, minimum collected energy threshold E min = −14 dBm, throughput threshold Rth = 5bit/s/H. In the simulation scenario, TX-RX links are randomly deployed in an area. The distances between each expected TX-RX link and interference link are 40 m and 80 m respectively. In the following comparative experimental analysis, each simulation experimental value is the average value of the experimental data generated after 100 times of independent execution of the algorithm.

Optimal Energy Efficiency Strategy of mm Wave

319

4.2 Performance Analysis and Comparison of Algorithms In order to compare and analyze the performance of the proposed strategy, we take the traditional transmission power control algorithm as the comparison algorithm. The traditional transmit power control algorithm with the most energy efficiency does not consider the dynamic joint optimization of the PS of SWITP technology. Therefore, referring to the comparison method in [7], we designed the power control algorithm of equally divided PS of SWITP (PC-E scheme) to compare with our strategy. The first experiment is to analyze the relationship between the maximum transmission power threshold and energy efficiency. In the experiment, the threshold Pmax is set to 50 mW, 100 mW, 150 mW, 200 mW and 250 mW respectively. The experimental comparison results are shown in Fig. 5. With the increase of Pmax , the energy efficiency of the links of the two schemes also increases. This is because within an appropriate range, with the increase of the Pmax , the allowable transmission power on TX becomes larger and larger, which increases the transmission throughput and improves the link energy efficiency. However, when Pmax reaches 200 mW, although the increase of TX transmission power can bring greater throughput, the energy consumption of the link also increases, resulting in a downward trend of link energy efficiency with the increase. Therefore, it is very important to select an appropriate maximum transmission power threshold. From the simulation results, to obtain better performance, the setting of AA needs to consider the trade-off between throughput and energy consumption. In the later simulation experiment, Pmax is set to 200 mW.

Fig. 5. Illustration of the impact of the maximum transmission power

The second experiment is to compare and analyze the impact of the minimum energy collection threshold on energy efficiency. In the experiment, the threshold E min is set to − 20 dBm, −18 dBm, −16 dBm, −14 dBm and −12 dBm respectively. The experimental results are shown in Fig. 6. The experimental results show that the link energy efficiency decreases with the increase of the minimum energy collection threshold. This is because according to the first law of thermodynamics, when the signal power transmitted on TX

320

T. Li and M. Lu

remains unchanged, if the energy collection power becomes larger, the power used for information transmission becomes smaller, resulting in smaller throughput. From the experimental comparison results of Fig. 5 and Fig. 6, it can be seen that our strategy are better than the traditional power control scheme in energy efficiency performance. In order to further verify the effectiveness of our scheme, the third and fourth experiments draw on the comparison method of [17] and add a benchmark scheme of dynamic PS with maximum transmission power (expressed by PS-max) as the comparison scheme.

Fig. 6. Illustration of the impact of the energy harvesting threshold

The third experiment is to compare the effects of different TX-RX link distances on link energy efficiency under the three schemes. The experimental results are shown in Fig. 7. It can be seen from the experimental comparison results that the energy efficiency of the three schemes decreases gradually with the increase of TX-RX link distance. This is because the path loss between TX-RX increases with the increase of the distance between them, and the channel gain decreases, resulting in the decrease of energy efficiency. However, from the comparison results of three experiments, the performance of the scheme in this paper is still better than the other two comparison schemes. This is because the dynamic optimization scheme using joint transmit power and power diversion factor can obtain an optimal swipt power diversion factor, so as to achieve the optimal trade-off between link throughput and energy consumption. The fourth experiment is to compare and analyze the influence of interference link distance on energy efficiency. In the experiment, the distances of interference links are 60 m, 70 m, 80 m, 90 m and 100 m respectively. The experimental results are shown in Fig. 8. It can be seen from the experimental comparison results that the energy efficiency of the three schemes is improved with the increase of the interference link distance. This is because with the increase of interference link distance, the SINR of TX-RX link increases, and the link throughput also increases according to Shannon’s theorem. Therefore, when the link energy consumption is fixed, the more bits transmitted, the greater the link energy efficiency. Under the same interference link distance, the performance of our scheme is better than the other two comparison schemes. This is

Optimal Energy Efficiency Strategy of mm Wave

321

Fig. 7. Illustration of the impact of the distance of TX-RX

Fig. 8. Illustration of the impact of the distance of the interference link

because the our scheme can better meet the balance of link throughput and energy consumption, so as to maximize energy efficiency.

5 Conclusions This paper studies the energy efficiency optimization of mm Wave cooperative small cell under SWITP. Firstly, the system model of energy-limited UE pairing in mm Wave small cell is constructed, and a SWITP-based optimal energy efficiency strategy of mm Wave cooperative communication small cell is proposed to maximize the link energy efficiency. In order to achieve the goal of green communication, under the joint constraints of minimum link transmission rate and minimum energy harvesting, the strategy maximizes the energy efficiency of the system link by optimizing the transmission power control and power diversion factor. As the original problem is a nonconvex fractional programming

322

T. Li and M. Lu

problem, the strategy uses Dinkelbach method to transform the objective function into a convex optimization problem, and then uses Lagrange dual method to solve it. The simulation results show that the proposed strategy is better than the traditional power control method and maximum transmit power method in optimizing the energy efficiency performance of the system. Acknowledgment. These works are supported by the Guangxi science and technology plan project of China (No. AD20297125).

References 1. Clerckx, B., Zhang, R., Schober, R., Ng, D.W.K., Kim, D.I., Vincent Poor, H.: Guest editorial wireless transmission of information and power—part II. IEEE J. Sel. Areas Commun. 37(2), 249–252 (2019) 2. Deng, N., Haenggi, M.: The energy and rate meta distributions in wirelessly powered D2D networks. IEEE J. Sel. Areas Commun. 37(2), 269–282 (2019) 3. Luo, Y., Hong, P., Su, R., et al.: Resource allocation for energy harvesting-powered D2D communication underlaying cellular networks. IEEE Trans. Veh. Technol. 66(11), 10486– 10498 (2017) 4. Yang, H.H., Lee, J., Quek, T.Q.S.: Heterogeneous cellular network with energy harvestingbased D2D communication. IEEE Trans. Wirel. Commun. 15(2), 1406–1419 (2016) 5. Giatsoglou, N., Ntontin, K., Kartsakli, E., Antonopoulos, A., Verikoukis, C.: D2D-aware device caching in mmWave-cellular networks. IEEE J. Sel. Areas Commun. 35(9), 2025–2037 (2017) 6. Niu, Y.Y., Liu, Y.L., Chen, X., Zhong, Z., Han, Z.: Device-to-device communications enabled energy efficient multicast scheduling in mmWave small cells. IEEE Trans. Commun. 66(3), 1093–1109 (2018) 7. Kuang, Z., Liu, G., Li, G., et al.: Energy efficient resource allocation algorithm in energy harvesting-based D2D heterogeneous networks. IEEE Internet Things J. 6(1), 557–567 (2019) 8. Zhai, D., Zhang, R., Jianbo, D., Ding, F.Z., Richard, Y.: Simultaneous wireless information and power transfer at 5G new frequencies: Channel measurement and network design. IEEE J. Sel. Areas Commun. 37(1), 171–186 (2019) 9. Khan, T.A., Alkhateeb, A., Heath, R.W.: Millimeter wave energy harvesting. IEEE Trans. Wirel. Commun. 15(9), 6048–6062 (2016) 10. Tu, L.T., Di Renzo, M.: Analysis of millimeter wave cellular networks with simultaneous wireless information and power transfer. In: 2017 International Conference on Recent Advances in Signal Processing, Telecommunications & Computing, Da Nang, Vietnam, IEEE Press (2017) 11. Zhou, X., Guo, J., Durrani, S., et al.: Power beacon-assisted millimeter wave ad hoc networks. IEEE Trans. Wirel. Commun. 66(2), 830–844 (2017) 12. Khan, T.A., Heath, R.W.: Wireless power transfer in millimeter wave tactical networks. IEEE Sig. Process Lett. 24(9), 1284–1287 (2017) 13. Dai, L., Wang, B., Peng, M., Chen, S.: Hybrid precoding-based millimeter-wave massive MIMO-NOMA with simultaneous wireless information and power transfer. IEEE J. Sel. Areas Commun. 37(1), 131–141 (2019). https://doi.org/10.1109/JSAC.2018.2872364 14. Wang, X., Jin, T., Hu, L., et al.: Energy-Efficient power allocation and Q-learning-based relay selection for relay-aided D2D communication. IEEE Trans. Veh. Technol. 69(6), 6452–6462 (2020)

Optimal Energy Efficiency Strategy of mm Wave

323

15. Yang, L., Xiong, K., Fan, P., Ding, Z., Zhong, Z., Letaief, K.B.: Global energy efficiency in secure MISO SWIPT systems with non-linear power-splitting EH model. IEEE J. Sel. Areas Commun. 37(1), 216–232 (2019) 16. Lee, K., Hong, J.-P., Seo, H., Choi, W.: Learning-based resource management in device-todevice communications with energy harvesting requirements. IEEE Trans. Commun. 68(1), 402–413 (2020) 17. Ding, H., Zhang, H., Tian, J., et al.: Energy efficient user association and power control for dense heterogeneous networks. In: 2018 International Conference on Computing, Networking and Communications, Maui, HI, USA. IEEE Press (2018)

Low Latency Execution Guarantee Under Uncertainty in Serverless Platforms M. Reza HoseinyFarahabady1(B) , Javid Taheri2 , Albert Y. Zomaya1 , and Zahir Tari3 1

School of Computer Science, Center for Distributed and High Performance Computing, The University of Sydney, Sydney, NSW, Australia {reza.hoseiny,albert.zomaya}@sydney.edu.au 2 Department of Mathematics and Computer Science, Karlstad University, Karlstad, Sweden [emailprotected] 3 School of Computing Technologies, RMIT University, Victoria, Australia [emailprotected]

Abstract. Serverless computing recently emerged as a new run-time paradigm to disentangle the client from the burden of provisioning physical computing resources, leaving such diﬃculty on the service provider’s side. However, an unsolved problem in such an environment is how to cope with the challenges of executing several co-running applications while fulﬁlling the requested Quality of Service (QoS) level requested by all application owners. In practice, developing an eﬃcient mechanism to reach the requested performance level (such as p-99 latency and throughput) is limited to the awareness (resource availability, performance interference among consolidation workloads, etc.) of the controller about the dynamics of the underlying platforms. In this paper, we develop an adaptive feedback controller for coping with the buﬀer instability of serverless platforms when several collocated applications are run in a shared environment. The goal is to support a low-latency execution by managing the arrival event rate of each application when shared resource contention causes a signiﬁcant throughput degradation among workloads with different priorities. The key component of the proposed architecture is a continues management of server-side internal buﬀers for each application to provide a low-latency feedback control mechanism based on the requested QoS level of each application (e.g., buﬀer information) and the worker nodes throughput. The empirical results conﬁrm the response stability for high priority workloads when a dynamic condition is caused by low priority applications. We evaluate the performance of the proposed solution with respect to the response time and the QoS violation rate for high priority applications in a serverless platform with four worker nodes set up in our in-house virtualized cluster. We compare the proposed architecture against the default resource management policy in Apache OpenWhisk which is extensively used in commercial serverless platforms. The results show that our approach achieves a very low overhead (less than 0.7%) while it can improve the p-99 latency of high pric Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 324–335, 2022. https://doi.org/10.1007/978-3-030-96772-7_30

Low Latency Execution Guarantees in Serverless Platforms

325

ority applications by 64%, on average, in the presence of dynamic high traﬃc conditions. Keywords: Dynamic controller of computer systems · Serverless computing · Virtualized platforms · Quality of Service (QoS)

1

Introduction

Serverless computing, also known as function-as-a-service (FaaS) or lambda services, has increasingly become popular in recent years due to their unique ﬂexibility of paying per usage business model. The new paradigm enables the business owners to design and develop complex data-intensive applications by breaking it into more manageable functional units. The FaaS paradigm can also be exploited to execute a wide range of applications including, but not limited to, web services, information exchange systems, machine learning, data mining, and image and text processing [1–4]. Adaptive micro-service computing in the form of event streaming is the current trend of the FaaS (serverless) paradigm. However, extensive empirical evaluations have revealed that the resource management policies adapted by almost all commercial products can lead to long delays in the internal buﬀers of high priority applications (hence a degraded performance), particularly when a signiﬁcant contention among consolidated workloads occurs across a shared environment (e.g., see [3,5–7]). When the buﬀering of unprocessed events becomes higher than a predeﬁned threshold, the FaaS platform suﬀers from a high latency delay, and therefore a degraded performance perceived by the application endusers [4,8,9]. Based on our observations using several real workload bench-markings, the following ineﬃciencies are deemed as the main limiting factors for a proper deployment of a low-latency computation. First, the lack of a congestion control mechanism to stabilize the throughput of the underlying hardware can lead to a high level of instability in the latency of computation for some (if not all) applications that share a physical machine. Second, open-loop mechanisms – currently employed by almost all commercial products– introduce a signiﬁcant delay and ﬂuctuated utilization level of computing resources, particularly when there is an abrupt change in the arrival rate of some applications. Third, an inaccurate estimation of arrival rate or the degraded performance among collocated applications (usually due to a random disturbance of the input variables) can signiﬁcantly degrade the level of performance isolation among consolidated workloads inside a working node, and therefore leads to a critical level of QoS violation incidents for high priority applications. In such contexts, it is vital to improve the operational eﬃciency of the underlying platform to respond to the application requests as requested by application owners. In practical scenarios, a feedback controller can be eﬀectively employed by the service provider to provision the right amount of computing resources to each serverless application during the run-time.

326

M. R. HoseinyFarahabady et al.

Most existing FaaS/serverless platforms are unaware of the time-line target value and the quality of service (QoS) requirements perceived by end-users. In fact, such platforms merely aim to enhance the average or a speciﬁc percentile of the query response time or the average resource utilization of the underlying devices. As a result, the transit delay in the response time of each application, which is usually caused by the waiting time in the internal buﬀer of each functional unit, may signiﬁcantly exceed a desired threshold value set by an application owner (i.e., a QoS violation incident occurs). Supporting the desired QoS enforcement level is challenging, since real-time events may arrive in a burst manner at any arbitrary rate (e.g., due to a varying market demand or traﬃc status for a data science application in a ﬁnancial context). Furthermore, the degree of shared resource contention among consolidated workloads may change over the course of their execution; this makes the problem of allocating computing resources to guarantee the QoS requirements even more challenging. To address such barriers, the main aim of this research work is to design a “feedback control” mechanism to support applications’ QoS enforcement levels in FaaS platforms. Most of the existing open-source FaaS platforms, such as Dask [10] and Apache OpenWhisk [11], only aim to support fast processing of event-driven applications on-the-ﬂy; they usually update the results of running processing units in a timely fashion, once the corresponding events are triggered within a predeﬁned interval. In this paper, we consider soft real-time serverless applications (such as those found in the ﬁnance sector) in which a processing delay may degrade the level of QoS achievement from end-users’ perspectives, but may yield loss of revenue for the service provider. If enough information about the worst-case execution time (WCET) or worst-case resource requirement (WCRR) of each submitted application is available, then the results of classic schedulability theory can be properly employed to decide if a given deadline constraint can be fulﬁlled or not. In such a case, a priority-based or a deadline-based scheduling policy can be used to provide an implementation to guarantee the timing constraint during the course of execution. Because in most practical cases, such information about the worst-case values cannot be derived in the compile time, the platform may encounter under utilization of computing resources. Our aim, in this paper, is to control the level of delay in the internal buﬀer of each functional unit to be lower than a speciﬁc threshold (even in the presence of burst traﬃc), while the resource requirement of each submitted task is unknown in prior and may vary during the execution time (i.e., due to changes in the external load of each functional unit). The rest of this paper is organized as follows. Section 2 highlights the main challenges associated with fulﬁlling the timing constraints of application tasks in FaaS platforms with shared resources when there is uncertainty in the actual resource consumption and the execution time of each functional unit. Section 3 presents the details of our proposed feedback control scheme. The performance of the predictive model controlling scheme is evaluated in Sect. 4. Finally, Sect. 5 concludes our work.

Low Latency Execution Guarantees in Serverless Platforms

2

327

Problem Statement

In this section, the overall structure of the target platform and its execution plan is presented. We discuss the performance optimization challenge in a FaaS platform as a resource allocation problem that needs to be adjusted dynamically in response to external events, while meeting quality of service (QoS) constraints. We also give a high-level description of the proposed feedback control approach for supporting the desired QoS performance of each submitted application. 2.1

FaaS Platform and Application Structure

An overall architecture of FaaS platforms can be described as follows. The FaaS paradigm enables application developers to represent the software architecture of a complex application by breaking it down into manageable functional units (FU) [12]. Each functional unit responds to a series of events that might be triggered by external or internal event sources. We assume that the underlying platform runs a set of event-driven CPU-intensive applications, denoted by Λ = {A1 , A2 · · · }. Each serverless application, Aj , can be modeled as a set of FUs, denoted by ΛA = {F1 , F2 · · · }; each FU might be triggered by a set of predeﬁned events. The set of all event sources that a particular Fj needs to trigger is shown by EFj = {e1 , e2 · · · }. The main responsibility of a FaaS platform is to invoke the corresponding FUs once triggering event occur [13]. The service provider can also select to pack and execute several FUs, that possibly might belong to diﬀerent QoS classes, into a single physical machine. We further assume that there are m physical machine that the controller can decide to deploy a copy of a FU to be executed in the next controlling interval. 2.2

Quality of Service Semantic

In this paper, we assume that the service provide of a FaaS platform can specify a certain number of level of service agreements (SLA) as quality of service (QoS) classes, where each QoS class identiﬁes a commitment between the service provider and application owners as an agreed run-time performance target. In most event-driven applications, the response time of service after the corresponding event is triggered can be considered as the main performance metric for a QoS class. The SLA target for such a metric is usually represented as the 99th percentile of application response time. We assume that the SLA contract deﬁnes exactly q diﬀerent QoS classes, shown by {Q1 · · · Qq } from which an application owner can choose the requested performance target and get billed accordingly. Each QoS class Qj stipulates two values of < R∗j , Pj,Δt >, where R∗j denotes an upper bound for the attained response time to be fulﬁlled by the service provider during the course of execution, and Pj,Δt represents an upper-bound for the percentage of QoS violation that is accepted by the end-user within an interval of length Δt (such semantic is similarly deﬁned and used by authors in [14]).

328

M. R. HoseinyFarahabady et al.

One of the key challenges in guaranteeing absolute service delay in a FaaS platform is to ﬁnd a resource allocation solution to achieve the desired delay for submitted applications belonging to diﬀerent QoS classes, even in the presence of varying load conditions that are unknown in priori [15]. Another challenge in the context of resource management problem is how to bridge the levels of abstraction (such as functional units, delay, internal buﬀer, stability conditions, and the arrival rate) to formulate and solve an optimal control problem. The main contribution of this paper is that we formulate and solve such resource management problems in a dynamic environment by employing the design principles of control theory. Using a feedback loop, it can provide the aforementioned delay guarantee for a FaaS platform with multiple QoS levels when the underlying system exhibits dynamic behavior. Furthermore, we employed the result of queuing theory to predict the the statistical properties of the internal buﬀers of each software component.

3

Design Approach

In this section, we formally introduce the steps to design a feedback controller to support the desired QoS enforcement bounds in the presence of dynamic workload in a serverless platform. 3.1

Main Components

The architecture of the proposed feedback controller can be described as follows. It consists of a rate estimator to predict the future rate of arrival events for each FU, a system model to represent the behavior of complex dynamical systems (here to estimate the length of unprocessed events in the internal buﬀer of each FU), an optimization component, and a target FaaS platform that consists of serverless working nodes to execute the submitted scripts (Fig. 1).

Fig. 1. An overall structure of the proposed feedback controller running across a FaaS/serverless platform with multiple worker nodes

The feedback controller is designed based on the principles of model predictive control (MPC) theory that is used to control the underlying system components while satisfying a set of predeﬁned performance constraints. It relies

Low Latency Execution Guarantees in Serverless Platforms

329

on dynamic system models that is obtained by system identiﬁcation techniques based on empirical results. One of the biggest advantages of using MPC in nonlinear systems is that it produces a robust near-optimal solution against erroneous values in the prediction or system models. Such a robustness is achieved by optimizing the target system variables over a ﬁnite time-horizon, while keeping the future system states into account [16]. The controller only applies one step of the control action, and then repeatedly optimizes the entire process in the next interval by considering the current and future states of the involving components. The actuator employs the Linux’s built-in control groups, cgroup. It is a resource allocation mechanism that limits the amount of resources available to each FU in the next controlling interval. 3.2

Monitoring and System Model

The monitor component is invoked at each sampling interval τ to compute the average arrival rate, the number of unprocessed events in the internal buﬀer of each FU, and their service time during the last sampling period. Such information is used by the optimizer to estimate the arrival rate for each FU, and to compute the new process budget in the forthcoming interval. We employ a classical autoregressive moving average (ARMA) model to predict the arrival rate of incoming events to each FU, denoted by λj,τ as a linear function of the past observations and the forecast errors at prior H intervals. The ARMA model with parameters K, φ and θ can be formally deﬁned as follows. λj,τ = Kj +

H h=1

φj,τ −h λj,τ −h +

H

θj,τ −h j,τ −h

(1)

h=1

Here, t is an uncorrelated innovation process with mean zero representing the past errors, λj,τ −h are the past observations of arrival rate [17], and H ≥ 1 is the order of the ARMA predictor. A higher order ARMA model is more accurate, while it requires more complex computation as the number of submitted applications in a given host. To design an eﬀective feedback control system, it is essential to predict the system performance dynamics when the incoming workload changes. We developed a simple model to capture the relation between the “queue size” and the “delay” perceived by each service. Such a model can be used by the optimizer module to bound the number of unprocessed events in the internal buﬀer of each application. We employed the Allen-Cunneen formula of G/G/N queue [18] to estimate the average response time experienced by each event right before its processing by the corresponding FU, as stated below. 2 Cs + C2d Pcb,m (2) Wm = μm(1 − ρ) 2 Here, Wm represents the waiting time experienced by each unprocessed event when both the arrival and the service time follows a general distribution; m is the

330

M. R. HoseinyFarahabady et al.

number of concurrently running instance of the corresponding FU; ρ represents the monitored utilization of the computing resource; Cd and Cs represent the coeﬃcient of variation for inter-arrival and the service time, respectively; and Pcb,m represents the probability that all m instances are fully utilized and no more events can be processed at this interval. 3.3

Optimizer

Once a new FaaS script is submitted, the optimizer component decides a working node to run the submitted script. The optimizer calculates the amount of performance degradation to be experienced by previously allocated applications in such a host, and then chooses the one that minimizes such degradation among all possible allocation decisions. It also ensures that the total capacity of processors does not exceed the computing requests in any working node. At each sampling interval τ , the controller compares the sampled delay of submitted applications, denoted by yj,τ , to the desired absolute delay of the corresponding QoS class, denoted by R∗j . Based on the error value, denoted by (ej,τ = |R∗j −yj,τ |), the optimizer computes the computing budget (i.e., the CPU share) to be allocated to each Fj . Such value is used by the progressive actuator to (re)allocate the process budget of each running process in the target host. Although the main goal of the optimizer is to reach the desired response time for applications in diﬀerent QoS classes, the controller must provide a robust solution, too. That is, it should be able to eﬀectively handle changes in the incoming workload, as the arrival traﬃc rate of each application is usually unknown and could change over time. Because of such robustness requirement, we selected the model predictive control approach to determine the appropriate values for the amount of CPU-shares for each FU. In particular, the MPC optimization module performs a series of actions at every controlling interval, denoted by τ ∈ {T1 , T1 + ΔT · · · }, which are highlighted as follows. – The monitoring module gathers a sample of non-processed events in the internal buﬀer for every FU to estimate an upper-bound for the queuing delay of each application within the next controlling intervals. – The optimizer calculates the required processor share to be allocated to every Fj such that its response time in the future Tref intervals brings the performance error of the output response, ej,τ +Tref , to zero. – In case the entire computing resource demand exceeds the available capacity of such resources, the optimizer performs a cost-beneﬁt analysis (CBA) to determine the near optimal allocation of computing resources to minimize the rate of QoS violation incidents across the entire platform. – Once the optimizer resolves a possible allocation of processing capacity to each FU, the progressive actuator applies one step of the updating action to the current CPU share of FUs by considering the the response speed factor (Tref ). Having a value greater than one for Tref guarantees a robust performance output even in the presence of errors in the workload prediction or the system performance model.

Low Latency Execution Guarantees in Serverless Platforms

331

– Finally, in the next controlling interval, the entire cycle of monitoring (as a feedback loop), modeling, and optimization is repeated. 3.4

Cost-Benefit Analysis (CBA)

To optimally exploit the cost-eﬀectiveness of available computing resources, we develop a model to promptly adjust the resource allocation in response to ﬂuctuating workloads, based on the estimation of total demand, queuing delay, and the projected QoS violation rate of each application. Such a process is able to allocate more computing resources prior to the occurrence of a high volume or high resource-demanding workload. In a virtualized environment, however, the computing resources can be dynamically provisioned and managed by leveraging the estimation of resources requested by diﬀerent QoS classes in the forthcoming intervals (e.g., by exploring characteristics of the traﬃc patterns of each application using Eq. (1)). We developed a simple CBA method to signiﬁcantly reduce the operational costs without compromising the level of quality of service. Such factors are impacted by how a serverless platform manages available computing resources in presence of high incoming traﬃc, and therefore having a set of appropriate tools to optimize such a process is an important diﬀerentiator. In the following section, we present the proposed mechanism by modeling the resource allocation burden as a proﬁt maximization problem. We also developed a dynamic programming method to ﬁnd solutions in reasonable amounts of time. We use notation CτΣ to denote the sum of requested CPU cap demanded by all submitted applications at interval τ . In the same manner, we use notation Cj,τ to denote the processing demand requested by a speciﬁc FU Fj at τ . Moreover, let U ∗ denote the maximum processing capacity available in the entire FaaS platform (which depends on the number of working host). Our assumption to employ the CBA during the given interval is that Cτ ≥ U ∗ . The CBA is stated as a reward function, denoted by R, to be maximized when only a partial fulﬁllment of requested resource capacity of FU Fj is possible. The reward function is formulated as follows. Rj,τ (r) = (Cj,τ − rj,τ ) × Iqj

(3)

Here, rj,τ is the partial fulﬁllment of Fj resource request. In Eq. (3), notation Iqj denotes a constant factor represent the importance of Fj compared to other FUs that might belong to diﬀerent QoS classes. The objective function to maximize the total contribution received by the service provider is formulated as follows. Rj,τ (r) (4) max r,τ

Fj ∈λ

subject to the obvious constraints of resource availability at any given time. We developed a dynamic programming approach to ﬁnd a near-optimal solution for the above-mentioned optimization problem. In particular, we only allowed the values of partial resource allocation to be taken from a certain bracket, that is

332

M. R. HoseinyFarahabady et al.

∗ rj,τ ∈ D = {5%, 10%, · · · , 100%} × Um in every working machine m. Then we can develop the Bellman equation of sub-optimization problem as follows.

Vω (Rω ) =

max Vω+1 (Rω − rω ) + Rj,τ (rω )

0≤rω ≤Rω

(5)

where Vω (.) denotes the optimal reward of allocating Rω resources among all not-yet-allocated FUs.

4

Performance Evaluation

To evaluate the performance of the proposed controlling mechanism, we implement the proposed solution as a proxy tier into the latest version of Apache OpenWhisk (version 20.11) running in our in-house cluster consisting four nodes, each equipped with an Intel i7-7700 CPU with 8 cores, and 64 GB main memory. The proposed approach is evaluated against the default policy of Open-Whisk. The application test cases are chosen from a set of functional workloads from Cloud-Suite [19] in the category of web services (WS). We conducted experiments with diﬀerent load traﬃc patterns by varying the number of HTTP requests per second, and the probability distribution that the incoming traﬃc is drawn from (i.e., Poisson and Weibull distributions). The average number of triggered events per FU varies in the range of λj ∈ [1000, 5000] requests per minute. Each class is deﬁned by a set-point value for the 99-th percentile response time over a period of one second. Our conﬁguration for the set-point values for each QoS class merely allows the available capacity of computing resources to fulﬁll the response time of highest priority application requests (Q1 ). By continues monitoring of the actual response time of applications in each QoS class, we can evaluate the ability of the controller to identify the total amount of QoS violation rates due to the dynamic workload incurred by low priority applications. 4.1

Result Summary

Plots in Fig. 2 show the rate of QoS violation incidents for applications in different priority classes as we increase the total number of applications from 64 to 512. This performance metric reﬂects how well the proposed controller can satisfy the requested service level agreement compared to the results obtained by applying the default policy of Open-Whisk. Result shows that the default policy evenly allocates the computing capacity in a round-robin fashion, that in turn, causes a signiﬁcant QoS violation rate for applications in high priority classes (Q1 and Q2 ). By contrast, our proposed controller can dynamically identify and prevent a high violation rate for Q1 and Q2 applications. On average, the enhancement of QoS violation rate for Q1 and Q2 applications using the proposed controller is 64% and 51%, respectively. Plots in Fig. 3 show the attained processor utilization of FUs belonging to diﬀerent QoS classes as the total number of applications in each QoS class is

Low Latency Execution Guarantees in Serverless Platforms

333

Fig. 2. The QoS violation rate experienced by applications in diﬀerent QoS classes. The target response time of each QoS class is set to a value such that the available resources can only satisfy the demands from high priority applications. The total number of submitted applications varies from 64 to 512.

Fig. 3. Aggregated processor utilization of applications belonging to diﬀerent QoS classes as the number of applications in each QoS classes increases.

increased. Results conﬁrm that the aggregated processor utilization for applications in Q1 and Q2 are signiﬁcantly enhanced by applying the proposed feedback controller compared to the results of default policy in Open-Whisk. The normalized value of such improvements is 36% and 24% for Q1 and Q2 applications, respectively. The reason for such improvements is that the proposed controller uses the CBA to allocate higher amount of available processor capacity to Q1 and Q2 FUs, while preventing a host to perform near its saturation point. Results also conﬁrm that the utilized processor capacities are mostly consumed to eﬀectively fulﬁll the target performance of high priority applications. Improving such a parameter can signiﬁcantly enhance the service provider revenue by decreasing the wasted utilization of computing resources, as well as improving the end-users satisfaction level. 4.2

Computational Overhead

We measure the overhead time incurred by performing diﬀerent steps of monitoring, predicting and solving the optimization problem using dynamic programming approach. Table 1 lists such an overhead when the total number of applications reaches to N = 512. Results shows that the fraction of such overhead remains below 0.7% of the controlling interval length (1 s).

334

M. R. HoseinyFarahabady et al.

Table 1. Computational overhead when total number of applications increases. N

Overhead [Sec.]

64

0.07

128 0.17 512 0.68

5

Conclusion

Serverless technology is a recent computing paradigm that allows developers for enjoying automatic scaling and high availability for running scripts without the burden of infrastructure management. Developing a QoS aware resource allocation mechanism for serverless computing platform has drawn signiﬁcant attention in recent years. In this paper, we developed a QoS-aware resource controller that can guarantee the response time of event-driven applications, while mitigating the performance isolation problem experienced by high priority applications in a platform with shared resources. The experimental results using an in-house Open-Whisk cluster with four nodes conﬁrm the eﬀectiveness of the proposed solution when coping with modern workloads inspired by web services applications. In particular, the proposed solution can reduce the overall QoS violation rate for high priority applications by 64% on average. Acknowledgment. Prof. Albert Y. Zomaya acknowledges the support of Australian Research Council Discovery scheme (DP190103710). Prof. Javid Taheri would like to acknowledge the support of the Knowledge Foundation of Sweden through the AIDA project. Prof. Zahir Tari would like to acknowledge the support of the Australian Research Council (grant DP200100005). Dr. MohammadReza HoseinyFarahabady acknowledge the continued support and patronage of The Center for Distributed and High Performance Computing in The University of Sydney, NSW, Australia for giving access to advanced high-performance computing platforms and industry’s leading cloud facilities, machine learning (ML) and analytic infrastructure, the digital IT services and other necessary tools.

References 1. Menasc´e, D.A., Almeida, V.A.F., Riedi, R., Ribeiro, F., et al.: Hierarchical and multiscale approach to analyze e-business workloads. Perform. Eval. 54, 33–57 (2003) 2. Poccia, D.: AWS Lambda in Action: event-driven serverless applications. Simon and Schuster (2016) 3. Sbarski, P., Kroonenburg, S.: Serverless architectures on AWS: with examples using Aws Lambda. Simon and Schuster (2017) 4. Kim, Y.K., HoseinyFarahabady, M.R., Lee, Y.C., Zomaya, A.Y.: Automated ﬁnegrained CPU cap control in serverless computing platform. IEEE Trans. Parallel Distrib. Syst. 31(10), 2289–2301 (2020)

Low Latency Execution Guarantees in Serverless Platforms

335

5. Schad, J., Dittrich, J., et al.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc. VLDB Endow. 3, 460–471 (2010) 6. Wang, H., et al.: A-DRM: architecture-aware distributed RA of Virt. Clusters. In: ACM SIGPLAN/SIGOPS on Virtual Execution Environments, pp. 93–106 (2015) 7. Shuai, Y., Petrovic, G., Herfet, T.: OLAC: an open-loop controller for low-latency adaptive video streaming. In: 2015 IEEE International Conference on Communications (ICC), pp. 6874–6879 (2015) 8. Taheri, J., Zomaya, A.Y., Kassler, A.: A black-box throughput predictor for VMs in cloud environments. In: European Conference on Service-Oriented and Cloud Computing, pp. 18–33. Springer (2016). https://doi.org/10.1007/978-3-319-4448262 9. Al-Dulaimy, A., Taheri, J., Kassler, A., HoseinyFarahabady, M.R., Deng, S., Zomaya, A.: MULTISCALER: a multi-loop auto-scaling approach for cloud-based applications. IEEE Trans. Cloud Comput. (2020) 10. NumFOCUS. Dask: Advanced Parallelism for Analytics, Enabling Performance. https://dask.org/ (2021) 11. Apache Org. OpenWhisk: Open Source Serverless Cloud Platform. https:// openwhisk.incubator.apache.org (2021) 12. Kim, Y.K., HoseinyFarahabady, M.R., Lee, Y.C., Zomaya, A.Y., Jurdak, R.: Dynamic control of CPU usage in a lambda platform. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 234–244 (2018) 13. HoseinyFarahabady, M.R., Zomaya, A.Y., Tari, Z.: MPC for managing QoS enforcements & microarchitecture-level interferences in a lambda platform. IEEE Trans. Parall. Distrib. Syst. 29(7), 1442–1455 (2018) 14. Hoseinyfarahabady, M.R., Tari, Z., Zomaya, A.Y.: Disk throughput controller for cloud data-centers. In: International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 404–409 (2019) 15. HoseinyFarahabady, M.R., Taheri, J., Tari, Z., Zomaya, A.Y.: A dynamic resource controller for a lambda architecture. In: 2017 46th International Conference on Parallel Processing (ICPP), pp. 332–341 (2017) 16. Rawlings, J., Mayne, D.Q., Diehl, M.M.: Model predictive control: theory, computation, and design. Nob Hill Publishing, Madison, Wisconsin (2017) 17. Box, G., et al.: Time Series: Forecasting & Control. Wiley (2008) 18. Allen: Probability, Statistics, Queueing Theory. Academic Press, Cambridge (1990) 19. Ferdman, M., Adileh, A., et al.: Clearing the clouds: a study of emerging scaleout workloads on modern hardware. In: Architectural Support for Programming Languages & Operating Systems, ASPLOS, pp. 37–48. ACM (2012)

High Resolution Patient-Specific Blood Flow Simulation in a Full-Size Aneurysmal Aorta Based on a Parallel Two-Level Method Jie Zhou1 , Jing Li1 , Shanlin Qin2(B) , and Rongliang Chen2(B) 1

2

School of Mathematics and Statistics, Changsha University of Science and Technology, Changsha 410014, China Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China {sl.qin,rl.chen}@siat.ac.cn

Abstract. An accurate and eﬃcient blood ﬂow simulation in patientspeciﬁc arteries is instructive for the diagnose and treatment of various vascular diseases, which is, however, computationally challenging because of the complicated geometry of the artery and the turbulence in the blood ﬂow. In this work, we introduce a parallel scalable twolevel additive Schwarz method for fast solving the Navier-Stokes equations in a patient-speciﬁc full-size aorta with aneurysms. Distributions of the hemodynamics, such as the pressure, velocity, and wall shear stress, are presented and analyzed. The algorithm is studied with a focus on its robustness against diﬀerent values of model parameters and parallel scalability. The results show that the proposed method is robust to solve large and complicated simulation problems with over 25 million unstructured elements using over 5000 processors on a supercomputer. Keywords: Aortic aneurysm · Blood ﬂow simulation · Parallel computing · Newton-Krylov-Schwarz · Two-level additive Schwarz method

1

Introduction

Blood ﬂow simulation has been used to investigate the hemodynamics of vascular diseases, such as stenosis, dissection, and aneurysm. However, an accurate and eﬃcient description of the ﬂow ﬁeld is computationally challenging due to the complexity of the geometry and the large scale of the problem, which requires the development of robust and eﬃcient parallel numerical methods [16]. This work is ﬁnancially supported by the NSFC (Grant No. 11801543 and 12071461) and the Shenzhen grant (Grant No. JCYJ20190806165805433 and RCYX20200714114735074). Jing Li is supported by Hunan Provincial Natural Science Foundation of China (2021JJ30697) and the Scientiﬁc Research Project of the Hunan Provincial oﬃce of Education (20A022). c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 336–348, 2022. https://doi.org/10.1007/978-3-030-96772-7_31

A Parallel Two-Level Method for Blood Flow Simulation

337

The Newton-Krylov method is a powerful method for solving nonlinear systems, which adopts a Newton-type method for handling the nonlinear equations and a Krylov subspace method for solving the linear system at each Newton step to get the Newton search direction. However, the Krylov method, whose convergence rate is dependent on the condition number of the matrix, always fails to converge or converges very slow for large or complicated problems as considered in this paper. One eﬃcient method to accelerate the Krylov subspace method is the preconditioner. That is to design a preconditioner to reduce the condition number of the matrix before applying the Krylov subspace method. Many precondition techniques have been studied for the blood ﬂow simulations, such as the dual threshold incomplete LU factorization and the incomplete block-LU factorization [2]. A Newton-Krylov method preconditioned with additive Schwarz methods, which is called Newton-Krylov-Schwarz (NKS), is studied recently for the blood ﬂow simulation in the cerebral artery [9] and abdominal aorta [12]. The performance of the NKS method depends largely on the eﬀect of the preconditioner, especially when using a large number of processor cores. In this work, we introduce a two-level Schwarz preconditioner for the NKS method and simulate the blood ﬂow of a full-size aorta with aneurysms. The two-level Schwarz preconditioner has been applied in solving many problems, such as the ﬂuid-structure interaction [6], the multigroup neutron diﬀusion [7], the elastic crack analysis [4] and the porous media [8]. Most of these works adopt a pair of nested meshes since the interpolation and restriction matrices between the coarse and ﬁne meshes can be easily obtained. However, the nested meshes are diﬃcult to generate, especially for a computational domain with complex structures [3]. Therefore, we consider the non-nested meshes, where the coarse and ﬁne meshes are independently generated and the interpolation is achieved by using the radial basis function. This method has been used to solve the linear system [1], the Poisson equation [13], the coupled PDE system [15] and so on. In our previous work, the two-level overlapping Schwarz algorithm has been used to simulate the blood ﬂow in a cerebral artery with stenoses and achieves a good strong scalability [3]. In this work, we use it to simulate the blood ﬂow in a full-size aorta with aneurysms and further study the performance of the algorithm. Especially, the performance of the algorithm is comprehensively studied by testing the strong and weak scalability, and investigating the inﬂuence of subdomain overlapping size, and the level of ﬁll-ins of the incomplete factorization that is used as the subdomain solver. We also report the robustness of the algorithm against diﬀerent values of the model parameters, such as the viscosity, resistance, and compliance, which may be diﬀerent among various diseases. The rest of this paper is organized as follows. In Sect. 2, we introduce the 3D artery geometry and the mesh that used in the simulation, and followed by a detailed introduction of the two-level NKS method. In Sect. 3, we present some results of the hemodynamics of the aneurysmal aorta and study the numerical performance of the algorithm with respect to its robustness and scalability. Some concluding remarks are drawn in Sect. 4.

338

2 2.1

J. Zhou et al.

Methodology Image Segmentation and Mesh Generation

As shown in Fig. 1, the geometry of a full-size aorta, from the ascending aorta to the iliac arteries, is reconstructed from the CT image by using the software Mimics (Materialise, Leuven, Belgium). The geometry has 1 inlet at the ascending aorta and 13 outlets at the major branch vessels, including the common carotid artery, the brachiocephalic artery, the left subclavian artery, the common hepatic artery, the splenic artery, the superior mesenteric artery, the left and right renal arteries, and the left and right common iliac arteries. There are three aneurysms located in the aortic arch and the right and left common iliac arteries, marked as dashed squares 1, 2, and 3 in the left of Fig. 1, respectively. A coarse mesh of 68,506 and a ﬁne mesh of 13,902,281 tetrahedral elements are generated independently to cover the geometry by using a commercial software ICEM (ANSYS, Canonsburg, Pennsylvania), as shown in the enlarged views in the right of Fig. 1. It can be seen that the size of the elements in the coarse mesh (red) is larger than that in the ﬁne mesh (blue) and the nodal points are not nested since they are independently generated. The mesh is critical to the accuracy of the numerical results, and its generation includes the following main steps: (1) import the geometry into ICEM and create parts for the wall, the inlet, and the outlets to assign diﬀerent boundary conditions; (2) set a global mesh size for the overall meshing and adjust local mesh size for diﬀerent parts; and (3) create an unstructured mesh of tetrahedral elements to cover the whole domain and export it after a check of the mesh quality. The mesh is partitioned into non-overlapping subdomains by ParMETIS, which ensures the number of elements in each processor is roughly balanced.

Fig. 1. The geometry, meshes and boundary conditions of the aorta with major branch vessels and aneurysms. (Color ﬁgure online)

A Parallel Two-Level Method for Blood Flow Simulation

2.2

339

Governing Equation and Boundary Conditions

The blood ﬂow is considered as an incompressible Newtonian ﬂuid and governed by the following Navier-Stokes equations [11], ∂u + (u · ∇)u + ∇p − μΔu = 0 in Ω × (0, T ], ρ (1) ∂t ∇·u = 0 in Ω × (0, T ], where ρ and μ are the density and the dynamic viscosity of the blood, u and p are the velocity vector and the pressure to be solved, respectively. The Dirichlet and non-slip boundary conditions are imposed on the inlet ΓI and wall ΓW as follows, u = vI , on ΓI × (0, T ], u = 0, on ΓW × (0, T ], where vI is a pulsatile velocity waveform obtained from the patient-speciﬁc clinical measurement in the ascending aorta, as shown in Fig. 1. A three-element Windkessel model is applied at each outlet to account for the impact of the downstream vasculature, which is governed by the following equation [5], Pi (t) + Ri Ci

dPi (t) dQi (t) = (Ri + Ri )Qi (t) + Pib (t) + Ri Ri Ci , dt dt

where Ri , Ri , Ci , Pi (t) and Qi (t) are the resistances, the compliance, the pressure and the ﬂow rate at the ith outlet respectively, as shown in Fig. 1. Pib (t) is the pressure at the downstream vasculature. As given in [3], the analytic solution of this equation is Pi (t) = Ri Qi (t) + Pi (0) − Ri Qi (0) e−t/τi +

t

e−(t − s)/τ Qi (s)ds, Ci

Ri Ci ;

where τi = Pi (0) and Qi (0) are the initial pressure and ﬂow rate at the ith outlet. Here, the distal pressure Pib (t) is assumed to be 0. During the calculation, a total resistance RT and total compliance CT will be introduced, whose values are manually adjusted so that the obtained diastolic and systolic pressures at the inlet match the clinically measured values. Then, RT and CT are split to each outlet for the values of Ri and Ci by the radius of the vessels. 2.3

Newton-Krylov-Schwarz Method with a Two-Level Preconditioner

For the full discretization of Eq. (1), we adopt a stabilized P1 -P1 ﬁnite element method in space and an implicit backward Euler method in time [3]. After the discretization, we obtain a large, sparse, and nonlinear algebraic system at each time step, which is denoted as

340

J. Zhou et al.

F(χ) = 0,

(2)

where χ includes all the velocity and pressure at each mesh point. To solve Eq. (2), the NKS method is adopted, which updates the solution through the following iterative method χk+1 = χk + τk Sk , where τk is a step length calculated from a line search method and Sk is the Newton correction obtained by inexactly solving the Jacobian system at each Newton step in the sense Jk Mk−1 Mk Sk + F(χk ) < l F(χk ), where l is a given relative tolerance to control the “exactness” of the solution of the Jacobian system and Mk is a two-level Schwarz preconditioner deﬁned as h −1 h T Bc (IH ) + M −1 = IH

Np l=1

h −1 h T (Rl0 )T Bl−1 Rlδ I − Jk IH Bc (IH ) .

h where IH is an interpolation operator from the coarse mesh to the ﬁne mesh. Bl−1 is a ﬁne-level subdomain preconditioner for the Jacobian matrix Jk . Bc−1 is a coarse-level preconditioner for the inverse of the coarse-level Jacobian matrix. Np is the number of subdomains, which also equals the number of processors used in parallel computing. δ is the overlapping size extended from the nonoverlapping subdomains Ωl (l = 1, 2, . . . , Np ) to the overlapping subdomains Ωlδ . Rl0 and Rlδ are the restriction operators that map the global vectors in Ω to those in Ωl and Ωlδ respectively. Notably, it is diﬃcult to directly solve the problem on the ﬁne mesh, but much easier on the coarse mesh, which can then be interpolated to the ﬁne mesh by using a radial interpolation basis function, as described in [3].

3

Results and Discussion

In this section, we will present some numerical results of the hemodynamics in a full-size patient-speciﬁc aorta with a focus on the performance of the proposed algorithm, carried out on the Tianhe 2A supercomputer at the National Supercomputer Center in Guangzhou, China. 3.1

Simulation Results and Discussion

For the simulation, the values of the total resistance RT and total compliance 5 are chosen as 1012.27 dyn · s/cm and 1.026146 × 10−2 cm5 /dyn, with which, the simulated pressure matches with the patient’s pressure. Figure 2 shows the spatial distributions of the pressure, streamline of the velocity, and wall shear stress (WSS) in the aorta during a systolic period. It can be seen that the pressure gradually decreases along the artery from the proximal to the distal ends, and

A Parallel Two-Level Method for Blood Flow Simulation

341

the range of the pressure matches with that reported in [10]. The streamline of the velocity shows that physiologically reasonable values of the velocity are obtained with relatively lower velocity in the aneurysmal region compared to other regions. Some secondary ﬂows are developed at the aneurysmal regions 1 and 2 that are marked in Fig. 1. Similar results are reported in [17]. The distribution of the WSS shows that relatively lower WSS can be observed in the aneurysm compared to the other regions, as has been shown in [12]. It is reported that the WSS shifts to a low level during the growing period of the aneurysm, and the rupture usually occurs at these sites [19]. The features of the hemodynamics in the aneurysmal regions should consequently have an impact on the aneurysm development and rupture, which should be studied more extensively.

Fig. 2. Spatial distributions of the pressure, streamline of the velocity and WSS at the period of systole

3.2

Robustness and Scalability

In this subsection, we show the robustness and parallel scalability of the proposed NKS method for the simulation of the hemodynamics in the whole aorta. The algorithm has several important parameters, such as the time-step size Δt, the viscosity μ, the overlapping size δ, the ILU ﬁll-in level, and the resistance and compliance, which aﬀect the performance of the method. For all the numerical tests in this subsection, the stopping criteria for the linear and nonlinear solvers are set to be 10−6 (relative error). In all tables, “Newton”, “GMRES”, and “Time (s)” refer to the average number of Newton iterations per time step, the average number of GMRES iterations per Newton

342

J. Zhou et al.

iteration, and the average compute time in seconds per time step, respectively. In the two-level method, the coarse-level problem is solved by the GMRES method preconditioned with a one-level Schwarz preconditioner and we use the maximum number of iterations (“Coarse Its”) as the stopping condition for the coarse-level GMRES. For each test, we only change one parameter to see the performance of the algorithm. In the two-level method, we use a mesh with 6.85 × 104 elements as the coarse mesh, and all the tests are carried out on two meshes (mesh1 with 3.26 × 106 elements and mesh2 with 1.39 × 107 elements) for comparison. Table 1 shows the inﬂuence of the time-step size Δt on the performance of the algorithm, and four time-step sizes 5 × 10−4 , 1 × 10−3 , 2 × 10−3 and 4 × 10−3 are tested. All tests are performed on two diﬀerent meshes, namely the mesh1 with 3.26 × 106 elements and the mesh2 with 1.39 × 107 elements. Results show that, in general, with the increase of the time-step size, the number of Newton and GMRES iterations and the compute time increase, which means that the solver becomes more diﬃcult to converge. For example, when the time-step size increases to 4 × 10−3 , the GMRES iterations are almost tripled for the coarse mesh case (mesh1) and diverge for the ﬁne mesh case (mesh2). The main reason is that the initial guess of Newton’s method becomes too far from the exact solution for the large time-step size case, which slows down the convergence or even diverges. Table 1 also shows that the ﬁne mesh case is more diﬃcult to solve than the coarse mesh case, and at the same time, our algorithm shows good robustness with respect to the mesh reﬁnement since both the linear and nonlinear iterations increase a little bit after a threefold increase in the problem size, which shows that the proposed algorithm has the potential to solve even larger problems. In the rest of the paper, we use 1 × 10−3 as the default time-step size. Table 1. The impact of the time-step size Δt on the performance of the solver. The mesh1 and mesh2 are carried out with 120 and 480 processors (same setups are used for the rest test cases), respectively. Here NC means “Not Converge”. Mesh1: 3.26 × 106 Δt

Mesh2: 1.39 × 107

Newton GMRES Time(s) Newton GMRES Time(s)

5 × 10−4 2.20 −3

1 × 10

5.09

21.67

2.60

7.92

28.26

2.30

5.23

22.65

2.40

6.13

25.42

2 × 10−3 3.10

6.48

30.29

3.20

8.00

34.91

4 × 10−3 3.10

17.31

34.92

NC

–

–

In Table 2, we show the impact of the accuracy of the coarse-level solution on the performance of the proposed two-level method. The accuracy of the coarse problem is controlled by “Coarse Its”, where larger “Coarse Its” corresponds to a more accurate coarse-level solution. From the results, we see that with the increase of the “Coarse Its”, the number of Newton iterations doesn’t change, and the number of GMRES iterations decreases, which means that the two-level preconditioner becomes stronger with the increase of “Coarse Its”. But at the same time, the time

A Parallel Two-Level Method for Blood Flow Simulation

343

spent on the coarse-level will increase when “Coarse Its” increases, which makes the total compute time increase if “Coarse Its” reaches a certain value. Therefore, the optimal choice of “Coarse Its” is 40 in terms of computing time for this test case and we will use 40 as the default value for “Coarse its” for all the rest tests. Table 2. The impact of the stopping condition for the coarse-level GMRES on the performance of the solver Mesh1: 3.26 × 106

Mesh2: 1.39 × 107

Coarse Its Newton GMRES Time(s) Newton GMRES Time(s) 30

2.30

6.37

22.91

2.40

6.13

25.50

40

2.30

5.23

22.65

2.40

6.13

25.42

50

2.30

5.20

22.70

2.40

6.13

25.44

60

2.30

5.09

22.73

2.40

6.13

25.60

The viscosity μ is an important parameter in blood ﬂow simulation. Table 3 shows that the two-level method performs a robust convergence for a wide range of μ. We observe that as the viscosity increases, the number of Newton iteration gradually stabilizes at a constant, the number of GMRES iteration shows a small variation, and the computation time gradually stabilizes. Moreover, the eﬀect of the viscosity μ is similar for both meshes, which indicates that the proposed algorithm is robust with respect to the viscosity. Table 3. The impact of the viscosity μ on the performance of the solver Mesh1: 3.26 × 106 μ

Mesh2: 1.39 × 107

Newton GMRES Time(s) Newton GMRES Time(s)

0.01 2.85

5.89

27.69

2.50

6.68

26.87

0.04 2.35

5.26

23.03

2.40

6.42

25.76

0.07 2.25

5.36

22.12

2.40

6.54

26.29

0.10 2.25

5.20

22.11

2.40

6.86

26.42

In Table 4, the two-level preconditioner also shows a robust performance to the resistance R and the compliance C. The total resistance R and the total compliance C are critical parameters for the Windkessel model, which are generally determined by the clinical conditions of the patient. The results show that the number of Newton and GMRES iterations are almost stable with small variations leading to a slight ﬂuctuation of the computation time. Overall, the proposed algorithm is robust to both the resistance R and the compliance C. For the two-level Schwarz preconditioner, the ﬁll-in level of the incomplete LU (ILU) [14] is another parameter to aﬀect the performance of the algorithm,

344

J. Zhou et al.

Table 4. The impact of the resistance R and compliance C on the performance of the solver Mesh1: 3.26 × 106 Mesh2: 1.39 × 107 R(dyn · s/cm5 ) Newton GMRES Time(s) Newton GMRES Time(s) 5.06 × 102 1.012 × 103 2.024 × 103

2.45 2.30 2.30

C(cm5 /dyn)

Newton GMRES Time(s) Newton GMRES Time(s)

5.131 × 10−3 1.026 × 10−2 2.052 × 10−2

2.35 2.30 2.35

5.33 5.23 5.76 5.36 5.23 5.23

23.90 22.65 22.69

2.40 2.40 2.40

23.14 22.65 23.02

2.40 2.40 2.40

6.51 6.13 7.04 6.29 6.13 6.04

25.69 25.42 26.13 25.54 25.42 25.47

which is tested and summarized in Table 5. Np is the number of processors used for solving the problem. We use diﬀerent levels of ﬁll-in with diﬀerent subdomain solvers to test the robustness of the proposed algorithm. We ﬁx the overlapping size at 2, the coarse ILU level at 1, and test on the meshes with 1.39 × 107 and 2.60 × 107 elements. We conclude that the algorithm is stable as the ﬁll-in level of ILU increases. The results show that the numbers of Newton and GMRES iterations are almost stable and the compute time increases with the increase of the ﬁll-in levels. This means that we can use very small ﬁll-in levels in our simulation, which is unlike the one-level method that usually needs large ﬁll-in levels. Table 5. The eﬀect of the ILU ﬁll-in levels on the performance of the algorithm Mesh1: 1.39 × 107 Subsolve Np

Mesh2: 2.60 × 107

Newton GMRES Time(s) Np

Newton GMRES Time(s)

ILU(0)

720 2.40

6.38

17.54

1440 2.50

8.28

19.29

ILU(1)

720 2.40

5.67

18.04

1440 2.50

8.88

20.34

ILU(2)

720 2.40

5.75

19.89

1440 2.50

8.76

23.42

ILU(3)

720 2.40

6.58

23.90

1440 2.60

8.77

27.54

Table 6 studies the impact of the subdomain overlapping size on the proposed algorithm. The overlapping size is used to control the amount of information exchanged between subdomains. For the one-level method, with the increase of the number of subdomains (equal to the number of processors for the parallel computing), the preconditioner becomes weaker and therefore usually needs a large overlapping size as reported in [6]. For the proposed two-level method, the results show that the numbers of Newton and GMRES methods are not sensitive to the overlapping size, which means that we can use a very small overlapping

A Parallel Two-Level Method for Blood Flow Simulation

345

size in the simulation. Overlapping always means repeat works. Therefore, in the design of the parallel algorithm, we hope to use small overlapping to save time. The theory of the two-level domain decomposition method also suggests that the convergence rate is independent of the overlapping size which is consistent with our results [18]. Table 6. The eﬀect of the overlapping size δ on the performance of the solver Mesh1: 1.39 × 107 Overlap(δ) Np

Mesh2: 2.60 × 107

Newton GMRES Time(s) Np

Newton GMRES Time(s)

720 2.40

7.67

18.27

1440 2.20

12.27

18.90

1

720 2.40

7.08

18.39

1440 2.20

10.71

17.88

2

720 2.40

6.58

18.38

1440 2.20

9.24

17.56

3

720 2.40

6.17

18.49

1440 2.20

9.86

18.96

To understand the parallel scalability of the two-level preconditioner, we test the weak scalability and strong scalability of the algorithm. For the weak scalability, the number of linear iterations and the computing time should theoretically stabilize at a constant when the number of processor cores and the problem size increase at the same rate to keep the same subproblem size for each processor. Results are shown in Table 7. Four meshes with 3.26 × 106 , 6.70 × 106 , 1.39 × 107 and 2.60 × 107 elements, are used in the tests, and they are solved with 180, 360, 720, and 1440 processor cores, respectively. The results in Table 7 show that the numbers of Newton and GMRES iterations stay close to a constant when the number of mesh elements and the number of processor cores increases proportionally, and the computing time per time step does not change a lot. Our results indicate that the proposed algorithm is weakly scalable. Table 7. The weak scalability results tested on four diﬀerent meshes Mesh

Np

3.26 × 106

Newton GMRES Time(s)

180 2.35

5.28

16.11

6.70 × 10

360 2.30

7.42

17.60

1.39 × 107

720 2.20

5.86

16.21

1440 2.25

7.09

17.53

6

7

2.60 × 10

For the strong scalability, we only test two meshes with 1.39 × 107 and 2.60 × 107 elements. The results of the strong scalability in Table 8 show that the number of Newton iterations remains almost constant for both meshes as the number of processors increases and the number of GMRES iterations increases slowly at ﬁrst and then fast when the number of processors reaches 2880. For

346

J. Zhou et al.

the coarse mesh, when the number of processors increases from 360 to 2880, the increase in the number of GMRES iterations is slow, and the rate of computing time reduction is relatively uniform. For the ﬁne mesh, when the number of the processor increases from 2880 to 5760, the number of GMRES increases by 4 times, which results in a quick drop in the parallel eﬃciency. The main reason for the low eﬃciency is that the problem size is too small for 5760 processors, which makes the ratio of the computing time and the communication time between processors too small. Communication is the main bottleneck for achieving high parallel eﬃciency. One way to increase the parallel eﬃciency is to increase the problem size. We deﬁne the speed up and the parallel eﬃciency as speedup = tm /tn and eﬃciency = (tm × Npm )/(tn × Npn ), where tm and tn are the average computing time per time step under the usage of Npm and Npn processor cores, and Npm ≤ Npn . The parallel eﬃciency of the two-level algorithm is 45% when the number of processor cores reaches 2880 for the coarse mesh with 1.39 × 107 elements, and 35% when processor cores reaches 5760 for the ﬁne mesh with 2.60 × 107 elements. Overall, the proposed algorithm is robust and scalable for the solution of large-scale problems. Table 8. Strong scalability results tested on two diﬀerent meshes

4

Mesh

Np

Newton GMRES Time(s) Speedup Ideal Eﬃciency

1.39 × 107

360 720 1440 2880

2.40 2.40 2.40 2.30

5.54 6.63 9.13 17.09

42.35 24.45 15.87 11.69

1.00 1.73 2.67 3.62

1.00 2.00 4.00 8.00

100% 87% 67% 45%

2.60 × 107

720 1440 2880 5760

2.50 2.60 2.40 2.40

8.16 10.50 22.13 80.54

55.60 36.60 22.67 20.18

1.00 1.52 2.45 2.76

1.00 2.00 4.00 8.00

100% 76% 61% 35%

Conclusion

In this work, a parallel NKS algorithm with a two-level preconditioner is used to simulate the blood ﬂow in a full-size aorta with aneurysms. A large nonlinear system is obtained from the discretization of the Navier-Stokes equations by using a stabilized ﬁnite element method in space and an implicit backward Euler method in time. The system is then solved by the NKS algorithm with a two-level Schwarz preconditioner, which is constructed by a radial interpolation basis function between the non-nested meshes. Numerical tests show that the algorithm is robust to the viscosity, the overlapping size, and the ﬁll-in level and demonstrate good strong and weak scalability with up to 5000 processor cores.

A Parallel Two-Level Method for Blood Flow Simulation

347

References 1. Antonietti, P.F., Houston, P., Hu, X., Sarti, M., Verani, M.: Multigrid algorithms for hp-version interior penalty discontinuous Galerkin methods on polygonal and polyhedral meshes. Calcolo 54(4), 1169–1198 (2017) 2. Badia, S., Quaini, A., Quarteroni, A.: Modular vs. non-modular preconditioners for ﬂuid-structure systems with large added-mass eﬀect. Comput. Methods Appl. Mech. Eng. 197(49–50), 4216–4232 (2008) 3. Chen, R., et al.: A parallel non-nested two-level domain decomposition method for simulating blood ﬂows in cerebral artery of stroke patient. Int. J. Numer. Methods Biomed. Eng. 36(11), e3392 (2020) 4. Chen, X., Cai, X.C.: Eﬀective two-level domain decomposition preconditioners for elastic crack problems modeled by extended ﬁnite element method. Commun. Comput. Phys. 28(4), 1561–1584 (2020) 5. Grinberg, L., Karniadakis, G.E.: Outﬂow boundary conditions for arterial networks with multiple outlets. Ann. Biomed. Eng. 36(9), 1496–1514 (2008) 6. Kong, F., Cai, X.C.: A scalable nonlinear ﬂuid-structure interaction solver based on a Schwarz preconditioner with isogeometric unstructured coarse spaces in 3D. J. Comput. Phys. 340, 498–518 (2017) 7. Kong, F., et al.: A fully coupled two-level Schwarz preconditioner based on smoothed aggregation for the transient multigroup neutron diﬀusion equations. Numer. Linear Algebra Appl. 25(3), e2126 (2018) 8. Luo, L., Liu, L., Cai, X.C., Keyes, D.E.: Fully implicit hybrid two-level domain decomposition algorithms for two-phase ﬂows in porous media on 3D unstructured grids. J. Comput. Phys. 409, 109312 (2020) 9. Luo, L., Shiu, W.S., Chen, R., Cai, X.C.: A nonlinear elimination preconditioned inexact Newton method for blood ﬂow problems in human artery with stenosis. J. Comput. Phys. 399, 108926 (2019) 10. Meidert, A.S., Nold, J.S., Hornung, R., Paulus, A.C., Zwißler, B., Czerner, S.: The impact of continuous non-invasive arterial blood pressure monitoring on blood pressure stability during general anaesthesia in orthopaedic patients. Eur. J. Anaesthesiol. 34(11), 716–722 (2017) 11. Morris, P.D., et al.: Computational ﬂuid dynamics modelling in cardiovascular medicine. Heart 102(1), 18–28 (2016) 12. Qin, S., et al.: Eﬃcient parallel simulation of hemodynamics in patient-speciﬁc abdominal aorta with aneurysm. Comput. Biol. Med. 136, 104652 (2021) 13. Radhakrishnan, A., Xu, M., Shahane, S., Vanka, S.P.: A non-nested multilevel method for meshless solution of the Poisson equation in heat transfer and ﬂuid ﬂow. arXiv preprint arXiv:2104.13758 (2021) 14. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM (2003) 15. Salvador, M., Dede’, L., Quarteroni, A.: An intergrid transfer operator using radial basis functions with application to cardiac electromechanics. Comput. Mech. 66(2), 491–511 (2020). https://doi.org/10.1007/s00466-020-01861-x 16. Shang, Y.: A parallel two-level ﬁnite element variational multiscale method for the Navier-Stokes equations. Nonlin. Anal. Theory Methods Appl. 84, 103–116 (2013) 17. Sheidaei, A., Hunley, S., Zeinali-Davarani, S., Raguin, L., Baek, S.: Simulation of abdominal aortic aneurysm growth with updating hemodynamic loads using a realistic geometry. Med. Eng. Phys. 33(1), 80–88 (2011)

348

J. Zhou et al.

18. Toselli, A., Widlund, O.: Domain Decomposition Methods-Algorithms and Theory. Springer, Heidelberg (2004). https://doi.org/10.1007/b137868 19. Wang, Y., Leng, X., Zhou, X., Li, W., Siddiqui, A.H., Xiang, J.: Hemodynamics in a middle cerebral artery aneurysm before its growth and fatal rupture: case study and review of the literature. World Neurosurg. 119, e395–e402 (2018)

Optimizing Data Locality by Executor Allocation in Reduce Stage for Spark Framework Zhongming Fu1(B) , Mengsi He1 , Zhuo Tang2 , and Yang Zhang3 1

College of Computer Science and Technology, University of South China, Hengyang, China [emailprotected] 2 College of Information Science and Engineering, Hunan University, Changsha, China 3 Science and Technology on Parallel and Distributed Laboratory (PDL), National University of Defense Technology, Changsha, China

Abstract. Data locality is a key factor inﬂuencing the performance of Spark systems. As the execution container of tasks, the executors started on which nodes can directly aﬀect the locality level achieved by the tasks. This paper tries to improve the data locality by executor allocation in reduce stage for Spark framework. Firstly, we calculate the network distance matrix of executors and formulate an optimal executor allocation problem to minimize the total communication distance. Then, an approximation algorithm is proposed and the approximate factor is proved to be 2. Finally, we evaluate the performance of our algorithm in a practical Spark cluster by using several representative benchmarks: sort, pageRank and LDA. Experimental results show that the proposed algorithm can help to improve the data locality and application/job performance obviously.

Keywords: Communication distance allocation · Spark

1

· Data locality · Executor

Introduction

Apache Spark becomes the popular parallel computing framework for massive data processing. A typical Spark application contains one or more jobs, and a job usually consists of many stages. Since these stages are executed sequentially, the intermediate output of the former stage is used as the input of the later stage. When the tasks of a stage run in parallel on diﬀerent nodes, the data communication is required during the job execution. In the map (i.e., shuﬄeMap) stage, each task reads a data block to process and outputs the intermediate data to local disks. In the reduce (i.e., result) stage, each task fetches part of the intermediate data from all the previous tasks for processing. This is a manyto-many communication mode. The resulting large amount of network traﬃc in c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 349–357, 2022. https://doi.org/10.1007/978-3-030-96772-7_32

350

Z. Fu et al.

these two stages can extend execution time and congest the cluster network, thereby hindering the system [1]. For improving performance, data locality is a key factor considered by the task scheduling of Spark stages [2]. The task scheduling determines the executor on which node the task runs and the data locality refers to scheduling computation/task close to data. In particular, in the map stage, the task scheduler uses the delay scheduling algorithm [3] that assigns the map task to the node which stores the data block, thus to avoid copying data remotely. In the reduce stage, the task scheduler assigns the reduce task to one of the nodes that holds more intermediate data to the task, thus to minimize the data transmission volume. However, as the execution container of tasks, the executors can limit the nodes available for the task scheduling, which aﬀects the locality level achieved by the tasks. On the one hand, if the executor is not started on the node in which a data block is located in the map stage, the map task is almost impossible to retrieve data locally. On the other hand, if the executors are started on the node away from each other in the reduce stage, the reducer has to span a long network distances to get data. In the Spark framework, spreadOut and noSpreadOut are two algorithms provided to decide the executors start up. Unfortunately, none of them fully consider the locality factor. In this paper, we improve the data locality of tasks from the view of executor allocation considering the reduce stage for Spark applications. As the number of reduce stages in general is much greater than that of map stages, the reduce stage has an important impact on the entire application/job performance. The main contributions of this paper are summarized as below. • We calculate the network distance matrix of executors, and formulate an executor allocation problem to minimize the total communication distance. This problem proved to be an NP-Hard problem. • We propose an optimal executor allocation approximation algorithm, and prove that the approximate factor of the algorithm is 2. • We implement our algorithm in Spark-3.0.1 and evaluate its performance on representative benchmarks. The experiment results explain that the proposed algorithm can decrease the task execution time for better data locality. The rest of this paper is organized as follows. Section 2 reviews related research. Section 3 presents the proposed executor allocation algorithm. Experiments and performance evaluation are given in Sect. 4. Section 5 concludes this paper.

2

Related Work

A lot of research has been done to optimize the cross-node/rack data communication problem in MapReduce-type frameworks, which can be categorized as follows: Task Scheduling. In the design of MapReduce, Dean et al. [4] took the locality of map tasks into account to save bandwidth consumption. The priority of tasks

Optimizing Data Locality by Executor Allocation in Reduce Stage

351

scheduled to nodes is classiﬁed into three levels: node-local, i.e., the task and its data block are on the same node; rack-local, i.e., the task and its data block are on diﬀerent nodes but on the same rack; and oﬀ-rack, i.e., the task and its data block are on diﬀerent racks but on a cluster. Further, using the time-for-space strategy, Zaharia et al. [3] proposed the delay scheduling algorithm. If there is no task can obtain data locally on the request node, it will wait for a small amount of time and in the hope of obtaining better locality from subsequent nodes. In a cluster that quickly releases resources, the delay scheduling could achieve a higher proportion of node-local tasks while preserving fairness. Besides the map stage, the data locality for reducers also aﬀects the job performance. Tang et al. [5] presented a minimum transmission cost reduce task scheduler (MTCRS). It decides the appropriate launching locations for reduce tasks according to the waiting time of each reduce task and the transmission cost set, which is computed by the sizes and the locations of intermediate data partitions. Data Pre-fetching. From another angle, Sun et al. [6] designed a high performance scheduling optimizer (HPSO), a prefetching service based task scheduler to improve data locality for MapReduce jobs. Their idea is to predict the most appropriate nodes to which future map tasks should be assigned and then preload the input data to memory without any delaying on running normal tasks. Nevertheless, the method may incur additional overhead and could not help to alleviate the network traﬃc of cluster. As our early work [7], we optimized the task locality in the map stage by the executor allocation in Spark framework. In this paper, we focuses on the executor allocation in the reduce stage, with the purpose of providing tasks with the possibility of better locality when scheduling the reduce tasks.

3

Executor Allocation Algorithm

This section ﬁrst formulates the optimal executor allocation problem, and then presents the approximation algorithm for the problem. 3.1

Optimal Executor Allocation Problem

When a Spark application is submitted to the cluster and to be executed, the master registers with the resource manager and applies for the resources to start a group of executors. An executor is the container of executing tasks, which actually is a collection of computing resources (i.e., cpu and memory). A task can be scheduled to run on a node requiring to have idle executors. In the initial state of allocating executor for an application, some particular data structures are deﬁned as follows: (1) E: A set of executors allowed to be started on the nodes, the number is m. The element eli represents the ith executor that can be started on the lth node if marked. In the Spark system, the number of executors allowed to

352

Z. Fu et al.

start on each node can be calculated based on the free resources of the node, formalized as: f ree cpui f ree memoryi ], [ ]}, (1) exe numi = min{[ cpu conf memory conf where exe numi indicates the number of executors allowed to be started on node Ni , and cpu conf and memory conf are the CPUs and memory capacity conﬁgured by the executor respectively. (2) D: A matrix of m × m represents the communication distance between two executors of E, represented as: ⎡

d00 d10 .. .

d01 d11 .. .

... ... .. .

d0(m−1) d1(m−1) .. .

⎤

⎢ ⎥ ⎢ ⎥ D=⎢ ⎥, ⎣ ⎦ d(m−1)0 d(m−1)1 . . . d(m−1)(m−1) where dij represents the communication distance between executor ei and ej . The communication distance depends on the network latency and bandwidth. To capture the data locality, we divide the proximity level (P L) of two executors into three levels: (1) two executors are on the same node, then P L is equal to 0; (2) two executors are on diﬀerent nodes of the same rack, then P L is equal to 1; (3) two executors are on diﬀerent nodes of diﬀerent racks, then P L is equal to 2. Then the distance dij can be further calculated as: ⎧ 0, if PL = 0 ⎪ ⎪ ⎨ 1 2 × bandN S + latencyN S , if PL = 1 , dij =

⎪ ⎪ 1 1 ⎩2 × + latencyN S + 2 × + latencySS , if PL = 2 bandN S

bandSS

(2) where bandN S is the network bandwidth from node to switch, bandSS is the network bandwidth from switch to switch, latency N S is the network delay from node to switch, and latency SS is the network delay from switch to switch. In this model, our purpose is to start the required executors on nodes close to each other. Assuming that the number of executors required by the application is k, so the optimal executor allocation problem can be described as selecting a subset E ∈ E to minimize the total communication distance between two executors. This problem can be formalized as follows by using Integer Programming: min

m−1 m−1

dij × (xi × xj ),

i=0 j=0

subject to

m−1

xi = k, xi ∈ {0, 1}, 0 ≤ i < m − 1,

(3)

i=0

where xi is a binary variable, whose value is 1 means that the executor ei is selected, and value is 0 means that the executor is not selected.

Optimizing Data Locality by Executor Allocation in Reduce Stage

353

Theorem 1. The optimal executor allocation problem (abbreviated as the OEA problem) is NP-Hard. Proof. The k-clique problem in graph theory can be shown to reducible to the OEA problem. That is, for any instance of the k-clique, an instance of OEA can be created in polynomial time such that solving the instance of OEA solves the instance of k-clique as well. According to the NP completeness of the k-clique problem, the OEA problem can be proved to be NP-Hard [8]. 3.2

Approximation Algorithm

Algorithm 1 describes the approximation algorithm for the optimal executor allocation problem. Firstly, the algorithm selects k nearest executors (including ej itself) for each executor ei . For executor ei , the set of its k nearest executors is represented as S(ei ), and the sum of communication distances from executor ei to other k − 1 executors is calculated and represented as C(ei ). Then, ﬁnd the smallest C(ev ) among all executors, and assign the executor set S(ev ) to M inSet. Thirdly, calculate the total communication distance between two executors of M inSet and represent it as M inCost. Finally, return to M inSet. Algorithm 1: Approximation Algorithm Input:

1 2 3 4 5 6 7 8

The set of executors allowed to start: E; The communication distance matrix: D; The number of executors required: k; Output: The executors selected to start. begin for each executor ei of E, ﬁnd k executors nearest (including ei itself) to ei , represented as S(ei ); calculate the sum of communication distances from executor ei to other k − 1 executors: C(ei ) = ej ∈S(ei ) dij ; ﬁnd the smallest C(ev ) and the executor set is represented as M inSet; calculate the total communication distances between executors of M inSet, represented as M inCost; return M inSet. end

The algorithm takes O(m) time to ﬁnd the nearest k executors by using the optimal algorithm. For m executors, the time it takes is m×O(m). Therefore, the time complexity of Algorithm 1 is O(m2 ), where m is the number of executors allowed to start. Theorem 2. The approximate factor of the approximation algorithm to the optimal executor allocation problem is 2.

354

Z. Fu et al.

Proof. The solution of the approximation algorithm for the optimal executor allocation is M inSet, and the sum of the communication distances between executors is M inCost. Let M inSet∗ be the optimal solution, and the sum of the communication distances between executors of M inSet∗ is M inCost∗ . Then for M inCost∗ , there is: M inCost∗ = =

1 2 1 2

dij ≤

ei ∈M inSet∗ ej ∈M inSet∗

C(ei ) ≥

ei ∈M inSet∗

1 2

1 2

dij

ei ∈M inSet∗ ej ∈M inSet

M inCost =

ei ∈M inSet∗

k × M inCost. 2

(4) For M inCost, there is: M inCost =

1 2

dij .

(5)

ei ∈M inSet ej ∈M inSet

Let Cev gets the minimum total communication distances M inCost. According to the triangular inequality [9], there is:

ei ∈M inSet ej ∈M inSet

=

=

⎛ ⎝

ei ∈M inSet

⎛

=k×⎝

⎞

⎞

ei ∈M inSet

⎛

div ⎠ + k × ⎝

ei ∈M inSet

dvj

ei ∈M inSet ej ∈M inSet

dvi ⎠ +

ej ∈M inSet

div +

ei ∈M inSet ej ∈M inSet

(div + dvj )

ei ∈M inSet ej ∈M inSet

dij ≤

⎛ ⎝

⎞ djv ⎠

ej ∈M inSet

⎞

djv ⎠

ej ∈M inSet

= k × M inCost + k × M inCost.

(6)

Therefore, for M inCost, there is: 1 × 2k × M inCost = k × M inCost. 2

(7)

The approximate factor of our solution M inSet is: σ=

k × M inCost M inCost = 2. = k M inCost∗ 2 × M inCost

(8)

Therefore, the approximation algorithm for the optimal executor allocation problem is a 2-approximate algorithm.

Optimizing Data Locality by Executor Allocation in Reduce Stage

4

355

Experimental Evaluation

We evaluate the performance in a data center with the KVM technology used to build virtual machines. Each VM is equipped with 4 virtual cores, 8 GB RAM and 64GB disk space. We then deploy the Spark-3.0.1 cluster in the data center that contains 18 nodes (each server starts 2 VMs). 4.1

Performance

(1) Micro-benchmark Sort is a frequently used application with the function of making data objects in order. The experiment uses 30GB data set of Wikipedia entries as input data. This application contains a job with two stages: map stage and reduce stage, each stage has 80 tasks. To evaluate the performance under diﬀerent numbers of executors, the required number of executors is set to 30, 40, and 50 respectively in the procedure. Figure 1(a) reveals the performance comparison of the three executor allocation methods, where the proposed approximation algorithm is marked as OTCD. It illustrates that the job execution time of OTCD lower than spreadOut, noSpreadOut. In particular, when the required number of executors is 50 (i.e., Executor 50), compared with other two methods, OTCD decreases the execution time by 32.8% and 24.5%, respectively. Figure 1(b) shows the comparison of the reduce stage time under diﬀerent methods. In this stage, the reduce tasks take a lot of time to obtain the intermediate data from previous tasks. Because the reduce stage is considered in our optimization of data locality through executor allocation, it can be seen that comparing Fig. 1(a), OTCD has a signiﬁcant performance improvement in the reduce stage. In particular, when the required number of executors is 40 (i.e., Executor 40), by comparison with spreadOut and noSpreadOut, OTCD reduces the execution time by 37.1% and 28.2% respectively. 3000

SpreadOut NoSpreadOut OTCD

3000

Stage Execution Time(s)

Job Execution Time(s)

2500 2000 1500 1000 500 0

Executor_30

Executor_40

Executor_50

SpreadOut NoSpreadOut OTCD

2500 2000 1500 1000 500 0

Executor_30

Executor_40

Executor_50

(a) Job execution time (b) Reduce stage execution time

Fig. 1. Performance comparison of diﬀerent methods under Sort.

356

Z. Fu et al.

(2) Macro-benchmark

2400

SpreadOut NoSpreadOut OTCD

6000

Application Execution Time(s)

Application Execution Time(s)

To evaluate the performance under more complex applications, we select two popular machine learning algorithms pageRank and LDA from the Spark examples for testing. Since these two applications contain one or more jobs, in which a job usually contains a lot of stages, the application execution time is used. PageRank is a widely recognized iterative algorithm for ranking web pages according to their importance. The experiment uses 10GB data set of WT10g as input data, and set the parameter numIterations to 10 in the procedure. In the application execution, it consists of 1 job and 13 stages. From the experimental result of Fig. 2(a), it can be seen that compared with spreadOut and noSpreadOut, OTCD has the shortest application execution time. In particular, when the number of executors required is 40 (i.e., Executor 40), OTCD reduces the application time by 41.2% and 24.6% respectively. LDA is a document generation model in natural language processing, which identiﬁes the hidden subjects in a large-scale documents. The experiment runs on 20GB arXiv Bulk Data data set and the procedure sets the parameter maxIterations to 20. This application is concretely executed as 26 jobs and 90 stages totally. The experimental results illustrate that OTCD has a greater performance advantage than other two methods, as shown in Fig. 2(b). In particular, when the number of executors required is 50 (i.e., Executor 50), OTCD decreases the application time by 72.7% and 43.2% compared with spreadOut and noSpreadOut, respectively. As we can see for the application with many jobs and stages, optimizing the data locality by executor allocation in multiple reduce stages can bring a more substantial performance improvement.

1800

1200

600

Executor_30

Executor_40

Executor_50

(a) PageRank

SpreadOut NoSpreadOut OTCD

4800

3600

2400

1200

Executor_30

Executor_40

Executor_50

(b) LDA

Fig. 2. Performance comparison under macro-benchmark.

5

Conclusion

This paper has optimized the data locality by executor allocation for Spark framework. We propose an optimal executor allocation approximation algorithm, and the experimental results show that it can improve the data locality for lower data communication. As our future work, we intend to consider the input data distribution of each stage in the executor allocation.

Optimizing Data Locality by Executor Allocation in Reduce Stage

357

Acknowledgment. The work is supported by the Doctoral Research Startup Foundation of University of South China (No. 200XQD083).

References 1. Shabeera, T.P., Kumar, S.D.M.: A novel approach for improving data locality of mapreduce applications in cloud environment through intelligent data placement. Int. J. Serv. Technol. Manag. 26(4), 323–340 (2020) 2. Cheng, L., et al.: Network-aware locality scheduling for distributed data operators in data centers. IEEE Trans. Parallel Distributed Syst. 32(6), 1494–1510 (2021) 3. Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: European Conference on Computer Systems, pp. 265–278 (2010) 4. Dean, J., Ghemawat, S.: Mapreduce: simpliﬁed data processing on large clusters. Commun. ACM 51, 107–113 (2008) 5. Xia, T., Wang, L., Geng, Z.: A reduce task scheduler for mapreduce with minimum transmission cost based on sampling evaluation. Int. J. Database Theory Appl. 8, 1–10 (2015) 6. Sun, Mingming, Zhuang, Hang, Zhou, Xuehai, Lu, Kun, Li, Changlong: HPSO: prefetching based scheduling to improve data locality for mapreduce clusters. In: Sun, Xian-he, et al. (eds.) ICA3PP 2014. LNCS, vol. 8631, pp. 82–95. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11194-0 7 7. Fu, Z., Tang, Z., Yang, L., Liu, C.: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans. Parallel Distrib. Syst. 31(10), 2406–2420 (2020) 8. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of np-Completeness. W.H. Freeman & Co., San Francisco (1979) 9. Alicherry, M., Lakshman, T.V.: Optimizing data access latencies in cloud systems by intelligent virtual machine placement. In: 2013 Proceedings IEEE INFOCOM (2013)

TEFRED: A Temperature and Energy Cognizant Fault-Tolerant Real-Time Scheduler Based on Deadline Partitioning for Heterogeneous Platforms Yanshul Sharma, Zinea Das, and Sanjay Moulik(B) Indian Institute of Information Technology Guwahati, Guwahati, India {yanshul.sharma,zinea.das,sanjay}@iiitg.ac.in

Abstract. Energy consumption and peak temperatures on MPSoCs are growing exponentially as transistor density increases, rendering systems unstable. Thus, in modern real-time systems, fault-tolerance is an essential design feature. This paper proposes TEFRED, a heuristic scheduling strategy that addresses the problem of controlling energy and peak temperature levels simultaneously on systems with two types of cores while remaining resistant to transient faults. Our experimental results demonstrate that TEFRED can save considerable energy and lower core peak temperatures compared to a state-of-the-art energy-eﬃcient faulttolerant scheduler.

Keywords: Heterogeneous

1

· Fault-tolerant · Temperature · Energy

Introduction

Real-time systems are widely employed in high-risk sectors such as automobiles, aviation, and even medicine. The applications executing in such systems tend to have high demand, which led to the deployment of such systems from single to multicore platforms a decade ago. In hom*ogenous multicore platforms, generalpurpose cores cannot deliver the degree of eﬃcacy aﬀorded by heterogeneous multicore platforms. This is because heterogeneous architectures are made up of various types of cores, each of which is suited for a particular set of activities. Due to this reason, every task will need a diﬀerent duration to ﬁnish on other cores. Hence, the preparation of task schedules is more challenging on heterogeneous multicore platforms. As real-time systems are prone to failure, fault-tolerance is essential for such systems. Faults can be permanent, transient, or intermittent. We focus on transient faults in this paper, which have risen exponentially over time as transistor density, frequency, temperature, and other factors have increased. Standbysparing is a commonly used technique for fault-tolerance. Each task has two copies: the primary copy runs on the primary core, while the backup copy runs only if the ﬁrst copy fails (as determined by an acceptance test). c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 358–366, 2022. https://doi.org/10.1007/978-3-030-96772-7_33

TEFRED

359

The power density of MPSoCs has increased dramatically as the level of amalgamation on the chips has increased. In [14], the authors presented an energy cognizant scheduling strategy for tasks given as Directed Acyclic Graphs (DAGs) on a heterogeneous platform with two types of cores that can handle at most one transient fault per task and one permanent processor fault at the same time. The authors have suggested work for energy management and fault-tolerance while preserving service-level in mixed-criticality multicore systems in [15]. They utilized the task replication technique to handle failures and a variety of execution modes to keep the service quality high. In [4], the system chooses temporal redundancy and/or spatial redundancy approaches to achieve their aim. The slacks in task execution times are used to reduce energy usage. The rise in power density of SoCs is closely connected to their rising temperatures, which plays a crucial role in degrading the regular operation of these systems, rendering them unreliable. Interconnect latency rises by roughly 5% for every 10 ◦ C increase in temperature, while MOS-current driving capability drops by around 4% [13]. Due to timing breaches, this results in transient faults. The authors presented a MILP framework for separate scheduling activities on a heterogeneous platform in [18]. A MILP solver requires an exponential amount of time to solve problems with more granularity. As a result, they devised a two-stage heuristic that included task allocation to clusters and task replication, followed by task assignment to cores and frequency selection while preserving reliability and temperature restrictions. In [8], the authors looked at the power consumption of tasks on a heterogeneous platform, as well as the removal or decrease of waiting times for tasks that shared the same successor task. In [16], two heuristic techniques were devised: the leakage-aware workload stabilizing strategy and the temperature management strategy. They employed the variable-sized bin packing approach for task partitioning to ensure adequate resource usage under energy and temperature restrictions. Although numerous studies address the challenge of energy-eﬃcient scheduling for fault-tolerant real-time systems on hom*ogeneous multicore architecture, just a few use heterogeneous multicore architecture. Furthermore, no previous research has combined thermal control and energy eﬃciency for faulttolerant real-time systems. Hence, we propose a heuristic-based scheduler named TEFRED, which performs thermal and power management in fault-tolerant heterogeneous multicore platforms with two types of cores. As we would like to think, the proposed strategy ﬁts precisely to the platforms that have cores R or with diﬀerent micro-architectures but have identical ISA, like Helio X20 R big.LITTLE .

2

Specifications

System Model: We have contemplated a set of n periodic tasks Γ = {τ1 , τ2 , . . . , τn } to be scheduled on a heterogeneous processing platform Π which uses two types of cores: {Π LP for power eﬃciency, and Π HP } for performance. Each of the core type comprises of r cores, where j th core of type Π m is

360

Y. Sharma et al.

denoted as Πjm . It may be noted that we already have such processing platforms in market. Each occurrence of a periodic task τi is associated with a tuple HP > (execution requirements on respective core types at max< execLP i , execi HP > (steady-state imum frequency), deadline/period di , a tuple < tssLP i , tssi LP HP temperatures on respective core types), a tuple < ui , ui > (utilization w.r.t. LP and HP cores). The steady-state temperature of a task on a core is deﬁned as the maximum temperature attained by the core when the same task is run on it for an inﬁnitely long time, possibly with multiple instances. We assume implicit task deadlines. Every task set is characterized by a parameter called Utilization Factor (UF), which gives a measure of resource utilization corresponding to the given task set. Power Model: We have used the analytical core energy model given in [10]. For our system, the dynamic power consumption P ∝ f v 2 ; where f is the operating frequency and v is the supply voltage. Again, the supply voltage is linearly proportional to the operating frequency. Hence, P = cf 3 ; where c is the constant of proportionality. For eﬃcient power management, we have employed Dynamic Power Management (DPM). This energy-saving mechanism minimizes static power consumption by switching the core to sleep mode when it is idle. Thermal and Fault Model: The thermal model used in our work is based on [12]. For an interval [t0 , te ] in which τi is executing on the core Πjm , if the core temperature is Γ0 at time t0 , the temperature Γe at the end of the interval Πm Πm at time te is given by: Γem = tssi j + (Γ0 − tssi j )e−B(te −t0 ) , where B is a constant depending upon power consumption in the system. We have used the standby-sparing system, where each task has two copies, primary and backup. The primary copy of a task is scheduled on the LP core, while the backup copy is scheduled on the HP cores. Whenever a primary copy completes its execution, acceptance or sanity tests [6] is done to check if any transient fault occurred. If yes, then the backup copy is executed on the HP core. Otherwise, the backup copy is deallocated from the HP core. This system works on the assumption that each task can undergo fault at most once, and there can be at most q transient faults per frame, where a frame is a group of time-slots into which the execution in a system can be divided. Algorithm 1: TEFRED 1 2 3 4 5 6 7

Input: Set of tasks Γ , Set of cores Π, Number of transient faults q Output: Energy and Temperature aware Fault-Tolerant Schedule Let {τ1 , τ2 , . . . , τn } be set of ready tasks LP Compute average steady-state temperature, tssavg = n i=1 (tssi )/n while true do Using deadline-partitioning, compute next frame (say kth ) Rk Compute shares required by each task on LP and HP cores at Rk coreLP =ASSIGN-TO-LP-CORES (Γ , tssavg ) coreHP =ASSIGN-TO-HP-CORES (Γ )

TEFRED

3

361

Proposed Scheduling Scheme

TEFRED works in three phases to prepare a schedule for a set of real-time periodic tasks Γ on a heterogeneous platform Π comprising of two types of cores. In the ﬁrst phase, it uses deadline partitioning to compute the set of frames [11]. The second phase assigns tasks to the power-eﬃcient cores while controlling the excessive rise in its peak temperature using an eﬃcient temperature-aware heuristic. The third phase creates a heuristic schedule for tasks in the backup cores by slot reservation for possible transient faults. TEFRED (Algorithm 1): It starts by computing the next frame using a mechanism called deadline partitioning [11]. Then within the ensuing frame, it computes the shares of each task τi on both LP and HP cores using the following m m equation: shrim = um i × |Rk |, where m is one of {HP , LP }, ui = execi /di and |Rk | denotes the size of the ensuing frame Rk . It calls Algorithm 2 to get the task schedule on the power-eﬃcient cores. Finally, it calls Algorithm 3 to get the reserved slots in case of faults. Algorithm 2: ASSIGN-TO-LP-CORES

1 2 3 4 5 6 7 8 9 10

Input: Set of tasks Γ , tssLP avg Output: Task schedule on LP core (coreLP ) Set Lhot = ∅ , Lcool = ∅ and coreLP = ∅ for each task Γi do if (tssLP > tssLP avg ) then Add Γi to Lhot i else Add Γi to Lhot sorted in non-increasing order of tssLP i while Lhot = φ and Lcool = φ do Extract task Γi from the front of Lhot and add it to the end of coreLP Extract task Γi from the front of Lcool and add it to the end of coreLP if (Lhot = φ) then Add all tasks from front of Lcool to the end of coreLP else Add all tasks from front of Lhot to the end of coreLP Schedule coreLP onto LP cores using McNaughtons’s wrap around rule

ASSIGN-TO-LP-CORES (Algorithm 2): Firstly, it initializes empty lists Lhot , Lcool and coreLP . Consider each task τi in the task set Γ and assign it either to Lhot or Lcool on the basis of the thumb rule that if its steady-state temperature (tssLP i ) is greater than the average steady-state temperature of the task set (tssavg ), then assign it to the hot list Lhot , else assign it to the cool list Lcool . The hot list must contain tasks in the non-increasing order of tssLP i , and the cool list must contain tasks in the non-decreasing order of tssLP i . Once these two lists are formed, extract the hottest task and the coolest task from the hot list and cool list alternatively and assign them to the third list coreLP . At last, McNaughtons’s wrap-around rule [9] is applied on coreLP to schedule the tasks on available LP cores. McNaughton’s wrap-around rule helps to prepare an optimal schedule for tasks on a hom*ogeneous multicore platform. Since we are assigning primary copies of tasks on the same core type, i.e., LP , we have used the wrap-around rule to prepare the ﬁnal schedule.

362

Y. Sharma et al.

ASSIGN-TO-HP-CORES (Algorithm 3): In the considered platform, there are equal number of LP and HP cores, which makes the platform suitable for backup-overloading [5] technique. Since for each task, we have two shares, i.e., on LP cores and HP cores in a frame, when a task has been allotted to the core ΠjLP for a certain time interval, the exact proportionate workload for the task has to be allotted on the corresponding ΠjHP . For each core, ΠjHP in Π HP , the algorithm creates a list eList of all tasks in non-increasing order of their shares on the core ΠjHP . Then it ﬁnds the sum of the ﬁrst q shares from eList and calls this sum backup slots. Finally, it assigns this contiguous series of slots as late as possible in the current frame to cancel their execution on the HP core if the corresponding primary copies are successfully executed on the LP core. The algorithm overlaps q tasks in the backup slots, thus following the non-workconserving strategy of backup overloading. Since we have reserved backup slots equivalent to the sum of q tasks having the highest shares, the HP core will utilize all of these slots only in the worst case. In most cases, some slots in these backup slots will remain idle when the corresponding tasks in the LP core have been successfully executed.

Algorithm 3: ASSIGN-TO-HP-CORES 1 2 3 4 5 6 7

4

Input: Set of tasks Γ , Number of transient faults q Output: Reserved slot schedule on HP core for j = 1 : r in Π HP do Create list eList of all tasks in non-increasing order of their share on ΠjHP . Initialize backup slots = 0 for i = 1 : q in eList do backup slots = backup slots + execHP i Reserve backup slots number of slots as late in the frame as possible return reserved slot schedule on HP core

Experimental Set Up and Results

We have implemented the TEFRED algorithm and compared it against the following two algorithms based on hom*ogeneous multicore platforms: i. A hom*ogeneous version of TEFRED named TEFRED-HM, and ii. An energy-eﬃcient fault-tolerant scheduler named FEED-O [1]. We will use TEFRED-HET to refer to our proposed strategy for heterogeneous platforms from now onwards. To the best of our knowledge, no work focusing on fault-tolerance coupled with temperature and energy management has been done yet. As temperature is also an essential aspect of our work, we have compared the performance of TEFREDHET against TEFRED-HM, which focuses on fault-tolerance, energy, and temperature; and FEED-O, which focuses on fault-tolerance and energy.

TEFRED

363

Table 1. Task speciﬁcations for benchmark programs Program

Execution

Steady state Program

Execution

requirement temperature

Steady state

requirement temperature

◦ C)

(in ms)

(in

(in ms)

(in

Bodytrack

3824

85

Canneal

Dedup

6455

91

1007

80

Fluidanimate 4090

81

Freqmine

11082

84

Stream

6156

68

Swaptions

4535

76

Blackscholes

1203

85

◦ C)

All our simulations have been run for a total execution time of 100000 time slots with task sets having pre-speciﬁed utilization factors or system workload. For each set of input parameters, the average of the 50 simulations has been considered the outcome. The PARSEC [2] benchmark suite (with a large input set) has been used by us to substantiate eﬃciencies of the algorithms over different real-life scenarios that may arise Table 1. For all the experiments, we have taken n = 20 by selecting tasks from the 8 benchmark applications (with some tasks repeated in the set to form the taskset of size 20). We received periodic performance traces from Gem5 [3] simulator for an 8-core based heterogeneous chip-multiprocessor (considering 32 nm CMOS technology), where each of the faster 4 Out-of-Order cores can operate at a frequency of 3.0 GHz, and each of the 4 smaller In-Order cores can have a frequency of 1.8 GHz. We have used DPM for eﬃcient energy consumption. Note that, for each of our cores (both in-order and out-of-order), we have considered Alpha 21364 ISA. For complete periodic performance-power-thermal analysis, we integrated gem5 [3], McPATmonolithic [7], and HotSpot 6.0 [17] simulators are adopted. 1 0.8

TEFRED-HET FEED-O TEFRED-HM

0.8

0.6

ECon

0.4 0.2

0.6 0.4 0.2

Utilization Factor

(a) Eﬀect of Utilization Factor

9 10

7

8

6

5

4

2

3

9

0 1.

7 0.

0.

6 0.

8

5 0.

0.

4 0.

1

ECon

1

TEFRED-HET FEED-O TEFRED-HM

Number of Faults

(b) Eﬀect of Number of Faults

Fig. 1. Eﬀect on energy consumption

Experimental Results: We have performed a set of extensive simulation-based experiments to gauge the eﬃciency of the algorithms.

364

Y. Sharma et al.

Experiment 1 : We varied UF from 0.4 to 1.0, and the number of transient faults to be handled by the system per frame was ﬁxed at 10. Figure 1a shows that with the increase in UF, ECon values increase. It is because with the increase in UF (based on only the primary cores), the idle time of the HP cores with DPM capability decrease, which leads to higher energy consumption. However, TEFRED-HET can outperform TEFRED-HM and FEED-O because the latter algorithms are oblivious to the heterogeneity of the cores and choose the cores for backup-overloading randomly. In contrast, TEFRED-HET compensates ECon by assigning all tasks to the LP cores. It can be observed from Fig. 1a that the values of ECon vary from 0.26 to 0.31, 0.44 to 0.53, and 0.49 to 0.58 for TEFRED-HET, FEED-O, and TEFRED-HM with the variation in UF values, respectively. Experiment 2 : We varied the number of faults q from 1 to 10 at U F = 0.6. It can be observed from Fig. 1b that q is directly proportional to ECon because the number of backup slots on the backup cores increase with q. Also, each failed task running on a HP core requires higher energy consumption than the same on LP core. Hence, the ECon values are quite lesser for TEFRED-HET as compared to TEFRED-HM and FEED-O. It can be observed from Fig. 1b that the values of ECon vary from 0.3 to 0.35, 0.48 to 0.54, and 0.53 to 0.61 for TEFRED-HET, FEED-O, and TEFRED-HM with the variation in q values, respectively. 100

TEFRED-HET (Backup) FEED-O (Backup) TEFRED-HM (Backup)

TEFRED-HET (Backup) FEED-O (Backup) TEFRED-HM (Backup)

104

PTC (in °C)

PTC (in °C)

98 96 94

102 100 98

92 96

Utilization Factor

9

1. 0

0.

7

8

0.

0.

0. 6

4

0. 5

0.

0 1.

8

9 0.

7

0.

6

0.

5

0.

0.

0.

4

90 Utilization Factor

(a) Eﬀect on Temp. of LP Cores (b) Eﬀect on Temp. of HP Cores

Fig. 2. Eﬀect on temperature of cores

Experiment 3 : In this experiment, we have used the settings of Experiment 1. From Fig. 2, we observe that the PTC values of primary and backup cores increase with UF. It is because as the workload of the core increases with an increase in UF, the core gets lesser time to cool down and hence showcases higher peak temperatures. The temperature-aware heuristic strategy in TEFRED-HET and TEFRED-HM achieves eﬃcient PTC values because they schedule hot and cool tasks alternatively. However, task-to-core assignments are more eﬃcient in TEFRED-HET with respect to TEFRED-HM. As TEFRED-HM also chooses random cores for task-assignment, it leads to lesser eﬃcient scheduling and higher temperature on cores.

TEFRED

5

365

Conclusion

In this paper, we propose a fault-tolerant heuristic scheduling mechanism, TEFRED-HET, which successfully schedules tasks meeting their implicit deadlines. It outperforms TEFRED-HM which is a hom*ogeneous version of TEFRED-HET and a state-of-the-art fault-tolerant energy-aware scheduler for hom*ogeneous platforms named FEED-O. The proposed algorithm adopts the DPM technique for minimization of static energy consumption and reserves only necessary backup slots for a known number of maximum possible transient faults. TEFRED-HET also utilizes the diﬀerence in steady-state temperatures of the tasks to achieve the remarkable reduction in peak temperatures of the system.

References 1. Bansal, S., Bansal, R.K., Arora, K.: Energy eﬃcient backup overloading schemes for fault tolerant scheduling of real-time tasks. J. Syst. Architect. 113, 101901 (2021) 2. Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC benchmark suite: characterization and architectural implications. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81 (2008) 3. Binkert, N., et al.: The gem5 simulator. ACM SIGARCH Comput. Archit. News 39(2), 1–7 (2011) 4. Chatterjee, N., Paul, S., Chattopadhyay, S.: Task mapping and scheduling for network-on-chip based multi-core platform with transient faults. J. Syst. Architect. 83, 34–56 (2018) 5. Ghosh, S., Melhem, R., Mosse, D.: Fault-tolerant scheduling on a hard real-time multiprocessor system. In: International Parallel Processing Symposium, pp. 775– 782. IEEE (1994) 6. Koren, I., Krishna, C.M.: Fault-Tolerant Systems. Elsevier, Cambridge (2010) 7. Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P.: McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: IEEE/ACM International Symposium on Microarchitecture, pp. 469–480 (2009) 8. Li, T., Zhang, T., Yu, G., Song, J., Fan, J.: Minimizing temperature and energy of real-time applications with precedence constraints on heterogeneous MPSoC systems. J. Syst. Architect. 98, 79–91 (2019) 9. McNaughton, R.: Scheduling with deadlines and loss functions. Manage. Sci. 6(1), 1–12 (1959) 10. Moulik, S., Devaraj, R., Sarkar, A.: HEALERS: a heterogeneous energy-aware lowoverhead real-time scheduler. IET Comput. Digit. Tech. 13(6), 470–480 (2019) 11. Moulik, S., Sarkar, A., Kapoor, H.K.: Energy aware frame based fair scheduling. Sustain. Comput. Inform. Syst. 18, 66–77 (2018) 12. Moulik, S., Sarkar, A., Kapoor, H.K.: TARTS: a temperature-aware real-time deadline-partitioned fair scheduler. J. Syst. Architect. 112, 101847 (2021) 13. Narayanan, V., Xie, Y.: Reliability concerns in embedded system designs. Computer 39(1), 118–120 (2006)

366

Y. Sharma et al.

14. Roy, A., Aydin, H., Zhu, D.: Energy-eﬃcient fault tolerance for real-time tasks with precedence constraints on heterogeneous multicore systems. In: International Green and Sustainable Computing Conference, pp. 1–8. IEEE (2019) 15. Safari, S., Ansari, M., Ershadi, G., Hessabi, S.: On the scheduling of energy-aware fault-tolerant mixed-criticality multicore systems with service guarantee exploration. IEEE Trans. Parallel Distrib. Syst. 30(10), 2338–2354 (2019) 16. Sha, S., Wen, W., Chaparro-Baquero, G.A., Quan, G.: Thermal-constrained energy eﬃcient real-time scheduling on multi-core platforms. Parallel Comput. 85, 231– 242 (2019) 17. Stan, M.R., Zhang, R., Skadron, K.: Hotspot 6.0: Validation, acceleration and extension (2015) 18. Zhou, J., et al.: Reliability and temperature constrained task scheduling for makespan minimization on heterogeneous multi-core platforms. J. Syst. Softw. 133, 1–16 (2017)

Algorithms and Applications

Social Recommendation via Graph Attentive Aggregation Yuanwei Liufu and Hong Shen(B) Sun Yat-sen University, GuangZhou, China [emailprotected], [emailprotected] Abstract. Recommender systems play an important role in helping users discover items of interest from a large resource collection in various online services. Although deep graph neural network-based collaborative ﬁltering methods have achieved promising performance in recommender systems, they are still some weaknesses. Firstly, existing graph neural network methods only take user-item interactions into account neglecting direct user-user interactions which can be obtained from social networks. Secondly, they treat the observed data uniformly without considering ﬁne-grained diﬀerences in importance or relevance in the user-item interactions. In this paper, we propose a novel graph neural network social graph attentive aggregation (SGA) which is suitable for parallel training to boost eﬃciency which is the common bottleneck for neural network deployed machine learning models. This model obtains user-user collaborative information from social networks and utilizes self-attention mechanism to model the diﬀerentiation of importance in the user-item interactions. We conduct experiments on two real-world datasets and the results demonstrate that our method is eﬀective and can be trained in parallel eﬃciently. Keywords: Recommendation system · Social recommendation Graph neural network · Parallel computing

1

·

Introduction

Recommender systems have been studied to resolve the issue of information overload in various ﬁelds during the past decades, such as products-to-customer recommendation in e-commerce platforms and people-to-people recommendation in social networks, etc. Collaborative ﬁltering (CF), which assumes that two users with similar behaviors may show similar interests in items, is a class of widelyused personalized recommender systems based on the user-item interaction data such as purchases and clicks. Thanks to the strong capability of Graph Neural Networks (GNNs) [5] in representing graph data, there is an increasing number of studies utilizing GNNs [8,23,25] to learn representations in CF, yielding promising performance gains. Our model is mainly based on Neural Graph Collaborative Filtering (NGCF) [23] which regards user-item interactions as a bipartite graph structure and use graph aggregation techniques to capture collaborative information. c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 369–382, 2022. https://doi.org/10.1007/978-3-030-96772-7_34

370

Y. Liufu and H. Shen

Despite the eﬀectiveness of NGCF, we argue that traditional CF models often suﬀer from the sparsity problem [1] so as NGCF. For example, users usually give feedbacks on a very small proportion of items with the same preference intensity level. And thus there’s no suﬃcient data to build the models. An easy way to solve this problem is to take into consideration more information. Besides user-item interactions, social recommender system take social relations among users (user-user interactions) into account to model user’ preferences. As shown in social theories, people are easily inﬂuenced by their friends in the same social community. And thus people in social neighbors tend to have similar preference [2,7,8] We notice that considering direct user-user interaction in CF obtained from social relation among users can bring a great amount of semantic information and collaborative information in the recommender system.

Fig. 1. An example of the proposed graph model and the high-order connectivity with social information

The graph structure with direct user-user interaction information is illustrated in Fig. 1. The user to be analyzed in this recommender system is u1 that is highlighted with double circle in the left sub-ﬁgure. The right sub-ﬁgure shows the hierarchical structure expanded from u1 where l is the distance of the node to u1 . In the right sub-ﬁgure, the collaborative information is related to the distance of node. For example the distance between u1 and i1 is 1 (u1 → i1 ) while the distance between u1 and i4 is 2 (u1 → u3 → i4 ), thus we can assume that u1 is more likely to choose i1 than i4 . And the distance between u1 and u3 is less than the distance between u1 and u2 , which indicates that the preferences of u1 is more similar to u3 than u2 . From the right sub-ﬁgure, we can easily notice that without considering the social relation information, a large amount of collaborative information will be lost (only nodes framed with purple dotted lines will be left in Fig. 1). We believe that the introduction of social relations among users brings more expressive power to the model. Unlike many GNN based social recommendations utilizing complex neural model, we use a simple yet eﬀective way to model the interaction by encoding both items and users in the same vector space. And we also verify through experiment that our model can be easily trained in parallel.

Social Recommendation via Graph Attentive Aggregation

371

Moreover, we argued that it’s unreasonable to make the assumption that the weights of interacted items are same. For example, more attention should be paid to the baby products than the other products when someone has bought a diapers, as there might be a newly born baby in his family. Self-attention mechanisms [21] which is able to assign learnable importances for neighbors during embedding aggregation. To summarize, the main contributions of this paper include: – We show that social relation is important to be considered in Graph Neural Network for CF and propose a novel graph neural network with graph aggregation techniques. – We propose a new GNN layer of social graph attentive aggregation (SGA) with self-attention mechanism to capture ﬁne-grained modeling of user-item interaction and user-user interaction. – We demonstrate that our model can obtain promising result on various realworld datasets and be eﬃciently trained in parallel.

2

Related Work

In this section, we mainly take a view of existing work on Graph-based CF (i.e. social recommendation and attention mechanism). 2.1

Graph-Based CF

This line of studies often regard users and items as nodes and interactions between them as the edges and thus build the bipartite user-item interaction graph. Then a variety of graph-based methods are used to get the embeddings of users and items. With the embedding information, we can utilize interaction modeling methods to reconstruct the historical interactions or predict the future interactions. Owing to the popularity of GNNs nowadays, a great number of studies on graph-based recommender system has been proposed. GC-MC [3] may be the ﬁrst research with GNNs. It utilizes a GAE [15] framework with GCN as encoder and bilinear decoder for the matrix completion task which regards the recommendation task as a link prediction problem in bipartite graphs. However, it mainly focuses on user ratings predicting task which requires ratings as sideinformation. And it’s very time-consuming and thus not suitable for using in CF with large-scale datasets. The closest work to ours is NGCF done by Wang et al. [23]. The NGCF model use GCN to obtain high-order collaborative information in the user-item bipartite graph. However, as we discuss above, it neglect the social relation among users which contain a lot of collaborative information. Moreover, attention mechanism and the order information of user-item interactions could be considered to improve model expression.

372

2.2

Y. Liufu and H. Shen

Social Recommendation

Thanks to the popularity of social platforms, the exploitation of social relation information has drawn a lot of attention of reseachers. Considering user-user interaction, social recommendation tend to be a promising method to alleviate the data sparsity issue which often occurs in the former CF model. The general idea of social recommendation is that similar users would have similar preferences and thus have similar latent embeddings. Early proposed models are mainly based on matrix factorization. SR2 [17] obtains social embedding by regularizing latent user factors to force the connected user in social relation graph close to each other. SBPR [27] is based on BP [20] that considers social pair-wise information and it tends to assign higher ratings to the items that his friends may prefer. There are also some studies to consider other side-information in social network to construct the model. For example, TrustSVD [9], ContextMF [13], PTPMF [24] consider trust inﬂuence, social context and the strength of social ties, respectively. However, all the models discussed above were only based on shallow models which only considerate one-hop relations in social network. Instead of only considering the direct social relation of the users, our model diﬀers from these works in using GNN to capturing the high-order social information. 2.3

Attention

To enable ﬁne-grained modeling of user-item interactions and user-user interactions, our model relies on the neural attention mechanism, which have been widely applied in the domains of natural language processing [21] and computer vision [19]. For recommender systems, several studies attempt to employ attention-based memory networks to capture complex and ﬁne-grained user-item interactions in CF [6]. Additional side information such as texts [29] and heterogeneous relations [28] can also be integrated into the memory network. However, they only still center around user-item interactions. In contrast, our model also considers direct user-user interactions which captures ﬁne-grained high-order contexts. And the methods above mostly considerate one-hop semantic information only while our layerwise aggregation model can capture multi-hop semantic information.

3

Methodology

In this section, we will introduce our social graph attentive aggregation (SGA) model for social recommendation via graph attentive aggregation in detail. An overview of the proposed framework is demonstrated in Fig. 2. It consists of three components: (1) pre-trained embedding layer, which parameterizes each user and item into low-dimension dense vector preserving their interaction information. (2) multiple graph aggregation layers, which can aggregate both social relations among users and interactions between users and items. (3) preference prediction, which integrate the user and item embedding and output their proximity score to make proper recommendation.

Social Recommendation via Graph Attentive Aggregation

3.1

373

Pre-trained Embedding Layer

Many neural-based recommendation systems based on collaborative ﬁltering parameterize each user and item into latent embeddings [11,12,20]. In these models, users and items are represented by dense low-dimension vectors that encode items similarity and user preferences. By learning the representation of users and items in advance, we can use simple operations to get the preference score. The interaction matrix is usually used to train the embeddings, which is a 0, 1 matrix R where Rij indicates the ith user is related to the j th item, i.e., user has some interaction with the item. Since we reﬁne the embeddings by aggregating information from user-item interaction graph and user social user graph, it’s useful to utilize the embeddings of users and items trained by previous methods that have been proved eﬃcient and eﬀective to get better performance. In our experiments we use the initial interaction matrix as the pre-trained embeddings.

Fig. 2. Multiple SGA layers

3.2

Fig. 3. A single SGA layer

Graph Attentive Aggregation Layer

We will start by introducing the building blocks of a single graph attentive aggregation layer (Fig. 3), as the single layer is utilized throughout the framework and model how information ﬂows and aggregates in social recommendation graph. The ultimate model can therefore be built by stacking multiple graph attentive aggregation layer followed by a point-wise non-linearity, through which we can explore high-order interactions among users and items.

374

Y. Liufu and H. Shen

First-Order Aggregation. In graph theory, the connected nodes in graph are likely to share the same property [14]. By incorporating and aggregating node features in learning algorithm, graph neural network could explicitly learn the topological structure of each nodes’ neighborhood (the ﬁrst-order proximity) as well as the distribution of node features in the neighborhood [10]. Traditionally, many previous graph aggregation based recommender systems treat data as bipartite graph [23]. The user preferences can be inferred by interacted items, and the collaborative similarity of items is measured by the users who consume them. From a graph aggregation prospective, user’s embedding could only propagate to items, vice versa. However, intuitively the social relations can inﬂuence users’ behaviors. Some people may choose some items they’ve never bought before after friends’ strong recommendation, which motivates the consideration of feature aggregation among users when we describe users in a graph aggregation way. We build upon this basis to perform graph aggregation operation on each connected users and user-item interactions. In a composite graph including user social relations and user-item interactions, we can simultaneously encode each node’s ﬁrst-order proximity with diﬀerent types of node, i.e. user and item, into single latent space by aggregating the neighborhood information without distinction. Speciﬁcally, in each layer all user nodes will be updated by their adjacent nodes including user and items nodes, while all item nodes are updated by connected user nodes as there no relations among item nodes. Message Construction. For a connected node pair (u, v) in the social recommendation graph, we deﬁne the message from node v to u as: M(u,v) = f (eu , ev ),

(1)

where M ∈ RN ×N ×d is the message embedding matrix for each pair (u, v), and f (·) is the message encoding function, which takes two embeddings, eu and ev , as inputs and outputs a embedding of the same dimension. It can be implemented by simple element-wise multiplication or Multi Layer Perceptron (MLP) or any other transformation. Here our implementation of f (·) is the same as the model in [23]: 1 W1 ev + W2 (eu ev ) , u = v, M(u,v) = (2) |Nu | |Nv |

where W1 , W2 ∈ Rd ×d are two trainable linear transformations that are used to extract features for later aggregation. The term ev ev is used to encode the interaction on each dimension, where denotes the element-wise product. This term is more expressive to encode node pair aﬃnity, followed by a fully connected layer. The term W1 ev could retain the initial information from neighborhoods, acting like skip-connection [11] to some extents. It can improve model’s capacity while avoiding the twisting of data, thus promote the generalization performance of the model. After the joint transformation for pair (u, v), we use the graph Laplacian normalization factor 1/ |Nu | |Nv | to normalize the messages, where |Nu | and

Social Recommendation via Graph Attentive Aggregation

375

|Nv | denotes the number of ﬁrst-order neighbors of the node u and v, respectively. Without this Laplacian normalizing factor, the high-degree nodes will receive superabundant messages in the graph aggregation process, which breaks the balance of message aggregation and reduce the utility of the model. Self-attention Layer. Self-attention is a special case of attention mechanism and has been successfully applied in graph-structure data to assign diﬀerent importances to the neighborhoods of each node [22]. In the social recommendation graph, the self-attention layer is used to capture user’s global dependencies on users in social relations and on items in interactions graph without regard to their distances by applying multiple aggregation. For each node u, a shared self-attention operation f : Rd × Rd → R is performed on all the message embeddings M(u,v) where v ∈ N (u), and outputs the attention coeﬃcients that indicate the importance of the messages from its neighborhoods. Speciﬁcally, we ﬁrst apply a shared linear transformation parametrized by a weight matrix Wa ∈ Rd×d to the message embeddings. c(u,v) = f (M(u,u) Wa , M(u,v) Wa ).

(3)

Note that we take the self-connection of u into consideration, which can be calculated by the ﬁrst term in Eq. (2), as the weight matrix W1 is enough to represent the self-connection aggregation. M(u,u) = W1 eu .

(4)

As we model the graph aggregation process layerwise and high-oder global dependencies could be computed by stacking aggregation layer, it’s neither eﬀective nor eﬃcient to compute the messages and attention coeﬃcients of all nodes with attention mechanism. Therefore, we use masked attention to preserve the ﬁrst order graph structure—only compute c(u,v) where v is u’ ﬁrst order neighbors. In our experiments, the attention operation f is a simple feedforward neural network with a non-linearity activation LeakyReLU that takes the concatenation of two embeddings as inputs and outputs a single score, followed by the softmax function to nomalize the masked attention coeﬃcients. T . (5) c(u,v) = Softmax LeakeyReLU (Muu Wa ) (Muv Wa ) Message Aggregation. Next we will introduce how to reﬁne u’s embedding by aggregating the messages from u’s ﬁrst-order neighbors. To formally describe (l+1) u’s representation eu after the (l + 1)th aggregations, we use the following function as: eu(l+1) = σ Wagg · fa*gg M(l) , u, N (u) , c + b , (6) where M(l) denotes the message matrix in the lth aggregation, N and c denotes the set of u’s neighbors and attention coeﬃcient matrix for each connected pair,

376

Y. Liufu and H. Shen

respectively. After the aggregation, we apply a single layer network parameterized by the weight Wagg and the bias b, followed by non-linear activation function σ(·), e.g. LeakyReLU [18]. We implemented the aggregator as weighted sum pooling with selfconnection, i.e. fa*gg M(l) , u, N (u) , c = m(u,u) + c(u,v) m(u,v) . (7) v∈Nu ,v=u

As stated previously, we take the self-connection for each node’s aggregation that acts like skip-connection to retain the information of original features. Through the attentive aggregation, we can reﬁnes a user’s (or an item’s) embedding by considering both connected users and connected items, explicitly exploiting both the social relations and user-item dependencies. 3.3

Preference Prediction

After stacking L aggregation layers according to the complexity of the data, we (1) (2) (L) obtain multiple representations for node u, namely {eu , eu , · · · , eu }. Each representation captures the dependencies between u and its direct neighborhoods (i.e. social inﬂuences among users and user preferences for items). In order to promote the model performance, we apply skip connections to concatenate the multiple representations for each node. Finally, a shared fullyconnected layer is used in case of sparsity and the curse of dimensionality. As such, we not only enrich the pre-trained embeddings with several aggregation layers but also allow controlling the aggregation level by changing L, thus could promote the generalization performance regardless of graph complexity. Typically, we use the average degree of nodes in the social recommendation graph as the reference for selecting L. Next we will build our recommender system to learn the model parameters. With the representations of all users and items, we can conduct the simple inner product to estimate the user’s preference for the target item. For the loss function, we choose the pairwise BPR loss [20] to optimize the model: 2 − ln σ(ˆ yui − yˆuj ) + λ ||Θ||2 (8) Loss = (u,i,j)∈(O)

where O = {(u, i, j)|(u, i) ∈ A+ , (u, j) ∈ A− } denotes the pairwise training data while yˆui denotes user u’s preference for item i, A+ is the set of connected pairs in the composite social recommendation graph, A− indicates the pair without 2 connection that is usually obtained by sampling. λ ||Θ||2 is the L2 regularization term to control the capacity of the model.

4

Experiment

In this section, we detail our experimental setup. We describe the experimental datasets in Sect. 4.1. Baselines and evaluation metrics are given in Sect. 4.2 and

Social Recommendation via Graph Attentive Aggregation

377

Sect. 4.3, respectively. Training and parameter settings are in Sect. 4.4. Finally, we report our experimental results by comparing the overallperformance and eﬃciency of the proposed model and the baseline models in Sect. 4.5. 4.1

Dataset

We conduct experiments on the two real-world datasets: Last.FM and Gowalla, the detail information of which are described as follows. – Last.fm [4]: It contains music artist listening records of 2K users from Last.fm online music systems1 . The artists are viewed as items in this dataset. In order to ensure the dataset quality, we use the 10-core setting, i.e., only retaining users and items with at least ten interactions. – Gowalla [16]: It is a location-based social network datasets where users can share their locations by checking-in. In this dataset, we treat locations as items and predict the user-location interaction. We use the datasets published by wang et al. [23] in our experiments. 4.2

Baselines

To demonstrate the eﬀectiveness of the proposed model, we compare our model with the following baseline methods. – MF [20]: It is a matrix factorization methods based on the implict feedback of user-item interactions. The method is optimized with Bayesian personalized ranking(BPR) loss, which can be viewed as a maximum posterior estimator derived from the Bayesian formulation of the problem. – GC-MC [3]: It is a collaborative ﬁltering method based on Graph convolution network [14]. The method views the user-item interactions as a bipartite graph and use a graph auto-encoder framework to learn the representations of users and items. – HOP-Rec [26]: It is a uniﬁed method of factorization and graph models that captures high-order information within a user-item interaction matrix. The high-order information is obtained with random walks on the graph and is used to enrich the user-item interaction data. – NGCF [23]: It is a graph-based collaborative ﬁltering methods that learns embeddings of users and items by leveraging high-order connectivities in the user-item interaction bipartite graph. 4.3

Evaluation Metrics

To evaluate the performance of the proposed model, we adopt precision@k, recall@k and ndcg@k as evaluation metrics which are detailed as follows: 1

http://www.lastfm.com/.

378

Y. Liufu and H. Shen

Precision@k is the fraction of top-k retrieved items that are relevant to user’s preference, i.e., items appearing in the test set of the user. It can be calculated by: Precision@K =

d , k

(9)

where d is the number of relevant items in the top-k retrieved items. Recall@k is the fraction of items that are relevant to user’s preference that are successfully retrieved in top-k results, which can be calculated by: Recall@K =

d , n

(10)

where n is the total number of relevant items of the user. Ndcg@k is a widely used measure in retrieval task performance evaluation. The main idea is that highly relevant items should appear earlier in the retrieved results, i.e., lower ranks. It assign each items a graded relevance and penalize high relevant items appearing latter in the retrieved results. The range of ndcg@k is [0, 1] with higher value representing better performance. In our experiment we set k = 20 and report the average metrics for all users in the test set. 4.4

Parameter Settings

In our experiments, the number of hidden layers is set to 3 with the number of hidden units in each layer set to 64 and we ﬁx the size of embeddings of users and items to 64 as well. The model is implemented using tensorﬂow. 4.5

Experiment Results

We compare performance of diﬀerent methods by performing item retrieval task, whose goal is to retrieval the most relevant items given a user. Speciﬁcally, given a user, we calculate the relevance scores with items that do not appear in the training set of the user and rank them accordingly. Then we calculate the evaluation metrics described in Sect. 4.3. Table 1. Overall performance (k = 20). Method

Last.fm Precision Recall

ndcg

Gowalla Precision Recall

ndcg

MF

0.0492

0.2265

0.2598

0.3987

0.1291

0.1878

GC-MC

0.0531

0.2368

0.2577

0.0431

0.1395

0.1960

HOP-Rec 0.0587

0.2401

0.2601

0.0512

0.1399

0.2128

NGCF

0.0668

0.2457

0.2687

0.0478

0.1547

0.2237

Ours

0.0712

0.2497 0.2723 0.0489

0.1601 0.2294

Social Recommendation via Graph Attentive Aggregation

379

The overall performance of the proposed model and the baselines are given in Table 1, from which we obtain the following results: – Compared with MF, GC-MC, which treats the user-item interactions as bipartite graph and aggregates feature of neighbors to lean embeddings of users and items, achieves better results. HOP-Rec and NGCF, which considers higher order interactions between entities, also gets better performance. The result indicates that the complex relations between users and items can be better captured by aggregating features of higher order neighbors. – The proposed model achieved best performance on Last.fm dataset and has the best performance in terms of recall@k and ndcg@k on Gowalla dataset. The results indicate the eﬀectiveness of the proposed attentive layer. This can be explained that our proposed model is not only able to capture high order relations of entities, but also capture the relative importance of the neighbors with the attention mechanism. 4.6

Parallel Eﬃciency Evaluation

To evaluate the parallel eﬃciency of the proposed model, we compare our model with our baseline model MGCF and GraphRec [8] which also take social relation into account. We utilize KaHip as our choice of graph partitioning method and launch the experiment on 1 machine with 8 Nvidia GTX 2080Tis. We set the parameter as Sect. 4.3 and evaluate the speedup-ratio on the datasets proposed in Sect. 4.1. The result is shown in Fig. 4.

speedup-ratio

3

2

SGA Base GraphRec

1 2

4

6

8

number of GPUs used (a) Last.fm

speedup-ratio

5 4

4

3

2

SGA Base GraphRec

1 2

4

6

8

number of GPUs used (b) Gowalla

Fig. 4. The Speedup ratio with diﬀerent GPUs used.

In Last.fm dataset, social edges are limited, so all three models perform similarly. However, when the number of social relations increase in Gowalla datase, the speedup ratio of GraphRec drops sharply. It proves that our SGA model achieve better performance than other GNN-based social recommendation model while maintaining similar parallel eﬃciency with the model without considering social relations.

380

5

Y. Liufu and H. Shen

Conclusion and Future Work

In this paper, we proposed a social graph attentive aggregation (SGA) network for social recommendation. Our model combines the strength of NGCF for leveraging high-order collaborative information from user-item bipartite graph and the social networks for utilizing direct user-user interaction information to alleviate the data sparsity issue. Moreover, we also utilized attention mechanism to enable ﬁne-grained modeling. The experimental results showed that our model is eﬀective and suitable for parallel training for eﬃciency speed up. For future work, we will take more side information other than social networks into consideration. For example, if two items were put in the same shopping cart, we can assume that these two items are related and can be modeled as an edge in useritem graphs to further alleviate the sparsity problem. Moreover, we will explore eﬀective parallelization strategies to further boost the eﬃciency of our model. Acknowledgement. This work is supported by Key-Area Research and Development Plan of Guangdong Province #2020B010164003.

References 1. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 6, 734–749 (2005) 2. Bakshy, E., Rosenn, I., Marlow, C., Adamic, L.: The role of social networks in information diﬀusion. In: Proceedings of the 21st International Conference on World Wide Web, pp. 519–528. ACM (2012) 3. Berg, R.V.D., Kipf, T.N., Welling, M.: Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263 (2017) 4. Cantador, I., Brusilovsky, P., Kuﬂik, T.: 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In: Proceedings of the 5th ACM Conference on Recommender Systems. RecSys 2011, ACM, New York, NY, USA (2011) 5. Deﬀerrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral ﬁltering. In: Advances in Neural Information Processing Systems, pp. 3844–3852 (2016) 6. Ebesu, T., Shen, B., Fang, Y.: Collaborative memory network for recommendation systems. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 515–524. ACM (2018) 7. Fan, W., Derr, T., Ma, Y., Wang, J., Tang, J., Li, Q.: Deep adversarial social recommendation. arXiv preprint arXiv:1905.13160 (2019) 8. Fan, W., et al.: Graph neural networks for social recommendation. In: The World Wide Web Conference, pp. 417–426. ACM (2019) 9. Guo, G., Zhang, J., Yorke-Smith, N.: TrustSVD: collaborative ﬁltering with both the explicit and implicit inﬂuence of user trust and of item ratings. In: TwentyNinth AAAI Conference on Artiﬁcial Intelligence (2015) 10. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, pp. 1024–1034 (2017)

Social Recommendation via Graph Attentive Aggregation

381

11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 12. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.S.: Neural collaborative ﬁltering. In: Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. International World Wide Web Conferences Steering Committee (2017) 13. Jiang, M., Cui, P., Wang, F., Zhu, W., Yang, S.: Scalable recommendation with social contextual information. IEEE Trans. Knowl. Data Eng. 26(11), 2789–2802 (2014) 14. Kipf, T.N., Welling, M.: Semi-supervised classiﬁcation with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 15. Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016) 16. Liang, D., Charlin, L., McInerney, J., Blei, D.M.: Modeling user exposure in recommendation. In: Proceedings of the 25th International Conference on World Wide Web, pp. 951–961. International World Wide Web Conferences Steering Committee (2016) 17. Ma, H., Zhou, D., Liu, C., Lyu, M.R., King, I.: Recommender systems with social regularization. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 287–296. ACM (2011) 18. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectiﬁer nonlinearities improve neural network acoustic models. In: Proceedings ICML, vol. 30, p. 3 (2013) 19. Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014) 20. Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian personalized ranking from implicit feedback. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 452–461. AUAI Press (2009) 21. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 22. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017) 23. Wang, X., He, X., Wang, M., Feng, F., Chua, T.S.: Neural graph collaborative ﬁltering. arXiv preprint arXiv:1905.08108 (2019) 24. Wang, X., Hoi, S.C., Ester, M., Bu, J., Chen, C.: Learning personalized preference of strong and weak ties for social recommendation. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1601–1610. International World Wide Web Conferences Steering Committee (2017) 25. Wu, L., Sun, P., Hong, R., Fu, Y., Wang, X., Wang, M.: SocialGCN: an eﬃcient graph convolutional network based model for social recommendation. arXiv preprint arXiv:1811.02815 (2018) 26. Yang, J.H., Chen, C.M., Wang, C.J., Tsai, M.F.: HOP-rec: high-order proximity for implicit recommendation. In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 140–144. ACM (2018) 27. Zhao, T., McAuley, J., King, I.: Leveraging social connections to improve personalized ranking for collaborative ﬁltering. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 261–270. ACM (2014)

382

Y. Liufu and H. Shen

28. Zhou, X., Liu, D., Lian, J., Xie, X.: Collaborative metric learning with memory network for multi-relational recommender systems. arXiv preprint arXiv:1906.09882 (2019) 29. Zhou, X., Mascolo, C., Zhao, Z.: Topic-enhanced memory networks for personalised point-of-interest recommendation. arXiv preprint arXiv:1905.13127 (2019)

MACSQ: Massively Accelerated DeepQ Learning on GPUs Using On-the-fly State Construction uger Marcel K¨oster(B) , Julian Groß, and Antonio Kr¨ German Research Center for Artiﬁcial Intelligence (DFKI), Saarland Informatics Campus, Campus D3.2, 66123 Saarbr¨ ucken, Germany {marcel.koester,julian.gross,antonio.krueger}@dfki.de Abstract. The current trend of using artiﬁcial neural networks to solve computationally intensive problems is omnipresent. In this scope, DeepQ learning is a common choice for agent-based problems. DeepQ combines the concept of Q-Learning with (deep) neural networks to learn diﬀerent Q-values/matrices based on environmental conditions. Unfortunately, DeepQ learning requires hundreds of thousands of iterations/Q-samples that must be generated and learned for large-scale problems. Gathering data sets for such challenging tasks is extremely time consuming and requires large data-storage containers. Consequently, a common solution is the automatic generation of input samples for agent-based DeepQ networks. However, a usual workﬂow is to create the samples separately from the training process in either a (set of) pre-processing step(s) or interleaved with the training process. This requires the input Q-samples to be materialized in order to be fed into the training step of the attached neural network. In this paper, we propose a new GPU-focussed method for on-the-ﬂy generation of training samples tightly coupled with the training process itself. This allows us to skip the materialization process of all samples (e.g. avoid dumping them disk), as they are (re)constructed when needed. Our method signiﬁcantly outperforms usual workﬂows that generate the input samples on the CPU in terms of runtime performance and memory/storage consumption. Keywords: Massively-parallel processing · Neural networks · Q-learning · Graphics processing units · GPUs · State construction

1

Introduction

Neural network and DeepQ learning become more and more prominent [19]. Due to advancements in parallel GPU-based processing over the past years, applying DeepQ learning to large-scale problems becomes feasible. However, a severe limitation is always the dataset processing in general. Either researches have to deal with large binary-based datasets in data storages or they favor automatic sample generation. Although even combinations of both approaches are also common choices, we focus on purely automatic generation of training samples in this paper. This work has been developed in the project APPaM (01IW20006), which is partly funded by the German ministry of education and research (BMBF). c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 383–395, 2022. https://doi.org/10.1007/978-3-030-96772-7_35

384

M. K¨ oster et al.

In this context, we have to randomly generate a large number of states used for training. A state thereby contains all environment information in which the agent(s) live(s) in. It also includes the exact state of all agents in order to represent them as precise as necessary for the overall problem domain description. Given a set of generated states, a single Q-matrix is trained for each of them. After training these matrices, they act as desired outputs for an attached neural network. The inputs of this network are then given by the diﬀerent states. This allows for learning computational rules to infer Q-based decisions on environmental conditions deﬁned by the input states. Since these states must be generated prior to learning, a common choice is the generation on the CPU side. This allows to conveniently model the stategeneration code in an arbitrary programming language. It is often possible to use straight-forward parallelization principles on the CPU-side to improve performance of the state-generation logic. Although this seems to be a perfect choice at ﬁrst sight, large-scale problems require hundreds of thousands of states to achieve high learning accuracy. This often causes scalability issues on the CPU side and/or storage problems when saving all generated samples to a storage device for learning. In this paper, we propose a new high-level method and a set of GPU-driven algorithms to accelerate DeepQ learning. In particular, our approach enables automatic (re-)generation of states on GPU devices without any further CPU intervention. This helps to signiﬁcantly outperform CPU-based sample generation on the one hand and to reduce the required memory consumption in already GPU-specialized learning pipelines on the other hand.

2

Related Work

As outlined in the introduction, DeepQ learning is a state-of-the-art of oftenchosen method. For this reason, it also a well-researched topic in general covered by hundreds of applications. Although it is widely applied, the usual way to train these networks is by generating sample input states on the CPU [19]. For example, Mnih et al. [18,19] evaluated diﬀerent games using CPU-created samples. Also, papers reasoning about improving precision and convergences mainly take CPU-generated samples into account [3,6,22]. In contrast to these mainly CPU-driven methods in terms of state generation, the work by Liang et al. [14] takes GPU-acceleration into account. In this paper, like in many others [17], CPU-evolved samples are passed to the GPUs for performing multiple training epochs [23]. To overcome runtime and memory limitations of these approaches we generate samples on-the-ﬂy on the GPUs which also improves training performance. Recent work has shown signiﬁcant performance improvements when using GPUs in the context of massively parallel simulations. A well known example is the work by Groß et al. [4,5] accelerating parallel neighborhood lookups in large-scale 3D particle simulations (e.g. general [12] and ﬂuid simulations [11]). However, GPU acceleration is not particularly limited to particles in general. There have been great advancements in the domain of purely GPU-optimized simulation methods for arbitrary domains [7,9]. This makes GPUs more applicable to general purpose simulations targeting many parallel states.

MACSQ

385

A prominent optimization technique to leverage the parallel performance of GPUs, is the use of proper memory-access patterns [1,13,21]. This task becomes particularly challenging in our domain while processing multiple states in parallel. Previous work by K¨ oster et al. [10] evaluates various possibilities to design suitable data-structure layouts in this context. We follow their advises and use the same techniques to realize all of our memory-access patterns. Most similar to our approach in terms of tracking states, is the work by K¨ oster et al. [8]. The authors target the setting in which is it often beneﬁcial to not remember states by storing them but to eﬃciently reconstruct them when needed. Our new method is based on the same principle but with a diﬀerent purpose, which requires major adjustments of this approach to be used in our domain. In terms of parallel learning, our method borrows architectural concepts from the one by Amin et al. [2]. In contrast to their approach, our algorithms focus on multiple network adjustments using many states per GPU. However, we also perform parallel feed-forward steps while adjusting the matrix and bias weights using parallel reductions.

3

MACSQ

Fig. 1. A single update step of our processing pipeline. First, we need a given domain description (step 1). Afterwards, we instantiate diﬀerent states by iteratively sampling for valid solutions (green, step 2). The actual Q-matrices for each state are maintained in shared memory (blue, step 2) which are also iteratively built. Next, we feed the state description and their Q-matrices into the same neural network in parallel, which we want to train (yellow, step 3). Finally, we perform a parallel reduction of all matrix weights and bias vectors (step 4) in order to realize the network updates (step 5). (Color ﬁgure online)

As outlined in the introduction, we focus on the automatic generation of states on the GPU. For this purpose, we leverage the high-level architectural design

386

M. K¨ oster et al.

of state reconstruction by K¨ oster et al. [8] (see Sect. 2). The main idea in this scope is to avoid storing states in memory/on a storage device, if they are not needed for the current operation. However, this implies that they have to be re-computed (reconstructed) later on in order to use them again. Figure 1 shows a single step of our approach while taking the nature of GPUs into account. We assume a given domain description (model) that can be imperatively executed in the context of multiple states on a GPU (see also Sect. 4). This description is then used to spawn multiple states that are created using a randomnumber generator (RNG). Thereby, the RNG is maintained and managed in the background without being tied to the domain description. This gives us the ability to reconstruct the same states by using previously stored RNG-states, which will be recovered for reconstruction. In order to improve performance, we maintain all Q-matrices for each state in shared memory. This signiﬁcantly reduces the number of expensive global-memory accesses, since Q learning requires many updates to the Q-matrix values.

Fig. 2. Traditional processing approach (1): Each state is processed by a single thread group on the GPU. Our concept inspired by [10] (2): Each state is processed by a single warp. Since a thread group contains multiple warps in our case, we process multiple states per thread group.

To ensure scalability, while keeping the overhead for the GPU warp schedulers as low as possible, we spawn many large thread groups covering as much as threads as possible on each GPU device (see also Sect. 5). Within each group, we use a single warp per state, rather than using the whole group to process a single state (see Fig. 2). This approach has already been successfully used in previous work [10] to handle thousands of states eﬃciently in parallel. The concept is suitable for small-scale (in terms of a few number of agents and environment properties that must be tracked), as well as large-scale, domain descriptions. In the case of small-scale descriptions, many parallel threads working on a single state can easily become idle. This causes loss of occupancy, and thus, often significant performance bottlenecks. In contrast to this problem, large-scale domain descriptions would require many threads to improve the overall throughput. However, these domains usually require more samples in general, which implies more threads working on the diﬀerent states at the same time. Using the method presented here, ensures scalability in the context of small-scale and large-scale problems by using a compromise in the number of threads per state. Algorithm 1 shows our GPU-friendly state-initialization algorithm that is applied to each state. As mentioned above, we assign a single warp per state. Consequently, we have to compute a globally unique state index per warp ﬁrst. Note

MACSQ

387

that the algorithm contains a divergent branch as one of its ﬁrst instructions: If the currently computed warp-wide state index exceeds the current number of states, all threads in this warp leave the current group. This case occurs in the presence of a number of states that is not dividable by the total number of warps in all thread groups. Note further, that this is a not a performance issue. If all threads in a warp leave the thread group, the warp dispatcher can activate another warp, which implicitly realizes the concept of thread compaction on a warp level [10]. Algorithm 1: High-Level State-Initialization Algorithm 1 2 3 4 5 6 7 8 9 10 11

/* Compute the state index for each warp in each group groupDim stateIdx := gridIdx · warpSize + warpIdx; if stateIdx ¿= numStates then return; end random := LoadRNG(stateIdx); validInitialization := 0; while validInitialization = 1 do initialized := DomainDescription.InitState(stateIdx, random); validInitialization := Warp.AllReduceAdd(initialized); end StoreRNG(stateIdx, random);

*/

From an algorithmic point of view, we use an RNG-based iterative initialization approach: We use the current domain description to perform a parallel initialization using all threads of a single warp (referred to as lanes, lines 7– 10). Each thread invocation returns a lane-local result whether the initialization has been valid, in terms of domain-speciﬁc constraints. After each initialization attempt, we perform a warp-wide reduction to verify that all lanes have returned a successful initialization result. This process is repeated until a single attempt has been successful iote that the domain-description implementation needs to take care of initializing all state-dependent properties/agent states using the lanes of a single warp. Our main algorithm to compute all Q-matrices is presented in Algorithm 2. In analogy to Algorithm 1, we have to query and verify the current state index of each warp. Next, we allocate a suﬃcient amount of shared memory per thread group to store all Q-matrices for each state (each warp, line 5). Each warp computes its unique sub-view into shared-memory in order to address the associated Q-matrix elements (line 6). Afterwards, each warp initializing its Q-matrix by either loading pre-training Q-values, or zeroing them (in the case of a new training process, line 7). The following lines of the algorithm use a GetFromFirstLane function. Its purpose is to execute the passed function invocation in the ﬁrst lane of each warp only. All other lanes do not perform an operation while the function invocation is evaluated. Subsequently, all lanes participate in a divergent-free warp-shuﬄe operation in which each lane gets the value from the ﬁrst lane. Using this eﬃcient approach, allows us to broadcast the single result value from the function invocation to all other threads in the warp.

388

M. K¨ oster et al.

Algorithm 2: Massively-Parallel Q-Determination Algorithm 1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

/* Compute the state index for each warp in each group groupDim stateIdx := gridIdx · warpSize + warpIdx; if stateIdx ¿= numStates then return; end /* Initialize the Q-matrix for all active warps in this group groupDim sharedQ := SharedMemory qDim.X · qDim.Y · warpSize ; qViewPerState := SubView(sharedQ, qDim.X · qDim.Y · warpIdx); LoadOrInitQView(stateIdx, qViewPerState); random := LoadRNG(stateIdx); /* Build or update the Q-matrix numSourcePossibilities := GetFromFirstLane( DomainDescription.GetNumSourcePossibilities(stateIdx, random)); /* Perform the specified number of Q-tries for i := 1 to #Q-S do /* Determine the current source possibility for all threads in this warp source := GetFromFirstLane( NextRandom(random 0, numSourcePossibilities)); /* Get a target possibility for this thread (if any) (hasTarget, target) := DomainDescription.TryGetTarget( stateIdx, source, random); /* Determine the reward for this thread (if any) (hasReward, reward) := DomainDescription.TryGetReward( stateIdx, source, hasTarget, target); /* Get the Q-matrix data currentQ := qViewPerState[source, target]; nextQ := qViewPerState[source, SelectQTarget(target)]; /* Compute the updated Q-value using α and γ newQ = UpdateQ(reward, currentQ, nextQ); /* Wait for all threads and propagate changes Warp.Barrier; /* Update the Q-matrix after reading all data if hasReward then qViewPerState[source, target] := newQ; end /* Wait for all threads and propagate changes Warp.Barrier; end /* Store the state of the current RNG StoreRNG(stateIdx, random); /* Export Q-matrix values to the neural network input ExportToNeuralNetworkOutput(stateIdx, qViewPerState);

*/

*/

*/ */ */ */ */ */ */ */ */

*/ */ */

The primary idea here is to perform (at least) a speciﬁed number of Qsamples per state (#Q-S, lines 11–26, see also Sect. 5). At least here refers to the fact that each lane in a warp gets the same Q-source value (lines 12–13)

MACSQ

389

for sampling in each iteration, which can result in a number of warpSize·#Q-S samples in sum. Then, all lanes try to determine a valid Q-target value within the Q-dimensions according to the domain-speciﬁc constraints (lines 14–15). As this operation can fail for each target possibility, this function returns a tuple consisting of a success value hasTarget and the actual Q-target reference target (if any). A prerequisite at this point is the fact that the domain-description logic has to ensure that diﬀerent lanes will be assigned to diﬀerent targets. Otherwise, this results in race conditions during Q-matrix updates later on. Although this might sound quite sophisticated to achieve in general, it turns out to be straight forward in most cases in practice based on our experience. A common use case is to select between diﬀerent target values in a certain range. By subdividing this range into several sections based on the warp size, the diﬀerent target-value intervals can be directly assigned to the diﬀerent lanes. The remaining steps are to determine the reward (lines 16–17) and to perform the computation of the newQ value based on current α and γ settings (lines 18– 20). Before issuing any Q-matrix value updates, we have to wait for all lanes in the warp. This is important since the computation of the newQ values involves reading data from the current Q-matrix. Removing this barrier would lead to read-write race conditions. If a reward could be determined for the current lane in lines 16–17, the newQ value can be updated in shared memory. Note that we also need an additional barrier after the Q-matrix updates to avoid reading outdated information in the next iteration. Finally, we store the current (statedependent) RNG state and export the Q-matrix for each state from shared memory to a location a global memory for training purposes.

4

Implementation Details

We have used C# in combination with the ILGPU-compiler1 to implement our system. ILGPU is used to compile parts of our application written in managed code to executable GPU code that can be run on our NVIDIA GPUs. Note we perform all memory allocations prior to launching any GPU kernel in order to avoid unnecessary latencies and blocking operations. Furthermore, we completely avoid using ﬂoating-point-based atomic operations to have deterministic and reproducible results [20] in the context of reduction operations. However, given diﬀerent group sizes targeting diﬀerent GPUs [1,15,20,21], the results may still vary. This is not an issue in general, as a ﬁxed group size using our implementation on a particular GPU architecture (e.g. NVIDIA Ampere [21]) always yields the same results. Furthermore, we use an Xorshift-based random-number generator to compute new random numbers on-the-ﬂy on the GPU [16].

5

Evaluation

The whole evaluation section is based on a simple, yet challenging, agent-based simulation/optimization problem (see Fig. 3). It is build around an assignment problem from the ﬁeld of manufacturing, which requires diﬀerent agents to be assigned to diﬀerent working stations. The agents can move between the stations 1

www.ilgpu.net.

390

M. K¨ oster et al.

by taking a pre-deﬁned movement-time-matrix into account. Thereby, the overall purpose is to assemble products that have to pass all stations in order to be completed. If a product reaches a station, a single work step needs to be performed on the product using an agent (if any). After completing a single work step, the product is passed to the next station (until it reaches the ﬁnal station). Note that only a single agent can be assigned to a station at a time; although multiple agents might stand in front of the station. Moreover, only a single product is allowed to be on a station at any point in time.

Fig. 3. A sample production line with 10 stations (black lines), 5 products (purple) and 4 agents (in front of their stations, green). Agents can move freely between the stations. (Color ﬁgure online)

In order to evaluate diﬀerent computational workloads and simulate multiple use cases, we have to diﬀerentiate between scenarios and states. A scenario refers to a given number of stations and agents, whereas a single state lives within its parent scenario deﬁnition and contains an actual description of all product/station/agent states. Based on this diﬀerentiation, Fig. 3 shows a sample state within a scenario of 10 stations and 4 agents. Changing a scenario conﬁguration, also inﬂuences the size of the hidden layers used for implementing the assignment logic2 . Table 1 presents the evaluated scenario conﬁgurations, as well as their neural network settings. Note that we do not use any convolutional networks for these simple scenarios while taking common pitfalls into account [3]. These conﬁgurations have been selected because they refer to existing use cases from reality. Note that these conﬁgurations do not contain any products since product placement and agent assignment remain state dependent rather than scenario dependent. Table 1. The used evaluation scenarios (1–3) with diﬀerent station and agent setups. The Q-dimension is always equal to the squared number of stations in all cases. Note that the neural network conﬁguration is chosen in a way that the input dimension (size of the input layer) is equal to the number of stations + agents. The output dimension (size of the output layer) is equal to the corresponding Q-dimension (as we learn whole Q-matrices) and the size of the hidden layer (like the number of samples) has been determined using an oﬄine auto-tuning process. #Q-S refers to the number of samples to compute the Q-matrix in each state and #N-S refers to the total number of training states. Scenario #Stations #Agents Q-Dimension

2

#Q-S Network

#N-S

1 2

10 12

4 4

10 × 10 12 × 12

20K 28K

14 × 64 × 100 16 × 72 × 144

192K 512K

3

16

6

16 × 16

50K

24 × 128 × 256 1792K

For the sake of simplicity, we use a single hidden layer for all evaluation scenarios.

MACSQ

391

Table 2. Thread group conﬁgurations for the used GPUs, their number of states per thread group and the number of dispatched states in parallel. Note that this number is twice as large compared to the maximum number of parallel states per GPU to maximize occupancy.

Group size States per group #Parallel states #Dispatched states

GTX 1080 Ti

RTX 3090

1024 32 28 × 2 × 32 = 1792 1792 × 2 = 3584

768 24 82 × 2 × 24 = 3936 3936 × 2 = 7872

We use two diﬀerent GPUs from NVIDIA, a GTX 1080 Ti and an RTX 3090, and compare these results to a pure proﬁling-tuned C#-based samplegeneration engine running an AMD Ryzen 3950X. As discussed in Sect. 3, we process multiple states per GPU thread group. Table 2 depicts the used group sizes in order to achieve maximum occupancy on our evaluation GPUs. Note that the number of dispatched states in parallel is also referred to as the batch size (BS) in the remainder of this section. As presented in the introduction, a common approach is to generate all samples used for learning on the CPU to the actual training step. Table 3 shows runtime measurements for our three evaluation scenarios (see Table 1) using a purely CPU-based state-generation step. As expected, the runtime grows significantly with the complexity of the scenario. However, the runtime is primarily dominated by the number of samples #N-S and not by the required number of Q-learning samples #Q-S. This is due to the fact, that the Q-matrices are maintained in the L1/L2 caches. Table 3. Runtime in seconds for generating all samples (#N-S) on our evaluation CPU for learning purposes. Scenario #N-S 1 2 3

Ryzen 3950X (16 Cores, 32 Threads)

192K 48.75 s 512K 160.37 s 1792K 1,717.51 s

Using our purely-GPU-based method, results in considerable runtime improvements (see Table 4). Since we make excessive use of the L1 caches to maintain our Q-matrices (in shared memory), the overall runtime is primarily dominated by the number of training samples #N-S (similar to the CPU version). However, in this evaluation table we diﬀerentiate between two types A and B. In the ﬁrst case (type A), we generate a single batch (achieving maximum occupancy on the device) only. Type B covers the case in which we have to generate all states on the GPU.

392

M. K¨ oster et al.

Table 4. Runtime measurements in seconds on the evaluation GPUs. Type A: generation of a single batch only (BS, see also Table 2). Type B: iterative generation of all states #N-S in GPU memory. Scenario Type

GTX 1080 Ti σ

BS

1 ∗ 2 ∗ 3 ∗

0.005 s 2.911 s 0.080 s 11.906 s 0.516 s 261.832 s

3584 0.004 s – 1.172 s 3584 0.007 s – 4.831 s 3584 0.349 s – 85.386 s

A B A B A B

0 0.032 0 0.206 0.024 0.563

RTX 3090 σ 0 0.053 0 0.024 0.015 0.451

BS 7872 – 7872 – 7872 –

Comparing the runtime of our GPU-based method with the CPU implementation reveals speedups from 6.5× to 16.75× on the GTX 1080 Ti and from 20× to 41× on the RTX 3090. Note that speedup decreases the more samples are generated at once in these simple evaluation scenarios. This is due to the fact that the maximum occupancy has already been reached using our computed batch sizes. Note that the speedup will not decrease any further since the parallel processing capabilities ouf our GPU devices beat our CPU by orders of magnitude. This is particularly helpful when dealing with larger scenarios and problem domains yielding even higher speedup factors. If the actual networktraining step is performed on the GPU, the CPU samples need to be copied to the GPU devices. Moreover, if all training samples do not ﬁt into global GPU memory, we need to “page-in” and “page-out” subsets of them. This makes the CPU-version even slower. Consider the total memory consumption of our states (including their Qmatrices) shown in Table 5. Since our new approach is also capable of reconstructing “old” (already seen) states, it is possible to limit the number of states that must be held in memory at any point in time. Limiting this number to be equal to the batch size, allows us to reduce the memory consumption on our benchmarks by factors of 53× up to 500× (see Table 5). Although this is not required given our simple evaluation scenarios (as all samples ﬁt into main memory), this still shows great improvement possibilities in large-scale applications. Table 5. The memory consumption of a single state in bytes. The GPU columns present the total memory consumption in MB when processing a batch-size number of states in parallel. The right-most column (All States) depicts the memory consumption in MB when materializing all training states #N-S in memory. ote that a single entry in the Q-matrix is implemented using a 32-bit ﬂoat. Scenario State size GTX 1080 Ti RTX 3090 All states 1 2 3

414 B 592 B 1046 B

1.42 MB 2.02 MB 3.57 MB

3.11 MB 4.44 MB 7.85 MB

75.81 MB 289.06 MB 1787.60 MB

MACSQ

393

A common strategy is using a certain number of samples per training epoch, which can be regenerated on-the-ﬂy, as discussed above (see Table 6). However, this imposes an additional runtime overhead. On our benchmarks, the measured slowdown of regenerating samples (type A), rather than maintaining all of them in main memory (type B), lies between 4× and 5×. We do not believe that this is a severe limitation as “paging-in” and “paging-out” states in large-scale applications will result in even larger overheads. Table 6. Neural-network training setups using multiple epochs. A given number of randomly chosen samples (out of the set of all training samples #N-S) is used per epoch. Type A: using on-the-ﬂy state reconstruction with the help of multiples of the batch size. Type B: generating all states on the GPU prior to the training step. Scenario Epochs #Samples Type GTX 1080 Ti RTX 3090 1 ∗ 2 ∗ 3 ∗

6

900 ∗ 1000 ∗ 1500 ∗

960 ∗ 2560 0 6000 0

A B A B A B

12.8 s 2.9 s 57.6 s 11.9 s 1,296.7 s 261.8 s

4.9 s 1.2 s 22.9 s 4.8 s 399.4 s 85.4 s

Conclusion

In this paper, we presented a new approach to on-the-ﬂy sample generation and training for agent-based DeepQ networks. It is entirely GPU based and does not require a CPU interop, which makes it a great choice for asynchronous processing. The evaluation sections describes the signiﬁcant speedups and memory size reduction using our method. Compared to CPU-based sample generation, our GPU-designed algorithms help to achieve runtime improvements by 6.5× (on an older GPU architecture) and up to 41× on a recent GPU device using our simple evaluation scenarios. Larger-scale real-world scenarios will yield substantially higher improvements. It is also possible to trade runtime performance against memory consumption. Accepting a slowdown of up to 5× on the one hand, we are able to reduce the memory consumption by up to 500× on the other hand. We argue to trade the memory consumption for the runtime performance, since large-scale applications require billions of samples that have to paged-in and out of GPU memory. This causes even worse runtime slowdowns. Analyzing further scenarios in detail will reveal even more optimization potential. Hence, we would like to improve our method to take additional domain-dependent factors into account. ¨ uz for her suggestions Acknowledgment. The authors would like to thank Nurten Oks¨ and feedback regarding our paper.

394

M. K¨ oster et al.

References 1. AMD: AMD Vega Instruction Set Architecture (2019) 2. Amin, M.A., Kashif, M., Umer, M., Rehman, A., Waheed, F., Rehman, H.U.: Parallel backpropagation neural network training techniques using graphics processing unit. Int. J. Adv. Comput. Sci. Appl. (2019) 3. Fu, J., Kumar, A., Soh, M., Levine, S.: Diagnosing Bottlenecks in Deep Q-learning Algorithms (2019) 4. Groß, J., K¨ oster, M., Kr¨ uger, A.: Fast and eﬃcient nearest neighbor search for particle simulations. In: Proceedings of the Conference on Computer Graphics & Visual Computing (CGCV-2019). The Eurographics Association (2019) 5. Groß, J., K¨ oster, M., Kr¨ uger, A.: CLAWS : Computational load balancing for accelerated neighbor processing on GPUs using warp scheduling. In: Proceedings of the Conference on Computer Graphics and Visual Computing (CGCV-2020). The Eurographics Association (2020) 6. Hasselt, H.V., Guez, A., Silver, D.: Deep reinforcement learning with double Qlearning. In: Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence. AAAI Press (2016) 7. K¨ oster, M., Groß, J., Kr¨ uger, A.: FANG: fast and eﬃcient successor-state generation for heuristic optimization on GPUs. In: Wen, S., Zomaya, A., Yang, L.T. (eds.) ICA3PP 2019. LNCS, vol. 11944, pp. 223–241. Springer, Cham (2020). https:// doi.org/10.1007/978-3-030-38991-8 15 8. Park, J.H., Shen, H., Sung, Y., Tian, H. (eds.): PDCAT 2018. CCIS, vol. 931. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-5907-1 9. K¨ oster, M., Groß, J., Kr¨ uger, A.: High-performance simulations on GPUs using adaptive time steps. In: Qiu, M. (ed.) ICA3PP 2020. LNCS, vol. 12452, pp. 369– 385. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60245-1 26 10. K¨ oster, M., Groß, J., Kr¨ uger, A.: Massively parallel rule-based interpreter execution on GPUs using thread compaction. Int. J. Parallel Program. 48(4), 675–691 (2020) 11. K¨ oster, M., Kr¨ uger, A.: Adaptive Position-Based Fluids: Improving Performance of Fluid Simulations for Real-Time Applications. Int. J. Comput. Graph. Animation (2016) 12. K¨ oster, M., Kr¨ uger, A.: Screen space particle selection. In: Proceedings of the Conference on Computer Graphics and Visual Computing (CGCV-2018). The Eurographics Association (2018) 13. K¨ oster, M., Leißa, R., Hack, S., Membarth, R., Slusallek, P.: Code Reﬁnement of Stencil Codes. Parallel Process. Lett. (PPL) 24, 1441003 (2014) 14. Liang, J., Makoviychuk, V., Handa, A., Chentanez, N., Macklin, M., Fox, D.: GPUaccelerated robotic simulation for distributed reinforcement learning (2018) 15. Lustig, D., Sahasrabuddhe, S., Giroux, O.: A formal analysis of the NVIDIA PTX memory consistency model. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (2019) 16. Marsaglia, G.: Xorshift RNGs. J. Statist. Software, Articles 8 (2003) 17. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, PMLR (2016) 18. Mnih, V., et al.: Playing Atari with Deep Reinforcement Learning (2013) 19. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015) 20. NVIDIA: faster parallel reductions on Kepler (2014)

MACSQ

395

21. NVIDIA: CUDA C Programming Guide v11.5 (2021) 22. Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.): ECML 2005. LNCS (LNAI), vol. 3720. Springer, Heidelberg (2005). https://doi.org/10.1007/ 11564096 23. Stooke, A., Abbeel, P.: Accelerated methods for deep reinforcement learning (2019)

Model-Based Multi-agent Policy Optimization with Dynamic Dependence Modeling Biyang Hu, Chao Yu(B) , and Zifan Wu School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, China {huby25,wuzf5}@mail2.sysu.edu.cn, [emailprotected]

Abstract. This paper explores the combination of model-based methods and multi-agent reinforcement learning (MARL) for more eﬃcient coordination among multiple agents. A decentralized model-based MARL method, Policy Optimization with Dynamic Dependence Modeling (POD2M), is proposed to dynamically determine the importance of other agents’ information during the model building process. In POD2M, the agents adapt their mutual dependence during building their own dynamic models in order to make a trade-oﬀ between an individuallearning process and a coordinated-learning process. Once the dynamic models have been built, the policies are then trained based on one-step model predictive rollouts. Empirical experiments on both cooperative and competitive scenarios indicate that our method can achieve higher sample eﬃciency against the compared model-free MARL algorithms, and outperforms the centralized method in large domains.

Keywords: Multi-agent reinforcement learning optimization · Dynamic dependence

1

· Model-based policy

Introduction

Reinforcement learning (RL) has made exciting progress in a variety of domains, such as Atari games [1], Go [2] and recently Android System [3]. RL algorithms can be divided into two categories: model-based methods and model-free methods. Model-based methods build a predictive dynamic model of the true environment such that the agent can learn the policy with the simulation samples to reduce the sample complexity [4]. In contrast, model-free methods learn the policies directly from the experience data. While model-free methods have been proved as a general solution for learning complex tasks [5–8], these algorithms suﬀer from the cost of sample eﬃciency. Especially in some scenarios such as medical and military ﬁelds, collecting enough experience data to train a model-free RL agent can be very diﬃcult. In contrast, model-based methods can guarantee high sample eﬃciency of learning. However, the accuracy of model estimation c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 396–411, 2022. https://doi.org/10.1007/978-3-030-96772-7_36

Model-Based Multi-agent POD2M

397

acts as an essential bottleneck to policy quality, generally resulting in inferior performance of model-based methods compared to their model-free counterparts. Recently, several studies have proposed various model-based methods [9–11] that can achieve higher sample eﬃciency and similar asymptotic performance compared to model-free RL methods in single agent learning environments. In contrast to single-agent RL, Multi-agent RL (MARL) has been extensively applied to various scenarios including multi-robot systems [12,13], realtime strategic games [14,15] and autonomous driving [16,17]. The main challenge of MARL is that an agent is required to interact with other agents and the environment feedback depends on the joint actions of all the agents. The coexistence of other agents and the concurrent update of multiple agents’ policies cause the non-stationarity issue from the perspective of each learning agent. This issue is further exaggerated in model-based MARL, where agents not only need to reason about other agents’ behaviors in a dynamic environment, but also need to build a model that is able to correctly capture the transition of this environment. An intuitive solution [18] is to build a centralized dynamic model to approximate the transition process with the inputs of all the agents’ observations. However, this kind of centralized method may lead to poor performance in complex problems due to the exponential increase of complexity in the number of agents. This paper focuses on how to learn a decentralized dynamic model for each agent to approximate the transition process with the information of others only when it is necessary. In multi-agent systems, the mutual dependence among the agents and necessity of coordination can dynamically change over time. For example, at a certain time-step, the multi-agent system can be in a loosely coupled state [19], in which an agent has weak dependence with others for coordination, so that it is enough to use its own information to build its local dynamic model. In order to consider the dynamic mutual dependence of agents when building their local dynamic models, we propose a novel model-based MARL method called Policy Optimization with Dynamic Dependence Modeling (POD2M), in which each agent’s policy is optimized by using simulation experiences from its local dynamic model that dynamically incorporates other agents’ information during the model estimation process. The main feature of our proposed method is to dynamically adapt mutual independence during building their local dynamic models so that the agents can make a trade-oﬀ between an individual-agent learning process and a coordinated learning process. Moreover, when considering the information of others in the coordinated process, the input dimension of our method increases linearly with the number of the agents, which addresses the exponential complexity issue in the centralized approach. We validate our method in both cooperative scenarios and competitive scenarios using the particle environment [20]. The results reveal that our method can converge eﬃciently and derive higher sample eﬃciency than model-free algorithms. The ﬁnal asymptotic performance shows that our method can achieve a comparable result against the centralized model-based MARL method in small-scale domains and much better performance in larger domains.

398

B. Hu et al.

The rest of the paper is organized as follows. Section 2 discusses the related work, followed by a background introduction of RL and model-based learning in Sect. 3. Section 4 provides a detailed description of our method and Sect. 5 reports experimental studies. At last, Sect. 6 concludes the paper.

2

Related Work

Model-based RL has two main challenges: model building and model using. For model building, the most common methods [21,22] include building deterministic models or probabilistic models. It depends on whether the transition of state is determined in the speciﬁc application environment. For model using, the agent policy can be learned by exploiting the model prediction experiences. The typical Dyna-Q algorithm [23] provides a model-based training framework with both model-predicted and environment-returned experiences. Shooting methods [24] utilize the model to predict the state transition process with ﬁxed step size and compute the accumulated reward during the predicted steps to help select the action. Methods based on model-based value function expansion [25] and policy search with back-propagation through paths [26] integrate both model-free and model-based processes into the policy optimization. The previous work of theoretical analysis [4] provided a monotonic improvement guarantee by enforcing a distance constraint between the learning policy and the data collecting policy. On this foundation, subsequent work [27,28] makes a deduction to derive a return discrepancy bound with the branched rollout and constructs a policy optimization framework based on the experiences generated by the dynamic model. Other algorithms learn the dynamic model in the latent space, such as Dreamer [10], which constructs a close-loop training scheme and veriﬁes that the learned model can predict the transition states accurately in a long period of rollouts. MuZero [11] extends the model-based methods with monte-carlo tree search and derives an end-to-end strategy to update the set of networks. In terms of MARL, the framework of centralized training with decentralized execution (CTDE) is commonly used as the basis of the coordination among multiple agents. Decentralized policies are learned in a centralized manner so that they can share information such as parameters without restriction during training. Algorithms based on CTDE [20] use a centralized value function by considering all the agents as a single one to solve the non-stationary problem during the training process. Although CTDE algorithms can solve many multiagent problems, they must search in the joint observation-action space which grows exponentially with the number of agents. On this foundation, a method [29] with attention mechanism is applied to solve the credit-assignment challenges and further improves the performance of CTDE framework. In addition to CTDE, another typical type of decentralized training algorithms [30] decomposes the centralized value function to a number of respective value functions and guarantees a positive growth of total returns but they are also constrained by the number of agents. Some other algorithms [31] utilize the reward shaping mechanism to promote coordination and distribute each agent an intrinsic

Model-Based Multi-agent POD2M

399

reward representing their individual goal. This kind of reward shaping methods requires a total state to train the intrinsic reward distributor which is impossible for some application scenarios. Last but not the least, the role-based algorithms [32] believe that each agent performs diﬀerent roles and the action space can be segmented according to the role, which is not always feasible in some multi-agent cases. For the model-based MARL problem, there is relatively limited work in the literature to our knowledge. A common solution is to build a centralized prediction dynamic model [18,33] to deal with the non-stationary problem. The method of centralized model predict the transition process considering all the agents and each agent trains the policy based on the CTDE framework. Obviously, the centralized model encounters the dimension explosion problem with the growth of the number of agents. Some decentralized methods, e.g., [34], provide a general framework and return discrepancy bounds of model-based MARL. However, these methods require each agent to model its opponents or partners and precisely predict their actions which may cost tremendous computation consumption.

3

Preliminaries

In this section, we ﬁrst introduce the MARL problems, and then the traditional methods of model-based RL including the model building and the model training therein. 3.1

MARL

We consider the framework of Markov Games, which is a multi-agent extension i of Markov Decision Processes (MDP). S is the nstate ispace in the games. A is the action space of agent i ∈ 1, ..., n and A = i=1 A is the joint action space. Ri : S × A → R is the reward function of agent i. In cooperative scenarios, each agent i observe a reward r = R(s, a) shared by all agents. T : S × A → S deﬁnes the probability distribution over possible next states. γ ∈ [0, 1] is the discount factor. At each time step, agent i receives a partially observable variant oi which contains partial information from the global state S. Agent i uses its policy π i (ait |oit ) to demonstrate the probability of taking action ait at the observation oit at time step t. The agents aim to ﬁnd the optimal policy π∗i that maximizes their expected function ∞ by the objective discounted returns denoted i 1 n s . γ r , a , ..., a as η : π∗i = arg max η π i = Ea1 ∼π1 ,...,an ∼πn ,s∼T t t t t=0 t t πi

Policy gradient methods [23] aims to estimate the gradient of an agent’s expected returns with respect to the parameter θ of policy πθ . This gradient of the objective function is given as follows: ∇θ J (πθ ) = ∇θ log (πθ (at |st ))

∞ t =t

γ t −t rt (st , at ) .

(1)

400

B. Hu et al.

∞ t −t The term rt (st , at ) can lead to high variance. To this end, t =t γ the Actor-Critic (AC) [35]

framework uses a critic Q-function Qφ (st , at ) = ∞ t −t E rt (st , at ) to approximate the expected discounted returns. The t =t γ approximated Q-function with respect to parameter φ is learned by minimizing the regression loss as follows: LQ (φ) = E(s,a,r,s )∼D δφ (s, a, s )

2 δφ (s, a, s ) = r(s, a) + γEa ∼π(s ) Qφ (s , a ) − Qφ (s, a) ,

(2)

where δφ is the TD-error, Qφ is the target Q-function that is updated with several intervals and D is the replay buﬀer storing the past experiences. Once a critic is updated by minimizing the TD-error, the actor πθ can be improved by maximizing the action-value function for actions produced by the policy. 3.2

Model-Based RL

Model-based RL learns a forward dynamic model to approximate the true transition function S × A → S and reward function S × A → R of the environment. The dynamic model is trained on the true environment dataset N Denv = {(st , at , st+1 , rt , dt )}t=0 , where rt is the sampled reward and dt is the termination indicator denoting the end of the episode. There are two methods to build the learned dynamic model: deterministic methods and probabilistic methods. For deterministic models, the standard way is to train the model to minimize the Mean Squared Error (MSE) between the predictive states and the true states as follows: LM SE =

N

2

ˆ p (st , at ) − st+1 2 ,

(3)

t=1

where pˆ(st , at ) is the deterministic next state predicted by the dynamic model with the inputs of current state and current action. For probabilistic models, Gaussian probabilistic method is commonly used to predict a distribution over next states: sˆt+1 ∼ N (μ(st , at ), σ(st , at )) and optimizes the Negative Log Likelihood (NLL) by: LN LL =

N μ(st , at ) − sTt+1 σ −1 (st , at )[μ(st , at ) − st+1 ] + log det σ(st , at ) . (4) t=1

In order to consider uncertainty over model predictions, model-based RL methods usually use the ensembles of learned models [36] rather than a single model. j Each model pˆj in the ensemble is trained on its own copy of the dataset Denv independently. The ﬁnal prediction for an ensemble of M models is then given by: M 1 j sˆt+1 = pˆ (st , at ). (5) M j=1

Model-Based Multi-agent POD2M

401

In the following sections, we denote the model ensemble for agent i as pˆi for simplicity.

4

The POD2M Method

We propose a model-based MARL method named Policy Optimization with Dynamic Dependence Modeling (POD2M). POD2M has two key components including the model-based policy optimization and the dynamic dependence modeling among multiple agents. In POD2M, each agent learns a dynamic model and uses the data collected from the model rollouts to learn a policy. The overall framework of our proposed POD2M method, including the structure of the critic network, the computation graph of the policy optimization and the prediction process of the dynamic model, is given in Fig. 1.

Fig. 1. The overall framework of our proposed method. The computation graph of the model-based TD-error is placed in the middle. The critic network, as shown in the left, uses the attention mechanism to adapt the dynamic dependence of other agents. In order to optimize the policy, the dynamic model is used to derive the values of target Q-function. Note that the attention module in critic network is shared with dynamic model during the process of policy optimization since they both need to consider the information of other agents.

4.1

Model-Based Policy Optimization

Policy optimization with dynamic models learns an accurate critic Q-function Qφi with parameter φi for each agent. Denoting the policy of agent i as θi and the transition function of the true environment πθi with i parameter i i as p ot+1 |ot , at , the traditional TD-learning can be seen as an optimization problem: (6) arg min E δφi (oit , πθi (oit ), oit+1 ) . oit ∼D φi oit+1 ∼p(oit+1 |oit ,πθi (oit ))

402

B. Hu et al.

Computing the gradient of TD-error δφi requires considering the eﬀect of action equals ait = πθi (oit ) on the transition to the subsequent state oit+1 , and i this i i . In |o , a to back-propagating through the true environment dynamics p o t t t+1 i i i i POD2M, agent i learns a decentralized model pˆ oˆt+1 |ot , at to approximate its own state transitions and uses the learned model to derive predicted rollouts. In Eq. (2), the approximate Q-function of subsequent state oit+1 is used to optimize the critic network. In order to incorporate the model estimations, we take advantage model to make one-step predictions and sample of the dynamic oˆit+1 from pˆi ·|oit , πθi (oit ) , leading to the following model-based TD-error: 2 oit+1 ) − Qφi (oit , ait ) . (7) δˆφi (oit , πθi (oit ), oˆit+1 ) = r(oit , ait ) + γQφi oˆit+1 , πθi (ˆ In this one-step policy optimization method, the agents only need to learn the transition model rather than reward model or opponents’ policy models. However, some commonly used model-based RL methods [27,28,37] require not only predicted transition function but also predicted reward function and opponents’ polices. These methods may take compound bias into policy optimization, potentially resulting in bad performance and high variance. 4.2

Dynamic Dependence Modeling

There are many ways for an agent to take the information of other agents into consideration, such as communication [38], social inﬂuence [39], and opponents modeling [40]. Dynamically assigning importance weights to other agents enables each agent to selectively consider the information of other agents. We apply the attention mechanism [41] in our method for dynamic dependence modeling and thus eﬃcient critic learning. Taking agent i’s observation oi , action ai and the information of other agents (o−i , a−i ) as input, the critic Q-function can be written as follows: Qφi (oi , ai , o−i , a−i ) = Qφi (ei (oi , ai ), xi ) αj v j , xi =

(8)

j=i

where ei is a one-layer MLP embedding function, xi is the contribution from other agents, v j is agent j’s values, and αj is the attention weight of agent j. Since the attention mechanism requires the same embedding space among selectors, keys and values, the embedding function ei is used to map (oi , ai ) to the same dimension with xi , i.e., the weighted sum of other agents’ values. The attenthe tion weight αj is derived by comparing embedding ej with ei and mapping

similarity value between the two embeddings into a softmax: sof tmax

Wq WkT √ dWk

[41], where Wq transforms ei into a “query” and Wk transforms ej into a “key”.

Model-Based Multi-agent POD2M

403

Multiple attention heads are used in our experiments, and each head maintains a separate tuple of parameters (Wk , Wq , V ). The vector xi is then constructed simply by concatenating the contribution from others. The learning of the dynamic models utilizes the same attention component in the critic learning. In this way, each agent is able to selectively take other agents’ information into account when predicting its own state transition. The dynamic model for agent i can be written as: oˆit+1 = pˆi (·|ei (oit , ait ), xit ).

(9)

The counterfactual advantage trick [42] deﬁned below is employed to solve the credit assignment problem: Ai = Qφi (o, (ai , a−i )) − Ea ∼Ai Qφi (o, (a , a−i )),

(10)

where o is the concatenated observations of all the agents, a−i is the joint action of all the agents except agent i and a is every possible action that agent i can take. The gradient of the objective function in Eq. (1) then can be given by: ∇θi J (πθi ) = ∇θi log πθi ai |oi Ai . Algorithm 1 . Policy Optimization with Dynamic Dependence Modeling (POD2M) Initialize: policy πθi , Q-function Qφi , dynamic model pˆi , ,target policy πθi , target Q-function Qφi , environment buﬀer Denv 1: for each episode do 2: for m trajectories do i 3: Collect transitions oi , ai , o , ri acting according to the policy πθi i 4: Denv ← − Denv ∪ oi , ai , o , ri 5: end for 6: for model training steps do 7: Train model pˆi on Denv 8: end for 9: for policy optimization steps do 10: Extract local information oi , ai , ri ∼ Denv 11: Compute the encoding representation e(oi , ai ) and weighted sum xi of other agents oi , πθi (ˆ oi )), x ˆi ) where oˆi ∼ pˆi ·|oi , ai 12: yˆ ← − r oi , ai + γQφi (ei (ˆ 2 13: δˆφi (oi , πθi (oi ), oˆi ) ← − yˆ − Qφi (ei (oi , πθi (oi ), xi )) 14: φi ← − φi − αQ ∇φi δˆφi 15: 16: 17:

i

i

φ ← − τ φi + (1 − τ )φ if t mod d=0 then − θi + απ ∇θi log πθi (ai |oi ) Ai (oi , ai ) θi ← i

18: θ ← − τ φi + (1 − τ )θ 19: end if 20: end for 21: end for

i

404

B. Hu et al.

The overall algorithm of our proposed POD2M method is present in Algorithm 1. For simplicity, the algorithm is described in the perspective of agent i and we use oi to represent the current local observation and oˆi to represent the model predicted subsequent observation. For the interaction with the true environment (line 2 to 5), the sampled trajectories are used to train the dynamic model and policy optimization. The training method of the dynamic model (line 6 to 8) for agent i can be written as Lmodel =

N i i i i i pˆ e (ot , at ), xt − oit+1 2 . 2

(11)

t=1

RL agent use both true trajectories and predicted rollouts to update its critic and policy networks (line 9 to 19).

5 5.1

Experiments Setup

We evaluate our method in the two-dimensional Multi-agent Particle Environment (MPE) [20] that consists of X agents and Y landmarks. There are multiple environments including cooperative scenarios (all agents maximize a shared return) and competitive scenarios (agents have conﬂicting aims) in MPE. The agents have continuous observation spaces (information including the location and speed) and discrete action spaces (move up, down, left, right and stay). Here we focus on three scenarios, i.e., the Spread, Tag and Adversary as introduced in Fig. 2. We ﬁrst introduce the model-based policy optimization method in multiagent systems with centralized dynamic model, denoted as Policy Optimization with Centralized Modeling (POCM). We regard POCM as an essential method to compare to our proposed POD2M. The main idea of POCM method is to consider the multiple agents as a single agent. The multi-agent systems have only one single dynamic model to approximate the transition function of the true environment. Since this model serves for all the agents and performs a centralized role in the system, it can be considered as a centralized model and formulated as follows: 1 (12) oˆt+1 , ..., oˆnt+1 = pˆ o1t , a1t , ..., ont , ant . The centralized model for the multi-agent systems can be constructed by considering the local observations of all the agents as the input, and the concatenation of all the predicted local observations as the output. In competitive scenarios, the single opponent agent uses the model-based deterministic policy gradient method [37] to update its policy. An ensemble of 8 neural networks of 3 hidden layers with 256 neurons is used for the dynamic models that learn the transition between the current and next states for agent i as oˆit+1 = oit + pˆ(oit , ait ). We employ multi-layer perceptrons for

Model-Based Multi-agent POD2M

405

the actor (3 layers, 64 neurons for each agent) and the critic (3 layers, 128 neurons for each agent). All the neural networks are trained with Adam optimizer with learning rate of 0.001 and weight decay of 0.0001. As we describe in the section of our method, we employ the attention component for both critic networks and dynamic models with 4 attention heads and respective “quer”,“key” and “value” parameters. We use the same dimension of inputs between critic networks and the attention component by utilizing the state embedding function ei (oit ) and stateaction embedding function ei (oit , ait ), which are used to encode the information for the attention component, with 1 layer and 128 neurons. We employ a delay in the policy updates d of 2 and a soft-update ratio τ of 0.01 for target networks. We employ a discounted factor γ of 0.95. Moreover, we employ categorical sampling for action selections and set norm gradient clipping to 10 by default for all the experiments.

(a) Spread

(b) Tag

(c) Adversary

Fig. 2. (a) Spread: a cooperative scenario including 3 agents and 3 landmarks and these agents should learn to reach the landmarks respectively while avoiding collisions and repeated overlays. (b) Tag: a competitive scenario including 3 good agents (red), 1 opponent (green) and random obstacles (grey). The good agents learn to cooperate to pursue and capture their opponent while the opponent agent, possessing faster speed, learns to avoid being caught by the good agents. (c) Adversary: a competitive scenario including 2 good agents (blue), 1 opponent agent (red), 1 goal landmark (green) and 1 fake landmark (grey). The opponent agent can only observe the position of good agents and aim to ﬁnd out the good landmark to overlay while the 2 good agents learn to confuse their opponent and reach the goal landmarks respectively. (Color ﬁgure online)

5.2

Results

Cooperative Scenario. To make a full comparison between POD2M and model-free methods, Value-Decomposition Networks (VDN) [30], Multi-ActorAttention-Critic (MAAC) [29] and Counterfactual Multi-Agent (COMA) [42] are implemented in the fully cooperative scenario, i.e. Spread, though they are originally applied to the SMAC [43] tasks. As shown in Fig. 3, in the Spread domain, the average reward achieves nearly −5.5 when all the agents are able to reach their landmarks respectively. In contrast to traditional model-free algorithms, the convergence of model-based methods are much faster. This reveals

406

B. Hu et al.

that POD2M can achieve higher sample eﬃciency than model-free MARL methods. The centralized method POCM can also achieve high learning performance in this domain due to its relatively small scale and thus accurate estimation of the centralized model.

Fig. 3. The reward curve of POD2M against traditional model-free MARL algorithms and POCM method in the cooperative scenario for 7500 episodes.

Competitive Scenarios. Tag is a competitive scenario where three good agents learn their own policies and get the rewards respectively instead of a shared return. Hence, it is hard to use the rewards of the three good agents to represent the performance of the method in this scenario. We evaluate the coordinated behaviors of the three good agents by the learning curve of the opponent agent pursued and chased by the good agents. The Adversary domain is also a competitive scenario that two good agents receive their rewards respectively. Since the goals of the two agents is relatively uniﬁed, we use the sum of their rewards to assess the learning results of this scenario. In these competitive scenarios, the dynamic models for the good agents are constructed by considering the whole agents’ information including the information of their opponent. The opponent agent uses its own information to build the dynamic model and utilizes the same policy optimization method as the good agents.

Model-Based Multi-agent POD2M

(a) Tag

407

(b) Adversary

Fig. 4. (a) The reward curve of the opponent agent pursed by the good agents. The less reward the opponent agent receives, the better learning performance the good agents have achieved. (b) The average return of the 2 cooperative agents. Higher rewards indicate better learning performance.

Note that in the Tag and Adversary scenarios, we only compare POD2M with POCM and Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [20]. The methods mentioned in Spread, such as COMA, VDN and MAAC, are tested in the fully cooperative SMAC tasks and thus not suitable for competitive scenarios. POCM method constructs a centralized model for all the agents in the environment to approximate the transition function. In other words, the centralized model includes the information of both the good agents and opponent agent. In Fig. 4 (a), we can see that the reward of the opponent agent shows an upward trend in the early stage, because the opponent agent learns to avoid being caught. After a few episodes, the good agents have learned the coordinated behaviors to purse and capture their opponent so that the reward of the opponent falls. However, the good agents using the POD2M method can learn more quickly to capture the opponent agent, compared to the MADDPG and POCM methods. In Fig. 4 (b), POD2M still performs best among the three methods. It is a bit surprising to observe that, in this domain, the performance of POCM is rather poor, suggesting the limits of building centralized model in competitive domains. Larger Scale Scenario. POD2M takes the mutual dependence of other agents into consideration by using the soft limits instead of the total inputs of local observations and actions. We extend the POD2M method to a larger scale

408

B. Hu et al.

domain to evaluate its scalability. We employ Spread with 6 cooperative agents and 6 landmarks compared to the 3 particles scenario mentioned above. From Fig. 5, we can see that POD2M can still keep high sample eﬃciency and achieve a steady asymptotic performance. The model-free algorithms combine the information of all the agents as the inputs of their critic Q-functions. Due to the exponential growth of the dimension, the expression capacity decreases signiﬁcantly, which makes policy learning diﬃcult. Unlike in the small scale domain in Fig. 3, where POCM performs similarly with POD2M, in this relatively larger domain, POCM cannot converge to the same level of POD2M, since POCM uses the combination of local observations and actions to estimate the joint model and thus encounters the same scalability problem as the model-free methods.

Fig. 5. The larger scale performances among model-based and model-free algorithms

6

Conclusion

In this paper, we investigated model-based MARL problems and designed a method utilizing the dynamic dependence among agents and model-based policy optimization for more eﬃcient model estimation and policy learning. In multiagent systems, the agents need to dynamically adapt their dependence when building their own dynamic models in order to make a trade-oﬀ between the individual-agent learning process and the coordinated learning process. We validate our method in both cooperative scenarios and competitive scenarios using the particle environment. The results reveal that our method can converge eﬃciently and derive higher sample eﬃciency than the model-free algorithms. The ﬁnal asymptotic performance shows that our method can achieve a comparable result against the centralized model-based MARL method in small-scale domains

Model-Based Multi-agent POD2M

409

and much better performance in larger domains. In the future, we plan to provide theoretical analysis of our proposed method, and evaluate it in other more complex domains. Acknowledgement. This work is supported by the National Natural Science Foundation of China under Grant 62076259.

References 1. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 2. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016) 3. Toyama, D., et al.: Androidenv: a reinforcement learning platform for android. arXiv preprint arXiv:2105.13231 (2021) 4. Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., Ma, T.: Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858 (2018) 5. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395. PMLR (2014) 6. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897. PMLR (2015) 7. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 8. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: oﬀ-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870. PMLR (2018) 9. Moerland, T.M., Broekens, J., Jonker, C.M.: Model-based reinforcement learning: a survey. arXiv preprint arXiv:2006.16712 (2020) 10. Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019) 11. Schrittwieser, J., et al.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839), 604–609 (2020) 12. Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012) 13. Chao, Yu., Dong, Y., Li, Y., Chen, Y.: Distributed multi-agent deep reinforcement learning for cooperative multi-robot pursuit. J. Eng. 2020(13), 499–504 (2020) 14. Vinyals, O., et al.: Starcraft II: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782 (2017) 15. Wu, Z., Yu, C., Ye, D., Zhang, J., Piao, H., Zhuo, H.H.: Coordinated proximal policy optimization. arXiv preprint arXiv:2111.04051 (2021) 16. Wang, R.E., et al.: Model-based reinforcement learning for decentralized multiagent rendezvous. In Conference on Robot Learning (CoRL), pp. 711–725 (2020) 17. Chao, Yu., et al.: Distributed multiagent coordinated learning for autonomous driving in highways based on dynamic coordination graphs. IEEE Trans. Intell. Transp. Syst. 21(2), 735–748 (2019)

410

B. Hu et al.

18. Willemsen, D., Coppola, M., Che de Croon, G.: Mambpo: sample-eﬃcient multi-robot reinforcement learning using learned world models. arXiv preprint arXiv:2103.03662 (2021) 19. Chao, Yu., Zhang, M., Ren, F., Tan, G.: Multiagent learning of coordination in loosely coupled multiagent systems. IEEE Trans. Cybernet. 45(12), 2853–2867 (2015) 20. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems, pp. 6382–6393 (2017) 21. Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S.: Neural network dynamics for model-based deep reinforcement learning with model-free ﬁne-tuning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559– 7566. IEEE (2018) 22. Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in Neural Information Processing Systems, pp. 4759–4770 (2018) 23. Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine Learning Proceedings 1990, pp. 216–224. Elsevier (1990) 24. Wang, T., Ba, J.: Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649 (2019) 25. Feinberg, V., Wan, A., Stoica, I., Jordan, M.I., Gonzalez, J.E., Levine, S.: Model-based value estimation for eﬃcient model-free reinforcement learning. arXiv preprint arXiv:1803.00101 (2018) 26. Clavera, I., Fu, V., Abbeel, P.: Model-augmented actor-critic: Backpropagating through paths. In: International Conference on Learning Representations (2020) 27. Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: Advances in Neural Information Processing Systems, pp. 12498–12509 (2019) 28. Rajeswaran, A., Mordatch, I., Kumar, V.: A game theoretic framework for model based reinforcement learning. In: International Conference on Machine Learning, pp. 7953–7963. PMLR (2020) 29. Iqbal, S., Sha, F.: Actor-attention-critic for multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 2961–2970. PMLR (2019) 30. Sunehag, P., et al.: Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296 (2017) 31. Du, Y., Han, L., Fang, M., Liu, J., Dai, T., Tao, D.: Liir: learning individual intrinsic reward in multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 4403–4414 (2019) 32. Wang, T., Dong, H., Lesser, V., Zhang, C.: Roma: multi-agent reinforcement learning with emergent roles. In: International Conference on Machine Learning, pp. 9876–9886 (2020) 33. Park, Y.J., Cho, Y.S., Kim, S.B.: Multi-agent reinforcement learning with approximate model learning for competitive games. PLoS ONE 14(9), e0222215 (2019) 34. Zhang, W., Wang, X., Shen, J., Zhou, M.: Model-based multi-agent policy optimization with adaptive opponent-wise rollouts. arXiv preprint arXiv:2105.03363 (2021) 35. Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)

Model-Based Multi-agent POD2M

411

36. Kurutach, T., Clavera, I., Duan, Y., Tamar, A., Abbeel, P.: Model-ensemble trustregion policy optimization. In: International Conference on Learning Representations (2018) 37. D’Oro, P., Ja´skowski, W.: How to learn a useful critic? Model-based actiongradient-estimator policy optimization. In: Advances in Neural Information Processing Systems, pp. 313–324 (2020) 38. Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with backpropagation. In: Advances in Neural Information Processing Systems, pp. 2244– 2252 (2016) 39. Wang, T., Wang, J., Wu, Y., Zhang, C.: Inﬂuence-based multi-agent exploration. In: International Conference on Learning Representations (2019) 40. He, H., Boyd-Graber, J., Kwok, K., Daum´e III, H.: Opponent modeling in deep reinforcement learning. In: International Conference on Machine Learning, pp. 1804–1813. PMLR (2016) 41. Vaswani, A., et al.:. Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 42. Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 32, pp. 2974–2982 (2018) 43. Samvelyan, M., et al.: The starcraft multi-agent challenge. In: International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 2186–2188 (2019)

Multi-index Federated Aggregation Algorithm Based on Trusted Verification Zhenshan Bao, Wei Bai, and Wenbo Zhang(B) The Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China [emailprotected]

Abstract. Movited by the modern phenomenon of distributed data collected by edge devices at scale, federated learning can use the large amounts of training data from diverse users for better representation and generalization. To improve flexibility and scalability, we propose a new federated optimization algorithm, named as Multi-index federated aggregation algorithm based on trusted verfication(TVFedmul). TVFedmul is optimized based on Fedavg algorithm, which overcomes a series of problems caused by the original aggregation algorithm, which only takes the single index of data quantity as a reference factor to measure the aggregation weight of each client. The improved aggregation algorithm is based on multi-index measurement, which can reflect the comprehensive ability of clients more comprehensively, so as to make overall judgment. Further, we introduces hyperparameter α, which can be changed to determine the importance of the indexes. Finally, via extensive experimentation, the efficiency and effectiveness of the proposed algorithm is verified. Keywords: Federated learning · Aggregation algorithm · Distributed learning

1 Introduction With the growing prevalence of edge devices, designing communication-efficient techniques for learning using client data is an increasingly important area in distributed machine learning. AI-based solutions rely intrinsically on appropriate algorithms, but even more so on large training datasets [1]. Federated learning has emerged as an important paradigm in modern large-scale machine learning [2]. In federated learning, the training data remains distributed over a large number of clients [3]. Data is typically generated at different scenarios, which can lead to significant differences in the distribution of data across data partitions [4]. A federated learning system is often composed of servers and clients, with an architecture that is similar to parameter servers [5]. The main objective of federated learning is to fit a model to data generated from network devices without continuous transfer of the massive amount of collected data from edge of the network to back-end servers for processing [6, 7]. Federated averaging (Fedavg) [8] has emerged due to its simplicity and low communication cost. In each iteration, the algorithm selects a number of clients with a ratio © Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 412–420, 2022. https://doi.org/10.1007/978-3-030-96772-7_37

Multi-index Federated Aggregation Algorithm

413

of ρ, and performs the stochastic gradient decent and loss function on the local private data. The key challenges for Fedavg are 1) The update mode of Fedavg with reference to single data quantity may cause clients to overstate the quantity of data in order to make their local model occupy a large proportion in aggregation. 2) Fedavg increases the insecurity of the system. 3) In the training process, the noise data will downgrade the model. On the contrary, if the dataset with small data amount is of good quality and more representative [9, 10], it also makes its own contribution to the model. 4) When the data is heterogeneous (non-iid), fedavg may result in unstable and slow convergence. To address the above, in this study, we propose a new algorithm TVFedmul. The contributions of our work can be summarized as follows. 1) TVFedmul take the data quantity, as well as data quality, into the contribution to federal learning model. 2) TVFedmul increases the security and fairness of the federated system to a certain extent. 3) TVFedmul make the federated system more flexible and scalable. 4) The customized federated learning is realized and the practicability of the algorithm is improved.

2 Related Work Recently we have witnessed significant progress in developing novel methods that address different challenges in federated learning. Zhang et al. [11] proposed an asynchronous approach with “soft” averaging, which only consider the data center setting, and do not consider datasets that are unbalanced and non-iid, properties that are essential to the federated learning setting. Chen et al. [12] proposed FedSA, a novel federated learning algorithm that accelerates convergence and resists performance degradation caused by non-iid data and staleness. Despite the attention on performance degradation with non-iid data in recent works [13], none of them provide the theoretical guarantees. Zhou et al. [14] proposed methods that dynamically change learning rates, including learning rate decay and adaptive learning rates. Xie et al. [15] proposed an algorithm that uses a mixed hyperparameter to balance the robustness-efficiency trade-offs. However, this method, in general, only evaluate equally sized local data, thus failed to generalize into more practical situations where most real-world data are different in size. Alireza Fallah et al. [16] considered the heterogeneous case in federated learning, and studied a personalized variant of the classic federated learning formulation in which the goal is to find a proper initialization model for the users that can be quickly adapted to the local data of each user after the training phase. Li et al. [17] proposed a q-FFL, a novel optimization objective inspired by fair resource allocation in wireless networks that encourages fairer accuracy distributions across devices in federated learning. However, none of the federated learning algorithms studied the effect of the quality of the privacy data owned by the clients.

414

Z. Bao et al.

3 TVFedmul 3.1 Weight Calculation Data Quantity Proportion We denote data quantity ratio as Q1 , which is fixed during each round of aggregation because the number of data of each client is determined. Assume that there are k clients, each client i ∈ [1, k] has its own local private data Di containing nk data samples. ki=1 nk represents the total number of data owned by each client, denoted as n. Then the data quantity ratio Q1i of client i is calculated as follows Eq. (1). Q1i =

ni , i ∈ (1, 2 · · · k) n

(1)

Data Quality Proportion We denote data quality ratio as Q2 . In federated learning, the update effect is the most intuitive reflection of the data quality. Therefore, the TVFedmul algorithm introduces verification nodes to verify the model update effect of each client. The verification node can obtain the model update information of each client. Therefore, the verification node should be an honest node with high comprehensive capability. In TVFedmul, the honesty and comprehensive ability of each client are measured by their performance on the public data set. The verification nodes of this round are selected from the first λ clients with the highest model verification scores in the previous round. The clients selected as the verification node of this round will not participate in the training of this round, but will validate and score the updated model from other clients with their local data set. It can be seen that the verification nodes change dynamically in each round, and so does the public data set, which increases the generalization ability of the model to a certain extent. To prevent clients with high-quality data from being selected as validation nodes that do not participate in model updates, thus breaking the overall model iteration efficiency, the first λ nodes in the even number position are selected as the verification nodes according to the score from high to low. Assume that there are k clients, m verification nodes. Sij represents the test accuracy of the model update for the ith client on the jth verification node. Then the final score Si is calculated as follows: Si =

1 m Sij j=1 m

(2)

where ki=1 Si represents the total score of each client, as S. Then the ratio of data quality to Q2i is calculated as follows: Q2i =

Si , i ∈ (1, 2 · · · k) S

(3)

Multi-index Federated Aggregation Algorithm

415

3.2 Aggregation The objective function is min f (ω), Then the f (ω) is calculated as follows: ω∈Rd

1 fi (ω) n

(4)

fi (ω) = L(xi , yi ; ω)

(5)

def

f (ω) =

n

i=1

where L(xi , yi ; ω) represents the result of the loss of sample (xi , yi ) as predicted on the given parameter ω. Assume Dk is the data set owned by the k-th client, nk represents the size of clients. The average loss of the samples for client k is: Fk (ω) =

1 fi (ω) nk

(6)

i∈Pk

The gradient of the k client in the t iteration is gk = ∇Fk (ωt )gk , learning rate is η. Then the calculation of local update for this round is as follows: k ωt+1 ← ωtk − η∇Fk (ωk )

(7)

After each client completes the local update, the results are uploaded to the verification nodes. Then they uploads to the aggregation server that calculates the update weight of each client in the round and performs the aggregation. The aggregation weight Qti of the client k in the round t is: m 1 j=1 Sij ni i Qt = α + (1 − α) (8) n m S Here, α is a hyperparameter that can be changed according to the specific federated learning task, used to adjust the two influencing factors. The score of the local model on the public test set reflects the data quality of the clients to some extent. SSi is used as one of the reference factors together with nnk to determine the contribution of clients to the global model. Compared with the fedavg algorithm, the integrated metrics make the evaluation of the clients more rigorous and comprehensive, more conducive to the aggregate server to make a judgment as a whole. In addition, because of many iterations, the local model of each client has a different percentage Q2i for each round, as a result, the comprehensive weight of each client in each round of model aggregation is different, and the variable weight truly reflects the contribution of each client to the global model of updating. The global parameter of round t aggregation is: ωt+1 ←

k k=1

k comes from Eq. (7). where ωt+1

k Qtk ωt+1

(9)

416

Z. Bao et al.

The total loss function of round t model is: ft (ω) =

k

Qtk Fk (ω)

(10)

k=1

Where Fk (ω) comes from Eq. (6).

4 System Model Figure 1 shows the architecture of TVFedmul algorithm. The algorithm consists of distributed training stage, model verification stage and model aggregation stage.

Fig. 1. The architecture of TVFedmul

5 Experimental Validation 5.1 Datasets In this section, we empirically evaluate the proposed algorithm in iid and non-iid. The training set is partitioned onto n = 100 devices. We conduct experiments on benchmark: MNIST (http://yann.lecun.com/exdb/mnist/). When it is used to non-iid, each client can only own a part of the data sets of categories. First, the MNIST is sorted in descending order with labels 0 to 9, and then the images are sliced to make the image labels in each slice the same, that is, the same number. Divide it into 200 pieces, each containing 300 images. Distributed to 100 clients to simulate the private data owned by each client that is assigned only two possible data sets: 600 images containing only one kind of label and 300 images each containing two kinds of label. During the federal training, clients do not share data with each other. They can only access the data assigned to them, and can only access the numbers with two different labels, which well simulates the data distribution in the non-iid.

Multi-index Federated Aggregation Algorithm

417

5.2 Experimental Non-IID In order to verify the influence of the proportion of clients, two algorithms were used to conduct experiments. Among them, in Fedavg, a total of 200 rounds of training, ρ were set at 0.1, 0.3, 0.5 and 0.7 respectively, as shown in Fig. 2. In TVFedmul, a total of 260 rounds of training, α were set at 0.5, ρ were set at 0.1, 0.3, 0.5, 0.7, respectively, as shown in Fig. 3. It can be seen from the results that, the more clients participate in the training, the faster convergence rate of the model, and the higher accuracy. In order to verify the influence of the two factors in the TVFedmul, α were taken for comparison experiment, the proportions of data quantity were 0.1, 0.3, 0.5, 0.7 and 0.9, and the proportions of data quality were 0.9, 0.7, 0.5, 0.3 and 0.1. The experiment round was 260. The results are shown in Fig. 4. It can be seen from the results that when α is different, the convergence trend of the model is almost the same, but the convergence rate and the final accuracy are different. When α = 0.1, the convergence effect of the model is the best, when α = 0.9, it is the worst, and with the increase of α, the convergence effect of the model is better and better. In order to further verify the effectiveness, two algorithms under the same experimental conditions were compared, as shown in Fig. 5. Among them, the training rounds are 240, ρ is 0.7, α is 0.1. It can be seen from the results that the model convergence speed of the improved algorithm is faster, and the final model accuracy reaches 94.59%, which is 2.53% higher than that of the Fedavg (92.06%). The Fig. 6. shows the loss of training.

Fig. 2. Fedavg-noniid-ρ

Fig. 3. TVFedmul-noniid-ρ

Fig. 5. TVFedmul & Fedavg-noniid

Fig. 4. TVFedmul-noniid-α

418

Z. Bao et al.

Fig. 6. The training loss of TVFedmul & Fedavg-noniid

IID The same experiment is carried out in the iid.

Fig. 7. Fedavg-iid-ρ

Fig. 8. TVFedmul-iid-ρ

Fig. 9. TVFedmul-iid-α

Fig. 10. TVFedmul & Fedavg-iid

Fig. 11. The training loss of TVFedmul & Fedavg-iid

As shown in Fig. 7 and Fig. 8 different values of the parameters are verified by using Fedavg algorithm and TVFedmul algorithm, which are 0.1, 0.5 and 0.9 respectively. The experimental results are consistent, that is, the convergence rate and training accuracy of the model are improved with the increase of the number of clients.

Multi-index Federated Aggregation Algorithm

419

As shown in Fig. 9, the convergence effect of the model with hyperparameters of 0.1, 0.5 and 0.9 is shown, again confirming that data quality has a greater impact on the federated system, two factors should be considered in model aggregation. As shown in Fig. 10, TVFedmul is superior to Fedavg in iid, and the training accuracy of the model is improved from 98.1% to 98.69%. The Fig. 11. shows the loss of training.

6 Conclusion In this work, we propose TVFedmul, which takes data quantity and data quality into consideration that calculate the aggregation weight more rigorous and comprehensive, speeds up the convergence rate and improves the accuracy of the global model. With the introduction of the data quantity, the comprehensive weight of the clients is adjusted according to the actual training effect, which improves the flexibility of the system. In addition, the way of multi-index aggregation to some extent increases the cost of evil node, and protects the fairness and security of the system. Finally, the introduction of super-parameter realizes customized federated learning.

References 1. Warnat-Herresthal, S., Schultze, H., Shastry, K.L., et al.: Swarm Learning for decentralized and confidential clinical machine learning. Nature 594(7862), 265–270 (2021) 2. Jenny, H., Mehryar, M., Theertha, S.A.: FedBoost: communication-efficient algorithms for federated learning. In: International Conference on Machine Learning, pp. 3931–3941 (2020) 3. Karimireddy, S.P., Kale, S., Mohri, M., et al.: SCAFFOLD: stochastic controlled averaging for on-device federated learning. ArXiv (2019) 4. Kevin, H., Amar, P., Onur, M., et al.: The Non-IID data quagmire of decentralized machine learning. In: International Conference on Machine Learning, pp. 4337–4348 (2020) 5. Reisizadeh, A., Mokhtari, A., Hassani, H., et al.: FedPAQ: a communication-efficient federated learning method with periodic averaging and quantization. In: International Conference on Artificial Intelligence and Statistics, vol. 108, pp. 2021–2030 (2020) 6. Lingjuan, L., Jiangshan, Y., Karthik, N., et al.: Towards fair and privacy-preserving federated deep models. IEEE Trans. Parallel Distribut. Syst. 31, 2524–2541 (2020) 7. Acar, D.A., Zhao, Y., Navarro, R.M., et al.: Federated learning based on dynamic regularization. In: International Conference on Learning Representations (2021) 8. McMahan, H.B., Moore, E., Ramage, D., et al.: Communication-efficient learning of deep networks from decentralized data. In: International Conference on Artificial Intelligence and Statistics, vol. 54, pp. 1273–1282 (2017) 9. Nishio, T., Yonetani, R.: Client selection for federated learning with heterogeneous resources in mobile edge. In: IEEE International Conference on Communications, pp. 1–7 (2019) 10. Li, L., Xu, W., Chen, T., et al.: RSA: byzantine-robust stochastic aggregation methods for distributed learning fron heterogeneous datasets. In: Proceeding of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 1544–1551 (2019) 11. Zhang, S.X., Choromanska, A., LeCun, Y.: Deep learning with elastic averaging SGD. In: NIPS, vol. 28 (2015) 12. Chen, M., Mao, B.C., Ma, T.Y.: A staleness-aware asynchronous Federated Learning algorithm with non-IID data. Fut. Generation Comput. Syst. 120, 1–12 (2021)

420

Z. Bao et al.

13. Li, X., Huang, K., Yang, W., et al.: On the convergence of FedAvg on Non-IID data. Arxiv (2020) 14. Wei, D., Yi, Z., Nanqing, D., et al.: Toward understanding the impact of stalenessn in distributed machine learning. In: International Conference on Learning Representations (2019) 15. Xie, C., Koyejo, O., Guptal, I.: Asynchronous federated optimization. ArXiv (2019) 16. Fallah, A., Mokhtari, A., Ozdaglar, A.: Personalized federated learning: a meta-learning approach. ArXiv (2020) 17. Li, T., Sanjabi, M., Smith, V.: Fair resource allocation in federated learning. ArXiv (2020)

Few-Shot Generative Learning by Modeling Stereoscopic Priors Yuehui Wang, Qing Wang, and Dongyu Zhang(B) Sun Yat-sen University, Guangzhou, China [emailprotected], [emailprotected]

Abstract. Few-shot image generation, which aims to generate images from only a few images for a new category, has attracted some research interest in recent years. However, existing few-shot generation methods only focus on 2D images, ignoring 3D information. In this work, we propose a few-shot generative network which leverages 3D priors to improve the diversity and quality of generated images. Inspired by classic graphics rendering pipelines, we unravel the image generation process into three factors: shape, viewpoint and texture. This disentangled representation enables us to make the most of both 3D and 2D information in few-shot generation. To be speciﬁc, by changing the viewpoint and extracting textures from diﬀerent real images, we can generate various new images even in data-scarce settings. Extensive experiments show the eﬀectiveness of our method. Keywords: Computer vision · Few-shot image generation Generative adversarial network · Data augmentation

1

·

Introduction

The challenge of learning new concept from very few examples, often called fewshot learning or low-shot learning, is a long-standing problem. Some recent works [9,11] explore the ability of few-shot generation under speciﬁc circ*mstances. To be more concrete, [11] proposes a meta-learning based method of generating personalized talking head images. [9] presents a framework to learn a generative model from a single natural image. However, they only focus on the information brought by 2D image dataset, we consider to use 3D priors to guide image generation. In this paper, we explore image generation in few-shot settings and simultaneously care for 3D information: shape, viewpoint and texture. First, the shape of the objects in the generated images depends on the category of our 2D image dataset (e.g., car, chair and table). Second, by changing the viewpoint of the Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-030-96772-7 38. c Springer Nature Switzerland AG 2022 H. Shen et al. (Eds.): PDCAT 2021, LNCS 13148, pp. 421–429, 2022. https://doi.org/10.1007/978-3-030-96772-7_38

422

Y. Wang et al.

camera in the process of rendering 3D priors, we can get a variety of 2.5D samples (e.g., depth images). After that, we extract the texture of an arbitrarily sampled image from the 2D image dataset. Finally, we recombine these three factors, with our novel generative model Few-shot Generative Network with 3D priors (FGN-3D), to generate new images.

Fig. 1. Qualitative results. When given a real 3D prior (with determined shape and viewpoint) and a texture image, our model successfully apply the texture to the prior and generate realistic images without mode collapse nor mode confusion.

The few-shot learning ability of our proposed method is obtained through two stages: (a) meta-learning and (b) ﬁne-tuning. Meta-learning is performed on base classes where a large training set of 3D collections and corresponding 2D real images is available. In the course of meta-learning, our system simulates few-shot learning tasks and learns to transform 2.5D samples (e.g., depth images) into realistic RGB images. After that, we ﬁne tune our models, with highcapacity generator and discriminator pre-trained via meta-learning, on novel classes where the training data is scarce. The proposed network quickly learns to generate realistic images of novel classes, which are unseen during meta-learning, after a few training steps. Note that during the whole training process, the 3D priors and the 2D real images do not need to be from the same class, i.e., our model is class-agnostic. Figure 1 shows some qualitative results produced by our

Few-Shot Generative Learning by Modeling Stereoscopic Priors

423

model, where the desired texture is applied to the speciﬁed 3D prior, regardless of their classes. Summarizing the contributions of this paper, we: – Propose a two-stage training model (FGN-3D) which introduces 3D priors into image generation in few-shot scenarios. – Demonstrate that our model produces the state-of-the-art results compared to extended baselines while retaining good generalization performance.

Fig. 2. Overview of the proposed FGN-3D model. To generate image x ˆ, we ﬁrst extract k depth and mask pairs from a 3D prior (from modeling in meta-learning stage or sampling in ﬁne-tuning stage), after that we encode l augmented texture images into Ztexture . Finally we recombine them and choose the one with the lowest feature matching loss as the output.

2 2.1

Method Architecture and Notation

First we’d like to introduce the necessary notations. Let I denote the 2D RGB image space RH×W ×3 , V the 3D prior space RV ×V ×V and C = {0, . . . , L} the discrete label space. Our training dataset S consists of 3D collections {vi }N i N M and real 2D RGB images {xj }M j , i.e., S = {{vi }i , {xj }j }. Note that we use i and j to accentuate no pair relationship between 3D and 2D data. For few-shot learning, we separate the label space C into Cbase where large number of training data are available and Cnovel which is underrepresented.

424

Y. Wang et al.

Then we introduce the network architectures of diﬀerent modules in the framework. In the meta-learning stage of our approach, the proposed FGN-3D framework is split into two parts: (a) 3D priors modeling part and (b) 2D image generation part. Figure 2 shows an overview of the proposed FGN-3D framework. Speciﬁcally, for 3D priors modeling part, two networks are trained: – The 3D priors generator G3D takes a latent code zshape sampled from a normal distribution, a class label y ∈ Cbase and outputs a 3D instance vˆ, i.e., vˆ = G3D (zshape , y). – The 3D priors discriminator D3D takes a 3D instance v, a class label y ∈ Cbase and outputs a single scalar r3D , i.e., r3D = D3D (v, y), which indicates whether the input v is a real instance from class y. For 2D image generation, three networks are trained: – The texture embedder E maps a real image x into a vector ztexture , i.e., ztexture = E(Aug(x); φ). Here, Aug(·) represents data augmentation operations and φ is the model parameters. Note that E is designed to be classagnostic to leverage all training data and increase the diversity of generated images. – The image generator G2D takes a depth image xd , texture latent code ztexture and outputs a synthesized image x ˆ, i.e., x ˆ = G2D (xd , ztexture ; ψ). Here xd is obtained by employing a fully diﬀerentiable projection function p with a speciﬁc viewpoint vp on a 3D prior v: xd = p(v, vp). Here, ψ denotes model parameters that are learned in the meta-learning stage. In general, during meta-learning, we aim to learn ψ such that G2D are able to maximize the similarity between its outputs and the real image. – The image discriminator D2D takes a 2D image x, a class label y ∈ Cbase and outputs a single scalar r2D , i.e., r2D = D2D (x, y; ϕ). which indicates whether the input x is a real image from class y. For each training stage, we ﬁrst train the two parts separately to ensure that G3D is able to generate realistic 3D priors and that G2D is able to generate corresponding RGB images given the depth map xd . After that we train them jointly to improve the diversity and quality of the generated images. 2.2

Meta-Learning on Base Classes

3D Priors Modeling. We base our 3D priors generator G3D and discriminator D3D on the 3D-GAN architecture proposed by [10]. However, vanilla 3D-GAN suﬀers model collapse and unstable training process when extended to multiclass generation setting. To address these problems, the Wasserstein distance [2] and spectral normalization [6] are used. Besides, following the advice of [7], we feed the conditional information y into the discriminator by projection instead of concatenation. Speciﬁcally, the loss function of modeling 3D priors is:

Few-Shot Generative Learning by Modeling Stereoscopic Priors

425

min max L3D = Ev [D3D (v, y)] G3D D3D

− Ezshapep [D3D (G3D (zshape , y), y)].

(1)

2D Image Generation. The training process of 2D image generation part is done by simulating episodes of K-shot learning. In each episode, we randomly sample a 3D instance vˆ from G3D and a real image x from training dataset. Then, K depth images {xd1 , xd2 , . . . , xdk } are obtained by changing the viewpoint in the projection function p(ˆ v , vp). Additionally, we can also get K corresponding image masks {xmask1 , xmask2 , . . . , xmask } with a simple threshold, which will later be used to regularize the synthesized image. To increase the diversity of generated images, we produce L augmented real images: {x1 , x2 , . . . , xl } = Aug(x) before feeding them into the texture embedder E. Here we use CycleGAN-like [12] architecture. We employ two generators and two discriminators: forward (from depth to real RGB) generator Gf w and discriminator Df w , backward (from real RGB to depth) generator Gbw and discriminator Dbw . We train these four networks jointly with adversarial losses and cycle-consistency losses. More formally, when training forward, the adversarial loss is given by: x)], Lf w = Ex [log(Df w (x))] + E(xd ,{x1 ,...,xl }) [log(1 − Df w (ˆ

(2)

x ˆ = Gf w (xd , E({x1 , . . . , xl })).

(3)

where When training backward: Lbw = Exd [log(Dbw (xd ))] + Ex [log(1 − Dbw (Gbw (x))].

(4)

Cycle-consistency losses are also used to enforce the bijective relationship between the two domains in the forward and backward phase: 1 Lcyc f w = Ex [Gf w (Gbw (x)) − x1 ],

(5)

Lcyc x) − xd 11 ]. bw = E(xd ,{x1 ,...,xl }) [Gbw (ˆ

(6)

and

Additionally the feature matching loss [4] is employed to make sure our generated x ˆ share the same texture as the input real image x in general. Removing the last layer from Df w , we construct a feature extractor Df w which is then used to extract features from x ˆ and {x1 , . . . , xl }: x) − LF M = E(ˆx,{x1 ,...,xl }) [Df w (ˆ

Df w (xl ) l

L

11 ].

(7)

426

Y. Wang et al.

At this point, we write the full loss of the 2D image generation process as cyc L2D = Lf w + Lbw + Lcyc f w + Lbw + λf m LF M ,

(8)

where λf m shows the weight of feature matching loss. Full Model. Our full objective in this stage is as follows: min

max

(G3D ,Gf w ,Gbw ) (Df w ,Dbw )

2.3

L3D + L2D .

(9)

Fine-Tuning on Novel Classes

Once the meta-learning has ﬁnished, the forward generator Gf w is able to generate RGB image for novel class, which is unseen during meta-learning stage, conditioned on the depth images projected from 3D priors. In this stage, the ﬁne-tuning loss of image generation is: inetune = E[log(D2D (x))] + E[log(1 − D2D (ˆ x)], Lf2D

(10)

x ˆ = G2D (p(v, vp), E({x1 , . . . , xl })).

(11)

where The full objective in this stage is: inetune min max Lf2D + λf m LF M . G2D D2D

3 3.1

(12)

Experiment Experimental Setting

Baselines. We compare our method against ﬁve popular GAN variants: DCGAN [8], LSGAN [5], WGAN-GP [2] and VON [13]. Since the vanilla baselines are class-speciﬁc, we extend them to support multi-class generation for fair comparison. Detail extensions are as follows: – 3D-free GAN variants: We simply extend them into conditional generation based on class labels, i.e., c-DCGAN, c-LSGAN and c-WGAN-GP. – extended-VON: We introduce multi-class generation setting (conditional 3DGAN) and texture extraction ability (texture encoder) into VON. Note that they require much more training data than our method in the paper they originally proposed.

Few-Shot Generative Learning by Modeling Stereoscopic Priors

427

Fig. 3. Quantitative comparison between meta-VON and our method with T = 20 on novel classes, where T represents the number of samples used for ﬁne-tuning.

Datasets – 3D collections: We use ShapeNet [1] models for 3D priors modeling. Specifically, we choose the ﬁve largest classes (car, chair, airplane, sofa and riﬂe) as our base classes Cbase . For each one of them, we limit the number of CAD models to 500. The next ﬁve largest classes (table, lamp, vessel, bench, speaker) are novel classes Cnovel , where there are at most 20 models for each one of them. – 2D images: There are 500 images for each class in Cbase , where cars and chairs are all crawled from Google, for the rest three classes (airplane, sofa and riﬂe), 250 images are from Google and 250 are renderings from corresponding classes in ShapeNet. Similar to 3D collections settings, each class in Cnovel holds 20 images at most. Metrics. We calculate Fr´echet Inception Distance (FID) [3] to evaluate distribution matching between generated images and real images, lower FID values mean better image quality and diversity.

428

Y. Wang et al.

Table 1. Quantitative comparisons with FID, smaller numbers are better. Here ‘-’ represents severe model collapse. Note that even in base class, where other baselines use all the training data and we use only part of them, our model also shows SOTA performance. Methods\Classes Car

Chair Airplane Sofa

Riﬂe Table Lamp Vessel Bench Speaker mFID ↓

c-DCGAN c-LSGAN c-WGAN-GP extended-VON FGN-3D (ours)

245.0 235.4 174.1 58.8 64.7

186.5 137.3 110.9 89.8 86.2

3.2

153.2 175.6 143.1 81.3 77.2

258.8 224.6 217.6 96.1 90.2

201.5 177.6 156.9 58.9 55.6

– – – 219.7 89.0

– – – 240.5 102.4

– – – 223.3 111.8

– – – 281.3 98.6

– – – 266.6 106.4

209.0 190.1 160.5 161.6 88.2

Main Results

We provide both quantitative and qualitative evaluation on baselines and our model. Please refer to our supplementary material for more training details and additional results. Qualitative Evaluation. Figure 1 demonstrates some images generated by the proposed model, when given a 3D prior and a texture image (regardless of their classes). Note that our method applies texture information well without mode collapse or mode confusion, which are often observed in other baselines. Figure 3 shows more examples on novel classes with T = 20, where T represents the number of samples used in ﬁne-tuning stage. Note that the diversity and quality of generated images are both improved with our method. Quantitative Evaluation. Table 1 reports quantitative results of our model and all baselines on both base classes and novel classes. Averaged FID is reported and our model (FGN-3D) outperforms all baselines both on base classes and novel classes, obtaining state-of-the-art results (Table 2). Table 2. Analysis on beneﬁts of introducing two-stage training strategy and making full use of 3D information for few-shot generation. Methods\Classes Table Lamp Vessel Bench Speaker

4

meta-VON

93.4

105.0

133.3

144.9

106.9

meta-FGN-3D

95.5

118.9

115.1

102.3

116.8

full-FGN-3D

89.0

102.4 111.8 98.6

106.4

Conclusion

In this paper, we propose a two-stage model based on GANs (FGN-3D), which introduces 3D priors into image generation in few-shot scenarios. Empirical evidence has been provided that by fully utilizing 3D structure information, our model outperforms all extended baselines fewer samples (20 at most) on novelty.

Few-Shot Generative Learning by Modeling Stereoscopic Priors

429

References 1. Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015) 2. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: NeurIPS (2017) 3. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017) 4. Liu, M.Y., et al.: Few-shot unsupervised image-to-image translation. In: ICCV (2019) 5. Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: ICCV (2017) 6. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: ICLR (2018) 7. Miyato, T., Koyama, M.: cGANs with projection discriminator. In: ICLR (2018) 8. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015) 9. Shaham, T.R., Dekel, T., Michaeli, T.: Singan: learning a generative model from a single natural image. In: ICCV (2019) 10. Wu, J., Zhang, C., Xue, T., Freeman, W.T., Tenenbaum, J.B.: Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In: NeurIPS (2016) 11. Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: ICCV (2019) 12. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017) 13. Zhu, J.Y., et al.: Visual object networks: image generation with disentangled 3D representations. In: NeurIPS (2018)

Distributed Fair k-Center Clustering Problems with Outliers Fan Yuan1 , Luhong Diao1 , Donglei Du2 , and Lei Liu1(B) 1 Department of Operations Research and Information Engineering, Beijing University of Technology, Beijing 100124, People’s Republic of China [emailprotected], {diaoluhong,liuliu leilei}@bjut.edu.cn 2 Faculty of Management, University of New Brunswick, Fredericton, NB E3B 5A3, Canada [emailprotected]

Abstract. Big data clustering is a fundamental problem with a vast number of applications. Due to the increasing size of data, interests in clustering problems in distributed computation models have increased. On t