Big data problems can be classified into three main categories: batch processing (Hadoop), stream processing (Apache Flink and Apache Heron) and iterative machine learning and graph problems (Apache Spark). Each of these problems have different processing, communication and storage requirements. Therefore, each system provides separate solutions to these needs.
All these systems use dataflow programming model to perform distributed computations. With this model, big data frameworks represent a computation as a generic graph where nodes doing computations and the edges representing the communication. The nodes of the graph can be executed on different machines in the cluster depending on the requirements of the application.
We identify four key tasks in big data systems
Job Submission (Acquiring compute resources and managing a job),
Parallel communication between processes,
Executing user logic with threads and processes
Managing the data including both static and intermediate.
An independent component can be developed for each of these tasks. However, current systems provide tightly coupled solutions to these tasks excluding the resource scheduling.
Twister2 [1-3] is a loosely-coupled component-based approach to big data. Each of the four essential abstractions have different implementations to support various applications. Therefore, it has a pluggable architecture. It can be used to solve all three types of big data problems mentioned above.
Twister2 provides a data analytics hosting environment where it supports different data analytics including streaming, data pipelines and iterative computations. Unlike many other big data systems that are designed around user APIs, Twister2 is built from bottom up to support different APIs and workloads. Our vision for Twister2 is a complete computing environment for data analytics.
In this tutorial, we review big data problems and systems, explain Twister2 architecture and features, provide examples for developing and running applications on Twister2 system. By learning Twister2, big data developers will have an experience with a flexible big data solution that can be used to solve all three types of big data problems.
Twister2 is an ongoing open source project at Indiana University. It started in the 4th quarter of 2017.
Github - https://github.com/DSC-SPIDAL/twister2
Documentation - https://twister2.gitbook.io/twister2
User List - firstname.lastname@example.org
Slack Channel - dsc-twister.slack.com
By taking this tutorial you will;
Get an introductory information on big data systems and Apache big data solutions
Learn Twister2 concepts
Learn how job submission is done in Twister2 and related resource schedulers
Learn how the parallel communication happens Twister2
Learn the task system of Twister2
Learn how to develop big data solutions
This Tutorial consists of several presentations and hands-on examples. We have prepared a docker image for you to easily run the examples in your laptop as well as a cluster deployment of Twister2.
This tutorial website consists of the following sections.
Here are the introductory slides we will go through in this Tutorial.
Explains Twister2 Job Submission and Resource abstractions
Explains the parallel communication model
Explains the task layer and API of Twister2
Explains the background
Explains the background
Here are the slides for Hands on section but online description https://twister2.gitbook.io/twister2/tutorial/developing
Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, Vibhatha Abeykoon, Geoffrey Fox, "Twister2: Design of a Big Data Toolkit" in EXAMPI 2017 workshop November 12 2017 at SC17 conference, Denver CO 2017.
Supun Kamburugamuve, Pulasthi Wickramasinghe, Kannan Govindarajan, Ahmet Uyar, Gurhan Gunduz, Vibhatha Abeykoon, Geoffrey Fox, "Twister:Net - Communication Library for Big Data Processing in HPC and Cloud Environments", Proceedings of Cloud 2018 Conference July 2-7 2018, San Francisco.
Kannan Govindarajan, Supun Kamburugamuve, Pulasthi Wickramasinghe, Vibhatha Abeykoon, Geoffrey Fox, "Task Scheduling in Big Data - Review, Research: Challenges, and Prospects", Proceedings of 9th International Conference on Advanced Computing (ICoAC), December 14-16, 2017, India.
This work was partially supported by NSF CIF21 DIBBS 1443054 and the Indiana University Precision Health initiative. We thank Intel for their support of the Juliet and Victor systems and extend our gratitude to the FutureSystems team for their support with the infrastructure.