Testing, Integration, Development of SPARK for MIDAS

MIDAS is Middleware for Data-Intensive Analysis and Science that provide: Resource Management, Coordination and Communication, address heterogeneity at the infrastructure level, is flexible and compute-data coupling.

Spark is an open source cluster-computing framework that belongs to Apache Ecosystem. In contrast to Hadoop’s two-stage disk-based Map Reduce paradigm, Spark's multistage in-memory primitives provide performance up to 100 faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms.

The main objective of the project is to integrate Apache Spark with MIDAS tools using the Pilot Abstraction, which is going to be the interoperability layer between Spark and HPC. Spark Cluster can be launched using a Pilot-Job. The Pilot Agent is responsible for managing Spark’s resources, which are the CPU cores, the nodes and the required resources of the application. Spark is intended for iterative algorithms, so in-memory API is provided for the distributed cluster memory. It can be file, spark or Redis memory manager. In order to test the implementation, we developed a couple of applications to run on to of the infrastructure, k-means and leaflet finder algorithms, which are iterative algorithms. The project is still in development mode.

We have also begun developing a testing framework for these developments.