zos-spark

Welcome to the dedicated GitHub organization comprised of community contributions around the IBM zOS Platform for Apache Spark.

The intent of this GitHub organization is to enable the development of an ecosystem of tools associated with a reference architecture that demonstrates how the IBM zOS Platform for Apache Spark can be combined with a variety of open source tools. Of specific interest is the combination of Project Jupyter and Apache Spark to provide a flexible and extendible data processing and analytics solution.

The reference architecture attempts to serve as a discussion starter and a guide for endeavors that are considering the use of Apache Spark. For example, it can be used to:

Reference-Architecture-Diagram

Several of the projects in this GitHub organization are used together to serve as a demonstration of the reference architecture as well as an integration verification test (IVT) of a new deployment of IBM zOS Platform for Apache Spark. For more details refer to the Client Retention Demo repo.

Overview

Architecture Concepts

  1. Data Access: At the center of the solution is a Tidy Data Repository that is kept fresh with datasets from various data processing end-points and it serves as the central data source for analytical activity. This can be implemented in several ways, such as:
    • NoSQL database, storing JSON data
    • Hadoop Distributed File System (HDFS) or General Parallel File System (GPFS), storing JSON files
  2. Flexibility: A variety of use cases can be addressed by this solution. JSON datasets can be created using Spark jobs that handle complex data munging tasks and/or Data Scientists can perform ETL activities within notebooks across multiple datasets in the NoSQL database. Depending on skills and the requirements of a particular analytical task, users can determine when and where to preform ETL activities.
  3. Extensibility: New datasets can be injected into the NoSQL database from a variety of external datasources and/or data processing environments. For example, a separate instance of Apache Spark may be installed for use with a set of databases on distributed systems. Furthermore, users may want to run jobs that aggregate data from multiple data processing environments.
  4. Interoperability: An ecosystem of analytical tools are appropriately positioned to interoperate with the various solution components. For instance, the Scala Workbench is tightly-coupled with the installation of IBM zOS Platform for Apache Spark.
  5. Integration: This solution should be viewed as a complimentary approach to the data processing challenges of an enterprise. Nothing in this solution alters the procedures, policies or existing programmatic approaches currently deployed.

Architecture Considerations

  1. Managed Data Processing Environment: Spark provides distributed task dispatching and scheduling. It can be used to allow data processing tasks to be submitted as batch jobs on some predefined frequency. Few users need to interact with such a managed environment for data processing jobs. Only Data Wranglers need a deep understanding of Spark and the tools necessary to integrate with it.
  2. Tidy Data Repository: As data processing tasks are completed, the results of those jobs can be stored in a central location that is more easily accessible by a broader user community. New data sets that are produced by the Spark jobs can be refreshed or purged as desired by the system administrators or user community. Two obvious deployment topologies standout:

    • NoSQL database deployed in a Linux on Z partition
    • NoSQL database deployed on a distributed server

    In either case, the NoSQL database chosen should sport a robust set of Python and/or R libraries for manipulating data.

  3. Content Format: Given the various programming languages used by Data Scientists, the Tidy Data Repository should embrace a data storage format, such as JSON, that is commonly supported across programming languages and data stores.

Solution Persona

Persona Tool Description
Operations Data Wrangler Spark Job Manager Combine Spark Job Server and Spark Job Server Client to provide a robust toolset for managing the lifecycle of Spark Jobs.
Information Management Specialist Scala Workbench Tightly-coupled Jupyter+Spark workbench that allows Scala users with direct access to transactional systems to query, analyze and visualize enterprise data.
Data Scientist Interactive Insights Workbench Open source tool for Python and R users to access datasets for analysis and insight generation.

Solution Components

Component Purpose
IBM zOS Platform for Apache Spark IBM z/OS Platform for Apache Spark includes Apache Spark open source capabilities consisting of the Apache Spark core, Spark SQL, Spark Streaming, Machine Learning Library (MLib), and Graphx. IBM z/OS Platform for Apache Spark also includes the industry’s only mainframe-resident Spark data abstraction solution, providing universal, optimized data access to a broad set of structured and unstructured data sources through Spark APIs. With this capability, traditional z/OS data sources such as IMS, VSAM, DB2 z/OS, PDSE, or SMF data can be accessed in a performance-optimized manner with Spark.
Tidy Data Repository NoSQL database that stores the result of Spark Jobs in JSON format. In this solution JSON is positioned as the common point of exchange for all languages to load and manipulate.
Spark Job Server Provides a REST API for submitting, running and monitoring Spark Jobs. Also enables the results of jobs to be converted to JSON format.
Spark Job Server Client Java Client API for development of client GUI tools that will allow developers to easily manage Spark jobs.

Notes:

  1. Current understanding of Spark Job Server and Client repos is preliminary and warrants further investigation.

  2. It may make sense to merge the Scala Workbench and Spark Job Manager tools into a single application.

Tool Development

Scala Workbench

Interactive Insights Workbench