Spark
What is Apache Spark?
Apache Spark is an open-source framework that simplifies the development and efficiency of data analytics jobs. It supports a wide range of API and language choices with over 80 data transformation and action operators that hide the complexity of cluster computing.
Who built and maintains Spark?
A crowd of developers from more than 300 companies built Spark, and a vast community of users contribute to its continuing refinement. It is used by organizations across a wide range of industries and its community of developers is the largest in Big Data.
Why is Spark popular?
With reported speeds 100 times faster than similar analytics engines, Spark can access variable data sources and run on several platforms, including Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. Whether you’re processing batched or streaming data, you’ll see top-level performance due to the state-of-the-art Spark DAG scheduler, a query optimizer, and a physical execution engine.
Why is Spark powerful?
Spark’s distinctive power comes from its in-memory processing. It uses a distributed pool of memory-heavy nodes and compact data encoding along with an optimizing query planner to minimize execution time and memory demand.
Because Spark performs calculations in memory, it can process data as much as 100 times faster than frameworks that process on disk. It is the preferred tool for processing the large volumes of data required for analytics and training models for machine learning and AI.
In addition, it runs a stack of native libraries that provide both expert machine learning and SQL-like data structures, allowing you to gain exceptional performance with large data sets. And with more than 80 high-level operators, Spark makes creating parallel apps easy.
Where did Spark come from?
In 2009, at the University of California, Berkeley’s AMPLab, what started out as a few graduate research papers soon blossomed into what’s now known as Spark. In 2010, AMPLab open-sourced Spark and wider collaboration began. By 2013, the Spark community had grown so large that it came under the Apache Software Foundation.
Since then, more than 1200 developers from hundreds of organizations have helped Spark continue to evolve and become the powerful interface it is today. In fact, more than 1000 organizations use Spark in production, according to a 2016 survey.
How are Spark and Hadoop different?
Spark and Hadoop have a few similarities. Both are open-source frameworks for analytic data processing; both live in the Apache Software Foundation; both contain machine learning libraries; and both can be programmed in several different languages, such as Java, Python, R, or Scala.
However, Spark was written to extend the number of computations possible with Hadoop, so the true comparison is simply how Spark enhances Hadoop’s native data processing component, known as MapReduce.
For example, Hadoop processes data only in batches, while Spark processes in batches plus streaming, real-time data. Additionally, while both have machine learning libraries, only Spark performs processing functions by using in-memory data, which makes it much faster than Hadoop. Finally, the biggest difference between Spark and Hadoop is in efficiency. Hadoop uses a two-stage execution process, while Spark creates Directed Acyclic Graphs (DAGs) to schedule tasks and manage worker nodes so processing can be done concurrently and hence more efficiently.
Benefits of Apache Spark
Spark has many advantages over other frameworks. It provides advanced analytics in an easy-to-use format with the flexibility and scalability needed to accelerate processing speed and efficiency. Some of its benefits include:
Speed
Because data is organized to scale in-memory processing across distributed cluster nodes, and because Spark can do processing without having to write data back to disk storage, it can perform up to 100 times faster than MapReduce on batch jobs when processing in memory and ten times faster on disk.
Multilingual
Written in Scala, Spark also comes with API connectors for using Java and Python, as well as an R programming package that allows users to process very large data sets required by data scientists.
Ease of use
Due to Spark’s more efficient way of distributing data across nodes and clusters, it can perform parallel data processing and data abstraction. And its ability to tie together multiple types of databases and compute data from many types of data stores allows it to be used across multiple use cases.
Power
Spark can handle a huge volume of data—as much as several petabytes according to proponents. And it allows users to perform exploratory data analysis on this petabyte-scale data without needing to downsample.
Advanced analytics
Spark comes packaged with several libraries of code to run data analytics applications. For example, the MLlib has machine learning code for advanced statistical operations, and the Spark Streaming library enables users to analyze data in real time.
Increased access to Big Data
Spark separates storage and compute, which allows customers to scale each to accommodate the performance needs of analytics applications. And it seamlessly performs batch jobs to move data into a data lake or data warehouse for advanced analytics.
Dynamic qualities
Spark includes tools to help users dynamically scale nodes to adjust to changing workloads. And at the end of a processing cycle, reallocating nodes automatically is easier in Spark.
Demand for Spark developers
As businesses increasingly require faster analytics processing to remain competitive, the demand for Spark developers is rising. And with on-demand machine learning overtaking the market, those who can facilitate deep learning accelerators and AI technologies are essential.
What is the Spark Framework?
When developers using Java or Kotlin want to develop more expressive web applications with limited boilerplate, they often turn to the Spark Framework. With a declarative and expressive syntax, Spark is designed for a more productive rapid development process that enables better coding.
As a micro framework, Spark allows developers to take full advantage of the Java Virtual Machine (JVM) with a less cumbersome process. And Spark has such a concise code syntax, coding with it is far more streamlined than with other Java web frameworks.
In addition, the Spark Framework language is designed to run on the server-side with types already built in. This helps NodeJS developers who use statically typed languages that compile to JavaScript, such as TypeScript, and are increasingly taking more of the server-side web development.
Spark operates with several data structures that make it a more powerful framework than other alternatives. These include, RDDs, DataFrames, Datasets, Tungsten, and GraphFrames, which are described below:
- Resilient Distributed Datasets (RDDs): RDDs distribute data across clusters, allowing for a simultaneous variety of processing tasks. In case of failure of any nodes in a cluster, tasks can be recomputed so actions can continue without intervention.
- DataFrames: DataFrames organize data into columns for SQL operations. These do not provide data type safety, although that is covered in datasets themselves.
- Datasets: Datasets also organize data into columns for SQL queries. They do provide data type safety measures.
- Tungsten: The Tungsten data structure is a more recent addition that was introduced to enhance Spark’s performance to bare metal performance levels. It targets memory management, binary processing, code generation, and algorithm development for faster processing.
- GraphFrames: With the GraphFrames data structure, you can run graph queries, which are run in-memory for top performance speeds.
By combining these five data structures, Spark can prepare data and provide descriptive analysis and searches like other frameworks. At the same time, it also provides predictive analysis, machine learning, and graph processing that enable businesses to make quicker, more informed decisions.
What does HPE offer to maximize Spark performance?
Enterprises pursuing a data-first strategy can use any number of HPE solutions to help unlock the value of their data in on-premises, multi-cloud, or edge deployments. Leading the digital transformation, HPE offers breakthrough services that integrate Spark into notebooks to accelerate machine learning and AI workloads, decreasing time to insights that add value to the business.
For example, HPE Ezmeral is an elastic platform that scales Apache Spark workloads with a multi-tenant control plane, GPU acceleration, full isolation and control, and prebuilt data analytics tools. HPE Ezmeral is the first data lakehouse that brings cloud-native capabilities to on-premises analytics, machine learning, and AI. Recent tests show that HPE Ezmeral, NVIDIA RAPIDS, and Tensor Core A100 GPUs accelerate Spark AI and ETL workloads by 29x. [i]
In addition, HPE recently introduced HPE Ezmeral Unified Analytics, which is an industry-first cloud data lakehouse platform that integrates Spark as part of its opinionated stack along with other best-in-class open-source tools. HPE Ezmeral Unified Analytics is also available on HPE GreenLake.
[1] HPE standardized our testing models based on Big-Data benchmarks leveraging Kubernetes Pods managed by HPE Ezmeral, NVIDIA A100 40GB GPUs, and HPE ProLiant DL385 Gen10 Plus v2.