Apache Spark: Ruling the Market for Over a Decade

Apache Spark is about a decade old and declared an open source platform in 2010. It is in boom from the past 5 years and a widely used and one of the famous technologies in AI and big data. Let us see the remarkable journey of Spark and where it may head next.

Spark is an in-memory substitution for MapReduce, a disk-based computational system for Hadoop clusters designed earlier. Due to the complaint of its horribly slow performance, Spark removed MapReduce out of the Hadoop cluster.

Spark’s advantage atop MapReduce is to see how it processes data across the RDDs (resilient distributed datasets). Mapping, shuffling and reducing phases of MapReduce’s use plenty of expensive computational trips on the disk wherein the Spark RDDs lessened that I/O by putting the dataset in the memory until that task was completed. (Datasets exceeding RAMs limit, Spark spills over to disk.) And, this RDD method made Spark much faster.

Coding of Spark is more efficient as it allows the developers to pen compressed routines in several languages using the APIs for Python, Scala, R, and Java. With the launch of DataFrames in 2015, Spark’s productivity story boosted, it allowed to store the data in structures like tables cached in the memory, a feature which coincides with the launched Spark SQL (formerly known as Shark). After a year, the Spark community and Zaharia along with Spark 2.0 added the Datasets concept that is a programming interface (type-safe and object-oriented) based on DataFrames.

Spark’s Hadoop has proved its worth on several workloads; it replaced the batch-oriented traditional ETL jobs. The Spark’s ability for fast iteration caught the attention of many data scientists trying to master the machine learning models and algorithms. Further, if the SQL layer is added, Spark can become proficient with the interactive analytics, and the business analysts can use it too.

Somewhere between 2013-14, Spark’s popularity brimmed. Cloudera the first distributor of Hadoop understood its impact on the market, MapR and Hortonworks (Cloudera’s part now) Technologies were also very close. Also, many vendors jumped on the bandwagon of Spark, to replace the MapReduce with the faster, superior and simpler Spark engine. 

Moving beyond Hadoop

Spark is incredibly versatile on storage front as well, it is integrated along with YARN and is not limited to run on Hadoop. Spark was co-developed at AMPLab alongside Apache Mesos that was helped by Zaharia to develop. Today it supports open source software for resource scheduling, and also runs on any stand-alone computer, laptop or a server.

Spark was developed to work with the HDFS, and it was also adopted by MapR for its hybrid MapR File System. Spark, now also works with MemSQL storage, Apache Cassandra, Amazon S3, Alluxio, Cloudera’s Kudu, OpenStack Swift, and Elasticsearch. It is available as the processing engine for all the public clouds, is a key engine powering Amazon’s famous EMR (Elastic MapReduce) service today and a most preferred in Microsoft Azure. GCP (Google Cloud Compute) supports Spark as well, and it is one of the “runners” in the construct of Apache Beam.

Spark offers many specialized engines for processing. The core functions of Spark were substitutes for the batch routines in MapReduce, later, Spark has added SQL engine (Spark SQL), machine learning library (MLLib), database for a graph (GraphX) and a real-time streaming analytics engine (Spark Streaming).

Coded in Scala, Spark executes in a JVM (Java Virtual Machine). Spark is widely available for developers and it is ensured by APIs for Python, R, Scala, and Java. Engineers and power users can directly write Spark routines from the command line interface, few developers write Spark applications via interfaces like IBM‘s Data Science Experience, Cloudera’s Data Science Workbench or Jupyter.

Sustaining its Popularity

It is the most approved and trending open source project having 1,000-plus contributors of 250-plus companies. Databricks, a California company is ready to deliver Apache Spark services in the cloud. By Series E funding, it raised $250 million and employs several key individuals that are responsible for Spark’s development.

Despite Hadoop’s troubles, it is amazing to see how it has retained its name. Despite the fall of Hortonworks and Cloudera’s merge, operational challenges and technical complexity of the Hadoop users, Spark has maintained its name. Due to its versatility, Spark’s popularity sustained beyond Hadoop.

Spark is quite flexible, it unified numerous kinds of analytics within a single framework whereas the Hadoop did not have, for example, SQL, machine learning or other factors like real-time factor. Bringing unified analytics under one roof made it supreme and powerful. Spark has a big influence on big data too, knowing Spark’s Core is must for Data engineers creating new data pipelines, and data scientists still use MLlib for development of machine learning.

The Future

Spark’s future is very bright and the resources are keeping it pretty much in business. The software had a big update last year (with Spark 2.3), that supported real-time processing and Kubernetes in Spark Streaming. Soon, Apache Spark may release Spark 3.0.

What shall Spark 3.0 offer?

It is speculated that Spark may focus on arriving AI technologies. It doesn’t integrate currently with the deep learning frameworks. As a few folks use MXnet, PyTorch and Tensorflow frameworks with Spark, the job schedulers face incompatibilities which create severe operational issues.

Other improvements such as completely depreciating the RDD API, better online servings of machine learning model, an enhancement to the API of Scala, data formats support (Apache Arrow), support Neo4j‘s Cypher, making the MLlib APIs type-safe, better support for several processor types like FPGAs and GPUs are also being considered.

News Reporter