Apache Spark - An Overview and Best Practices

 · 1 min read

Apache Spark

  • Apache Spark is a unified analytics engine for processing large volumes of data.
  • It can run workloads 100 times faster than Hadoop MapReduce
  • It offers over 80 high-level operators that make it easy to build parallel apps.
  • Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, and can access data from multiple sources.
  • Spark processes data in batches as well as in real-time
  • Spark stores data in the RAM i.e. in-memory. So, it is easier to retrieve it
  • Spark provides caching and in-memory data storage

Components of the Spark ecosystem

Apache Spark has 3 main categories that comprise its ecosystem. Those are:

Language support

Spark can integrate with different languages to applications and perform analytics. These languages are Java, Python, Scala, and R.

Core Components

Spark supports 5 main core components. There are Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and GraphX.

Cluster Management

Spark can be run in 3 environments. Those are the Standalone cluster, Apache Mesos, and YARN.

How Spark runs applications