Apache Spark
- Apache Spark is a unified analytics engine for processing large volumes of data.
- It can run workloads 100 times faster than Hadoop MapReduce
- It offers over 80 high-level operators that make it easy to build parallel apps.
- Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, and can access data from multiple sources.
- Spark processes data in batches as well as in real-time
- Spark stores data in the RAM i.e. in-memory. So, it is easier to retrieve it
- Spark provides caching and in-memory data storage
Components of the Spark ecosystem
Apache Spark has 3 main categories that comprise its ecosystem. Those are:
Language support
Spark can integrate with different languages to applications and perform analytics. These languages are Java, Python, Scala, and R.
Core Components
Spark supports 5 main core components. There are Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and GraphX.
Cluster Management
Spark can be run in 3 environments. Those are the Standalone cluster, Apache Mesos, and YARN.