Learning Objectives:

     Understand Big Data and its components such as HDFS. You will learn about the Hadoop Cluster Architecture and you will also get an introduction to Spark and you will get to know about the difference between batch processing and real-time processing.
  • What is Big Data?
  • Big Data Customer Scenarios
  • Limitations and Solutions of Existing Data Analytics Architecture with Uber Use Case
  • How Hadoop Solves the Big Data Problem?
  • What is Hadoop?
  • Hadoop’s Key Characteristics
  • Hadoop Ecosystem and HDFS
  • Hadoop Core Components
  • Rack Awareness and Block Replication
  • YARN and its Advantage
  • Hadoop Cluster and its Architecture
  • Hadoop: Different Cluster Modes

Learning Objectives:

     Learn the basics of Scala that are required for programming Spark applications. You will also learn about the basic constructs of Scala such as variable types, control structures, collections such as Array, ArrayBuffer, Map, Lists, and many more.
  • What is Scala?
  • Why Scala for Spark?
  • Scala in other Frameworks
  • Introduction to Scala REPL
  • Basic Scala Operations
  • Variable Types in Scala
  • Control Structures in Scala

Learning Objectives:

     In this module, you will learn about object-oriented programming and functional programming techniques in Scala.
  • Functional Programming
  • Higher Order Functions
  • Anonymous Functions
  • Class in Scala
  • Getters and Setters
  • Custom Getters and Setters
  • Properties with only Getters
  • Auxiliary Constructor and Primary Constructor
  • Singletons
  • Extending a Class
  • Overriding Methods
  • Traits as Interfaces and Layered Traits

Learning Objectives:

     Understand Apache Spark and learn how to develop Spark applications. At the end, you will learn how to perform data ingestion using Sqoop.
  • Spark’s Place in Hadoop Ecosystem
  • Spark Components & its Architecture
  • Spark Deployment Modes
  • Introduction to Spark Shell
  • Writing your first Spark Job Using SBT
  • Submitting Spark Job
  • Spark Web UI
  • Data Ingestion using Sqoop

Learning Objectives:

     Get an insight of Spark – RDDs and other RDD related manipulations for implementing business logics (Transformations, Actions and Functions performed on RDD).
  • Challenges in Existing Computing Methods
  • Probable Solution & How RDD Solves the Problem
  • What is RDD, It’s Operations, Transformations & Actions
  • Data Loading and Saving Through RDDs
  • Key-Value Pair RDDs
  • Other Pair RDDs, Two Pair RDDs
  • RDD Lineage
  • RDD Persistence
  • WordCount Program Using RDD Concepts
  • RDD Partitioning & How It Helps Achieve Parallelization
  • Passing Functions to Spark

Learning Objectives:

     In this module, you will learn about SparkSQL which is used to process structured data with SQL queries, data-frames and datasets in Spark SQL along with different kind of SQL operations performed on the data-frames. You will also learn about the Spark and Hive integration.
  • Need for Spark SQL
  • What is Spark SQL?
  • Spark SQL Architecture
  • SQL Context in Spark SQL
  • User Defined Functions
  • Data Frames & Datasets
  • Interoperating with RDDs
  • JSON and Parquet File Formats
  • Loading Data through Different Sources
  • Spark – Hive Integration
Criteria MapReduce Spark
Processing speed Good Excellent (up to 100 times faster)
Data caching Hard disk In-memory
Performing iterative jobs Average Excellent
Dependency on Hadoop Yes No
Machine Learning applications Average Excellent

Spark is a fast, easy-to-use, and flexible data processing framework. It has an advanced execution engine supporting a cyclic data flow and in-memory computing. Apache Spark can run standalone, on Hadoop, or in the cloud and is capable of accessing diverse data sources including HDFS, HBase, and Cassandra, among others.

  • Apache Spark allows integrating with Hadoop.
  • It has an interactive language shell, Scala (the language in which Spark is written).
  • Spark consists of RDDs (Resilient Distributed Datasets), which can be cached across the computing nodes in a cluster.
  • Apache Spark supports multiple analytic tools that are used for interactive query analysis, real-time analysis, and graph processing

RDD is the acronym for Resilient Distribution Datasets—a fault-tolerant collection of operational elements that run in parallel. The partitioned data in an RDD is immutable and distributed. There are primarily two types of RDDs:

  • Parallelized collections: The existing RDDs running in parallel with one another
  • Hadoop datasets: Those performing a function on each file record in HDFS or any other storage system

A Spark engine is responsible for scheduling, distributing, and monitoring the data application across the cluster.

As the name suggests, a partition is a smaller and logical division of data similar to a ‘split’ in MapReduce. Partitioning is the process of deriving logical units of data to speed up data processing. Everything in Spark is a partitioned RDD.

  • Transformations
  • Actions

Transformations are functions applied to RDDs, resulting in another RDD. It does not execute until an action occurs. Functions such as map() and filer() are examples of transformations, where the map() function iterates over every line in the RDD and splits into a new RDD. The filter() function creates a new RDD by selecting elements from the current RDD that passes the function argument.

In Spark, an action helps in bringing back data from an RDD to the local machine. They are RDD operations giving non-RDD values. The reduce() function is an action that is implemented again and again until only one value if left. The take() action takes all the values from an RDD to the local node.

Serving as the base engine, Spark Core performs various important functions like memory management, monitoring jobs, providing fault-tolerance, job scheduling, and interaction with storage systems.

Learning Objectives:

     Learn why machine learning is needed, different Machine Learning techniques/algorithms, and SparK MLlib.
  • Why Machine Learning?
  • What is Machine Learning?
  • Where Machine Learning is Used?
  • Face Detection: USE CASE
  • Different Types of Machine Learning Techniques
  • Introduction to MLlib
  • Features of MLlib and MLlib Tools
  • Various ML algorithms supported by MLlib

Learning Objectives:

     Implement various algorithms supported by MLlib such as Linear Regression, Decision Tree, Random Forest and many more.
  • Supervised Learning – Linear Regression, Logistic Regression, Decision Tree, Random Forest
  • Unsupervised Learning – K-Means Clustering & How It Works with MLlib
  • Analysis on US Election Data using MLlib (K-Means)

Learning Objectives:

     Understand Kafka and its Architecture. Also, learn about Kafka Cluster, how to configure different types of Kafka Cluster. Get introduced to Apache Flume, its architecture and how it is integrated with Apache Kafka for event processing. At the end, learn how to ingest streaming data using flume.
  • Need for Kafka
  • What is Kafka?
  • Core Concepts of Kafka
  • Kafka Architecture
  • Where is Kafka Used?
  • Understanding the Components of Kafka Cluster
  • Configuring Kafka Cluster
  • Kafka Producer and Consumer Java API
  • Need of Apache Flume
  • What is Apache Flume?

Learning Objectives:

     In this module, you will learn about the different streaming data sources such as Kafka and flume. At the end of the module, you will be able to create a spark streaming application.
  • Apache Spark Streaming: Data Sources
  • Streaming Data Source Overview
  • Apache Flume and Apache Kafka Data Sources
  • Example: Using a Kafka Direct Data Source
  • Perform Twitter Sentimental Analysis Using Spark Streaming

Learning Objectives:

     Work on an end-to-end Financial domain project covering all the major concepts of Spark taught during the course.


Learning Objectives:

     In this module, you will be learning the key concepts of Spark GraphX programming and operations along with different GraphX algorithms and their implementations.


Spark does not support data replication in memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best thing about this is that RDDs always remember how to build from other datasets.

Spark driver is the program that runs on the master node of a machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master. It also delivers RDD graphs to Master, where the standalone Cluster Manager runs.

Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:

hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;

Hive supports Spark on YARN mode by default.

  • Spark SQL (Shark) for developers
  • Spark Streaming for processing live data streams
  • GraphX for generating and computing graphs
  • MLlib (Machine Learning Algorithms)
  • SparkR to promote R programming in the Spark engine

Spark supports stream processing—an extension to the Spark API allowing stream processing of live data streams. Data from different sources like Kafka, Flume, Kinesis is processed and then pushed to file systems, live dashboards, and databases. It is similar to batch processing in terms of the input data which is here divided into streams like batches in batch processing.

Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX component enables programmers to reason about structured data at scale.

MLlib is a scalable Machine Learning library provided by Spark. It aims at making Machine Learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and the like.

Spark SQL, better known as Shark, is a novel module introduced in Spark to perform structured data processing. Through this module, Spark executes relational SQL queries on data. The core of this component supports an altogether different RDD called SchemaRDD, composed of row objects and schema objects defining the data type of each column in a row. It is similar to a table in relational databases.

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with the Parquet file and considers it be one of the best Big Data Analytics formats so far.

  • Hadoop Distributed File System (HDFS)
  • Local file system
  • Amazon S3