Learning Objectives:

     Understand Big Data and analyse limitations of traditional solutions. You will learn about the Hadoop and its core components and you will get to know about the difference between Hadoop 1.0 and Hadoop 2.x.
Topics:
  • Introduction to big data
  • Common big data domain scenarios
  • Limitations of traditional solutions
  • What is Hadoop?
  • Hadoop 1.0 ecosystem and its Core Components
  • Hadoop 2.x ecosystem and its Core Components
  • Application submission in YARN

Learning Objectives:

     In this module, you will learn about Hadoop Distributed File System, Hadoop Configuration Files and Hadoop Cluster Architecture. You will also get to know the roles and responsibilities of a Hadoop administrator.
Topics:
  • Distributed File System
  • Hadoop Cluster Architecture
  • Replication rules
  • Hadoop Cluster Modes
  • Rack awareness theory
  • Hadoop cluster administrator responsibilities
  • Understand working of HDFS
  • NTP server
  • Initial configuration required before installing Hadoop
  • Deploying Hadoop in a pseudo-distributed mode

Learning Objectives:

     Learn how to build a Hadoop multi-node cluster and understand the various properties of Namenode, Datanode and Secondary Namenode.
Topics:
  • OS Tuning for Hadoop Performance
  • Pre-requisite for installing Hadoop
  • Hadoop Configuration Files
  • Stale Configuration
  • RPC and HTTP Server Properties
  • Properties of Namenode, Datanode and Secondary Namenode
  • Log Files in Hadoop
  • Deploying a multi-node Hadoop cluster

Learning Objectives:

     In this module, you will learn how to add or remove nodes to your cluster in adhoc and recommended way. You will also understand day to day Cluster Administration tasks like balancing data in cluster, protecting data by enabling trash, attempting a manual failover, creating backup within or across clusters.
Topics:
  • Commisioning and Decommissioning of Node
  • HDFS Balancer
  • Namenode Federation in Hadoop
  • High Availabilty in Hadoop
  • .Trash Functionality
  • Checkpointing in Hadoop
  • Distcp
  • Disk balancer

Learning Objectives:

     Get to know about the various processing frameworks in Hadoop and understand the YARN job execution flow. You will also learn about various schedulers and MapReduce programming model in the context of Hadoop administrator and schedulers.
Topics:
  • Different Processing Frameworks
  • Different phases in Mapreduce
  • Spark and its Features
  • Application Workflow in YARN
  • YARN Metrics
  • YARN Capacity Scheduler and Fair Scheduler
  • Service Level Authorization (SLA)

Learning Objectives:

     In this module, you will understand the insights about Cluster Planning and Managing, what are the aspects one needs to think about when planning a setup of a new cluster.
Topics:
  • Planning a Hadoop 2.x cluster
  • Cluster sizing
  • Hardware, Network and Software considerations
  • Popular Hadoop distributions
  • Workload and usage patterns
  • Industry recommendations

Hadoop comes with many pros. It has been observed that it offer some of the best benefits of the programmers as compared with any other framework. It makes it easy for programmers to write the code reliably and detect the same errors in same. It is purely based on Java and thus there are no compatibility issues. As far as the matter of functions and distribute systems is concerned, Hadoop has become the number one choice of several programmers all over the world. In addition to this, handling bulk data very easily is another good thing about this framework.

Relational database management tools often fail to perform their tasks and some stages. This is common when they have to handle a large amount of data. Big data is nothing but an array of complex data sets. It is actually an approach that makes it easy for businesses to get the maximum information from their data by properly searching, analyzing, sharing, transferring, capturing, as well as visualizing the same.

These are:
1. Velocity
2. Veracity
3. Velocity
4. Value

Hadoop is basically an approach that makes it easy for the users to handle big data without facing any problem. All the business decisions can simply be made by getting the most useful information in no time. Hadoop has been equipped with some of the best components that make it easy for the users to keep up the pace even if the data is too large. Hadoop has been equipped with two prime components and they are:

1. Processing Framework
2. Storage Unit

YARN stands for Yet Another Resource Negotiator
HDFS stands for Hadoop Distributed File System

Hadoop has a powerful data storage unit which is tagged as “Hadoop Distributed File System”. Any form of data can easily be stored in it in the form of blocks. It makes use of master and slave topology. In case the need of extended storage is felt, the same can be extended to fulfill the same. Hadoop is best in this aspect.

These are actually related to storage in the Hadoop. Name Node is basically considered as a master node and is responsible for maintaining the Meta data information which is related to different blocks based on some of the factors related with them. Data Nodes are considered as Slave Nodes which is mainly responsible for storage and management of data in the basic format.

Learning Objectives:

     Get to know about the Hadoop cluster monitoring and security concepts. You will also learn how to secure a Hadoop cluster with Kerberos.
Topics:
  • Monitoring Hadoop Clusters
  • Hadoop Security System Concepts
  • Securing a Hadoop Cluster With Kerberos
  • Common Misconfigurations
  • Overview on Kerberos
  • Checking log files to understand Hadoop clusters for troubleshooting

Learning Objectives: 

    In this module, you will learn about the Cloudera Hadoop 2.x and various features of it.
Topics:
  • Visualize Cloudera Manager
  • Features of Cloudera Manager
  • Build Cloudera Hadoop cluster using CDH
  • Installation choices in Cloudera
  • Cloudera Manager Vocabulary
  • Cloudera terminologies
  • Different tabs in Cloudera Manager
  • What is HUE?
  • Hue Architecture
  • Hue Interface
  • Hue Features

Learning Objectives:

     Get to know the working and installation of Hadoop ecosystem components such as Pig and Hive.
Topics:
  • Explain Hive
  • Hive Setup
  • Hive Configuration
  • Working with Hive
  • Setting Hive in local and remote metastore mode
  • Pig setup
  • Working with Pig

Learning Objectives:

     In this module, you will learn about the working and installation of HBase and Zookeeper.
Topics:
  • What is NoSQL Database
  • HBase data model
  • HBase Architecture
  • MemStore, WAL, BlockCache
  • HBase Hfile
  • Compactions
  • HBase Read and Write
  • HBase balancer and hbck
  • HBase setup
  • Working with HBase
  • Installing Zookeeper

Learning Objectives:

     In this module, you will get to know about Apache Oozie which is a server-based workflow scheduling system to manage Hadoop jobs.
Topics:
  • Oozie overview
  • Oozie Features
  • Oozie workflow, coordinator and bundle
  • Start, End and Error Node
  • Action Node
  • Join and Fork
  • Decision Node
  • Oozie CLI
  • Install Oozie

Learning Objectives:

     Learn about the different data ingestion tools such as Sqoop and Flume.
Topics:
  • Types of Data Ingestion
  • HDFS data loading commands
  • Purpose and features of Sqoop
  • Perform operations like, Sqoop Import, Export and Hive Import
  • Sqoop 2
  • Install Sqoop
  • Import data from RDBMS into HDFS
  • Flume features and architecture
  • Types of flow
  • Install Flume
  • Ingest Data From External Sources With Flume
  • Best Practices for Importing Data

Both Resource and Node Manager are associated with the YARN. Resource Manager is responsible for receiving the requests related to data processing. It then passes the same to the parallel Node Managers and ensures the processing has been taken place in a proper manner. On the other side Node Manager make sure the proper execution of task on every single Data Node.

The Secondary Name Node is responsible for this. It generally performs this task with the help of other parallel nodes and make sure that the task has been processed at its level best. It also generated the reports related to same which are sent along with the data for the analysis of same in the step wise manner.

NAS stands for Network-attached Storage and is generally regarded as the storage server which is file-level. It is connected with a server and is mainly responsible to make sure that the access has been provided to a group of users. When it comes to storing and accessing the files, all the responsibilities are beard by the NAS which can be a software, or a hardware. On the other side, HDFS is a distributed file system and is actually based on commodity hardware.

Well, it is possible to store data in a distributed manner among all the machines within a cluster. Another approach is to choose a dedicated machine for the same. Distributed approach is a good option because failure of one machine doesn’t interrupt the entire functionality within an organization. Although back up can be created for the first case, accessing the backup data and bringing it into the main server can create a lot of time issues. Thus second option is a good one. It is reliable. However, all the machines within a cluster need to be protected in every aspect in case of confidential data or information.

When it comes to Hadoop, it really doesn’t matter whether the data that needs to be stored is structured, unstructured or semi-structured. Also, the schema of data is totally unfamiliar to the Hadoop. On the other side, RDBMS always have structured data. It cannot process the others. The schema of same is always known to it. When it comes to processing capabilities, RDMS has limited number of options while Hadoop enables the same without any strict upper limit on the same. Another key difference is Hadoop is open source, while the RDBMS is licensed.

It is done mainly after the loading of the data. Sometime it even leads to bugs but that can be managed at a later stage. Actually, it follows the scheme on read protocol.

Hadoop is a good option to consider for OLAP systems, data discovery, as well as for Data Analysis. Hadoop has features that make the bulk data handling very easy. Because all these tasks have a lot of data to handle, the Hadoop approach can easily be trusted.