 Define Data Science
 Discuss the era of Data Science
 Describe the Role of a Data Scientist
 Illustrate the Life cycle of Data Science
 List the Tools used in Data Science
 State what role Big Data and Hadoop, Python, R and Machine Learning play in Data Science
Topics:
 What is Data Science?
 What does Data Science involve?
 Era of Data Science
 Business Intelligence vs Data Science
 Life cycle of Data Science
 Tools of Data Science
 Introduction to Python
 Discuss Data Acquisition techniques
 List the different types of Data
 Evaluate Input Data
 Explain the Data Wrangling techniques
 Discuss Data Exploration
 Data Analysis Pipeline
 What is Data Extraction
 Types of Data
 Raw and Processed Data
 Data Wrangling
 Exploratory Data Analysis
 Visualization of Data
 Essential Python Revision
 Necessary Machine Learning Python libraries
 Define Machine Learning
 Discuss Machine Learning Use cases
 List the categories of Machine Learning
 Illustrate Supervised Learning Algorithms
 Identify and recognize machine learning algorithms around us
 Understand the various elements of machine learning algorithm like parameters, hyper parameters, loss function and optimization.
 Python Revision (numpy, Pandas, scikit learn, matplotlib)
 What is Machine Learning?
 Machine Learning UseCases
 Machine Learning Process Flow
 Machine Learning Categories
 Linear regression
 Gradient descent
 Understand What is Supervised Learning?
 Illustrate Logistic Regression
 Define Classification
 Explain different Types of Classifiers such as Decision Tree and Random Forest
 What is Classification and its use cases?
 What is Decision Tree?
 Algorithm for Decision Tree Induction
 Creating a Perfect Decision Tree
 Confusion Matrix
 What is Random Forest?
 Define the importance of Dimensions
 Explore PCA and its implementation
 Discuss LDA and its implementation
 Introduction to Dimensionality
 Why Dimensionality Reduction
 PCA
 Factor Analysis
 Scaling dimensional model
 LDA
In supervised machine learning, a model makes predictions or decisions based on past or labeled data. Labeled data refers to sets of data that are given tags or labels, and thus made more meaningful.
Overfitting is a situation that occurs when a model learns the training set too well, taking up random fluctuations in the training data as concepts. These impact the model’s ability to generalize and don’t apply to new data.
When a model is given the training data, it shows 100 percent accuracy—technically a slight loss. But, when we use the test data, there may be an error and low efficiency. This condition is known as overfitting.
There are multiple ways of avoiding overfitting, such as:
 Regularization. It involves a cost term for the features involved with the objective function
 Making a simple model. With lesser variables and parameters, the variance can be reduced
 Crossvalidation methods like kfolds can also be used
 If some model parameters are likely to cause overfitting, techniques for regularization like LASSO can be used that penalize these parameters
There is a threestep process followed to create a model:
 Train the model
 Test the model
 Deploy the model
Training Set  Test Set 



Consider a case where you have labeled data for 1,000 records. One way to train the model is to expose all 1,000 records during the training process. Then you take a small set of the same data to test the model, which would give good results in this case.
But, this is not an accurate way of testing. So, we set aside a portion of that data called the ‘test set’ before starting the training process. The remaining data is called the ‘training set’ that we use for training the model. The training set passes through the model multiple times until the accuracy is high, and errors are minimized.
One of the easiest ways to handle missing or corrupted data is to drop those rows or columns or replace them entirely with some other value.
There are two useful methods in Pandas:
 IsNull() and dropna() will help to find the columns/rows with missing data and drop them
 Fillna() will replace the wrong values with a placeholder value
When the training set is small, a model that has a right bias and low variance seems to work better because they are less likely to overfit.
For example, Naive Bayes works best when the training set is large. Models with low bias and high variance tend to perform better as they work fine with complex relationships.
A confusion matrix (or error matrix) is a specific table that is used to measure the performance of an algorithm. It is mostly used in supervised learning; in unsupervised learning, it’s called the matching matrix.
The confusion matrix has two parameters:
 Actual
 Predicted
It also has identical sets of features in both of these dimensions.
False positives are those cases which wrongly get classified as True but are False.
False negatives are those cases which wrongly get classified as False but are True.
In the term ‘False Positive,’ the word ‘Positive’ refers to the ‘Yes’ row of the predicted value in the confusion matrix. The complete term indicates that the system has predicted it as a positive, but the actual value is negative.
So, looking at the confusion matrix, we get:
Falsepositive = 3
True positive = 12
Similarly, in the term ‘False Negative,’ the word ‘Negative’ refers to the ‘No’ row of the predicted value in the confusion matrix. And the complete term indicates that the system has predicted it as negative, but the actual value is positive.
So, looking at the confusion matrix, we get:
False Negative = 1
True Negative = 9
The three stages of building a machine learning model are:

Model Building
Choose a suitable algorithm for the model and train it according to the requirement

Model Testing
Check the accuracy of the model through the test data

Applying the Model
Make the required changes after testing and use the final model for realtime projects
Here, it’s important to remember that once in a while, the model needs to be checked to make sure it’s working correctly. It should be modified to make sure that it is uptodate.
Deep learning is a subset of machine learning that involves systems that think and learn like humans using artificial neural networks. The term ‘deep’ comes from the fact that you can have several layers of neural networks.
One of the primary differences between machine learning and deep learning is that feature engineering is done manually in machine learning. In the case of deep learning, the model consisting of neural networks will automatically determine which features to use (and which not to use).
Machine Learning  Deep Learning 



 Understand What is Naïve Bayes Classifier
 How Naïve Bayes Classifier works?
 Understand Support Vector Machine
 Illustrate How Support Vector Machine works?
 Hyperparameter optimization
 What is Naïve Bayes?
 How Naïve Bayes works?
 Implementing Naïve Bayes Classifier
 What is Support Vector Machine?
 Illustrate how Support Vector Machine works?
 Hyperparameter optimization
 Grid Search vs Random Search
 Implementation of Support Vector Machine for Classification

 Define Unsupervised Learning
 Discuss the following Cluster Analysis
 What is Clustering & its Use Cases?
 What is Kmeans Clustering?
 How Kmeans algorithm works?
 How to do optimal clustering
 What is Cmeans Clustering?
 What is Hierarchical Clustering?
 How Hierarchical Clustering works?
 Define Association Rules
 Learn the backend of recommendation engines and develop your own using python
 What are Association Rules?
 Association Rule Parameters
 Calculating Association Rule Parameters
 Recommendation Engines
 How Recommendation Engines work?
 Collaborative Filtering
 Content Based Filtering
 Explain the concept of Reinforcement Learning
 Generalize a problem using Reinforcement Learning
 Explain Markov’s Decision Process
 Demonstrate Q Learning
 What is Reinforcement Learning
 Why Reinforcement Learning
 Elements of Reinforcement Learning
 Exploration vs Exploitation dilemma
 Epsilon Greedy Algorithm
 Markov Decision Process (MDP)
 Q values and V values
 Q – Learning
 α values
 Explain Time Series Analysis (TSA)
 Discuss the need of TSA
 Describe ARIMA modelling
 Forecast the time series model
 What is Time Series Analysis?
 Importance of TSA
 Components of TSA
 White Noise
 AR model
 MA model
 ARMA model
 ARIMA model
 Stationarity
 ACF & PACF
 Discuss Model Selection
 Define Boosting
 Express the need of Boosting
 Explain the working of Boosting algorithm
 What is Model Selection?
 Need of Model Selection
 Cross – Validation
 What is Boosting?
 How Boosting Algorithms work?
 Types of Boosting Algorithms
 Adaptive Boosting
Applications of supervised machine learning include:

Email Spam Detection
Here we train the model using historical data that consists of emails categorized as spam or not spam. This labeled information is fed as input to the model.

Healthcare Diagnosis
By providing images regarding a disease, a model can be trained to detect if a person is suffering from the disease or not.

Sentiment Analysis
This refers to the process of using algorithms to mine documents and determine whether they’re positive, neutral, or negative in sentiment.

Fraud Detection
Training the model to identify suspicious patterns, we can detect instances of possible fraud.
Supervised learning uses data that is completely labeled, whereas unsupervised learning uses no training data.
In the case of semisupervised learning, the training data contains a small amount of labeled data and a large amount of unlabeled data.
There are two techniques used in unsupervised learning: clustering and association.
Clustering
Clustering problems involve data to be divided into subsets. These subsets, also called clusters, contain data that are similar to each other. Different clusters reveal different details about the objects, unlike classification or regression.
Association
In an association problem, we identify patterns of associations between different variables or items.
For example, an ecommerce website can suggest other items for you to buy, based on the prior purchases that you have made, spending habits, items in your wishlist, other customers’ purchase habits, and so on.
 Supervised learning – This model learns from the labeled data and makes a future prediction as output
 Unsupervised learning – This model uses unlabeled input data and allows the algorithm to act on that information without guidance.
Inductive Learning  Deductive Learning 



Kmeans  KNN 



The classifier is called ‘naive’ because it makes assumptions that may or may not turn out to be correct.
The algorithm assumes that the presence of one feature of a class is not related to the presence of any other feature (absolute independence of features), given the class variable.
For instance, a fruit may be considered to be a cherry if it is red in color and round in shape, regardless of other features. This assumption may or may not be right (as an apple also matches the description)
Reinforcement learning has an environment and an agent. The agent performs some actions to achieve a specific goal. Every time the agent performs a task that is taking it towards the goal, it is rewarded. And, every time it takes a step which goes against that goal or in reverse direction, it is penalized.
Earlier, chess programs had to determine the best moves after much research on numerous factors. Building a machine designed to play such games would require many rules to be specified.
With reinforced learning, we don’t have to deal with this problem as the learning agent learns by playing the game. It will make a move (decision), check if it’s the right move (feedback), and keep the outcomes in memory for the next step it takes (learning). There is a reward for every correct decision the system takes and punishment for the wrong one.
While there is no fixed rule to choose an algorithm for a classification problem, you can follow these guidelines:
 If accuracy is a concern, test different algorithms and crossvalidate them
 If the training dataset is small, use models that have low variance and high bias
 If the training dataset is large, use models that have high variance and little bias
Once a user buys something from Amazon, Amazon stores that purchase data for future reference and finds products that are most likely also to be bought, it is possible because of the Association algorithm, which can identify patterns in a given dataset.