Aditya Gupta, Himanshu Sharma, Priyanshi Singh, Sanchita Paul.

Nearly 3,000 years ago, the philosopher-mystic Pythagoras claimed that everything can be expressed in numbers. At that time, no one understood him. Today, we are witnessing a digital breakthrough in which machines analyze large amounts of data on decisions made by people in different situations, translate learning algorithms into their own language, and act by analogy with humans. Today, developments in the field of AI and Machine Learning confidently follow the path of creating a computer, the cognitive functions of which are comparable to the human brain. …

Light GBM is another version of Gradient Boosting. The light stands for lighter version which makes it faster and more accurate.

- It is a fast, distributed, high performing GBM framework based on Decision Trees.

2. It splits the tree leaf wise (other frameworks like XGBoost split level wise)

Natural Language Processing is getting to be one of the most popular techniques that is being used. It is a technique used to program computers to understand, process and analyze huge amounts of data that is in human natural language such as text, speech, etc.

Suppose let’s take for example we are reading reviews for a book, we as humans can understand just by looking at the reviews that it is positive or negative right? But then how do machines understand these sentiments?

This is where Natural Language Processing comes into picture.

Libraries one can use for NLP are SpaCy…

What is Hierarchical Clustering? Well by definition it is an unsupervised method of creating similar groups from top to bottom or bottom to top.

There are two major types of this clustering: Agglomerative and Divisive.

I will be trying to explain the first type of clustering first and then the second type. Agglomerative clustering can be very well understood by heat maps.

Clustering can be done in various ways but I have found heat maps to be most intuitive.

Heatmaps can be found in the seaborn library, they give us a fair understanding of the correlation between variables.

In this…

Before we dive into what Regularized Regression, let us understand what high bias and overfitting means.

Let us take a simple example: What will happen if we ask Sachin Tendulkar (Indian cricketer) to solve a data science problem? He will be unable to do it right? This is because he knows cricket too well and will thus fail to understand or solve a problem that is not in his domain.

Let us now use this example to understand overfitting in simple terms.

*Overfitting is what happens when the model has has understood the data so well and so closely that…*

Before we move to checking the assumptions let us first understand why do we need to need to check for assumptions before fitting a model.

Why do we do this? You don’t need the assumptions for having a best fit line. But your parameters maybe biased or have high variance. Violation of assumptions will make interpretation of regression results much more difficult. Also the predictions made by the model will be extremely inefficient. Now that we understand the need, let us see the how.

I will be using the 50 start-ups dataset to check for the assumptions. …

**What is Machine Learning?**

Machine Learning is the study of computer algorithms that improve automatically through experience and by the use of data. In simpler words Machine Learning is a different approach from Heuristic Learning wherein the algorithms perform some function on the training data and makes predictions without any explicit programming.

One real life example of Machine Learning is: Drowsiness detection system. This is a very advanced example of Machine Learning that uses computer vision, OpenCv and Python to detect if a person’s eyes are closing for a few seconds! How cool is that? …

Before we begin with A/B testing, it is of the utmost importance that we understand Hypothesis testing first.

In simple words hypothesis is an idea that can be tested, and in hypothesis testing we are trying to reject the *STATUS QUO*, thereby trying to find an alternative change or innovation

Let us suppose that we are primarily concerned with using the resulting sample to test some particular hypothesis concerning them. As an illustration, suppose that a construction firm has just purchased a large supply of cables that have been guaranteed to have an average of breaking strength of at least…

We know that descriptive statistics provide information about our immediate group of data.

For example, we could calculate the mean and standard deviation of the exam marks for the 100 students and this could provide valuable information about this group of 100 students.

Any group of data like this, which includes all the data you are interested in, is called a **population**. A population can be small or large, as long as it includes all the data you are interested in.

For example, if you were only interested in the exam marks of 100 students, the 100 students would represent…

**Random Variables**

A random variable is a numerical description of the outcome of a statistical experiment.

**Types of Random Variable**

Random variables whose set of values can be written either as a finite sequence x1,x2….xn, or as an infinite sequence x1 are said to be discrete. For example a random variable whose set of possible values is the set of non negative integers is a discrete random variable.

There are also random variables that take continuum of possible values. These are called continuous random variables

The probability distribution for a random variable describes how the probabilities are distributed over the…

Hi I am Sanchita, an engineer, a math enthusiast, an AlmaBetter Datascience trainee and writer at Analytics Vidhya