Exploratory-data-analysis-using-R

Analyzed the distribution of Indian census data using R, applied the central limit theorem and compared different sampling methods.

Abstract

The project is based on picking up a real-life Dataset, preparing and pre-processing the Dataset so that it can be analysed using various methods and draw graphs from them.

Introduction

Population Census is the total process of collecting, compiling, analyzing or otherwise disseminating demographic, economic and social data pertaining, at a specific time, of all persons in a country or a well-defined part of a country. As such, the census provides snapshot of the country’s population and housing at a given point of time.

Census of India is a rich database which can tell stories of over a billion Indians. This database has been extracted from Census of 2001 and includes data of 590 districts and 34 States having around 80 variables each.

Source

This Dataset is picked from Kaggle. https://www.kaggle.com/bazuka/census2001

Analysis for Categorical Data:

It is the qualitative data that is associated with a property or a quality. Generally, to represent the frequency of various categories, we use bar plot and pie chart. Here we check the Religions.

Analyzing Numeric data:

It is the quantitative data which is associated with numeric measurement. To graphically represent the numerical data, we generally use histogram, bar plot and dot chart. Here we check the number of Males with respect to each State.

Applying the Central Limit Theorem:

The central limit theorem states that the distribution of the sample means for a given sample size of the population has the shape of the normal distribution.

Sampling Methods:

1. Simple Random Sampling

In simple random sampling, every item from a frame has the same chance of selection from the sample as every other item.

2. Systematic Sampling

In systematic sampling, sample members from a larger population are selected.

3. Stratified Sampling

In stratified sampling, the items from the frame are subdivided into separate subgroups called strata. Simple random samples are selected from each stratum and combined for the desired sample of size n.

4. Clustering

In cluster sampling, the data is divided into groups called clusters. These clusters should mirror the entire data. A random sample of these clusters is then collected and analyzed.

5. Confidence Level

The confidence level is the confidence that the confidence interval contains the data mean.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.DS_Store		.DS_Store
Exploratory Data Analyis Main.R		Exploratory Data Analyis Main.R
Indian Census Dataset.csv		Indian Census Dataset.csv
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Exploratory-data-analysis-using-R

Abstract

Introduction

Source

Analysis for Categorical Data:

Analyzing Numeric data:

Applying the Central Limit Theorem:

Sampling Methods:

1. Simple Random Sampling

2. Systematic Sampling

3. Stratified Sampling

4. Clustering

5. Confidence Level

About

Uh oh!

Releases

Packages

Languages

License

AmoghKatwe/Exploratory-data-analysis-using-R

Folders and files

Latest commit

History

Repository files navigation

Exploratory-data-analysis-using-R

Abstract

Introduction

Source

Analysis for Categorical Data:

Analyzing Numeric data:

Applying the Central Limit Theorem:

Sampling Methods:

1. Simple Random Sampling

2. Systematic Sampling

3. Stratified Sampling

4. Clustering

5. Confidence Level

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages