Analyzed the distribution of Indian census data using R, applied the central limit theorem and compared different sampling methods.
The project is based on picking up a real-life Dataset, preparing and pre-processing the Dataset so that it can be analysed using various methods and draw graphs from them.
Population Census is the total process of collecting, compiling, analyzing or otherwise disseminating demographic, economic and social data pertaining, at a specific time, of all persons in a country or a well-defined part of a country. As such, the census provides snapshot of the country’s population and housing at a given point of time.
Census of India is a rich database which can tell stories of over a billion Indians. This database has been extracted from Census of 2001 and includes data of 590 districts and 34 States having around 80 variables each.
This Dataset is picked from Kaggle. https://www.kaggle.com/bazuka/census2001
It is the qualitative data that is associated with a property or a quality. Generally, to represent the frequency of various categories, we use bar plot and pie chart. Here we check the Religions.
It is the quantitative data which is associated with numeric measurement. To graphically represent the numerical data, we generally use histogram, bar plot and dot chart. Here we check the number of Males with respect to each State.
The central limit theorem states that the distribution of the sample means for a given sample size of the population has the shape of the normal distribution.
In simple random sampling, every item from a frame has the same chance of selection from the sample as every other item.
In systematic sampling, sample members from a larger population are selected.
In stratified sampling, the items from the frame are subdivided into separate subgroups called strata. Simple random samples are selected from each stratum and combined for the desired sample of size n.
In cluster sampling, the data is divided into groups called clusters. These clusters should mirror the entire data. A random sample of these clusters is then collected and analyzed.
The confidence level is the confidence that the confidence interval contains the data mean.