R Statistics

 Statistics is the science of analyzing, reviewing and conclude data.

Some basic statistical numbers include:Mean, median and mode
Minimum and maximum value
Percentiles
Variance and Standard Deviation
Covariance and Correlation
Probability distributions

The R language was developed by two statisticians. It has many built-in functionalities, in addition to libraries for the exact purpose of statistical analysis.

Min and Max

The min() and max() functions can be used to find the lowest or highest value in a set.Lets consider  the mtcars data set and Find the largest and smallest value of the variable hp (horsepower).
> min(mtcars$hp)
[1] 52
> max(mtcars$hp)
[1] 335

Now we know that the largest horsepower value in the set is 335, and the lowest 52.We could take a look at the data set and try to find out which car these two values belongs to:

For example, we can use the which.max() and which.min() functions to find the index position of the max and min value in the table:
> which.max(mtcars$hp)
[1] 31
> which.min(mtcars$hp)
[1] 19

Or even better, combine which.max() and which.min() with the rownames() function to get the name of the car with the largest and smallest horsepower:
> rownames(mtcars)[which.max(mtcars$hp)]
[1] "Maserati Bora"
> rownames(mtcars)[which.min(mtcars$hp)]
[1] "Honda Civic"
Now we know for sure:
Maserati Bora is the car with the highest horsepower, and Honda Civic is the car with the lowest horsepower.

min() and max() in R with NA Values

While working on a large data set, we may encounter NA (Not Applicable) values in a vector.

In this case the min() function doesn't give desired output if NA is present. For example,
numbers <- c(2, NA, 6, 7, NA, 10)
# return smallest value 
min(numbers) # NA

Output
[1]  NA

Here, we get NA as output. But that is not the desired output.

So we can handle this using na.rm argument
numbers <- c(2, NA, 6, 7, NA, 10)
# return smallest value 
 min(numbers, na.rm = TRUE) # 2

Output
[1] 2

Here, we have used the na.rm argument to handle NA values.

By setting na.rm to TRUE, we have removed NA before the computation. So the output will be 2 not NA.

Note: Similar to min(), we can use max() with NA values too.

Outliers

Max and min can also be used to detect outliers. An outlier is a data point that differs from rest of the observations.

Example of data points that could have been outliers in the mtcars data set:
If maximum of forward gears of a car was 11
If minimum of horsepower of a car was 0
If maximum weight of a car was 50 000 lbs

Mean, Median, and Mode

In statistics, there are often three values that interests us:
Mean - The average value
Median - The middle value
Mode - The most common value

Mean

To calculate the average value (mean) of a variable from the mtcars data set, find the sum of all values, and divide the sum by the number of values.
the mean() function in R can do it for you:
Example:
> mean(mtcars$wt)
[1] 3.21725

 Median

The median value is the value in the middle, after you have sorted all the values.If we take a look at the values of the wt variable (from the mtcars data set), we will see that there are two numbers in the middle:

Note: If there are two numbers in the middle, you must divide the sum of those numbers by two, to find the median.

Luckily, R has a function that does all of that for you: Just use the median() function to find the middle value:
Example:
>  median(mtcars$wt)
[1] 3.325

Mode

The mode value is the value that appears the most number of times.

R does not have a function to calculate the mode. However, we can create our own function to find it.

If we take a look at the values of the wt variable (from the mtcars data set), we will see that the numbers 3.440 are often shown:


> names(sort(-table(mtcars$wt))[1])
[1] "3.44"
From the example above, we now know that the number that appears the most number of times in mtcars wt variable is 3.44 or 3.440 lbs.

Percentiles

Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.

If we take a look at the values of the wt (weight) variable from the mtcars data set:

What is the 75. percentile of the weight of the cars? The answer is 3.61 or 3 610 lbs, meaning that 75% or the cars weight 3 610 lbs or less:

> quantile(mtcars$wt, c(0.75))
 75% 
3.61 
If you run the quantile() function without specifying the c() parameter, you will get the percentiles of 0, 25, 50, 75 and 100:
> quantile(mtcars$wt)
     0%     25%     50%     75%    100% 
1.51300 2.58125 3.32500 3.61000 5.42400 

Quartiles

Quartiles are data divided into four parts, when sorted in an ascending order:
The value of the first quartile cuts off the first 25% of the data
The value of the second quartile cuts off the first 50% of the data
The value of the third quartile cuts off the first 75% of the data
The value of the fourth quartile cuts off the 100% of the data


Use the quantile() function to get the quartiles.

Statistical Summary of Data in R

We use the summary() function to get statistical information about the dataset.

The summary() function returns six statistical summaries:
min
First Quartile
Median
Mean
Third Quartile
Max

Let's take a look at example,
# get statistical summary of Temp variable 
summary(airquality$Temp)

Output 
Min.     1st Qu.     Median     Mean     3rd     Qu. Max. 
56.00     72.00     79.00 77.    88         85.00     97.00

In the above example, we have used the summary() function to get statistical summary of the Temp variable of airquality dataset.

Here,
Min - is the minimum value i.e. 56.00
1st Qu. - is the first quartile i.e. 72.00
Median - is the median value i.e. 79.00
Mean - is the mean value i.e. 77.88
3rd Qu. - is the third quartile i.e. 85.00
Max - is the maximum value i.e. 97.00



Comments

Popular posts from this blog

Programming in R - Dr Binu V P

Introduction

R Data Types