R Factors


Factor is a data structure used for fields that takes only predefined, finite number of values (categorical data). For example: a data field such as marital status may contain only values from single, married, separated, divorced, or widowed.

In such case, we know the possible values beforehand and these predefined, distinct values are called levels. Examples of factors are:
Demography: Male/Female
Music: Rock, Pop, Classic, Jazz
Training: Strength, Stamina

To create a factor, use the factor() function and add a vector as argument:

Example:
# Create a factor
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))

# Print the factor
music_genre
Output:
[1] Jazz    Rock    Classic Classic Pop     Jazz    Rock    Jazz   
Levels: Classic Jazz Pop Rock

You can see from the example above that that the factor has four levels (categories): Classic, Jazz, Pop and Rock.

To only print the levels, use the levels() function:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))

levels(music_genre)

Output:
[1] "Classic" "Jazz"    "Pop"     "Rock" 

Factors are closely related with vectors. In fact, factors are stored as integer vectors. This is clearly seen from its structure.
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))

str(music_genre)
Output:
Factor w/ 4 levels "Classic","Jazz",..: 2 4 1 1 3 2 4 2

We see that levels are stored in a character vector and the individual elements are actually stored as indices.

Factors are also created when we read non-numerical columns into a data frame.

By default, data.frame() function converts character vector into factor. To suppress this behavior, we have to pass the argument stringsAsFactors = FALSE.

Factor Length

Use the length() function to find out how many items there are in the factor:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))

length(music_genre)
Output:
[1] 8

Access Factors

To access the items in a factor, refer to the index number, using [] brackets:

Example:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
#access third item
music_genre[3]
#access first and second item
music_genre[c(1,2)]
#access all except first item
music_genre[-1]
#use the logical vector to access
music_genre[c(TRUE, FALSE, FALSE, TRUE,FALSE,FALSE,FALSE,FALSE)]
Output:
[1]Classic
Levels: Classic Jazz Pop Rock
[1] Jazz Rock
Levels: Classic Jazz Pop Rock
[1] Rock    Classic Classic Pop     Jazz    Rock    Jazz   
Levels: Classic Jazz Pop Rock
[1] Jazz    Classic
Levels: Classic Jazz Pop Rock

Levels can also be predefined by the programmer. 
# Creating a factor with levels defined by programmer
gender <- factor(c("female", "male", "male", "female"),
levels = c("female", "transgender", "male"));
gender

Output:
[1] female male   male   female
Levels: female transgender male

Change Item Value

To change the value of a specific item, refer to the index number:

Example:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))

music_genre[3] <- "Pop"

music_genre
Output:
[1] Jazz    Rock    Pop     Classic Pop     Jazz    Rock    Jazz   
Levels: Classic Jazz Pop Rock

Note that you cannot change the value of a specific item if it is not already specified in the factor. The following example will produce an error:

Example

Trying to change the value of the third item ("Classic") to an item that does not exist/not predefined ("Opera"):
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))

music_genre[3] <- "Opera"

music_genre[3]

Output:
Warning message:
In `[<-.factor`(`*tmp*`, 3, value = "Opera") :
  invalid factor level, NA generated
[1] <NA>
Levels: Classic Jazz Pop Rock


However, if you have already specified it inside the levels argument, it will work:

Example

Change the value of the third item:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"), levels = c("Classic", "Jazz", "Pop", "Rock", "Opera"))

music_genre[3] <- "Opera"

music_genre[3]



Comments

Popular posts from this blog

Programming in R - Dr Binu V P

R Data Types

R- Linear Regression