30
Vector and data frame indexing Steve Bagley somgen223.stanford.edu 1

Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Vector and data frame indexing

Steve Bagley

somgen223.stanford.edu 1

Page 2: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

More about vectors

somgen223.stanford.edu 2

Page 3: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Recycling rule

• When operating on multiple vectors of different lengths, R will reuse values ifthere are insufficient ones, wrapping around.

• This is the cause of confusion (and bugs), so be careful.

somgen223.stanford.edu 3

Page 4: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Recycling rule examples

c(1, 2) + c(3, 4) # length 2 + length 2[1] 4 61 + c(3, 4) # length 1 + length 2[1] 4 5c(1, 2, 3, 4, 5) + c(3, 4) # length 5 + length 2Warning in c(1, 2, 3, 4, 5) + c(3, 4): longer object length is not a multiple ofshorter object length[1] 4 6 6 8 8## which is as if you had typed (but without a warning):c(1, 2, 3, 4, 5) + c(3, 4, 3, 4, 3) # length 5 + length 5[1] 4 6 6 8 8

somgen223.stanford.edu 4

Page 5: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Indexing a vector: positive integers index those elements of the vector

(x <- c(9, 12, 6, 10, 10, 16, 8, 4))[1] 9 12 6 10 10 16 8 4x[1][1] 9x[2:4][1] 12 6 10x[c(3, 1)][1] 6 9index <- c(1, 1, 1, 2, 2, 3)x[index][1] 9 9 9 12 12 6

• Indexing returns a subsequence of the vector. It does not change the originalvector.

• Brackets [ ] are used for indexing.• R starts counting vector indices from 1.• You can index using a multi-element vector.• The length of the result is the length of the index vector.

somgen223.stanford.edu 5

Page 6: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Indexing a vector: logical values pick those vector elements corresponding toTRUE

x[1] 9 12 6 10 10 16 8 4x >= 11[1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSEx[x >= 11][1] 12 16

• Logical values are either TRUE or FALSE.• They are typically produced by using a comparison operator or similar test.

somgen223.stanford.edu 6

Page 7: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Indexing a vector: negative integers leave out those elements of the vector

x[1] 9 12 6 10 10 16 8 4x[1][1] 9x[-1][1] 12 6 10 10 16 8 4x[-length(x)][1] 9 12 6 10 10 16 8x[c(-1, -length(x))][1] 12 6 10 10 16 8

• You can’t mix positive and negative vector indices in a single index expression.R will complain.

• What about using 0 as an index? It is ignored.

somgen223.stanford.edu 7

Page 8: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Exercise: mean values

Using the vector x with values (9, 12, 6, 10, 10, 16, 8, 4)• Select out those values greater than the mean.

somgen223.stanford.edu 8

Page 9: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Answer: mean values

x <- c(9, 12, 6, 10, 10, 16, 8, 4)x > mean(x)[1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSEx[x > mean(x)][1] 12 10 10 16

somgen223.stanford.edu 9

Page 10: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Indexing a vector with an out-of-bounds index

x[1] 9 12 6 10 10 16 8 4x[20][1] NA

• An out-of-bounds index does not cause an error.• It returns NA.

somgen223.stanford.edu 10

Page 11: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Assigning to an out-of-bounds position

x[1] 9 12 6 10 10 16 8 4x[10] <- 333x[1] 9 12 6 10 10 16 8 4 NA 333• Assigning to an out-of-bounds position creates that position and all the positionsup to it.

somgen223.stanford.edu 11

Page 12: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Indexing a data frame

somgen223.stanford.edu 12

Page 13: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Set up data

(gene_exp1 <- read_csv(str_c(data_dir, "gene_exp1.csv")))# A tibble: 3 x 3

gene control treatment<chr> <dbl> <dbl>

1 ABC123 0 12 DEF234 10 33 GKK7 12 13

somgen223.stanford.edu 13

Page 14: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Turn a data frame column into a vector

gene_exp1$gene[1] "ABC123" "DEF234" "GKK7"gene_exp1$gene_nameWarning: Unknown or uninitialised column: `gene_name`.NULL

• Use $ when you need to explicitly refer to the column by name.• Note that using a non-existent name will issue a warning, and return the value

NULL. This is a common source of bugs.

somgen223.stanford.edu 14

Page 15: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Select column(s) by name

gene_exp1[, "gene"]# A tibble: 3 x 1

gene<chr>

1 ABC1232 DEF2343 GKK7gene_exp1[, c("treatment", "control")]# A tibble: 3 x 2

treatment control<dbl> <dbl>

1 1 02 3 103 13 12

• Use [row, col] format for a data frame.• You can leave out row or col.• This returns a data frame, perhaps with only a single column.

somgen223.stanford.edu 15

Page 16: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Select column(s) by number

gene_exp1[, c(2, 1)]# A tibble: 3 x 2

control gene<dbl> <chr>

1 0 ABC1232 10 DEF2343 12 GKK7

• You can refer to columns by number, starting from 1.

somgen223.stanford.edu 16

Page 17: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Select rows or columns by number

z <- c(2, 3)gene_exp1[z, ]# A tibble: 2 x 3

gene control treatment<chr> <dbl> <dbl>

1 DEF234 10 32 GKK7 12 13gene_exp1[, z]# A tibble: 3 x 2

control treatment<dbl> <dbl>

1 0 12 10 33 12 13

somgen223.stanford.edu 17

Page 18: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Select row and column

gene_exp1[1, 2]# A tibble: 1 x 1

control<dbl>

1 0• The result is a one-row, one-column data frame.

somgen223.stanford.edu 18

Page 19: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Use of [[ ]]

## Use name explicitlygene_exp1[["gene"]][1] "ABC123" "DEF234" "GKK7"## Set a variable to the column namecol <- "treatment"gene_exp1[[col]][1] 1 3 13

• [[ ]] returns a single data frame column as a vector.

somgen223.stanford.edu 19

Page 20: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Comparing $ and [[ ]]

x <- 2df$xdf[[x]]df[["x"]]

• The first expression returns the column named x.• The second expression returns the second column, because x has the value 2.• The third expression returns the column named x, using quotes around thecolumn name.

somgen223.stanford.edu 20

Page 21: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Factors (repeated from day 4)

• Factors are a powerful, but sometimes perplexing, way to work withdiscrete-valued data.

• The possible values of a factor are drawn from a finite set of alternatives orcategories. Factors are often used in graphics and analysis for grouping.

• Example: encoding the sex of a human subject as either M or F and grouping bysex.

• Example: encoding the names of the fifty US states and grouping by state.• Note that many measured values are better represented not as factors but aseither integers (such as for counting) or floating-point (real-valued) numbers.Example: number of subjects, weight.

somgen223.stanford.edu 21

Page 22: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Set up data

gene_tall <- gather(gene_exp1, condition, expression_level,control:treatment)

(gene_tall2 <- mutate(gene_tall, condition = as.factor(condition)))# A tibble: 6 x 3

gene condition expression_level<chr> <fct> <dbl>

1 ABC123 control 02 DEF234 control 103 GKK7 control 124 ABC123 treatment 15 DEF234 treatment 36 GKK7 treatment 13

• <fct> means that column type is factor.

somgen223.stanford.edu 22

Page 23: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Plot

gene_tall2 %>%ggplot(aes(condition, expression_level)) +geom_point(aes(color = gene))

0

5

10

control treatmentcondition

expression_level

gene

ABC123

DEF234

GKK7

• Note order of values on x-axis: it comes from the order of the levels of thefactor: “control,” then “treatment”.

• By default this will be alphabetical order.

somgen223.stanford.edu 23

Page 24: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

What are the levels?

gene_tall2$condition[1] control control control treatment treatment treatmentLevels: control treatmentlevels(gene_tall2$condition)[1] "control" "treatment"

• A factor is a type of vector, so has a similar print representation.• It is augmented by the second line, which lists the levels in order.• The levels function returns the levels explicitly.

somgen223.stanford.edu 24

Page 25: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

How to change the order of the levels

gene_tall2$condition[1] control control control treatment treatment treatmentLevels: control treatmentfct_relevel(gene_tall2$condition, "treatment", "control")[1] control control control treatment treatment treatmentLevels: treatment control

• Note the values are unchanged.• Note the order of the levels is changed.

somgen223.stanford.edu 25

Page 26: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Update the data frame with the new levels

gene_tall2 <- gene_tall2 %>%mutate(condition = fct_relevel(condition, "treatment", "control"))

somgen223.stanford.edu 26

Page 27: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

New plot

gene_tall2 %>%ggplot(aes(condition, expression_level)) +geom_point(aes(color = gene))

0

5

10

treatment controlcondition

expression_level

gene

ABC123

DEF234

GKK7

• Order on x-axis reflects the new factor level order.

somgen223.stanford.edu 27

Page 28: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

How to change the factor values

gene_tall2$condition[1] control control control treatment treatment treatmentLevels: treatment controlfct_recode(gene_tall2$condition, ctrl = "control",

trt = "treatment")[1] ctrl ctrl ctrl trt trt trtLevels: trt ctrl

• You might need shorter values for graph labels.• In fct_recode, assign the old value to the new value.• Note the factor order stays the same.

somgen223.stanford.edu 28

Page 29: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

New plot

gene_tall2 <- gene_tall2 %>%mutate(condition = fct_recode(condition, ctrl = "control",

trt = "treatment"))gene_tall2 %>%

ggplot(aes(condition, expression_level)) +geom_point(aes(color = gene))

0

5

10

trt ctrlcondition

expression_level

gene

ABC123

DEF234

GKK7

• Order on x-axis reflects the new factor level order.

somgen223.stanford.edu 29

Page 30: Vector and data frame indexing - Stanford UniversityRecyclingrule • Whenoperatingonmultiplevectorsofdifferentlengths,Rwillreusevaluesif thereareinsufficientones,wrappingaround. •

Reading

• Read: 15 Factors | R for Data Science

somgen223.stanford.edu 30