class: center, middle, inverse, title-slide # Lec 03 - NAs, functions, loops, and merge conflicts ##
Statistical Programming ### Fall 2021 ###
Dr. Colin Rundel --- exclude: true --- class: middle, center # Missing Values --- ## Missing Values R uses `NA` to represent missing values in its data structures, what may not be obvious is that there are different `NA`s for different atomic types. .pull-left[ ```r typeof(NA) ``` ``` ## [1] "logical" ``` ```r typeof(NA+1) ``` ``` ## [1] "double" ``` ```r typeof(NA+1L) ``` ``` ## [1] "integer" ``` ```r typeof(c(NA,"")) ``` ``` ## [1] "character" ``` ] -- .pull-right[ ```r typeof(NA_character_) ``` ``` ## [1] "character" ``` ```r typeof(NA_real_) ``` ``` ## [1] "double" ``` ```r typeof(NA_integer_) ``` ``` ## [1] "integer" ``` ```r typeof(NA_complex_) ``` ``` ## [1] "complex" ``` ] --- ## NA "stickiness" Because `NA`s represent missing values it makes sense that any calculation using them should also be missing. .pull-left[ ```r 1 + NA ``` ``` ## [1] NA ``` ```r 1 / NA ``` ``` ## [1] NA ``` ```r NA * 5 ``` ``` ## [1] NA ``` ] .pull-right[ ```r sqrt(NA) ``` ``` ## [1] NA ``` ```r 3^NA ``` ``` ## [1] NA ``` ```r sum(c(1, 2, 3, NA)) ``` ``` ## [1] NA ``` ] -- <br/> Summarizing functions (e.g. `sum()`, `mean()`, `sd()`, etc.) will often have a `na.rm` argument which will allow you to *drop* missing values. ```r sum(c(1, 2, 3, NA), na.rm = TRUE) ``` ``` ## [1] 6 ``` ```r mean(c(1, 2, 3, NA), na.rm = TRUE) ``` ``` ## [1] 2 ``` --- ## NAs are not always sticky A useful mental model for `NA`s is to consider them as a unknown value that could take any of the possible values for that type. -- For numbers or characters this isn't very helpful, but for a logical value we know that the value must either be `TRUE` or `FALSE` and we can use that when deciding what value to return. -- ```r TRUE & NA ``` ``` ## [1] NA ``` -- ```r FALSE & NA ``` ``` ## [1] FALSE ``` -- ```r TRUE | NA ``` ``` ## [1] TRUE ``` -- ```r FALSE | NA ``` ``` ## [1] NA ``` --- ## Conditionals and missing values `NA`s can be problematic in some cases (particularly for control flow) ```r 1 == NA ``` ``` ## [1] NA ``` -- ```r if (2 != NA) "Here" ``` ``` ## Error in if (2 != NA) "Here": missing value where TRUE/FALSE needed ``` -- ```r if (all(c(1,2,NA,4) >= 1)) "There" ``` ``` ## Error in if (all(c(1, 2, NA, 4) >= 1)) "There": missing value where TRUE/FALSE needed ``` -- ```r if (any(c(1,2,NA,4) >= 1)) "There" ``` ``` ## [1] "There" ``` --- ## Testing for `NA` To explicitly test if a value is missing it is necessary to use `is.na` (often along with `any` or `all`). .pull-left[ ```r NA == NA ``` ``` ## [1] NA ``` ```r is.na(NA) ``` ``` ## [1] TRUE ``` ```r is.na(1) ``` ``` ## [1] FALSE ``` ] -- .pull-right[ ```r is.na(c(1,2,3,NA)) ``` ``` ## [1] FALSE FALSE FALSE TRUE ``` ```r any(is.na(c(1,2,3,NA))) ``` ``` ## [1] TRUE ``` ```r all(is.na(c(1,2,3,NA))) ``` ``` ## [1] FALSE ``` ] --- ## Other Special values (double) These are defined as part of the IEEE floating point standard (not unique to R) * `NaN` - Not a number * `Inf` - Positive infinity * `-Inf` - Negative infinity .pull-left[ ```r pi / 0 ``` ``` ## [1] Inf ``` ```r 0 / 0 ``` ``` ## [1] NaN ``` ```r 1/0 + 1/0 ``` ``` ## [1] Inf ``` ] .pull-right[ ```r 1/0 - 1/0 ``` ``` ## [1] NaN ``` ```r NaN / NA ``` ``` ## [1] NaN ``` ```r NaN * NA ``` ``` ## [1] NaN ``` ] --- ## Testing for `Inf` and `NaN` `NaN` and `Inf` don't have the same testing issues that `NA`s do, but there are still convenience functions for testing for these types of values .pull-left[ ```r is.finite(Inf) ``` ``` ## [1] FALSE ``` ```r is.infinite(-Inf) ``` ``` ## [1] TRUE ``` ```r is.nan(Inf) ``` ``` ## [1] FALSE ``` ```r is.nan(-Inf) ``` ``` ## [1] FALSE ``` ```r Inf > 1 ``` ``` ## [1] TRUE ``` ```r -Inf > 1 ``` ``` ## [1] FALSE ``` ] -- .pull-right[ ```r is.finite(NaN) ``` ``` ## [1] FALSE ``` ```r is.infinite(NaN) ``` ``` ## [1] FALSE ``` ```r is.nan(NaN) ``` ``` ## [1] TRUE ``` ```r is.finite(NA) ``` ``` ## [1] FALSE ``` ```r is.infinite(NA) ``` ``` ## [1] FALSE ``` ```r is.nan(NA) ``` ``` ## [1] FALSE ``` ] --- ## Coercion for infinity and NaN First remember that `Inf`, `-Inf`, and `NaN` are doubles, however their coercion behavior is not the same as for other doubles ```r as.integer(Inf) ``` ``` ## Warning: NAs introduced by coercion to integer range ``` ``` ## [1] NA ``` ```r as.integer(NaN) ``` ``` ## [1] NA ``` -- .pull-left[ ```r as.logical(Inf) ``` ``` ## [1] TRUE ``` ```r as.logical(-Inf) ``` ``` ## [1] TRUE ``` ```r as.logical(NaN) ``` ``` ## [1] NA ``` ] -- .pull-right[ ```r as.character(Inf) ``` ``` ## [1] "Inf" ``` ```r as.character(-Inf) ``` ``` ## [1] "-Inf" ``` ```r as.character(NaN) ``` ``` ## [1] "NaN" ``` ] --- class: middle count: false # Functions --- ## Function Parts Functions are defined by two components: the arguments (`formals`) and the code (`body`). Functions are assigned names like any other object in R (using `=` or `<-`) ```r gcd = function(x1, y1, x2 = 0, y2 = 0) { R = 6371 # Earth mean radius in km # distance in km acos(sin(y1)*sin(y2) + cos(y1)*cos(y2) * cos(x2-x1)) * R } ``` -- .pull-left[ ```r typeof(gcd) ``` ``` ## [1] "closure" ``` ] .pull-right[ ```r mode(gcd) ``` ``` ## [1] "function" ``` ] -- <div> .pull-left[ ```r formals(gcd) ``` ``` ## $x1 ## ## ## $y1 ## ## ## $x2 ## [1] 0 ## ## $y2 ## [1] 0 ``` ] .pull-right[ ```r body(gcd) ``` ``` ## { ## R = 6371 ## acos(sin(y1) * sin(y2) + cos(y1) * cos(y2) * cos(x2 - x1)) * ## R ## } ``` ] </div> --- ## Return values There are two approaches to returning values from functions in R - explicit and implicit returns. -- **Explicit** - using one or more `return` function calls ```r f = function(x) { return(x * x) } f(2) ``` ``` ## [1] 4 ``` -- <br/> **Implicit** - return value of the last expression is returned. ```r g = function(x) { x * x } g(3) ``` ``` ## [1] 9 ``` --- ## Invisible returns Many functions in R make use of an invisible return value ```r f = function(x) { print(x) } y = f(1) ``` ``` ## [1] 1 ``` ```r y ``` ``` ## [1] 1 ``` -- ```r g = function(x) { invisible(x) } ``` ```r g(2) ``` ```r z = g(2) z ``` ``` ## [1] 2 ``` --- ## Returning multiple values If we want a function to return more than one value we can group things using atomic vectors or lists. ```r f = function(x) { c(x, x^2, x^3) } f(1:2) ``` ``` ## [1] 1 2 1 4 1 8 ``` -- <br/> ```r g = function(x) { list(x, "hello") } g(1:2) ``` ``` ## [[1]] ## [1] 1 2 ## ## [[2]] ## [1] "hello" ``` .footnote[More on generic vectors next Tuesday] --- class: split-50 ## Argument names When defining a function we explicitly define names for the arguments, which become variables within the scope of the function. When calling a function we can use these names to pass arguments in an alternative order. ```r f = function(x, y, z) { paste0("x=", x, " y=", y, " z=", z) } ``` -- .pull-left[ ```r f(1, 2, 3) ``` ``` ## [1] "x=1 y=2 z=3" ``` ```r f(z=1, x=2, y=3) ``` ``` ## [1] "x=2 y=3 z=1" ``` ] -- .pull-right[ ```r f(y=2, 1, 3) ``` ``` ## [1] "x=1 y=2 z=3" ``` ```r f(y=2, 1, x=3) ``` ``` ## [1] "x=3 y=2 z=1" ``` ] -- ```r f(1, 2, 3, 4) ``` ``` ## Error in f(1, 2, 3, 4): unused argument (4) ``` ```r f(1, 2, m=3) ``` ``` ## Error in f(1, 2, m = 3): unused argument (m = 3) ``` --- ## Argument defaults It is also possible to give function arguments default values, so that they don't need to be provided every time the function is called. ```r f = function(x, y=1, z=1) { paste0("x=", x, " y=", y, " z=", z) } ``` -- .pull-left[ ```r f(3) ``` ``` ## [1] "x=3 y=1 z=1" ``` ```r f(x=3) ``` ``` ## [1] "x=3 y=1 z=1" ``` ] -- .pull-right[ ```r f(z=3, x=2) ``` ``` ## [1] "x=2 y=1 z=3" ``` ```r f(y=2, 2) ``` ``` ## [1] "x=2 y=2 z=1" ``` ] -- <br/> ```r f() ``` ``` ## Error in paste0("x=", x, " y=", y, " z=", z): argument "x" is missing, with no default ``` --- ## Scope R has generous scoping rules, if it can't find a variable in the functions body, it will look for it in the next higher scope, and so on. ```r y = 1 f = function(x) { x + y } f(3) ``` ``` ## [1] 4 ``` -- ```r y = 1 g = function(x) { y = 2 x + y } g(3) ``` ``` ## [1] 5 ``` ```r y ``` ``` ## [1] 1 ``` --- Additionally, variables defined within a scope only persist for the duration of that scope, and do not overwrite variables at a higher scopes ```r x = 1 y = 1 z = 1 f = function() { y = 2 g = function() { z = 3 return(x + y + z) } return(g()) } f() ``` ``` ## [1] 6 ``` ```r c(x,y,z) ``` ``` ## [1] 1 1 1 ``` --- ## Exercise 1 - scope What is the output of the following code? Explain why. ```r z = 1 f = function(x, y, z) { z = x+y g = function(m = x, n = y) { m/z + n/z } z * g() } f(1, 2, x = 3) ``` --- ## Operators as functions In R, operators are actually a special type of function - using backticks around the operator we can write them as functions. ```r `+` ``` ``` ## function (e1, e2) .Primitive("+") ``` ```r typeof(`+`) ``` ``` ## [1] "builtin" ``` -- .pad-top[] ```r x = 4:1 x + 2 ``` ``` ## [1] 6 5 4 3 ``` ```r `+`(x, 2) ``` ``` ## [1] 6 5 4 3 ``` --- ## Getting Help Prefixing any function name with a `?` will open the related help file for that function. ```r ?`+` ?sum ``` -- For functions not in the base package, you can generally see their implementation by entering the function name without parentheses (or using the `body` function). ```r lm ``` ``` ## function (formula, data, subset, weights, na.action, method = "qr", ## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, ## contrasts = NULL, offset, ...) ## { ## ret.x <- x ## ret.y <- y ## cl <- match.call() ## mf <- match.call(expand.dots = FALSE) ## m <- match(c("formula", "data", "subset", "weights", "na.action", ## "offset"), names(mf), 0L) ## mf <- mf[c(1L, m)] ## mf$drop.unused.levels <- TRUE ## mf[[1L]] <- quote(stats::model.frame) ## mf <- eval(mf, parent.frame()) ## if (method == "model.frame") ## return(mf) ## else if (method != "qr") ## warning(gettextf("method = '%s' is not supported. Using 'qr'", ## method), domain = NA) ## mt <- attr(mf, "terms") ## y <- model.response(mf, "numeric") ## w <- as.vector(model.weights(mf)) ## if (!is.null(w) && !is.numeric(w)) ## stop("'weights' must be a numeric vector") ## offset <- model.offset(mf) ## mlm <- is.matrix(y) ## ny <- if (mlm) ## nrow(y) ## else length(y) ## if (!is.null(offset)) { ## if (!mlm) ## offset <- as.vector(offset) ## if (NROW(offset) != ny) ## stop(gettextf("number of offsets is %d, should equal %d (number of observations)", ## NROW(offset), ny), domain = NA) ## } ## if (is.empty.model(mt)) { ## x <- NULL ## z <- list(coefficients = if (mlm) matrix(NA_real_, 0, ## ncol(y)) else numeric(), residuals = y, fitted.values = 0 * ## y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w != ## 0) else ny) ## if (!is.null(offset)) { ## z$fitted.values <- offset ## z$residuals <- y - offset ## } ## } ## else { ## x <- model.matrix(mt, mf, contrasts) ## z <- if (is.null(w)) ## lm.fit(x, y, offset = offset, singular.ok = singular.ok, ## ...) ## else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, ## ...) ## } ## class(z) <- c(if (mlm) "mlm", "lm") ## z$na.action <- attr(mf, "na.action") ## z$offset <- offset ## z$contrasts <- attr(x, "contrasts") ## z$xlevels <- .getXlevels(mt, mf) ## z$call <- cl ## z$terms <- mt ## if (model) ## z$model <- mf ## if (ret.x) ## z$x <- x ## if (ret.y) ## z$y <- y ## if (!qr) ## z$qr <- NULL ## z ## } ## <bytecode: 0x5566ea13e418> ## <environment: namespace:stats> ``` --- ## Less Helpful Examples ```r list ``` ``` ## function (...) .Primitive("list") ``` ```r `[` ``` ``` ## .Primitive("[") ``` ```r sum ``` ``` ## function (..., na.rm = FALSE) .Primitive("sum") ``` ```r `+` ``` ``` ## function (e1, e2) .Primitive("+") ``` --- class: middle count: false # Loops --- ## for loops Simplest, and most common type of loop in R - given a vector iterate through the elements and evaluate the code block for each. ```r is_even = function(x) { res = c() for(val in x) { res = c(res, val %% 2 == 0) } res } is_even(1:10) ``` ``` ## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE ``` --- ## `while` loops Repeat until the given condition is **not** met (i.e. evaluates to `FALSE`) ```r make_seq = function(from = 1, to = 1, by = 1) { res = c(from) cur = from while(cur+by <= to) { cur = cur + by res = c(res, cur) } res } make_seq(1, 6) ``` ``` ## [1] 1 2 3 4 5 6 ``` ```r make_seq(1, 6, 2) ``` ``` ## [1] 1 3 5 ``` --- ## `repeat` loops Repeat the loop until a `break` is encountered ```r make_seq2 = function(from = 1, to = 1, by = 1) { res = c(from) cur = from repeat { cur = cur + by if (cur > to) break res = c(res, cur) } res } make_seq2(1, 6) ``` ``` ## [1] 1 2 3 4 5 6 ``` ```r make_seq2(1, 6, 2) ``` ``` ## [1] 1 3 5 ``` --- class: split-50 ## Special keywords - `break` and `next` These are special actions that only work *inside* of a loop * `break` - ends the current **loop** (inner-most) * `next` - ends the current **iteration** .pull-left[ ```r f = function(x) { res = c() for(i in x) { if (i %% 2 == 0) break res = c(res, i) } res } f(1:10) ``` ``` ## [1] 1 ``` ```r f(c(1,1,1,2,2,3)) ``` ``` ## [1] 1 1 1 ``` ] .pull-right[ ```r g = function(x) { res = c() for(i in x) { if (i %% 2 == 0) next res = c(res,i) } res } g(1:10) ``` ``` ## [1] 1 3 5 7 9 ``` ```r g(c(1,1,1,2,2,3)) ``` ``` ## [1] 1 1 1 3 ``` ] --- ## Some helpful functions Often we want to use a loop across the indexes of an object and not the elements themselves. There are several useful functions to help you do this: `:`, `length`, `seq`, `seq_along`, `seq_len`, etc. .pull-left[ ```r 4:7 ``` ``` ## [1] 4 5 6 7 ``` ```r length(4:7) ``` ``` ## [1] 4 ``` ```r seq(4,7) ``` ``` ## [1] 4 5 6 7 ``` ] .pull-right[ ```r seq_along(4:7) ``` ``` ## [1] 1 2 3 4 ``` ```r seq_len(length(4:7)) ``` ``` ## [1] 1 2 3 4 ``` ```r seq(4,7,by=2) ``` ``` ## [1] 4 6 ``` ] --- ## Avoid using `1:length(x)` A common loop construction you'll see in a lot of R code is using `1:length(x)` to generate a vector of index values for the vector `x`. .pull-left[ .midi[ ```r f = function(x) { for(i in 1:length(x)) { print(i) } } f(2:1) ``` ``` ## [1] 1 ## [1] 2 ``` ```r f(2) ``` ``` ## [1] 1 ``` ```r f(integer()) ``` ``` ## [1] 1 ## [1] 0 ``` ] ] .pull-right[ ```r g = function(x) { for(i in seq_along(x)) { print(i) } } g(2:1) ``` ``` ## [1] 1 ## [1] 2 ``` ```r g(2) ``` ``` ## [1] 1 ``` ```r g(integer()) ``` ] -- <div> .pull-left[ ```r 1:length(integer()) ``` ``` ## [1] 1 0 ``` ] .pull-right[ ```r seq_along(integer()) ``` ``` ## integer(0) ``` ] </div> --- ## Exercise 2 Below is a vector containing all prime numbers between 2 and 100: .center[ ```r primes = c( 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97) ``` ] If you were given the vector `x = c(3,4,12,19,23,51,61,63,78)`, write the R code necessary to print only the values of `x` that are *not* prime (without using subsetting or the `%in%` operator). Your code should use *nested* loops to iterate through the vector of primes and `x`. --- class: middle center ## Merge Conflict + GitHub Actions <br/> Live Demo