R提高篇(四): 數據管理二_ZenDei技術網路在線

目錄：數學函數統計函數應用示例控制流數學函數 ceiling(x): 大於等於 x 的最小整數, 如： ceiling(3.213) --> 4 floor(x): 小於等於 x 的最大整數，如： floor(3.6534) --> 3 trunc(x): 取x的整數部分，如： trun ...

目錄：

數學函數
統計函數
應用示例
控制流

數學函數

ceiling(x): 大於等於 x 的最小整數, 如： ceiling(3.213) --> 4
floor(x): 小於等於 x 的最大整數，如： floor(3.6534) --> 3
trunc(x): 取x的整數部分，如： trunc(5.999) --> 5
round(x,digits=n): 將x舍入為指定的小數，如： round(3.4567,2) --> 3.46
signif(x,digits=n)：將x舍入為指定的有效數字位數如： signif(3.4567,2) --> 3.5

統計函數

mean(x): 平均數，語法： mean(x, trim = 0, na.rm = FALSE, ...)

trim: (0 to 0.5), 如：mean(x, trim = 0.1), 表示先把x的最大的10%的數和最小的10%的數去掉，然後剩下的數算平均

示例代碼

 x <- c(2,4,6,8,10,11)
 mean(x,trim = 0.2)   #6*0.2=1.2（取整）,除去最大最小值11、2 ，實際計算 4 6 8 10的平均 --> 7
 mean(x)              # 41/6  --> 6.833

median(x) : 中位數，語法： median(x, na.rm = FALSE) ，從小到大或是從大到小排列時最中間的那個或是最中間兩個數的平均值，主要用於算一個整體（小組）的一般水平...
```
> x <- c(2,4,6,9,10,11)
>  median(x)
[1] 7.5
>  x <- c(2,4,9,10,11)
>  median(x)
[1] 9
```
med(x): 絕對中位數，語法：mad(x, center = median(x), constant = 1.4826, na.rm = FALSE,low = FALSE, high = FALSE)

主要是考慮到數據列中一些與均值相差較遠的離譜數據在求均值和方差時候，尤其是求方差時對結果產生較大的影響
實際上計算mad的過程就是：constant * cMedian(abs(x - center))
low,high: 當中位數是兩個數字時，根據設置的參數取最小或最大的值

示例代碼

> x <- c(1,2,3,5,7,8)
> mad(x)
[1] 3.7065
> #分解mad(x)計算過程
> y <- abs(x - median(x));y
[1] 3 2 1 1 3 4
> z <- median(y);z
[1] 2.5
> 1.4826 * z
[1] 3.7065

rang(x): 求值域，語法：range(..., na.rm = FALSE)

> x <- c(1,2,3,5,7,8)
> y <- range(x);y
[1] 1 8
> diff(y)   #滯後差分
[1] 7

sd(x): 標準差，語法：sd(x, na.rm = FALSE)，標準差是一組數據平均值分散程度的一種度量。一個較大的標準差，代表大部分數值和其平均值之間差異較大；一個較小的標準差，代表這些數值較接近平均值

標準差也被稱為標準偏差，或者實驗標準差，公式為 , ,
公式意義：所有數減去其平均值的平方和，所得結果除以該組數之個數減一（上面的公式有誤），再把所得值開根號，所得之數就是這組數據的標準差
標準差應用於投資上，可作為量度回報穩定性的指標。標準差數值越大，代表回報遠離過去平均數值，回報較不穩定故風險越高。相反，標準差數值越小，代表回報較為穩定，風險亦較小

代碼示例：

> x <- c(1,2,3,4)
> sd(x)
[1] 1.290994
> m <- mean(x);
> y <- sum((x - m)^2)/(length(x) -1)
> sqrt(y)
[1] 1.290994

scale(x): 為數據對象x按列進行中心化或標準化，語法：scale(x, center = TRUE, scale = TRUE)

數據的中心化是指數據集中的各項數據減去數據集的均值
標準化是指中心化之後的數據在除以數據集的標準差

示例代碼：

> data <- c(1, 2, 3, 6, 3)
> scale(data)
           [,1]
[1,] -1.0690450
[2,] -0.5345225
[3,]  0.0000000
[4,]  1.6035675
[5,]  0.0000000
attr(,"scaled:center")
[1] 3
attr(,"scaled:scale")
[1] 1.870829

quantile(x): 求百分位數，語法：quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,...)

預設計算方法：要計算的點到起始點的距離/終點到起始點的距離=要計算的比例

示例代碼：

> m <- c(1:20)
> quantile(m,probs = c(0.25,0.75,1));
  25%   75%  100% 
 5.75 15.25 20.00 
> (20 -1)*0.25 + 1
[1] 5.75

應用示例

根據如下表格中的數據，按成績對學生進行ABCDEF分等級，最後按姓氏和名字進行排序輸出
示例數據如下，保存本地 student.rda 文件中

第一步：同於三門學科分值相差甚遠，首選必須將變數進行標準化，讓數據變得可比較，每科成績用單位標準差表示，如下

> setwd("E:\\R")
> load(file = "roster.rda")
> options(digits = 2)  #小數點後保留2位
> z <- scale(roster[,2:4])
> z
     math science english
1   0.013   1.078   0.587
2   1.143   1.591   0.037
3  -1.026  -0.847  -0.697
4  -1.649  -0.590  -1.247
5  -0.068  -1.489  -0.330
6   0.128  -0.205   1.137
7  -1.049  -0.847  -1.247
8   1.432   1.078   1.504
9   0.832   0.308   0.954
10  0.243  -0.077  -0.697

第二步：計算每行的平均得分獲得綜合得分，合併到花名冊，如下：

> score <- apply(z,MARGIN = 1,FUN = mean);
> score
    1     2     3     4     5     6     7     8     9    10 
 0.56  0.92 -0.86 -1.16 -0.63  0.35 -1.05  1.34  0.70 -0.18 
> roster <- cbind(roster,score);roster
            student math science english score
1        John Davis  502      95      25  0.56
2    Angla Williams  600      99      22  0.92
3  Bullwinkle Moose  412      80      18 -0.86
4       David Jones  358      82      15 -1.16
5  Janice Makhammer  495      75      20 -0.63
6    Cheryl Cushing  512      85      28  0.35
7    Reuven Ytzrhak  410      80      15 -1.05
8         Greg Knox  625      95      30  1.34
9      Joel England  573      89      27  0.70
10     Mary Rayburn  522      86      18 -0.18

第三步：按百分位數進行等級劃分，如下：

> y <- quantile(roster$score,probs = c(0.8,0.6,0.4,0.2))
> y
  80%   60%   40%   20% 
 0.74  0.44 -0.36 -0.89 
> roster <- within(roster,{
+                   grader <- NA
+                   grader[score >= y[1]] <- "A"
+                   grader[score < y[1] & score >= y[2]] <- "B"
+                   grader[score < y[2] & score >= y[3]] <- "C"
+                   grader[score < y[3] & score >= y[4]] <- "D"
+                   grader[score < y[4]] <- "E"
+ })
> roster
            student math science english score grader
1        John Davis  502      95      25  0.56      B
2    Angla Williams  600      99      22  0.92      A
3  Bullwinkle Moose  412      80      18 -0.86      D
4       David Jones  358      82      15 -1.16      E
5  Janice Makhammer  495      75      20 -0.63      D
6    Cheryl Cushing  512      85      28  0.35      C
7    Reuven Ytzrhak  410      80      15 -1.05      E
8         Greg Knox  625      95      30  1.34      A
9      Joel England  573      89      27  0.70      B
10     Mary Rayburn  522      86      18 -0.18      C

第四步：將student變數拆分為firstname 和 lastname ,並按姓氏和名稱排序，將結果保存為本地文件 studentGrade.rda

> name  <- strsplit(roster$student," ")
Error in strsplit(roster$student, " ") : non-character argument
> class(roster$student)   #查看student類型，是因數，必須轉化成字元
[1] "factor"

正確代碼如下：

>  name  <- strsplit(as.character(roster$student)," ")
>  firstname <- sapply(name,"[",1)  # "[" 是一個可以提取某個對象一部分的函數,在這裡用來提取列表中name各成份中的第一或二個元素
>  lastname <- sapply(name,"[",2)
>  sGrade <- cbind(firstname,lastname,roster[,-1]) 
>  sGrade[order(sGrade$lastname),]
    firstname  lastname math science english      score grader
6      Cheryl   Cushing  512      85      28  0.3532485      C
1        John     Davis  502      95      25  0.5592028      B
9        Joel   England  573      89      27  0.6978361      B
4       David     Jones  358      82      15 -1.1620473      E
8        Greg      Knox  625      95      30  1.3378934      A
5      Janice Makhammer  495      75      20 -0.6289776      D
3  Bullwinkle     Moose  412      80      18 -0.8565414      D
10       Mary   Rayburn  522      86      18 -0.1768163      C
2       Angla  Williams  600      99      22  0.9238259      A
7      Reuven   Ytzrhak  410      80      15 -1.0476242      E
>  save(sGrade,file = "studentGrade.rda")

控制流

for ：示例： for(i in 1:10) print("hello")

while: 示例：

 i<-10
 while (i>0)
 {
   print("hello")
   i <- i-1
 }

ifelse: 語法： ifelse(cond,statement1,statement2) , 當 cond = true 時執行第一個語句，反之執行第二個

swith: 語法： switch(expression, case1, case2, case3....)

x <- switch(
  3,
  "first",
  "second",
  "third",
  "fourth"
)
x