Awk實戰案例精講插入幾個新欄位在"a b c d"的b後面插入3個欄位e f g。 echo a b c d|awk '{$3="e f g "$3}1' 格式化空白移除每行的首碼、尾碼空白，並將各部分左對齊。 aaaa bbb ccc bbb aaa ccc ddd fff eee gg ...

Awk實戰案例精講

插入幾個新欄位

在"a b c d"的b後面插入3個欄位e f g。

echo a b c d|awk '{$3="e f g "$3}1'

格式化空白

移除每行的首碼、尾碼空白，並將各部分左對齊。

      aaaa        bbb     ccc                 
   bbb     aaa ccc
ddd       fff             eee gg hh ii jj

awk 'BEGIN{OFS="\t"}{$1=$1;print}' a.txt

執行結果：

aaaa    bbb     ccc
bbb     aaa     ccc
ddd     fff     eee     gg      hh      ii      jj

篩選IPv4地址

從ifconfig命令的結果中篩選出除了lo網卡外的所有IPv4地址。

## 1.法一：
ifconfig | awk '/inet / && !($2 ~ /^127/){print $2}'

# 按段落讀取
## 2.法二：
ifconfig | awk 'BEGIN{RS=""}!/lo/{print $6}'

## 3.法三：
ifconfig |\
awk '
  BEGIN{RS="";FS="\n"}
  !/lo/{$0=$2;FS=" ";$0=$0;print $2;FS="\n"}
'

讀取.ini配置文件中的某段

[base]
name=os_repo
baseurl=https://xxx/centos/$releasever/os/$basearch
gpgcheck=0

enable=1

[mysql]
name=mysql_repo
baseurl=https://xxx/mysql-repo/yum/mysql-5.7-community/el/$releasever/$basearch

gpgcheck=0
enable=1

[epel]
name=epel_repo
baseurl=https://xxx/epel/$releasever/$basearch
gpgcheck=0
enable=1
[percona]
name=percona_repo
baseurl = https://xxx/percona/release/$releasever/RPMS/$basearch
enabled = 1
gpgcheck = 0

awk '
    BEGIN{RS=""}  # 按段落
    /\[mysql\]/{
        print;
        while( (getline)>0 ){
            if(/\[.*\]/){
                exit
            }
            print
        }
}' a.txt

根據某欄位去重

去掉uid=xxx重覆的行。

2019-01-13_12:00_index?uid=123
2019-01-13_13:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
2019-01-14_12:00_index?uid=123
2019-01-14_13:00_index?uid=123
2019-01-15_14:00_index?uid=333
2019-01-16_15:00_index?uid=9710

awk -F"?" '!arr[$2]++{print}' a.txt

結果：

2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710

次數統計

portmapper
portmapper
portmapper
portmapper
portmapper
portmapper
status
status
mountd
mountd
mountd
mountd
mountd
mountd
nfs
nfs
nfs_acl
nfs
nfs
nfs_acl
nlockmgr
nlockmgr
nlockmgr
nlockmgr
nlockmgr

awk '
  {arr[$1]++}
  END{
    OFS="\t";
    for(idx in arr){printf arr[idx],idx}
  }
' a.txt

統計TCP連接狀態數量

$ netstat -tnap
Proto Recv-Q Send-Q Local Address   Foreign Address  State       PID/Program name
tcp        0      0 0.0.0.0:22      0.0.0.0:*        LISTEN      1139/sshd
tcp        0      0 127.0.0.1:25    0.0.0.0:*        LISTEN      2285/master
tcp        0     96 192.168.2.17:22 192.168.2.1:2468 ESTABLISHED 87463/sshd: root@pt
tcp        0      0 192.168.2017:22 192.168.201:5821 ESTABLISHED 89359/sshd: root@no
tcp6       0      0 :::3306         :::*             LISTEN      2289/mysqld
tcp6       0      0 :::22           :::*             LISTEN      1139/sshd
tcp6       0      0 ::1:25          :::*             LISTEN      2285/master

統計得到的結果：

5: LISTEN
2: ESTABLISHED

netstat -tnap |\
awk '
    /^tcp/{
        arr[$6]++
    }
    END{
        for(state in arr){
            print arr[state] ": " state
        }
    }
'

一行式：

netstat -tna | awk '/^tcp/{arr[$6]++}END{for(state in arr){print arr[state] ": " state}}'
netstat -tna | /usr/bin/grep 'tcp' | awk '{print $6}' | sort | uniq -c

統計日誌中各IP訪問非200狀態碼的次數

日誌示例數據：

111.202.100.141 - - [2019-11-07T03:11:02+08:00] "GET /robots.txt HTTP/1.1" 301 169

統計非200狀態碼的IP，並取次數最多的前10個IP。

# 法一
awk '
  $8!=200{arr[$1]++}
  END{
    for(i in arr){print arr[i],i}
  }
' access.log | sort -k1nr | head -n 10

# 法二：
awk '
    $8!=200{arr[$1]++}
    END{
        PROCINFO["sorted_in"]="@val_num_desc";
        for(i in arr){
            if(cnt++==10){exit}
            print arr[i],i
        }
}' access.log

統計獨立IP

url 訪問IP 訪問時間訪問人

a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.23|2015-11-20 20:34:48|guest
c.com.cn|202.109.134.24|2015-11-20 20:34:48|guest
a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
a.com.cn|202.109.134.24|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.25|2015-11-20 20:34:48|guest

需求：統計每個URL的獨立訪問IP有多少個(去重)，並且要為每個URL保存一個對應的文件，得到的結果類似：

a.com.cn  2
b.com.cn  2
c.com.cn  1

並且有三個對應的文件：

a.com.cn.txt
b.com.cn.txt
c.com.cn.txt

代碼：

BEGIN{
  FS="|"
}

!arr[$1,$2]++{
  arr1[$1]++
}

END{
  for(i in arr1){
    print i,arr1[i] >(i".txt")
  }
}

輸出第二欄位重覆的所有整行

如下文本內容

1 zhangsan
2 lisi
3 zhangsan
4 lisii
5 a
6 b
7 c
8 d
9 a
10 b

問題1：輸出第二列重覆的所有整行，即輸出結果：

1 zhangsan
3 zhangsan
5 a
9 a
6 b
10 b

代碼：

awk '{
    arr[$2]++;
    if(arr[$2]>1){
      if(arr[$2]==2){
        print first[$2]
      };
      print $0
    }else{
      first[$2]=$0
    }
}' a.txt

問題2：輸出第二列不重覆的所有整行，即輸出結果：

2 lisi
4 lisii
7 c
8 d

代碼：

 awk '{
   arr[$2]++
   first[$2]=$0
 }
 END{
   for(i in arr){
     if(arr[i]==1){print first[i]}
   }
 }' a.txt

相鄰重覆行去重，並保留最後一行

根據欄位進行比較，去除相鄰的重覆行，並保留重覆行中的最後一行以及那些非重覆行。

a.log內容：

TCP 10.33.4.149:19404 wrr
-> 10.27.4.197:19404 FullNat 10 2 0
TCP 10.33.4.150:19039 wrr
TCP 10.33.4.150:19089 wrr
-> 10.27.4.201:19089 FullNat 10 2 0
TCP 10.33.4.150:19094 wrr
TCP 10.33.4.150:19102 wrr
TCP 10.33.4.150:19107 wrr
-> 10.27.100.150:19107 FullNat 10 18 0
TCP 10.33.4.150:19111 wrr
TCP 10.33.4.150:19112 wrr
TCP 10.33.4.150:19113 wrr
TCP 10.33.4.150:19114 wrr
TCP 10.33.4.150:19207 wrr
-> 10.27.100.150:19207 FullNat 10 18 0

以第一欄位判斷重覆，去重相鄰行，最終輸出結果：

TCP 10.33.4.149:19404 wrr
-> 10.27.4.197:19404 FullNat 10 2 0
TCP 10.33.4.150:19089 wrr
-> 10.27.4.201:19089 FullNat 10 2 0
TCP 10.33.4.150:19107 wrr
-> 10.27.100.150:19107 FullNat 10 18 0
TCP 10.33.4.150:19207 wrr
-> 10.27.100.150:19207 FullNat 10 18 0

方案：

# 用第一個欄位比較。如果想按其他欄位比較，換成$N，N為對應的欄位號
awk '
{
  if($1!=prev){
    a[++n]=$0;
  }else{
    a[n]=$0
  }
}
{prev=$1}
END{
  for(i=1;i<=length(a);i++){print a[i]}
}
' a.log

處理欄位缺失的數據

ID  name    gender  age  email          phone
1   Bob     male    28   [email protected]     18023394012
2   Alice   female  24   [email protected]  18084925203
3   Tony    male    21                  17048792503
4   Kevin   male    21   [email protected]    17023929033
5   Alex    male    18   [email protected]    18185904230
6   Andy    female       [email protected]    18923902352
7   Jerry   female  25   [email protected]  18785234906
8   Peter   male    20   [email protected]     17729348758
9   Steven          23   [email protected]    15947893212
10  Bruce   female  27   [email protected]   13942943905

當欄位缺失時，直接使用FS劃分欄位來處理會非常棘手。gawk為瞭解決這種特殊需求，提供了FIELDWIDTHS變數。

FIELDWIDTH可以按照字元數量劃分欄位。

awk '{print $4}' FIELDWIDTHS="2 2:6 2:6 2:3 2:13 2:11" a.txt

處理欄位中包含了欄位分隔符的數據

下麵是CSV文件中的一行，該CSV文件以逗號分隔各個欄位。

Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA

需求：取得第三個欄位"1234 A Pretty Street, NE"。

當欄位中包含了欄位分隔符時，直接使用FS劃分欄位來處理會非常棘手。gawk為瞭解決這種特殊需求，提供了FPAT變數。

FPAT可以收集正則匹配的結果，並將它們保存在各個欄位中。（就像grep匹配成功的部分會加顏色顯示，而使用FPAT劃分欄位，則是將匹配成功的部分保存在欄位$1 $2 $3...中）。

echo 'Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA' |\
awk 'BEGIN{FPAT="[^,]+|\".*\""}{print $1,$3}'

取欄位中指定字元數量

16  001agdcdafasd
16  002agdcxxxxxx
23  001adfadfahoh
23  001fsdadggggg

得到：

awk '{print $1,substr($2,1,3)}'
awk 'BEGIN{FIELDWIDTH="2 2:3"}{print $1,$2}' a.txt

行列轉換

name age
alice 21
ryan 30

轉換得到：

name alice ryan
age 21 30

awk '
    {
      for(i=1;i<=NF;i++){
        if(!(i in arr)){
          arr[i]=$i
        } else {
            arr[i]=arr[i]" "$i
        }
      }
    }
	END{
        for(i=1;i<=NF;i++){
            print arr[i]
        }
	}
' a.txt

行列轉換2

文件內容：

74683 1001
74683 1002
74683 1011
74684 1000
74684 1001
74684 1002
74685 1001
74685 1011
74686 1000
....
100085 1000
100085 1001

文件就兩列，希望處理成

74683 1001 1002 1011
74684 1000 1001 1002
...

就是只要第一列數字相同，就把他們的第二列放一行上，中間空格分開

{
  if($1 in arr){
    arr[$1] = arr[$1]" "$2
  } else {
    arr[$1] = $2
  }
  
}

END{
  for(i in arr){
    printf "%s %s\n",i,arr[i]
  }
}

篩選給定時間範圍內的日誌

grep/sed/awk用正則去篩選日誌時，如果要精確到小時、分鐘、秒，則非常難以實現。

但是awk提供了mktime()函數，它可以將時間轉換成epoch時間值。

# 2019-11-10 03:42:40轉換成epoch
$ awk 'BEGIN{print mktime("2019 11 10 03 42 40")}'
1573328560

藉此，可以取得日誌中的時間字元串部分，再將它們的年、月、日、時、分、秒都取出來，然後放入mktime()構建成對應的epoch值。因為epoch值是數值，所以可以比較大小，從而決定時間的大小。

下麵strptime1()實現的是將2019-11-10T03:42:40+08:00格式的字元串轉換成epoch值，然後和which_time比較大小即可篩選出精確到秒的日誌。

BEGIN{
  # 要篩選什麼時間的日誌，將其時間構建成epoch值
  which_time = mktime("2019 11 10 03 42 40")
}

{
  # 取出日誌中的日期時間字元串部分
  match($0,"^.*\\[(.*)\\].*",arr)
  
  # 將日期時間字元串轉換為epoch值
  tmp_time = strptime1(arr[1])
  
  # 通過比較epoch值來比較時間大小
  if(tmp_time > which_time){print}
}

# 構建的時間字元串格式為："2019-11-10T03:42:40+08:00"
function strptime1(str   ,arr,Y,M,D,H,m,S) {
  patsplit(str,arr,"[0-9]{1,4}")
  Y=arr[1]
  M=arr[2]
  D=arr[3]
  H=arr[4]
  m=arr[5]
  S=arr[6]
  return mktime(sprintf("%s %s %s %s %s %s",Y,M,D,H,m,S))
}

下麵strptime2()實現的是將10/Nov/2019:23:53:44+08:00格式的字元串轉換成epoch值，然後和which_time比較大小即可篩選出精確到秒的日誌。

BEGIN{
  which_time = mktime("2019 11 10 03 42 40")
}

{
  match($0,"^.*\\[(.*)\\].*",arr)
  
  tmp_time = strptime2(arr[1])
  
  if(tmp_time > which_time){
    print 
  }
}

# 構建的時間字元串格式為："10/Nov/2019:23:53:44+08:00"
function strptime2(str   ,dt_str,arr,Y,M,D,H,m,S) {
  dt_str = gensub("[/:+]"," ","g",str)
  # dt_sr = "10 Nov 2019 23 53 44 08 00"
  split(dt_str,arr," ")
  Y=arr[3]
  M=mon_map(arr[2])
  D=arr[1]
  H=arr[4]
  m=arr[5]
  S=arr[6]
  return mktime(sprintf("%s %s %s %s %s %s",Y,M,D,H,m,S))
}

function mon_map(str   ,mons){
  mons["Jan"]=1
  mons["Feb"]=2
  mons["Mar"]=3
  mons["Apr"]=4
  mons["May"]=5
  mons["Jun"]=6
  mons["Jul"]=7
  mons["Aug"]=8
  mons["Sep"]=9
  mons["Oct"]=10
  mons["Nov"]=11
  mons["Dec"]=12
  return mons[str]
}

去掉`/**/`中間的註釋

示例數據：

/*AAAAAAAAAA*/
1111
222

/*aaaaaaaaa*/
32323
12341234
12134 /*bbbbbbbbbb*/ 132412

14534122
/*
    cccccccccc
*/
xxxxxx /*ddddddddddd
    cccccccccc
    eeeeeee
*/ yyyyyyyy
5642341

# 註釋內的行
/\/\*/{
    # 同行有"*/"
    if(/\*\//){
        print gensub("(.*)/\\*.*\\*/(.*)","\\1\\2","g",$0)
    } else {
        # 同行沒有"*/"

        # 1.去掉/*行後的內容
        print gensub("(.*)/\\*.*","\\1","g",$0)

        # 2.繼續讀取，直到出現*/，並去掉中間的所有數據
        while( ( getline ) > 0 ){
            # 出現了*/行
            if(/\*\//){
                print gensub(".*\\*/(.*)","\\1","g",$0)
            }
        }
    }
}
# 非註釋內容
!/\/\*/{print}

前後段落關係判斷

從如下類型的文件中，找出false段的前一段為i-order的段，同時輸出這兩段。

2019-09-12 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-order',
  ],
]
2019-09-12 07:16:27 [-][
  'data' => [
    false,
  ],
]
2019-09-21 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-order',
  ],
]
2019-09-21 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-user',
  ],
]
2019-09-17 18:34:37 [-][
  'data' => [
    false,
  ],
]

BEGIN{
  RS="]\n"
  ORS=RS
}
{
  if(/false/ && prev ~ /i-order/){
    print tmp
    print
  }
  tmp=$0
}

遞歸正則搜索：

grep -Pz '(?s)\d+((?!2019).)*i-order(?1)+\d+(?1)+false(?1)+'

兩個文件的處理

有兩個文件file1和file2，這兩個文件格式都是一樣的。

需求：先把文件2的第五列刪除，然後用文件2的第一列減去文件一的第一列，把所得結果對應的貼到原來第五列的位置，請問這個腳本該怎麼編寫？

file1：
50.481  64.634  40.573  1.00  0.00
51.877  65.004  40.226  1.00  0.00
52.258  64.681  39.113  1.00  0.00
52.418  65.846  40.925  1.00  0.00
49.515  65.641  40.554  1.00  0.00
49.802  66.666  40.358  1.00  0.00
48.176  65.344  40.766  1.00  0.00
47.428  66.127  40.732  1.00  0.00
51.087  62.165  40.940  1.00  0.00
52.289  62.334  40.897  1.00  0.00
file2：
48.420  62.001  41.252  1.00  0.00
45.555  61.598  41.361  1.00  0.00
45.815  61.402  40.325  1.00  0.00
44.873  60.641  42.111  1.00  0.00
44.617  59.688  41.648  1.00  0.00
44.500  60.911  43.433  1.00  0.00
43.691  59.887  44.228  1.00  0.00
43.980  58.629  43.859  1.00  0.00
42.372  60.069  44.032  1.00  0.00
43.914  59.977  45.551  1.00  0.00

# 方法一：

awk '{
  f1 = $1
  if( (getline <"file2") >= 0 ){
    $5 = $1 - f1
    print $0
  }
}' file1

# 方法二：
awk '
  NR==FNR{arr[FNR]=$1}
  NR!=FNR{$5=$1-arr[FNR];print}
' file1 file2

統計多項數據

如下內容，第一個欄位是IP，第二個欄位是每個訪問的uri。

1.1.1.1 /index1.html
1.1.1.1 /index1.html
1.1.1.1 /index2.html
1.1.1.1 /index2.html
1.1.1.1 /index2.html
1.1.1.1 /index3.html
1.1.1.2 /index1.html
1.1.1.2 /index2.html
1.1.1.2 /index2.html
1.1.1.3 /index1.html
1.1.1.3 /index1.html
1.1.1.3 /index2.html
1.1.1.3 /index2.html
1.1.1.3 /index2.html
1.1.1.3 /index3.html
1.1.1.3 /index3.html
1.1.1.4 /index2.html
1.1.1.4 /index2.html

要求統計出每個ip訪問的總次數，以及每個ip所訪問的uri的次數。

期望的輸出結果：

1.1.1.1 6 /index3.html 1
1.1.1.1 6 /index2.html 3
1.1.1.1 6 /index1.html 2
1.1.1.2 3 /index2.html 2
1.1.1.2 3 /index1.html 1
1.1.1.3 7 /index3.html 2
1.1.1.3 7 /index2.html 3
1.1.1.3 7 /index1.html 2
1.1.1.4 2 /index2.html 2

awk代碼：

awk '
  {
    a[$1][$2]++
  }
  END{
    # 遍曆數組，統計每個ip的訪問總數
    for(ip in a){
      for(uri in a[ip]){
        b[ip] += a[ip][uri]
      }
    }
    
    # 再次遍歷
    for(ip in a){
      for(uri in a[ip]){
        print ip, b[ip], uri, a[ip][uri]
      }
    }
  }
' a.log

處理段落

文件內容如下：

{ "ent_id" : MinKey, "_id" : MinKey } -->> {
        "ent_id" : NumberLong("aaaaa"),
        "_id" : ObjectId("bbbbb")
} on : shard04 Timestamp(685, 0)
{
        "ent_id" : NumberLong("ccccc"),
        "_id" : ObjectId("ddddd")
} -->> {
        "ent_id" : NumberLong("eeeee"),
        "_id" : ObjectId("fffff")
} on : shard04 Timestamp(331, 1)
{
        "ent_id" : NumberLong("ggggg"),
        "_id" : ObjectId("hhhhh")
} -->> {
        "ent_id" : NumberLong("iiiii"),
        "_id" : ObjectId("jjjjj")
} on : shard04 Timestamp(680, 0)

期望結果：

MinKey,MinKey,NumberLong("aaaaa"),ObjectId("bbbbb"),shard04
NumberLong("ccccc"),ObjectId("ddddd"),NumberLong("eeeee"),ObjectId("fffff"),shard04
NumberLong("ggggg"),ObjectId("hhhhh"),NumberLong("iiiii"),ObjectId("jjjjj"),shard04

awk代碼：

BEGIN{
  # 以Timestamp...為輸入記錄分隔符，一次讀取一段
  RS=" Timestamp\\([0-9]+, [0-9]\\)"
}
{
  # 將一段中所有冒號後的內容保存到數組
  patsplit($0,arr,": ([0-9a-zA-Z\"\\(\\)])+")
  for(i in arr){
    # 移除冒號，並使用逗號分隔串聯各元素
    str = str gensub(": ","","g",arr[i])","
  }
  # 移除尾部逗號
  print(substr(str,1,length(str)-1))
  str=""
}

使用Perl或Ruby則更簡單：

perl -0nE 'BEGIN{$,=","}say $& =~ /: \K[^\s,]+/g while /{.*?} on : \S+/sg' test.log
ruby -ne 'BEGIN{$/=nil};$_.scan(/{.*?} on : \S+/m){|s|puts s.scan(/: \K[^\s,]+/).join(",")}' test.log

# 或者使用通用邏輯：
# 讀取每一行，將冒號後的欄位保存到數組，當遇到能匹配on:的行時輸出數組元素
# Ruby實現通用代碼
arr = []
File.foreach("/home/longshuai/test.log") do |line|
  line.scan(/: \K[^\s,]+/) do |s|
    arr << s
  end
  if /on : \S+/ =~ line
    puts arr.join(",")
    arr = []
  end
end

作者：駿馬金龍出處：http://www.cnblogs.com/f-ck-need-u/ Linux運維交流群：921383787

Awk實戰案例精講

Awk實戰案例精講

插入幾個新欄位

格式化空白

篩選IPv4地址

讀取.ini配置文件中的某段

根據某欄位去重

次數統計

統計TCP連接狀態數量

統計日誌中各IP訪問非200狀態碼的次數

統計獨立IP

輸出第二欄位重覆的所有整行

相鄰重覆行去重，並保留最後一行

處理欄位缺失的數據

處理欄位中包含了欄位分隔符的數據

取欄位中指定字元數量

行列轉換

行列轉換2

篩選給定時間範圍內的日誌

去掉/**/中間的註釋

前後段落關係判斷

兩個文件的處理

統計多項數據

處理段落

去掉`/**/`中間的註釋