Linux文本处理

一、初识sed和gawk

文本处理

1.sed 编辑器

流编辑器（stream editor），流编辑器则会在编辑器处理数据之前基于预先提供的一组规则来编辑数据流。sed编辑器可以根据命令来处理数据流中的数据

(1) 一次从输入中读取一行数据。

(2) 根据所提供的编辑器命令匹配数据。

(3) 按照命令修改流中的数据。

(4) 将新的数据输出到STDOUT。

在流编辑器将所有命令与一行数据匹配完毕后，它会读取下一行数据并重复这个过程。

1	sed options script file

选项	描述
-e script	在处理输入时，将script中指定的命令添加到已有的命令中
-f file	在处理输入时，将file中指定的命令添加到已有的命令中
-n	不产生命令输出，使用print命令来完成输出

script参数指定了应用于流数据上的单个命令。如果需要用多个命令，要么使用-e选项在命令行中指定，要么使用-f选项在单独的文件中指定。有大量的命令可用来处理数据。

在命令行定义编辑器命令

$ echo "This is a test" | sed 's/test/big test/' 
This is a big test
#s命令会用斜线间指定的第二个文本字符串来替换第一个文本字符串模式。在本例中是big test替换了test。

$ sed 's/dog/cat/' data1.txt
#sed编辑器并不会修改文本文件的数据。它只会将修改后的数据发送到STDOUT。如果你查看原来的文本文件，它仍然保留着原始数据。

在命令行使用多个编辑器命令
1
sed -e 's/brown/green/; s/dog/cat/' data1.txt
两个命令都作用到文件中的每行数据上。命令之间必须用分号隔开，并且在命令末尾和分号之间不能有空格。

从文件中读取编辑器命令

$ cat script1.sed 
s/brown/green/ 
s/fox/elephant/ 
s/dog/cat/ 
$ 
$ sed -f script1.sed data1.txt 
The quick green elephant jumps over the lazy cat. 
The quick green elephant jumps over the lazy cat. 
The quick green elephant jumps over the lazy cat. 
The quick green elephant jumps over the lazy cat.

2.gawk

提供一个类编程环境来修改和重新组织文件中的数据。

定义变量来保存数据；
使用算术和字符串操作符来处理数据；
使用结构化编程概念（比如if-then语句和循环）来为数据处理增加处理逻辑；
通过提取数据文件中的数据元素，将其重新排列或格式化，生成格式化报告。

格式化日志文件

gawk命令格式

1	gawk options program file

选项	描述
-F fs	指定行中划分数据字段的字段分隔符
-f file	从指定的文件中读取程序
-v var=value	定义gawk程序中的一个变量及其默认值
-mf N	指定要处理的数据文件中的最大字段数
-mr N	指定数据文件中的最大数据行数
-W keyword	指定gawk的兼容模式或警告等级

从命令行读取程序脚本

gawk程序脚本用一对花括号来定义。

#gawk命令行假定脚本是单个文本字符串，你还必须将脚本放到单引号中。下
$ gawk '{print "Hello World!"}'

#没有在命令行上指定文件名，所以gawk程序会从STDIN接收数据。
#输入一行文本并按下回车键，gawk会对这行文本运行一遍程序脚本。

gawk程序会针对数据流中的每行文本执行程序脚本。由于程序脚本被设为显示一行固定的文本字符串，因此不管在数据流中输入什么文本，都会得到同样的文本输出。

要终止这个gawk程序,bash shell提供了一个组合键来生成EOF（End-of-File）字符。Ctrl+D组合键会在bash中产生一个EOF字符。

使用数据字段变量

会自动给一行中的每个数据元素分配一个变量。

$0代表整个文本行；
$1代表文本行中的第1个数据字段；
$2代表文本行中的第2个数据字段；
$n代表文本行中的第n个数据字段。

gawk在读取一行文本时，会用预定义的字段分隔符划分每个数据字段。gawk中默认的字段分隔符是任意的空白字符。

$ cat data2.txt 
One line of test text. 
Two lines of test text. 
Three lines of test text.
$ gawk '{print $1}' data2.txt 
One 
Two 
Three

采用了其他字段分隔符的文件，可以用-F选项指定。

1	$ gawk -F: '{print $1}' /etc/passwd

在程序脚本中使用多个命令

允许你将多条命令组合成一个正常的程序。

1
2
3

#会给字段变量$4赋值 打印整个数据字段。
$ echo "My name is Rich" | gawk '{$4="Christine"; print $0}' 
My name is Christine

从文件中读取程序

$ cat script2.gawk 
{print $1 "'s home directory is " $6} 
$ 
$ gawk -F: -f script2.gawk /etc/passwd 
root's home directory is /root

#要一条命令放一行即可
$ cat script3.gawk 
{ 
text = "'s home directory is " 
print $1 text $6
}

在处理数据前运行脚本

默认情况下，gawk会从输入中读取一行文本，针对该行的数据执行程序脚本。。有时可能需要在处理数据前运行脚本，比如为报告创建标题。BEGIN会强制gawk在读取数据前执行BEGIN关键字后指定的程序脚本。
1
2
3
4
5
$ gawk 'BEGIN {print "Hello World!"}'
Hello World!

$ gawk 'BEGIN {print "The data3 File Contents:"}
> {print $0}' data3.txt

在处理数据后运行脚本

1
2
3

gawk 'BEGIN {print "The data3 File Contents:"} 
> {print $0} 
> END {print "End of File"}' data3.txt

一个小程序脚本的完整的报告

$ cat script4.gawk 
BEGIN { 
print "The latest list of users and shells" 
print " UserID \t Shell" 
print "-------- \t -------" 
FS=":" 
} 
{ 
print $1 " \t " $7 
} 
END { 
print "This concludes the listing" 
}

FS的特殊变量。这是定义字段分隔符的另一种方法。

sed编辑器基础

1.替换选项

替换标记

替换命令在替换多行中的文本时能正常工作，但默认情况下它只替换每行中出现的第一处。替换一行中不同地方出现的文本必须使用替换标记（substitution flag）。

s/pattern/replacement/flags

1. 数字，表明新文本将替换第几处模式匹配的地方；
2. g，表明新文本将会替换所有匹配的文本；
3. p，表明原先行的内容要打印出来；
4. w file，将替换的结果写到文件中。

$  cat data4.txt 
This is a test of the test script. 
This is the second test of the test script.

#1. 用新文本替换第几处模式匹配的地方。
$  sed 's/test/trial/2' data4.txt
This is a test of the trial script. 
This is the second test of the trial script.

#2. g替换文本中匹配模式所匹配的每处地方。
$  sed 's/test/trial/g' data4.txt 
This is a trial of the trial script. 
This is the second trial of the trial script.

#3. p会打印与替换命令中指定的模式匹配的行。通常会和sed的-n选项一起使用。-n选项将禁止sed编辑器输出。只输出被替换命令修改过的行。
$ cat data5.txt 
This is a test line. 
This is a different line.  
$ sed -n 's/test/trial/p' data5.txt 
This is a trial line.

#4. w替换标记会产生同样的输出,会将输出保存到指定文件
$ sed 's/test/trial/w test.txt' data5.txt 
This is a trial line. 
This is a different line. 
$ 
$ cat test.txt 
This is a trial line.

替换字符

正斜线（/）不容易替换，，sed编辑器允许选择其他字符来作为替换命令中的字符串分隔符：
1
$ sed 's!/bin/bash!/bin/csh!' /etc/passwd

2.使用地址

只想将命令作用于特定行或某些行，则必须用行寻址（line addressing）

以数字形式表示行区间
用文本模式来过滤出行

[address]command

address { 
 command1 
 command2 
 command3 
}

以数字方式的行寻址

用行在文本流中的行位置来引用。第一行编号为1

#单个行号
$ sed '2s/dog/cat/' data1.txt 
The quick brown fox jumps over the lazy dog 
The quick brown fox jumps over the lazy cat 
The quick brown fox jumps over the lazy dog 
The quick brown fox jumps over the lazy dog

#行地址区间。
$ sed '2,3s/dog/cat/' data1.txt 
The quick brown fox jumps over the lazy dog 
The quick brown fox jumps over the lazy cat 
The quick brown fox jumps over the lazy cat 
The quick brown fox jumps over the lazy dog

#文本中从某行开始的所有行 $
$ sed '2,$s/dog/cat/' data1.txt 
The quick brown fox jumps over the lazy dog 
The quick brown fox jumps over the lazy cat 
The quick brown fox jumps over the lazy cat 
The quick brown fox jumps over the lazy cat

使用文本模式过滤器

1	/pattern/command

只修改用户Samantha的默认shell，可以使用sed命令。

$ grep Samantha /etc/passwd 
Samantha:x:502:502::/home/Samantha:/bin/bash 

#该命令只作用到匹配文本模式的行上。
$ sed '/Samantha/s/bash/csh/' /etc/passwd 
Samantha:x:502:502::/home/Samantha:/bin/csh

sed编辑器在文本模式中采用了一种称为正则表达式（regular expression）的特性来帮助你创建匹配效果更好的模式。

命令组合

需要在单行上执行多条命令，可以用花括号将多条命令组合在一起。

$ sed '2{ 
> s/fox/elephant/ 
> s/dog/cat/ 
> }' data1.txt 
The quick brown fox jumps over the lazy dog. 
The quick brown elephant jumps over the lazy cat. 
The quick brown fox jumps over the lazy dog. 
The quick brown fox jumps over the lazy dog.

3.删除行

#从数据流中删除特定的文本行，通过行号指定
$ sed '3d' data6.txt 
This is line number 1. 
This is line number 2. 
This is line number 4.

#也可以通过区间
$ sed '2,3d' data6.txt

#通过特殊的文件结尾
$ sed '3,$d' data6.txt

#模式匹配特性
$ sed '/number 1/d' data6.txt 
This is line number 2. 
This is line number 3. 
This is line number 4.

sed编辑器不会修改原始文件。你删除的行只是从sed编辑器的输出中消失了。原文件仍然包含那些“删掉的”行。

用两个文本模式来删除某个区间内的行,指定的第一个模式会“打开”行删除功能，第二个模式会“关闭”行删除功能。sed编辑器会删除两个指定行之间的所有行（包括指定的行）。

1 2	$ sed '/1/,/3/d' data6.txt This is line number 4.

4.插入和附加文本

sed编辑器允许向数据流插入和附加文本行。

插入（insert）命令（i）会在指定行前增加一个新行；
附加（append）命令（a）会在指定行后增加一个新行。

sed '[address]command\
new line'
#new line中的文本将会出现在sed编辑器输出中你指定的位置。

$ echo "Test Line 2" | sed 'i\Test Line 1' 
Test Line 1 
Test Line 2

#向数据流内部添加文本
$ sed '3i\ 
> This is an inserted line.' data6.txt

5.修改行

修改（change）命令允许修改数据流中整行文本的内容。

$ sed '3c\ 
> This is a changed line of text.' data6.txt

#用文本模式来寻址
$ sed '/number 3/c\ 
> This is a changed line of text.' data6.txt

文本模式修改命令会修改它匹配的数据流中的任意文本行。

#文本模式修改命令会修改它**匹配的数据流中的任意文本行**。
$ sed '/number 1/c\ 
> This is a changed line of text.' data8.txt 
This is a changed line of text. 
This is line number 2. 
This is line number 3. 
This is line number 4. 
This is a changed line of text.


#用这一行文本来替换数据流中的两行文本
$ sed '2,3c\ 
> This is a new line of text.' data6.txt 
This is line number 1. 
This is a new line of text. 
This is line number 4.

6.转换命令

是唯一可以处理单个字符的命令

1	[address]y/inchars/outchars/

inchars中的第一个字符会被转换为outchars中的第一个字符，第二个字符会被转换成outchars中的第二个字符。果inchars和outchars的长度不同，则sed编辑器会产生一条错误消息。

1 2	$ echo "This 1 is a test of 1 try." \| sed 'y/123/456/' This 4 is a test of 4 try.

无法限定只转换在特定地方出现的字符。

7.打印

p命令用来打印文本行；
等号（=）命令用来打印行号；
l（小写的L）命令用来列出行。

打印行:打印包含匹配文本模式的行。

#-n选项，你可以禁止输出其他行
$ sed -n '/number 3/p' data6.txt 
This is line number 3.

#在修改之前查看行
#查找包含数字3的行，然后执行两条命令。首先，脚本用p命令来打印出原始行；然后它用s命令替换文本，并用p标记打印出替换结果。输出同时显示了原来的行文本和新的行文本。
$ sed -n '/3/{ 
> p 
> s/line/test/p 
> }' data6.txt 
This is line number 3. 
This is test number 3.

打印行号

$ sed -n '/number 4/{ 
> = 
> p 
> }' data6.txt 
4 
This is line number 4.

列出行

打印数据流中的文本和不可打印的ASCII字符。

$ cat data9.txt 
This line contains tabs. 
$ 
$ sed -n 'l' data9.txt 
This\tline\tcontains\ttabs.$

8.使用 sed 处理文件

写入文件

1	[address]w filename

filename可以使用相对路径或绝对路径，但不管是哪种，运行sed编辑器的用户都必须有文件的写权限。

#的前两行打印到一个文本文件中。
$ sed '1,2w test.txt' data6.txt

#根据一些公用的文本值从主文件中创建一份数据文件
$ cat data11.txt 
Blum, R Browncoat 
McGuiness, A Alliance 
Bresnahan, C Browncoat 
Harken, C Alliance
$ sed -n '/Browncoat/w Browncoats.txt' data11.txt 
$ cat Browncoats.txt 
Blum, R Browncoat 
Bresnahan, C Browncoat

从文件读取数据

1	[address]r filename

只能指定单独一个行号或文本模式地址。

$ cat data12.txt 
This is an added line. 
This is the second added line.
$ sed '/number 2/r data12.txt' data6.txt 
This is line number 1. 
This is line number 2. 
This is an added line. 
This is the second added line. 
This is line number 3. 
This is line number 4.

和删除命令配合使用：利用另一个文件中的数据来替换文件中的占位文本。

$ cat notice.std
Would the following people: 
LIST 
please report to the ship's captain.

$ sed '/LIST/{ 
> r data11.txt 
> d 
> }' notice.std 
Would the following people: 
Blum, R Browncoat 
McGuiness, A Alliance 
Bresnahan, C Browncoat 
Harken, C Alliance 
please report to the ship's captain.
#件将通用占位文本LIST放在人物名单的位置。要在占位文本后插入名单，只需读取命令就行了。d删除占位文本

二、sed进阶

多行命令

sed编辑器读取数据流时，它会基于换行符的位置将数据分成行。

如果你正在数据中查找短语Linux System Administrators Group，它很有可能出现在两行中，每行各包含其中一部分短语。

N：将数据流中的下一行加进来创建一个多行组（multiline group）来处理。
D：删除多行组中的一行。
P：打印多行组中的一行。

1.next 命令

单行的next命令

小写的n命令会告诉sed编辑器移动到数据流中的下一文本行，而不用重新回到命令的最开始再执行一遍。通常sed编辑器在移动到数据流中的下一文本行之前，会在当前行上执行完所有定义好的命令。

#有个数据文件，共有5行内容，其中的两行是空的。目标是删除首行之后的空白行，而留下最后一行之前的空白行。如果写一个删掉空白行的sed脚本，你会删掉两个空白行。
$ cat data1.txt
This is the header line. 

This is a data line. 

This is the last line. 
$ 
$ sed '/^$/d' data1.txt
This is the header line. 
This is a data line. 
This is the last line.

#脚本要查找含有单词header的那一行。找到之后，n命令会让sed编辑器移动到文本的下一行，也就是那个空行。
$ sed '/header/{n ; d}' data1.txt 
This is the header line. 
This is a data line. 

This is the last line.

合并文本行

单行next命令会将数据流中的下一文本行移动到sed编辑器的工作空间（称为模式空间）。多行版本的next命令（用大写N）会将下一文本行添加到模式空间中已有的文本后。

这样的作用是将数据流中的两个文本行合并到同一个模式空间中。文本行仍然用换行符分隔，但sed编辑器现在会将两行文本当成一行来处理。

#sed编辑器脚本查找含有单词first的那行文本。找到该行后，它会用N命令将下一行合并到那行，然后用替换命令s将换行符替换成空格。
$ cat data2.txt 
This is the header line. 
This is the first data line. 
This is the second data line. 
This is the last line. 
$ 
$ sed '/first/{ N ; s/\n/ / }' data2.txt 
This is the header line. 
This is the first data line. This is the second data line. 
This is the last line.

#如果要在数据文件中查找一个可能会分散在两行中的文本短语的话
$ cat data3.txt 
On Tuesday, the Linux System 
Administrator's group meeting will be held. 
All System Administrators should attend. 
Thank you for your attendance. 
$ 
$ sed 'N ; s/System Administrator/Desktop User/' data3.txt 
On Tuesday, the Linux System 
Administrator's group meeting will be held. 
All Desktop Users should attend. 
Thank you for your attendance.

#替换命令在System和Administrator之间用了通配符模式（.）来匹配空格和换行符
$ sed 'N ; s/System.Administrator/Desktop User/' data3.txt 
On Tuesday, the Linux Desktop User's group meeting will be held. 
All Desktop Users should attend. 
Thank you for your attendance.

#但当它匹配了换行符时，它就从字符串中删掉了换行符，导致两行合并成一行。
#可以在sed编辑器脚本中用两个替换命令：一个用来匹配短语出现在多行中的情况，一个用来匹配短语出现在单行中的情况。
$ sed 'N 
> s/System\nAdministrator/Desktop\nUser/ 
> s/System Administrator/Desktop User/ 
> ' data3.txt 
On Tuesday, the Linux Desktop 
User's group meeting will be held. 
All Desktop Users should attend. 
Thank you for your attendance.

#这个脚本总是在执行sed编辑器命令前将下一行文本读入到模式空间。当它到了最后一行文本时，就没有下一行可读了，所以N命令会叫sed编辑器停止。如果要匹配的文本正好在数据流的最后一行上，命令就不会发现要匹配的数据。
$ cat data4.txt 
On Tuesday, the Linux System 
Administrator's group meeting will be held. 
All System Administrators should attend. 
#由于System Administrator文本出现在了数据流中的最后一行，N命令会错过它，因为没有其他行可读入到模式空间跟这行合并。
#将单行命令放到N命令前面，并将多行命令放到N命令后面，
$ sed ' 
> s/System Administrator/Desktop User/ 
> N 
> s/System\nAdministrator/Desktop\nUser/ 
> ' data4.txt 
On Tuesday, the Linux Desktop 
User's group meeting will be held. 
All Desktop Users should attend.

2.多行删除命令

ed编辑器提供了多行删除命令D，它只删除模式空间中的第一行。该命令会删除到换行符（含换行符）为止的所有字符。

1
2
3

$ sed 'N ; /System\nAdministrator/D' data4.txt 
Administrator's group meeting will be held. 
All System Administrators should attend.

如果需要删掉目标数据字符串所在行的前一文本行

#sed编辑器脚本会查找空白行，然后用N命令来将下一文本行添加到模式空间。如果新的模式空间内容含有单词header，则D命令会删除模式空间中的第一行。
$ cat data5.txt 
This is the header line. 
This is a data line. 
This is the last line. 
$ 
$ sed '/^$/{N ; /header/D}' data5.txt 
This is the header line. 
This is a data line. 
This is the last line.

3.多行打印命令

只打印多行模式空间中的第一行

当你用-n选项来阻止脚本输出时，它和显示文本的单行p命令的用法大同小异。

1 2	$ sed -n 'N ; /System\nAdministrator/P' data3.txt On Tuesday, the Linux System

保持空间

模式空间（pattern space）是一块活跃的缓冲区，sed编辑器有另一块称作保持空间（hold space）的缓冲区域。

命令	描述
h	将模式空间复制到保持空间
H	将模式空间附加到保持空间
g	将保持空间复制到模式空间
G	将保持空间附加到模式空间
x	交换模式空间和保持空间的内容

通常，在使用h或H命令将字符串移动到保持空间后，最终还要用g、G或x命令将保存的字符串移回模式空间