Thursday, September 8, 2016

Processing Text Streams

Processing Text Streams
When you’re processing text streams within a script or when piping output at the shell prompt, there may be times when you need to filter the output of one command so that only certain portions of the text stream are actually passed along to the stdin of the next command. You can use a variety of tools to do this. In the last part of this chapter, we’ll look at using the following commands:

cut
expand and unexpand
fmt
join and paste
nl
od
•pr    
sed and awk
sort
split
tr
uniq
wc

cut
The cut command is used to print columns or fields that you specify from a file to the standard output. By default, the tab character is used as a delimiter. The following options can be used with cut:

blist Select only these bytes.
clist Select only these characters.
ddelim Use the specified character instead of tab for the field delimiter.
flist Select only the specified fields. Print any line that contains no delimiter character,
unless the –s option is specified.
s Do not print lines that do not contain delimiters.

For example, you could use the cut command to display all group names from the /etc/group file. Remember, the name of each group is contained in the first field of each line of the file.
However, the group file uses colons as the delimiter between fields, so you must specify a colon instead of a tab as the delimiter. The command to do this is cut –d: –f1 /etc/group
.

expand and unexpand
The expand command is used to process a text stream and remove all instances of the tab character and replace them with the specified number of spaces (the default is eight). You can use the –t number option to specify a different number of spaces. The syntax is
expand –t number filename
.
In Figure 14-3, the tab characters in the tabfile file are replaced with five spaces.

You can also use the unexpand command. The unexpand command works in the opposite manner as the expand command. It converts spaces in a text stream into tab characters. By default, eight contiguous spaces are converted into tabs. However, you can use the –t option to specify a different number of spaces.

It’s important to note that, by default, unexpand will only convert leading spaces at the beginning of each line. To force it to convert all spaces of the correct number to tabs, you must include the –a option with the unexpand command.


fmt
You can use the fmt command to reformat a text file. It is commonly used to change the wrapping of long lines within the file to a more manageable width. The syntax for using fmt is fmt option filename

For example, you could use the –w option with the fmt command to narrow the text of a file
to 80 columns by entering fmt –w 80 filename

join and paste
The join command prints a line from each of two specified input files that have identical join fields. The first field is the default join field, delimited by white space. You can specify a different join field using the –j field option.

For example, suppose you have two files. The first file (named firstnames) contains the following content:
1 Mike
2 Jenny
3 Joe
The second file (named lastnames) contains the following content:
1 Johnson
2 Doe
3 Jones
You can use the join command to join the corresponding lines from each file by entering
join –j 1 firstnames lastnames

. This is shown here:
rtracy@openSUSE:~> join -j 1 firstnames lastnames
1 Mike Johnson
2 Jenny Doe
3 Joe Jones


The paste command works in much the same manner as the join command. It pastes togethe
corresponding lines from one or more files into columns. By default, the tab character is used to
separate columns. You can use the –dn option to specify a different delimiter character. You can also use the –s option to put the contents of each file into a single line.

For example, you could use the paste command to join the corresponding lines from the firstnames and lastnames files by entering

paste firstnames lastnames

. An example is shown here:
rtracy@openSUSE:~> paste firstnames lastnames
1 Mike   1 Johnson
2 Jenny  2 Doe
3 Joe    3 Jones


nl
The nl command determines the number of lines in a file. When you run the command, the
output is written with a line number added to the beginning of each line in the file. The syntax is nl filename

For example, in the example shown here, the nl command is used to add a number to the beginning of each line in the tabfile.txt file:

rtracy@openSUSE:~> nl tabfile.txt
     1 This file uses tabs.
     2         This line used a tab.
     3         This line used a tab.
     4 After using expand, the tabs will be replaced with spaces.


od
The od (octal dump) command is used to dump a file, including binary files. This utility can
dump a file in several different formats, including octal, decimal, floating point, hex, and character format. The output from od is simple text, so you can use the other stream-processing tools we’ve been looking at to further filter it.

The od command can be very useful. For example, you can perform a dump of a file to locate stray characters in a file. Suppose you created a script file using an editor on a different operating system (such as Windows) and then tried to run it on Linux. Depending on which editor you used, there may be hidden formatting characters within the script text that aren’t displayed by your text editor. However, they will be read by the bash shell when you try to run the script, thus causing errors. When you look at the script in an editor, everything seems fine.

You could use the od command to view a dump of the script to isolate where the problem-
causing characters are located in the file. The syntax for using od is od options filename
. Some of the more commonly used options include the following:

–b  Octal dump
–d  Decimal dump
–x  Hex dump
–c  Character dump

For example, “Hello World” script has been created in the LibreOffice word processor and saved as an .odt file. As such, it has a myriad of hidden characters embedded in the text.

These characters obviously cannot be viewed from within LibreOffice. However, they can be viewed using the od –c helloworld.odt command.


pr
The pr command is used to format text files for printing. It formats the file with pagination, headers, and columns. The header contains the date and time, filename, and page number. You can use the following options with pr:

–d  Double-space the output.
–l  page_length  Set the page length to the specified number of lines. The default is 66.
–o margin  Offset each line with the specified number of spaces. The default margin is 0.


sed and awk
The sed command is a stream text editor. Unlike the interactive text editors that you’ve already learned how to use in this book, such as vi, a stream editor takes a stream of text as its stdin and then performs operations on it that you specify. Then, sed sends the results to stdout. You can use the following commands with sed:

s  Replaces instances of a specified text string with another text string. The syntax for
using the s command is sed s/term1/term2/

For example,  I’ve used the cat command to display a file in the tux user’s home directory named lipsum.txt. I then use cat to read lipsum.txt and then pipe the stdout to the stdin of the sed command and specify that the term “ipsum” be replaced with “IPSUM.”

d  Deletes the specified text. For example, to delete every line of text from the stdin that
contains the term “eos,” you would enter sed /eos/d
.
Remember, sed doesn’t actually modify the source of the information—in this case, the lipsum.txt file. It takes its stdin, makes the changes, and sends it to the stdout. If you want to save the changes made by sed, you need to redirect its stdout to a file using >
For example, I could redirect the output from the command in Figure 14-8 to a file named lipsum_out.txt by entering cat lipsum.txt | sed s/ipsum/IPSUM/ > lipsum_out.txt at the shell prompt.


In addition to sed, you can also use awk to manipulate output. Like sed, awk can be used to
receive output from another command as its stdin and manipulate it in a manner you specify.

However, the way awk does this is a little bit different. The awk command treats each line of text it receives as a record. Each word in the line, separated by a space or tab character, is treated as a separate field within the record.

For example, consider the following text file:

Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum

According to awk, this file has seven records because it has seven separate lines of text. Each line of text has a carriage return/linefeed character at the end that creates a new line. This is the character awk uses to define the end of a record. The first record has eight fields, the second record has 11 fields, and so on.

Notice that white space, not punctuation, delimits the fields. Each field is referenced as $
field_number.

For example, the first field of any record is referenced as $1, the second as $2, and so on.
Using awk, we can specify a field in a specific record and manipulate it in some manner. The syntax for using awk is awk ‘pattern{manipulation}’

For example, we could enter
cat lipsum2.txt | awk ‘{print $1,$2,$3}’

to print out the first three words (“fields”) of each line (“records”).
Because we didn’t specify a pattern to match on, awk simply prints out the first three words of every line.

You can also include a pattern to specify exactly which records to search on. For example,
suppose we only wanted to display the first three fields of any record that includes the text “do” somewhere in the line. To do this, you add a pattern of /do/ to the command.
You can also add your own text to the output. Just add it to the manipulation part of the command within quotes. In fact, you can also add control characters to output as well. Use the following:

\t  Inserts a tab character
\n  Adds a newline character
\f  Adds a form feed character
\r  Adds a carriage return character

For example, in Figure 14-11, I’ve entered
cat lipsum.txt | awk ‘/do/ {print "Field 1: "$1"\t", "Field 2: "$2"\t",
"Field 3: "$3"\t"}’

which causes each field to be labeled Field 1, Field 2, and Field 3
.
 It also inserts a tab character between each field. As with sed, awk doesn’t modify the original file. It sends its output to stdout (the screen). If you want to send it to a file, you can redirect it using >


sort
The sort command sorts the lines of a text file alphabetically. The output is written to the standard output. Some commonly used options for the sort command include the following:

–f Fold lowercase characters to uppercase characters.
–M  Sort by month.
–n  Sort numerically.
–r  Reverse the sort order.

For example, the sort –n –r firstnames

command sorts the lines in the firstnames file numerically in reverse order. This is shown here:

rtracy@openSUSE:~> sort –n –r firstnames
3 Joe
2 Jenny
1 Mike

The sort command can be used to sort the output of other commands (such as ps) by piping
the standard output of the first command to the standard input of the sort command.


split
The split command splits an input file into a series of files (without altering the original input file). The default is to split the input file into 1,000-line segments. You can use the –n option to specify a different number of lines.

For example, the split –1 firstnames outputfile_ command can be used to split the firstnames file into three separate files, each containing a single line.


tr
The tr command is used to translate or delete characters. However, be aware that this command does not work with files. To use it with files, you must first use a command such as cat to send the text stream to the standard input of tr. The syntax is

tr options X Y

Some commonly used options for the tr command include the following:

–c  Use all characters not in X.
–d  Delete characters in X; do not translate.
–s  Replace each input sequence of a repeated character that is listed in X with a single occurrence of that character.
–t  First truncate X to the length of Y.

For example, to translate all lowercase characters in the lastnames file to uppercase characters, you could enter cat lastnames | tr a-z A-Z
, as shown in this example:

rtracy@openSUSE:~> cat lastnames | tr a-z A-Z
1 JOHNSON
2 DOE
3 JONES

uniq
The uniq command reports or omits repeated lines. The syntax is
uniq options input output

 You can use the following options with the uniq command:

–d  Only print duplicate lines.
–u  Only print unique lines.

For example, suppose our lastnames file contained duplicate entries:
1 Johnson
1 Johnson
2 Doe
3 Jones

You could use the uniq lastnames command to remove the duplicate lines. This is shown in
the following example:

rtracy@openSUSE:~> uniq lastnames
1 Johnson
2 Doe
3 Jones

Be aware that the uniq command only works if the duplicate lines are adjacent to each other.

If the text stream you need to work with contains duplicate lines that are not adjacent, you can use the sort command to first make them adjacent and then pipe the output to the standard input of uniq.


wc
The wc command prints the number of newlines, words, and bytes in a file. The syntax is

wc options files

. You can use the following options with the wc command:

–c  Print the byte counts.
–m  Print the character counts.
–l  Print the newline counts.
–L  Print the length of the longest line.
–w  Print the word counts.


For example, to print all counts and totals for the firstnames file, you would use the
wc firstnames

 command, as shown in this example:

rtracy@openSUSE:~> wc firstnames
  3 6 21 firstnames





No comments:

Post a Comment