He Sells Shell Scripts to Intersect Sets

Graphical User Interfaces are in great demand but for some tasks there are better tools. Thomas Guest demonstrates the capabilities of command shells.

Introduction

The Unix command shell contains a lot of what I like in a programming environment: it's dynamic, high-level, interpreted, flexible, succinct. It's even reasonably portable now that bash seems to have become the shell of choice. Although there's much about shell scripting I don't like, on many occasions it turns out to be the best tool for the job.

In this article we shall demonstrate how simple shell scripts can be used to implement sets, providing one line recipes for set creation, set union, set intersection and more. Having explored the power of the Unix shell we'll consider its limitations, before finally discussing the more general lessons we can learn from the Unix tools.

An example: Apache server logs

As an example, let's suppose we want to analyse sets of IP addresses contained in a couple of Apache HTTP Server [ Apache ] access logs: access_log1 and access_log2. Each log file contains many thousands of lines which look something like this:

    65.214.44.29 - - [25/Jun/2007:00:03:21 +0000] ...
    74.6.87.40 - - [25/Jun/2007:00:03:24 +0000] ...
    65.214.44.29 - - [25/Jun/2007:00:03:24 +0000] ...
    74.6.86.212 - - [25/Jun/2007:00:03:36 +0000] ...
    ...

We can cut this file down to leave just the IP address at the start of each line. cut is a simple tool which we'll be using again later, and here we're passing it options -f1 to select the first field from each line and -d" " to use the space character as a field separator.

    $ cut -f1 -d" " access_log1
    65.214.44.29
    74.6.87.40
    65.214.44.29
    74.6.86.212
    ...

Set creation

The output from this command is likely to be full of duplicates. Regular site visitors typically hit the web server a few times; web spiders and robots are much more hungry. To obtain the sets of unique IP addresses contained in each log file, we could do this:

    $ cut -f1 -d" " access_log1 | sort | uniq > IP1
    $ cut -f1 -d" " access_log2 | sort | uniq > IP2

Here cut picks out the IP addresses, sort orders the results, uniq eliminates duplicates, and we've redirected the output into files IP1 and IP2. By the way, we could have eliminated a link from the pipeline using the -u option to sort. The Unix shell tools aren't entirely orthogonal!

     $ cut -f1 -d" " access_log1 | sort -u > IP1

The resulting sets are ordered - a set implementation which should be familiar to C++ programmers. The IP addresses will be lexicographically rather than numerically ordered, since we went with the sort defaults. This means that, for example, 122.152.128.10 appears before 58.167.213.128 because 1 alphabetically precedes 5 . With a little more effort, we could probably persuade sort to yield a numeric ordering (no, sort -n isn't good enough).

Multiset creation

If instead we wanted a multiset - that is, a set in which elements may appear more than once, we could count the number of times items are repeated in the sorted output using the -c option to uniq .

    $ cut -f1 -d" " access_log1 | sort | uniq -c
       8 12.153.20.132
       2 12.217.178.11
      14 12.30.66.226
       1 122.152.128.49
      ...

Here, each IP address is prefixed by the number of times it occurred in the log file, so our multiset contains 12.153.20.132 8 times, etc. This will be useful later when we discuss intersection operations.

Set union

Let's assume we've followed the steps above and IP1 and IP2 contain the set of IP addresses in the two access logs. Forming the union of these sets is simple.

    $ sort -m IP1 IP2 | uniq > IP1_union_IP2

The -m (merge) option to sort is purely for efficiency and the result would be equally correct without it. Since the inputs are already sorted, we can just merge them together, line by line. For C++ users, it's the difference between the std::sort and std::merge algorithms.

Set intersection

The best recipe I've come up with for creating the intersection of the sets IP1 and IP2 isn't quite as simple. Here's how it works. We form the multiset union of IP1 and IP2, then filter it to leave just the elements which appear twice in this multiset.

    $ sort -m IP1 IP2 | uniq -c | grep "^ *2" | \
      tr -s " " | cut -f3 -d" " > IP1_intersection_IP2

Let's unpick this pipeline. First, sort -m IP1 IP2 | uniq -c generates the multiset of IP addresses in IP1 and IP2. Since IP1 and IP2 are sets and therefore individually contain no repeats, the resulting multiset looks something like this:

    $ sort -m IP1 IP2 | uniq -c
       1 12.30.66.226
       1 122.152.128.10
       2 122.152.128.49
       1 122.152.129.54
       ...

Each line in the output starts with a count which must be either 1 or 2. Lines starting with 2 correspond to IP addresses common to both files - and these are the IP addresses which form the intersection of IP1 and IP2. The other lines, the ones starting with 1 , are the IP addresses in just one of IP1 or IP2.

We then use grep to pick out lines starting with any number of spaces followed by a 2 . Next tr -s " " squeezes repeated spaces from each line, making the output suitable for use with cut using the space character as a field delimiter. Finally cut itself extracts the column we want (the one with the IP address).

Set symmetric difference

The same recipe with a simple tweak finds the set symmetric difference between IP1 and IP2 (the IP addresses in just one of IP1 and IP2 that is). All we need to do is change the grep pattern to include a 1 instead of a 2 . The magic of the shell command history allows us to hit the up arrow key -⇑- and edit the previous command directly; we don't even have to type it all in again.

    $ sort -m IP1 IP2 | uniq -c | grep "^ *1" | \
      tr -s " " | cut -f3 -d" " > IP1_symmetric_diff_IP2

Set subtraction

What about the elements in IP1 but not IP2? We can just intersect IP1 and IP1_symmetric_diff_IP2 . Again, we can use the command shell history to recall and adapt the previous command.

    $ sort -m IP1 IP1_symmetric_diff_IP2 | uniq -c | grep "^ *2" | \
      tr -s " " | cut -f3 -d" " > IP1_subtract_IP2

More set operations

One of the nice things about set operations is there aren't many of them. We've already covered the important ones, and these can easily be extended. Try and work out what set operations are going on in the the command history shown below.

    $ diff -q S1 S2
    $ head -1 S1
    $ sort -m S1 S2 S3 S4 S5 | uniq
    $ sort -m S1 S2 S3 | uniq -c | grep -c "^ *3"
    $ sort -m S1 S2 | uniq -c | grep "^ *2" | tr -s " " | cut -f3 -d" " | diff S1 -
    $ sort -m S1 S2 | uniq -c | grep -c "^ *2"
    $ tail -1 S2
    $ wc -l S1

As a hint, the answers in lexicographical order are:

are two sets the same?
are two sets disjoint?
first element of a set
how big is the intersection of three sets?
how many elements in a set?
is a subset of?
last element of a set
unite five sets

Extending the toolset

The command shell is a powerful, dynamic and extensible programming environment. Even these simple one-line scripts can be stored as functions which can be sourced when a new shell is started; you can add command-line help to them, you can find them using tab-completion, you can keep them in your source control system. In this way you can create your own customised shell working environment and port it from platform to platform just by checking it out [ Guest06a ].

A script's got to know its limitations

Apache server logs are no more and no less than line oriented text. Each record in the log is terminated by a newline character, and each field within each record is delimited in an obvious way: by brackets, spaces, hyphens, whatever - who needs XML? This is the kind of format shell scripts handle well. Conversely, anything more complicated, XML for example, or records which span multiple lines, is likely to push the shell tools too far. Maybe awk could cope, but I don't think many people bother learning awk these days: it's better to use one of the popular high-level languages when basic shell commands won't do.

Shell scripts tend not to fail safely. For example, the following command is meant to clear out files in a temporary directory:

    # Don't try this at home!
    $ rm -rf $TEMP_WORK_DIR/*

You can imagine what happens if TEMP_WORK_DIR has not been set. In general, the Unix commands build on a couple of dangerous assumptions: that programmers know what they are doing; and that the show must go on - by which I mean that, given malformed input, a shell script will not throw an exception. The IP filters we discussed in this article work quite happily with any old text file as input - if it wasn't an Apache http server log, the only indication of failure may well be smaller sets than expected.

I'll admit that I personally avoid writing any shell scripts much longer than the ones shown here. As with Makefiles, I admire and respect the technology but I'd rather have someone else deal with the details. The bash manual may be brief to a fault, but I've yet to get to grips with its finer details. Sometimes it's just too subtle.

On the subject of details, earlier in this article I said that by default sort uses lexicographical ordering, which isn't perhaps the ordering we'd prefer for IP addresses; and I also said that a numeric sort -n wouldn't do the job either: IP addresses aren't really numbers, they're dot separated number quartets. You can use sort to place IP addresses in a more natural order, but the command you'll need is anything but natural.

    # Natural ordering of IP addresses
    $ sort -t. +0n -1n +1n -2n +2n -3n +3n IP

If you want to know how this works you'll have to read the manual. The code, on its own, is unreadable [ Guest06b ]. If you don't know where the manual is, just open a shell window and type man . If the output from this command doesn't help, try man man , and if you don't know how to open a shell window, I'm surprised you're even reading this sentence!

Conclusion

Modern graphical development environments tend to hide the shell and the command line, probably with good reason, and I don't suppose this article will persuade anyone they're worth hunting out. And yet the Unix shell embodies so much that is modern and, I suspect, future, best practice.

For me, it's not just what the shell tools can do, it's the example they set. Look again at some of the recipes presented in this article and you'll see container operations without explicit loops. You'll see flexible and generic algorithms. You'll see functional programming. You'll see programs which can parallel-process data without a thread or a mutex in sight; no chance of shared memory corruption or race conditions here. The original design of the shell tools may have become somewhat polluted - we've already seen that sort does some of what uniq can do - but I think the intent shines through as clearly as ever: we have a compact suite of orthogonal tools, each with its own responsibility, which cooperate using simple interfaces. We would do well to emulate this model in our own software designs.

Credits

I would like to thank the Overload editorial team for their help with this article.

References

[ Apache]: http://httpd.apache.org/

[ Guest06a]: http://blog.wordaligned.org/articles/2006/09/07/personal-version-control

[ Guest06b]: http://blog.wordaligned.org/articles/2006/08/06/readable-code