Unix Utilities

The Unix tool philosophy is often associated with the "worse is better" mindset, and the relative merits are still a source of controversy. While the issue is still actively debated, for practical purposes, worse is better has won for now.

Unix is based around the philosophy that each tool should do one thing, and do it well. Of course, we often want to carry out large, complicated tasks for which a small tool is inappropriate. At this point, there are two possible ways to proceed. One could author a large tool specifically designed to carry out the designated task. This approach is often seen with Windows software development. The other approach is to recognise that most large, complicated tasks are composed of several smaller subtasks, and that many of these subtasks are common to two or more of the larger tasks. Thus, if one can find a way to solve the subtasks and then combine the solutions to solve the large tasks, a great saving in time and effort can be made, as one does not have to solve the subtasks over and over again.

That's great, but I use Windows

I think that you should change operating systems. No, seriously, I do. But on the assumption that you won't (or can't—I'm stuck on a Windows box at my day job), there are ways to get many of the benefits of Unix tools without running Unix.

Examples of the utilities in use

Replacing ISBN10 with ISBN13

Quick summary of how the solution will work.

Find all instances of ISBN10s in a document
For each of the instances, find the equivalent ISBN13
Run a search and replace for each pair of ISBN10 and equivalent ISBN13

Details of the method

Find the ISBN10s

We know that an ISBN10 is a string of nine digits, followed by either another digit or the letter 'X'. An ISBN10 consists of four parts, and in many cases the different parts will be separated using hyphens or spaces. Fortunately, the ISBNs that we have to deal with in this example are not subdivided in any way. Therefore, we can use the regular expression [0-9]{9}[X0-9]{1} to match ISBN10s.

grep -o -E [0-9]{9}[X0-9]{1} inputfile.txt > outputfile.txt

This command means "for the file 'inputfile.txt', find every instance in the file that matches the regular expression specified, and write each of those instances so found to a new line in the file 'outputfile.txt'". The result is a new file (outputfile.txt), each line of which is an ISBN10 found in 'inputfile.txt'.

Find the corresponding ISBN13s

978 is the asset code for 'book'

To convert an ISBN10 to an ISBN13 the check digit is removed, "978" is prepended to the ISBN10 and the check digit is then recalculated.

Delete duplicate rows from a list in Excel

Improving the process

There are a number of ways that this solution could be improved. Unfortunately, I don't know how to implement them without access to some more tools that are trivially available under Unix, but which I don't (yet) have under Win32. For instance, at the moment, you need to repeat the process for each individual file that you need to convert. That's not too bad if you have a small number of files that each contains a large number of ISBNs. However, in the reverse situation, carrying out the process for each file would quickly get annoying. A quick and dirty solution (if you don't have access to a decent shell) would be to run the grep process on each of the files that you need to convert, but append the output to a single file, rather than to multiple different files or overwriting a single file. This will result in a single file that contains all of the ISBN10s that need to be converted.

The above procedure assumes that there are no other numbers in the file that will provide a match to the regular expression but which are not ISBN10s.

External links

GnuWin32 Packages