Mark Needham

Thoughts on Software Development

Archive for the ‘Shell Scripting’ Category

Learning Unix find: Searching in/Excluding certain folders

with 5 comments

I love playing around with commands on the Unix shell but one of the ones that I’ve found the most difficult to learn beyond the very basics is find.

I think this is partially because I find the find man page quite difficult to read and partially because it’s usually quicker to work out how to solve my problem with a command I already know than to learn another one.

However, I recently came across Greg’s wiki which seems to do a pretty good job of explaining it.

Reasonably frequently I want to get a list of files to scan but want to exclude files in the .git directory since the results tend to become overwhelming with those included:

$ find . ! -path  "*.git*" -type f -print

Here we’re saying find items which don’t have git in their path and which are of type ‘f’ (file) and then print them out.

If we don’t include the -type flag then the results will also include directories which isn’t what we want in this case. The -print is optional in this case since by default what we select will be printed.

Sometimes we want to exclude more than one directory which can be done with the following command:

$ find . \( ! -path "*target*" -a ! -path "*tools*" -a ! -path "*.git*" -print \)

Here we’re excluding the ‘target’, ‘tools’ and ‘git’ directories from the listing of files that we return.

The -a flag stands for ‘and’ so the above command reads ‘find all files/directories which do not have target in their path and do not have tools in their path and do not have .git in their path’.

We can always make that command a bit more specific if any of those words legitimately appear in a path.

As well as the -print flag there is also a -prune flag which we can use to stop find from descending into a folder.

The first command could therefore be written like this:

$ find . -path "*.git*" -prune -o -type f -print

This reads ‘don’t go any further into a folder which has git in the path but print any other files which don’t have git in their path’.

I’m still finding -prune a bit confusing to understand and as the wiki points out:

The most confusing property of -prune is that it is an ACTION, and thus no further filters are processed after it.

To use it, you have to combine it with -o to actually process the non-skipped files, like so:

A couple of months ago I was playing around with our git repository trying to get a list of all the scala files in the ‘src/main’ directory and I went with this command:

$ find . -type f -regex ".*src/main.*\.scala$"

Using the above flags it could instead be written like this:

$ find . -path "*src/main*" -type f -iname "*\.scala*"

or

$ find . -type f -path "*src/main/*\.scala"

Interestingly those latter two versions seem to be a bit slower than the one that uses the -regex flag.

I’m not entirely sure why that is – presumably by supplying two flags on the latter two solutions find has to do more operations per line than it does with the -regex option or something like that?

Written by Mark Needham

October 21st, 2011 at 9:25 pm

Posted in Shell Scripting

Tagged with ,

Bash: Reusing previous commands

with 2 comments

A lot of the time when I’m using the bash shell I want to re-use commands that I’ve previously entered and I’ve recently learnt some neat ways to do this from my colleagues Tom and Kief.

If we want to list the history of all the commands we’ve entered in a shell session then the following command does the trick:

> history
...
  761  sudo port search pdfinfo
  762  to_ipad andersen-phd-thesis.pdf 
  763  vi ~/.bash_profile
  764  source ~/.bash_profile
  765  to_ipad andersen-phd-thesis.pdf 
  766  to_ipad spotify-p2p10.pdf 
  767  mkdir LinearAlgebra

If we want to execute any of those commands again then we can do that by entering ![numberOfCommand. For example, to execute the last command on that list we’d do this:

> !767
mkdir LinearAlgebra
mkdir: LinearAlgebra: File exists

We can also search the history and execute the last command that matches the search by doing the following:

> !mk
mkdir LinearAlgebra
mkdir: LinearAlgebra: File exists

A safer way to do this would be to suffix that with :p so the command gets printed to stdout rather than executed:

> !mk:p
mkdir LinearAlgebra

A fairly common use case that I’ve come across is to search for a file and then once you’ve found it open it in a text editor.

We can do this by using the !! command which repeats the previously executed command:

> find . -iname "someFile.txt"
> vi `!!`

We can achieve the same thing by wrapping ‘!!’ inside ‘$()’ as well:

> find . -iname "someFile.txt"
> vi $(!!)

Sam Rowe has a cool post where he goes into this stuff in even more detail.

I’m sure there are more tricks that I haven’t learnt yet so please let me know if you know some!

Written by Mark Needham

October 13th, 2011 at 7:46 pm

Posted in Shell Scripting

Tagged with ,

Unix: Getting the page count of a linearized PDF

with 2 comments

We were doing some work last week to rasterize a PDF document into a sequence of images and wanted to get a rough idea of how many pages we’d be dealing with if we created an image per page.

The PDFs we’re dealing with are linearized since they’re available for viewing on the web:

A LINEARIZED PDF FILE is one that has been organized in a special way to enable efficient incremental access in a network environment.

The file is valid PDF in all respects, and is compatible with all existing viewers and other PDF applications. Enhanced viewer applications can recognize that a PDF file has been linearized and can take advantage of that organization (as well as added “hint” information) to enhance viewing performance.

The neat thing about this is it means that the document has meta data detailing the number of pages it contains:

Part 2: Linearization parameter dictionary

43 0 obj

<< /Linearized 1.0 % Version

/L 54567 % File length

/H [475 598] % Primary hint stream offset and length (part 5)

/O 45 % Object number of first page’s page object (part 6)

/E 5437 % Offset of end of first page

/N 11 % Number of pages in document

/T 52786 % Offset of first entry in main cross-reference table (part 11)

>>

endobj

By making use of the strings command Duncan and I hacked together a little script that lets us grab the number of pages in The Games of Strategy PDF or any other linearized PDF:

strings RAND_CB149-1.pdf | 
awk '/Linearized/ { inmeta = 1; } match($0, /\/N [0-9]+/) { if(inmeta) print substr( $0, RSTART, RLENGTH ); exit;}' |
cut -d" " -f2

It seems much more difficult to find the count if the document hasn’t been linearized but we didn’t need to solve that problem for the moment!

Written by Mark Needham

October 9th, 2011 at 11:34 am

Posted in Shell Scripting

Tagged with ,

Unix: Summing the total time from a log file

with 2 comments

As I mentioned in my last post we’ve been doing some profiling of a data ingestion job and as a result have been putting some logging into our code to try and work out where we need to work on.

We end up with a log file peppered with different statements which looks a bit like the following:

18:50:08.086 [akka:event-driven:dispatcher:global-5] DEBUG - Imported document. /Users/mneedham/foo.xml in: 1298
18:50:09.064 [akka:event-driven:dispatcher:global-1] DEBUG - Imported document. /Users/mneedham/foo2.xml in: 798
18:50:09.712 [akka:event-driven:dispatcher:global-4] DEBUG - Imported document. /Users/mneedham/foo3.xml in: 298
18:50:10.336 [akka:event-driven:dispatcher:global-3] DEBUG - Imported document. /Users/mneedham/foo4.xml in: 898
18:50:10.982 [akka:event-driven:dispatcher:global-1] DEBUG - Imported document. /Users/mneedham/foo5.xml in: 12298

I can never quite tell which column I need to get so end up doing some exploration with awk like this to find out:

$ cat foo.log | awk ' { print $9 }'
1298
798
298
898
12298

Once we’ve worked out the column then we can add them together like this:

$ cat foo.log | awk ' { total+=$9 } END { print total }'
15590

I think that’s much better than trying to determine the total run time in the application and printing it out to the log file.

We can also calculate other stats if we record a log entry for each record:

$ cat foo.log | awk ' { total+=$9; number+=1 } END { print total/number }'
3118
$ cat foo.log | awk 'min=="" || $9 < min {min=$9; minline=$0}; END{ print min}' 
298

Written by Mark Needham

July 27th, 2011 at 11:02 pm

Posted in Shell Scripting

Tagged with ,

mount_smbfs: mount error..File exists

with one comment

I’ve been playing around with mounting a Windows file share onto my machine via the terminal because I’m getting bored of constantly having to go to Finder and manually mounting it each time!

After a couple of times of mounting and unmounting the drive I ended up with this error:

> mount_smbfs //mneedham@punedc02/shared punedc02_shared/
mount_smbfs: mount error: /Volumes/punedc02_shared: File exists

I originally thought the ‘file exists’ part of the message was suggesting that I’d already mounted a share on ‘punedc02_shared’ but calling the ‘umount’ command led to the following error:

> umount punedc02_shared
umount: punedc02_shared: not currently mounted

I had actually absent mindedly gone and mounted the drive elsewhere through Finder which I only realised after reading Victor’s comments on this post.

Make sure that you already do not have the same share mounted on your Mac.

I had //host/share already mounted in /Volumes/share, so when I tried to mount //host/share to /Volumes/newshare it gave me the “file exists” error.

I learnt, thanks to the unix.com forums, that you can see which devices are mounted by using ‘df’.

This is where Finder had mounted the drive for me:

> df
Filesystem                 512-blocks      Used  Available Capacity  Mounted on
...
//mneedham@punedc02/shared  209696376 199773696    9922680    96%    /Volumes/shared

Since the shared drive gets unmounted when I disconnect from the network I decided to write a shell script that would set it up for me again.

#!/bin/sh
function mount_drive {
  mkdir -p $2
  mount_smbfs $1 $2 
}
 
drives_to_unmount=`df | awk '/mneedham@punedc02/ { print $6 }'`
 
if [ "$drives_to_unmount" != "" ]; then
  echo "Unmounting existing drives on punedc02: \n$drives_to_unmount"
  umount $drives_to_unmount
fi
 
mount_drive //mneedham@punedc02/media /Volumes/punedc02_media 
mount_drive //mneedham@punedc02/shared /Volumes/punedc02_shared

At the moment I’ve just put that in ‘/usr/bin’ so that it’s available on the path.

If there’s a better way to do this or a way to simplify the code please let me know.

I did come across a few ways to do the mounting using Apple Script in this post but that’s not half as fun as using the shell!

Written by Mark Needham

January 15th, 2011 at 6:31 pm

Posted in Shell Scripting

Tagged with

Sed: ‘sed: 1: invalid command code R’ on Mac OS X

with 5 comments

A few days ago I wrote about how we’d been using Sed to edit multiple files and while those examples were derived from what we’d been using on Ubuntu I realised that they didn’t actually work on Mac OS X.

For example, the following command:

sed -i 's/require/include/' Rakefile

Throws this error:

sed: 1: "Rakefile": invalid command code R

What I hadn’t realised is that on the Mac version of sed the ‘-i’ flag has a mandatory suffix, as described in this post.

The appropriate section of the man page for sed on the Mac looks like this:

-i extension

Edit files in-place, saving backups with the specified extension. If a zero-length extension is given, no backup will be saved.

It is not recommended togive a zero-length extension when in-place editing files, as you risk corruption or partial content in situations where disk space is exhausted, etc.

Whereas on Ubuntu the suffix is optional so we see this:

-i[SUFFIX], –in-place[=SUFFIX]

edit files in place (makes backup if extension supplied)

In order to get around this we need to provide a blank suffix when using the ‘-i’ flag on the Mac:

sed -i "" 's/require/include/' Rakefile

I didn’t RTFM closely enough the first time!

Written by Mark Needham

January 14th, 2011 at 2:15 pm

Posted in Shell Scripting

Tagged with

Sed across multiple files

with 12 comments

Pankhuri and I needed to rename a method and change all the places where it was used and decided to see if we could work out how to do it using sed.

We needed to change a method call roughly like this:

home_link(current_user)

To instead read:

homepage_path

For which we need the following sed expression:

sed -i 's/home_link([^)]*)/homepage_path/' [file_name]

Which works pretty well if you know which file you want to change but we wanted to run it over the whole code base.

A bit of googling led us to this thread on devshed which suggested we’d need to get a list of the files and then run sed through the list:

for file in `find .  -type f`; do sed -i 's/home_link([^)]*)/homepage_path/' $file; done

That pretty much works but it doesn’t play nicely if the file has a space in the name since sed thinks the file name has ended before it actually has.

I was pretty sure that we should be able to pipe the output of the find into xargs and a bit more googling led us to the following solution:

find . -type f -print0 | xargs -0 sed -i 's/home_link([^)]*)/homepage_path/'

The ‘print0′ flag is described like so:

This primary always evaluates to true.  It prints the pathname of the current file to standard output, followed by an ASCII NUL character (character code 0).

While ‘-0′ in ‘xargs’ is described like this:

  -0      Change xargs to expect NUL (``\0'') characters as separators, instead of spaces and newlines.  This is expected to be used in concert with the -print0 function in find(1).

It also runs amazingly fast!

If anyone knows a better way feel free to point it out in the comments.

Written by Mark Needham

January 11th, 2011 at 4:43 pm

Posted in Shell Scripting

Tagged with

A dirty hack to get around aliases not working in a shell script

with 4 comments

In another script I’ve been working on lately I wanted to call ‘mysql’ but unfortunately on my machine it’s ‘mysql5′ rather than ‘mysql’.

I have an alias defined in ‘~/.bash_profile’ so I can call ‘mysql’ from the terminal whenever I want to.

alias mysql=mysql5

Unfortunately shell scripts don’t seem to have access to this alias and the only suggestion I’ve come across while googling this is to source ‘~/.bash_profile’ inside the script.

Since others are going to use the script and might have ‘~/.bashrc’ instead of ‘~/.bash_profile’ I didn’t really want to go down that route.

At this stage a colleague of mine came up with the idea of creating a soft link from mysql to mysql5 inside a folder which is already added to the path.

We located mysql5…

> which mysql5
/opt/local/bin/mysql5

…and then created a soft link like so:

cd /opt/local/bin/mysql5
ln -s mysql5 mysql

And it works!

Of course t’is pure hackery so I’d be interested if anyone knows a better way of getting around this.

Written by Mark Needham

November 24th, 2010 at 6:48 pm

Posted in Shell Scripting

Tagged with

Browsing around the Unix shell more easily

with 14 comments

Following on from my post about getting the pwd to display on the bash prompt all the time I have learnt a couple of other tricks to make the shell experience more productive.

Aliases are the first new concept I came across and several members of my current team and I now have these setup.

We are primarily using them to provide a shortcut command to get to various locations in the file system. For example I have the following ‘work’ alias in my ~/.bash_profile file:

alias work='cd ~/path/to/my/current/project'

I can then go to the bash prompt and type ‘work’ and it navigates straight there. You can put as many different aliases as you want in there, just don’t forget to execute the following command after adding new ones to get them reflected in the current shell:

. ~/.bash_profile

A very simple idea but one that helps save so many keystrokes for me every day.

Another couple of cool commands I recently discovered are pushd and popd

They help provide a stack to store directories on, which I have found particularly useful when browsing between distant directories.

For example suppose I am in the directory ‘/Users/mneedham/Desktop/Blog/’ but I want to go to ‘/Users/mneedham/Projects/Ruby/path/to/some/code’ to take a look at some code.

Before changing to that directory I can execute:

pushd .

This will push the current directory (‘/Users/mneedham/Desktop/Blog/’) onto the stack. Then once I’m done I just need to run:

popd

I’m back to ‘/Users/mneedham/Desktop/Blog/’ with a lot less typing.

Running the following command shows a list of the directories currently on the stack:

dirs

I love navigating with the shell so if you’ve get any other useful tips please share them!

Written by Mark Needham

October 15th, 2008 at 10:31 pm

Posted in Shell Scripting

Tagged with

Show pwd all the time

with 3 comments

Finally back in the world of the shell last week I was constantly typing ‘pwd’ to work out where exactly I was in the file system until my colleague pointed out that you can adjust your settings to get this to show up automatically for you on the left hand side of the prompt.

To do this you need to create or edit your .bash_profile file by entering the following command:

vi ~/.bash_profile

Then add the following line to this file:

export PS1='\u@\H \w\$ '

You should now see something like the following on your command prompt:

mneedham@Macintosh-5.local /users/mneedham/Erlang/playbox$

Another colleague pointed out that the information on the left side is completely configurable. The following entry from the manual pages of bash (Type ‘man bash’ then search for ‘PROMPTING’) show how to do this:

PROMPTING
       When executing interactively, bash displays the primary prompt PS1 when it is ready to read a command, and the secondary prompt PS2 when it needs more input to complete a command.  Bash allows these prompt
       strings to be customized by inserting a number of backslash-escaped special characters that are decoded as follows:
              \a     an ASCII bell character (07)
              \d     the date in "Weekday Month Date" format (e.g., "Tue May 26")
              \D{format}
                     the format is passed to strftime(3) and the result is inserted into the prompt string; an empty format results in a locale-specific time representation.  The braces are required
              \e     an ASCII escape character (033)
              \h     the hostname up to the first `.'
              \H     the hostname
              \j     the number of jobs currently managed by the shell
              \l     the basename of the shell's terminal device name
              \n     newline
              \r     carriage return
              \s     the name of the shell, the basename of $0 (the portion following the final slash)
              \t     the current time in 24-hour HH:MM:SS format
              \T     the current time in 12-hour HH:MM:SS format
              \@     the current time in 12-hour am/pm format
              \A     the current time in 24-hour HH:MM format
              \u     the username of the current user
              \v     the version of bash (e.g., 2.00)
              \V     the release of bash, version + patchelvel (e.g., 2.00.0)
              \w     the current working directory
              \W     the basename of the current working directory
              \!     the history number of this command
              \#     the command number of this command
              \$     if the effective UID is 0, a #, otherwise a $
              \nnn   the character corresponding to the octal number nnn
              \\     a backslash
              \[     begin a sequence of non-printing characters, which could be used to embed a terminal control sequence into the prompt
              \]     end a sequence of non-printing characters

This page has more information on some of the other files that come in useful when shell scripting.

Written by Mark Needham

September 28th, 2008 at 10:50 pm

Posted in Shell Scripting

Tagged with ,