Mark Needham

Thoughts on Software Development

Archive for the ‘shell-scripting’ tag

Unix parallel: Populating all the USB sticks

without comments

The day before Graph Connect Europe 2016 we needed to create a bunch of USB sticks containing Neo4j and the training materials and eventually iterated our way to a half decent approach which made use of the GNU parallel command which I’ve always wanted to use!

But first I needed to get a USB hub so I could do lots of them at the same time. I bought the EasyAcc USB 3.0 but there are lots of other ones that do the same job.

Next I mouunted all the USB sticks and then renamed the volumes to be NEO4J1 -> NEO4J7:

for i in 1 2 3 4 5 6 7; do diskutil renameVolume "USB DISK" NEO4J${i}; done

I then created a bash function called ‘duplicate’ to do the copying work:

function duplicate() {
  echo ${i}
  time rsync -avP --size-only --delete --exclude '.*' --omit-dir-times /Users/markneedham/Downloads/graph-connect-europe-2016/ /Volumes/NEO4J${i}/

We can now call this function in parallel like so:

seq 1 7 | parallel duplicate

And that’s it. We didn’t get a 7x improvement in the throughput of USB creation from doing 7 in parallel but it took ~ 9 minutes to complete 7 compared to 5 minutes each. Presumably there’s still some part of the copying that is sequential further down – Amdahl’s law #ftw.

I want to go and find other things that I can use pipe into parallel now!

Written by Mark Needham

June 1st, 2016 at 5:53 am

Posted in Shell Scripting

Tagged with

SSHing onto machines via a jumpbox

with 4 comments

We wanted to be able to ssh into some machines which were behind a firewall so we set up a jumpbox which our firewall directed any traffic on port 22 towards.

Initially if we wanted to SSH onto a machine inside the network we’d have to do a two step process:

$ ssh jumpbox
# now on the jumpbx
$ ssh internal-network-machine

That got a bit annoying after a while so Sam showed us a neat way of proxying the second ssh command through the first one by making use of netcat.

We put the following into ~/.ssh/config:

Host jumpbox jumpbox-ip
 Hostname jumpbox-ip
 User     user
 IdentityFile ~/.ssh/id_rsa
 ProxyCommand none
Host internal-network-machine
  Hostname internal-network-machine-ip
Host 10.*
 User     ubuntu
 ProxyCommand ssh jumpbox exec nc -w 9000 %h %p
 UserKnownHostsFile /dev/null
 StrictHostKeyChecking no

The ‘-w 9000’ flag defines a 2 1/2 hour wait period so that any orphaned connections will die off within that time.

%h and %p represent the host and port of the internal machine so in this case %h is ‘internal-network-machine-ip’ and the port will be 22.

We can then just do the following to ssh into the machine:

ssh internal-network-machine

Which is pretty neat!

This is explained further on benno’s blog and on the Open BSD journal.

Written by Mark Needham

August 10th, 2012 at 12:58 am

Posted in Shell Scripting

Tagged with ,

VCloud Guest Customization Script : [: postcustomization: unexpected operator

without comments

We have been doing some work to automatically provision machines using the VCloud API via fog and one of the things we wanted to do was run a custom script the first time that a node powers on.

The following explains how customization scripts work:

In vCloud Director, when setting a customization script in a virtual machine, the script:

  • Is called only on initial customization and force recustomization.
  • Is called with the precustomization command line parameter before out-of-box customization begins.
  • Is called with the postcustomization command line parameter after out-of-box customization finishes.
  • Needs to be a batch file for Windows virtual machines and a shell script for Unix virtual machines.

We wanted the script to run only when passed the ‘postcustomization’ flag because our script relied on some networking configuration which hadn’t yet been done in the ‘precustomization’ state.

We wrote something like the following script:

if [ x$1 == x"postcustomization" ]; then
  echo post customization

Unfortunately when we provisioned the node it hadn’t run any of the code within the if block and we saw the following message in /var/log/vmware-inc/customization.log:

5: [: xpostcustomization: unexpected operator

Nick pointed out that the test utility which we’re using to do the comparison on the 2nd line uses a single = in the POSIX shell even though it will work with double = in the bash shell.

We thought this was pretty strange since we are telling the script to run with the bash shell in the first line.

We eventually realised that the script was being spawned out to a POSIX shell by /root/.customization/ which is the script that gets called on power on:

 ((${SH} $POST_CUSTOMIZATION_TMP_SCRIPT_NAME "postcustomization" > /tmp/stdout.log) 2>&1 | ${TEE} -a /tmp/stderr.log)

I created a simple script to check the theory:

[ "mark" == "mark" ] && echo Mark

Which works fine when called directly:

$ ./ 

And throws the expected error when called with ‘sh’:

$ sh 
[: 3: mark: unexpected operator

We therefore needed to change our script to do the comparison with a single = like so:

if [ x$1 = x"postcustomization" ]; then
  echo post customization

Written by Mark Needham

August 6th, 2012 at 9:50 pm

Posted in Shell Scripting

Tagged with

Learning Unix find: Searching in/Excluding certain folders

with 5 comments

I love playing around with commands on the Unix shell but one of the ones that I’ve found the most difficult to learn beyond the very basics is find.

I think this is partially because I find the find man page quite difficult to read and partially because it’s usually quicker to work out how to solve my problem with a command I already know than to learn another one.

However, I recently came across Greg’s wiki which seems to do a pretty good job of explaining it.

Reasonably frequently I want to get a list of files to scan but want to exclude files in the .git directory since the results tend to become overwhelming with those included:

$ find . ! -path  "*.git*" -type f -print

Here we’re saying find items which don’t have git in their path and which are of type ‘f’ (file) and then print them out.

If we don’t include the -type flag then the results will also include directories which isn’t what we want in this case. The -print is optional in this case since by default what we select will be printed.

Sometimes we want to exclude more than one directory which can be done with the following command:

$ find . \( ! -path "*target*" -a ! -path "*tools*" -a ! -path "*.git*" -print \)

Here we’re excluding the ‘target’, ‘tools’ and ‘git’ directories from the listing of files that we return.

The -a flag stands for ‘and’ so the above command reads ‘find all files/directories which do not have target in their path and do not have tools in their path and do not have .git in their path’.

We can always make that command a bit more specific if any of those words legitimately appear in a path.

As well as the -print flag there is also a -prune flag which we can use to stop find from descending into a folder.

The first command could therefore be written like this:

$ find . -path "*.git*" -prune -o -type f -print

This reads ‘don’t go any further into a folder which has git in the path but print any other files which don’t have git in their path’.

I’m still finding -prune a bit confusing to understand and as the wiki points out:

The most confusing property of -prune is that it is an ACTION, and thus no further filters are processed after it.

To use it, you have to combine it with -o to actually process the non-skipped files, like so:

A couple of months ago I was playing around with our git repository trying to get a list of all the scala files in the ‘src/main’ directory and I went with this command:

$ find . -type f -regex ".*src/main.*\.scala$"

Using the above flags it could instead be written like this:

$ find . -path "*src/main*" -type f -iname "*\.scala*"


$ find . -type f -path "*src/main/*\.scala"

Interestingly those latter two versions seem to be a bit slower than the one that uses the -regex flag.

I’m not entirely sure why that is – presumably by supplying two flags on the latter two solutions find has to do more operations per line than it does with the -regex option or something like that?

Written by Mark Needham

October 21st, 2011 at 9:25 pm

Posted in Shell Scripting

Tagged with ,

mount_smbfs: mount error..File exists

with 7 comments

I’ve been playing around with mounting a Windows file share onto my machine via the terminal because I’m getting bored of constantly having to go to Finder and manually mounting it each time!

After a couple of times of mounting and unmounting the drive I ended up with this error:

> mount_smbfs //mneedham@punedc02/shared punedc02_shared/
mount_smbfs: mount error: /Volumes/punedc02_shared: File exists

I originally thought the ‘file exists’ part of the message was suggesting that I’d already mounted a share on ‘punedc02_shared’ but calling the ‘umount’ command led to the following error:

> umount punedc02_shared
umount: punedc02_shared: not currently mounted

I had actually absent mindedly gone and mounted the drive elsewhere through Finder which I only realised after reading Victor’s comments on this post.

Make sure that you already do not have the same share mounted on your Mac.

I had //host/share already mounted in /Volumes/share, so when I tried to mount //host/share to /Volumes/newshare it gave me the “file exists” error.

I learnt, thanks to the forums, that you can see which devices are mounted by using ‘df’.

This is where Finder had mounted the drive for me:

> df
Filesystem                 512-blocks      Used  Available Capacity  Mounted on
//mneedham@punedc02/shared  209696376 199773696    9922680    96%    /Volumes/shared

Since the shared drive gets unmounted when I disconnect from the network I decided to write a shell script that would set it up for me again.

function mount_drive {
  mkdir -p $2
  mount_smbfs $1 $2 
drives_to_unmount=`df | awk '/mneedham@punedc02/ { print $6 }'`
if [ "$drives_to_unmount" != "" ]; then
  echo "Unmounting existing drives on punedc02: \n$drives_to_unmount"
  umount $drives_to_unmount
mount_drive //mneedham@punedc02/media /Volumes/punedc02_media 
mount_drive //mneedham@punedc02/shared /Volumes/punedc02_shared

At the moment I’ve just put that in ‘/usr/bin’ so that it’s available on the path.

If there’s a better way to do this or a way to simplify the code please let me know.

I did come across a few ways to do the mounting using Apple Script in this post but that’s not half as fun as using the shell!

Written by Mark Needham

January 15th, 2011 at 6:31 pm

Posted in Shell Scripting

Tagged with

Sed: ‘sed: 1: invalid command code R’ on Mac OS X

with 8 comments

A few days ago I wrote about how we’d been using Sed to edit multiple files and while those examples were derived from what we’d been using on Ubuntu I realised that they didn’t actually work on Mac OS X.

For example, the following command:

sed -i 's/require/include/' Rakefile

Throws this error:

sed: 1: "Rakefile": invalid command code R

What I hadn’t realised is that on the Mac version of sed the ‘-i’ flag has a mandatory suffix, as described in this post.

The appropriate section of the man page for sed on the Mac looks like this:

-i extension

Edit files in-place, saving backups with the specified extension. If a zero-length extension is given, no backup will be saved.

It is not recommended togive a zero-length extension when in-place editing files, as you risk corruption or partial content in situations where disk space is exhausted, etc.

Whereas on Ubuntu the suffix is optional so we see this:

-i[SUFFIX], –in-place[=SUFFIX]

edit files in place (makes backup if extension supplied)

In order to get around this we need to provide a blank suffix when using the ‘-i’ flag on the Mac:

sed -i "" 's/require/include/' Rakefile

I didn’t RTFM closely enough the first time!

Written by Mark Needham

January 14th, 2011 at 2:15 pm

Posted in Shell Scripting

Tagged with

Sed across multiple files

with 12 comments

Pankhuri and I needed to rename a method and change all the places where it was used and decided to see if we could work out how to do it using sed.

We needed to change a method call roughly like this:


To instead read:


For which we need the following sed expression:

sed -i 's/home_link([^)]*)/homepage_path/' [file_name]

Which works pretty well if you know which file you want to change but we wanted to run it over the whole code base.

A bit of googling led us to this thread on devshed which suggested we’d need to get a list of the files and then run sed through the list:

for file in `find .  -type f`; do sed -i 's/home_link([^)]*)/homepage_path/' $file; done

That pretty much works but it doesn’t play nicely if the file has a space in the name since sed thinks the file name has ended before it actually has.

I was pretty sure that we should be able to pipe the output of the find into xargs and a bit more googling led us to the following solution:

find . -type f -print0 | xargs -0 sed -i 's/home_link([^)]*)/homepage_path/'

The ‘print0’ flag is described like so:

This primary always evaluates to true.  It prints the pathname of the current file to standard output, followed by an ASCII NUL character (character code 0).

While ‘-0’ in ‘xargs’ is described like this:

  -0      Change xargs to expect NUL (``\0'') characters as separators, instead of spaces and newlines.  This is expected to be used in concert with the -print0 function in find(1).

It also runs amazingly fast!

If anyone knows a better way feel free to point it out in the comments.

Written by Mark Needham

January 11th, 2011 at 4:43 pm

Posted in Shell Scripting

Tagged with

A dirty hack to get around aliases not working in a shell script

with 4 comments

In another script I’ve been working on lately I wanted to call ‘mysql’ but unfortunately on my machine it’s ‘mysql5’ rather than ‘mysql’.

I have an alias defined in ‘~/.bash_profile’ so I can call ‘mysql’ from the terminal whenever I want to.

alias mysql=mysql5

Unfortunately shell scripts don’t seem to have access to this alias and the only suggestion I’ve come across while googling this is to source ‘~/.bash_profile’ inside the script.

Since others are going to use the script and might have ‘~/.bashrc’ instead of ‘~/.bash_profile’ I didn’t really want to go down that route.

At this stage a colleague of mine came up with the idea of creating a soft link from mysql to mysql5 inside a folder which is already added to the path.

We located mysql5…

> which mysql5

…and then created a soft link like so:

cd /opt/local/bin/mysql5
ln -s mysql5 mysql

And it works!

Of course t’is pure hackery so I’d be interested if anyone knows a better way of getting around this.

Written by Mark Needham

November 24th, 2010 at 6:48 pm

Posted in Shell Scripting

Tagged with