Mark Needham

Thoughts on Software Development

Archive for the ‘Shell Scripting’ Category

Unix: Working with parts of large files

without comments

Chris and I were looking at the neo4j log files of a client earlier in the week and wanted to do some processing of the file so we could ask the client to send us some further information.

The log file was over 10,000 lines long but the bit of the file we were interesting in was only a few hundred lines.

I usually use Vim and the ‘:set number’ when I want to refer to line numbers in a file but Chris showed me that we can achieve the same thing with e.g. ‘less -N data/log/neo4j.0.0.log’.

We can then operate on say lines 10-100 by passing the ‘-n’ flag to sed:

-n By default, each line of input is echoed to the standard output after all of the commands have been applied to it. The -n option suppresses this behavior.

$ sed -n '10,15p' data/log/neo4j.0.0.log
INFO: Enabling HTTPS on port [7473]
May 19, 2013 11:11:52 AM org.neo4j.server.logging.Logger log
INFO: No SSL certificate found, generating a self-signed certificate..
May 19, 2013 11:11:53 AM org.neo4j.server.logging.Logger log
INFO: Mounted discovery module at [/]
May 19, 2013 11:11:53 AM org.neo4j.server.logging.Logger log

We then used a combination of grep, awk and sort to work out which log files we needed.

Written by Mark Needham

May 19th, 2013 at 9:44 pm

Posted in Shell Scripting

Tagged with

Unix: Checking for open sockets on nginx

without comments

Tim and I were investigating a weird problem we were having with nginx where it was getting in a state where it had exceeded the number of open files allowed on the system and started rejecting requests.

We can find out the maximum number of open files that we’re allowed on a system with the following command:

$ ulimit -n
1024

Our hypothesis was that some socket connections were never being closed and therefore the number of open files was climbing slowly upwards until it exceeded the limit.

We wanted to check how many sockets nginx had open so to start with we needed to know the process IDs it was running under:

$ ps aux | grep nginx | grep -v grep
root      1089  0.0  0.7 105152  2736 ?        Ss   17:34   0:00 nginx: master process /usr/sbin/nginx
www-data 17474  0.0  0.6 105300  2296 ?        S    21:49   0:04 nginx: worker process
www-data 17475  0.0  0.7 105300  2856 ?        S    21:49   0:04 nginx: worker process
www-data 17476  0.0  0.7 105300  2792 ?        S    21:49   0:03 nginx: worker process
www-data 17477  0.0  0.7 105300  2668 ?        S    21:49   0:04 nginx: worker process

So the process IDs we’re interested in are 1089, 17474, 17475, 17476 and 17477.

We can check which file descriptors they have open with the following command:

$ sudo ls -alh /proc/{1089,17{474,475,476,477}}/fd
/proc/17476/fd:
total 0
dr-x------ 2 www-data www-data  0 Apr 23 23:40 .
...
l-wx------ 1 www-data www-data 64 Apr 23 23:40 6 -> /var/log/nginx/error.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 7 -> /var/www/thinkingingraphs/shared/log/nginx_access.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 8 -> /var/www/thinkingingraphs/shared/log/nginx_error.log
lrwx------ 1 www-data www-data 64 Apr 23 23:40 9 -> socket:[8910]
 
/proc/17477/fd:
total 0
...
lrwx------ 1 www-data www-data 64 Apr 23 23:40 56 -> socket:[52213]
lrwx------ 1 www-data www-data 64 Apr 23 23:40 57 -> anon_inode:[eventpoll]
l-wx------ 1 www-data www-data 64 Apr 23 23:40 6 -> /var/log/nginx/error.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 7 -> /var/www/thinkingingraphs/shared/log/nginx_access.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 8 -> /var/www/thinkingingraphs/shared/log/nginx_error.log
lrwx------ 1 www-data www-data 64 Apr 23 23:40 9 -> socket:[8910]

We can narrow that down to just show us how many sockets are open:

$ sudo ls -alh /proc/{1089,17{474,475,476,477}}/fd | grep socket  | wc -l
189

We could also use lsof although for some reason that returns a slightly different number:

$ sudo lsof -p 1089,17474,17475,17476,17477 | grep socket | wc -l
184

If we want to use brace expansion to do that it becomes a bit more tricky:

$ sudo lsof -p `echo {1089,174{74,75,76,77}} | sed 's/ /,/g'` | grep socket | wc -l
184

Annoyingly we couldn’t actually replicate the error but think that it’s been solved in nginx 1.2.0 (we were using 1.1.19) by this change:

Bugfix: a segmentation fault might occur in a worker process if the
       "try_files" directive was used; the bug had appeared in 1.1.19.

Written by Mark Needham

April 23rd, 2013 at 11:59 pm

Posted in Shell Scripting

Tagged with

awk: Parsing ‘free -m’ output to get memory usage/consumption

with 3 comments

Although I know this problem is already solved by collectd and New Relic I wanted to write a little shell script that showed me the memory usage on a bunch of VMs by parsing the output of free.

The output I was playing with looks like this:

$ free -m
             total       used       free     shared    buffers     cached
Mem:           365        360          5          0         59         97
-/+ buffers/cache:        203        161
Swap:          767         13        754

I wanted to find out what % of the memory on the machine was being used and as I understand it the numbers that we would use to calculate this are the ‘total’ value on the ‘Mem’ line and the ‘used’ value on the ‘buffers/cache’ line.

I initially thought that the ‘used’ value I was interested in should be the one on the ‘Mem’ line but this number includes memory that Linux has borrowed for disk caching so it isn’t the true number.

There’s another quite interesting article showing some experiments you can do to prove this.

So what I wanted to do was get the result of the calculation ’203/365′ which I wasn’t sure how to do until I realised you can match multiple regular expressions with awk like so:

$ free -m | awk '/Mem:/ { print $2 } /buffers\/cache/ { print $3 }'                                                        
365
203

We’ve now filtered the output down to just our two numbers but another neat thing you can do with awk is change what it uses as its field and record separator.

In this case we want to change the field separator to be the new line character and we’ll set the record separator to be nothing because otherwise it defaults to the new line character which will mess with our field separator.

Those two values are set by using the ‘RS’ and ‘FS’ variables:

$ free -m | 
  awk '/Mem:/ { print $2 } /buffers\/cache/ { print $3 }' | 
  awk 'BEGIN { RS = "" ; FS = "\n" } { print $2 / $1 }'
0.556164

This is still sub optimal because we’re using two awk commands rather than one! We can get around that by storing the two memory values in variables and printing them out in an END block:

$ free -m | 
  awk '/Mem:/ { total=$2 } /buffers\/cache/ { used=$3 } END { print used/total}'
0.556164

Written by Mark Needham

April 10th, 2013 at 7:03 am

Posted in Shell Scripting

Tagged with

Sed: Replacing characters with a new line

with 5 comments

I’ve been playing around with writing some algorithms in both Ruby and Haskell and the latter wasn’t giving the correct result so I wanted to output an intermediate state of the two programs and compare them.

I didn’t do any fancy formatting of the output from either program so I had the raw data structures in text files which I needed to transform so that they were comparable.

The main thing I wanted to do was get each of the elements of the collection onto their own line. The output of one of the programs looked like this:

[(1,2), (3,4)…]

To get each of the elements onto a new line my first step was to replace every occurrence of ‘, (‘ with ‘\n(‘. I initially tried using sed to do that:

sed -E -e 's/, \(/\\n(/g' ruby_union.txt

All that did was insert the string value ‘\n’ rather than the new line character.

I’ve come across similar problems before and I usually just use tr but in this case it doesn’t work very well because we’re replacing more than just a single character.

I came across this thread on Linux Questions which gives a couple of ways that we can get see to do what we want.

The first suggestion is that we should use a back slash followed by the enter key while writing our sed expression where we want the new line to be and then continue writing the rest of the expression.

We therefore end up with the following:

sed -E -e "s/,\(/\
/g" ruby_union.txt

This approach works but it’s a bit annoying as you need to delete the rest of the expression so that the enter works correctly.

An alternative is to make use of echo with the ‘-e’ flag which allows us to output a new line. Usually backslashed characters aren’t interpreted and so you end up with a literal representation. e.g.

$ echo "mark\r\nneedham"
mark\r\nneedham
 
$ echo -e "mark\r\nneedham"
mark
needham

We therefore end up with this:

sed -E -e "s/, \(/\\`echo -e '\n\r'`/g" ruby_union.txt

** Update **

It was pointed out in the comments that this final version of the sed statement doesn't actually lead to a very nice output which is because I left out the other commands I passed to it which get rid of extra brackets.

The following gives a cleaner output:

$ echo "[(1,2), (3,4), (5,6)]" | sed -E -e "s/, \(/\\`echo -e '\n\r'`/g" -e 's/\[|]|\)|\(//g'
1,2
3,4
5,6

Written by Mark Needham

December 29th, 2012 at 5:49 pm

Posted in Shell Scripting

Tagged with

Unix: Counting the number of commas on a line

without comments

A few weeks ago I was playing around with some data stored in a CSV file and wanted to do a simple check on the quality of the data by making sure that each line had the same number of fields.

One way this can be done is with awk:

awk -F "," ' { print NF-1 } ' file.csv

Here we’re specifying the file separator -F as ‘,’ and then using the NF (number of fields) variable to print how many commas there are on the line.

Another slightly more complicated way is to combine tr and awk like so:

tr -d -c ",\n" < file.csv | awk ' { print length } '

Here we’re telling tr to delete any characters except for a comma or new line.

If we pass just a comma to the ‘-d’ option like so…

tr -d "," < file.csv

…that would delete all the commas from a line but we can use the ‘-c’ option to complement the comma i.e. delete everything except for the comma.

tr -d -c "," < file.csv

Unfortunately that puts all the commas onto the same line so we need to complement the new line character as well:

tr -d -c ",\n" < file.csv

We can then use the length variable of awk to print out the number of commas on each line.

We can achieve the same thing by making use of sed instead of tr like so:

sed 's/[^,]//g' file.csv | awk ' { print length } '

Since sed operates on a line by line basis we just need to tell it to substitute anything which isn’t a comma with nothing and then pipe the output of that into awk and use the length variable again.

I thought it might be possible to solve this problem using cut as well but I can’t see any way to get it to output the total number of fields.

If anyone knows any other cool ways to do the same thing let me know in the comments – it’s always interesting to see how different people wield the unix tools!

Written by Mark Needham

November 10th, 2012 at 4:30 pm

Posted in Shell Scripting

Tagged with

Upstart: Job getting stuck in the start/killed state

with 3 comments

We’re using upstart to handle the processes running on our machines and since the haproxy package only came package with an init.d script we wanted to make it upstartified.

When defining an upstart script you need to specify an expect stanza in which you specify whether or not the process which you’re launching is going to fork.

If you do not specify the expect stanza, Upstart will track the life cycle of the first PID that it executes in the exec or script stanzas.

However, most Unix services will “daemonize”, meaning that they will create a new process (using fork(2)) which is a child of the initial process.

Often services will “double fork” to ensure they have no association whatsoever with the initial process.

There is a table on the upstart cookbook under the ‘Implications of Misspecifying expect‘ section which explains what will happen if we specify this incorrectly:

Expect Stanza Behaviour
  Specification of Expect Stanza
Forks no expect expect fork expect daemon
0 Correct start hangs start hangs
1 Wrong pid tracked † Correct start hangs
2 Wrong pid tracked Wrong pid tracked Correct

When we were defining our script we went for expect daemon instead of expect fork and had also mistyped the arguments to the haproxy script which meant it failed to start and ended up in the start/killed state.

From what we could tell upstart had a handle on a PID which didn’t actually exist and when we tried a stop haproxy the command seemed to succeed but didn’t actually do anything.

Phil pointed us to a neat script written by Clint Byrum which spins up and then kills loads of processes in order to exhaust the PID space until a process with the PID upstart is tracking exists and can be re-attached and killed.

It’s available on his website but that wasn’t responding for a period of time yesterday so I’ll repeat it here just in case:

#!/usr/bin/env ruby1.8
 
class Workaround
  def initialize target_pid
    @target_pid = target_pid
 
    first_child
  end
 
  def first_child
    pid = fork do
      Process.setsid
 
      rio, wio = IO.pipe
 
      # Keep rio open
      until second_child rio, wio
        print "\e[A"
      end
    end
 
    Process.wait pid
  end
 
  def second_child parent_rio, parent_wio
    rio, wio = IO.pipe
 
    pid = fork do
      rio.close
      parent_wio.close
 
      puts "%20.20s" % Process.pid
 
      if Process.pid == @target_pid
        wio << 'a'
        wio.close
 
        parent_rio.read
      end
    end
    wio.close
 
    begin
      if rio.read == 'a'
        true
      else
        Process.wait pid
        false
      end
    ensure
      rio.close
    end
  end
end
 
if $0 == __FILE__
  pid = ARGV.shift
  raise "USAGE: #{$0} pid" if pid.nil?
  Workaround.new Integer pid
end

We can put that into a shell script, run it and the world of upstart will get back into a good place again!

Written by Mark Needham

September 29th, 2012 at 9:56 am

Posted in Shell Scripting

Tagged with

Finding ways to use bash command line history shortcuts

with 2 comments

A couple of months ago I wrote about a bunch of command line history shortcuts that Phil had taught me and after recently coming across Peteris Krumins’ bash history cheat sheet I thought it’d be interesting to find some real ways to use them.

A few weeks ago I wrote about a UTF-8 byte order mark (BOM) that I wanted to remove from a file I was working on and I realised this evening that there were some other files with the same problem.

The initial command read like this:

awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' data/Taxonomy/Products.csv  > data/Taxonomy/Products.csv.bak

The version of the file without the BOM is data/Taxonomy/Products.csv.bak but I wanted it to be data/Taxonomy/Products.csv so I needed to mv it to that location.

By making use of history expansion we can write this as follows:

mv !$ !!:2

!$ represents the last argument which is data/Taxonomy/Products.csv.bak and !!:2 gets the 2nd argument passed to the last command which in this case is data/Taxonomy/Products.csv.

As you’re typing it will expand to the following:

mv data/Taxonomy/Products.csv.bak data/Taxonomy/Products.csv

One of the things that we do quite frequently is look at the nginx configurations and logs of our different applications which involved doing the following:

$ tail -f /var/log/nginx/site-1-access.log
$ tail -f /var/log/nginx/site-2-access.log

or

$ vi /etc/nginx/sites-enabled/site-1-really-long-name-cause-we-can
$ vi /etc/nginx/sites-enabled/site-2-really-long-name-cause-we-can

Everything except for the file name is the same but typing the up arrow to get the previous command and then manually deleting the file name can end up taking longer than just writing out the whole command again if the site name is long.

Ctrl-w deletes the whole path so that doesn’t help us either.

An alternative is the use the ‘h’ modifier which “Removes a trailing pathname component, leaving the head.”

In this case we could do the following:

$ vi /etc/nginx/sites-enabled/site-1-really-long-name-cause-we-can
$ vi !$:h/site-2-really-long-name-cause-we-can

We still have to type out the whole file name and we don’t get any auto complete help which is a bit annoying.

I realised that on my zsh if I type a space after a history expansion command it expands what I’ve typed to the full paths of everything, which is due to the following key binding:

.oh-my-zsh $ grep -rn "magic-space" *
lib/key-bindings.zsh:20:bindkey ' ' magic-space    # also do history expansion on space

We can do the same thing in bash by running the following command:

bind Space:magic-space

Then if I wanted to open that second nginx file I could do the following:

$ vi !$:h # then type a space which will expand it to:
$ vi /etc/nginx/sites-enabled/ # I can then type backspace, then type 'site-2' and tab and open the file

It’s not completely smooth because of the backspace but I think it’s marginally quicker than the other options.

Another one which I mentioned in the first post is the ^original^replacement which will run the previous command but replace the first instance of ‘original’ with ‘replacement’.

With this one it often seems faster to type the up arrow and change what you want manually or retype the command but when doing a grep of a specific folder I think this is faster.

e.g.

$ grep -rn "magic-space" ~/.oh-my-zsh/lib
/Users/mneedham/.oh-my-zsh/lib/key-bindings.zsh:20:bindkey ' ' magic-space    # also do history expansion on space

Let’s say I was intrigued about bindkey and wanted to find all the instances of that.

One way to do that would be to type up and then manually go back along the line using Meta-B until I get to ‘bind-key’ when I can delete that with a few Ctrl-W‘s but in this case the search/replace approach is quicker:

$ ^magic-space^bindkey
grep -rn "bindkey" ~/.oh-my-zsh/lib
/Users/mneedham/.oh-my-zsh/lib/completion.zsh:24:bindkey -M menuselect '^o' accept-and-infer-next-history
/Users/mneedham/.oh-my-zsh/lib/completion.zsh:71:  bindkey "^I" expand-or-complete-with-dots

I’m still looking for other ways to re-use bash history more effectively so let me know any other cool tricks in the comments.

Written by Mark Needham

September 19th, 2012 at 7:00 am

Posted in Shell Scripting

Tagged with

zsh: Don’t verify substituted history expansion a.k.a. disabling histverify

without comments

I use zsh on my Mac terminal and in general I prefer it to bash but it has an annoying default setting whereby when you try to repeat a command via substituted history expansion it asks you to verify that.

For example let’s say by mistake I try to vi into a directory rather than cd’ing into it:

vi ~/.oh-my-zsh

If I try to cd into the directory by using ‘!$’ to grab the last argument from the previous command it will make me confirm that I want to do this:

$ cd !$
$ cd ~/.oh-my-zsh

While reading another one of Peter Krumins’ blog posts, this time about bash command line history, I came to learn that this is because a setting called histverify has been enabled.

histverify

Allow to review a history substitution result by loading the resulting line into the editing buffer, rather than directly executing it.

I found a thread on StackOverflow which explains all the zsh settings in more detail but for my purposes I needed to run the following command to disable histverify:

unsetopt histverify

I also put that into my ~/.zshrc file so it will carry across to any new terminal sessions that I open.

Written by Mark Needham

September 16th, 2012 at 1:35 pm

Posted in Shell Scripting

Tagged with

cURL and the case of the carriage return

without comments

We were doing some work this week where we needed to make a couple of calls to an API via a shell script and in the first call we wanted to capture one of the lines of the HTTP response headers and use that as in input to the second call.

The way we were doing this was something like the following:

#!/bin/bash
 
# We were actually grabbing a different header but for the sake 
# of this post we'll say it was 'Set-Cookie'
AUTH_HEADER=`curl -I http://www.google.co.uk | grep Set-Cookie`
 
echo $AUTH_HEADER

When we echoed $AUTH_HEADER it looked exactly as we’d expect…

$ ./blah.txt 2>/dev/null
Set-Cookie: NID=63=gwfYa4fhbdqYyEdySrFn1AYybExgjQbQUKPdC5sZ5orRznGY-bt3gTwlc0XaPXv
TxmCIyjDzKWOGBCYlOouQ5-2l7gQGOAj90VrY3LLabRqwJ5Y3zlf-dNR6Y5U3VDKw; 
expires=Sun, 17-Mar-2013 08:28:25 GMT; path=/; domain=.google.co.uk; HttpOnly

…but when we passed that value into the next cURL command it was returning a 401 response code which suggested that we hadn’t even sent the header at all.

We changed the code so that we manually assigned AUTH_HEADER with the correct value and then everything worked fine which suggested there was something weird in the value we were getting back from cURL.

We were constructing the arguments to our next cURL command like so:

#!/bin/bash
 
# We were actually grabbing a different header but for the sake 
# of this post we'll say it was 'Set-Cookie'
AUTH_HEADER=`curl -I http://www.google.co.uk | grep Set-Cookie`
 
echo $AUTH_HEADER
 
ARGS="-H $AUTH_HEADER OTHER RANDOM STUFF HERE"
echo $ARGS

When we ran that we noticed that $ARGS was displaying some quite strange behaviour where the text after $AUTH_HEADER was overriding the value of $AUTH_HEADER:

$ ./blah.txt 2>/dev/null
 
Set-Cookie: NID=63=rma3ah7oBhyirDUqFPODHfaTK9XOqs0CPapYVgTM6vHyCgDTcXs2P_mVDI_hnsap
33E3E6k54b50J8MLc85JadBAiMdhq5HDeH-LbLqwy_hUAOj-1w-YwZOHW7okuiEy; 
expires=Sun, 17-Mar-2013 08:37:47 GMT; path=/; domain=.google.co.uk; HttpOnly
 
Set-Cookie: NID=63=rma3ah7oBhyirDUqFPODHfaTK9XOqs0CPapYVgTM6vHyCgDTcXs2P_mVDI_hnsap
33E3E6k54b50J8MLc85JadBAiMdhq5HDeH-LbLqwy_hUAOj-1w-YwZOHW7okuiEy; 
expires=Sun, 17-Mar-2013 08:37:47 GMT; path=/; dom RANDOM TEXT SO RANDOMpOnly

Paul was wondering by at the time so we asked him if he could think of anything that could be leading to what we were seeing. He suggested there was probably a carriage return lurking at the end of the line.

Nick showed us how we could prove that was the case using xxd:

xxd <<< $AUTH_HEADER

When we ran the script again we could see the carriage return character (0d) at the end of the line:

$ ./blah.txt 2>/dev/null
Set-Cookie: NID=63=QDECY69302tLN0CSMyug-TzzczxzGNWs70i8huV60qM7BFv18F63dNSz4trqzHXvzbKXNLb
gBcLKKCTuOSTCjS6w_6UNJVrkZ6G_lLxSSyCeHaK4iJGW8XWu86i7CsOB; 
expires=Sun, 17-Mar-2013 08:55:37 GMT; path=/; domain=.google.co.uk; HttpOnly
...
0000160: 2f3b 2064 6f6d 6169 6e3d 2e67 6f6f 676c  /; domain=.googl
0000170: 652e 636f 2e75 6b3b 2048 7474 704f 6e6c  e.co.uk; HttpOnl
0000180: 790d 0d0a                                y..

Nick then showed us how to get rid of it using tr like so:

AUTH_HEADER=`curl -I http://www.google.co.uk | grep Set-Cookie` | tr -d '\r'`

Written by Mark Needham

September 15th, 2012 at 9:06 am

Posted in Shell Scripting

Tagged with

Bash: Piping data into a command using heredocs

with 2 comments

I’ve been playing around with some data modelled in neo4j recently and one thing I wanted to do is run an adhoc query in the neo4j-shell and grab the results and do some text manipulation on them.

For example I wrote a query which outputted the following to the screen and I wanted to sum together all the values in the 3rd column:

| ["1","2","3"]         | "3"                             | 1234567    |   
| ["4","5","6"]         | "6"                             | 8910112    |

Initially I was pasting the output into a text file and then running the following sequence of commands to work it out:

$ cat blah2.txt| cut -d"|" -f 4  | awk '{s+=$0} END {print s}'  
10144679

One way to avoid having to create blah2.txt would be to echo the output into standard out like so:

$ echo "| ["1","2","3"]         | "3"                             | 1234567    |   
| ["4","5","6"]         | "6"                             | 8910112    | " | cut -d"|" -f 4  | awk '{s+=$0} END {print s}'   
10144679

But it gets a bit confusing as the number of lines of results increases and you have to keep copy/pasting the cut and awk parts of the chain around which was annoying.

One of the things I read on the bus this week was a blog post going through a bunch of bash one liners and half way through it covers piping data into commands using heredocs which I’d completely forgotten about!

A simple example could be to send a simple message to cat which will output the message to standard out:

$ cat <<EOL
heredoc> hello i am mark
heredoc> EOL
hello i am mark

That works if we want to pipe data into a single command but I didn’t know how we’d be able to pipe the output of that command to another command.

In fact it’s actually reasonably simple:

$ cat <<EOL | cut -d"|" -f 4  | awk '{s+=$0} END {print s}' 
pipe pipe heredoc> | ["1","2","3"]         | "3"                             | 1234567    |   
pipe pipe heredoc> | ["4","5","6"]         | "6"                             | 8910112    | 
pipe pipe heredoc> EOL
10144679

And now I have no need to create random text files all over my machine!

Written by Mark Needham

September 15th, 2012 at 7:54 am

Posted in Shell Scripting

Tagged with