Mark Needham

Thoughts on Software Development

Archive for the ‘Shell Scripting’ Category

cURL: POST/Upload multi part form

without comments

I’ve been doing some work which involved uploading a couple of files from a HTML form and I wanted to check that the server side code was working by executing a cURL command rather than using the browser.

The form looks like this:

<form action="http://foobar.com" method="POST" enctype="multipart/form-data">
    <p>
        <label for="nodes">File 1:</label>
        <input type="file" name="file1" id="file1">
    </p>
 
    <p>
        <label for="relationships">File 2:</label>
        <input type="file" name="file2" id="file2">
    </p>
 
    <input type="submit" name="submit" value="Submit">
</form>

If we convert the POST request from the browser into a cURL equivalent we end up with the following:

curl 'http://foobar.com' -H 'Origin: null' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Host: foobar.com:7474' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36' -H 'Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryMxYFIg6GFEIPAe6V' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Cache-Control: max-age=0' -H 'Cookie: splashShown1.6=1; undefined=0; _mkto_trk=id:773-GON-065&token:_mch-localhost-1373821432078-37666; JSESSIONID=123cbkxby1rtcj3dwipqzs7yu' -H 'Connection: keep-alive' --data-binary $'------WebKitFormBoundaryMxYFIg6GFEIPAe6V\r\nContent-Disposition: form-data; name="file1"; filename="file1.csv"\r\nContent-Type: text/csv\r\n\r\n\r\n------WebKitFormBoundaryMxYFIg6GFEIPAe6V\r\nContent-Disposition: form-data; name="file2"; filename="file2.csv"\r\nContent-Type: text/csv\r\n\r\n\r\n------WebKitFormBoundaryMxYFIg6GFEIPAe6V\r\nContent-Disposition: form-data; name="submit"\r\n\r\nSubmit\r\n------WebKitFormBoundaryMxYFIg6GFEIPAe6V--\r\n' --compressed

I tried executing this command but I couldn’t quite get it to work, and in any case it seemed extremely complicated. Thankfully, I came across a cURL tutorial which described a much simpler alternative which does work:

curl --form file1=@file1.csv --form file2=@file2.csv --form press=submit http://foobar.com

I knew it couldn’t be that complicated!

Written by Mark Needham

September 23rd, 2013 at 10:16 pm

Posted in Shell Scripting

Tagged with ,

Unix: tar – Extracting, creating and viewing archives

with one comment

I’ve been playing around with the Unix tar command a bit this week and realised that I’d memorised some of the flag combinations but didn’t actually know what each of them meant.

For example, one of the most common things that I want to do is extract a gripped neo4j archive:

$ wget http://dist.neo4j.org/neo4j-community-1.9.2-unix.tar.gz
$ tar -xvf neo4j-community-1.9.2-unix.tar.gz

where:

  • -x means extract
  • -v means produce verbose output i.e. print out the names of all the files as you unpack it
  • -f means unpack the file which follows this flag i.e. neo4j-community-1.9.2-unix.tar.gz in this example

I didn’t realise that by default tar runs against standard input so we could actually achieve the above in one go with the following combination:

$ wget http://dist.neo4j.org/neo4j-community-1.9.2-unix.tar.gz -o - | tar -xv

The other thing I wanted to do was create a gripped archive from the contents of a folder, something which I do much less frequently and am therefore much more rusty at! The following does the trick:

$ tar -cvzpf neo4j-football.tar.gz neo4j-football/
$ ls -alh neo4j-football.tar.gz 
-rw-r--r--  1 markhneedham  staff   526M 22 Aug 23:38 neo4j-football.tar.gz

where:

  • -c means create a new archive
  • -z means gzip that archive
  • -p means preserve file permissions

Sometimes we’ll want to exclude some things from our archive which is where the ‘–exclude’ flag comes in handy.

For example I want to exclude the data, git and neo4j folders which sit inside ‘neo4j-football’ which I can do with the following:

$ tar --exclude "data*" --exclude "neo4j-community*" --exclude ".git" -cvzpf neo4j-football.tar.gz neo4j-football/
$ ls -alh neo4j-football.tar.gz 
-rw-r--r--  1 markhneedham  staff   138M 22 Aug 23:36 neo4j-football.tar.gz

If we want to quickly check that our file has been created correctly we can run the following:

$ tar -tvf neo4j-football.tar.gz

where:

  • -t means list the contents of the archive to standard out

And that pretty much covers my main use cases for the moment!

Written by Mark Needham

August 22nd, 2013 at 10:56 pm

Posted in Shell Scripting

Tagged with ,

Unix/awk: Extracting substring using a regular expression with capture groups

with 3 comments

A couple of years ago I wrote a blog post explaining how I’d used GNU awk to extract story numbers from git commit messages and I wanted to do a similar thing today to extract some node ids from a file.

My eventual solution looked like this:

$ echo "mark #1000" | gawk '{ match($0, /#([0-9]+)/, arr); if(arr[1] != "") print arr[1] }'
1000

But in the comments an alternative approach was suggested which used the Mac version of awk and the RSTART and RLENGTH global variables which get set when a match is found:

$ echo "mark #1000" | awk 'match($0, /#[0-9]+/) { print substr( $0, RSTART, RLENGTH )}'
#1000

Unfortunately Mac awk doesn’t seem to capture groups so as you can see it includes the # character which we don’t actually want.

In this instance it wasn’t such a big deal but it was more annoying for the node id extraction that I was trying to do:

$ head -n 5 log.txt
Command[27716, Node[7825340,used=true,rel=14547348,prop=31734662]]
Command[27716, Node[7825341,used=true,rel=14547349,prop=31734665]]
Command[27716, Node[7825342,used=true,rel=14547350,prop=31734668]]
Command[27716, Node[7825343,used=true,rel=14547351,prop=31734671]]
$ head -n 5 log.txt | awk 'match($0, /Node\[([^,]+)/) { print substr( $0, RSTART, RLENGTH )}'
Node[7825340
Node[7825341
Node[7825342
Node[7825343
Node[7825336

I ended up having to brew install gawk and using a variation of the gawk command I mentioned at the beginning of this post:

$ head -n 5 log.txt | gawk 'match($0, /Node\[([^,]+)/, arr) { print arr[1]}'
7825340
7825341
7825342
7825343
7825336

Written by Mark Needham

June 26th, 2013 at 3:23 pm

Posted in Shell Scripting

Tagged with

Unix: find, xargs, zipinfo and the ‘caution: filename not matched:’ error

with 4 comments

As I mentioned in my previous post last week I needed to scan all the jar files included with the neo4j-enterprise gem and I started out by finding out where it’s located on my machine:

$ bundle show neo4j-enterprise
/Users/markhneedham/.rbenv/versions/jruby-1.7.1/lib/ruby/gems/shared/gems/neo4j-enterprise-1.8.2-java

I then thought I could get a list of all the jar files using find and pipe it into zipinfo via xargs to get all the file names and then search for HighlyAvailableGraphDatabaseFactory:

Unfortunately when I tried that it didn’t quite work:

$ cd /Users/markhneedham/.rbenv/versions/jruby-1.7.1/lib/ruby/gems/shared/gems/neo4j-enterprise-1.8.2-java/lib/neo4j-enterprise/jars/
$ find . -iname "*.jar" | xargs zipinfo
caution: filename not matched:  ./lib/neo4j-enterprise/jars/logback-classic-0.9.30.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/logback-core-0.9.30.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/neo4j-backup-1.8.2.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/neo4j-com-1.8.2.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/neo4j-consistency-check-1.8.2.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/neo4j-ha-1.8.2.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/neo4j-udc-1.8.2.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/org.apache.servicemix.bundles.netty-3.2.5.Final_1.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/server-api-1.8.2.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/slf4j-api-1.6.2.jar
caution: filename not matched:  ./lib/neo4j-enterprise/jars/zookeeper-3.3.2.jar

I switched ‘zipinfo’ to ‘echo’ to see what was going on which resulted in the following output:

$ find . -iname "*.jar" | xargs echo
./log4j-1.2.16.jar ./logback-classic-0.9.30.jar ./logback-core-0.9.30.jar ./neo4j-backup-1.8.2.jar ./neo4j-com-1.8.2.jar ./neo4j-consistency-check-1.8.2.jar ./neo4j-ha-1.8.2.jar ./neo4j-udc-1.8.2.jar ./org.apache.servicemix.bundles.netty-3.2.5.Final_1.jar ./server-api-1.8.2.jar ./slf4j-api-1.6.2.jar ./zookeeper-3.3.2.jar

As I understand it, xargs expects arguments to be separated by a space and I thought it would apply the command to each argument individually but it seemed to be including the space as part of the file name.

I’ve previously used the ‘-n’ flag to xargs to explicitly tell it to call the corresponding command with one argument at a time and that seemed to do the trick:

$ find . -iname "*.jar" | xargs -n1 zipinfo
Archive:  ./log4j-1.2.16.jar   481535 bytes   346 files
-rw----     2.0 fat     3186 bX defN 30-Mar-10 23:25 META-INF/MANIFEST.MF
-rw----     2.0 fat        0 bl defN 30-Mar-10 23:25 META-INF/
-rw----     2.0 fat    11366 bl defN 30-Mar-10 23:14 META-INF/LICENSE
-rw----     2.0 fat      160 bl defN 30-Mar-10 23:14 META-INF/NOTICE
-rw----     2.0 fat        0 bl defN 30-Mar-10 23:25 META-INF/maven/
-rw----     2.0 fat        0 bl defN 30-Mar-10 23:25 META-INF/maven/log4j/
...

Of course to solve this particular we don’t actually need to use find and xargs since we can just call zipinfo with the wildcard match:

$ zipinfo \*.jar
Archive:  log4j-1.2.16.jar   481535 bytes   346 files
-rw----     2.0 fat     3186 bX defN 30-Mar-10 23:25 META-INF/MANIFEST.MF
-rw----     2.0 fat        0 bl defN 30-Mar-10 23:25 META-INF/
-rw----     2.0 fat    11366 bl defN 30-Mar-10 23:14 META-INF/LICENSE
-rw----     2.0 fat      160 bl defN 30-Mar-10 23:14 META-INF/NOTICE
-rw----     2.0 fat        0 bl defN 30-Mar-10 23:25 META-INF/maven/
-rw----     2.0 fat        0 bl defN 30-Mar-10 23:25 META-INF/maven/log4j/
-rw----    
...

I’m curious why xargs didn’t work as I expected it to though – have I just misremembered its default behaviour or is something weird going on?

Written by Mark Needham

June 9th, 2013 at 11:10 pm

Posted in Shell Scripting

Tagged with

Unix: Working with parts of large files

without comments

Chris and I were looking at the neo4j log files of a client earlier in the week and wanted to do some processing of the file so we could ask the client to send us some further information.

The log file was over 10,000 lines long but the bit of the file we were interesting in was only a few hundred lines.

I usually use Vim and the ‘:set number’ when I want to refer to line numbers in a file but Chris showed me that we can achieve the same thing with e.g. ‘less -N data/log/neo4j.0.0.log’.

We can then operate on say lines 10-100 by passing the ‘-n’ flag to sed:

-n By default, each line of input is echoed to the standard output after all of the commands have been applied to it. The -n option suppresses this behavior.

$ sed -n '10,15p' data/log/neo4j.0.0.log
INFO: Enabling HTTPS on port [7473]
May 19, 2013 11:11:52 AM org.neo4j.server.logging.Logger log
INFO: No SSL certificate found, generating a self-signed certificate..
May 19, 2013 11:11:53 AM org.neo4j.server.logging.Logger log
INFO: Mounted discovery module at [/]
May 19, 2013 11:11:53 AM org.neo4j.server.logging.Logger log

We then used a combination of grep, awk and sort to work out which log files we needed.

Written by Mark Needham

May 19th, 2013 at 9:44 pm

Posted in Shell Scripting

Tagged with

Unix: Checking for open sockets on nginx

without comments

Tim and I were investigating a weird problem we were having with nginx where it was getting in a state where it had exceeded the number of open files allowed on the system and started rejecting requests.

We can find out the maximum number of open files that we’re allowed on a system with the following command:

$ ulimit -n
1024

Our hypothesis was that some socket connections were never being closed and therefore the number of open files was climbing slowly upwards until it exceeded the limit.

We wanted to check how many sockets nginx had open so to start with we needed to know the process IDs it was running under:

$ ps aux | grep nginx | grep -v grep
root      1089  0.0  0.7 105152  2736 ?        Ss   17:34   0:00 nginx: master process /usr/sbin/nginx
www-data 17474  0.0  0.6 105300  2296 ?        S    21:49   0:04 nginx: worker process
www-data 17475  0.0  0.7 105300  2856 ?        S    21:49   0:04 nginx: worker process
www-data 17476  0.0  0.7 105300  2792 ?        S    21:49   0:03 nginx: worker process
www-data 17477  0.0  0.7 105300  2668 ?        S    21:49   0:04 nginx: worker process

So the process IDs we’re interested in are 1089, 17474, 17475, 17476 and 17477.

We can check which file descriptors they have open with the following command:

$ sudo ls -alh /proc/{1089,17{474,475,476,477}}/fd
/proc/17476/fd:
total 0
dr-x------ 2 www-data www-data  0 Apr 23 23:40 .
...
l-wx------ 1 www-data www-data 64 Apr 23 23:40 6 -> /var/log/nginx/error.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 7 -> /var/www/thinkingingraphs/shared/log/nginx_access.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 8 -> /var/www/thinkingingraphs/shared/log/nginx_error.log
lrwx------ 1 www-data www-data 64 Apr 23 23:40 9 -> socket:[8910]
 
/proc/17477/fd:
total 0
...
lrwx------ 1 www-data www-data 64 Apr 23 23:40 56 -> socket:[52213]
lrwx------ 1 www-data www-data 64 Apr 23 23:40 57 -> anon_inode:[eventpoll]
l-wx------ 1 www-data www-data 64 Apr 23 23:40 6 -> /var/log/nginx/error.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 7 -> /var/www/thinkingingraphs/shared/log/nginx_access.log
l-wx------ 1 www-data www-data 64 Apr 23 23:40 8 -> /var/www/thinkingingraphs/shared/log/nginx_error.log
lrwx------ 1 www-data www-data 64 Apr 23 23:40 9 -> socket:[8910]

We can narrow that down to just show us how many sockets are open:

$ sudo ls -alh /proc/{1089,17{474,475,476,477}}/fd | grep socket  | wc -l
189

We could also use lsof although for some reason that returns a slightly different number:

$ sudo lsof -p 1089,17474,17475,17476,17477 | grep socket | wc -l
184

If we want to use brace expansion to do that it becomes a bit more tricky:

$ sudo lsof -p `echo {1089,174{74,75,76,77}} | sed 's/ /,/g'` | grep socket | wc -l
184

Annoyingly we couldn’t actually replicate the error but think that it’s been solved in nginx 1.2.0 (we were using 1.1.19) by this change:

Bugfix: a segmentation fault might occur in a worker process if the
       "try_files" directive was used; the bug had appeared in 1.1.19.

Written by Mark Needham

April 23rd, 2013 at 11:59 pm

Posted in Shell Scripting

Tagged with

awk: Parsing ‘free -m’ output to get memory usage/consumption

with 3 comments

Although I know this problem is already solved by collectd and New Relic I wanted to write a little shell script that showed me the memory usage on a bunch of VMs by parsing the output of free.

The output I was playing with looks like this:

$ free -m
             total       used       free     shared    buffers     cached
Mem:           365        360          5          0         59         97
-/+ buffers/cache:        203        161
Swap:          767         13        754

I wanted to find out what % of the memory on the machine was being used and as I understand it the numbers that we would use to calculate this are the ‘total’ value on the ‘Mem’ line and the ‘used’ value on the ‘buffers/cache’ line.

I initially thought that the ‘used’ value I was interested in should be the one on the ‘Mem’ line but this number includes memory that Linux has borrowed for disk caching so it isn’t the true number.

There’s another quite interesting article showing some experiments you can do to prove this.

So what I wanted to do was get the result of the calculation ‘203/365′ which I wasn’t sure how to do until I realised you can match multiple regular expressions with awk like so:

$ free -m | awk '/Mem:/ { print $2 } /buffers\/cache/ { print $3 }'                                                        
365
203

We’ve now filtered the output down to just our two numbers but another neat thing you can do with awk is change what it uses as its field and record separator.

In this case we want to change the field separator to be the new line character and we’ll set the record separator to be nothing because otherwise it defaults to the new line character which will mess with our field separator.

Those two values are set by using the ‘RS’ and ‘FS’ variables:

$ free -m | 
  awk '/Mem:/ { print $2 } /buffers\/cache/ { print $3 }' | 
  awk 'BEGIN { RS = "" ; FS = "\n" } { print $2 / $1 }'
0.556164

This is still sub optimal because we’re using two awk commands rather than one! We can get around that by storing the two memory values in variables and printing them out in an END block:

$ free -m | 
  awk '/Mem:/ { total=$2 } /buffers\/cache/ { used=$3 } END { print used/total}'
0.556164

Written by Mark Needham

April 10th, 2013 at 7:03 am

Posted in Shell Scripting

Tagged with

Sed: Replacing characters with a new line

with 5 comments

I’ve been playing around with writing some algorithms in both Ruby and Haskell and the latter wasn’t giving the correct result so I wanted to output an intermediate state of the two programs and compare them.

I didn’t do any fancy formatting of the output from either program so I had the raw data structures in text files which I needed to transform so that they were comparable.

The main thing I wanted to do was get each of the elements of the collection onto their own line. The output of one of the programs looked like this:

[(1,2), (3,4)…]

To get each of the elements onto a new line my first step was to replace every occurrence of ‘, (‘ with ‘\n(‘. I initially tried using sed to do that:

sed -E -e 's/, \(/\\n(/g' ruby_union.txt

All that did was insert the string value ‘\n’ rather than the new line character.

I’ve come across similar problems before and I usually just use tr but in this case it doesn’t work very well because we’re replacing more than just a single character.

I came across this thread on Linux Questions which gives a couple of ways that we can get see to do what we want.

The first suggestion is that we should use a back slash followed by the enter key while writing our sed expression where we want the new line to be and then continue writing the rest of the expression.

We therefore end up with the following:

sed -E -e "s/,\(/\
/g" ruby_union.txt

This approach works but it’s a bit annoying as you need to delete the rest of the expression so that the enter works correctly.

An alternative is to make use of echo with the ‘-e’ flag which allows us to output a new line. Usually backslashed characters aren’t interpreted and so you end up with a literal representation. e.g.

$ echo "mark\r\nneedham"
mark\r\nneedham
 
$ echo -e "mark\r\nneedham"
mark
needham

We therefore end up with this:

sed -E -e "s/, \(/\\`echo -e '\n\r'`/g" ruby_union.txt

** Update **

It was pointed out in the comments that this final version of the sed statement doesn’t actually lead to a very nice output which is because I left out the other commands I passed to it which get rid of extra brackets.

The following gives a cleaner output:

$ echo "[(1,2), (3,4), (5,6)]" | sed -E -e "s/, \(/\\`echo -e '\n\r'`/g" -e 's/\[|]|\)|\(//g'
1,2
3,4
5,6

Written by Mark Needham

December 29th, 2012 at 5:49 pm

Posted in Shell Scripting

Tagged with

Unix: Counting the number of commas on a line

with 2 comments

A few weeks ago I was playing around with some data stored in a CSV file and wanted to do a simple check on the quality of the data by making sure that each line had the same number of fields.

One way this can be done is with awk:

awk -F "," ' { print NF-1 } ' file.csv

Here we’re specifying the file separator -F as ‘,’ and then using the NF (number of fields) variable to print how many commas there are on the line.

Another slightly more complicated way is to combine tr and awk like so:

tr -d -c ",\n" < file.csv | awk ' { print length } '

Here we’re telling tr to delete any characters except for a comma or new line.

If we pass just a comma to the ‘-d’ option like so…

tr -d "," < file.csv

…that would delete all the commas from a line but we can use the ‘-c’ option to complement the comma i.e. delete everything except for the comma.

tr -d -c "," < file.csv

Unfortunately that puts all the commas onto the same line so we need to complement the new line character as well:

tr -d -c ",\n" < file.csv

We can then use the length variable of awk to print out the number of commas on each line.

We can achieve the same thing by making use of sed instead of tr like so:

sed 's/[^,]//g' file.csv | awk ' { print length } '

Since sed operates on a line by line basis we just need to tell it to substitute anything which isn’t a comma with nothing and then pipe the output of that into awk and use the length variable again.

I thought it might be possible to solve this problem using cut as well but I can’t see any way to get it to output the total number of fields.

If anyone knows any other cool ways to do the same thing let me know in the comments – it’s always interesting to see how different people wield the unix tools!

Written by Mark Needham

November 10th, 2012 at 4:30 pm

Posted in Shell Scripting

Tagged with

Upstart: Job getting stuck in the start/killed state

with 3 comments

We’re using upstart to handle the processes running on our machines and since the haproxy package only came package with an init.d script we wanted to make it upstartified.

When defining an upstart script you need to specify an expect stanza in which you specify whether or not the process which you’re launching is going to fork.

If you do not specify the expect stanza, Upstart will track the life cycle of the first PID that it executes in the exec or script stanzas.

However, most Unix services will “daemonize”, meaning that they will create a new process (using fork(2)) which is a child of the initial process.

Often services will “double fork” to ensure they have no association whatsoever with the initial process.

There is a table on the upstart cookbook under the ‘Implications of Misspecifying expect‘ section which explains what will happen if we specify this incorrectly:

Expect Stanza Behaviour
  Specification of Expect Stanza
Forks no expect expect fork expect daemon
0 Correct start hangs start hangs
1 Wrong pid tracked † Correct start hangs
2 Wrong pid tracked Wrong pid tracked Correct

When we were defining our script we went for expect daemon instead of expect fork and had also mistyped the arguments to the haproxy script which meant it failed to start and ended up in the start/killed state.

From what we could tell upstart had a handle on a PID which didn’t actually exist and when we tried a stop haproxy the command seemed to succeed but didn’t actually do anything.

Phil pointed us to a neat script written by Clint Byrum which spins up and then kills loads of processes in order to exhaust the PID space until a process with the PID upstart is tracking exists and can be re-attached and killed.

It’s available on his website but that wasn’t responding for a period of time yesterday so I’ll repeat it here just in case:

#!/usr/bin/env ruby1.8
 
class Workaround
  def initialize target_pid
    @target_pid = target_pid
 
    first_child
  end
 
  def first_child
    pid = fork do
      Process.setsid
 
      rio, wio = IO.pipe
 
      # Keep rio open
      until second_child rio, wio
        print "\e[A"
      end
    end
 
    Process.wait pid
  end
 
  def second_child parent_rio, parent_wio
    rio, wio = IO.pipe
 
    pid = fork do
      rio.close
      parent_wio.close
 
      puts "%20.20s" % Process.pid
 
      if Process.pid == @target_pid
        wio << 'a'
        wio.close
 
        parent_rio.read
      end
    end
    wio.close
 
    begin
      if rio.read == 'a'
        true
      else
        Process.wait pid
        false
      end
    ensure
      rio.close
    end
  end
end
 
if $0 == __FILE__
  pid = ARGV.shift
  raise "USAGE: #{$0} pid" if pid.nil?
  Workaround.new Integer pid
end

We can put that into a shell script, run it and the world of upstart will get back into a good place again!

Written by Mark Needham

September 29th, 2012 at 9:56 am

Posted in Shell Scripting

Tagged with