Mark Needham

Thoughts on Software Development

Archive for the ‘sed’ tag

Sed: Using environment variables

without comments

I’ve been playing around with the BBC football data set that I wrote about a couple of months ago and I wanted to write some code that would take the import script and replace all instances of remote URIs with a file system path.

For example the import file contains several lines similar to this:

LOAD CSV WITH HEADERS 
FROM "https://raw.githubusercontent.com/mneedham/neo4j-bbc/master/data/matches.csv" 
AS row

And I want that to read:

LOAD CSV WITH HEADERS 
FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" 
AS row

The start of that path also happens to be my working directory:

$ echo $PWD
/Users/markneedham/repos/neo4j-bbc

So I wanted to write a script that would look for occurrences of ‘https://raw.githubusercontent.com/mneedham/neo4j-bbc/master’ and replace it with $PWD. I’m a fan of Sed so I thought I’d try and use it to solve my problem.

The first thing we can do to make life easy is to change the default delimiter. Sed usually uses ‘/’ to separate parts of the command but since we’re using URIs that’s going to be horrible so we’ll use an underscore instead.

For a first cut I tried just removing that first part of the URI but not replacing it with anything in particular:

$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master__' import.cql
 
$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master__' import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/subs.csv" AS row

Cool! That worked as expected. Now let’s try and replace it with $PWD:

$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master_file://$PWD_' import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "file://$PWD/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/subs.csv" AS row

Hmmm that didn’t work as expected. The $PWD is being treated as a literal instead of being evaluated like we want it to be.

It turns out this is a popular question on Stack Overflow and there are lots of suggestions – I tried a few of them and found that single quotes did the trick:

$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master_file://'$PWD'_' import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/subs.csv" AS row

We could also use double quotes everywhere if we prefer:

$ sed "s_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master_file://"$PWD"_" import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/subs.csv" AS row

Written by Mark Needham

August 13th, 2015 at 7:30 pm

Posted in Shell Scripting

Tagged with

Sed: Replacing characters with a new line

with 5 comments

I’ve been playing around with writing some algorithms in both Ruby and Haskell and the latter wasn’t giving the correct result so I wanted to output an intermediate state of the two programs and compare them.

I didn’t do any fancy formatting of the output from either program so I had the raw data structures in text files which I needed to transform so that they were comparable.

The main thing I wanted to do was get each of the elements of the collection onto their own line. The output of one of the programs looked like this:

[(1,2), (3,4)…]

To get each of the elements onto a new line my first step was to replace every occurrence of ‘, (‘ with ‘\n(‘. I initially tried using sed to do that:

sed -E -e 's/, \(/\\n(/g' ruby_union.txt

All that did was insert the string value ‘\n’ rather than the new line character.

I’ve come across similar problems before and I usually just use tr but in this case it doesn’t work very well because we’re replacing more than just a single character.

I came across this thread on Linux Questions which gives a couple of ways that we can get see to do what we want.

The first suggestion is that we should use a back slash followed by the enter key while writing our sed expression where we want the new line to be and then continue writing the rest of the expression.

We therefore end up with the following:

sed -E -e "s/,\(/\
/g" ruby_union.txt

This approach works but it’s a bit annoying as you need to delete the rest of the expression so that the enter works correctly.

An alternative is to make use of echo with the ‘-e’ flag which allows us to output a new line. Usually backslashed characters aren’t interpreted and so you end up with a literal representation. e.g.

$ echo "mark\r\nneedham"
mark\r\nneedham
 
$ echo -e "mark\r\nneedham"
mark
needham

We therefore end up with this:

sed -E -e "s/, \(/\\`echo -e '\n\r'`/g" ruby_union.txt

** Update **

It was pointed out in the comments that this final version of the sed statement doesn’t actually lead to a very nice output which is because I left out the other commands I passed to it which get rid of extra brackets.

The following gives a cleaner output:

$ echo "[(1,2), (3,4), (5,6)]" | sed -E -e "s/, \(/\\`echo -e '\n\r'`/g" -e 's/\[|]|\)|\(//g'
1,2
3,4
5,6

Written by Mark Needham

December 29th, 2012 at 5:49 pm

Posted in Shell Scripting

Tagged with

Sed: Extended regular expressions

without comments

Irfan and I were looking at how to do some text substitution in a text file this afternoon and turned to sed to help us in our quest.

He had originally used grep to find what he wanted to replace on each line, using a grep regular expression to match one or more numbers:

cat the_file.txt | grep "[0-9]\+"

That works pretty well but since I knew how to do the substitution in sed we needed to convert the regular expression to work with sed.

We started off with just trying to print the lines which matched the regular expression:

cat the_file.txt | sed -n '/[0-9]\+/p'

Which prints nothing because sed uses basic regular expressions by default which means we can’t use ‘+’ to match 1 or more numbers.

grep on the other hand…

Grep understands two different versions of regular expression syntax: “basic” and “extended.” In GNU grep, there is no difference in available functionality using either syntax. In other implementations, basic regular expressions are less powerful.

To get sed to allow us to use extended metacharacters we need to pass the ‘-E’ flag to sed which also means that we no longer to escape the ‘+’:

cat the_file.txt | sed -nE '/[0-9]+/p'

From what I understand you can also only use the following metacharacters in extended mode as well:

  • ? – for matching zero or one occurrence of a regular expression
  • | – for matching either the preceding or following regular expression
  • () – grouping regular expressions
  • {n,m} – for matching a range of occurrences of the single preceding character

I’m told that you can use grep to do substitution as well but I haven’t figured out how exactly you do that yet.

Written by Mark Needham

February 11th, 2011 at 8:34 pm

Capistrano, sed, escaping forward slashes and ‘p’ is not ‘puts’!

with 2 comments

Priyank and I have been working on automating part of our deployment process and one task we needed to do as part of this is replace some variables used in one of our shell scripts.

All the variables in the script refer to production specific locations but we needed to change a couple of them in order to run the script in our QA environment.

We’re therefore written a sed command, which we call from Capistrano, to allow us to do this.

The Capistrano script looks a little like this:

task :replace_in_shell do
	directory = "/my/directory/path"
	sed_command = "sed 's/^some_key.*$/#{directory}/' shell_script.sh > shell_script_with_qa_variables.sh"
	run sed_command
end

Unfortunately this creates the following sed command which isn’t actually valid syntactically:

sed 's/^some_key.*$//my/directory/path/' shell_script.sh > shell_script_with_qa_variables.sh

We decided to use ‘gsub’ to escape all the forward slashes in the directory path and to work out which parameters we needed to pass to ‘gsub’ we started using irb.

Executing gsub with the appropriate parameters leads us to believe that 2 backslashes will be added:

ruby-1.8.7-p299 > "/my/directory/path".gsub("/", "\\/")
 => "\\/my\\/directory\\/path"

This is because there IRB is implicitly called ‘inspect’ on the result which shows a different string than what we would actually get.

While writing this blog post I’ve also learnt (thanks to Ashwin) that ‘p’ is not the same as ‘puts’ which is what I originally thought and has been driving me crazy as I try to understand why everything I print includes an extra backslash!

The following code:

p "/mark/dir/".gsub("/", "\\/")

is the same as typing:

puts "/mark/dir/".gsub("/", "\\/").inspect

We were able to change our Capistrano script to escape forward slashes like so:

task :replace_in_shell do
	directory = "/my/directory/path"
	sed_command = "sed 's/^some_key.*$/#{directory.gsub("/", "\\/"}/' shell_script.sh > shell_script_with_qa_variables.sh"
	run sed_command
end

Written by Mark Needham

November 18th, 2010 at 6:40 pm

Posted in Ruby

Tagged with , ,