Mark Needham

Thoughts on Software Development

Archive for the ‘Software Development’ Category

Study until your mind wanders

with 2 comments

I’ve previously found it very difficult to read math heavy content which has made it challenging to read Distributed Computing which I bought last May.

After several false starts where I gave up after getting frustrated that I couldn’t understand things the first time around and forgot everything if I left it a couple of days I decided to try again with a different approach.

I’ve been trying a technique I learned from Mini Habits where every day I have a (very small) goal of reading one page of the book. Having such a small goal means that I can read the material as slowly as I like (repeating previous days if necessary).

So far I’ve read 4 chapters (~100 pages) over the last month – some days I read 6 or 7 pages, other days I only manage one. The key is to keep the rhythm of reading something.

I tried doing the reading at different times of the day – on the bus on the way to work, in the evening before going to sleep – and found that for me the best time is immediately when I wake up. To minimise my mind wandering I don’t read any emails, chat messages or social media accounts before I start.

Despite this, I’ve noticed that after a while my mind starts to wander while reading proofs and that’s a signal for me to stop for the day. When I pick up again the next day I’ve often found that I understand what I was having difficulty with.

I’ve read that meditation prior to studying is an effective way to quiet the mind and would be a more ‘on demand’ way of achieving the concentration required to read this type of material. I’ve never done any meditation so if anyone has any tips on where to start that’d be helpful.

Written by Mark Needham

December 31st, 2015 at 10:47 am

jq: Cannot iterate over number / string and number cannot be added

without comments

In my continued parsing of meetup.com’s JSON API I wanted to extract some information from the following JSON file:

$ head -n40 data/members/18313232.json
[
  {
    "status": "active",
    "city": "London",
    "name": ". .",
    "other_services": {},
    "country": "gb",
    "topics": [],
    "lon": -0.13,
    "joined": 1438866605000,
    "id": 92951932,
    "state": "17",
    "link": "http://www.meetup.com/members/92951932",
    "photo": {
      "thumb_link": "http://photos1.meetupstatic.com/photos/member/8/d/6/b/thumb_250896203.jpeg",
      "photo_id": 250896203,
      "highres_link": "http://photos1.meetupstatic.com/photos/member/8/d/6/b/highres_250896203.jpeg",
      "photo_link": "http://photos1.meetupstatic.com/photos/member/8/d/6/b/member_250896203.jpeg"
    },
    "lat": 51.49,
    "visited": 1446745707000,
    "self": {
      "common": {}
    }
  },
  {
    "status": "active",
    "city": "London",
    "name": "Abdelkader Idryssy",
    "other_services": {},
    "country": "gb",
    "topics": [
      {
        "name": "Weekend Adventures",
        "urlkey": "weekend-adventures",
        "id": 16438
      },
      {
        "name": "Community Building",
        "urlkey": "community-building",

In particular I want to extract the member’s id, name, join date and the ids of topics they’re interested in. I started with the following jq query to try and extract those attributes:

$ jq -r '.[] | [.id, .name, .joined, (.topics[] | .id | join(";"))] | @csv' data/members/18313232.json
Cannot iterate over number (16438)

Annoyingly this treats topic ids on an individual basis rather than as an array as I wanted. I tweaked the query to the following with no luck:

$ jq -r '.[] | [.id, .name, .joined, (.topics[].id | join(";"))] | @csv' data/members/18313232.json
Cannot iterate over number (16438)

As a guess I decided to wrap ‘.topics[].id’ in an array literal to see if it had any impact:

$ jq -r '.[] | [.id, .name, .joined, ([.topics[].id] | join(";"))] | @csv' data/members/18313232.json
92951932,". .",1438866605000,""
jq: error (at data/members/18313232.json:60013): string ("") and number (16438) cannot be added

Woot! A different error message at least and this one seems to be due to a type mismatch between the string we want to end up with and the array of numbers that we currently have.

We can cast our way to victory with the ‘tostring’ function:

$ jq -r '.[] | [.id, .name, .joined, ([.topics[].id | tostring] | join(";"))] | @csv' data/members/18313232.json
...
92951932,". .",1438866605000,""
193866304,"Abdelkader Idryssy",1445195325000,"16438;20727;15401;9760;20246;20923;3336;2767;242;58259;4417;1789;10454;20274;10232;563;25375;16433;15187;17635;26273;21808;933;7789;23884;16212;144477;15322;21067;3833;108403;20221;1201;182;15083;9696;4377;15360;18296;15121;17703;10161;1322;3880;18333;3485;15585;44584;18692;21681"
28643052,"Abhishek Chanda",1439688955000,"646052;520302;15167;563;65735;537492;646072;537502;24959;1025832;8599;31197;24410;26118;10579;1064;189;48471;16216;18062;33089;107633;46831;20479;1423042;86258;21441;3833;21681;188;9696;58162;20398;113032;18060;29971;55324;30928;15261;58259;638;16475;27591;10107;242;109595;10470;26384;72514;1461192"
39523062,"Adam Kinder-Jones",1438677474000,"70576;21549;3833;42277;164111;21522;93380;48471;15167;189;563;25435;87614;9696;18062;58162;10579;21681;19882;108403;128595;15582;7029"
194119823,"Adam Lewis",1444867994000,"10209"
14847001,"Adam Rogers",1422917313000,""
87709042,"Adele Green",1436867381000,"15167;18062;102811;9696;30928;18060;78565;189;7029;48471;127567;10579;58162;563;3833;16216;21441;37708;209821;15401;59325;31792;21836;21900;984862;15720;17703;96823;4422;85951;87614;37428;2260;827;121802;19672;38660;84325;118991;135612;10464;1454542;17936;21549;21520;17628;148303;20398;66339;29661"
11497937,"Adrian Bridgett",1421067940000,"30372;15046;25375;638;498;933;374;27591;18062;18060;15167;10581;16438;15672;1998;1273;713;26333;15099;15117;4422;15892;242;142180;563;31197;20479;1502;131178;15018;43256;58259;1201;7319;15940;223;8652;66493;15029;18528;23274;9696;128595;21681;17558;50709;113737"
14151190,"adrian lawrence",1437142198000,"7029;78565;659;85951;15582;48471;9696;128595;563;10579;3833;101960;16137;1973;78566;206;223;21441;16216;108403;21681;186;1998;15731;17703;15043;16613;17885;53531;48375;16615;19646;62326;49954;933;22268;19243;37381;102811;30928;455;10358;73511;127567;106382;16573;36229;781;23981;1954"
183557824,"Adrien Pujol",1421882756000,"108403;563;9696;21681;188;24410;1064;32743;124668;15472;21123;1486432;1500742;87614;46831;1454542;46810;166000;126177;110474"
...

Written by Mark Needham

November 24th, 2015 at 12:12 am

Posted in Software Development

Tagged with

jq: Filtering missing keys

without comments

I’ve been playing around with the meetup.com API again over the last few days and having saved a set of events to disk I wanted to extract the venues using jq.

This is what a single event record looks like:

$ jq -r ".[0]" data/events/0.json
{
  "status": "past",
  "rating": {
    "count": 1,
    "average": 1
  },
  "utc_offset": 3600000,
  "event_url": "http://www.meetup.com/londonweb/events/3261890/",
  "group": {
    "who": "Web Peeps",
    "name": "London Web",
    "group_lat": 51.52000045776367,
    "created": 1034097743000,
    "join_mode": "approval",
    "group_lon": -0.12999999523162842,
    "urlname": "londonweb",
    "id": 163876
  },
  "name": "London Web Design October Meetup",
  "created": 1094756756000,
  "venue": {
    "city": "London",
    "name": "Roadhouse Live Music Restaurant , Bar & Club",
    "country": "GB",
    "lon": -0.1,
    "phone": "44-020-7240-6001",
    "address_1": "The Piazza",
    "address_2": "Covent Garden",
    "repinned": false,
    "lat": 51.52,
    "id": 11725
  },
  "updated": 1273536337000,
  "visibility": "public",
  "yes_rsvp_count": 2,
  "time": 1097776800000,
  "waitlist_count": 0,
  "headcount": 0,
  "maybe_rsvp_count": 5,
  "id": "3261890"
}

We want to extract the keys underneath ‘venue’.
I started with the following:

$ jq -r ".[] | .venue" data/events/0.json
...
{
  "city": "London",
  "name": "Counting House Pub",
  "country": "gb",
  "lon": -0.085022,
  "phone": "020 7283 7123",
  "address_1": "50 Cornhill Rd",
  "address_2": "EC3V 3PD",
  "repinned": false,
  "lat": 51.513407,
  "id": 835790
}
null
{
  "city": "Paris",
  "name": "Mozilla Paris",
  "country": "fr",
  "lon": 2.341002,
  "address_1": "16 Bis Boulevard Montmartre",
  "repinned": false,
  "lat": 48.871834,
  "id": 23591845
}
...

This is close to what I want but it includes ‘null’ values which means when you extract the keys inside ‘venue’ they are all empty as well:

jq -r ".[] | .venue | [.id, .name, .city, .address_1, .address_2, .lat, .lon] | @csv" data/events/0.json
...
101958,"The Green Man and French Horn,  -","London","54, St. Martins Lane - Covent Garden","WC2N 4EA",51.52,-0.1
,,,,,,
107295,"The Yorkshire Grey Pub","London","46 Langham Street","W1W 7AX",51.52,-0.1
...
,,,,,,

If functional programming lingo we want to filter out any JSON documents which don’t have the ‘venue’ key.
‘filter’ has a different meaning in jq so it took me a while to realise that the ‘select’ function was what I needed to get rid of the null values:

$ jq -r ".[] | select(.venue != null) | .venue | [.id, .name, .city, .address_1, .address_2, .lat, .lon] | @csv" data/events/0.json | head
11725,"Roadhouse Live Music Restaurant , Bar & Club","London","The Piazza","Covent Garden",51.52,-0.1
11725,"Roadhouse Live Music Restaurant , Bar & Club","London","The Piazza","Covent Garden",51.52,-0.1
11725,"Roadhouse Live Music Restaurant , Bar & Club","London","The Piazza","Covent Garden",51.52,-0.1
11725,"Roadhouse Live Music Restaurant , Bar & Club","London","The Piazza","Covent Garden",51.52,-0.1
76192,"Pied Bull Court","London","Galen Place, London, WC1A 2JR",,51.516747,-0.12719
76192,"Pied Bull Court","London","Galen Place, London, WC1A 2JR",,51.516747,-0.12719
85217,"Earl's Court Exhibition Centre","London","Warwick Road","SW5 9TA",51.49233,-0.199735
96579,"Olympia 2","London","Near Olympia tube station",,51.52,-0.1
76192,"Pied Bull Court","London","Galen Place, London, WC1A 2JR",,51.516747,-0.12719
101958,"The Green Man and French Horn,  -","London","54, St. Martins Lane - Covent Garden","WC2N 4EA",51.52,-0.1

And we’re done.

Written by Mark Needham

November 14th, 2015 at 10:51 pm

Posted in Software Development

Tagged with

Docker 1.9: Port forwarding on Mac OS X

without comments

Since the Neo4j 2.3.0 release there’s been an official docker image which I thought I’d give a try this afternoon.

The last time I used docker about a year ago I had to install boot2docker which has now been deprecated in place of Docker Machine and the Docker Toolbox.

I created a container with the following command:

docker run --detach --publish=7474:7474 neo4j/neo4j

And then tried to access the Neo4j server locally:

$ curl http://localhost:7474
curl: (7) Failed to connect to localhost port 7474: Connection refused

I quickly checked that docker had started up Neo4j correctly:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                              NAMES
1f7c48e267f0        neo4j/neo4j         "/docker-entrypoint.s"   10 minutes ago      Up 10 minutes       7473/tcp, 0.0.0.0:7474->7474/tcp   kickass_easley

Looks good. Amusingly I then came across my own blog post from a year ago where I’d run into the same problem – the problem being that we need to access the Neo4j server via the VM’s IP address rather than localhost.

Instead of using boot2docker we now need to use docker-machine to find the VM’s IP address:

$ docker-machine ls
NAME      ACTIVE   DRIVER       STATE     URL                         SWARM
default   *        virtualbox   Running   tcp://192.168.99.100:2376
$ curl http://192.168.99.100:7474
{
  "management" : "http://192.168.99.100:7474/db/manage/",
  "data" : "http://192.168.99.100:7474/db/data/"
}

And we’re back in business.

Written by Mark Needham

November 8th, 2015 at 8:58 pm

Posted in Software Development

Tagged with

IntelliJ ‘java: cannot find JDK 1.8’

without comments

I upgraded to IntelliJ 15.0 a few days ago and was initially seeing the following exception when trying to compile:

module-name
 
java: cannot find JDK 1.8

I’ve been compiling against JDK 1.8 for a while now using IntelliJ 14 so I wasn’t sure what was going on.

I checked my project settings and they seemed fine:

2015 11 08 11 39 16

The error message suggested I look in the logs to find more information but I wasn’t sure where those live! I eventually found out the answer via the comments of this support ticket although I later found a post describing it in more detail.

Looking into the logs revealed the following error message:

$ less /Users/markneedham/Library/Logs/IntelliJIdea15/idea.log
 
2015-11-05 16:31:28,429 [ 428129]   INFO - figurations.GeneralCommandLine - Cannot run program "/Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home/bin/java" (in directory "/Applications/IntelliJ IDEA 15.app/Contents/bin"): error=2, No such file or directory
java.io.IOException: Cannot run program "/Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home/bin/java" (in directory "/Applications/IntelliJ IDEA 15.app/Contents/bin"): error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at com.intellij.execution.configurations.GeneralCommandLine.startProcess(GeneralCommandLine.java:368)
	at com.intellij.execution.configurations.GeneralCommandLine.createProcess(GeneralCommandLine.java:354)
	at com.intellij.execution.process.OSProcessHandler.<init>(OSProcessHandler.java:38)
	at org.jetbrains.idea.maven.server.MavenServerManager$3.startProcess(MavenServerManager.java:359)
	at org.jetbrains.idea.maven.server.MavenServerManager$3.execute(MavenServerManager.java:345)
	at com.intellij.execution.rmi.RemoteProcessSupport.a(RemoteProcessSupport.java:206)
	at com.intellij.execution.rmi.RemoteProcessSupport.acquire(RemoteProcessSupport.java:139)
	at org.jetbrains.idea.maven.server.MavenServerManager.create(MavenServerManager.java:163)
	at org.jetbrains.idea.maven.server.MavenServerManager.create(MavenServerManager.java:71)
	at org.jetbrains.idea.maven.server.RemoteObjectWrapper.getOrCreateWrappee(RemoteObjectWrapper.java:41)
	at org.jetbrains.idea.maven.server.MavenServerManager$9.execute(MavenServerManager.java:525)
	at org.jetbrains.idea.maven.server.MavenServerManager$9.execute(MavenServerManager.java:522)
	at org.jetbrains.idea.maven.server.RemoteObjectWrapper.perform(RemoteObjectWrapper.java:76)
	at org.jetbrains.idea.maven.server.MavenServerManager.applyProfiles(MavenServerManager.java:522)
	at org.jetbrains.idea.maven.project.MavenProjectReader.applyProfiles(MavenProjectReader.java:369)
	at org.jetbrains.idea.maven.project.MavenProjectReader.doReadProjectModel(MavenProjectReader.java:98)
	at org.jetbrains.idea.maven.project.MavenProjectReader.access$300(MavenProjectReader.java:42)
	at org.jetbrains.idea.maven.project.MavenProjectReader$1.doProcessParent(MavenProjectReader.java:422)
	at org.jetbrains.idea.maven.project.MavenProjectReader$1.doProcessParent(MavenProjectReader.java:399)
	at org.jetbrains.idea.maven.project.MavenParentProjectFileProcessor.processRepositoryParent(MavenParentProjectFileProcessor.java:84)
	at org.jetbrains.idea.maven.project.MavenParentProjectFileProcessor.process(MavenParentProjectFileProcessor.java:62)
	at org.jetbrains.idea.maven.project.MavenProjectReader.resolveInheritance(MavenProjectReader.java:425)
	at org.jetbrains.idea.maven.project.MavenProjectReader.doReadProjectModel(MavenProjectReader.java:95)
	at org.jetbrains.idea.maven.project.MavenProjectReader.access$300(MavenProjectReader.java:42)
	at org.jetbrains.idea.maven.project.MavenProjectReader$1.doProcessParent(MavenProjectReader.java:422)
	at org.jetbrains.idea.maven.project.MavenProjectReader$1.doProcessParent(MavenProjectReader.java:399)
	at org.jetbrains.idea.maven.project.MavenParentProjectFileProcessor.processRepositoryParent(MavenParentProjectFileProcessor.java:84)
	at org.jetbrains.idea.maven.project.MavenParentProjectFileProcessor.process(MavenParentProjectFileProcessor.java:62)
	at org.jetbrains.idea.maven.project.MavenProjectReader.resolveInheritance(MavenProjectReader.java:425)
	at org.jetbrains.idea.maven.project.MavenProjectReader.doReadProjectModel(MavenProjectReader.java:95)
	at org.jetbrains.idea.maven.project.MavenProjectReader.access$300(MavenProjectReader.java:42)
	at org.jetbrains.idea.maven.project.MavenProjectReader$1.doProcessParent(MavenProjectReader.java:422)
	at org.jetbrains.idea.maven.project.MavenProjectReader$1.doProcessParent(MavenProjectReader.java:399)
	at org.jetbrains.idea.maven.project.MavenParentProjectFileProcessor.processRepositoryParent(MavenParentProjectFileProcessor.java:84)
	at org.jetbrains.idea.maven.project.MavenParentProjectFileProcessor.process(MavenParentProjectFileProcessor.java:62)
	at org.jetbrains.idea.maven.project.MavenProjectReader.resolveInheritance(MavenProjectReader.java:425)
	at org.jetbrains.idea.maven.project.MavenProjectReader.doReadProjectModel(MavenProjectReader.java:95)
	at org.jetbrains.idea.maven.project.MavenProjectReader.readProject(MavenProjectReader.java:53)
	at org.jetbrains.idea.maven.project.MavenProject.read(MavenProject.java:626)
	at org.jetbrains.idea.maven.project.MavenProjectsTree.doUpdate(MavenProjectsTree.java:564)
	at org.jetbrains.idea.maven.project.MavenProjectsTree.doAdd(MavenProjectsTree.java:509)
	at org.jetbrains.idea.maven.project.MavenProjectsTree.update(MavenProjectsTree.java:470)
	at org.jetbrains.idea.maven.project.MavenProjectsTree.updateAll(MavenProjectsTree.java:441)
	at org.jetbrains.idea.maven.project.MavenProjectsProcessorReadingTask.perform(MavenProjectsProcessorReadingTask.java:60)
	at org.jetbrains.idea.maven.project.MavenProjectsProcessor.doProcessPendingTasks(MavenProjectsProcessor.java:134)
	at org.jetbrains.idea.maven.project.MavenProjectsProcessor.access$100(MavenProjectsProcessor.java:30)
	at org.jetbrains.idea.maven.project.MavenProjectsProcessor$2.run(MavenProjectsProcessor.java:109)
	at org.jetbrains.idea.maven.utils.MavenUtil$7.run(MavenUtil.java:464)
	at com.intellij.openapi.application.impl.ApplicationImpl$8.run(ApplicationImpl.java:365)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
	at org.jetbrains.ide.PooledThreadExecutor$1$1.run(PooledThreadExecutor.java:55)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 55 more

Somewhere I had a JDK 1.7 defined which no longer existed on my machine. I actually only have one JDK installed at the moment:

$ /usr/libexec/java_home -V
Matching Java Virtual Machines (1):
    1.8.0_51, x86_64:	"Java SE 8"	/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home
 
/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home

A bit of exploring led me to ‘Platform Settings’ which is where the culprit was:

2015 11 08 11 45 00

That setting lives actually lives in /Users/markneedham/Library/Preferences/IntelliJIdea15/options/jdk.table.xml and once I removed it IntelliJ resumed normal service.

Written by Mark Needham

November 8th, 2015 at 11:47 am

Hadoop: HDFS – ava.lang.NoSuchMethodError: org.apache.hadoop.fs.FSOutputSummer.(Ljava/util/zip/Checksum;II)V

without comments

I wanted to write a little program to check that one machine could communicate a HDFS server running on the other and adapted some code from the Hadoop wiki as follows:

package org.playground;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
 
import java.io.IOException;
 
public class HadoopDFSFileReadWrite {
 
    static void printAndExit(String str) {
        System.err.println( str );
        System.exit(1);
    }
 
    public static void main (String[] argv) throws IOException {
        Configuration conf = new Configuration();
        conf.addResource(new Path("/Users/markneedham/Downloads/core-site.xml"));
 
        FileSystem fs = FileSystem.get(conf);
 
        Path inFile = new Path("hdfs://192.168.0.11/user/markneedham/explore.R");
        Path outFile = new Path("hdfs://192.168.0.11/user/markneedham/output-" + System.currentTimeMillis());
 
        // Check if input/output are valid
        if (!fs.exists(inFile))
            printAndExit("Input file not found");
        if (!fs.isFile(inFile))
            printAndExit("Input should be a file");
        if (fs.exists(outFile))
            printAndExit("Output already exists");
 
        // Read from and write to new file
        byte buffer[] = new byte[256];
        try ( FSDataInputStream in = fs.open( inFile ); FSDataOutputStream out = fs.create( outFile ) )
        {
            int bytesRead = 0;
            while ( (bytesRead = in.read( buffer )) > 0 )
            {
                out.write( buffer, 0, bytesRead );
            }
        }
        catch ( IOException e )
        {
            System.out.println( "Error while copying file" );
        }
    }
}

I initially thought I only had the following in my POM file:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.0</version>
</dependency>

But when I ran the script I got the following exception:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.fs.FSOutputSummer.<init>(Ljava/util/zip/Checksum;II)V
	at org.apache.hadoop.hdfs.DFSOutputStream.<init>(DFSOutputStream.java:1553)
	at org.apache.hadoop.hdfs.DFSOutputStream.<init>(DFSOutputStream.java:1582)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1614)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1465)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1390)
	at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:394)
	at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:390)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:390)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:334)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:776)
	at org.playground.HadoopDFSFileReadWrite.main(HadoopDFSFileReadWrite.java:37)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

From following the stack trace I realised I’d made a mistake and had accidentally pulled in a dependency on hadoop-hdfs 2.4.1. If we don’t have the hadoop-hdfs dependency we’d actually see this error instead:

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:170)
	at org.playground.HadoopDFSFileReadWrite.main(HadoopDFSFileReadWrite.java:22)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

Now let’s add the correct version of the dependency and make sure it all works as expected:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.0</version>
    <exclusions>
        <exclusion>
            <groupId>ch.qos.logback</groupId>
            <artifactId>logback-classic</artifactId>
        </exclusion>
        <exclusion>
            <groupId>javax.servlet</groupId>
            <artifactId>servlet-api</artifactId>
        </exclusion>
    </exclusions>
</dependency>

When we run that a new file is created in HDFS on the other machine with the current timestamp:

$ date +%s000
1446336801000
 
$ hdfs dfs -ls
...
-rw-r--r--   3 markneedham supergroup       9249 2015-11-01 00:13 output-1446337098257
...

Written by Mark Needham

October 31st, 2015 at 11:58 pm

Posted in Software Development

Tagged with ,

jq: error – Cannot iterate over null (null)

without comments

I’ve been playing around with the jq library again over the past couple of days to convert the JSON from the Stack Overflow API into CSV and found myself needing to deal with an optional field.

I’ve downloaded 100 or so questions and stored them as an array in a JSON array like so:

$ head -n 100 so.json
[
    {
        "has_more": true,
        "items": [
            {
                "is_answered": false,
                "delete_vote_count": 0,
                "body_markdown": "...",
                "tags": [
                    "jdbc",
                    "neo4j",
                    "cypher",
                    "spring-data-neo4j"
                ],
                "question_id": 33023306,
                "title": "How to delete multiple nodes by specific ID using Cypher",
                "down_vote_count": 0,
                "view_count": 8,
                "answers": [
                    {
...
]

I wrote the following command to try and extract the answer meta data and the corresponding question_id:

$ jq -r \
 '.[] | .items[] |
 { question_id: .question_id, answer: .answers[] } |
 [.question_id, .answer.answer_id, .answer.title] |
 @csv' so.json
 
33023306,33024189,"How to delete multiple nodes by specific ID using Cypher"
33020796,33021958,"How do a general search across string properties in my nodes?"
33018818,33020068,"Neo4j match nodes related to all nodes in collection"
33018818,33024273,"Neo4j match nodes related to all nodes in collection"
jq: error (at so.json:134903): Cannot iterate over null (null)

Unfortunately this results in an error since some questions haven’t been answered yet and therefore don’t have the ‘answers’ property.

While reading the docs I came across the alternative operation ‘//’ which can be used to provide defaults – in this case I thought I could plugin an empty array of answers if a question hadn’t been answered yet:

$ jq -r \
 '.[] | .items[] |
 { question_id: .question_id, answer: (.answers[] // []) } |
 [.question_id, .answer.answer_id, .answer.title] |
 @csv' so.json
 
33023306,33024189,"How to delete multiple nodes by specific ID using Cypher"
33020796,33021958,"How do a general search across string properties in my nodes?"
33018818,33020068,"Neo4j match nodes related to all nodes in collection"
33018818,33024273,"Neo4j match nodes related to all nodes in collection"
jq: error (at so.json:134903): Cannot iterate over null (null)

Still the same error! Reading down the page I noticed the ? operator which provides syntactic sugar for handling/catching errors. I gave it a try:

$ jq -r  '.[] | .items[] |
 { question_id: .question_id, answer: .answers[]? } |
 [.question_id, .answer.answer_id, .answer.title] |
 @csv' so.json | head -n10
 
33023306,33024189,"How to delete multiple nodes by specific ID using Cypher"
33020796,33021958,"How do a general search across string properties in my nodes?"
33018818,33020068,"Neo4j match nodes related to all nodes in collection"
33018818,33024273,"Neo4j match nodes related to all nodes in collection"
33015714,33021482,"Upgrade of spring data neo4j 3.x to 4.x Relationship Operations"
33011477,33011721,"Why does Neo4j OGM delete method return void?"
33011102,33011565,"Neo4j and algorithms"
33011102,33013260,"Neo4j and algorithms"
33010859,33011505,"Importing data into an existing database in neo4j"
33009673,33010942,"How do I use Spring Data Neo4j to persist a Map (java.util.Map) object inside an NodeEntity?"

As far as I can tell we are just skipping any records that don’t contain ‘answers’ which is exactly the behaviour I’m after so that’s great – just what we need!

Written by Mark Needham

October 9th, 2015 at 6:34 am

Posted in Software Development

Tagged with

Mac OS X: Installing the PROJ.4 – Cartographic Projections Library

without comments

I’ve been following Scott Barnham’s guide to transforming UK postcodes into (lat, long) coordinates and needed to install the PROJ.4 Cartographic Projections library which I initially struggled with.

The first step is to download a tar.gz version which is linked from the wiki page:

$ wget http://download.osgeo.org/proj/proj-4.9.1.tar.gz

Next we’ll unpack the file and then build the binaries:

$ tar -xvf proj-4.9.1.tar.gz
$ cd proj-4.9.1
$ ./configure --prefix ~/projects/land-registry/proj-4.9.1
$ make
$ make install

The files we need are in the bin directory…

$ ls -alh bin/
total 184
drwxr-xr-x   8 markneedham  staff   272B  5 Oct 23:07 .
drwxr-xr-x@ 41 markneedham  staff   1.4K  5 Oct 20:46 ..
-rwxr-xr-x   1 markneedham  staff    20K  5 Oct 23:07 cs2cs
-rwxr-xr-x   1 markneedham  staff    16K  5 Oct 23:07 geod
lrwxr-xr-x   1 markneedham  staff     4B  5 Oct 23:07 invgeod -> geod
lrwxr-xr-x   1 markneedham  staff     4B  5 Oct 23:07 invproj -> proj
-rwxr-xr-x   1 markneedham  staff    13K  5 Oct 23:07 nad2bin
-rwxr-xr-x   1 markneedham  staff    21K  5 Oct 23:07 proj

…now let’s give it a try. We need to feed in OSGB36 grid reference values and then we’ll get back WGS84 Lat/Lng values. We can grab some grid reference values from the Ordnance Survey website.

e.g. the Neo4j London office has the post code SE1 0NZ which translates to coordinates 531950,180195. Let’s try those out with PROJ.4:

$ ./proj-4.9.1/bin/cs2cs -f '%.7f' +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894 +units=m +no_defs +to +proj=latlong +ellps=WGS84 +towgs84=0,0,0 +nodefs
 
531950 180195
-0.1002020	51.5052917 46.0810195

So it’s suggested a (lat, long) pairing of (51.5052917, -0.1002020). And if we plug that into Google maps

2015 10 05 23 29 09

…it’s pretty much spot on!

Written by Mark Needham

October 5th, 2015 at 10:41 pm

Record Linkage: Playing around with Duke

without comments

I’ve become quite interesting in record linkage recently and came across the Duke project which provides some tools to help solve this problem. I thought I’d give it a try.

The typical problem when doing record linkage is that we have two records from different data sets which represent the same entity but don’t have a common key that we can use to merge them together. We therefore need to come up with a heuristic that will allow us to do so.

Duke has a few examples showing it in action and I decided to go with the linking countries one. Here we have countries from Dbpedia and the Mondial database and we want to link them together.

The first thing we need to do is build the project:

export JAVA_HOME=`/usr/libexec/java_home`
mvn clean package -DskipTests

At the time of writing this will put a zip fail containing everything we need at duke-dist/target/. Let’s unpack that:

unzip duke-dist/target/duke-dist-1.3-SNAPSHOT-bin.zip

Next we need to download the data files and Duke configuration file:

wget https://raw.githubusercontent.com/larsga/Duke/master/doc/example-data/countries-dbpedia.csv
wget https://raw.githubusercontent.com/larsga/Duke/master/doc/example-data/countries.xml
wget https://raw.githubusercontent.com/larsga/Duke/master/doc/example-data/countries-mondial.csv
wget https://raw.githubusercontent.com/larsga/Duke/master/doc/example-data/countries-test.txt

Now we’re ready to give it a go:

java -cp "duke-dist-1.3-SNAPSHOT/lib/*" no.priv.garshol.duke.Duke --testfile=countries-test.txt --testdebug --showmatches countries.xml
 
...
 
NO MATCH FOR:
ID: '7706', NAME: 'guatemala', AREA: '108890', CAPITAL: 'guatemala city',
 
MATCH 0.9825124555160142
ID: '10052', NAME: 'pitcairn islands', AREA: '47', CAPITAL: 'adamstown',
ID: 'http://dbpedia.org/resource/Pitcairn_Islands', NAME: 'pitcairn islands', AREA: '47', CAPITAL: 'adamstown',
 
Correct links found: 200 / 218 (91.7%)
Wrong links found: 0 / 24 (0.0%)
Unknown links found: 0
Percent of links correct 100.0%, wrong 0.0%, unknown 0.0%
Records with no link: 18
Precision 100.0%, recall 91.74311926605505%, f-number 0.9569377990430622

We can look in countries.xml to see how the similarity between records is being calculated:

  <schema>
    <threshold>0.7</threshold>
...
    <property>
      <name>NAME</name>
      <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
      <low>0.09</low>
      <high>0.93</high>
    </property>
    <property>
      <name>AREA</name>
      <comparator>no.priv.garshol.duke.comparators.NumericComparator</comparator>
      <low>0.04</low>
      <high>0.73</high>
    </property>
    <property>
      <name>CAPITAL</name>
      <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
      <low>0.12</low>
      <high>0.61</high>
    </property>
  </schema>

So we’re working out similarity of the capital city and country by calculating their Levenshtein distance i.e. the minimum number of single-character edits required to change one word into the other

This works very well if there is a typo or difference in spelling in one of the data sets. However, I was curious what would happen if the country had two completely different names e.g Cote d’Ivoire is sometimes know as Ivory Coast. Let’s try changing the country name in one of the files:

"19147","Cote dIvoire","Yamoussoukro","322460"
java -cp "duke-dist-1.3-SNAPSHOT/lib/*" no.priv.garshol.duke.Duke --testfile=countries-test.txt --testdebug --showmatches countries.xml
 
NO MATCH FOR:
ID: '19147', NAME: 'ivory coast', AREA: '322460', CAPITAL: 'yamoussoukro',

I also tried it out with the BBC and ESPN match reports of the Man Utd vs Tottenham match – the BBC references players by surname, while ESPN has their full names.

When I compared the full name against surname using the Levenshtein comparator there were no matches as you’d expect. I had to split the ESPN names up into first name and surname to get the linking to work.

Equally when I varied the team name’s to be ‘Man Utd’ rather than ‘Manchester United’ and ‘Tottenham’ rather than ‘Tottenham Hotspur’ that didn’t work either.

I think I probably need to write a domain specific comparator but I’m also curious whether I could come up with a bunch of training examples and then train a model to detect what makes two records similar. It’d be less deterministic but perhaps more robust.

Written by Mark Needham

August 8th, 2015 at 10:50 pm

The Willpower Instinct: Reducing time spent mindlessly scrolling for things to read

with one comment

I recently finished reading Kelly McGonigal’s excellent book ‘The Willpower Instinct‘ having previously watched her Google talk of the same title

My main takeaway from the book is that there are things that we want to do (or not do) but doing them (or not as the case may be) isn’t necessarily instinctive and so we need to develop some strategies to help ourselves out.

In one of the early chapters she suggests picking a habit that you want to do less off and write down on a piece of paper every time you want to do it and how you’re feeling at that point.

After writing it down you’re free to then follow through and do it but you don’t have to if you change your mind.

I was quite aware of the fact that I spend a lot of time idly scrolling from email to Twitter to Facebook to LinkedIn to news websites and back again so I thought it’d be interesting to track when/why I was doing this. The annoying thing about this habit is that it can easily eat up 20-30 minutes at a time without you even noticing.

I’ve been tracking myself for about three weeks and in the first few days I noticed that the first thing I did as soon as I woke up was grab my phone and get into the cycle.

It was quite frustrating to be lured in so early in the day but one of the suggestions in the book is that feeling guilty about something is actually detrimental to our progress. Instead we should note why it happened and then move on – the day isn’t a write off because of one event!

Kelly suggests that if we can work out the times when we’re most likely to fall into our habits then we can pre-plan a mitigation strategy.

From looking over my notes the following are the reasons why I want to start mindlessly scrolling:

  • I’m stuck on the problem I’m working on
  • I’m bored
  • I’m tired
  • I’m hungry
  • I’m getting distracted by notifications
  • I want to not think for a while

The notifications bullet is easy to address – I turn off notifications on my phone for 4 hours at a time so I don’t even know there’s anything to read.

I was intrigued to note that I got distracted when stuck on a problem – the main take away here is to check whether the urge to scroll mindlessly is being driven by having to think hard. If it is then I can choose to either get back to it or go for a short walk and then come back. But definitely don’t start scrolling!

I often find myself bored on my commute to work so I’ve addressed this by working out a book/paper I’m going to read the night before and then having that ready for the journey. Lunch time is prime time for mindless scrolling as well so I’ve filled that time with various computer science/data science videos.

Since I started tracking my scrolling I’ve found myself sleeping earlier so my assumption is that the extra hours awake were being spent mindlessly scrolling which led to being more tired so a win all around in that respect.

Something I’ve noticed is that I’m sometimes wasting time on other activities which I’m are not ‘forbidden’ but are equally unconstructive e.g. chat applications / watching music videos.

The former are obviously useful for communicating with people so I’ve been trying to use them only when I actually want to chat to someone rather than mindlessly looking for messages to read.

I also find myself not wanting to write down the times I’ve mindlessly scrolled when I’m doing it a lot on a given day. Being aware of this is helpful as I just write it down anyway and get on with the day.

The summary of my experience so far is it seems beneficial – I don’t think I’ve lost anything by not checking those mediums so often and I’ve definitely read a lot more than I usually do and been more focused as well.

Now I need to go and try out some of the other exercises from the book – if you’ve read it / tried out any of the tips I’d love to hear what’s worked well for you.

Written by Mark Needham

June 12th, 2015 at 11:12 pm