NIAssembly Open Data – #opendata #ni #spark #clojure

This post was created automatically via an RSS feed and was originally published at https://dataissexy.wordpress.com/2015/07/22/niassembly-open-data-opendata-ni-spark-clojure/

Data From Stormont

The Northern Ireland Assembly has opened up a fair chunk of data as a web service, returning results in either XML or JSON format. And from first plays with it, it’s rather well put together.

StormontChamber

What I’ve also learned is that the team listen, a small suggestion was implemented no sooner had they returned to work on the Monday morning. Only goes to show that the team want this succeed.

The web service is split up in to various areas:

  • Members – current and historical MLA’s
  • Questions – written and verbal questions.
  • Organisations
  • Plenary
  • Hansard – contributions by members during plenary debate

With the data being under an Open Northern Ireland Assembly Licence, as long as you provide a link back to it.

The Project

I’m going to setup a Clojure/Spark project and start processing some of this data. I want to do the following items:

  1. Load the current members data.
  2. Save the questions for each member.
  3. Load the saved questions for each member.
  4. Join the data together by the member id.
  5. Find the frequency of departments specific member questions are directed at.

Setting Up Spark

Before I can do any of that I need to set up the require Spark context so I can handle the data how I want.

(comment
 (def c (-> (conf/spark-conf)
 (conf/master "local[3]")
 (conf/app-name "niassemblydata-sparkjob")))
 (def sc (spark/spark-context c)))

The reason I put this in a comment block is that it’s there for copy/pasting in the REPL when I need it.

Loading the Members

The Members API has details of the members past and present. Right now I’m only concerned about the current ones. So the end point I want to use is:

http://data.niassembly.gov.uk/members_json.ashx?m=GetAllCurrentMembers

This returns the current members in JSON format with name, display name, organisation and so on. So I’m not hammering the API with a request each time I’ve downloaded the JSON contents and saved them to a file.

$ curl http://data.niassembly.gov.uk/members_json.ashx?m=GetAllCurrentMembers > members.json

Next I want to load this file in to Spark and create a pair RDD with the member id as the key and the JSON data as a map for that member as the value.

(defn load-members [sc filepath] 
 (->> (spark/whole-text-files sc filepath)
      (spark/flat-map (s-de/key-value-fn (fn [key value] 
          (-> (json/read-str value :key-fn (fn [key] (-> key
               str/lower-case
               keyword)))
               (get-in [:allmemberslist :member])))))
       (spark/map-to-pair (fn [rec] (spark/tuple (:personid rec) rec)))))

As you can’t confirm that the JSON file is going to be neatly one object per line, which I know it isn’t in this case, I’ll use Spark’s wholeTextFile method to load in one file per RDD. This returns a pair RDD [filename, contents-of-file] then iterate through each RDD using Clojure’s JSON library to read in each member (nested within the AllMembersList->Member JSON Array), convert the key name to a Clojure map key and convert that name to lower case.

In the Clojure REPL I can test this easily:

mlas.core> (def members (load-members sc "/Users/Jason/work/data/niassembly/members.json"))
#'mlas.core/members
mlas.core> (spark/first members)
#sparkling/tuple ["307" {:partyorganisationid "111", :memberfulldisplayname "Mr S Agnew", :membertitle "MLA - North Down", :memberimgurl "http://aims.niassembly.gov.uk/images/mla/307_s.jpg", :membersortname "AgnewSteven", :memberlastname "Agnew", :personid "307", :constituencyname "North Down", :memberprefix "Mr", :constituencyid "11", :partyname "Green Party", :membername "Agnew, Steven", :affiliationid "2482", :memberfirstname "Steven"}]

So I have a tuple of members with the memberid being the key and a map of key/values for the actual record.

Saving the Questions For Each Member

With the members safely dealt with I can now turn my attention to the questions. There is a web service within the API that will return a JSON set of question for a given member, all I have to do is pass in the member ID as a value on the end point.

So, for example, if I want to get the questions for member ID 90 I would call the service with (copy/paste the url below in to browser to see the actual output):

http://data.niassembly.gov.uk/questions_json.ashx?m=GetQuestionsByMember&personId=90

As I want to load the questions for each member I’m going to iterate my paid RDD of members and use the key (as it’s the member id) to pull the data via URL with Clojure’s slurp function and the save the JSON response to disk with Clojure’s spit function.

(defn save-question-data [pair-rdd] 
 (->> pair-rdd 
      (spark/map (s-de/key-value-fn (fn [key value] 
               (spit (str questions-path key ".json") 
                   (slurp (str api-questions-by-member key))))))))

I can run this from the REPL easily but it will take a bit of time.

mlas.core>(spark/collect  (save-question-data members))

At this point all I’ve done is call the web service and save the questions to disk. I now need to load them in to Spark and create another pair RDD for each question and I want to use the :tablerpersonid as the key for the tuple.

Loading The Question Data Into Spark

In the same way we loaded the members data as a filename/filecontent pair we are going to do the same with the question data. This time there’s a whole directory of files.

Now at present as I’m in development mode I’m making an assumption here, I’m assuming that every question set has questions in it, there are though some MLA’s that don’t like asking questions for one reason or another.

$ ls -l -S
-rw-r--r-- 1 Jason staff 22 21 Jul 20:18 131.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:18 5223.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:17 5225.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:19 71.json
-rw-r--r-- 1 Jason staff 22 21 Jul 20:21 95.json

So for the minute I’m going to delete those and only concentrate with members who ask questions.

I’ve written a function that will iterate each member and load the questions file that member ID.

(defn load-questions [sc questions-path] 
 (->> (spark/whole-text-files sc questions-path)
      (spark/flat-map (s-de/key-value-fn (fn [key value] 
         (-> (json/read-str value
                  :key-fn (fn [key] 
                           (-> key 
                               str/lower-case 
                               keyword)))
             (get-in [:questionslist :question])))))
       (spark/map-to-pair (fn [rec] (spark/tuple (:tablerpersonid rec) rec)))
       (spark/group-by-key)))

Notice the spark/group-by-key which gives us a key with a vector of maps [k, [v1, v2….vn]. If it was left out we’d have a pair RDD with lots of rows. With that loaded in to Spark we can have look at what we have.

mlas.core> (def questions (load-questions sc "/Users/Jason/work/data/niassembly/questions"))
#'mlas.core/questions
mlas.core> (spark/count questions)
103
mlas.core> (spark/first questions)
#sparkling/tuple ["104" [{:tablerpersonid "104", :departmentid "76", :questiondetails "http://data.niassembly.gov.uk/questions.asmx/GetQuestionDetails?documentId=2828", :documentid "2828", :reference "AQW 141/07", :departmentname "Department of Agriculture and Rural Development", :tableddate "2007-05-22T00:00:00+01:00", :questiontext "To ask the Minister of Agricultur...............}]]
mlas.core>

So far we have two pair RDD’s one for members and one for questions, both have the member id as the key. This is a good place to be as it means we can easily join the data.

Joining RDD Datasets

Using Spark’s join functionality we get a pair RDD with the key and then two RDD blocks, [key, [left-rdd, right-rdd]. If there is a left and right element then the join will happen, if not then it will be left off. Now if we were to use left-outer-join the left RDD would be preserved even if the right hand side has no value that matches by the key.

mlas.core> (def member-questions-rdd (spark/join members questions))
#'mlas.core/member-questions-rdd
mlas.core> (spark/count member-questions-rdd)
103
mlas.core>

A good rule of thumb to check is if the quantity of the joined data is more than the original left hand side RDD count. If that’s the case then check for duplicate member id’s in the member RDD.

Calculating Department Frequencies

First of all let’s discuss what we’re trying to achieve. For every member we want to see the frequency of departments that MLA’s are directing their questions at. Now we have the joined RDD with the members and questions we can run Spark to find out for us.

As Spark actions are immutable you will end up doing several Spark map segments to get to an answer. You’ve seen so far we’ve done a map to load members, a map to load the questions and a join to give us a [key, [member, questions]] pair RDD.

What I really want to do is refine this a bit further. One of the nice things with using Spark under Clojure is Sparkling Destructuring which gives you a handy set of functions for dealing with iterating the data. So for our key/value/value pair RDD we can use s-de/key-val-val-fn and this will give us an iterable function with the key, the first value (member) and the second value (questions) as accessible values.

(defn department-frequencies-rdd [members-questions-rdd] 
   (->> members-questions-rdd
        (spark/map-to-pair 
           (s-de/key-val-val-fn (fn [key member questions]
              (let [freqmap (map (fn
 (:departmentname question)) questions)]
                (spark/tuple key (frequencies freqmap))))))))

When I run this I get the following output:

mlas.core> (def freqs (department-frequencies-rdd members-questions-rdd))
#'mlas.core/freqs
mlas.core> (spark/first freqs)
#sparkling/tuple ["8" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]
mlas.core>

We could extend this a little further with some more information on the member but we’ve essentially achieved what we proposed at the start.

Concluding

This is merely scratching the surface of the NI Assembly data sets that are available. With some simple Clojure and Spark usage we’ve managed to pull the member data and questions, do a simple join and the find the frequency of departments.

Most data science is about consolidating large sets of data down to simple numbers that can be presented. Just by looking at the data I wouldn’t have know that a member has asked 212 questions to the Department of Health, Social Services and Safety.

Now I can.

You can download the source code for this project from the Github repository.

 

Posted in Benefits of open data, Informing Decision-making, Posts from feeds, Transparency Tagged with: , , , , , , , ,