NI Open Data – Mining Prescription Data – #opendata #spark #clojure
Moving On From The NI Assembly
There was plenty of scope from the NI Assembly blog posts I did last time (you can read part 1 and part 2 for the background). While I received a lot of messages with “why don’t you do this” and “can you find xxxxxx subject out” it’s not something I wish to do. Kicking hornets nests isn’t really part of my job description.
Saying that when there’s open data for the taking then it’s worth looking at. Recently the Detail Data project opened up a number of datasets to be used. Buried within is the prescriptions that GP’s or Nurse within the practice has prescribed.
Acquiring the Data
The details of the prescription data are here: http://data.nicva.org/dataset/prescribing-statistics (though the data would suggest it’s not really statistics, just raw CSV data), the files are large but nothing I’m worrying about in the “BIG DATA” scheme of things, this is small in relative terms. I’ve downloaded October 2014 to March 2015, that’s a good six months worth of data.
Creating a Small Sample Test File
When developing these kind of jobs before jumping into any code it’s worth having a look at the data itself. See how many lines of data there are, this time as it’s a CSV file I know it’ll be one object per line.
Jason-Bells-MacBook-Pro:niprescriptions Jason$ wc -l 2014-10.csv 459188 2014-10.csv
Just under half a million lines for one month, that’s okay but too much for testing. I want to knock it down to 200 for testing. The UNIX head command will sort us out nicely.
head -n20 2014-10.csv > sample.csv
So for the time being I’ll be using my sample.csv file for development.
Loading Data In To Spark
First thing I need to do is define the header row of the CSV as a set of map keys. when Spark loads the data in then I’ll use zipmap to pair the values to the keys for each row of the data.
(def fields [:practice :year :month :vtm_nm :vmp_nm :amp_nm :presentation :strength :total-items :total-quantity :gross-cost :actual-cost :bnfcode :bnfchapter :bnfsection :bnfparagraph :bnfsub-paragraph :noname1 :noname2])
You might have noticed the final two keys, noname1 and noname2. The reason for this is simple, there are commas on the header row but no names so I’ve forced them to have a name to keep the importing simple.
PRACTICE,Year,Month,VTM_NM,VMP_NM,AMP_NM,Presentation,Strength,Total Items,Total Quantity,Gross Cost (<A3>),Actual Cost (<A3>),BNF Code,BNF Chapter,BNF Section,BNF Paragraph,BNF Sub-Paragraph,, 1,2015,3,-,-,-,-,-,19,0,755.00,737.28,-,99,0,0,0,,
With that I can now create the function that loads in the data.
(defn load-prescription-data [sc filepath] (->> (spark/text-file sc filepath) (spark/map #(->> (csv/read-csv %) first)) (spark/filter #(not= (first %) "PRACTICE")) (spark/map #(zipmap fields %)) (spark/map-to-pair (fn [rec] (let [practicekey (:practice rec)] (spark/tuple practicekey rec)))) (spark/group-by-key)))
Whereas the NI Assembly data was in JSON format so I had the keys already defined, this time I need to use the zipmap function to mix the values at the head keys together. This gives us a handy map to reference instead of relying on the element number of the CSV line. As you can see I’m grouping all the prescriptions by their GP key.
Counting The Prescription Frequencies
This function is very similar to the frequency function I used in the NI Assembly project, by mapping each prescription record and retaining the item prescribed I can then use the frequencies function to get counts for each distinct type.
(defn def-practice-prescription-freq [prescriptiondata] (->> prescriptiondata (spark/map-to-pair (s-de/key-value-fn (fn [k v] (let [freqmap (map (fn [rec] (:vmp_nm rec)) v)] (spark/tuple k (frequencies freqmap))))))))
Getting The Top 10 Prescribed Items For Each GP
Suppose I want to find out what are the top ten prescribed items for each GP location. As the function I’ve got has the frequencies with a little tweaking we can return what I need. First I’m using sort-by to sort on the function, this will give me a sort smallest to largest, using the reverse function then flips it on it’s head and gives me largest to smallest. With me only wanting ten items I then use the take function to return the first ten items in the sequence.
(defn def-practice-prescription-freq [prescriptiondata] (->> prescriptiondata (spark/map-to-pair (s-de/key-value-fn (fn [k v] (let [freqmap (map (fn [rec] (:vmp_nm rec)) v)] (spark/tuple k (take 10 (reverse (sort-by val (frequencies freqmap)))))))))))
Creating The Output File Process
So with two simple functions we have the workings of a complete Spark job. I’m going to create a function to do all the hard work for us and save us repeating lines in the REPL. This function will take in the Spark context, the file path of the raw data files (or file if I want) and an output directory path where the results will be written.
(defn process-data [sc filepath outputpath] (let [prescription-rdd (load-prescription-data sc filepath)] (->> prescription-rdd (def-practice-prescription-freq) (spark/coalesce 1) (spark/save-as-text-file outputpath))))
What’s going on here then? First of all we load the raw data in to a Spark Pair RDD and then using the thread last function we calculate the item frequencies and then reduce all the RDD’s down to a single RDD with the coalesce function. Finally we output everything to our output path. First of all I’ll test it from the REPL with the sample data I created earlier.
nipresciptions.core> (process-data sc "/Users/Jason/work/data/niprescriptions/sample.csv" "/Users/Jason/Desktop/testoutput/") nil
Looking at the file part-00000 in the output directory you can see the output.
(1,(["Blood glucose biosensor testing strips" 4] ["Ostomy skin protectives" 4] ["Macrogol compound oral powder sachets NPF sugar free" 3] ["Clotrimazole 1% cream" 2] ["Generic Dermol 200 shower emollient" 2] ["Chlorhexidine gluconate 0.2% mouthwash" 2] ["Clarithromycin 500mg modified-release tablets" 2] ["Betamethasone valerate 0.1% cream" 2] ["Alendronic acid 70mg tablets" 2] ["Two piece ostomy systems" 2]))
So we know it’s working okay…. now for the big test, let’s do it against all the data.
Running Against All The Data
First things first, don’t forget to remove sample.csv file if it’s in your data directory or it will get processed with the other raw files.
Back to the REPL and this time my input path will just be the data directory and not a single file, this time all files will be processed (Oct 14 -> Mar 15).
nipresciptions.core> (process-data sc "/Users/Jason/work/data/niprescriptions/" "/Users/Jason/Desktop/output/")
This will take a lot longer as there’s much more data to process. When it does finish have a look at the part-00000 file again.
(610,(["Gluten free bread" 91] ["Blood glucose biosensor testing strips" 62] ["Isopropyl myristate 15% / Liquid paraffin 15% gel" 29] ["Lymphoedema garments" 27] ["Macrogol compound oral powder sachets NPF sugar free" 25] ["Ostomy skin protectives" 24] ["Gluten free mix" 21] ["Ethinylestradiol 30microgram / Levonorgestrel 150microgram tablets" 20] ["Gluten free pasta" 20] ["Carbomer '980' 0.2% eye drops" 19])) (625,(["Blood glucose biosensor testing strips" 62] ["Gluten free bread" 38] ["Gluten free pasta" 27] ["Ispaghula husk 3.5g effervescent granules sachets gluten free sugar free" 24] ["Macrogol compound oral powder sachets NPF sugar free" 20] ["Isopropyl myristate 15% / Liquid paraffin 15% gel" 20] ["Isosorbide mononitrate 25mg modified-release capsules" 18] ["Alginate raft-forming oral suspension sugar free" 18] ["Isosorbide mononitrate 50mg modified-release capsules" 18] ["Oxycodone 40mg modified-release tablets" 16])) (661,(["Blood glucose biosensor testing strips" 55] ["Gluten free bread" 55] ["Macrogol compound oral powder sachets NPF sugar free" 24] ["Salbutamol 100micrograms/dose inhaler CFC free" 20] ["Colecalciferol 400unit / Calcium carbonate 1.5g chewable tablets" 19] ["Venlafaxine 75mg modified-release capsules" 18] ["Isosorbide mononitrate 25mg modified-release capsules" 18] ["Isosorbide mononitrate 60mg modified-release tablets" 18] ["Alginate raft-forming oral suspension sugar free" 18] ["Venlafaxine 150mg modified-release capsules" 18])) (17,(["Blood glucose biosensor testing strips" 55] ["Gluten free bread" 35] ["Macrogol compound oral powder sachets NPF sugar free" 29] ["Colecalciferol 400unit / Calcium carbonate 1.5g chewable tablets" 24] ["Gluten free biscuits" 22] ["Ispaghula husk 3.5g effervescent granules sachets gluten free sugar free" 21] ["Diclofenac 1.16% gel" 19] ["Sterile leg bags" 19] ["Glyceryl trinitrate 400micrograms/dose pump sublingual spray" 19] ["Ostomy skin protectives" 18]))
There we are, GP 610 prescribed 91 loaves of Gluten Free Bread over the six month period. The blood glucose testing strips are also high on the agenda but that would come as no surprise for any one who is diabetic.
So Which GP’s Are Prescribing What?
The first number in the raw data is the GP id. In the DetailData notes for the prescription data I read:
“Practices are identified by a Practice Number. These can be cross-referenced with the GP Practices lists.“
As with the NI Assembly data I can load in the GP listing data and join the two by their key. Sadly on this occasion though I can’t, the data just isn’t there on the page. I’m not sure if it’s broken or removed on purpose. Shame but I’m not going to create a scene.
DetailData Prescription Data: http://data.nicva.org/dataset/prescribing-statistics
Github Repo for this project: https://github.com/jasebell/niprescriptiondata
*** Note: I spelt “prescriptions” wrong in the Clojure project but as this is a throwaway kinda a thing I won’t be altering it…. ***