Would you like to know every domain name the UK Government had registered? Of course you would! There could be all sorts of interesting tit-bits hidden in there (ProtectAndSurvive.gov.uk? EbolaOutbreak2017.nhs.uk? MinistryOfTruth.police.uk?)
Rather than relying on Freedom of Information requests, or Open Data, we can go straight to the source of domain names – the DNS!
Shut Up And Give Me The Codez!
Download all UK Government domain names
.gov.uk 15,436 records
.nhs.uk 4,877 records
.police.uk 466 records
.mod.uk 268 records
.parliament.uk 91 records
That’s… quite a lot! I wonder how many are new?
Not intended snarkily, but has web rationalisation/no new govt domains been formally abandoned as a policy now?
— Steph Gray (@lesteph) November 9, 2015
— Charlotte Jee (@charlottejee) November 9, 2015
The Gov.UK file is a CSV which also show when the domain was first registered (if available).
The Domain Name System (DNS) lists every single domain name (example.com). It tells your computer which IP Address is associated with a Domain Name. If your local DNS doesn’t know where example.gov.uk lives, it goes to the ISP’s DNS. If they don’t know, they ask an upstream provider’s DNS. And so on, until someone asks the .gov.uk nameserver for an authoritative response.
So, can you download every domain name in existence? No, not easily. It usually involves filling out lots of forms and giving some compelling reason why you want it.
However, Rapid7’s sonar project provides a sort of “best guess” for all the domain names which it can see.
To download the entire file is 12GB. That’s the zipped version.
Once unzipped, it’s a whopping 67GB
A quick look at the file shows it contains 1,408,097,159 records. Youch! That’s a lot of domain names!
This is what the file looks like
$ head 20150926_dnsrecords_all cshengmei.com.h310.6dns.net,a,22.214.171.124 reseauocoz.cluster007.ovh.net,cname,cluster007.ovh.net cse-web-cl.comunique-se.com.br,a,126.96.36.199 ext-cust.squarespace.com,a,188.8.131.52 ext-cust.squarespace.com,a,184.108.40.206 ext-cust.squarespace.com,a,220.127.116.11 ext-cust.squarespace.com,a,18.104.22.168 ghs.googlehosted.com,cname,googlehosted.l.googleusercontent.com isutility.web9.hubspot.com,cname,a1049.b.akamai.net sendv54sxu8f12g.ihance.net,a,22.214.171.124 sites.smarsh.io,a,126.96.36.199 www.triblocal.com.s3-website-us-east-1.amazonaws.com,cname,s3-website-us-east-1.amazonaws.com *.01ete21.cn.cname.yunjiasu-cdn.net,a,188.8.131.52 *.01ete21.cn.cname.yunjiasu-cdn.net,a,184.108.40.206
Ok, so let’s get all the *.gov.uk records out of there…
grep "gov.uk" 20150926_dnsrecords_all 0-19insalford.info,soa,ns0.ictservices.co.uk postmaster.salford.gov.uk 2010022204 28800 7200 604800 86400 019186.gov.ukpfl.cn,a,220.127.116.11 100days.local.gov.uk,a,18.104.22.168 101.gov.uk,a,22.214.171.124 101.gov.uk,a,126.96.36.199 101.gov.uk,mx,20 sms2.101.gov.uk 101.gov.uk,ns,ns1.p08.dynect.net
Ah! Ok, we’re picking up some websites which are pointing to a gov.uk site (potentially useful) and some false positives like “019186.gov.ukpfl.cn”. Let’s just look at records where the first column ends with .gov.uk”:
grep ".gov.uk," 20150926_dnsrecords_all 100days.local.gov.uk,a,188.8.131.52 101.gov.uk,a,184.108.40.206 101.gov.uk,a,220.127.116.11 101.gov.uk,mx,20 sms2.101.gov.uk 101.gov.uk,ns,ns1.p08.dynect.net 101.gov.uk,ns,ns2.p08.dynect.net 101.gov.uk,ns,ns3.p08.dynect.net 101.gov.uk,soa,ns1.p08.dynect.net hostmaster.cscdns.net 2014121100 3600 600 604800 1800 1901redirect.nationalarchives.gov.uk,a,18.104.22.168 1sttouch.powys.gov.uk,a,22.214.171.124 1t6c3c0p2r0m934.forestry.gov.uk,a,126.96.36.199 2011.census.gov.uk,a,188.8.131.52 2014.colneyheathparishcouncil.gov.uk,a,184.108.40.206 2050-calculator-tool-wiki.decc.gov.uk,cname,wiki.2050.org.uk
OK, so how do we de-duplicate these? The first thing to do is manipulate the data. We only want the first column. There are an number of ways to do this in Linux, I prefer to use the Python tool CSVfilter.
sudo pip install csvfilter.
To grab only the first (zeroth) column
cat 20150926_dnsrecords_all | csvfilter -f 0 > out.csv
Now, this doesn’t quite work. Why? Because some DNS records contain incredibly strange data! You can manually clean up the data, but that’s a bit boring and utterly impossible to load into Excel or any other normal editor.
Here’s what I did…
- Copy all the lines containing gov.uk into a new file
grep ".gov.uk," 20150926_dnsrecords_all > govuk.csv
- Create a new file with only the first column
cat govuk.csv | csvfilter -f 0 > govuk0.csv
- Sort the file and make sure each line in unique
sort govuk0.csv | uniq > govuk.txt
Hey presto! A more-or-less complete list of every .gov.uk website which is registered. The same can be performed for .NHS.uk, .police.uk, .MOD.uk etc.
Getting The Dates
Time to crack out the Ruby!
Using the WHOIS library, I wrote a simple script to parse the text records and query when the domain name was created.
c = Whois::Client.new
File.open( “govuk.txt” ).each do |line|
r = c.lookup(line.chomp)
rescue Whois::Error => e
rescue StandardError => e
This isn’t perfect – there are only records for the third level of gov.uk – and no records at all for Parliament, MOD, Police, and NHS. It is also a bit slow to run through the thousands of records – but we can see a few interesting bits and bobs.
Created in 2015
I suspect some of these are merely renewals, rather than brand new domains.
seemis.gov.uk,2015-10-29 00:00:00 +0000 yjb.gov.uk,2015-10-28 00:00:00 +0000 crbonline.gov.uk,2015-10-23 00:00:00 +0100 coi.gov.uk,2015-10-14 00:00:00 +0100 gibraltar.gov.uk,2015-07-29 00:00:00 +0100 dorsetforyou.gov.uk,2015-03-19 00:00:00 +0000 ico.gov.uk,2015-03-19 00:00:00 +0000 bridgnorthtowncouncil.gov.uk,2015-01-29 00:00:00 +0000
wdc.gov.uk,2003-06-03 00:00:00 +0100 west-dunbarton.gov.uk,2003-06-03 00:00:00 +0100 clacks.gov.uk,2003-06-02 00:00:00 +0100 bassetlaw.gov.uk,2003-04-29 00:00:00 +0100 dti.gov.uk,2003-03-13 00:00:00 +0000
Sadly, clacks.gov.uk has very little to do with Terry Pratchett!
That’s all folks!
Spotted anything unusual? Found a better way to do things? Stick a comment in the box!
If you’ve enjoyed this post, you can buy me something from my Amazon Wishlist.