Scraping Tools – A quick round-up
I’ve written here before about using Scraperwiki to scrape content from websites which haven’t implemented OpenData.
I have even used Scraper Wiki to scrape our own website to get badly-formed content out in a structured way for a hack.
Now, sadly, Scraperwiki is no longer free, and old scrapers are mostly frozen. So if you are about to embark on a hackathon you might be looking for an alternative.
I’ve noticed four recently – but have yet to test these.
- Morph a web-based tool from OpenAustralia. Write scrapers in Ruby, Python or PHP
- Tabula: Extracts tabular data from PDFs. Runs under Windows or Mac. Needs Java. See also: http://schoolofdata.org/2014/03/27/tackling-pdfs-with-tabula/
- Portia: released yesterday(!) – available via GitHub, soon to be made available as a hosted service
- Import.IO – A free web-hosted service that promises further (maybe paid for) features. Looks like a great Help section with support, webinars etc.
Have you used any of these? Leave a comment with your experiences. Or let me know of other alternatives!