Most internet marketers who have been in the business for more than a year will be familiar with Technorati, a curated directory of blogs that eventually stretched into the hundreds of thousands. A mainstay resource for several years for those involved in online PR and content-based digital marketing, on the 29th of May 2014 Technorati permanently closed the directory to pursue other ventures. Whilst this was quite a loss for those aware of the potential of Technorati for sourcing high-quality publications for outreach, it is possible to partially recover the directory’s entries using a popular web-based tool.
Introducing the Wayback Machine
Popular with search marketers for a variety of reasons, particularly with regards to broken link building and domain research, the Wayback machine can also be used to gain access to previous versions of Technorati’s index which are not otherwise publicly available. Whilst not all of the pages of the directory will be indexed by the Wayback Machine’s crawlers, most of the directory’s categories should have a reasonable number of pages which can still be accessed.
First, we’re going to visit the old homepage of the directory
This is the last saved version of the directory before it was closed. From here, we can navigate to the subcategories and select the specific ‘niche’ of blogs that interests us. Technorati orders its sites by authority, and the sites displayed on the first pages will typically be the more powerful ones.
From here, it’s a simple matter of exploring which domains may be of interest. Bear in mind that any copied/opened URLS will be in the archive.org format, as above, so you’ll need to put them in excel and use ‘text to columns’ to separate the website addresses from the archive ones. A browser tab copying add-on may come in useful here.
Scaling the process - Using Scrapebox to assemble a list of outreach prospects
Extracting domains en masse from the archive of Technorati can take a little while to set up, but it is well worth it for the end result. Although it would be possible to set up data extraction with a custom-built scraper, it’s relatively easy to do so with the cheap, effective ‘Swiss army knife’ of SEO, Scrapebox. Historically not well regarded by a large section of the search marketing community due to its comment spam connotations, it would be a misconception to assume that scrapebox is purely a ‘black-hat’ tool. In the right hands, it can be used for a great number of tasks far more appropriate to the modern day marketer. For a one-off fee of $97 (less with one of the discount codes available) you can gain access to a tool that:
· Allows you to conduct keyword research across Google, Amazon, Youtube, and others
· Has advanced list management features to assist with managing large lists of domains
· Enables batch checking of links for 404 errors, expired domains, and more
· Is able to source Creative Commons images in bulk for content-based projects
· Can be very effectively used for broken link building
· Is able to bulk check social metrics, Moz Page and domain authority, as well as Google cache dates and index found/missing statuses
· For a small amount extra, it is possible to use scrapebox as a rank tracker by purchasing the rank tracker add on. Whilst not as full-featured as dedicated rank tracking software, it is nevertheless a cost-effective investment.
· Competitive analysis can be done more easily at scale with the link extraction feature, particularly in conjunction with the Moz API.
· In a similar vein, Local SEO can be made easier by using scrapebox to discover competitor NAP (name, address, phone number) local citations.
In this instance, we’re going to use it to extract all the blogs from the ‘travel’ archive of Technorati. Whilst this sort of venture would normally require coding knowledge, Scrapebox makes it considerably simpler.
Firstly, we need to locate a page that has a number on it, this one here for instance:
Then we need to remove the trailing slash and put it in excel.
By dragging down, we can easily get all the 813 URLs that we need. Now we’ll put them into scrapebox and insert them into the URL section.
For the next part, we’ll need to use 64-bit link extractor add on.
We need to select External links, and the number of connections (under ‘settings’) can be anything up to 1,000, but realistically anything between 10 and 30 will be more than sufficient if private proxies are used. If no private proxies are available, keeping the connections lower is advised.
Once it’s finished, we need to click on ‘show save folder’ and then open the notepad file. Text to columns on the forward slash in excel will give us the root domains. Once we’ve done that, column F is what we’re interested in.
We can then take the 154,155 URLs and put them back into scrapebox, and then use the remove/filter>remove duplicate domains option (a huge number of those URLs will be Technorati.com)
And there we have it! Not a complete list of all the travel blogs that were in the index, but a fair number.
Once you’ve got your list of domains, you’ll most probably want to perform metric checks on them. Importing them into Buzzstream and filtering them would work, or alternatively using the ‘bulk backlinks’ function in MajesticSEO. Furthermore, you may wish to check cache dates and whether the domains are live and have not expired, as some of the sites found may not be regularly updated.
There are a number of reasonable-quality, up-to-date blog directories out there that marketing agencies can utilise to source prospects for outreach, but the old-time king of them all can still be worth investigating with the right approach.
Charlotte is chief whip when it comes to making sure words are in order at ICS-digital. You can get in touch with her directly at email@example.com