Alleged Data Leak Reveals New Intel on Google Ranking Factors

Google Search

The SEO industry has been reeling today following the alleged leak of a number of internal Google docs, all of which pertain to its search ranking algorithm.

Much has already been written in the past 12 hours or so, and with that in mind we we won't look to rehash the excellent investigative work of Moz and SparkToro founder Rand Fishkin here.

What was particularly interesting for us at ICS-digital, however is that - as the more skeptical amongst us have suspected for some time - much of what Google has implemented at the core of search seems to contradict what it has said publicly in the not-too-distant past.

The documents, purportedly uploaded to Github by accident several months ago, come from Google's Content API Warehouse and provide a compelling look at the inner workings of the world's largest search engine.

What have we learned today?

Some of the main takeaways of interest include:

  • Newer sites do have a 'sandbox' effect, which means their rankings are placed in 'quarantine' while Google determines how trustworthy they are. This is why it's really difficult for us to see short-term results from SEO for new domains. (Previously denied by Google.)
  • Similarly to the above, 'HostAge' is considered a ranking factor, which means more established, older domains may outperform newer sites when all other things are equal. (Previously denied by Google.)
  • Clickstream data is heavily used to influence rankings, i.e. what users search for both before and after visiting a particular webpage - to either promote or demote it. This click data and pattern detection is also used to combat spam. (Previously denied by Google.)
  • Google uses a lot of site-specific user / engagement data from its Chrome browser to determine a page's rankings, including the 'sitelinks' that appear underneath a main domain in search.
  • For specific events, Google uses a whitelist to promote sites that it considers authoritative / anti-'fake news', i.e. for information during the Covid pandemic, US elections and in travel more generally.
  • There is documentation suggesting Google can identify authors and treats them as entities in the system, thus measuring E-E-A-T and impacting rankings. (Previously denied by Google.)
  • Links are heavily weighted and Google uses a system similar to ahref's URL Rating / Moz's Page Authority to determine their quality: links are judged as high, low or medium quality, which rely on criteria including the number of the clicks to the referring (i.e. linking) page, as well as any authority-based metrics. (Previously denied by Google, at least in part.)
  • Domain names that are exact matches for unbranded keywords (e.g. houses-to-rent-in-leeds.co.uk) are heavily demoted in search, as part of Google's spam and low-effort-content detection.

What does this mean for the search industry?

ICS-digital's Marketing Director Martin Calvert was quick to explain earlier how the leak - if indeed it is legitimate - may impact the work we do for clients in highly regulated and competitive sectors such as iGaming, finance and law.

As you would expect, Google has been tight-lipped about the veracity of the documentation and we don't expect them to comment any time soon - if ever.

However, based on the efforts of Fishkin and his team to confirm that the style and naming conventions are very much in line with existing internal documents, there are no indications that it isn't authentic.

For us at ICS-digital, it's largely business as usual today, particularly based on our preference to base our SEO strategies on what Google does rather than what it says.

Get in touch if you'd like to find out more about our practical approach to search marketing.