Huge Google Search Document Leak Reveals Inner Workings of Ranking Algorithm

Danny Goodwin reports via Search Engine Land: A trove of leaked Google documents has given us an unprecedented look inside Google Search and revealed some of the most important elements Google uses to rank content. Thousands of documents, which appear to come from Google’s internal Content API Warehouse, were released March 13 on Github by an automated bot called yoshi-code-bot. These documents were shared with Rand Fishkin, SparkToro co-founder, earlier this month.
What’s inside. Here’s what we know about the internal documents, thanks to Fishkin and [Michael King, iPullRank CEO]:
Current: The documentation indicates this information is accurate as of March.
Ranking features: 2,596 modules are represented in the API documentation with 14,014 attributes.
Weighting: The documents did not specify how any of the ranking features are weighted — just that they exist.
Twiddlers: These are re-ranking functions that "can adjust the information retrieval score of a document or change the ranking of a document," according to King.
Demotions: Content can be demoted for a variety of reasons, such as: a link doesn’t match the target site; SERP signals indicate user dissatisfaction; Product reviews; Location; Exact match domains; and/or Porn. Change history: Google apparently keeps a copy of every version of every page it has ever indexed. Meaning, Google can "remember" every change ever made to a page. However, Google only uses the last 20 changes of a URL when analyzing links.
Other interesting findings. According to Google’s internal documents: Freshness matters — Google looks at dates in the byline (bylineDate), URL (syntacticDate) and on-page content (semanticDate).
To determine whether a document is or isn’t a core topic of the website, Google vectorizes pages and sites, then compares the page embeddings (siteRadius) to the site embeddings (siteFocusScore).
Google stores domain registration information (RegistrationInfo).
Page titles still matter. Google has a feature called titlematchScore that is believed to measure how well a page title matches a query.
Google measures the average weighted font size of terms in documents (avgTermWeight) and anchor text.
What does it all mean? According to King: "[Y]ou need to drive more successful clicks using a broader set of queries and earn more link diversity if you want to continue to rank. Conceptually, it makes sense because a very strong piece of content will do that. A focus on driving more qualified traffic to a better user experience will send signals to Google that your page deserves to rank." […] Fishkin added: "If there was one universal piece of advice I had for marketers seeking to broadly improve their organic search rankings and traffic, it would be: ‘Build a notable, popular, well-recognized brand in your space, outside of Google search.’"