Data leak from the Google Content Warehouse API: can SEO still evolve?

May 13, 2024, leaked Google documents, revealed by yoshi-code-bot on Github, gave us an unprecedented insight into Google search and revealed its important elements for ranking content. Here's a summary of what we know about these documents. 

Overview of the leak

We know this documentation is current, due to the date of the documents. The API contains 2 modules and 596 attributes. It only reveals the existence of classification criteria but not their relative importance. However, it does indicate reranking functions that can "adjust the score" or "change the ranking" of a document, according to iPullRank's Michael King.

Thus, content can be demoted for different reasons, such as product reviews, bad links or other signals from the SERP (Search Engine Result Pages), which indicate user dissatisfaction.

 

Page retention and history

According to this documentation, Google keeps a copy of all versions of all pages it has ever indexed. This means that it "remembers" all changes made to a page, although it only uses the last 20 changes to a URL when analyzing links.

 

Ranking criteria

Diversity and relevance of links

The documents show that:

  • the diversity and relevance of links remain essential.
  • PageRank is still very present in Google's ranking functions.
  • the PageRank of the home page is taken into account in each document.

Google uses various metrics for links, including BadClicks, GoodClicks, LastLongestClicks and UnsquashedClicks. To summarize, successful clicks remain very important for Google!

Length and originality of content

When it comes to content, longer documents may be truncated, while shorter content receives a score (from 0 to 512) based on originality. Regarding the latter, we must understand the notion ofEEAT (Experience, Expertise, Authoritativeness, and Trustworthiness).

Importance of brand and notoriety

It is important to remember that brand matters and that brand awareness should be built outside of search engine results pages as well. Google uses named entities, as well as a criterion called “siteAuthority”, revealed by the Panda update which affected many sites with so-called “poor” content.

 

Chrome data usage

Without wanting to make Chrome users paranoid, the module ChromeInTotal indicates that Google uses data from its Chrome browser for ranking.

 

Interesting Features Revealed

Here are some of the interesting features present in the document:

  • BylineDate, SyntacticDate, SemanticDate : freshness of content matters.
  • SmallPersonalSite, RegistrationInfo : it seems that small expert sites can be put forward.
  • SiteRadius, SiteFocusScore : know if a document is a central subject of the website.
  • TitlematchScore, AvgTermWeight : the central information of the page counts.
  • AnchorMismatchDemotion, CompressedQualitySignals, TopicEmbeddingsVersionedData : the relevance of the links.
  • LocalCountryCodes : Local links (from the same country) are probably more valuable.
  • FullRightContext : The content surrounding a link provides context for the anchor text.

 

SEO implications

To try to answer the question of how SEO will evolve following this leak, we think that yes, it will evolve. Indeed, based essentially on tests and learn, SEO will be able to confirm or refute hypotheses and progress more quickly than before.

 

Conclusion

Dixon Jones, CEO of Inlinks, made Google's 14 search variables searchable to let us know what Google is storing and what they are used for. It is important to remember that these characteristics are not weighted in the documentation. It is also impossible to know which ones are used in production and which ones might exist for experimental purposes.

To rank well, you must remember that relevance and user experience remain the main objectives.

 

 

Rossitza Mavreau, Lead Traffic Manager SEO SEA Analytics at UX-Republic

 

 

sources: