SEO experiment with the Robots.txt file

1 de August de 2024

01/08/202408:38

Mj Cachon

Some time ago while reading Google’s guidelines I noticed a small detail that may have gone unnoticed by many people.

It is about how the Disallow or Allow instructions should be created in the robots.txt file. Here is the fragment of the guidelines to which I refer:

You can read that part of the documentation at this link.

I started to think that I had already seen many robots.txt in which the instructions did not begin with “/”, contrary to what Google says and the truth is that although I wondered why, I do not know the reason.

Does Google respect it or doesn’t Google respect it? Can Google’s explanation in its guidelines really be interpreted in any other way?

THE EXPERIMENT IN AND WITH MY WEBSITE

At the end of June I modified my robots.txt file to remove the slash at the beginning of each instruction: on 28 June at 22:17. The file looked like this:

As you can see, the instructions were left with no leading slash (/) as understood from the guidelines.

What I wanted to check was:

– If those Disallow were going to be respected or not.

And I was going to check it like this:

– Checking the server accesses (logs) by Googlebot.
I have ruled out using the Google Search Console crawl stats because it only gives us a sample.

So if after the date of changing the robots.txt file any of the urls patterns included in the file received accesses by Googlebot, that would mean that the rule is not respected without the slash.

Now I just had to wait and see what would happen, although with a clear handicap: the crawling of my website is very low, I deleted a lot of content and, usually, I don’t generate many pieces of content.

THE RESULTS

After more than a month, I am writing this post on August 1st after reviewing the logs provided by my hosting provider.

I attach some of the blocked patterns and their impact on the logs, considering that there are certain pages that have a tiny crawl budget, as it is something that may have affected the experiment or that I still don’t have the data.

Patterns that have received hits by Googlebot:

– en/home-en/ has received several hits, some before the robots.txt change and some after.

– css or js resources, which are accessed more often or more frequently, if they have been accessed normally despite the Disallow statement without slash

– Pages of the “category” pattern that in many cases are old pages, but have received hits anyway, in this case, it is the url of the first row that has been crawled despite the robots.txt.

CHECKING WITH THE GOOGLE SEARCH CONSOLE URL INSPECTOR

Despite the instruction in the robots.txt which is intended to block, despite starting the line without the slash, the Google Search Console url inspector understands that these lines are not blocking or do not comply with the required nomenclature.

Crawling is allowed if the Disallow line does NOT start with a slash.

The image is in Spanish but it says ” Crawling is allowed = Yes”.

SECOND EXPERIMENT TO VALIDATE

Despite the results, I’m going to start a second test today to validate the results of the first one. As the crawl budget is so low, we’re going to test it on a site with more movement to confirm if the behaviour is the same.

This should be exactly the same. We will see in a few weeks.

IS IT NECESSARY TO DISCUSS THIS FURTHER IN ORDER TO AVOID TECHNICAL PROBLEMS?

Definitely YES.

In parallel to finding this detail in the guidelines, for years I have been seeing websites that use their Disallow instructions with patterns that do not begin with a slash, this has always generated a lot of doubts in me as the guidelines are apparently clear.

After the experiment, I have analysed about 1000 websites from different sectors (in the Spanish market) and I have found that there are many websites that use the instructions without the initial slash, so presumably these instructions may not be blocking what they really want with the consequent waste of crawling.

In order not to highlight specific cases, I simply add the sector tables.

– Sectors that do so according to Google’s Guidelines, I add the average per sector and the maximum percentage value of Disallows without slash, to illustrate good usage.

– Sectors that do not do so according to Google’s Guidelines, I add the average per sector and the maximum percentage value of Disallows without bar, to illustrate the current usage.

CONCLUSIONS AND NEXT STEPS

The conclusion is clear: we must know Google’s guidelines in detail and understand the technical implications of not applying the instructions or directives in the format and requirements indicated.

In the following weeks I will update the article with the results of the second test that can validate the initial hypothesis.

If you have experiences in this regard, leave a comment and contribute to the conversation :)

Soy MJ Cachón

Consultora SEO desde 2008, directora de la agencia SEO Laika. Volcada en unir el análisis de datos y el SEO estratégico, con business intelligence usando R, Screaming Frog, SISTRIX, Sitebulb y otras fuentes de datos. Mi filosofía: aprender y compartir.