Robot Directives in Real World Websites to Benefit SEO
Noindex, disallow and nofollow are directives which give a sense of control over bots accessing our website.
In the following article I’ll explain each directive and provide instances of how these maybe applied to a website.
Just before we take a closer look at the robot directives, let’s just refresh ourselves on how Google finds our content before placing it within their index.
First of Accessibility is a good term to know of and means - ensuring the website is accessible and usable for all, with a focus on an easy navigation, improved UX and ensuring the right information is in the right place for the visitor.
Second, what is Crawling - Process of looking for new or updated pages, Google uses links and sitemaps to discover URLs.
Indexing - This is the term used to describe Google fetching a webpage, reads/analyses it and then adds it in to the index.
Once the above 3 steps have been executed, Google applies its algorithm to order the web pages in an order of importance against specific keywords. This is what we term the all important ranking process, and achieving good ranking usually equates to good SEO being performed.
The Noindex Meta Tag
This tag is the one which simply tells a search engine bot that it should not include this page in its search engine results pages. Simply asking not to be indexed and works perfectly for a new web page and an already indexed page will be de-indexed once the directive is picked up on the next crawl.
It is placed within the head section of a web page and in its simplest form will appear like this –
<meta name=”robots” content=”noindex>
Issues can arise as there may be instances where the search engines are prevented from seeing this page, resulting in it still appearing in search results. Makes sense, huh! Well, this occurs if the page is blocked off within the Robots txt file. If it is blocked in the Robots file, then the page’s request to be noindexed within the head section of the page, will not be seen and so therefor it might still appear in the search engines results pages (serps).
There is another method to prevent a web page from being indexed and that is to place the noindex tag in a X-Robots-Tag in the HTTP header. This is for another post which I’ll link to from here once complete.
Handling links on a page to Noindex
Often, a web page you wish to noindex will contain internal links to other web pages on the website. For example, a category page which contains direct links to either sub-categories and/or product pages.
These pages may contain PageRank which you’d like to continue to distribute across the pages it links out to. To ensure that these links are still followed which results in bots being able to access pages from it, simply place this directive in the head of the page –
<meta name=”robots” content=”noindex, follow >
The second attribute ‘follow’ within the tag indicates to bots that all the links on the page should be followed. This means the pages linked to from the page you wish to noindex, will not be negatively impacted by the removal of the internal links.
Using the Robots txt File to Noindex a page
Using the Robots txt file to noindex a web page gives more scope to noindex not just one url but entire folders of urls. This is not an effective way to deindex a page that is already indexed, but a way to prevent a new page from being indexed that isn't already within the search engaines database. Well, prior to 2019 it was, Google made this an unsupported rule back in July, 2019. So read on to learn of the work around.
Robots File Disallow Directive
Disallow directives placed within the Robots txt file tells search engine bots not to crawl whatever web page, or folder, it is you’ve directed to disallow.
The syntax within the Robots txt file will look like this for disallowing a page –
<Disallow: /your-page-url/ >
This directive ensures that we are removing the page from our website, it will not pass PageRank to any other area of the website. All links on a disallowed page will be rendered useless. But, if a page we have disallowed via the Robots txt file has external links and canonical tags which point to it, it may still be indexed and so rank in the serps.
Absolutely Removing a Page
Placing the url of the page we wish to disallow in the Robots txt file and a noindex tag in the head section of the page we wish to remove, doesn’t guarantee the page from being removed from the search engines index because the page will be blocked at the Robots level. As the page is blocked at Robots level, this means that bots won’t be able to crawl it and so will not discover the noindex tag placed on the page and so will be unaware of the directive to leave the page out of their index, let alone remove it.
The above recommendation is a means to prevent a web page from being crawled. It was possible, up until Google updated this in 2019, to noindex a url from the Robots file, as explained below -
To fully ensure a web page will not be indexed combine both the noindex tag in the Robots file for the page, and the disallow directive from the Robots txt file to. This approach will ensure the page will not appear in the search engines index, as the page is prevented from being crawled. This will also ensure no PageRank is passed anywhere from the page.
The syntax within the Robots file to ensure a page is to be completely removed from your website will look like this –
<Disallow: /web-page-1/>
<Disallow: / web -page-2/>
<Noindex: / web-page-1/>
<Noindex: / web -page-2/>
Since Google updated this in 2019 - making the noindex rule for the Robots file an unsupported rule, other methods are now used to prevent a web page from being indexed, or deindexed.
To completely remove the option for a web page to be indexed, a noindex robots meta tag is the most reliable method. Placed in the head section of the web page it’ll ensure the page is not indexed and removed from the index if already indexed. The page will still be crawled but will not be placed within the index of the search engine.
The NoFollow Tag
These tags, nofollow, are implemented within the link tag as an attribute. They instruct the bot that the page the link is pointing to is not to be considered a factor of importance, and it should not go on to discover more urls.
We can add nofollow tags into the head section of a web page to, or within the link element itself –
<meta name=”robots” content=”nofollow” />
<a href=”example.html” rel=”nofollow”>example page>
There are common means in which a nofollow tag is implemented, a common use is on eCom sites where many pages can be generated by a faceted navigation. Here nofollow tags are often used in a manner that causes PageRank to remain focused on the more important pages, and so not distribute it out to the pages of far lesser importance. This approach was often referred to as funnelling.
This does not prevent the page with the nofollow tag pointing at it from being crawled entirely, just from the specific link the nofollow tag is placed.
Google updated this directive a few years ago and stated the nofollow tag is now merely a hint, as opposed to being a signal.
They also added two more link attributes, along with the nofollow there’s now a rel=”sponsored” attribute which identifies links that are for advertisement purposes. And the rel=”ugc” attribute for User Generated Content, used for links in busy ugc sites such as forums and big blog comment areas.
If you wish to ensure a page is fully blocked from being indexed, we can use a nofollow tag along with a noindex tag. The nofollow tag is no guarantee that the web page will not be indexed, presence in a sitemap, or other pages linking to it are factors which may result in these pages being indexed and so present in the serps. Adding a noindex tag to the page with links pointing to it that have the nofollow tag will ensure the page can be crawled, and the web page will not be indexed. This approach is often used for login or admin pages of a website, or application and websites that may generate lots of URLs from internal searches.
Real World Application
I’ve worked on several ECom platforms, mostly Magento which have utilised faceted navigation. Ensuring these pages are handled appropriately results in a more sensical journey is created for bots. Ensuring bots are not wasting their time crawling thousands of URL variations benefits indexation and ensures the more prominent pages remain the central focus.