How To Control Web Crawlers With Robots.txt, Meta Robot Tags & SEOPressor

By zhiyuan on April 14, 2016

SUBSCRIBE TO SEOPressor BLOG

Join 50,000+ fellow SEO marketers!
Get SEOPressor latest insights straight to your inbox.
Enter your email address below:

74

score %

SEO Score

SEOPressor SEO Score

Found us from search engine?

We rank high, you can too.

SEOPressor helps you to optimize your on-page SEO for higher & improved search ranking.

Get Your SEO Score Now

how to control web crawlers

As the predominance of Google shows, the Internet – more specifically, the World Wide Web – is dominated by the presence and use of web crawlers. Specifically, these are the programs that search, index and rate the various websites that exist. In short, they tell us where to go when we search for a particular word or phrase.

Still, most people have no idea how they accomplish this goal. Here is a comprehensive guide on web crawlers and how to control them through robots.txt file, meta robot tags, and with our plugin – SEOPressor Connect:

What Are Web Crawlers?

what are web crawlers

Web crawlers are known by a variety of names – industry jargon labels them spiders or bots but technically they are referred to as web crawlers.

No matter the name, they are used to scan the web “reading” everything they find. Specifically, they index what words are used on a website and in what context. The index produced is basically an enormous list, Then when a “search” is made, the search checks the pre-made index and delivers the most relevant results – that is, the results at the top of the list.

Why Are They Valuable?

From the earliest days, search engines such as Lycos, Alta Vista, Yahoo! all the way up to the more recent ones of Bing and Google have essentially used web crawlers to define their existence.

In short, web crawlers are their sole raison d’etre. Innumerable bots are used to survey every site on the web – a Herculean task, to say the least – but one that is nevertheless incredibly lucrative. Just ask Mr. Page or Mr. Brin. Still, there is nothing to stop you from using their technology to your advantage.

What Is The Downside?

In an ideal world, the owner of a website page could specify the exact keywords and concepts under which the page will be indexed. This fact, of course, has been seriously corrupted by unscrupulous SEO purveyors that want to try to game the system. Innumerable sites have populated their pages with phrases such as “Gangnam Style”, “Mila Kunis,” and “Hilary Clinton” to simply increase their web traffic.

For better or worse, this tactic is a now a way of life for the average “black hat” website promoter. A word of warning – stay away from them.

How Do Web Crawlers Actually Work?

how web crawlers work

The first assault on a website is always by a web crawler. In its simplest form, it merely catalogs everything on a site. Legitimate companies being “scanned” are aware of this and would like to provide as much information as possible. The goal is to provide access to as many pages as possible and thereby establish the worth of the site. By methodically going from link to link, the bots will systematically categorize a site to the benefits of all. Still, it is the obligation of the website owner or his designated agent to make the best of this automated categorization.

how crawlers work

What Does That Really Mean Technically?

The first step in the process is to deploy a web crawler to thoroughly search a site. An index of words is thus created. Importance in this process is the use of meta tags. With these “tags”, the programmer of the site can delineate the most important keywords, phrases and concepts fro the web crawler and thus get indexed in the most appropriate way. There are also situations where a page owner does not want the page indexed and a robot exclusion protocol can be included to divert the bots completely away from the page.

Next, the web crawler program produces a “weighted” index. A simple index is merely a list of words and the URL – not particularly good ay delivering useful search results.

Instead, the best web crawlers use various factors – such as the number of times a word is used throughout the document, whether the word appears in subheadings or if the word or phrase is actually in the title – to assign a weight to the word. Then, when a search is performed by a user, the most heavily weighted websites will appear at the top of the returned results.

You might wonder how searches across the vast Internet are accomplished so efficiently. The answer is with a technique known as hashing. The relevant search terms on a website are organized into “hash tables” which take the various ranked phrases and assigns them a number. This process significantly reduces the average time that it takes to do a search even if the search terms are somewhat complicated.

What Does The Future Of Web Crawling Hold?

These days – although it is changing somewhat – most search engines perform a literal search. That is, they look for the phrases that a user enters into the query as exactly as possible. In addition, Boolean operators can be used quite effectively – if the user knows what they are doing – to narrow the search.

Newer search engine versions – not yet released – are currently being developed that will use natural-language and concept-based queries. The result will be search engines that deliver better results with less effort even if the user does not really know what they are doing.

Where Do I Go From Here?

Developing and generating interest in your website requires time effort and not a small degree of experience. Some business owners will find the lucky combination without the help of professional website developers. Most others, however, will fail. Do not make this mistake.

However, if you do not have a web developer on your team, you could always learn how to do it yourself. There are several ways to control web crawlers – through robots.txt file, meta robot tags, or through third-party solutions. In this article, we’re going to lay it all out and teach you how to do it through all 3 options.

How Do I Control Web Crawlers With Robots.txt?

how to control web crawlers

To direct search engine crawlers with a robots.txt, first, you’ll need to create a robots.txt file.

Creating or Editing Your Robots.txt File

The robots.txt file should be at the root of your website. For example, if your domain was example.com it should be found:

On your website:

    http://example.com/robots.txt

On your server:

    /home/username/public_html/robots.txt

When a search engine crawls a website, it requests the robots.txt file first and then follows the rules within.

  • Do note that web crawlers do not necessarily follow robots.txt rules, they are just a guideline for how they should behave.
  • If you want to set a Crawl-delay for Google, you must do it in Google Webmaster tools.

Controlling Web Crawler’s Access via Robots.txt File

You have a few ways of controlling how crawlers crawl your site with robots.txt file, you can either use:

    The User-agent: Determine which User-agent the rule applies to, and * is a wildcard matching any User-agent.
    Disallow: Determine which files or folders shouldn’t be crawled.

Examples of How Robots.txt Are Used:

To restrict all web crawlers from crawling:

You can disallow any search engine from crawling your website with these rules:

    User-agent: *
    Disallow: /

To allow full access to all crawlers:

Search engines are able to crawl your website by default, so adding this code is not necessary.

    User-agent: *
    Disallow:

To restrict all crawlers from crawling multiple directories or pages:

If you have multiple directories such as /cgi-bin/, /private/, and /tmp/ that you didn’t want to be crawled, you could use the following code:

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /print-ready/
    Disallow: /refresh.htm

You need to start a new line of “Disallow” for each URL you want to exclude.

To exclude a single crawler:

If you only want to exclude a single crawler from access to your /private/ directory, and disallow all other bots you could use:

    User-agent: Googlebot
    Disallow: /

Instructions are given to Google only.

To allow only one robot access:

If we only wanted to allow Googlebot access to our /private/ directory, and disallow all other bots we could use:

    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Disallow:

As you see, the rules of specificity apply, not inheritance.

How Do I Control Web Crawlers With Robots Meta Tag?

Another way of controlling web crawlers if through Robots Meta Tag. You can use this method if you do not have access to the root directory, hence, can’t upload your robots.txt file. It is also great if you want to restrict web crawlers from crawling certain pages on your site.

The Robots Meta Tag is similar to other meta tags, and they are added in the <head> section of your code.

Examples of how Robot Meta Tags are used:

To disallow all crawlers from indexing a page:

    <meta name="robots" content="noindex,follow">
    <meta name="robots" content="noindex">

Even though the crawlers will not index the page, they will still follow links found on it.

To restrict crawlers from following links within the page:

    <meta name="robots" content="index,nofollow">
    <meta name="robots" content="nofollow">

To restrict crawlers from crawling and indexing the page and the subsequent pages it links to:

    <meta name="robots" content="noindex,nofollow">
    <meta name="robots" content="none">

To direct search engines not to cache your page:

    <meta name="robots" content="noarchive">

Most search engines cache your links for a certain period of time, which might bring visitors to the older version of your page. If your page is dynamic, you should use this tag so that search engines do not cache your page and will always bring visitors to the latest version of your page.

To index and follow the page:

    <meta name="robots" content="index,follow">
    <meta name="robots" content="index">
    <meta name="robots" content="follow">
    <meta name="robots" content="all">

These commands instruct web crawlers to index the page and follow the links on it. They are unnecessary because the crawlers will do this by default.

How Do I Control Web Crawlers Without Any Coding

Learning all these robots.txt or meta robot tags could be a lot of hassle for a small task, especially for people that doesn’t know how to code, doesn’t have access to the website’s backend, developers who are managing thousands of pages at the same time, etc.

However, what needs to be done has to be done, instructing crawlers what to do cannot be skipped. There aren’t many solutions available out there right now that could automate this process for you. Which is why we’ve included the “Robot Rules” function in our plugin.

on-page robot rules

With SEOPressor Connect, controlling web crawlers is as simple as a click.

seopressor connect robot rules

SEOPressor Connect allows you to control web crawlers with just a few ticks.

All you have to do is tick on the function you want, click update and SEOPressor Connect will generate the codes and inject them into your website. This function wasn’t available in SEOPressor v5 or the older versions of the plugin.

Along with many other features, they’re added into SEOPressor Connect so that you can have all the functions in one plugin. With SEOPressor Connect, you don’t have to install a ton of plugins, get your WordPress site cluttered, and worry about incompatibility issues. You can have all the On-Page SEO solutions in just one plugin – SEOPressor Connect.



Other articles you might like:

  • [Announcement] SEOPressor Connect Is Officially Live!
  • SEOPressor On-Page Settings (META Settings, Canonical, 301 Redirect, Robot Rules)
  • How To Fix Broken Links To Improve Your SEO
  • The Google Survival Guide:
    Complete Protection From Potential Algorithm Updates

    • 5 Critical Lessons To Prepare Your Website
    • Patent-sourced Prediction On How Google Will Change In The Future
    • Tips To Avoid The Worst-case Scenario
    • Instantly Actionable Steps To Follow

    Updated: 19 September 2017

    Zhi Yuan

    About Zhi Yuan

    Zhi Yuan is an inbound marketer who has expertise in both SEO and content marketing. He is on a never-ending quest to understand the mysteries of the search engine world, and sometimes, on his way to the ice-cream parlor.

    Related Articles

    The Ultimate Semantic SEO Guide: How To Do SEO Like…How Long Should You Spend Writing A Blog Post?9 Most Common SEO Mistakes To Avoid In 2017 (And…
    Loading Disqus Comments ...
    Shares