How to Force Googlebot to Crawl only New Contents
You can force Googlebot to crawl only new content by using certain parameters in the sitemap URL that you submit to the search console. You want to know how? Take the explanation below seriously.
A sitemap is a file in the root folder of your domain. This file contains all the URLs on your blog or website. Googlebot uses your blog's sitemap as a referral to visit, crawl, and index any URLs it deems eligible for indexing. Unfortunately, lately, many bloggers have experienced indexation problems, the cause of which is still unknown.
Why Googlebot didn’t crawl new URLs?
Some blog experts speculate that non-indexed URLs are URLs that do not meet the standards imposed by Google. But this indexing issue does not only attack low value content but also high value content. Many bloggers complain that Googlebot doesn't even crawl their new URLs. In fact, Google is only allowed to determine whether the URL is indexed or not after being crawled.
Bloggers complain that Googlebot keeps crawling old content (which is already indexed) while new content is ignored. This continues to happen even though the blogger has submitted and updated the sitemap file and has requested indexing. After doing some small research, I came to the conclusion that Googlebot uses the sitemap you submitted as the main reference for the URL indexation process on your blog or website.
By default, your sitemap file is at the URL https://yourblog.blogspot.com/sitemap.xml
. Usually, you use this sitemap URL in the search console. However this sitemap contains all the URLs on your blog and Googlebot uses it as a referral to visit. Googlebot will visit from the first content on your blog so often the indexed URL will be visited again. Isn't this unnecessary especially if you don't do any updates to the old content?
How to Force Googlebot to Crawl new URLS?
To force Googlebot to crawl only the most recent content on your blog, you need to know Googlebot's schedule of visits to your blog. Usually, you can see the schedule and frequency of Googlebot visits to your blog via Google Search Console.
Go to the Coverage section and check the Valid box. After that, scroll down the page and click Submitted and Indexed. You will see a row of URLs that have been indexed by Google. What you need to pay attention to is the last top crawl date. You will get information about the schedule and frequency of crawling on your blog. Also, don't be surprised if the URL that was crawled recently is an already indexed URL. Meanwhile, your new content will be in the Excluded tab and not yet crawled by Googlebot; even though you have posted these contents since last month.
Googlebot is not to blame, of course. In order for your most recent post to be crawled by Googlebot, you must state it in your sitemap file. This way, Googlebot can be directed to crawl uncrawled URLs and ignore indexed URLs.
All you need to do is get your blog's Atom sitemap URL and add the start-index and max-result parameters. This part is a bit tricky for beginner bloggers. For that, pay attention to the steps to set the value of these two parameters below.
- Count the number of recent posts that haven't been crawled
You can see the number of recent uncrawled posts on the Posts tab of your Blogger dashboard. Then, check in Google Search Console to see if the URL of your most recent post hasn't been crawled. By counting the number of your most recent posts, you will get a number which you will use as the value of the max-result parameter. Bear in mind that your latest post is numbered 1; number 1 refers to your latest post.
Example: the number of recent and uncrawled posts is 20, so the max-result value is 20 (
max-results=20
). Meanwhile, the value of the start-index parameter is 1. Thus, your sitemap url ishttps://yourblog.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=20
- Update sitemap URL in Google Search Console
You need to update the sitemap URL in Google Search Console. The trick is: go to the Sitemaps tab and add a sitemap using the URL with the above parameters. Then, you must remove all other sitemaps so that Googlebot has only one referral.
- Update sitemap URL in Custom robots.txt
You need to enable Custom robots.txt in the Settings tab of your blogger dashboard under Crawlers and Indexing. After that, add the sitemap URL at the bottom of your custom robots.txt (
sitemap:https://yourblog.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=20
), delete other sitemap URLs (if any), and save it. This is an update: if in case you get a sitemap error in the Google Search Console, just set the atom sitemap URL with parameters (point number 3) while in the Google Search Console you just need to put the atom sitemap URL without parameters.
Well, now you can wait a few days and then check in Google Search Console if your latest content has been crawled. This depends on Googlebot's schedule and frequency of visits to your blog.
Since your sitemap has been changed, the former indexed URLs will be in the indexed not submitted in sitemap. You can fix this later by submitting your full sitemap after all new contents are indexed. In the meantime, don't request indexing for any old contents. Also, it may need more than one crawl schedule to index all your new contents. So, be patient.
I hope this article will be a solution to your indexation issue. If you found this article useful, please share it with other bloggers who may be experiencing similar problems.
Please, only relevant comments are accepted. Comments that are irrelevant and/or containing active links will be deleted. Thank you.
Post a Comment for "How to Force Googlebot to Crawl only New Contents"