A robots.txt file is a directive to search engine crawlers as to which URLs they can access on your site.
A robots.txt file is used mainly to manage the crawl budget and prevent it from overloading your server with requests. However, it does not keep a web page out of Google. To achieve this, block indexing with noindex or password-protect the page.
How Does Robots.txt Affect Different File Types?
1. Web Pages
A robots.txt file for web pages (PDF, HTML, or other non-media formats readable by Google) can prevent crawler requests from overwhelming your server. Alternatively, the robots.txt file can avoid crawling unimportant or similar pages on your site.
However, using a robots.txt file to hide your pages from Google search results is not advised. If additional pages point to descriptive text on your page, Google can still index the URL without visiting the page. Block your page from search results with noindex or password protection.
Note that if you block your web page with a robots.txt file, its URL can still appear in search results. However, the result will not contain a description, and media files such as video and image files, PDFs, and other non-HTML files, will be disregarded.
Nevertheless, users or other pages can still link to your media files from other pages. If you see search results for your page, correct it by removing the robots.txt entry blocking the page. To hide the page entirely from Search, use another method.
2. Resource Files
If you think that your pages are loaded with resource files such as unimportant scripts, images, or style files, a robots.txt file can block them. Yet, make sure that blocking these pages will not be significantly affected by the loss. For example, don’t block them if their absence will make the page harder for Google’s crawler to understand; otherwise, Google won’t adequately analyze pages that depend on those resources.
What Are the Limitations of Robots.txt File Blocking?
Not all search engines support robots.txt file directives
Googlebot and other respectable web crawlers comply with robots.txt file directives, but other crawlers might not and cannot be forced to do so. In this case, use other blocking methods like password-protecting private files on your server.
Different crawlers interpret syntax differently
While reputable web crawlers follow robots.txt file directives, each crawler may interpret the directives differently or not understand them. Therefore, make sure you know the correct syntax to address different crawlers.
Pages linked from other sites can still appear in Google search results
While Google won’t crawl or index content blocked by a robots.txt file, Google may still index them if they are linked from other places on the web, such as anchor text linking to the page. To prevent this, password-protect the files on your server, use the response header or noindex meta tag, or remove the page