The behavior described below has recently changed. If you have Clouds created prior to October 19, 2017, please read this article.
What is robots.txt?
A /robots.txt is an optional file that lets a webmaster explicity tell well-behaving web robots, like search index spiders, about how they should crawl a website. If there is no robots.txt file present, most robots should proceed with crawling and indexing a site.
This is useful when site owners use a dev or staging site for ongoing work, and an isolated production site where changes and updates are deployed. You can tell web robots to ignore the dev site, while allowing indexing on the production site.
robots.txt in MODX Cloud
Live or Production Sites
The robots.txt response for your sites using a Custom Domain (aka Production or Live sites) relies on the presence of robots.txt files on the filesystem. Therefore, you can customize the response by uploading a robots.txt file into the root of your site. If there is no robots.txt file, the web server returns a 404 error, which tells a robot that it's free to index the site.
MODX.dev Sites
Cloud sites that have a hostname ending in modx.dev are automatically served a Disallow: / response for robots.txt, to ensure your development sites are not indexed. If you have a use case where you want the robots.txt in your modx.dev subdomain site to function normally, please contact MODX Cloud Support.
Serve unique robots.txt files per hostname in MODX Cloud
Some organizations use MODX Revolution to run multiple websites from a single installation using Contexts. Cases where this might apply would be a public facing marketing site combined with landing page microsites and possibly a non-public intranet.
Most site owners want their sites indexed, and we have described above how to control the robots.txt response for all your Custom Domains. However, for a hypothetical intranet using intranet.example.com as its hostname, you wouldn’t want it indexed. Traditionally, this was tricky to accomplish on multisite installs because they shared the same web root.
In MODX Cloud, it’s easy. Simply upload an additional file to your webroot named robots-intranet.example.com.txt with the following content:
User-agent: * Disallow: /
Note the name for these hostname-specific files is "robots-" plus the full hostname plus ".txt". If you want to cover both a domain name e.g. domain.name and it's "www" subdomain www.domain.name, you will need a file for each. The example above will block indexing on that hostname by well behaving robots, and all other hostnames will fall back to the standard robots.txt file (or lack thereof) if no other hostname-specific ones exist.