The behavior described below has recently changed. If you have Clouds created prior to October 19, 2017, please read this article.
What is robots.txt?
A /robots.txt is an optional file that lets a webmaster explicity tell well-behaving web robots, like search index spiders, about how they should crawl a website. If there is no robots.txt file present, most robots should proceed with crawling and indexing a site.
This is useful when site owners use a dev or staging site for ongoing work, and an isolated production site where changes and updates are deployed. You can tell web robots to ignore the dev site, while allowing indexing on the production site.
robots.txt in MODX Cloud
Each of your Clouds (sites) has multiple "hostnames" - the Cloud Address and the Web Address which are automatically created, and you can also assign one or more Custom Domains to a Cloud as needed.
The Cloud Address and Web Address (which both end in .modxcloud.com) will be automatically served a Disallow: / response for robots.txt, to ensure your development sites are not indexed.
The robots.txt response for your Custom Domains relies on the presence of robots.txt files on the filesystem. Therefore, you can customize the response by uploading a robots.txt file into the root of your site. If there is no robots.txt file, the web server returns a 404 error, which tells a robot that it's free to index the site.
Serve unique robots.txt files per hostname in MODX Cloud
Some organizations use MODX Revolution to run multiple websites from a single installation using Contexts. Cases where this might apply would be a public facing marketing site combined with landing page microsites and possibly a non-public intranet.
Most site owners want their sites indexed, and we have described above how to control the robots.txt response for all your Custom Domains. However, for a hypothetical intranet using intranet.example.com as its hostname, you wouldn’t want it indexed. Traditionally, this was tricky to accomplish on multisite installs because they shared the same web root.
In MODX Cloud, it’s easy. Simply upload an additional file to your webroot named robots-intranet.example.com.txt
with the following content:
User-agent: * Disallow: /
Note the name for these hostname-specific files is "robots-" plus the full hostname plus ".txt". If you want to cover both a domain name e.g. domain.name
and it's "www" subdomain www.domain.name
, you will need a file for each. The example above will block indexing on that hostname by well behaving robots, and all other hostnames will fall back to the standard robots.txt file (or lack thereof) if no other hostname-specific ones exist.