A common Cloudflare pattern is to expose a direct origin hostname such as direct.example.com for live or uncached access. That is useful operationally, but it creates duplicate-content risk if crawlers can index both the public domain and the origin hostname.
Operational context
- When to use this: you operate a Cloudflare-proxied site and also expose a direct or uncached subdomain for testing or origin access.
- What it reduces: search engines indexing the same content under multiple hostnames.
- Tradeoff: this solves crawler exposure, not all duplicate-content cases. Canonical URLs, redirects, and sitemap hygiene still matter.
Recommended approach
Serve robots.txt dynamically based on the requested host. Block crawlers on the direct subdomain while preserving the normal robots.txt for the public site.
This works best when the direct hostname is operationally necessary but should never appear in search results.
Apache
# Serve robots.txt from a script to control crawler access by hostname.
RewriteCond %{REQUEST_URI} robots\.txt$ [NC]
RewriteRule .* /robots.php [L]
Nginx
rewrite ^/robots.txt /robots.php last;
robots.php
<?php
header('Content-type: text/plain');
if ($_SERVER['HTTP_HOST'] == 'direct.domain.com') {
echo "User-agent: *\n";
echo "Disallow: /\n";
} else {
include('robots.txt');
}
Replace direct.domain.com with your origin or preview hostname.
Nginx map alternative
If you prefer not to route this through PHP, handle it in Nginx:
map $host $robots_body {
default "User-agent: *\nAllow: /\n";
direct.domain.com "User-agent: *\nDisallow: /\n";
}
server {
server_name domain.com direct.domain.com;
location = /robots.txt {
default_type text/plain;
return 200 $robots_body;
}
}
For the public hostname, serve your normal robots content instead of the minimal Allow: / example if you need sitemap declarations or crawl-delay rules.
Complementary controls
Robots rules reduce crawler access, but they do not consolidate SEO signals by themselves. Also use:
- canonical URLs pointing to the public hostname
- a sitemap that contains only public canonical URLs
- redirects from accidental alternate hostnames where possible
- Cloudflare DNS hygiene so direct origin hostnames are not easy to discover
- Search Console removal tools if the direct hostname has already been indexed
When robots.txt is insufficient
Do not rely on robots.txt alone when the direct hostname exposes private, staging, or customer data. Robots rules are voluntary crawler instructions, not access control.
For sensitive environments, require authentication, restrict by source IP or VPN, or remove public DNS entirely. If the hostname exists only for origin checks, consider a health endpoint rather than exposing the full application.
Verification
- Request
https://direct.domain.com/robots.txtand confirm it returnsDisallow: /. - Request the public site
robots.txtand confirm normal crawl rules still apply. - Inspect rendered pages on the direct hostname and confirm canonical URLs point to the public hostname.
- Confirm the sitemap lists only public URLs.
- Monitor Search Console or crawl logs to confirm the origin hostname is no longer being indexed.
Result
This pattern blocks crawlers on the direct subdomain while keeping the public site indexable, which removes a common source of duplicate-content exposure in Cloudflare setups.
Related work
This CDN and origin hygiene pattern supports high-traffic web platform delivery in Scaling and load-balancing architecture.