A common Cloudflare pattern is to expose a direct origin hostname such as direct.example.com for live or uncached access. That is useful operationally, but it creates duplicate-content risk if crawlers can index both the public domain and the origin hostname.

Operational context

  • When to use this: you operate a Cloudflare-proxied site and also expose a direct or uncached subdomain for testing or origin access.
  • What it reduces: search engines indexing the same content under multiple hostnames.
  • Tradeoff: this solves crawler exposure, not all duplicate-content cases. Canonical URLs, redirects, and sitemap hygiene still matter.

Recommended approach

Serve robots.txt dynamically based on the requested host. Block crawlers on the direct subdomain while preserving the normal robots.txt for the public site.

This works best when the direct hostname is operationally necessary but should never appear in search results.

Apache

# Serve robots.txt from a script to control crawler access by hostname.
RewriteCond %{REQUEST_URI} robots\.txt$ [NC]
RewriteRule .* /robots.php [L]

Nginx

rewrite ^/robots.txt /robots.php last;

robots.php

<?php

header('Content-type: text/plain');

if ($_SERVER['HTTP_HOST'] == 'direct.domain.com') {
  echo "User-agent: *\n";
  echo "Disallow: /\n";
} else {
  include('robots.txt');
}

Replace direct.domain.com with your origin or preview hostname.

Nginx map alternative

If you prefer not to route this through PHP, handle it in Nginx:

map $host $robots_body {
    default "User-agent: *\nAllow: /\n";
    direct.domain.com "User-agent: *\nDisallow: /\n";
}

server {
    server_name domain.com direct.domain.com;

    location = /robots.txt {
        default_type text/plain;
        return 200 $robots_body;
    }
}

For the public hostname, serve your normal robots content instead of the minimal Allow: / example if you need sitemap declarations or crawl-delay rules.

Complementary controls

Robots rules reduce crawler access, but they do not consolidate SEO signals by themselves. Also use:

  • canonical URLs pointing to the public hostname
  • a sitemap that contains only public canonical URLs
  • redirects from accidental alternate hostnames where possible
  • Cloudflare DNS hygiene so direct origin hostnames are not easy to discover
  • Search Console removal tools if the direct hostname has already been indexed

When robots.txt is insufficient

Do not rely on robots.txt alone when the direct hostname exposes private, staging, or customer data. Robots rules are voluntary crawler instructions, not access control.

For sensitive environments, require authentication, restrict by source IP or VPN, or remove public DNS entirely. If the hostname exists only for origin checks, consider a health endpoint rather than exposing the full application.

Verification

  1. Request https://direct.domain.com/robots.txt and confirm it returns Disallow: /.
  2. Request the public site robots.txt and confirm normal crawl rules still apply.
  3. Inspect rendered pages on the direct hostname and confirm canonical URLs point to the public hostname.
  4. Confirm the sitemap lists only public URLs.
  5. Monitor Search Console or crawl logs to confirm the origin hostname is no longer being indexed.

Result

This pattern blocks crawlers on the direct subdomain while keeping the public site indexable, which removes a common source of duplicate-content exposure in Cloudflare setups.

Related work

This CDN and origin hygiene pattern supports high-traffic web platform delivery in Scaling and load-balancing architecture.