Train your AI on Web Crawler

Auth: Public crawling (no auth)Incremental syncFull refresh

Open's web crawler indexes any public website you point it at — your docs site, help center, blog, knowledge base, marketing pages — and feeds the content into your AI agent's knowledge. Recrawls on a schedule so the agent stays current.

Connect once, and Open automatically keeps your AI agent's knowledge up to date. When you update content in Web Crawler, the changes sync automatically—no manual retraining required.

→syncs to→AI Agent

What can be synced

Pages — All discoverable pages with full HTML extraction.

Sitemaps — sitemap.xml-based discovery.

Selectors — CSS-selector-based content extraction.

Features

•Sitemap-aware — Reads sitemap.xml when present and follows discovered links recursively.
•Smart re-crawl — Recrawls on a schedule (daily/weekly/monthly) and only updates pages that changed.
•Selector-based extraction — Configure CSS selectors to skip nav, footer, and ads — only the content the agent needs.
•JS rendering — Renders JavaScript-heavy SPAs so client-rendered content gets indexed too.

Requirements

•Public or basic-auth-protected website

How to connect

1.In Open, go to AI Training → Sources → Add Web Crawler
2.Enter the seed URL and (optionally) a sitemap URL
3.Configure selectors to include / exclude (nav, footer, ads)
4.Set the recrawl schedule
5.Run the first crawl and review the indexed content

Good to know

Respects robots.txt by default; can be overridden for sites you own
JS-heavy SPAs supported via headless rendering
Per-domain rate limits keep crawls polite

Security: Open only requires read access to your Web Crawler. We never write, modify, or delete your content. All data is encrypted in transit and at rest. GDPR compliant, working toward SOC 2 Type II.

Ready to connect Web Crawler?

AI Training → Sources → Web Crawler

Other training sources

Notion

Confluence

Google Docs View all →