Train your AI on Web Crawler
Open's web crawler indexes any public website you point it at — your docs site, help center, blog, knowledge base, marketing pages — and feeds the content into your AI agent's knowledge. Recrawls on a schedule so the agent stays current.
Connect once, and Open automatically keeps your AI agent's knowledge up to date. When you update content in Web Crawler, the changes sync automatically—no manual retraining required.
What can be synced
Pages — All discoverable pages with full HTML extraction.
Sitemaps — sitemap.xml-based discovery.
Selectors — CSS-selector-based content extraction.
Features
- •Sitemap-aware — Reads sitemap.xml when present and follows discovered links recursively.
- •Smart re-crawl — Recrawls on a schedule (daily/weekly/monthly) and only updates pages that changed.
- •Selector-based extraction — Configure CSS selectors to skip nav, footer, and ads — only the content the agent needs.
- •JS rendering — Renders JavaScript-heavy SPAs so client-rendered content gets indexed too.
Requirements
- •Public or basic-auth-protected website
How to connect
- 1.In Open, go to AI Training → Sources → Add Web Crawler
- 2.Enter the seed URL and (optionally) a sitemap URL
- 3.Configure selectors to include / exclude (nav, footer, ads)
- 4.Set the recrawl schedule
- 5.Run the first crawl and review the indexed content
Good to know
- Respects robots.txt by default; can be overridden for sites you own
- JS-heavy SPAs supported via headless rendering
- Per-domain rate limits keep crawls polite
Security: Open only requires read access to your Web Crawler. We never write, modify, or delete your content. All data is encrypted in transit and at rest. GDPR compliant, working toward SOC 2 Type II.
Ready to connect Web Crawler?
AI Training → Sources → Web Crawler
Other training sources