Version: v1.7.3-14 🚧

Import from URL

Overview

The Import from URL ingestion method allows you to import documents directly from web URLs. This method can crawl websites and extract content, making it ideal for importing web-based documents, articles, and other online content.

Import from URL

When to use

Web content: When you need to import documents from websites
Online resources: For importing articles, documentation, or reports
Dynamic content: When content is regularly updated online
Public documents: For importing publicly available web documents
Research materials: When gathering content from multiple web sources

Configuration parameters

URL and crawling settings

Option	Default	Description	Use case
URL	- (text input)	The web page URL to import from	Specify the starting point for content import
Follow links	Off (can be toggled on or off)	Whether to crawl linked pages from the same domain	Enable for importing entire websites or documentation sections
Max documents	1 (number input)	Maximum number of pages to import during crawling	1: Import only the specified page 5-10: Import a small section of related pages 50+: Import large documentation sites or blog series

note

For document processing options, see the Shared Document Processing Options section in the main documentation.

Website crawling behavior

Single page import

Follow links: Off
Max documents: 1 (limits crawling to only the specified page)
Result: Only the specified page is imported
Use case: Specific articles, documentation pages, or reports

Multi-page crawling

Follow links: On
Max documents: Set to desired limit (controls how many pages are imported)
Result: Multiple pages from the same domain are imported, up to the specified limit
Use case: Entire documentation sites, blog series, or website sections

Crawling rules

Same domain: Only pages from the same domain are crawled
Respect robots.txt: Crawling respects website robots.txt files
Rate limiting: Built-in delays to avoid overwhelming servers
Duplicate detection: Automatically avoids importing duplicate content

Legal and ethical considerations

Terms of service: Respect website terms of service and robots.txt
Copyright: Ensure you have permission to import and use content
Rate limiting: Avoid overwhelming servers with too many requests
Data privacy: Be mindful of personal information that may be present

Feedback

Submit and view feedback for this page
Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai

Overview​

When to use​

Configuration parameters​

URL and crawling settings​

Website crawling behavior​

Single page import​

Multi-page crawling​

Crawling rules​

Legal and ethical considerations​