Crawl a website and ingest all pages into the knowledge base.
Args: collection: Target collection name start_url: Starting URL for crawling max_pages: Maximum number of pages to crawl max_depth: Maximum crawl depth url_patterns: Comma-separated URL match patterns (regex) exclude_patterns: Comma-separated exclusion patterns (regex) same_domain_only: Only crawl same domain content_selector: CSS selector for main content area remove_selectors: Comma-separated CSS selectors to remove concurrent_requests: Number of concurrent requests request_delay: Delay between requests in seconds timeout: Request timeout in seconds respect_robots_txt: Respect robots.txt rules parse_method: Parser for document ingestion chunk_strategy: Chunking strategy chunk_size: Chunk size in characters chunk_overlap: Chunk overlap in characters embedding_model_id: Embedding model ID embedding_batch_size: Batch size for embedding max_retries: Maximum retry attempts retry_delay: Delay between retries
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Target collection name
Starting URL for crawling
Maximum number of pages to crawl (default: 100)
Maximum crawl depth (default: 3)
Comma-separated URL match patterns (regex)
Comma-separated exclusion patterns (regex)
Only crawl same domain (default: True)
CSS selector for main content area
Comma-separated CSS selectors to remove
Concurrent requests (default: 3, max: 10)
1 <= x <= 10Delay between requests in seconds (default: 1.0)
x >= 0Request timeout in seconds (default: 30)
x >= 1Respect robots.txt (default: True)
Parser used during ingestion
default, pypdf, pdfplumber, unstructured, pymupdf, deepdoc Chunking strategy
recursive, fixed_size, markdown Chunk size in characters (default: 1000)
x > 0Chunk overlap (default: 200)
x >= 0Embedding model ID
Batch size for embedding (default: 10)
x > 0Maximum retries for embedding failures (default: 3)
x >= 0Delay between retries in seconds (default: 1.0)
x >= 0Successful Response
Result of website crawling and knowledge base ingestion.
This model provides comprehensive statistics about the crawling and ingestion process, including success/failure counts and timing.
Overall status: success|error|partial
Target collection name
Total number of unique URLs discovered
x >= 0Number of successfully crawled pages
x >= 0Number of pages that failed to crawl
x >= 0Number of documents created in collection
x >= 0Total number of chunks created
x >= 0Total number of embeddings generated
x >= 0Human-readable summary message
Total elapsed time in milliseconds
x >= 0List of successfully crawled URLs
Map of failed URLs to error messages
Non-critical warnings