Skip to main content
POST
/
api
/
kb
/
ingest-web
Ingest Web
curl --request POST \
  --url https://api.example.com/api/kb/ingest-web \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/x-www-form-urlencoded' \
  --data 'collection=<string>' \
  --data 'start_url=<string>' \
  --data max_pages=100 \
  --data max_depth=3 \
  --data 'url_patterns=<string>' \
  --data 'exclude_patterns=<string>' \
  --data same_domain_only=true \
  --data 'content_selector=<string>' \
  --data 'remove_selectors=<string>' \
  --data concurrent_requests=3 \
  --data request_delay=1 \
  --data timeout=30 \
  --data respect_robots_txt=true \
  --data parse_method=default \
  --data chunk_strategy=recursive \
  --data chunk_size=1 \
  --data chunk_overlap=1 \
  --data embedding_model_id=text-embedding-v4 \
  --data embedding_batch_size=1 \
  --data max_retries=1 \
  --data retry_delay=1
{
  "status": "<string>",
  "collection": "<string>",
  "total_urls_found": 1,
  "pages_crawled": 1,
  "pages_failed": 1,
  "documents_created": 1,
  "chunks_created": 1,
  "embeddings_created": 1,
  "message": "<string>",
  "elapsed_time_ms": 1,
  "crawled_urls": [
    "<string>"
  ],
  "failed_urls": {},
  "warnings": [
    "<string>"
  ]
}

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/x-www-form-urlencoded
collection
string
required

Target collection name

start_url
string
required

Starting URL for crawling

max_pages
integer | null
default:100

Maximum number of pages to crawl (default: 100)

max_depth
integer | null
default:3

Maximum crawl depth (default: 3)

url_patterns
string | null

Comma-separated URL match patterns (regex)

exclude_patterns
string | null

Comma-separated exclusion patterns (regex)

same_domain_only
boolean | null
default:true

Only crawl same domain (default: True)

content_selector
string | null

CSS selector for main content area

remove_selectors
string | null

Comma-separated CSS selectors to remove

concurrent_requests
integer | null
default:3

Concurrent requests (default: 3, max: 10)

Required range: 1 <= x <= 10
request_delay
number | null
default:1

Delay between requests in seconds (default: 1.0)

Required range: x >= 0
timeout
integer | null
default:30

Request timeout in seconds (default: 30)

Required range: x >= 1
respect_robots_txt
boolean | null
default:true

Respect robots.txt (default: True)

parse_method
enum<string> | null

Parser used during ingestion

Available options:
default,
pypdf,
pdfplumber,
unstructured,
pymupdf,
deepdoc
chunk_strategy
enum<string> | null

Chunking strategy

Available options:
recursive,
fixed_size,
markdown
chunk_size
integer | null

Chunk size in characters (default: 1000)

Required range: x > 0
chunk_overlap
integer | null

Chunk overlap (default: 200)

Required range: x >= 0
embedding_model_id
string
default:text-embedding-v4

Embedding model ID

embedding_batch_size
integer | null

Batch size for embedding (default: 10)

Required range: x > 0
max_retries
integer | null

Maximum retries for embedding failures (default: 3)

Required range: x >= 0
retry_delay
number | null

Delay between retries in seconds (default: 1.0)

Required range: x >= 0

Response

Successful Response

Result of website crawling and knowledge base ingestion.

This model provides comprehensive statistics about the crawling and ingestion process, including success/failure counts and timing.

status
string
required

Overall status: success|error|partial

collection
string
required

Target collection name

total_urls_found
integer
required

Total number of unique URLs discovered

Required range: x >= 0
pages_crawled
integer
required

Number of successfully crawled pages

Required range: x >= 0
pages_failed
integer
required

Number of pages that failed to crawl

Required range: x >= 0
documents_created
integer
required

Number of documents created in collection

Required range: x >= 0
chunks_created
integer
required

Total number of chunks created

Required range: x >= 0
embeddings_created
integer
required

Total number of embeddings generated

Required range: x >= 0
message
string
required

Human-readable summary message

elapsed_time_ms
integer
required

Total elapsed time in milliseconds

Required range: x >= 0
crawled_urls
string[]

List of successfully crawled URLs

failed_urls
Failed Urls · object

Map of failed URLs to error messages

warnings
string[]

Non-critical warnings