Sitemap Scraper Node
Last updated
Last updated
The Sitemap Scraper Node automates the extraction of content from websites by utilizing their XML sitemap. It can process both single sitemaps and nested sitemaps, with the ability to filter URLs based on specific prefixes. This makes it ideal for comprehensive website content extraction and documentation purposes.
Usage cost: 0.5 credit per scraped page
Sitemap URL*
URL to the XML sitemap
Must include protocol (http/https)
Example:
Starting With
Optional URL prefix filter
Only processes URLs beginning with this prefix
Example:
documents
(List[Document]): List of documents containing:
Processed page content
Source URL metadata
Extraction timestamps
document_contents
(List[string]):
List of extracted text content
One entry per successfully scraped page
URL Management
Use root sitemap URL for complete site scraping
Employ URL filters for targeted content extraction
Verify sitemap accessibility before execution
Resource Management
Monitor processing time for large sitemaps
Consider memory usage with many pages
Use filtering to limit scope when needed
Malformed XML in sitemaps
Rate limiting from target websites
Memory constraints with large sites
Invalid sitemap URLs
Access restrictions
Nested sitemap processing failures