Sitemap Scraper Node

Overview

The Sitemap Scraper Node automates the extraction of content from websites by utilizing their XML sitemap. It can process both single sitemaps and nested sitemaps, with the ability to filter URLs based on specific prefixes. This makes it ideal for comprehensive website content extraction and documentation purposes.

Usage cost: 0.5 credit per scraped page

Configuration

Settings

Sitemap URL*
- URL to the XML sitemap
- Must include protocol (http/https)
- Example: https://www.example.com/sitemap.xml
Starting With
- Optional URL prefix filter
- Only processes URLs beginning with this prefix
- Example: https://www.example.com/features/

Output Ports

documents (List[Document]): List of documents containing:
- Processed page content
- Source URL metadata
- Extraction timestamps
document_contents (List[string]):
- List of extracted text content
- One entry per successfully scraped page

Best Practices

URL Management
- Use root sitemap URL for complete site scraping
- Employ URL filters for targeted content extraction
- Verify sitemap accessibility before execution
Resource Management
- Monitor processing time for large sitemaps
- Consider memory usage with many pages
- Use filtering to limit scope when needed

Common Issues

Malformed XML in sitemaps
Rate limiting from target websites
Memory constraints with large sites
Invalid sitemap URLs
Access restrictions
Nested sitemap processing failures

PreviousWebpage scraper Node NextAPI Request Node

Last updated 8 months ago