Sitemap Scraper Node

Overview

The Sitemap Scraper Node automates the extraction of content from websites by utilizing their XML sitemap. It can process both single sitemaps and nested sitemaps, with the ability to filter URLs based on specific prefixes. This makes it ideal for comprehensive website content extraction and documentation purposes.

Usage cost: 0.5 credit per scraped page

Configuration

Settings

  1. Sitemap URL*

  2. Starting With

Output Ports

  1. documents (List[Document]): List of documents containing:

    • Processed page content

    • Source URL metadata

    • Extraction timestamps

  2. document_contents (List[string]):

    • List of extracted text content

    • One entry per successfully scraped page

Best Practices

  1. URL Management

    • Use root sitemap URL for complete site scraping

    • Employ URL filters for targeted content extraction

    • Verify sitemap accessibility before execution

  2. Resource Management

    • Monitor processing time for large sitemaps

    • Consider memory usage with many pages

    • Use filtering to limit scope when needed

Common Issues

  • Malformed XML in sitemaps

  • Rate limiting from target websites

  • Memory constraints with large sites

  • Invalid sitemap URLs

  • Access restrictions

  • Nested sitemap processing failures

Last updated