# Sitemap Scraper Node

### Overview

The Sitemap Scraper Node automates the extraction of content from websites by utilizing their XML sitemap. It can process both single sitemaps and nested sitemaps, with the ability to filter URLs based on specific prefixes. This makes it ideal for comprehensive website content extraction and documentation purposes.

Usage cost: 0.5 credit per scraped page

### Configuration

#### Settings

1. **Sitemap URL**\*
   * URL to the XML sitemap
   * Must include protocol (http/https)
   * Example: <https://www.example.com/sitemap.xml>
2. **Starting With**
   * Optional URL prefix filter
   * Only processes URLs beginning with this prefix
   * Example: [https://www.example.com/features/](https://www.example.com/blog/)

#### Output Ports

1. `documents` (List\[Document]): List of documents containing:
   * Processed page content
   * Source URL metadata
   * Extraction timestamps
2. `document_contents` (List\[string]):
   * List of extracted text content
   * One entry per successfully scraped page

### Best Practices

1. **URL Management**
   * Use root sitemap URL for complete site scraping
   * Employ URL filters for targeted content extraction
   * Verify sitemap accessibility before execution
2. **Resource Management**
   * Monitor processing time for large sitemaps
   * Consider memory usage with many pages
   * Use filtering to limit scope when needed

### Common Issues

* Malformed XML in sitemaps
* Rate limiting from target websites
* Memory constraints with large sites
* Invalid sitemap URLs
* Access restrictions
* Nested sitemap processing failures


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.waterflai.ai/studio/studio-builders/flow-components-nodes/sitemap-scraper-node.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
