Waterflai
  • Welcome to Waterflai
  • Getting Started
    • Concepts
    • Quickstart
  • Providers
    • Providers Overview
    • Providers setup
    • AI models
    • Choose the right models
  • Knowledge
    • Knowledge Overview
    • Knowledge connectors
    • Knowledge collections
  • Studio
    • Studio Overview
    • Studio Builders
      • Light Builder
      • Dream Builder
      • Workflow Builder
      • Flow components (nodes)
        • Input Node
        • Output Node
        • LLM model Node
        • Multimodal LLM Node
        • Dall-E 2 (image generation) Node
        • Dall-E 3 (image generation) Node
        • Sora video generation Node
        • Text-to-Speech (TTS) Node
        • Speech-to-Text (STT) Node
        • OCR Node
        • Agent Node
        • Reranker Node
        • Knowledge retrieval Node
        • Vector store insert Node
        • Vector store record delete Node
        • Gitbook loader
        • Notion Database Node
        • Figma Node
        • Webpage scraper Node
        • Sitemap Scraper Node
        • API Request Node
        • Document metadata extraction Node
        • Document metadata update Node
        • Character splitter Node
        • HTML splitter Node
        • Markdown Splitter
        • Calculator tool Node
        • Text as tool Node
        • Knowledge retrieval tool Node
        • Conditional Node
        • Iteration loop Node
      • Testing and Debugging
    • Publishing
    • Integration with API
    • Embedding in website
  • Analytics
    • Analytics Overview
    • Dashboards
    • Logs
  • Administration
    • Organization users
    • Workspace
    • Security and permissions
  • Troubleshooting
    • Support
Powered by GitBook
On this page
  • Overview
  • Configuration
  • Best Practices
  • Common Issues
  1. Studio
  2. Studio Builders
  3. Flow components (nodes)

Sitemap Scraper Node

PreviousWebpage scraper NodeNextAPI Request Node

Last updated 3 months ago

Overview

The Sitemap Scraper Node automates the extraction of content from websites by utilizing their XML sitemap. It can process both single sitemaps and nested sitemaps, with the ability to filter URLs based on specific prefixes. This makes it ideal for comprehensive website content extraction and documentation purposes.

Usage cost: 0.5 credit per scraped page

Configuration

Settings

  1. Sitemap URL*

    • URL to the XML sitemap

    • Must include protocol (http/https)

    • Example:

  2. Starting With

    • Optional URL prefix filter

    • Only processes URLs beginning with this prefix

    • Example:

Output Ports

  1. documents (List[Document]): List of documents containing:

    • Processed page content

    • Source URL metadata

    • Extraction timestamps

  2. document_contents (List[string]):

    • List of extracted text content

    • One entry per successfully scraped page

Best Practices

  1. URL Management

    • Use root sitemap URL for complete site scraping

    • Employ URL filters for targeted content extraction

    • Verify sitemap accessibility before execution

  2. Resource Management

    • Monitor processing time for large sitemaps

    • Consider memory usage with many pages

    • Use filtering to limit scope when needed

Common Issues

  • Malformed XML in sitemaps

  • Rate limiting from target websites

  • Memory constraints with large sites

  • Invalid sitemap URLs

  • Access restrictions

  • Nested sitemap processing failures

https://www.example.com/sitemap.xml
https://www.example.com/features/