Waterflai
  • Welcome to Waterflai
  • Getting Started
    • Concepts
    • Quickstart
  • Providers
    • Providers Overview
    • Providers setup
    • AI models
    • Choose the right models
  • Knowledge
    • Knowledge Overview
    • Knowledge connectors
    • Knowledge collections
  • Studio
    • Studio Overview
    • Studio Builders
      • Light Builder
      • Dream Builder
      • Workflow Builder
      • Flow components (nodes)
        • Input Node
        • Output Node
        • LLM model Node
        • Multimodal LLM Node
        • Dall-E 2 (image generation) Node
        • Dall-E 3 (image generation) Node
        • Sora video generation Node
        • Text-to-Speech (TTS) Node
        • Speech-to-Text (STT) Node
        • OCR Node
        • Agent Node
        • Reranker Node
        • Knowledge retrieval Node
        • Vector store insert Node
        • Vector store record delete Node
        • Gitbook loader
        • Notion Database Node
        • Figma Node
        • Webpage scraper Node
        • Sitemap Scraper Node
        • API Request Node
        • Document metadata extraction Node
        • Document metadata update Node
        • Character splitter Node
        • HTML splitter Node
        • Markdown Splitter
        • Calculator tool Node
        • Text as tool Node
        • Knowledge retrieval tool Node
        • Conditional Node
        • Iteration loop Node
      • Testing and Debugging
    • Publishing
    • Integration with API
    • Embedding in website
  • Analytics
    • Analytics Overview
    • Dashboards
    • Logs
  • Administration
    • Organization users
    • Workspace
    • Security and permissions
  • Troubleshooting
    • Support
Powered by GitBook
On this page
  • Overview
  • Configuration
  • Best Practices
  • Common Issues
  1. Studio
  2. Studio Builders
  3. Flow components (nodes)

Webpage scraper Node

Overview

The Web Scraper Node extracts and processes content from web pages, converting HTML content into plain text in markdown format. It provides options to control how links and images are handled during the extraction process, making it versatile for various web content extraction needs.

Usage cost: 1 credit

Configuration

Settings

  1. URL Configuration

    • URL*: Web page address to scrape

    • Supports variable interpolation

    • Must be publicly accessible

  2. Content Processing Options

    • Ignore Links: Exclude hyperlinks from output text

    • Ignore Images: Exclude image content from output text

Output Ports

  1. document (Document): Complete document object containing:

    • Page content

    • Metadata (URL, timestamps)

  2. document_content (string):

    • Extracted text content

    • Processed according to link/image settings

Best Practices

  1. URL Management

    • Verify URL accessibility before execution

    • Use complete URLs including protocol (http/https)

    • Consider URL encoding for special characters

  2. Content Extraction

    • Enable link/image ignoring for cleaner text

    • Monitor content size for large pages

Common Issues

  • JavaScript-rendered content not captured

  • Malformed or invalid URLs

  • Access restrictions (403 errors)

  • SSL/TLS certificate issues

PreviousFigma NodeNextSitemap Scraper Node

Last updated 3 months ago