Webpage scraper Node

Overview

The Web Scraper Node extracts and processes content from web pages, converting HTML content into plain text in markdown format. It provides options to control how links and images are handled during the extraction process, making it versatile for various web content extraction needs.

Usage cost: 1 credit

Configuration

Settings

  1. URL Configuration

    • URL*: Web page address to scrape

    • Supports variable interpolation

    • Must be publicly accessible

  2. Content Processing Options

    • Ignore Links: Exclude hyperlinks from output text

    • Ignore Images: Exclude image content from output text

Output Ports

  1. document (Document): Complete document object containing:

    • Page content

    • Metadata (URL, timestamps)

  2. document_content (string):

    • Extracted text content

    • Processed according to link/image settings

Best Practices

  1. URL Management

    • Verify URL accessibility before execution

    • Use complete URLs including protocol (http/https)

    • Consider URL encoding for special characters

  2. Content Extraction

    • Enable link/image ignoring for cleaner text

    • Monitor content size for large pages

Common Issues

  • JavaScript-rendered content not captured

  • Malformed or invalid URLs

  • Access restrictions (403 errors)

  • SSL/TLS certificate issues

Last updated