Webpage scraper Node

Overview

The Web Scraper Node extracts and processes content from web pages, converting HTML content into plain text in markdown format. It provides options to control how links and images are handled during the extraction process, making it versatile for various web content extraction needs.

Usage cost: 1 credit

Configuration

Settings

URL Configuration
- URL*: Web page address to scrape
- Supports variable interpolation
- Must be publicly accessible
Content Processing Options
- Ignore Links: Exclude hyperlinks from output text
- Ignore Images: Exclude image content from output text

Output Ports

document (Document): Complete document object containing:
- Page content
- Metadata (URL, timestamps)
document_content (string):
- Extracted text content
- Processed according to link/image settings

Best Practices

URL Management
- Verify URL accessibility before execution
- Use complete URLs including protocol (http/https)
- Consider URL encoding for special characters
Content Extraction
- Enable link/image ignoring for cleaner text
- Monitor content size for large pages

Common Issues

JavaScript-rendered content not captured
Malformed or invalid URLs
Access restrictions (403 errors)
SSL/TLS certificate issues

PreviousFigma Node NextSitemap Scraper Node

Last updated 8 months ago