Character splitter Node

Overview

The Character Splitter Node splits text or documents into smaller chunks using a recursive character-based approach. It's particularly useful for preparing text for LLMs that have token limits or when you need to process long documents in smaller segments.

Usage cost: 1 credit

Configuration

Settings

  1. Chunk Configuration

    • Chunk Size*: Number of characters per chunk

    • Chunk Overlap*: Number of overlapping characters between chunks

    • Separators: List of strings that define where to split the text (comma-separated)

  2. Input Selection

    • Documents/Text to Split*: Select one or more inputs to process

    • Supports:

      • Single text strings

      • Document objects

      • Arrays of text strings

      • Arrays of documents

Output Ports

  • split_documents (Document[]): Array of split documents

    • Each document maintains original metadata

    • Chunks respect natural text boundaries based on separators

Best Practices

  1. Chunk Size Selection

    • Consider model token limits

    • Balance information density

    • Account for desired context window

    • Test with representative content

  2. Overlap Configuration

    • Use overlap to maintain context

    • Consider semantic boundaries

    • Avoid too large overlaps (waste)

  3. Separator Usage

    • Use natural text boundaries

    • Consider document structure

    • Common separators: "\n\n", "\n", ".", "!", "?"

    • Order separators from specific to general

Common Issues

  • Memory issues with large documents

  • Inconsistent chunk sizes

  • Improper handling of special characters

Last updated