Waterflai
  • Welcome to Waterflai
  • Getting Started
    • Concepts
    • Quickstart
  • Providers
    • Providers Overview
    • Providers setup
    • AI models
    • Choose the right models
  • Knowledge
    • Knowledge Overview
    • Knowledge connectors
    • Knowledge collections
  • Studio
    • Studio Overview
    • Studio Builders
      • Light Builder
      • Dream Builder
      • Workflow Builder
      • Flow components (nodes)
        • Input Node
        • Output Node
        • LLM model Node
        • Multimodal LLM Node
        • Dall-E 2 (image generation) Node
        • Dall-E 3 (image generation) Node
        • Text-to-Speech (TTS) Node
        • Speech-to-Text (STT) Node
        • OCR Node
        • Agent Node
        • Reranker Node
        • Knowledge retrieval Node
        • Vector store insert Node
        • Vector store record delete Node
        • Gitbook loader
        • Notion Database Node
        • Figma Node
        • Webpage scraper Node
        • Sitemap Scraper Node
        • API Request Node
        • Document metadata extraction Node
        • Document metadata update Node
        • Character splitter Node
        • HTML splitter Node
        • Markdown Splitter
        • Calculator tool Node
        • Text as tool Node
        • Knowledge retrieval tool Node
        • Conditional Node
        • Iteration loop Node
      • Testing and Debugging
    • Publishing
    • Integration with API
    • Embedding in website
  • Analytics
    • Analytics Overview
    • Dashboards
    • Logs
  • Administration
    • Organization users
    • Workspace
    • Security and permissions
  • Troubleshooting
    • Support
Powered by GitBook
On this page
  • Overview
  • Configuration Settings
  • Best Practices
  1. Studio
  2. Studio Builders
  3. Flow components (nodes)

OCR Node

Overview

The OCR Node empowers your flows to extract text content from PDF files and images. By leveraging either dedicated OCR APIs or multimodal Large Language Models, this node can process a variety of visual document formats. It handles single files or lists of files, providing structured text output for further analysis or integration within your AI applications.

Usage cost: 1 credit

Configuration Settings

Input Source

  • Input Source*: Select a variable holding the base64 encoded data URL (or a list of such URLs) of the PDF file(s) or image(s) you want to process.

    • Accepts: string (single base64 data URL), string[] (list of base64 data URLs), file (single file object), file[] (list of file objects), image (single image object), image[] (list of image objects), pdf (single PDF object).

    • Behind the scenes, file, image, and PDF objects are converted to their base64 data URL representations.

Model Selection

  • Model*: Choose the AI model that will perform the Optical Character Recognition.

    • You can select from models specifically designed for OCR tasks (OCR API) or powerful Multimodal Large Language Models capable of analyzing visual content.

    • The platform will intelligently route the processing based on the selected model's capabilities.

Pages to Process (Optional)

  • Pages to Process: This optional field allows you to specify which pages of a single PDF input should be processed.

    • Enter page numbers or ranges, using a 1-based indexing (e.g., 1, 3-5, 1, 3-5, 7).

    • If left empty, all pages of the PDF will be processed.

    • This setting is ignored when the input source is a list of files or when the input is an image.

Output Ports

  • page_documents (List[Document]): A list of Langchain Document objects. Each document corresponds to a processed page and contains the extracted text as its page_content, along with metadata such as the original source filename, the input_index (if a list of inputs was provided), and the page number (0-based index within the source).

  • full_document (Document): A single Langchain Document containing the concatenated text from all processed pages across all input items. The metadata will reflect the details of the last processed page. This output is None if no pages were processed.

  • page_texts (List[str]): A simple list of strings, where each string contains the extracted text from a single processed page across all input items, maintaining the order of processing.

  • full_text (str): A single string containing all the extracted text concatenated together, with page breaks indicated by \n.

  • raw_api_responses (Optional[List[Dict[str, Any]]]): If an API was used for OCR, this output provides a list of the raw JSON responses received from the API for each processed input item or page. The structure of this output will vary depending on the specific API used. This will be None if a Multimodal LLM was used.

Best Practices

Model Selection

  • When choosing a model, consider the complexity and quality of your input documents. Dedicated OCR API models might be optimized for speed and accuracy on standard document layouts.

  • Multimodal LLMs can be more versatile in handling complex layouts, tables, and even extracting information based on visual understanding in addition to text.

  • Experiment with different models to find the best balance of performance and cost for your specific use case.

Input Preparation

  • Ensure that the PDF files and images provided are of reasonable quality and resolution for optimal OCR results. Blurry or low-resolution images can significantly impact accuracy.

  • For large PDF documents, consider processing them in smaller chunks or specifying page ranges to manage processing time and potential API limitations.

Handling Lists of Inputs

  • When providing a list of files, the "Pages to Process" setting will be disregarded, and all compatible content within each file will be processed. If you need to process specific pages from multiple PDFs, you might need to split them into individual files or use a preceding node to handle page extraction.

Common Issues

  • Poor OCR accuracy: This can often be attributed to low-quality input images or complex document layouts. Consider using higher-resolution images or experimenting with different OCR models.

  • Timeout errors: Processing very large files or a large number of files can exceed processing time limits. Try breaking down the input into smaller chunks or optimizing the workflow.

  • Unexpected output format: The output structure might vary slightly depending on whether an OCR API or a Multimodal LLM is used. Refer to the "Output Ports" section for detailed descriptions.

PreviousSpeech-to-Text (STT) NodeNextAgent Node

Last updated 15 days ago