OCR Node

Overview

The OCR Node empowers your flows to extract text content from PDF files and images. By leveraging either dedicated OCR APIs or multimodal Large Language Models, this node can process a variety of visual document formats. It handles single files or lists of files, providing structured text output for further analysis or integration within your AI applications.

Usage cost: 1 credit

Configuration Settings

Input Source

Input Source*: Select a variable holding the base64 encoded data URL (or a list of such URLs) of the PDF file(s) or image(s) you want to process.
- Accepts: string (single base64 data URL), string[] (list of base64 data URLs), file (single file object), file[] (list of file objects), image (single image object), image[] (list of image objects), pdf (single PDF object).
- Behind the scenes, file, image, and PDF objects are converted to their base64 data URL representations.

Model Selection

Model*: Choose the AI model that will perform the Optical Character Recognition.
- You can select from models specifically designed for OCR tasks (OCR API) or powerful Multimodal Large Language Models capable of analyzing visual content.
- The platform will intelligently route the processing based on the selected model's capabilities.

Pages to Process (Optional)

Pages to Process: This optional field allows you to specify which pages of a single PDF input should be processed.
- Enter page numbers or ranges, using a 1-based indexing (e.g., 1, 3-5, 1, 3-5, 7).
- If left empty, all pages of the PDF will be processed.
- This setting is ignored when the input source is a list of files or when the input is an image.

Output Ports

page_documents (List[Document]): A list of Langchain Document objects. Each document corresponds to a processed page and contains the extracted text as its page_content, along with metadata such as the original source filename, the input_index (if a list of inputs was provided), and the page number (0-based index within the source).
full_document (Document): A single Langchain Document containing the concatenated text from all processed pages across all input items. The metadata will reflect the details of the last processed page. This output is None if no pages were processed.
page_texts (List[str]): A simple list of strings, where each string contains the extracted text from a single processed page across all input items, maintaining the order of processing.
full_text (str): A single string containing all the extracted text concatenated together, with page breaks indicated by \n.
raw_api_responses (Optional[List[Dict[str, Any]]]): If an API was used for OCR, this output provides a list of the raw JSON responses received from the API for each processed input item or page. The structure of this output will vary depending on the specific API used. This will be None if a Multimodal LLM was used.

Best Practices

Model Selection

When choosing a model, consider the complexity and quality of your input documents. Dedicated OCR API models might be optimized for speed and accuracy on standard document layouts.
Multimodal LLMs can be more versatile in handling complex layouts, tables, and even extracting information based on visual understanding in addition to text.
Experiment with different models to find the best balance of performance and cost for your specific use case.

Input Preparation

Ensure that the PDF files and images provided are of reasonable quality and resolution for optimal OCR results. Blurry or low-resolution images can significantly impact accuracy.
For large PDF documents, consider processing them in smaller chunks or specifying page ranges to manage processing time and potential API limitations.

Handling Lists of Inputs

When providing a list of files, the "Pages to Process" setting will be disregarded, and all compatible content within each file will be processed. If you need to process specific pages from multiple PDFs, you might need to split them into individual files or use a preceding node to handle page extraction.

Common Issues

Poor OCR accuracy: This can often be attributed to low-quality input images or complex document layouts. Consider using higher-resolution images or experimenting with different OCR models.
Timeout errors: Processing very large files or a large number of files can exceed processing time limits. Try breaking down the input into smaller chunks or optimizing the workflow.
Unexpected output format: The output structure might vary slightly depending on whether an OCR API or a Multimodal LLM is used. Refer to the "Output Ports" section for detailed descriptions.

PreviousSpeech-to-Text (STT) Node NextAgent Node

Last updated 2 months ago