AI Central

Configure the Preprocessing Pipeline and Ingest into the Knowledge Base

This tutorial is the first article in the "Enterprise Technical Documentation Intelligent Q&A System" series. It introduces how to use the preprocessing capabilities of the RAG Pipeline to deeply parse PDF technical documents, enhance them from multiple dimensions, and ingest them into a knowledge base.

Enterprise internal technical documents (such as product white papers, API documentation, architecture design documents, etc.) usually contain a large amount of text, tables, and flowcharts. This article will guide you through orchestrating a document preprocessing Pipeline and creating a knowledge base, laying the data foundation for subsequent retrieval and Q&A.

💡 Tip: This tutorial requires AI Central V4.2 or later, and the operating user must have administrator privileges.

Scenario Description

Business Background

A technology company has a set of technical documents for a core product (in PDF format, about 80 pages). The document content includes:

  • A large number of text paragraphs: feature descriptions, technical principles, configuration guides, etc.

  • Data tables: parameter comparison tables, compatibility matrices, performance indicators, etc.

  • Flowcharts and architecture diagrams: system architecture, data flow, deployment topology, etc.

The company wants to build an intelligent Q&A system so that engineers and product managers can quickly obtain information from the documents, while also ensuring:

  • Accurate recall of relevant paragraphs for pure text content

  • Understanding of table structures and provision of structured answers for table content

  • Semantic retrieval for images/flowcharts based on image descriptions

  • Support for both full-text retrieval and vector retrieval to improve recall rate

Solution Overview

The preprocessing Pipeline used in this case includes the following core stages:

Stage

Node

Description

Text extraction

PDF File Text

Extract raw text and page structure from PDF

Intelligent segmentation

Page Segmentation

Intelligently split content into semantic paragraphs by page

Summary enhancement

Segment Summary Generation

LLM generates a summary for each paragraph to enhance retrieval semantics

Image understanding

Image Description Generation

LLM generates textual descriptions for images in the document

Table understanding

Table Advanced Summary Gen.

LLM generates structured summaries for tables

Data aggregation

Variable Aggregator

Merge multi-dimensional enhanced data

Document summary

Document Summary Generation

Generate a global summary of the entire document

Tokenization and indexing

Spacy Tokenizer

Tokenize paragraphs to support full-text retrieval

Persistent storage

Multiple storage nodes

Store raw text, paragraphs, enhanced data, tokens, and vectors separately


Step 1: Configure the Preprocessing Pipeline

The preprocessing Pipeline determines how documents are parsed, understood, and stored, directly affecting subsequent retrieval quality. In this case, we will build a multi-branch parallel processing Pipeline to fully extract information from text, images, and tables in the document.

File Preparation Requirements

  • Technical document: PDF format, such as Technical_Whitepaper_v2.1.pdf

  • The document should contain title structure, tables, and images to fully leverage the capabilities of each Pipeline node

  • It is recommended that the file size not exceed 50MB and the page count not exceed 200 pages

Create a Preprocessing Pipeline

  1. Go to the Knowledge module → click "RAG Pipeline" in the left menu bar.

  2. Switch to the "Preprocessing Pipeline" tab.

  3. Click "+ Create Pipeline" and fill in:

    • Name: Technical Document Deep Preprocessing

    • Description: Multi-dimensional enhancement processing for technical documents, including summaries, image descriptions, table parsing, tokenization, and vectorization

  4. Click "Confirm" to enter the orchestration interface.

image-20260518-091735.png

Orchestrate the Preprocessing Flow

In the orchestration interface, add nodes and connect them according to the following architecture. This Pipeline adopts a extract first, then branch in parallel design pattern:

1. PDF File Text Extraction

Add the "PDF File Text" node as the entry point of the flow:

  • Input: file_bytes (uploaded PDF file byte stream)

  • Configuration: set extract_hyperlinks to true (extract hyperlink information)

  • Output: file_content (extracted text content) and page content list

This node extracts text, image placeholders, table structures, and other information from the PDF into processable structured data.

https://docs.serviceme.com/en/assets/images/RP-3-b21317cd954af137a7b07f5a64c9e928.png

2. Page Segmentation

Add the "Page Segmentation" node:

  • Input: page content list page_content_items output from the previous step

  • Output: segments (semantic paragraph list)

Based on page layout and content structure, the system intelligently splits the document into independent semantic paragraphs. Each paragraph preserves contextual integrity and avoids cross-topic splitting.

https://docs.serviceme.com/en/assets/images/RP-4-cfe65c34ebd760eb9053881d0379558c.png

3. Multi-dimensional Summary Enhancement (Parallel Branches)

After page segmentation is completed, the Pipeline splits into three parallel branches to process different types of content:

Branch A: Segment Summary Generation

  • Input: segments

  • Configuration: prompt (summary prompt, customizable)

  • Output: segment_summaries (summary of each paragraph) and sensitive_segments (identified sensitive paragraph list, used for compliance review or filtering)

LLM generates concise summaries for each text paragraph to enhance semantic matching during retrieval. Even if the user's wording differs from the original text, the summaries can provide an additional semantic bridge.

https://docs.serviceme.com/en/assets/images/RP-5-17b2d8469fa832988447e82a70e3070e.png

Branch B: Image Description Generation

  • Input: segments (paragraphs containing image references)

  • Configuration: prompt (image description prompt)

  • Output: segment_images (textual descriptions of images)

Based on multimodal capabilities, the LLM analyzes images in the document (such as architecture diagrams, flowcharts, and screenshots) and generates structured textual descriptions. This enables image content to participate in semantic retrieval as well.

https://docs.serviceme.com/en/assets/images/RP-6-ae8a8c3f020e53862e6683c2beda1f3d.png

Branch C: Table Advanced Summary Generation

  • Input: segments (paragraphs containing tables)

  • Configuration:

    • prompt: table summary prompt

    • group_size: group size

    • max_columns: maximum number of columns

    • sample_rows: number of sampled rows

    • group_summary_prompt: prompt used when merging group summaries

    • large_table_threshold: minimum row/column threshold for determining a large table

    • small_table_threshold: maximum row/column threshold for determining a small table

  • Output: segment_tables (structured summaries of tables)

Deeply parses table content in the document and generates textual summaries that are easy to understand and retrieve. For example, a parameter comparison table can be transformed into a description such as "This table lists Y parameters supported by feature X and their default values."

https://docs.serviceme.com/en/assets/images/RP-7-59b73738cb247ed277462c7a359ac3d4.png

4. Data Aggregation (Variable Aggregator)

Add the "Variable Aggregator" node:

  • Input: outputs from the three branches: segment_summaries (paragraph summaries), segment_images (image descriptions), segment_tables (table summaries)

  • Output: group1 (aggregated enhanced data)

Merges paragraph summaries, image descriptions, and table summaries into a unified data structure for subsequent nodes.

https://docs.serviceme.com/en/assets/images/RP-8-e24e261cbb8c86307a8741c2031117a5.png

5. Document-level Summary Generation

Add the "Document Summary Generation" node:

  • Input: seg_summaries (the full list of paragraph summaries from the "Segment Summary Generation" node)

  • Configuration: prompt (document summary prompt)

  • Output: document_description (global summary of the entire document)

Generates a document-level summary description based on all paragraph summaries, used for document metadata storage and coarse-grained retrieval.

https://docs.serviceme.com/en/assets/images/RP-9-dd59b9d4ad1c071a0b074c901b8daf75.png

6. Tokenization Processing (Spacy Tokenizer)

Add the "Spacy Tokenizer" node (in parallel with storing paragraph enhanced data):

  • Input: group1 (aggregated paragraph data) and the original page-segmented segments (used to supplement original information)

  • Output: tokenized segments_tokenized

Uses Spacy to tokenize paragraphs and generate the Token sequence required for inverted indexing, supporting full-text retrieval capabilities.

https://docs.serviceme.com/en/assets/images/RP-10-2cd7abeec0f790b704b5dfb6bfbefe6c.png

7. Persistent Storage (Multiple Storage Nodes)

The end of the Pipeline includes the following storage nodes to persist various processing results:

Storage Node

Input

Description

Stored File Text

content

Store the complete raw text of the original document

Store Document Metadata

document_description

Store the document-level summary as metadata

Store Segment Enhance Data

doc_extension

Store paragraph enhanced data (summaries, image descriptions, table summaries)

Embedding and Store

segmentsbatch_sizerehire_size

Vectorize paragraphs (Embedding) and store them in the vector database

Store Tokenized Segments

segmentsbatch_size

Store tokenization results to support full-text retrieval

Stored File Segmentation

segments

Store the original segmentation results

After all storage nodes are completed, the flow converges to the End node, and the Pipeline execution ends.

Complete Flow Connection Diagram

image-20260518-091939.png

Trial Run and Publish

  1. Click the "Trial Run" button in the upper-right corner.

  2. Select the prepared Technical_Whitepaper_v2.1.pdf as the test file.

  3. Check the run results and verify one by one:

    • Text extraction: confirm that the PDF content is completely extracted without garbled text

    • Page segmentation: confirm that paragraph splitting is reasonable without cross-topic mixing

    • Paragraph summaries: confirm that the summaries accurately capture the core content of each paragraph

    • Image descriptions: confirm that the image descriptions are consistent with the actual image content

    • Table summaries: confirm that table data is correctly interpreted

    • Vectorized storage: confirm that Embedding is successfully generated

  4. After confirming that the trial run results are correct, click "Publish".

Note: To ensure testing efficiency, it is recommended to first use a small sample file no larger than 5MB and within 20 pages to test the run logic.

https://docs.serviceme.com/en/assets/images/RP-12-57d5d922f76dc71e14c10d745989ec24.png
https://docs.serviceme.com/en/assets/images/RP-13-add52f9d3bce1e5cf38012fc85b2a0d2.png

Step 2: Create a Knowledge Base and Upload Documents in RP Mode

After completing the preprocessing Pipeline configuration, the next step is to create a knowledge base and enable RP mode in the knowledge base, so that documents are automatically processed according to the published preprocessing Pipeline during ingestion.

Create a Knowledge Base and Bind the Preprocessing RP

  1. Go to the system's Knowledge module.

  2. Click "Enterprise Space" in the left menu bar to enter the knowledge base management interface.

  3. Click "+ Create Knowledge Base", select the "Pipeline Mode" mode, and fill in the following basic information:

    • Knowledge Base Name: Technical Documentation Library

    • Knowledge Base Description: Technical documentation for the enterprise's core products, with multi-dimensional enhancement for text, tables, and images

    • Sort Order: 1

    • Vector Database: keep the default and use the platform's built-in pgvector

    • Vector Model: select the platform's default AzureEmbedding model to generate vector representations for paragraphs and queries

  4. On the Pipeline configuration page, associate the published preprocessing Pipeline Mode:

    • Supported File Formats: check “All” so that multiple document formats can be processed later

    • Retrieval Settings: enable the “File Preview” feature to allow direct preview of original text snippets in retrieval results

    • Pipeline Selection: select the published Technical Document Deep Preprocessing, and specify the file type as PDF (if support for other formats is needed, corresponding type Pipelines can be added)

  5. After confirming the configuration is correct, click “Create” to complete the creation of the knowledge base and the binding of RP mode.

After creation is completed, when a user uploads a PDF file to this knowledge base, the system will automatically call “Technical Document Deep Preprocessing” to perform a series of processing steps on the PDF file, including text extraction, segmentation, multi-dimensional summary enhancement, tokenization, and vectorization.

https://docs.serviceme.com/en/assets/images/%E7%9F%A5%E8%AF%86-%E5%88%9B%E5%BB%BA%E7%9F%A5%E8%AF%86%E5%BA%931-dc6aa9def1616bb0e570c4993258ec75.png
https://docs.serviceme.com/en/assets/images/RP-14-e373cddba7d91a9a2d2dbf1b05ece643.png
https://docs.serviceme.com/en/assets/images/RP-15-1017ac813beaf4a43b6aac30a447ca23.png
https://docs.serviceme.com/en/assets/images/RP-16-e56d0a46c80c714c24834cf7a6f6ed98.png

Upload Technical Documents

  1. Enter the newly created knowledge base Technical Documentation Library.

  2. Click "Upload File" and upload Technical_Whitepaper_v2.1.pdf.

  3. The system will automatically call the Technical Document Deep Preprocessing Pipeline in RP mode to perform full-chain processing on the document.

  4. After the upload is completed, you can view the processing status and progress in the file list.

💡 Since this Pipeline contains multiple LLM invocation nodes (summary generation, image description, table parsing), processing time will be longer than that of a simple Pipeline. Please wait patiently.

https://docs.serviceme.com/en/assets/images/RP-17-ea729cfa23d2c502da7c3523d4d87e1e.png
https://docs.serviceme.com/en/assets/images/RP-18-4c20383627ea4e192ffb086ce5ca500c.png
https://docs.serviceme.com/en/assets/images/RP-19-a96202dae7f9a17e0b156eace6b38e75.png

Verify Knowledge Base Ingestion Results

  1. Go to the document details page of the knowledge base.

  2. Verify the following:

    • ✅ The document fragment list is displayed correctly

    • ✅ Each fragment contains enhanced data (summaries, image descriptions, etc.)

    • ✅ The document metadata contains a global summary description

    • ✅ Both vector data and tokenization indexes have been generated

https://docs.serviceme.com/en/assets/images/RP-20-856c008cf10139922013540c3a44bb57.png

Next Step

The preprocessing Pipeline and knowledge base are now ready. Next, please read Configure the Retrieval Pipeline and Agent Q&A to complete retrieval chain orchestration and end-to-end Q&A validation.