Configure the Preprocessing Pipeline and Ingest into the Knowledge Base

This tutorial is the first article in the "Enterprise Technical Documentation Intelligent Q&A System" series. It introduces how to use the preprocessing capabilities of the RAG Pipeline to deeply parse PDF technical documents, enhance them from multiple dimensions, and ingest them into a knowledge base.

Enterprise internal technical documents (such as product white papers, API documentation, architecture design documents, etc.) usually contain a large amount of text, tables, and flowcharts. This article will guide you through orchestrating a document preprocessing Pipeline and creating a knowledge base, laying the data foundation for subsequent retrieval and Q&A.

💡 Tip: This tutorial requires AI Central V4.2 or later, and the operating user must have administrator privileges.

Scenario Description

Business Background

A technology company has a set of technical documents for a core product (in PDF format, about 80 pages). The document content includes:

A large number of text paragraphs: feature descriptions, technical principles, configuration guides, etc.
Data tables: parameter comparison tables, compatibility matrices, performance indicators, etc.
Flowcharts and architecture diagrams: system architecture, data flow, deployment topology, etc.

The company wants to build an intelligent Q&A system so that engineers and product managers can quickly obtain information from the documents, while also ensuring:

Accurate recall of relevant paragraphs for pure text content
Understanding of table structures and provision of structured answers for table content
Semantic retrieval for images/flowcharts based on image descriptions
Support for both full-text retrieval and vector retrieval to improve recall rate

Solution Overview

The preprocessing Pipeline used in this case includes the following core stages:

Stage	Node	Description
Text extraction	PDF File Text	Extract raw text and page structure from PDF
Intelligent segmentation	Page Segmentation	Intelligently split content into semantic paragraphs by page
Summary enhancement	Segment Summary Generation	LLM generates a summary for each paragraph to enhance retrieval semantics
Image understanding	Image Description Generation	LLM generates textual descriptions for images in the document
Table understanding	Table Advanced Summary Gen.	LLM generates structured summaries for tables
Data aggregation	Variable Aggregator	Merge multi-dimensional enhanced data
Document summary	Document Summary Generation	Generate a global summary of the entire document
Tokenization and indexing	Spacy Tokenizer	Tokenize paragraphs to support full-text retrieval
Persistent storage	Multiple storage nodes	Store raw text, paragraphs, enhanced data, tokens, and vectors separately

Step 1: Configure the Preprocessing Pipeline

The preprocessing Pipeline determines how documents are parsed, understood, and stored, directly affecting subsequent retrieval quality. In this case, we will build a multi-branch parallel processing Pipeline to fully extract information from text, images, and tables in the document.

File Preparation Requirements

Technical document: PDF format, such as Technical_Whitepaper_v2.1.pdf
The document should contain title structure, tables, and images to fully leverage the capabilities of each Pipeline node
It is recommended that the file size not exceed 50MB and the page count not exceed 200 pages

Create a Preprocessing Pipeline

Go to the Knowledge module → click "RAG Pipeline" in the left menu bar.
Switch to the "Preprocessing Pipeline" tab.
Click "+ Create Pipeline" and fill in:
- Name: Technical Document Deep Preprocessing
- Description: Multi-dimensional enhancement processing for technical documents, including summaries, image descriptions, table parsing, tokenization, and vectorization
Click "Confirm" to enter the orchestration interface.

Orchestrate the Preprocessing Flow

In the orchestration interface, add nodes and connect them according to the following architecture. This Pipeline adopts a extract first, then branch in parallel design pattern:

1. PDF File Text Extraction

Add the "PDF File Text" node as the entry point of the flow:

Input: file_bytes (uploaded PDF file byte stream)
Configuration: set extract_hyperlinks to true (extract hyperlink information)
Output: file_content (extracted text content) and page content list

This node extracts text, image placeholders, table structures, and other information from the PDF into processable structured data.

https://docs.serviceme.com/en/assets/images/RP-3-b21317cd954af137a7b07f5a64c9e928.png

2. Page Segmentation

Add the "Page Segmentation" node:

Input: page content list page_content_items output from the previous step
Output: segments (semantic paragraph list)

Based on page layout and content structure, the system intelligently splits the document into independent semantic paragraphs. Each paragraph preserves contextual integrity and avoids cross-topic splitting.

https://docs.serviceme.com/en/assets/images/RP-4-cfe65c34ebd760eb9053881d0379558c.png

3. Multi-dimensional Summary Enhancement (Parallel Branches)

After page segmentation is completed, the Pipeline splits into three parallel branches to process different types of content:

Branch A: Segment Summary Generation

Input: segments
Configuration: prompt (summary prompt, customizable)
Output: segment_summaries (summary of each paragraph) and sensitive_segments (identified sensitive paragraph list, used for compliance review or filtering)

LLM generates concise summaries for each text paragraph to enhance semantic matching during retrieval. Even if the user's wording differs from the original text, the summaries can provide an additional semantic bridge.

https://docs.serviceme.com/en/assets/images/RP-5-17b2d8469fa832988447e82a70e3070e.png

Branch B: Image Description Generation

Input: segments (paragraphs containing image references)
Configuration: prompt (image description prompt)
Output: segment_images (textual descriptions of images)

Based on multimodal capabilities, the LLM analyzes images in the document (such as architecture diagrams, flowcharts, and screenshots) and generates structured textual descriptions. This enables image content to participate in semantic retrieval as well.

https://docs.serviceme.com/en/assets/images/RP-6-ae8a8c3f020e53862e6683c2beda1f3d.png

Branch C: Table Advanced Summary Generation

Input: segments (paragraphs containing tables)
Configuration:
- prompt: table summary prompt
- group_size: group size
- max_columns: maximum number of columns
- sample_rows: number of sampled rows
- group_summary_prompt: prompt used when merging group summaries
- large_table_threshold: minimum row/column threshold for determining a large table
- small_table_threshold: maximum row/column threshold for determining a small table
Output: segment_tables (structured summaries of tables)

Deeply parses table content in the document and generates textual summaries that are easy to understand and retrieve. For example, a parameter comparison table can be transformed into a description such as "This table lists Y parameters supported by feature X and their default values."

https://docs.serviceme.com/en/assets/images/RP-7-59b73738cb247ed277462c7a359ac3d4.png

4. Data Aggregation (Variable Aggregator)

Add the "Variable Aggregator" node:

Input: outputs from the three branches: segment_summaries (paragraph summaries), segment_images (image descriptions), segment_tables (table summaries)
Output: group1 (aggregated enhanced data)

Merges paragraph summaries, image descriptions, and table summaries into a unified data structure for subsequent nodes.

https://docs.serviceme.com/en/assets/images/RP-8-e24e261cbb8c86307a8741c2031117a5.png

5. Document-level Summary Generation

Add the "Document Summary Generation" node:

Input: seg_summaries (the full list of paragraph summaries from the "Segment Summary Generation" node)
Configuration: prompt (document summary prompt)
Output: document_description (global summary of the entire document)

Generates a document-level summary description based on all paragraph summaries, used for document metadata storage and coarse-grained retrieval.

https://docs.serviceme.com/en/assets/images/RP-9-dd59b9d4ad1c071a0b074c901b8daf75.png

6. Tokenization Processing (Spacy Tokenizer)

Add the "Spacy Tokenizer" node (in parallel with storing paragraph enhanced data):

Input: group1 (aggregated paragraph data) and the original page-segmented segments (used to supplement original information)
Output: tokenized segments_tokenized

Uses Spacy to tokenize paragraphs and generate the Token sequence required for inverted indexing, supporting full-text retrieval capabilities.

https://docs.serviceme.com/en/assets/images/RP-10-2cd7abeec0f790b704b5dfb6bfbefe6c.png

7. Persistent Storage (Multiple Storage Nodes)

The end of the Pipeline includes the following storage nodes to persist various processing results:

Storage Node	Input	Description
Stored File Text	`content`	Store the complete raw text of the original document
Store Document Metadata	`document_description`	Store the document-level summary as metadata
Store Segment Enhance Data	`doc_extension`	Store paragraph enhanced data (summaries, image descriptions, table summaries)
Embedding and Store	`segments`、`batch_size`、`rehire_size`	Vectorize paragraphs (Embedding) and store them in the vector database
Store Tokenized Segments	`segments`、`batch_size`	Store tokenization results to support full-text retrieval
Stored File Segmentation	`segments`	Store the original segmentation results

After all storage nodes are completed, the flow converges to the End node, and the Pipeline execution ends.

Complete Flow Connection Diagram

Trial Run and Publish

Click the "Trial Run" button in the upper-right corner.
Select the prepared Technical_Whitepaper_v2.1.pdf as the test file.
Check the run results and verify one by one:
- ✅ Text extraction: confirm that the PDF content is completely extracted without garbled text
- ✅ Page segmentation: confirm that paragraph splitting is reasonable without cross-topic mixing
- ✅ Paragraph summaries: confirm that the summaries accurately capture the core content of each paragraph
- ✅ Image descriptions: confirm that the image descriptions are consistent with the actual image content
- ✅ Table summaries: confirm that table data is correctly interpreted
- ✅ Vectorized storage: confirm that Embedding is successfully generated
After confirming that the trial run results are correct, click "Publish".

Note: To ensure testing efficiency, it is recommended to first use a small sample file no larger than 5MB and within 20 pages to test the run logic.

https://docs.serviceme.com/en/assets/images/RP-12-57d5d922f76dc71e14c10d745989ec24.png

https://docs.serviceme.com/en/assets/images/RP-13-add52f9d3bce1e5cf38012fc85b2a0d2.png

Step 2: Create a Knowledge Base and Upload Documents in RP Mode

After completing the preprocessing Pipeline configuration, the next step is to create a knowledge base and enable RP mode in the knowledge base, so that documents are automatically processed according to the published preprocessing Pipeline during ingestion.

Create a Knowledge Base and Bind the Preprocessing RP

Go to the system's Knowledge module.
Click "Enterprise Space" in the left menu bar to enter the knowledge base management interface.
Click "+ Create Knowledge Base", select the "Pipeline Mode" mode, and fill in the following basic information:
- Knowledge Base Name: Technical Documentation Library
- Knowledge Base Description: Technical documentation for the enterprise's core products, with multi-dimensional enhancement for text, tables, and images
- Sort Order: 1
- Vector Database: keep the default and use the platform's built-in pgvector
- Vector Model: select the platform's default AzureEmbedding model to generate vector representations for paragraphs and queries
On the Pipeline configuration page, associate the published preprocessing Pipeline Mode:
- Supported File Formats: check “All” so that multiple document formats can be processed later
- Retrieval Settings: enable the “File Preview” feature to allow direct preview of original text snippets in retrieval results
- Pipeline Selection: select the published Technical Document Deep Preprocessing, and specify the file type as PDF (if support for other formats is needed, corresponding type Pipelines can be added)
After confirming the configuration is correct, click “Create” to complete the creation of the knowledge base and the binding of RP mode.

After creation is completed, when a user uploads a PDF file to this knowledge base, the system will automatically call “Technical Document Deep Preprocessing” to perform a series of processing steps on the PDF file, including text extraction, segmentation, multi-dimensional summary enhancement, tokenization, and vectorization.

https://docs.serviceme.com/en/assets/images/%E7%9F%A5%E8%AF%86-%E5%88%9B%E5%BB%BA%E7%9F%A5%E8%AF%86%E5%BA%931-dc6aa9def1616bb0e570c4993258ec75.png

https://docs.serviceme.com/en/assets/images/RP-14-e373cddba7d91a9a2d2dbf1b05ece643.png

https://docs.serviceme.com/en/assets/images/RP-15-1017ac813beaf4a43b6aac30a447ca23.png

https://docs.serviceme.com/en/assets/images/RP-16-e56d0a46c80c714c24834cf7a6f6ed98.png

Upload Technical Documents

Enter the newly created knowledge base Technical Documentation Library.
Click "Upload File" and upload Technical_Whitepaper_v2.1.pdf.
The system will automatically call the Technical Document Deep Preprocessing Pipeline in RP mode to perform full-chain processing on the document.
After the upload is completed, you can view the processing status and progress in the file list.

💡 Since this Pipeline contains multiple LLM invocation nodes (summary generation, image description, table parsing), processing time will be longer than that of a simple Pipeline. Please wait patiently.

https://docs.serviceme.com/en/assets/images/RP-17-ea729cfa23d2c502da7c3523d4d87e1e.png

https://docs.serviceme.com/en/assets/images/RP-18-4c20383627ea4e192ffb086ce5ca500c.png

https://docs.serviceme.com/en/assets/images/RP-19-a96202dae7f9a17e0b156eace6b38e75.png

Verify Knowledge Base Ingestion Results

Go to the document details page of the knowledge base.
Verify the following:
- ✅ The document fragment list is displayed correctly
- ✅ Each fragment contains enhanced data (summaries, image descriptions, etc.)
- ✅ The document metadata contains a global summary description
- ✅ Both vector data and tokenization indexes have been generated

https://docs.serviceme.com/en/assets/images/RP-20-856c008cf10139922013540c3a44bb57.png

Next Step

The preprocessing Pipeline and knowledge base are now ready. Next, please read Configure the Retrieval Pipeline and Agent Q&A to complete retrieval chain orchestration and end-to-end Q&A validation.