当前位置：Java -> 如何利用RAG技术让Spring AI和OpenAI GPT更好地应用于自己的文件

如何利用RAG技术让Spring AI和OpenAI GPT更好地应用于自己的文件

项目 AIDocumentLibraryChat 使用 Spring AI 项目与 OpenAI 结合进行文档库搜索问题的解答。为此，使用了“检索增强生成”技术对文档进行处理。

检索增强生成

该流程如下：

上传文档
将文档存储在 PostgreSQL 数据库中。
将文档拆分以创建嵌入。
使用 OpenAI 嵌入模型创建嵌入。
将文档嵌入存储在 PostgreSQL 向量数据库中。

搜索文档：

创建搜索提示
使用 OpenAI 嵌入模型创建搜索提示的嵌入。
查询 PostgreSQL 向量数据库以获取最近嵌入距离的文档。
查询 PostgreSQL 数据库以获取文档。
使用搜索提示和文档文本块创建提示。
请求 GPT 模型的答案，并根据搜索提示和文档文本块显示答案。

文档上传

上传的文档存储在数据库中，以便获取答案的源文档。文档文本必须拆分成块，以创建每个块的嵌入。嵌入由 OpenAI 的嵌入模型创建，并且是一个表示文本块的超过 1500 个维度的向量。嵌入存储在 AI 文档中，包括块文本和向量数据库中源文件的 ID。

文档搜索

文档搜索获取搜索提示并使用 Open AI 嵌入模型将其转化为嵌入。该嵌入用于在向量数据库中搜索最近邻向量。这意味着搜索提示的嵌入和具有最大相似性的文本块。AIDocument 中的 ID 用于读取关系数据库中文档。使用搜索提示和 AIDocument 的文本块创建文档提示。然后，调用 OpenAI GPT 模型以基于搜索提示和文档上下文创建答案。这导致模型创建更接近提供的文档并提高准确性的答案。GPT 模型的答案返回并显示文档链接，以提供答案的来源。

架构

该项目的架构基于 Spring Boot 与 Spring AI。Angular UI 提供用户界面，显示文档列表，上传文档，并提供带有答案和源文档的搜索提示。它通过 REST 接口与 Spring Boot 后端通信。Spring Boot 后端为前端提供 REST 控制器，并使用 Spring AI 与 OpenAI 模型和 PostgreSQL 向量数据库通信。文档使用 Jpa 存储在 PostgreSQL 关系数据库中。选择使用 PostgreSQL 数据库，因为它在 Docker 镜像中将关系数据库和向量数据库结合在一起。

实施

前端

前端基于运用 Angular 构建的懒加载独立组件。这些懒加载独立组件在app.config.ts中进行了配置：

export const appConfig: ApplicationConfig = {
  providers: [provideRouter(routes), provideAnimations(), provideHttpClient()]
};

配置设置路由并启用 HTTP 客户端和动画。

延迟加载的路由在 app.routes.ts中定义：

export const routes: Routes = [
  {
    path: "doclist",
    loadChildren: () => import("./doc-list").then((mod) => mod.DOCLIST),
  },
    {
    path: "docsearch",
    loadChildren: () => import("./doc-search").then((mod) => mod.DOCSEARCH),
  },
  { path: "**", redirectTo: "doclist" },
];

在 'loadChildren' 中，'import("...").then((mod) => mod.XXX)' 懒加载提供的路径，并设置在 'mod.XXX' 常量中定义的导出路由。

延迟加载的路由 "docsearch" 在 index.ts 中导出常量：

export * from "./doc-search.routes";

这导出了 doc-search.routes.ts：

export const DOCSEARCH: Routes = [
  {
    path: "",
    component: DocSearchComponent,    
  },
  { path: "**", redirectTo: "" },
];

它定义了路由到 'DocSearchComponent'。

文件上传可以在DocImportComponent中找到，使用的模板是doc-import.component.html：

<h1 mat-dialog-title i18n="@@docimportImportFile">Import file</h1>
<div mat-dialog-content>
  <p i18n="@@docimportFileToImport">File to import</p>
  @if(uploading) {		
    <div class="upload-spinner"><mat-spinner></mat-spinner></div>    
  } @else {		
    <input type="file" (change)="onFileInputChange($event)">
  }
  @if(!!file) {
    <div>
      <ul>
        <li>Name: {{file.name}}</li>
        <li>Type: {{file.type}}</li>
        <li>Size: {{file.size}} bytes</li>
      </ul>    
    </div>
  }   
</div>
<div mat-dialog-actions>
  <button mat-button (click)="cancel()" i18n="@@cancel">Cancel</button>
  <button mat-flat-button color="primary" [disabled]="!file || uploading" 
    (click)="upload()" i18n="@@docimportUpload">Upload</button>
</div>

文件上传使用了''标签。它提供了文件上传功能，并在每次上传后调用'onFileInputChange(...)'方法。

'上传'按钮在单击时调用'upload()'方法将文件发送到服务器。

doc-import.component.ts中包含了与模板相关的方法：

@Component({
  selector: 'app-docimport',
  standalone: true,
  imports: [CommonModule,MatFormFieldModule, MatDialogModule,MatButtonModule, MatInputModule, FormsModule, MatProgressSpinnerModule],
  templateUrl: './doc-import.component.html',
  styleUrls: ['./doc-import.component.scss']
})
export class DocImportComponent {
  protected file: File | null  = null;
  protected uploading = false;
  private destroyRef = inject(DestroyRef); 
	
  constructor(private dialogRef: MatDialogRef<DocImportComponent>, 
    @Inject(MAT_DIALOG_DATA) public data: DocImportComponent, 
    private documentService: DocumentService) { }
	
  protected onFileInputChange($event: Event): void {
    const files = !$event.target ? null : 
      ($event.target as HTMLInputElement).files;
    this.file = !!files && files.length > 0 ? 
      files[0] : null;				
  }
	
  protected upload(): void {
    if(!!this.file) {
      const formData = new FormData();
      formData.append('file', this.file as Blob, this.file.name as string);
      this.documentService.postDocumentForm(formData)
        .pipe(tap(() => {this.uploading = true;}), 
          takeUntilDestroyed(this.destroyRef))
        .subscribe(result => {this.uploading = false; 
      this.dialogRef.close();});
    }
  }
	
  protected cancel(): void {
    this.dialogRef.close();
  }
}

这是一个独立的组件，带有其模块导入和注入的'DestroyRef'。

'onFileInputChange(...)'方法接受事件参数，并将其'files'属性存储在'files'常量中。然后检查第一个文件，并将其存储在'file'组件属性中。

'upload()'方法检查'file'属性并创建用于文件上传的'FormData()'。'formData'常量包含数据类型（'file'）、内容（'this.file'）和附加名称（'this.file.name'）。然后使用'documentService'将'FormData()'对象发布到服务器。'takeUntilDestroyed(this.destroyRef)'函数在组件销毁后取消订阅Rxjs管道。这使得在Angular中取消订阅管道非常方便。

后端

后端是一个带有Spring AI框架的Spring Boot应用程序。Spring AI管理对OpenAI模型和矢量数据库请求的请求。

Liquibase数据库设置

数据库设置使用Liquibase，脚本可以在db.changelog-1.xml中找到：

<databaseChangeLog
  xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog
    http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-3.8.xsd">
  <changeSet id="1" author="angular2guy">
    <sql>CREATE EXTENSION if not exists hstore;</sql>		
  </changeSet>
  <changeSet id="2" author="angular2guy">
    <sql>CREATE EXTENSION if not exists vector;</sql>		
  </changeSet>
  <changeSet id="3" author="angular2guy">
    <sql>CREATE EXTENSION if not exists "uuid-ossp";</sql>		
  </changeSet>
  <changeSet author="angular2guy" id="4">
    <createTable tableName="document">
      <column name="id" type="bigint">
        <constraints primaryKey="true"/> 
      </column>
      <column name="document_name" type="varchar(255)">
        <constraints notNullConstraintName="document_document_name_notnull" 
          nullable="false"/>
      </column>
      <column name="document_type" type="varchar(25)">
        <constraints notNullConstraintName="document_document_type_notnull" 
          nullable="false"/>
      </column>
      <column name="document_content" type="blob"/>
    </createTable>
  </changeSet>
  <changeSet author="angular2guy" id="5">
    <createSequence sequenceName="document_seq" incrementBy="50"   
      startValue="1000" />
  </changeSet>
  <changeSet id="6" author="angular2guy">
    <createTable tableName="vector_store">
      <column name="id" type="uuid" 
        defaultValueComputed="uuid_generate_v4 ()">
	<constraints primaryKey="true"/> 
      </column>
      <column name="content" type="text"/>			
      <column name="metadata" type="json"/>				
      <column name="embedding" type="vector(1536)">
        <constraints notNullConstraintName= 
          "vectorstore_embedding_type_notnull" nullable="false"/>
      </column>
    </createTable>
  </changeSet>
  <changeSet id="7" author="angular2guy">
    <sql>CREATE INDEX vectorstore_embedding_index ON vector_store 
      USING HNSW (embedding vector_cosine_ops);</sql>
  </changeSet>
</databaseChangeLog>

在changeset 4中，创建了Jpa文档实体的表，其中包含主键'id'。内容类型/大小未知，因此设置为'blob'。在changeset 5中，创建了Jpa实体的序列，使用了Spring Boot 3.x默认属性的Hibernate 6序列。在changeset 6中，创建了'vector_store'表，其中包含由'uuid-ossp'扩展创建的'uuid'类型的主键'id'。'content'列是'text'类型（在其他数据库中为'clob'），大小灵活。'metadata'列存储'AIDocuments'的'metadata'，为'json'类型。'embedding'列存储具有OpenAI维度的嵌入向量。在changeset 7中，设置了用于快速搜索'embeddings'列的索引。由于Liquibase的'<createIndex ...>'的参数有限，因此直接使用'<sql>'来创建。

Spring Boot / Spring AI实现

前端的DocumentController如下所示：

@RestController
@RequestMapping("rest/document")
public class DocumentController {
  private final DocumentMapper documentMapper;
  private final DocumentService documentService;

  public DocumentController(DocumentMapper documentMapper, 
    DocumentService documentService) {
    this.documentMapper = documentMapper;
    this.documentService = documentService;
  }

  @PostMapping("/upload")
  public long handleDocumentUpload(
    @RequestParam("file") MultipartFile document) {
    var docSize = this.documentService
      .storeDocument(this.documentMapper.toEntity(document));
    return docSize;
  }

  @GetMapping("/list")
  public List<DocumentDto> getDocumentList() {
    return this.documentService.getDocumentList().stream()
      .flatMap(myDocument ->Stream.of(this.documentMapper.toDto(myDocument)))
        .flatMap(myDocument -> {
          myDocument.setDocumentContent(null);
	  return Stream.of(myDocument);
    }).toList();
  }

  @GetMapping("/doc/{id}")
  public ResponseEntity<DocumentDto> getDocument(
    @PathVariable("id") Long id) {
    return ResponseEntity.ofNullable(this.documentService
      .getDocumentById(id).stream().map(this.documentMapper::toDto)
      .findFirst().orElse(null));
  }
	
  @GetMapping("/content/{id}")
  public ResponseEntity<byte[]> getDocumentContent(
    @PathVariable("id") Long id) {
    var resultOpt = this.documentService.getDocumentById(id).stream()
      .map(this.documentMapper::toDto).findFirst();
    var result = resultOpt.stream().map(this::toResultEntity)
      .findFirst().orElse(ResponseEntity.notFound().build());
    return result;
  }

  private ResponseEntity<byte[]> toResultEntity(DocumentDto documentDto) {
    var contentType = switch (documentDto.getDocumentType()) {
      case DocumentType.PDF -> MediaType.APPLICATION_PDF;
      case DocumentType.HTML -> MediaType.TEXT_HTML;
      case DocumentType.TEXT -> MediaType.TEXT_PLAIN;
      case DocumentType.XML -> MediaType.APPLICATION_XML;
      default -> MediaType.ALL;
    };
    return ResponseEntity.ok().contentType(contentType)
      .body(documentDto.getDocumentContent());
    }
	
  @PostMapping("/search")
  public DocumentSearchDto postDocumentSearch(@RequestBody 
    SearchDto searchDto) {
    var result = this.documentMapper
      .toDto(this.documentService.queryDocuments(searchDto));
    return result;
  }
}

'handleDocumentUpload(...)'处理了位于'/rest/document/upload'路径下的'文档上传'。

'getDocumentList()'处理文档列表的GET请求，并删除文档内容以减少响应大小。

'getDocumentContent(...)'处理文档内容的GET请求。它使用'documentService'加载文档，并将'DocumentType'映射到'MediaType'。然后返回内容和内容类型，浏览器根据内容类型打开内容。

'postDocumentSearch(...)'方法将请求内容放入'SearchDto'对象，并返回'documentService.queryDocuments(...)'调用的AI生成结果。

DocumentService的'storeDocument(...)'方法如下所示：

public Long storeDocument(Document document) {
  var myDocument = this.documentRepository.save(document);
  Resource resource = new ByteArrayResource(document.getDocumentContent());
  var tikaDocuments = new TikaDocumentReader(resource).get();
  record TikaDocumentAndContent(org.springframework.ai.document.Document    
    document, String content) {	}
  var aiDocuments = tikaDocuments.stream()
    .flatMap(myDocument1 -> this.splitStringToTokenLimit(
      myDocument1.getContent(), CHUNK_TOKEN_LIMIT)
    .stream().map(myStr -> new TikaDocumentAndContent(myDocument1, myStr)))
      .map(myTikaRecord -> new org.springframework.ai.document.Document(
        myTikaRecord.content(),	myTikaRecord.document().getMetadata()))
      .peek(myDocument1 -> myDocument1.getMetadata()
        .put(ID, myDocument.getId().toString())).toList();
  LOGGER.info("Name: {}, size: {}, chunks: {}", document.getDocumentName(),   
    document.getDocumentContent().length, aiDocuments.size());
  this.documentVsRepository.add(aiDocuments);
  return Optional.ofNullable(myDocument.getDocumentContent()).stream()
    .map(myContent -> Integer.valueOf(myContent.length).longValue())
    .findFirst().orElse(0L);
  }

  private List<String> splitStringToTokenLimit(String documentStr, 
    int tokenLimit) {
    List<String> splitStrings = new ArrayList<>();
    var tokens = new StringTokenizer(documentStr).countTokens();
    var chunks = Math.ceilDiv(tokens, tokenLimit);
    if (chunks == 0) {
      return splitStrings;
    }
    var chunkSize = Math.ceilDiv(documentStr.length(), chunks);
    var myDocumentStr = new String(documentStr);
    while (!myDocumentStr.isBlank()) {
      splitStrings.add(myDocumentStr.length() > chunkSize ?  
        myDocumentStr.substring(0, chunkSize) : myDocumentStr);
      myDocumentStr = myDocumentStr.length() > chunkSize ? 
        myDocumentStr.substring(chunkSize) : "";
    }
    return splitStrings;
}

'storeDocument(...)'方法将文档保存到关系数据库中。然后，将文档转换为'Spring AI'的'TikaDocumentReader'读取'ByteArrayResource'，并将其转换为'AIDocument'列表。然后将AIDocument列表压扁以使用'splitToTokenLimit(...)'方法将文档拆分为片段，这些片段以存储文档的'metadata'映射中的'id'转换为新的AIDocument。'metadata'中的'id'使得加载匹配的文档实体变得可能。然后将AIDocuments的嵌入隐式地创建，并使用'documentVsRepository.add(...)'方法调用OpenAI Embedding模型，并将带有嵌入的AIDocuments存储在矢量数据库中。然后返回结果。

'queryDocument(...)'方法如下所示：

public AiResult queryDocuments(SearchDto searchDto) {		
  var similarDocuments = this.documentVsRepository
    .retrieve(searchDto.getSearchString());
  var mostSimilar = similarDocuments.stream()
    .sorted((myDocA, myDocB) -> ((Float) myDocA.getMetadata().get(DISTANCE))
    .compareTo(((Float) myDocB.getMetadata().get(DISTANCE)))).findFirst();
  var documentChunks = mostSimilar.stream().flatMap(mySimilar -> 
    similarDocuments.stream().filter(mySimilar1 ->   
      mySimilar1.getMetadata().get(ID).equals(
        mySimilar.getMetadata().get(ID)))).toList();
  Message systemMessage = switch (searchDto.getSearchType()) {
    case SearchDto.SearchType.DOCUMENT -> this.getSystemMessage(
      documentChunks, (documentChunks.size() <= 0 ? 2000 
        : Math.floorDiv(2000, documentChunks.size())));
    case SearchDto.SearchType.PARAGRAPH ->  
      this.getSystemMessage(mostSimilar.stream().toList(), 2000);
  };
  UserMessage userMessage = new UserMessage(searchDto.getSearchString());
  Prompt prompt = new Prompt(List.of(systemMessage, userMessage));
  LocalDateTime start = LocalDateTime.now();
  AiResponse response = aiClient.generate(prompt);
  LOGGER.info("AI response time: {}ms",
    ZonedDateTime.of(LocalDateTime.now(),   
    ZoneId.systemDefault()).toInstant().toEpochMilli()
    - ZonedDateTime.of(start, ZoneId.systemDefault()).toInstant()
    .toEpochMilli());
  var documents = mostSimilar.stream().map(myGen -> 
    myGen.getMetadata().get(ID)).filter(myId ->  
      Optional.ofNullable(myId).stream().allMatch(myId1 -> 
        (myId1 instanceof String))).map(myId -> 
          Long.parseLong(((String) myId)))
        .map(this.documentRepository::findById)
	.filter(Optional::isPresent)
        .map(Optional::get).toList();
  return new AiResult(searchDto.getSearchString(), 
    response.getGenerations(), documents);
}

private Message  getSystemMessage(
  List<org.springframework.ai.document.Document> similarDocuments, 
  int tokenLimit) {
  String documents = similarDocuments.stream()
    .map(entry -> entry.getContent())
    .filter(myStr -> myStr != null && !myStr.isBlank())
    .map(myStr -> this.cutStringToTokenLimit(myStr, tokenLimit))
    .collect(Collectors.joining("\n"));
  SystemPromptTemplate systemPromptTemplate = 
    new SystemPromptTemplate(this.systemPrompt);
  Message systemMessage = systemPromptTemplate
    .createMessage(Map.of("documents", documents));
  return systemMessage;
}

private String cutStringToTokenLimit(String documentStr, int tokenLimit) {
  String cutString = new String(documentStr);
  while (tokenLimit < new StringTokenizer(cutString, " -.;,").countTokens()){
    cutString = cutString.length() > 1000 ? 
      cutString.substring(0, cutString.length() - 1000) : "";
  }
  return cutString;
}

该方法首先从向量数据库中加载与'searchDto.getSearchString()'最匹配的文档。为此，调用OpenAI嵌入模型将搜索字符串转换为嵌入，并使用该嵌入查询与搜索嵌入和数据库嵌入之间最小距离的AIDocument的向量数据库。然后将最小距离的AIDocument存储在'mostSimilar'变量中。接下来，通过匹配它们的元数据'id'的文档实体id，收集文档块的所有AIDocuments。使用'documentChunks'或'mostSimilar'的内容创建'systemMessage'。'getSystemMessage(...)'方法获取它们并将内容块切割成OpenAI GPT模型可以处理的大小，并返回'Message'。然后将'systemMessage'和'userMessage'转换为使用'aiClient.generate(prompt)'发送到OpenAi GPT模型的'prompt'。之后，AI的答案就可用了，并且文档实体被加载到'mostSimilar' AIDocument的元数据id。使用搜索字符串、GPT答案、文档实体创建'AiResult'，并返回。

Spring AI的向量数据库存储库DocumentVsRepositoryBean与Spring AI的'VectorStore'如下：

@Repository
public class DocumentVSRepositoryBean implements DocumentVsRepository {    
  private final VectorStore vectorStore;
    
  public DocumentVSRepositoryBean(JdbcTemplate jdbcTemplate, 
    EmbeddingClient embeddingClient) {				
    this.vectorStore = new PgVectorStore(jdbcTemplate, embeddingClient);
  }
	
  public void add(List<Document> documents) {
    this.vectorStore.add(documents);
  }
	
  public List<Document> retrieve(String query, int k, double threshold) {
    return  new VectorStoreRetriever(vectorStore, k, 
      threshold).retrieve(query);
  }
	
  public List<Document> retrieve(String query) {
    return new VectorStoreRetriever(vectorStore).retrieve(query);
  }
}

该存储库具有'vectorStore'属性，用于访问向量数据库。它在构造函数中通过注入参数使用'new PgVectorStore(...)'调用进行创建。PgVectorStore类作为Postgresql向量数据库扩展提供。它具有'embeddingClient'用于使用OpenAI嵌入模型以及'jdbcTemplate'用于访问数据库。

'add(...)'方法调用OpenAI嵌入模型并将AIDocuments添加到向量数据库。

'retrieve(...)'方法查询向量数据库以获取最小距离的嵌入。

结论

Angular使得前端的创建变得简单。独立组件的延迟加载使得初始加载变得很小。Angular Material组件对实现有很大帮助，且易于使用。

Spring Boot与Spring AI使得使用大型语言模型变得容易。Spring AI提供了隐藏嵌入创建的框架，并提供了一个易于使用的接口，用于在向量数据库中存储AIDocuments（支持多个）。还为搜索提示的嵌入创建提供了便利，向量数据库的接口也非常简单。Spring AI提示类使得为OpenAI GPT模型创建提示也变得容易。使用注入的'aiClient'调用模型，并返回结果。

Spring AI是Spring团队推出的一个非常好的框架。实验版本中没有出现任何问题。

使用Spring AI，现在可以轻松地在我们自己的文档上使用大型语言模型。

推荐阅读： 13.为什么TCP连接的时候是3次，关闭的时候却是4次？

本文链接：如何利用RAG技术让Spring AI和OpenAI GPT更好地应用于自己的文件