6.1. RAG training programme
Data extraction & cleansing and the calculation & storage of embeddings are the two decisive steps in training a RAG model. It is still possible to influence the quality of the processed content in order to achieve better retrieval and generation results.
Data extraction & cleansing
First, unstructured data must be converted into a standardised text format. PDFs can be extracted with pdfplumber or PyMuPDF, while web pages are crawled with BeautifulSoup. Data from databases can be retrieved using SQL queries or APIs.
After extraction, the text is segmented - either into sentences, paragraphs or thematic units. This helps to create smaller and more precise blocks of information for retrieval. The text is then cleaned up by removing superfluous spaces, HTML tags or special characters.
To improve the semantic quality of the data, metadata such as title, author, date or categories can be integrated. These later help with the semantic search and better context weighting.
Calculate & save embeddings
The processed text segments are converted into vector representations. Pre-trained models such as sentence-transformers (all-MiniLM, BGE, E5) can be used for this or fine-tuned for specific domains using contrastive learning.
Separate embeddings for metadata and content can be created and combined to better depict semantic relationships. Alternatively, special tags ([TITLE], [DATE]) can be inserted to sensitise the model to different contexts.
The calculated embeddings are then saved in a vector database such as FAISS, Pinecone or ChromaDB. This allows texts to be searched and retrieved efficiently.