How it Works
This pipeline can use several Pathway connectors to read the data from the local drive, Google Drive, and Microsoft SharePoint sources. It allows you to poll the changes with low latency and to do the modifications tracking. So, if something changes in the tracked files, the corresponding change is reflected in the internal collections. The contents are read into a single Pathway Table as binary objects.
After that, those binary objects are parsed with “unstructured” library and split into chunks. With the usage of OpenAI API, the pipeline embeds the obtained chunks.
Finally, the embeddings are indexed with the capabilities of Pathway's machine-learning library. The user can then query the created index with simple HTTP requests to the endpoints mentioned above.
Understanding your RAG pipeline
This folder contains several objects:
app.py
, the application code using Pathway and written in Python;config.yaml
, the file containing configuration stubs for the data sources, the OpenAI LLM model, and the web server. It needs to be customized if you want to change the LLM model, use the Google Drive data source or change the filesystem directories that will be indexed;requirements.txt
, the dependencies for your pipeline. It can be passed topip install -r ...
to install everything that is needed to launch the pipeline locally;Dockerfile
, the Docker configuration for running the pipeline in the container;.env
, a short environment variables configuration file where the OpenAI key must be stored;data/
, a folder with exemplary files that can be used for the test runs.
Let's understand your application code in app.py
Here in your app.py file you've followed a sequence of steps. Before looking at the code, let's give it a glance.
Set Up Your License Key: You ensure you have the necessary access to Pathway features.
Configure Logging: Set up logging to monitor what’s happening in your application.
Load Environment Variables: Manage sensitive data securely.
Define Data Sources Function: Handle data from various sources seamlessly.
Main Function with Click: Use command-line interaction to control your pipeline.
Initialize Embedder: Convert text to embeddings for further processing.
Initialize Chat Model: Set up your language model for generating responses.
Set Up Vector Store: Manage and retrieve document embeddings efficiently.
Set Up RAG Application: Combine retrieval and generation for effective question answering.
Build and Run Server: Start your server to handle real-time requests.
Possible Modifications
Change Input Folders: Update paths to new data folders.
Modify LLM: Switch to a different language model
Change Embedder: Use an alternative embedder from embedders.
Update Index: Configure a different indexing method.
Host and Port: Adjust the host and port settings for different environments.
Run Options: Enable or disable caching and specify a new cache folder.
It is also possible to easily create new components by extending the pw.UDF class and implementing the `__wrapped__` function.
Conclusion
This demonstrates setting up a powerful RAG pipeline with always up-to-date knowledge. While we've only scratched the surface, there's more to explore:
Re-ranking: Prioritize the most relevant results for your specific query.
Knowledge Graphs: Leverage relationships between entities to improve understanding.
Hybrid Indexing: Combine different indexing strategies for optimal retrieval.
Adaptive Reranking: Iteratively enlarge the context for optimal accuracy, see our next tutorial around adaptive RAG.
Stay tuned for future examples exploring these RAG techniques with Pathway!
Enjoy building your RAG project! If you have any questions or need further assistance, feel free to reach out to the Pathway team or check with your peers from the bootcamp cohort.
What if you want to use a Multimodal LLM like GPT-4o
That's a great idea indeed. Multimodal LLMs like GPT-4o excel at parsing images, charts, etc. thereby significantly enhancing the accuracy for text-based use-cases as well.
For example, imagine if you're building a RAG project with Google Drive as a data source. But that Drive folder has several financial documents with charts, tables, etc. Below is an interesting example where you'll see how Pathway parsed tables as images and used GPT-4o to get a much more accurate response.
Last updated