Part 5 - Creating Your Own AI-Powered Knowledge Base with Ollama
Creating Your Own AI-Powered Knowledge Base with Ollama
Now that you have your model up and running, it’s time to harness its true potential by building something truly useful: a personal knowledge base Q&A system. Imagine having an AI assistant that can retrieve, synthesize, and explain information from your personal or professional documents, research papers, or any specialized content you care about.
The Core Challenge: Context Is Everything
Large language models like Llama 3.1 come pre-trained with vast general knowledge, but they truly shine when provided with specific context relevant to your questions. The key to an effective knowledge base system is getting the right information into your model’s context window.
Here’s our approach:
- Organize your knowledge sources
- Structure effective prompts
- Create a specialized model
- Build simple retrieval mechanisms
Let’s walk through each step to create a system that gives you accurate, insightful answers based on your specialized knowledge.
Organizing Your Knowledge
Before we start querying, we need to organize our information. Create a dedicated directory for your knowledge base:
mkdir -p ~/knowledge_base/documents
Place your text files, markdown documents, or text-extracted PDFs in this directory. The cleaner and more structured your documents, the better your results will be.
For best results:
- Break large documents into smaller, topic-focused files
- Use clear filenames that describe the content
- Include headers and structured formatting where possible
Structuring Effective Prompts for Knowledge Retrieval
The magic of a good knowledge base system lies in how you structure your prompts. Here’s a template that works well:
cat << EOF > knowledge-prompt.txt
DOCUMENT:
Based on the information in the document above, please answer the following question.
If the answer cannot be found in the document, state that clearly rather than making up information.
QUESTION:
EOF
This template:
- Clearly separates the reference document from the query
- Instructs the model to only use provided information
- Prevents hallucinations by asking the model to acknowledge knowledge gaps
Creating a Specialized Knowledge Assistant Model
Now let’s create a specialized model optimized for knowledge retrieval:
cat << EOF > KnowledgeAssistant
FROM llama3.1:latest
SYSTEM "You are a precise knowledge assistant. Your primary goal is to provide accurate information based solely on the documents provided to you. You should:
1. Focus only on the content in the provided documents
2. Cite specific sections when answering
3. Admit when you don't have enough information
4. Provide concise, well-structured answers
5. Never fabricate information"
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF
ollama create knowledge-assistant -f KnowledgeAssistant
This model prioritizes accuracy over creativity with the low temperature setting and has a larger context window to accommodate your documents.
Building a Simple Document Retrieval System
Now let’s create a basic shell script that will:
- Take a query from the user
- Select a relevant document
- Feed both to our model
cat << EOF > query-knowledge.sh
#!/bin/bash
# Directory containing knowledge documents
KNOWLEDGE_DIR=~/knowledge_base/documents/climate/
# Get query from arguments
QUERY="\$*"
if [ -z "\$QUERY" ]; then
echo "Please provide a query"
exit 1
fi
# Simple keyword-based document selection (can be improved)
echo "Searching for relevant documents..."
RELEVANT_DOCS=\$(grep -l "\$QUERY" \$KNOWLEDGE_DIR/* 2>/dev/null)
if [ -z "\$RELEVANT_DOCS" ]; then
echo "No directly relevant documents found. Using first 3 documents..."
RELEVANT_DOCS=\$(ls \$KNOWLEDGE_DIR/* | head -n 3)
fi
# Process each relevant document
for DOC in \$RELEVANT_DOCS; do
echo "Processing document: \$(basename \$DOC)"
# Prepare the prompt with document content and query
DOCUMENT_CONTENT=\$(cat "\$DOC")
PROMPT=\$(cat knowledge-prompt.txt)
PROMPT=\${PROMPT//\{\{DOCUMENT_TEXT\}\}/\$DOCUMENT_CONTENT}
PROMPT=\${PROMPT//\{\{QUERY\}\}/\$QUERY}
# Run the query through our knowledge assistant
echo "Analyzing document content..."
ollama run knowledge-assistant "\$PROMPT"
echo -e "\n---\n"
done
EOF
chmod +x query-knowledge.sh
This script searches for documents containing keywords from your query, then runs each document through your knowledge assistant model.
Download Some Sample Documents
Let’s populate our knowledge base with some climate change information from NASA:
mkdir ~/knowledge_base/documents/climate/
uvx --from inscriptis inscript https://science.nasa.gov/climate-change/causes/ > ~/knowledge_base/documents/climate/climate_change.txt
Note: We’re using UV (which we installed earlier) to execute these scripts.
Using Your Knowledge Base System
Now you can query your knowledge base with natural language questions:
./query-knowledge.sh "What are the key factors affecting climate change according to the latest report?"
The script will:
- Search for documents containing keywords like “climate change” and “report”
- Feed each relevant document to your knowledge assistant
- Return answers based strictly on the content of those documents
Example Use Case: A SOC Compliance Bot
Let’s look at a concrete example. Imagine you need to build a knowledge base about SOC2 compliance:
- Create a directory for SOC2 documents:
mkdir -p ~/knowledge_base/documents/soc2/
- Update the script for the new documents:
cat << EOF > query-knowledge.sh
#!/bin/bash
# Directory containing knowledge documents
KNOWLEDGE_DIR=~/knowledge_base/documents/soc2/
# Get query from arguments
QUERY="\$*"
if [ -z "\$QUERY" ]; then
echo "Please provide a query"
exit 1
fi
# Simple keyword-based document selection (can be improved)
echo "Searching for relevant documents..."
RELEVANT_DOCS=\$(grep -l "\$QUERY" \$KNOWLEDGE_DIR/* 2>/dev/null)
if [ -z "\$RELEVANT_DOCS" ]; then
echo "No directly relevant documents found. Using first 3 documents..."
RELEVANT_DOCS=\$(ls \$KNOWLEDGE_DIR/* | head -n 3)
fi
# Process each relevant document
for DOC in \$RELEVANT_DOCS; do
echo "Processing document: \$(basename \$DOC)"
# Prepare the prompt with document content and query
DOCUMENT_CONTENT=\$(cat "\$DOC")
PROMPT=\$(cat knowledge-prompt.txt)
PROMPT=\${PROMPT//\{\{DOCUMENT_TEXT\}\}/\$DOCUMENT_CONTENT}
PROMPT=\${PROMPT//\{\{QUERY\}\}/\$QUERY}
# Run the query through our knowledge assistant
echo "Analyzing document content..."
ollama run knowledge-assistant "\$PROMPT"
echo -e "\n---\n"
done
EOF
- Download some SOC2 documentation:
uvx --from inscriptis inscript https://www.vanta.com/collection/soc-2/what-is-a-soc-2-audit > ~/knowledge_base/documents/soc2/soc2-audit.txt && uvx --from inscriptis inscript https://www.vanta.com/collection/soc-2/why-is-soc-2-important > ~/knowledge_base/documents/soc2/soc2-important.txt && uvx --from inscriptis inscript https://www.vanta.com/collection/soc-2/introduction > ~/knowledge_base/documents/soc2/soc.txt && uvx --from inscriptis inscript https://www.vanta.com/collection/soc-2/what-is-soc-2 > ~/knowledge_base/documents/soc2/soc2.txt
- Run a query about SOC2:
./query-knowledge.sh "What is a SOC2 Audit?"
The system will search through your documents, find relevant discussions about SOC2, and synthesize the findings.
Limitations and Improvement Opportunities
This simple system has some limitations:
- Basic keyword matching for document retrieval
- No semantic understanding of document relevance
- Limited context window (even 4096 tokens can be restrictive)
For more advanced capabilities, consider:
- Implementing vector embeddings for semantic search
- Creating document chunks instead of using full documents
- Building a simple RAG (Retrieval-Augmented Generation) system
Next Steps for Your Knowledge Base
As you grow more comfortable with your knowledge base system, you might want to:
- Improve document retrieval by incorporating tools like sentence-transformers
- Automate document processing with text extraction tools for PDFs and other formats
- Create a simple web interface using the Ollama API instead of command-line interaction
- Build specialized knowledge models for different domains or document types
The system we’ve built gives you a solid foundation – a personal AI lab that can answer questions based on your own knowledge sources, all running locally on your machine without sharing your sensitive data with third-party services.
In our final post, we’ll wrap everything up and explore some additional possibilities for your Ollama-powered AI lab.