Communicating Ontology Knowledge with an LLM in Graph RAG: The power of simplicity and standards

Michael DeBellis
12 hours ago
9 min read

The last few years, I've been focusing on the Retrieval Augmented Generation (RAG) architecture using an OWL knowledge graph to store the document corpus. One problem I noticed since the first system we developed was that there is a great deal of information that is in the ontology that is not available to the LLM. With the help of my colleague Bob Neches I've developed some simple but I think highly useful Python functions that deal with this problem. I'm planning to write up a paper on this but I think the functions are so easy to use (and they use RDFLib and so are not dependent on any commercial tools) that I would post about them now in case anyone else is doing work on Graph RAG and might find them useful.

I'm currently working on a Data Catalog that builds on the following ontologies: DCAT, PROV-O, ODRL, SKOS, and Dublin Core (more on the Data Catalog to come). I put a front end to the data catalog using RAG and the RAG tools in AllegroGraph. That is excellent for documents that contain things like reports, however there is a great deal of information in the data catalog related to provenance, governance, interfaces, and technology that is not shared with the LLM. E.g., one very useful set of properties are PROV-O properties such as prov:wasDerivedFrom and prov:wasGeneratedBy. I made those properties (and others) sub-properties of a new transitive property called dp:is_downstream_of. That way information about which datasets are upstream or downstream can be automatically computed by the reasoner.

A few months ago I was considering using a Python NLP library like Spacy or NLTK to analyze the structure of the ontology and generate English strings that capture things like sub and super classes, what class an Individual is an instance of, and especially values of properties. However, I haven't had the time to learn those NLP libraries (I've used Spacy but only a very small subset of it). So at first I tabled the idea. But then my colleague Bob Neches asked a very simple but important question: "Have you tried just doing simple pattern matching?" I hadn't but I tried it and it worked better than I expected.

Specifically, what I did was to create a new annotation property called dp:embedding_note. The "dp:" stands for Data Product. One of my goals for the Data Catalog is to include concepts from Deghani's Data Mesh architecture such as a Data Product. Then I wrote two Python functions: add_embedding_notes_for_classes.py and add_embedding_notes_for_properties.py. Each function takes as input an OWL file and a text file with a list of QNames for classes for the first function and QNames for properties for the second. I didn't want to generate embedding_notes for all classes and properties. E.g., no need to tell the LLM that all classes are subclasses of owl:Thing.

The first function processes each class name and:

Finds the immediate sub-classes for the class and uses the pattern: {Class Label} "is subclass of" {Subclass Label} to add a string to the embedding_note for that class.
Finds the immediate super-classes for the class and uses the pattern: {Class Label} "is superclass of" {Superclass Label} to add a string to the embedding_note for that class.
Finds the instances of the class and uses the pattern {Individual Label} "is an instance of" {Class Label}.

The second function takes each property and finds all triples with that property as the predicate.

For object properties it uses the pattern: {Subject Label} {Property Label} {Object Label}.
For datatype and annotation properties it uses the pattern {Subject Label} {Property Label} {Object Literal Value Converted to String}.

The first version created a new embeding_note annotation for each value. Then I added embedding_note to the list of properties that the AllegroGraph Create Vector Embeddings tool created vectors for by using an API to the ChatGPT Large-Text-Model-3. I tried asking the LLM some questions such as "Do any data products have governance annotations related to GDPR?" and... it didn't work at all! I made the parameter for how close an embedding vector has to be to the vector created for a new prompt really low but still it didn't match. Oh well, I thought, it didn't take much time to develop because it was simple.

However, then I remembered that there is a "Goldilocks Zone" for strings that you want to generate vectors for. They shouldn't be too long (hence why dividing corpus documents into chunks is an essential pre-processing step) and reducing documents into smaller chunks has in the past always been the issue I focused on. However, they shouldn't be too short either. In the first version of the code, each short sentence was its own embedding_note. These were short sentences such as "GDPR Risk Flag for Customer Profile Data Product was generated by Create Annotation 1001 Activity." These short strings were more appropriate for models like Word2Vec but of course I had to use a model that ChatGPT understood. So I tried something different. Rather than create a separate embedding_note for each string, I collected all the strings that had an entity as the subject into one string. The example sentence now no longer stood on its own but was part of a larger note:

"GDPR Risk Flag for Customer Profile Data Product is an instance of Risk Flag Annotation.

GDPR Risk Flag for Customer Profile Data Product has status Active.

GDPR Risk Flag for Customer Profile Data Product is downstream of Create Annotation 1001 Activity.

GDPR Risk Flag for Customer Profile Data Product requires review by Security Team.

GDPR Risk Flag for Customer Profile Data Product was attributed to Security Engineer 1.

GDPR Risk Flag for Customer Profile Data Product was generated by Create Annotation 1001 Activity.

GDPR Risk Flag for Customer Profile Data Product has label GDPR Risk Flag for Customer Profile Data Product.

GDPR Risk Flag for Customer Profile Data Product has definition This data product may contain fields that qualify as personally identifiable information (PII) under GDPR. A formal privacy review is required before expanding access beyond the analytics team."

Another benefit of this is that with the approach where I had an annotation for each sentence, it became very difficult to browse the annotation property values in Protege. There were so many embedding_notes that the other annotations were somewhat hard to find. This worked amazingly well! In fact, I eventually changed the way I generate vectors due to this. In the past I would generate vectors for properties such as skos:definition and skos:example. That was useful but still often didn't work well. The problem was that unless the string was explicitly written with embedding vectors in mind it typically would not have the name of the entity in the property value.

For example, the skos:definition for the Data_Product class was: "An organizationally owned data artifact designed for consumption, governed by explicit policies, versioned over time, and realized through one or more datasets or services." If the user sees this in Protege they know it is describing the Data_Product class because the value is an annotation property on that class. In the past I was often re-writing strings like this to be: "A Data Product is an organizationally owned...". But that was a pain and also there were many definitions on the ontologies that I was building on such as PROV-O and DCAT, that I didn't want to change because that would make my version of the ontology diverge from the standard. A minor deviation but still any deviation was not good. However, if I just include skos:definition in the text file with the names of properties that get embedding_notes I then have an embedding_note that says: "Data Product has definition An organizationally owned data artifact...".

The one other issue I had to deal with was that many of the ontologies I reused didn't always follow label standards like "has property" and "is property of". E.g., DCAT has a property called dcat:catalog with label "catalog" but when you read the definition, what it means is that the property points from one catalog to another catalog that is a part of the first catalog. I.e., the actual label should be "has sub-catalog". Again, I didn't want to change DCAT so I created another annotation property called embedding_label. For any entities that I wanted to generate embedding_notes that didn't have a label that communicated the semantics clearly, I simply added an embedding_label. Then I changed the functions so that they first check each entity for an embedding_label and if one exists they use it but if not they fall back to rdfs:label. In this example I added an embedding_label to the dcat:catalog property with the value: "has sub-catalog". I also did this for several of my datatype and annotation properties, as I typically use the "has property" and "is property of" pattern for object properties because that is a good way to make inverses obvious. For datatype and annotation properties, I typically use shorter labels as a way to illustrate to others the kind of property it is. E.g., if address is a class then I define the label for the object property that relates an Agent to their address as: has_address. However, if address is simply a string then I define the datatype property as simply address.

The Python code that generates the embedding_notes, along with example before and after ontology files and example text files with names of classes and properties can be found here: Python Functions to Generate Embedding Notes. There are also readme files with more detailed explanations and the Python code has examples of how to call each function using the sample ontologies in that directory. This repository is part of a bigger project I'm working on that I will be discussing more in the future. Here is a screen print from Protege that shows the generated embedding_note on a Data Product:

Data Product Individual with Generated Embedding Note

There are two important lessons here:

The KISS principle: Keep It Simple Stupid! I was thinking of a complex approach using an NLP library but luckily Bob asked a simple question that got me pointed in the right direction to a far simpler approach that worked extremely well. I've noticed this more times than I can remember when doing client work. We tech people just love sophisticated solutions. However, often the simpler solution works just as well and takes far less time and effort.
The power of standards. This worked especially well for me because I follow labelling standard of writing labels such as "has property" and "is property of" for object properties. That meant I could use the functions without changing most of the ontology. The only cases where I needed to add embedding_labels were ontologies such as DCAT that I imported and some of my data and annotation properties.

Here is a thread from the RAG system for the Data Catalog (I'm using a fictitious video streaming company called StreamForge for my example). I also have a Gruff visualization of the knowledge graph below so you can see that all the information came from the structure of the ontology which was conveyed to ChatGPT via embedding_notes:

Prompt: Are there any governance annotations in our data catalog relevant to the GDPR?

Response: There is a Governance Annotation specifically relevant to the GDPR called the 'GDPR Risk Flag for Customer Profile Data Product.' This annotation is an instance of Risk Flag Annotation and has several attributes that make it crucial for GDPR compliance:

Status: Active

Downstream Activity: It is downstream of the Create Annotation 1001 Activity.

Review Requirement: Requires review by the Security Team.

Attribution: Attributed to Security Engineer 1.

Generation Activity: Generated by Create Annotation 1001 Activity.

It is defined as: "This data product may contain fields that qualify as personally identifiable information (PII) under GDPR. A formal privacy review is required before expanding access beyond the analytics team."

Prompt: Which data products is the GDPR risk flag attached to?

Response: The GDPR Risk Flag is attached to the 'Customer Profile Data Product'.

Prompt: Which data products are downstream of Customer Profile Data Product?

Response: The Customer Profile Data Product serves as an upstream data source for:

1. Recommendation Data Product This data product is downstream of Customer Profile Data Product and uses it.

2. SF Creator Insights Data Product: This data product also utilizes Customer Profile Data Product as one of its sources.

Prompt: Who needs to review the GDPR Risk Flag?

Response: The GDPR Risk Flag for the Customer Profile Data Product requires review by the Security Team.

Prompt: Are there any members of the security team who have GDPR as a skill and whose work ratio is less than 50%?

Response: Security Engineer 2 is a member of the security team who holds the GDPR Skill and has a workload ratio of less than 50%.

All this information was collected from embedding_notes that reflect the structure of the ontology. Here is a Gruff diagram with the relevant entities and properties:

Gruff Visualization of Object Relevant to Dialog with ChatGPT Data Catalog Bot

There is still more work that could be done here. For example, right now information in rules and axioms is not converted to embedding_notes. This is more complex because axioms can get arbitrarily complex. Inspecting the RDF graph for axioms is much more work than anything I've done so far. In addition, while these pattern matching strings get the job done, using an NLP library would result in more readable strings (and without NLP simple pattern matching may often fail to yield readable sentences for axioms). However, this implementation has been very useful for my current project and in hindsight I think it would have made the previous RAG systems I built even more useful.

Comments