Using SPARQL to Delete Duplicate Instances

Michael DeBellis
Apr 9, 2024
3 min read

I'm currently working on a system built around a knowledge graph of documents. I'll have more to say on this in future posts. We've been using the new AllegroGraph Neuro Symbolic AI features. These allow a user to integrate knowledge graphs with Large Language Models (LLMs) using SPARQL. We're using it to develop what's called a Retrieval Augmented Generation (RAG) system where we use ChatGPT to parse questions and create answers using our knowledge graph. The benefits of RAG are that it uses the knowledge graph to eliminate hallucinations, bias, and black-box reasoning which are limitations of just using an LLM. Stay tuned for more.

This post is a bit more mundane but I hope still interesting. I had a problem which probably comes up in other systems as well. We import documents by doing queries on sites like PubMed and then download the metadata for the documents that match the query. The issue is that since we are focusing on a specific domain it isn't uncommon that 2 or more queries will match the same document. I hadn't realized quite how common it was until I did a SPARQL query to find all the duplicate documents and saw there were hundreds! I wrote a SPARQL query that I thought would delete the duplicates but it didn't work. Or rather it worked too well and deleted both copies of the document rather than just one as I wanted. I think sometime we can learn from mistakes as well as their corrections so I'm going to walk through the bad way I started, why it was bad, and the correct way. The correct way was provided to me by Lorenz Buehmann in a response to a question I posted on the User Support for Protege mailing list. Thanks Lorenz!

So to start, I wrote a query that would find all the duplicates:

SELECT DISTINCT *

WHERE {?s1 rdfs:label ?label1.

       ?s2 rdfs:label ?label2.

FILTER(?s1 != ?s2 && ?label1 = ?label2)}

This worked but it took a long time. This was my first mistake. By using two variables for the rdfs:label I was requiring SPARQL to iterate over all the instances in the ontology twice. More or less all the permutations of the rdfs:label value of every instance compared with every other instance. This worked but it took a very long time and I had to increase the timeout on AllegroGraph's SPARQL engine. The first thing that Lorenz pointed out is a much better way to do this is simply:

SELECT DISTINCT *

WHERE {?s1 rdfs:label ?label.

       ?s2 rdfs:label ?label.}

The difference in speed between these two queries was amazing. This utilizes the graph to constrain the search and returned all the duplicates in less than a second compared to several minutes with the original query. Note: the DISTINCT in the query is so we won't see the duplicates twice. Without DISTINCT for each document: Doc1 and Doc2 that are duplicates, SPARQL will match them twice, once binding ?s1 to Doc1 and the second time binding ?s1 to Doc2.

Which brings me to my second error. I wanted to delete all the duplicates so I wrote the following query:

DELETE {?s2 ?p ?o}

WHERE {?s1 rdfs:label ?label1.

       ?s2 rdfs:label ?label2;
           ?p  ?o.

FILTER(?s1 != ?s2 && ?label1 = ?label2)}

I was thinking that when the first duplicate was deleted the second duplicate would no longer match the FILTER statement. However, that's not how SPARQL works. It first matches everything in the WHERE clause and then executes the action clause (DELETE, SELECT, INSERT, etc.). This query took forever and then deleted both copies of the duplicate. The final correct query that Lorenz gave me was:

DELETE {?s2 ?p ?o.}

WHERE {?s1 rdfs:label ?label.

       ?s2 rdfs:label ?label;

	        ?p ?o.

FILTER(STR(?s1) < STR(?s2))

This was very fast and it only deleted one duplicate. The filter converts the IRI to a string and then succeeds when the first duplicate is less than (i.e., sorts before alphabetically) the second duplicate. This still matches every combination twice but it only deletes one of them, the one where the first match is alphabetically before the second which will always happen exactly once.

This is not a general delete duplicate instances query since it doesn't delete all the triples that have this instance as an object. However, since I hadn't run the reasoner or done anymore post processing, I knew that there were no such triples yet. Deleting all the triples that have this instant as an object would be easy though. Just add the following triple to the WHERE clause as well as to the DELETE clause:

?o1  ?p1   ?s2