Home Mathematics Queryfy: from knowledge graphs to questions using open Large Language Models
Article Open Access

Queryfy: from knowledge graphs to questions using open Large Language Models

Enabling finetuning by question generation on given knowledge
  • Felix Brei earned his Master of Science Degree from Leipzig University in 2019. After a brief period at the university, he joined InfAI, where he focuses on semantic web technologies and the application of large language models. His current research explores AI-assisted solutions to make the semantic web more accessible to non-experts, which is also the focus of his doctorate.

    ORCID logo EMAIL logo
    ,

    Lars-Peter Meyer, a graduate computer scientist, works as a project manager, researcher and scientific developer at the InfAI e.V. research institute in Leipzig. Since 2021, he has held leading roles in various cross-domain research projects focused on improving data organisation using Semantic Web techniques. In addition, he has been working on AI applications for many years and has been researching the integration of knowledge graphs with LLMs since 2023.

    ORCID logo
    and

    Prof. Dr. Michael Martin heads the Department of Data Management at Chemnitz University of Technology. His academic career began in 2006 in the Agile Knowledge Engineering and Semantic Web (AKSW) research group, where he specialized in techniques for data conversion, data representation, as well as the efficient querying and processing of graph data. Even today, Michael remains deeply engaged in the practical application of knowledge graphs in industry. A particular focus of his current research is the identification of vulnerabilities in supply networks and production chains to systematically enhance their resilience.

    ORCID logo
Published/Copyright: March 4, 2025

Abstract

When we look at the global knowledge graph landscape, we quickly find that there are billions of interconnected facts that have the potential to answer all kinds of questions. However, a persistent challenge lies in finding corresponding questions that align with these facts. The availability of these questions along with matching SPARQL queries is an important prerequisite for fine-tuning Large Language Models for domain-specific query generation, which is why we propose Queryfy, a novel framework that leverages Large Language Models to automate the task of deriving questions and queries from knowledge graphs, empowering users to harness their full potential.

1 Introduction

Knowledge graphs are an efficient and sensible way to store semi-structured data and have become more prominent over the last years. They are integral parts in a diverse host of research areas and many third party funded projects have used them as a fundamental building block, for example the CoyPu project[1] which focuses on resilience research, while material sciences is being covered by MaterialDigital.[2] Aside from these research projects there is a number of publicly accessible knowledge bases like the Open Research Knowledge Graph[3] [1], DBpedia or Wikidata that make use of this very technology. Alas, querying these knowledge graphs directly can be daunting as it requires proficiency with SPARQL, a query language similar to SQL but with a focus on knowledge graphs instead of tables.

Building on the work done by Banerjee et al. [2], we beside others have already shown in [3], [4], [5] that Large Language Models (LLMs) can assist with the translation of natural language into SPARQL queries. Our experiences in the field of resilience research have shown that this is often very much needed since answering questions like ‘Which of our suppliers are based in countries with ongoing armed conflicts?’ requires a multi-hop correlation of facts and leads to complex SPARQL queries, but such assistance is not easily obtained because there is a lack of datasets for fine-tuning LLMs for the knowledge graphs in use there.

Custom datasets like LC-QuAD [6] were created for certain specific knowledge graphs (more details in Section 2), but to this day there is no way to derive a similar dataset in an automatic fashion.

In this paper, we will demonstrate how Queryfy can generate such a dataset from a given knowledge graph with the help of LLMs. We will show which models and model sizes work best and what to look out for. We will use LLMs to not only come up with meaningful questions tailored to a specific graph, but also to answer these questions and create SPARQL queries for them. The resulting datasets can then be used as input for the fine tuning of smaller language models, thereby closing a gap that was left in our previous paper and taking one step further towards a fully integrated pipeline from knowledge graph to specialized small-footprint LLMs.

So to summarize, the contribution of this paper is two-fold:

  1. Queryfy, a novel tool that takes a knowledge graph as input and prompts a configurable list of LLMs to generate a KGQA dataset for it, saving all intermediary results and the final one in a JSON file (see Appendix 6 for information on where to obtain the source code). This makes the usage of knowledge graphs both easier and more resilient since it helps avoiding vendor lock-in by using publicly available and interchangeable LLMs, helps in deploying a tailored LLM for Text2SPARQL to assist users in accessing their knowledge graphs, and can be re-run anytime if their is any kind of hardware outage due to its high degree of automation.

  2. An initial evaluation of the quality of the dataset that was generated from an example graph. Note that this is not part of the pipeline yet and was done by us separately.

The remainder of this paper is now organized as follows: In Section 2 we will give an overview of the work that other researchers have already done in this field and how our research fits into the bigger picture. In Section 3 we will describe how the models were selected and what our dataset generation pipeline actually does. In Section 4 we will describe the results that we obtained and in Section 5 we will give an overview of the directions that one can take from here.

2 Related work

Knowledge Graph Question Answering is a challenge since quite a long time now and several benchmark datasets were generated[4] [7]. Some of the recent and relevant ones are the LC-QuAD [6], LC-Quad v2.0 [8], QALD [9], SciQA [1], DBNQA [10], BESTIARY [11] or Spider4SPARQL [12]. These benchmark datasets are manually collected, translated or generated using templates or tools like the SPARQL query generator [13]. We propose here an automated approach using LLMs.

Tackling the KGQA challenge and especially generating SPARQL queries from text using neural approaches and LLMs is topic of recent research. This includes the work from Banerjee et al. [2], SparqlGen [11], COT-SPARQL [14], some manual experiments on Text2Sparql using LLMs [5], [15], our experiments on fine tuning small LLMs for Text2Sparql [3] and the LLM-KG-Bench framework [4], [16], [17], [18].

For some other areas, approaches for LLM assisted training data generation already exist, like e.g. SciQAG [19] for Question Answering.

3 Setup of the experiments

3.1 The dataset

For all experiments we used an extended version of the Organizational Graph, a small hand crafted knowledge graph for demonstration purposes. The original graph was developed for [5] and its source is available from the corresponding GitHub Repository.[5] Using Turtle Syntax (ttl), it describes a fictitious organization called Wonder Org which consists of two departments (research and marketing) and has several employees living in different countries. We expanded it by adding more employees to reinforce the organizational structure that is represented in the graph. The source can be found as org.ttl in the GitHub repository that accompanies this paper.[6] Its advantages are that it covers a large array of ttl syntax and it is relatively small, making it an easy fit for all current context window sizes, as well as making it easy to verify results by hand.

3.2 Model selection

Since Queryfy is supposed to run on a diverse set of hardware, each with different hardware limitations, we divided its assessment into two sections. First, we start with a limit of 10 billion parameters to analyze its performance with smaller language models. We then conducted a second run with models with up to 34B parameters as this is the maximum number that our current hardware can handle.

For each run, we used the LLM Leaderboard[7] to find the best performing models that fit within the respective range. If a model offered versions with different context window lengths, we chose the one with the longest window. We also included Microsoft’s Phi-3-Medium model as a canary in the first run because we wanted to see if and how the performance of the models improves with an increasing number of parameters and Phi-3 is one of the best performing models as of now, even in comparison to models far above the 10B parameter limit. The models that were ultimately chosen can be found in Table 1.

Table 1:

Names of the models that we selected for the first run of the pipeline, along with their parameter counts and context window sizes.

Model name No. parameters Size of context window
Phi-3-mini-128k-instruct [20] 3.82B 128k
Phi-3-medium-4k-instruct [20] 14B 4k
openchat-3.6-8b-20240522 [21] 8.03B 8k
gemma-7b-it [22] 8.54B 8k
Mistral-7B-Instruct-v0.3 [23] 7.25B 8k
mindy-7b-v2a 7.24B 32k
Qwen1.5-7B-Chat [24] 7.72B 32k
Qwen2-7B-Instruct [25] 7.62B 128k
occiglot-7b-eu5-instructb 7.24B 8k
Yi-1.5-9B-Chat-16K [26] 8.83B 16k

We applied the same procedure to select models for the second run. Among these models are larger versions of the previous ones (like Gemma-2-27B), but also new ones like Yi-1.5-34B or InternLM-2.5-20B. The set of models that made it into our pipeline can be seen in Table 2.

Table 2:

Names of the models that we selected for the second run of the pipeline, along with their parameter counts and context window sizes.

Model name No. parameters Size of context window
Yi-1.5-34B-Chat-16K 34.4B 16k
gemma-2-27b-it 27.2B 8k
Internlm2.5-20b-chat [27] 19.9B 16k
Chocolatine-14B-Instruct-4k-DPOa 14B 4k
blossom-v5.1-34bb 34.4B 4k
Mistral-Nemo-Instruct-2407 12.2B 128k

3.3 The stages of Queryfy

Queryfy works according a process as visualized in Figure 1. It takes a list of model checkpoints to load from Huggingface as well as a knowledge graph in ttl format as input. In the five steps described in the following it finally generates a json file as explained in Figure 2 as result.

  1. Graph2Question (G2Q) Each model is prompted to come up with five questions (this number is configurable). The ttl string of the knowledge graph is part of the prompt. The results are gathered in a list along with metadata about which questions were generated by which model. An example that was generated by Qwen1.5-7b-Chat can be found in the first line of Table 3. The exact prompt we used can be found in Appendix A.1.

  2. Question2Answer (Q2A) Each model is prompted to find the answer to every single question directly from the knowledge graph, i.e. not only the questions that it generated itself but also the questions that were generated by all other models. Second line of Table 3 shows an example of a question generated by Phi-3-mini-128k-instruct. See Appendix A.2 for the complete prompt for this stage.

  3. Question2SPARQL (Q2S): Iterating over all questions from the list, each model has to generate a SPARQL query via the prompt seen in Appendix A.3 that answers the question.

    An example for a SPARQL query generated by blossom-v5.1-34b can be found in the last line of Table 3.

  4. SPARQL2Answer (S2A): The graph was loaded via rdflib and all generated queries were executed against the graph. The results are stored as part of the resulting list, along with any errors that came up during execution. This step does not involve the use of LLMs.

  5. Output: This stage currently involves storing the generated questions, queries and answers, after which users have to manually filter the list based on the quality of each question and whether or not the results sets are correct. Research is being done to automate this process based on established criteria from the NLP domain.

Figure 1: 
Visualization of the Queryfy pipeline. The first three steps involve the creation of task specific prompts which are sent to a list of LLMs, using the output as the input for the next stage. S2A then executes the generated queries on the given knowledge graph, while output currently consists of formatting and saving the data for later fine-tuning (filtering will be applied at this stage in a future release).
Figure 1:

Visualization of the Queryfy pipeline. The first three steps involve the creation of task specific prompts which are sent to a list of LLMs, using the output as the input for the next stage. S2A then executes the generated queries on the given knowledge graph, while output currently consists of formatting and saving the data for later fine-tuning (filtering will be applied at this stage in a future release).

Figure 2: 
Close-up of a single entry of the dataset that is produced by Queryfy. Each item consists of four distinct entries as seen in the image and is filtered based on the quality of the natural language question and whether or not the extracted answer matches the one returned by the SPARQL query (both were assessed manually for this paper).
Figure 2:

Close-up of a single entry of the dataset that is produced by Queryfy. Each item consists of four distinct entries as seen in the image and is filtered based on the quality of the natural language question and whether or not the extracted answer matches the one returned by the SPARQL query (both were assessed manually for this paper).

Table 3:

Examples for a generated question, answer and SPARQL query. They were generated by the models Qwen1.5-7b-Chat, Phi-3-mini-128k-instruct and blossom-v5.1-34b respectively.

Question Which department does Charles Turner belong to ∖ and what role does he have there?
Answer Department: Research Department Role: Chief Research Officer
SPARQL query SELECT ?department ?role WHERE { :charles foaf:surname “Turner”. ?membership org:member:charles. ?department org:unitOf:wonderOrg. ?membership org:organization ?department. ?role a org:Role. ?membership org:role ?role.}

Queryfy contains another optional step called generate_gpt_queries. This helps with testing, but involving GPT runs on production would jeopardize our goal of being vendor independent and avoiding possible data leaking.

For further explanation of the prompts that were used, refer to Appendix A.

4 Results and discussions

4.1 First pipeline execution

The goal of the first run was to assess the current capabilities of models with up to 10B parameters regarding question answering and SPARQL generation over a given knowledge graph. Not all models managed to generate five questions: gemma-7b generated two, mindy-7b gave us three, qwen2-7B three and occiglot-7b two, leaving us with 40 questions in total instead of the expected 50. Out of these 40 questions in total, 9 were discarded for being either too trivial (What is the first name of Charles Turner?), too under-defined (What is the foaf:role of:marketingManager?) or garbage (PREFIX: <https://abc.def/ghi/>), leaving us with 31 questions like:

  1. Which department does Charles Turner belong to and what role does he have there?

  2. Who is the chief research officer in the organization?

  3. What is the country where Francine resides, according to her home address?

The evaluation of the QA capabilities was done manually. For every question we looked at the answers that the LLMs extracted directly from the graph. If a question referred to a person, both the full name as well as the resource designation were accepted as correct (both “Charles Turner” and “:charles”). In case of questions like Who is a member of the Research Department? we accepted mentions of single persons as well as full lists. On the other hand, for the question Who are the members of the Research Department? the models had to provide all names. In any case, if the correct answer was part of the output that the LLM generated and there were no wrong statements within it, we accepted an answer as correct.

We can see from the results in Table 4 that most models are able to answer questions directly if a knowledge graph is given as a source of information. This is an important result for methods like Retrieval Augmented Generation (RAG) because they depend on LLMs being able to extract an answer from a given context. We can also see that models seem to have different capabilities in different areas. For example, while all questions generated by Yi-1.5 were just prefix definitions and therefore unusable, it did really well during question answering. The same can be said about openchat-3.6. All questions started with the string What is the full name of the person who holds the role of which would make for a subpar dataset if this were the only LLM that we used. But it still managed to answer all questions correctly.

Table 4:

Results for the LLMs from the first run. A total of 31 questions was kept. The number of kept questions per LLM is between 0 and 5. QA designates the number of correct answers directly from the graph. Syntax is the number of queries with syntactical errors. Empty is the number of times a query returned an empty result set, non-empty likewise. Correct is the count how often a query did return the correct answer to a question.

Model name Kept QA Syntax Empty Non-empty Correct
Phi-3-mini-128k-instruct 2 30 13 17 1 1
Phi-3-medium-4k-instruct 5 28 25 5 1 0
openchat-3.6-8b-20240522 5 31 5 19 7 6
gemma-7b-it 2 23 18 13 0 0
Mistral-7B-Instruct-v0.3 5 26 17 10 4 3
mindy-7b-v2 2 27 28 3 0 0
Qwen1.5-7B-Chat 5 30 7 19 5 5
Qwen2-7B-Instruct 3 26 7 18 6 4
occiglot-7b-eu5-instruct 2 20 29 2 0 0
Yi-1.5-9B-Chat-16K 0 29 31 0 0 0

When it comes to generating SPARQL queries though, the models struggle a lot. No model was able to generate more than six queries that produced the correct answer to a given question. This is partly due to the fact that these smaller sized LLMs (compared to ChatGPT, Gemini or Claude) tend to ignore commands like Do not generate any other text besides the query. The responses were very chatty and in some cases it was not trivial to extract the raw SPARQL query from the generated text.

Openchat-3.6 and the Qwen models have produced the best results in this run. While they may have only produced a semantically correct query in about 20 % of the time, the amount of syntactic errors they have made is also surprisingly low (only five resp. seven out of 31). Qwen2-7B got the count query wrong and also did not resolve a blank node during one query, returning the ID of the blank node instead. Mistral-7B and Openchat-3.6 also fumbled the count query.

4.2 Second pipeline execution

We kept the questions from the first run and asked the new models to only generate SPARQL queries to answer them. We omitted the question answering (QA) part during this run, because the smaller sized models had already shown very good results and we did not see much room for improvement here. Instead, we focused on the generation of SPARQL queries. The results are shown in Table 5.

Table 5:

Results for the second run, consisting of LLMs with up to 34B parameters. Syntax means the number of syntax errors, empty means the number of queries that could be executed but returned an empty result set. Non-empty likewise. Correct describes the number of queries that did indeed return the correct answer to a given question.

Model name Syntax Empty Non-empty Correct
Yi-1.5-34B-Chat-16K 26 2 3 3
gemma-2-27b-it 31 0 0 0
Internlm2.5-20b-chat 26 3 2 2
Chocolatine-14B-Instruct-4k-DPO 26 2 3 2
blossom-v5.1-34b 5 12 14 12
Mistral-Nemo-Instruct-2407 3 19 9 8

Again, no LLM managed to answer the question including a count correctly. However, they were able to use concat if the question asked for full names of people, generating SPARQL snippets like:

BIND(CONCAT(?firstName,’ ’, ?surname) AS ?personName)

We can also very clearly see that increasing the number of parameters leads to an increase in performance. blossom is able to generate SPARQL queries for questions regarding membership in certain departments, showing that it is able to handle blank nodes.

5 Outlook

The dataset generation pipeline that we introduced here and already utilized to examine the capabilities of modern, open source LLMs, is giving us a lot of different directions to venture into.

Firstly, the pipeline could and should be run with even larger models to see if the SPARQL generation results can be improved further. Since the source code is freely available, anyone with a large enough cluster can continue our research and expand on our results.

Secondly, since the pipeline is supposed to generate training datasets for any arbitrary knowledge graph, a method must be devised in the future to apply our methods on larger knowledge graphs that may not fit into the context window of the LLMs any more. Down sampling larger knowledge graphs while still keeping the structure intact is a challenge that we are actively working on. Another method we are investigating is splitting knowledge graphs into meaningful chunks, applying the pipeline we have introduced here and then merge the different datasets back into one large list that accommodates the graph as a whole.

Meyer et al. [18] have shown that the data format that is used in prompting does have an impact on the outcome of the experiment, which is something that we have not touched yet.

Another point to mention is that the quality of the generated questions was assessed manually. Automatic metrics should be used here because otherwise the pipeline does not scale. Implementing these metrics and balancing them to act as a quality gate during the initial steps of the pipeline is another task that we are actively working on. Furthermore, once this step is automated, the resulting code can be fed back to KGBench, extending it with another task and metric.

And finally, we still have to evaluate how the resulting datasets can be used to expand on our original work in [3], which was the motivation behind this paper.

6 Online resources

The source code of the pipeline as well as the results that were used in this paper are available at our GitHub repository (https://github.com/AKSW/LLMDatasetGenerator) and Zenodo (https://doi.org/10.5281/zenodo.14203176DOI:10.5281/zenodo.14203176).


Corresponding author: Felix Brei, InfAI e.V., Leipzig, Germany, E-mail:

About the authors

Felix Brei

Felix Brei earned his Master of Science Degree from Leipzig University in 2019. After a brief period at the university, he joined InfAI, where he focuses on semantic web technologies and the application of large language models. His current research explores AI-assisted solutions to make the semantic web more accessible to non-experts, which is also the focus of his doctorate.

Lars-Peter Meyer

Lars-Peter Meyer, a graduate computer scientist, works as a project manager, researcher and scientific developer at the InfAI e.V. research institute in Leipzig. Since 2021, he has held leading roles in various cross-domain research projects focused on improving data organisation using Semantic Web techniques. In addition, he has been working on AI applications for many years and has been researching the integration of knowledge graphs with LLMs since 2023.

Michael Martin

Prof. Dr. Michael Martin heads the Department of Data Management at Chemnitz University of Technology. His academic career began in 2006 in the Agile Knowledge Engineering and Semantic Web (AKSW) research group, where he specialized in techniques for data conversion, data representation, as well as the efficient querying and processing of graph data. Even today, Michael remains deeply engaged in the practical application of knowledge graphs in industry. A particular focus of his current research is the identification of vulnerabilities in supply networks and production chains to systematically enhance their resilience.

Acknowledgment

This work was supported by grants from the German Federal Ministry for Economic Affairs and Climate Action (BMWK) to the CoyPu project (01MK21007A) as well as from the German Federal Ministry of Transport and Digital Infrastructure (BMDV) to the project ADA (19F2190B) and from the German Federal Ministry of Education and Research (BMBF) to the project ScaleTrust (16DTM312D).

  1. Research ethics: Not applicable.

  2. Informed consent: Not applicable.

  3. Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Use of Large Language Models, AI and Machine Learning Tools: None declared.

  5. Conflict of interest: The authors state no conflict of interest.

  6. Research funding: This work was supported by grants from the German Federal Ministry for Economic Affairs and Climate Action (BMWK) to the CoyPu project (01MK21007A) as well as from the German Federal Ministry of Transport and Digital Infrastructure (BMDV) to the project ADA (19F2190B) and from the German Federal Ministry of Education and Research (BMBF) to the projects StahlDigital (13XP5116B) and ScaleTrust (16DTM312D).

  7. Data availability: The raw data can be obtained on request from the corresponding author.

  8. Software availability: The source code and raw data is available from our GitHub repository (https://github.com/AKSW/LLMDatasetGenerator) as well as Zenodo (https://doi.org/10.5281/zenodo.14203176).

Appendix A. Prompts for the individual pipeline stages

To improve readability, the turtle serialization of the knowledge graph was omitted from the following listings and a placeholder was used.

A.1 Question generation

Generate 5 questions that fit the following knowledge graph in ttl format:

<TTL Serialization of the knowledge graph>

One question per line. No additional line breaks. No enumeration.

Listing 1:

Example of a question generation prompt sent to the LLMs.

A.2 Answer extraction

You are given the following knowledge graph in ttl format:

<TTL Serialization of the knowledge graph>

How many marketing employees are there in WonderOrg?

Answer as short as possible. Give only facts, no full sentences.

Listing 2:

Example prompt that was used for direct answer extraction from the knowledge graph. This specific question was generated by blossom-v5.1-34b.

A.3 SPARQL query generation

You are given the following knowledge graph in ttl format:

<TTL Serialization of the knowledge graph>

Create a SPARQL query to answer the following question: How many \ marketing employees are there in WonderOrg?

Give only the query. Do not generate any other text. Wrap the query \ in code tags: ‘‘‘

Listing 3:

Example prompt that was used for the translation of a natural language question into a SPARQL query. This specific question was generated by blossom-v5.1-34b.

References

[1] S. Auer, et al.., “The SciQA scientific question answering benchmark for scholarly knowledge,” Sci. Rep., vol. 13, no. 1, p. 7240, 2023. https://doi.org/10.1038/s41598-023-33607-z.Search in Google Scholar PubMed PubMed Central

[2] D. Banerjee, P. A. Nair, J. N. Kaur, R. Usbeck, and C. Biemann, “Modern baselines for SPARQL semantic parsing,” in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, ACM, 2022.10.1145/3477495.3531841Search in Google Scholar

[3] F. Brei, J. Frey, and L.-P. Meyer, “Leveraging small language models for Text2SPARQLtasks to improve the resilience of AI assistance,” in Proceedings of the Third International Workshop on Linked Data-driven Resilience Research 2024 (D2R2’24), Colocated with ESWC 2024, Volume 3707 of CEUR-WS, J. Holze, S. Tramp, M. Martin, S. Auer, R. Usbeck, and N. Krdzavac, Eds., 2024.Search in Google Scholar

[4] L.-P. Meyer, et al.., “Developing a scalable benchmark for assessing large Language Models in knowledge graph engineering,” in Proceedings of the Posters and Demo Track of the 19th International Conference on Semantic Systems (SEMANTICS 2023), Volume 3526 of CEUR Workshop Proceedings, N. Keshan, S. Neumaier, A. L. Gentile, and S. Vahdati, Eds., CEUR-WS.org, 2023.Search in Google Scholar

[5] L.-P. Meyer, et al.., “LLM-assisted knowledge graph engineering: experiments with ChatGPT,” in First Working Conference on Artificial Intelligence Development for a Resilient and Sustainable Tomorrow (AITomorrow) 2023, Informatik aktuell, C. Zinke-Wehlmann, and J. Friedrich, Eds., 2024, pp. 101–112.Search in Google Scholar

[6] P. Trivedi, G. Maheshwari, M. Dubey, and J. Lehmann, “LC-QuAD: a corpus for complex question answering over knowledge graphs,” in The Semantic Web – ISWC 2017, C. d’Amato, M. Fernandez, V. Tamma, F. Lecue, P. Cudré-Mauroux, J. Sequeda, C. Lange, and J. Heflin, Eds., Cham, Springer International Publishing, 2017, pp. 210–218.10.1007/978-3-319-68204-4_22Search in Google Scholar

[7] L. Jiang and R. Usbeck, “Knowledge graph question answering datasets and their generalizability: are they enough for future research?” in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, New York, NY, USA, Association for Computing Machinery, 2022, pp. 3209–3218.10.1145/3477495.3531751Search in Google Scholar

[8] M. Dubey, D. Banerjee, A. Abdelkawi, and J. Lehmann, “LC-QuAD 2.0: a large dataset for complex question answering over Wikidata and DBpedia,” in The Semantic Web – ISWC 2019, Cham, Springer, Springer International Publishing, 2019, pp. 69–78.10.1007/978-3-030-30796-7_5Search in Google Scholar

[9] R. Usbeck, et al.., “QALD-10 – the 10th challenge on question answering over linked data,” Semant. Web, vol. 16, pp. 1–15, 2023. Pre-Print. https://doi.org/10.3233/sw-233471.Search in Google Scholar

[10] A.-K. Hartmann, E. Marx, and T. Soru, “Generating a large dataset for neural question answering over the DBpedia knowledge base,” in Workshop on Linked Data Management, co-located with the W3C WEBBR 2018, 2018.Search in Google Scholar

[11] L. Kovriguina, R. Teucher, D. Radyush, and D. Mouromtsev, “SPARQLGEN: one-shot prompt-based approach for SPARQL query generation,” in International Conference on Semantic Systems, volume 3526 of CEUR Workshop Proceedings, N. Keshan, S. Neumaier, A. L. Gentile, and S. Vahdati, Eds., CEUR-WS.org, 2023.Search in Google Scholar

[12] C. Kosten, P. Cudré-Mauroux, and K. Stockinger, “Spider4SPARQL: a complex benchmark for evaluating knowledge graph question answering systems,” in 2023 IEEE International Conference on Big Data (BigData), 2023, pp. 5272–5281.10.1109/BigData59044.2023.10386182Search in Google Scholar

[13] Y. Chen, M. M. Kokar, and J. J. Moskal, “Sparql query generator (sqg),” J. Data Semant., vol. 10, no. 3, pp. 291–307, 2021. https://doi.org/10.1007/s13740-021-00133-y.Search in Google Scholar

[14] H. M. Zahera, M. Ali, A. S. Mohamed, D. Moussallem, and A.-C. N. Ngomo, “Generating SPARQL from natural language using chain-of-thoughts prompting,” in Proceedings of the 20th International Conference on Semantic Systems, 17–19 September 2024, Amsterdam, The Netherlands, Volume 60 of Studies on the Semantic Web, A. Salatino, M. Alam, F. Ongenae, S. Vahdati, A.-L. Gentile, T. Pellegrini, and S. Jiang, Eds., IOS Press, 2024, pp. 353–368.Search in Google Scholar

[15] C. V. S. Avila, V. M. P. Vidal, W. Franco, and M. A. Casanova, “Experiments with text-to-SPARQL based on ChatGPT,” in 2024 IEEE 18th International Conference on Semantic Computing (ICSC), 2024, pp. 277–284.10.1109/ICSC59802.2024.00050Search in Google Scholar

[16] J. Frey, L.-P. Meyer, N. Arndt, F. Brei, and K. Bulert, “Benchmarking the abilities of Large Language Models for RDF knowledge graph creation and comprehension: how well do LLMs speak Turtle?” in Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG 2023) co-located with the 21th International Semantic Web Conference (ISWC 2023), Athens, November 6-10, 2023, volume 3559 of CEUR Workshop Proceedings, M. Alam, and M. Cochez, Eds., CEUR-WS.org, 2023.Search in Google Scholar

[17] J. Frey, L.-P. Meyer, F. Brei, S. Gruender, and M. Martin, “Assessing the evolution of LLM capabilities for knowledge graph engineering in 2023,” in The Semantic Web: ESWC 2024 Satellite Events, Cham, Springer Nature Switzerland, A. Meroño Peñuela et al.., Eds., Switzerland, Springer Nature Switzerland, 2025, pp. 51–60.10.1007/978-3-031-78952-6_5Search in Google Scholar

[18] L.-P. Meyer, J. Frey, F. Brei, and N. Arndt, “Assessing SPARQL capabilities of large Language Models,” in Proceedings of the 3rd International Workshop on Natural Language Processing for Knowledge Graph Creation co-located with 20th International Conference on Semantic Systems (SEMANTiCS 2024), vol. 3874, E. Vakaj, S. Iranmanesh, R. Stamartina, N. Mihindukulasooriya, S. Tiwari, F. Ortiz-Rodríguez, R. Mcgranaghan, Eds., CEUR-WS.org, 2024, pp. 35–53.Search in Google Scholar

[19] Y. Wan, et al.., SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-Grained Evaluation, arxiv.org, 2024.Search in Google Scholar

[20] M. Abdin, et al.., Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, arxiv.org, 2024.Search in Google Scholar

[21] G. Wang, S. Cheng, X. Zhan, X. Li, S. Song, and Y. L. OpenChat, “Advancing open-source language models with mixed-quality data,” in The Twelfth International Conference on Learning Representations, 2024.Search in Google Scholar

[22] Gemini Team, Gemini: A Family of Highly Capable Multimodal Models, arxiv.org, 2024.Search in Google Scholar

[23] A. Q. Jiang, et al.., Mistral 7B, arxiv.org, 2023.Search in Google Scholar

[24] J. Bai, et al.., Qwen Technical Report, arxiv.org, 2023.Search in Google Scholar

[25] An Yang, et al.., “Qwen2 technical report,” CoRR, abs/2407.10671, 2024.Search in Google Scholar

[26] 01. AI, Yi: Open Foundation Models by 01.AI, arxiv.org, 2024.Search in Google Scholar

[27] Z. Cai, et al.., InternLM2 Technical Report, arxiv.org, 2024.Search in Google Scholar

Received: 2024-08-31
Accepted: 2024-12-17
Published Online: 2025-03-04
Published in Print: 2025-02-25

© 2025 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 25.3.2026 from https://www.degruyterbrill.com/document/doi/10.1515/itit-2024-0079/html
Scroll to top button