- Methodology
- Open access
- Published:
AI-assisted evidence screening method for systematic reviews in environmental research: integrating ChatGPT with domain knowledge
Environmental Evidence volume 14, Article number: 5 (2025)
Abstract
Systematic reviews (SRs) in environmental science is challenging due to diverse methodologies, terminologies, and study designs across disciplines. A major limitation is that inconsistent application of eligibility criteria in evidence-screening affects the reproducibility and transparency of SRs. To explore the potential role of Artificial Intelligence (AI) in applying eligibility criteria, we developed and evaluated an AI-assisted evidence-screening framework using a case study SR on the relationship between stream fecal coliform concentrations and land use and land cover (LULC). The SR incorporates publications from hydrology, ecology, public health, landscape, and urban planning, reflecting the interdisciplinary nature of environmental research. We fine-tuned ChatGPT-3.5 Turbo model with expert-reviewed training data for title, abstract, and full-text screening of 120 articles. The AI model demonstrated substantial agreement at title/abstract review and moderate agreement at full-text review with expert reviewers and maintained internal consistency, suggesting its potential for structured screening assistance. The findings provide a structured framework for applying eligibility criteria consistently, improving evidence screening efficiency, reducing labor and costs, and informing large language models (LLMs) integration in environmental SRs. Combining AI with domain knowledge provides an exploratory step to evaluate feasibility of AI-assisted evidence screening, especially for diverse, large volume, and interdisciplinary studies. Additionally, AI-assisted screening has the potential to provide a structured approach for managing disagreement among researchers with diverse domain knowledge, though further validation is needed.
Introduction
Environmental science research employs diverse methodologies to investigate interactions between human-nature systems. It encompasses various study designs, data types, analytical methods, spatiotemporal scales, and research contexts [1,2,3]. The field requires interdisciplinary collaboration among experts in ecology, hydrology, biology, engineering, landscape, urban planning, and social science [4]. Researchers across disciplines apply distinct methodologies and terminologies for similar research questions [4, 5], complicating synthesis. The interdisciplinary nature of environmental science presents challenges in establishing consistent eligibility criteria and synthesize evidence in systematic reviews (SRs) compared to experimental fields such as clinical medicine [6].
Evidence screening, the process of identifying relevant studies for inclusion, is fundamental to the rigor, transparency, and reproducibility of SRs [7,8,9]. The first step in evidence screening involves defining eligibility criteria [10]. According to Cochrane Guidelines, eligibility criteria should be applied by multiple reviewers working independently and in duplicate to ensure consistency [7]. Discrepancies should be resolved through arbitration and consensus exercises at each stage to achieve agreement among reviewers [7, 11]. This process is also essential for SRs in environmental science that aim to provide evidence on environmental research questions through comprehensive, rigorous, transparent, and reproducible synthesis reviews of existing studies [7, 12, 13].
However, the interdisciplinary nature of environmental science complicates the evidence-screening process [14, 15]. Variability in study designs and analytical methods challenges in defining consistent eligibility criteria [16]. Further, reviewers from different disciplines may interpret the same eligibility criteria differently, causing inconsistent and unreliable evidence screening outcomes [17, 18]. Traditionally, SRs in environmental science rely on manual screening, applying predefined criteria for consistency. However, manual screening is not only time-consuming and labor-intensive but also prone to human error [16], especially when an SR involves a large volume of diverse and context-specific studies.
Artificial Intelligence (AI) offers potential to streamline SRs’ processes, particularly in automating evidence screening through machine learning and natural language processing [19,20,21]. While many AI tools can classify evidence, summarize texts, and assist in screening, they are often limited in specialized domain knowledge [19, 22, 23]. Recent advances in Large Language Models (LLMs) enhance contextual understanding, and fine-tuning with domain knowledge improves their performance to specific research needs [24,25,26]. Therefore, integrating LLMs with domain expertise has potential to provide a structural and efficient the screening process.
Our case study SR investigates the influence of land use and land cover (LULC) on stream fecal coliform contamination, providing insights into the applicability of AI-assisted screening in environmental research. Specifically, we aim to quantify the relationship between different types of LULC and the levels of fecal coliform in streams, as well as to explore methodological considerations for AI-assisted evidence screening. The study reflects key challenges in environmental science SRs, including interdisciplinary differences in terminology, methods, and data interpretation across hydrology, public health, landscape, and urban planning. The variations in spatiotemporal scales lead to different research methodologies (e.g., statistical or mechanistic models) to study fecal coliform contamination. Interdisciplinary in natural and soci-economic conditions contributes to variation in research findings, sometimes leading to contradictory conclusions. Therefore, this SR provides an opportunity to assess AI-assisted literature screening in environmental research and its potential in addressing inconsistencies in eligibility criteria application in SRs. To assess AI-assisted evidence screening, we fine-tuned ChatGPT-3.5 Turbo model with the expertise of environmental researchers to explore two questions:
-
1.
How does ChatGPT-3.5 Turbo perform in evidence screening?
-
2.
How does the consistency of ChatGPT-3.5 Turbo in evidence screening compare to that of human reviewers?
Method
The research team comprised six members: three domain expert reviewers responsible for literature screening and three technical specialists supporting model development and analysis. The three expert reviewers (‘R1’, ‘R2’, and ‘R3’) led the research by utilizing domain knowledge to define eligibility criteria, independently screen sample articles based on titles and abstracts, and full-text. R1 is a Ph.D. student in environmental science. R2 and R3 are environmental scientists with expertise in land use and hydrology, respectively. The technology supporters included an expert with prior SR experience, a Ph.D. student in data science, and a statistician. They were responsible for formulating the ChatGPT fine-tuning process, applying the model for evidence selection, and conducting statistical analysis of the results.
We built on the ChatGPT-3.5 Turbo and fine-tuned it using screening outcomes from the expert reviewers to enhance its ability to assess studies. Zotero (version 6.0.36) and Excel were used for article management, while RStudio (version 4.1.2) was utilized for statistical analysis. The study adhered to the PRISMA 2020 protocols [11] and Cochrane Handbook guidelines [7].
Search strategy
We conducted our article search using the Scopus, Web of Science, ProQuest, and PubMed databases. The search queries incorporated a combination of keywords, including “land use” (and synonyms), “fecal coliform” (and spelling variants, and the subset of fecal coliform bacteria), and “stream” (and synonyms) (Table 1). These keywords were combined using “AND” to create search queries for each database (details in Appendix Table A1).
Workflow
The workflow comprised three main stages: a pre-screening stage (Literature identification), followed by a two-step screening process — Step 1 (Title and abstract screening), and Step 2 (Full-text screening) (see Fig. 1). Initially, on March 19, 2024, a total of 1,361 articles were searched. After removing duplicated and non-English articles and those without abstracts, 711 articles were identified (see Appendix Table A2) to enter the screening process. In Step 1, reviewers randomly selected 130 articles (using the “sample_n()” function from the “dplyr” package in R) and determined “include” or “exclude” through group discussion. This process, repeated over four rounds, helped establish eligibility criteria. The criteria were then translated into a ChatGPT prompt as domain knowledge. Articles reviewed by human reviewers were used as a training dataset for fine-tuning ChatGPT-3.5 Turbo, which then screened the remaining articles. Step 2 followed a similar process but without model fine-tuning.
Literature screening
In Step 1 (title & abstract screening), three reviewers independently assessed 130 randomly selected articles to evaluate relevance, resolve discrepancies, and refine eligibility criteria. After four rounds of iterative discussion, reviewers established consensus-based final versions of the eligibility criteria (Table 2; the criteria’s four versions in Appendix Table A3) and created a binary-labeled dataset (i.e., “Yes” or “No” for relevance) for article inclusion (see Appendix Table A4).
Subsequently, ChatGPT-3.5 Turbo was fine-tuned for our research question. The final versions of the eligibility criteria were translated into a prompt (see Appendix A), and the binary-labeled set of 130 articles was split into a 70-article training set (randomly selecting 35 “Yes” and 35 “No” articles), a 20-article validation set, and a 40-article test set (see Appendix Table A4).
We applied a light fine-tuning process to the existing ChatGPT-3.5 Turbo model, adjusting key hyperparameters (specifically, epochs, batch size, learning rate, temperature, and top_p) to optimize performance on a specific dataset [27]. Epochs determine the number of passes through the data, where too few can lead to underfitting and too many to overfitting [28]. Batch size controls how many examples are processed before updates, with larger sizes speeding up training but requiring more memory, while smaller sizes add randomness and help escape local minima [29]. Learning rate dictates the step size for weight updates, where a high rate may lead to suboptimal solutions or divergence, and a low rate can slow down training [30]. Temperature controls the randomness of the model’s response [31]. Top_p sampling selects tokens based on cumulative probability, with lower values producing more focused outputs and higher values increasing variability [32].
After fine-tuning, we accounted for the model’s stochastic nature [33] by performing 15 runs and using the majority result as the final output (i.e., if more than 8 runs resulted in “Yes,” the answer was “Yes”; otherwise, it was “No”). The fine-tuned model was then used to screen the 581-article task set (see Appendix Table A5) with a temperature setting of 0.4 and top_p setting 0.8. We evaluated the model’s performance using Cohen’s Kappa and Fleiss’s Kappa statistics on a 40-article test set (see Appendix TableA6). Cohen’s Kappa measures agreement between two raters [34], while Fleiss’s Kappa extends this to multiple raters [35].
In Step 2, 384 articles that passed screening in Step 1 and had full-text availability underwent full-text screening (see Appendix Table A5 and A6). Because Step 2 involves full-text screening, focusing on the results and discussion sections for more comprehensive information, we updated our criteria prompt accordingly (see Appendix B). Three reviewers independently evaluated 45 randomly selected articles for inclusion based on the updated eligibility criteria through three rounds of an interactive process (Table 3; the criteria’s three versions in Appendix Table A7). This process resulted in a binary-labeled dataset (see Appendix Table A8), serving as the test set for the full-text screening model, which employs the same fine-tuned ChatGPT model, temperature and top_p setting as in Step 1 but with the updated criteria prompt. The refined model was then utilized to screen the remaining 339 articles (see Appendix Table A5). Finally, 120 articles passed screening in Step 2 (see Appendix Table A9 and A10). We also evaluated the model’s performance on a 45-article test set (see Appendix Table A10).
ROI analysis
We analyzed the Return on Investment (ROI) of AI-assisted versus manual screening by comparing costs and time savings. ROI was assessed by comparing costs and time for manually reviewing versus using ChatGPT for task set articles in Steps 1 and 2, respectively. In manual screening, reviewers screened each article in ~ 3 min (Step 1) and ~ 5 min (Step 2), with two 1-hour discussions per step to resolve disagreements (2 h total per step). The AI-assisted method used ChatGPT, with a reviewer supervising results and refining prompts, and a computer science expert fine-tuning the model. We tracked ChatGPT’s token usage, subscription fees, and labor costs.
Results
ChatGPT-3.5 turbo fine-tuning
We selected a small batch size of 2 to update the model at a low frequency and a learning rate of 0.2 to minimize overfitting due to the limited dataset size in the fine-tuning process. The model was trained for 3 epochs indicating the dataset was processed in three full cycles. While there are no standard configurations for GPT fine-tuning, we selected these hyperparameters to balance domain knowledge learning and model generalization.
During the finetuning process (Fig. 2), the training loss (blue line) showed a consistent decline, which indicated effective learning. The validation loss (red ‘X’ marks) initially dropped but became unstable later at around step 90 suggesting potential overfitting. We saved model checkpoints at Steps 35, 70, and 105 (green, orange, and purple dashed lines), and selected Checkpoint 2 at Step 70 (orange line) for the SR task due to its balance of training and validation performance. In summary, our fine-tuning approach balanced consistency in the training progress and generalization, with Checkpoint 2 providing optimal performance for the SR task.
Evaluation of ChatGPT-3.5 Turbo’s agreement with human reviewers and internal consistency
The fine-tuned ChatGPT-3.5 Turbo’s predictions aligned significantly with human reviewers’ consensus conclusion, demonstrating substantial agreement (Cohen’s Kappa score is 0.79) in Step 1 and moderate agreement (Cohen’s Kappa score is 0.61) in Step 2 (Tables 4 and 5).
The model also demonstrated substantial internal consistency across the 15 runs, with Fleiss’s Kappa of 0.81 and 0.78 in Steps 1 and 2, respectively. Over 90% of articles received consistent answers in at least 10 out of 15 runs for both steps (Fig. 3a, the run results of screening articles in Steps 1 and 2 are in Appendix Table A5 and A9). In the test set, over 75% of articles were consistently correctly predicted in at least 10 runs (Fig. 3b, the run results of test set in Steps 1 and 2 are in Appendix Table A6 and A10).
We tested 16 parameter pairs by varying temperature and top_p (0.2–0.8 in 0.2 increments) using the test dataset from Steps 1 and 2. Most combinations produced high and consistency kappa values with minimal performance variations (Fig. 4, see Appendix Table A11 and A12). ChatGPT demonstrated stable performance across a wide parameter range, suggesting strong internal consistency.
Comparison of ChatGPT-3.5 turbo and human reviewers in evidence screening: variability and agreement
Overall, ChatGPT-3.5 Turbo’s performance remained comparable to that of the human reviewers in both steps. We evaluated the agreement between both individual reviewers and ChatGPT-3.5 Turbo’s responses against the consensus conclusions by using test set in Steps 1 and 2 (Fig. 5, see Appendix Table A6 and A10). Human reviewers’ performance demonstrated significant variability across individual reviewers and the two steps. Specifically, in Step 1, Reviewer ‘R2’ performed better (Cohen’s Kappa score = 0.90) than ‘R1’, ‘R3’, and ChatGPT, where, ChatGPT’s Kappa score was 0.79 and ‘R1’ and ‘R3’’s Kappa were under 0.60 and 0.59. In Step 2, however, ‘R2’s’ Kappa dropped to 0.72, though it still exceeded the performance of the other reviewers and ChatGPT. During this stage, ChatGPT’s Kappa was at 0.61, closely matching ‘R1’ (0.58) and against ‘R3’ (0.69).
Compared to reviewers, ChatGPT demonstrated a more stable performance, with Cohen’s Kappa scores ranging from 0.53 to 0.84 in Step 1 and from 0.46 to 0.66 in Step 2 (Fig. 6a, details in Appendix Table A13). In contrast, the Cohen’s Kappa scores of the three human reviewers range from 0.24 to 1.00 in Step 1 and 0.35 to 0.84 in Step 2 (Fig. 6b). Specifically, Reviewer ‘R2’ started with a perfect Kappa score of 1.00 in Step 1, but had a lower score of 0.60 in Step 2, while reviewers ‘R1’ and ‘R3’ demonstrated variable improvements (see Appendix Table A14). The relatively stable performance of GPT, compared to the fluctuating performance of human reviewers, suggests that GPT can potentially produce more reliable evidence in SRs.
Our findings also highlight the persistent variability in reviewer agreement, indicating that inter-reviewer agreements did not stabilize consistently over time. Although there was an overall improvement in reviewer agreement, with Fleiss’s Kappa increasing from 0.44 in Step 1 to 0.8 in Step 2, significant variability remained in all the rounds (see Appendix Table A15). The Cohen’s Kappa scores ranged from 0.22 to 0.73 across reviewer pairs (see Appendix Table A16). Additionally, the good performance of ‘R2’ might be due to the stronger domain knowledge on this topic, which influenced the review process disproportionately.
ROI analysis between human reviewers and ChatGPT-3.5 turbo
We compared costs and labor hours for manual and AI-assisted screening (Fig. 7). In the manual approach, Step 1 required screening 581 articles (3 min each, 29.1 h) plus two one-hour group discussions with three reviewers (6 h), totaling 35.1 h and $526. Step 2 involved screening 339 articles (5 min each, 28.3 h) with two group discussions (6 h), totaling 34.3 h and $515. Overall, the manual method costed 69.4 h and $1,041; In the AI-assisted workflow, Step 1 required 5 h for prompt refinement, 2 h for model fine-tuning, and 2.5 h for ChatGPT screening process. Costs included a $55 token fee and a $20 membership fee, totaling 7.5 h and $150. Step 2 spends 1 h for prompt refinement, 2.5 h for screening, and token costs increased to $700. Overall, the AI-assisted method costed 11 h and $925.
ROI analysis indicates that AI-assisted screening enhances efficiency. AI reduced screening time per article from 4.5 min to 0.55 min, resulting in 8× improvement and 87.8% time savings. With a time-based ROI of 7.16, each AI-assisted hour saves over 7 manual hours. AI also increased screening throughput from 13 to 108 articles per hour, saved $0.11 per article, and reduced screening costs by 10%. The overall ROI is 10.7%, indicating that AI-assisted screening provided a net gain compared to manual screening.
Discussion
Potential role of ChatGPT-3.5 turbo in structuring evidence screening for systematic reviews
This case study demonstrates how integrating ChatGPT-3.5 Turbo into the evidence screening process contributes a framework for AI-assisted SRs in environmental science (see Fig. 8). The AI model exhibited moderate to substantial agreement with reviewers, as reflected by Cohen’s Kappa scores, indicating its potential as a screening aid. However, while AI contributes to consistency in the screening process, we want to note that consistency does not inherently translate to greater accuracy. The overall reliability of SRs depends on human-defined criteria, validation protocols, and model fine-tuning decisions.
Our results indicate that fine-tuning ChatGPT-3.5 Turbo with domain-specific knowledge enables consistent application of eligibility criteria, reducing variability in screening decisions. Human reviewers’ judgments showed inconsistencies due to varying interpretations of eligibility criteria [36]. Differences in disciplinary perspectives may lead to variability in how eligibility criteria are applied, which can influence screening outcomes in SRs [37]. For instance, variations in defining LULC and key concepts like “direct relationship” and “statistical relationship” with fecal coliform contamination create screening discrepancies. However, AI-assisted screening may reduce discrepancies in applying criteria, though its impact on bias reduction requires further investigation. In our case, Reviewer ‘R2’ exhibited higher agreement with the consensus than ‘R1’ and ‘R3’, indicating that individual expertise influenced screening decisions. This finding suggests that an expert’s interpretation of the eligibility criteria may outweigh others, impacting the screening outcomes and introducing potential bias [16, 36], reinforcing the need for consistent eligibility criteria application that can be realized by AI-assisted automating repetitive screening [38, 39].
Necessity of integrating domain knowledge with AI in environmental research
Several critical issues for defining eligibility criteria emerged in this case study SR, underscoring the importance of integrating human effort with ChatGPT-3.5 Turbo. First, the diverse classifications of LULC types across studies led to varying interpretations of LULC criteria (i.e., Criteria 1.1, 1.2, and 2.1). This issue was exacerbated by the reviewers’ different backgrounds and perspectives. For instance, some reviewers considered riparian zones as a type of land cover, while others did not define them as LULC. Reviewers’ opinions also varied on whether landscape ecology metrics (e.g., the number of patches, shape index) and best management practices related to land use meet the LULC criteria.
Second, the criteria of defining a “direct relationship” between LULC and fecal coliform contamination (Criteria 1.4) was also challenging for human reviewers to achieve consistency due to variations in research designs. For example, some studies focused on water quality index or ecological indicators like macroinvertebrates, where fecal coliform was not a direct indicator but rather a component of overall water quality assessment. Some research analyzed pollution sources associated with various LULC types without directly examining the impact of LULC on fecal coliform concentrations. Reviewers had different opinions on whether to include literature in these situations.
Third, defining a “statistical relationship” between LULC and fecal coliform contamination (Criteria 2.3) posed challenges in full-text screening. Some studies compared fecal coliform concentrations across different LULC types without quantifying LULC percentages or areas. Some studies summarized fecal coliform levels and LULC characteristics at individual sites but did not assess the quantitative relationship. Expert reviewers differed in judgments on whether these approaches can reflect a “statistical relationship” between LULC and fecal coliform contamination.
All three issues required consensus among all expert reviewers on how to define eligibility criteria, and the discrepancies were only able to be resolved through iterative discussions and arbitration. Through this process, reviewers refined eligibility criteria, which were then integrated into ChatGPT’s screening framework. Additionally, fine-tuning improved ChatGPT’s alignment with domain-specific eligibility criteria, reinforcing its ability to apply criteria consistently. Consequently, compared to the agreement between individual reviewers, ChatGPT-3.5 Turbo demonstrated a stronger alignment with the overall reviewer consensus., suggesting its potential as a supplementary screening tool.
Limitations
Several limitations exist in our methodology. First, fine-tuning with human expertise enhances ChatGPT’s alignment with eligibility criteria but narrows the model’s versatility, making it less effective for broader applications [40]. Other environmental science SRs need to retrain or adapt the model using their own eligibility criteria to ensure relevance. Second, model performance depends on training dataset quality [25], which may not fully capture the complexity of relevant literature, especially with a limited sample size. Also, the inherent stochastic nature of ChatGPT may introduce variability across multiple runs [33], impacting consistency. Third, the model is constrained to text-based data and cannot effectively process non-textual inputs [33]. Overall, while the ChatGPT model demonstrates significant potential in structuring evidence screening, careful consideration of model training and validation are essential to ensure consistent performance and generalizability.
Conclusion and future application in environmental science
This study presents an evidence-screening framework for SRs by integrating fine-tuned ChatGPT-3.5 Turbo with human expertise, though further validation is required for this exploratory step. AI-assisted methods can apply eligibility criteria consistently, improve efficiency, reduce labor demands, and manage disagreements in evidence screening for environmental SRs. To enhance AI-assisted screening across diverse environmental research domains, the following considerations should be addressed: (1) Improving training data quality and refining criteria with deeper domain expertise, (2) Advancing AI techniques to incorporate image data for more precise screening, and (3) Creating specialized language models tailored to the diverse methodologies and knowledge bases across different environmental science subfields. With continued collaboration between humans and AI, these advancements will enhance the model’s adaptability and effectiveness in SRs.
Data availability
The fine-turned ChatGPT 3.5 Turbo model and raw data are available on GitHub (https://github.com/billbillbilly/GPT_Pytools.git).
References
Liu J, Dietz T, Carpenter SR, Alberti M, Folke C, Moran E, Pell AN, Deadman P, Kratz T, Lubchenco J, Ostrom E, Ouyang Z, Provencher W, Redman CL, Schneider SH, Taylor WW. Complexity of coupled human and natural systems. Science. 2007;317(5844):1513–6. https://doiorg.publicaciones.saludcastillayleon.es/10.1126/science.1144004.
Rasmussen K, Arler F. Interdisciplinarity at the Human-Environment interface. Geografisk Tidsskrift-Danish J Geogr. 2010;110(1):37–45. https://doiorg.publicaciones.saludcastillayleon.es/10.1080/00167223.2010.10669495.
Roy ED, Morzillo AT, Seijo F, Reddy SMW, Rhemtulla JM, Milder JC, Kuemmerle T, Martin SL. The elusive pursuit of interdisciplinarity at the Human—Environment interface. Bioscience. 2013;63(9):745–53. https://doiorg.publicaciones.saludcastillayleon.es/10.1525/bio.2013.63.9.10.
Fortuin KPJ, van Koppen CSA, Kris, Leemans R. The value of conceptual models in coping with complexity and interdisciplinarity in environmental sciences education. Bioscience. 2011;61(10):802–14. https://doiorg.publicaciones.saludcastillayleon.es/10.1525/bio.2011.61.10.10.
Chapman RL. How to think about environmental studies. J Philos Educ. 2007;41(1):59–74. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/j.1467-9752.2007.00544.x.
Haddaway NR, Macura B, Whaley P, Pullin AS. ROSES reporting standards for systematic evidence syntheses: pro forma, flow-diagram and descriptive summary of the plan and conduct of environmental systematic reviews and systematic maps. Environ Evid. 2018;7(1):7. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13750-018-0121-7.
Higgins JP. Cochrane handbook for systematic reviews of interventions. 1st ed. John Wiley & Sons, Ltd; 2008. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/9780470712184.
Khangura S, Konnyu K, Cushman R, Grimshaw J, Moher D. Evidence summaries: the evolution of a rapid review approach. Syst Reviews. 2012;1(1):10. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/2046-4053-1-10.
Masic I, Miokovic M, Muhamedagic B. Evidence based Medicine– New approaches and challenges. Acta Informatica Med. 2008;16(4):219–25. https://doiorg.publicaciones.saludcastillayleon.es/10.5455/aim.2008.16.219-225.
Frampton GK, Livoreil B, Petrokofsky G. Eligibility screening in evidence synthesis of environmental management topics. Environ Evid. 2017;6(1):27. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13750-017-0102-2.
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, Moher D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. https://doiorg.publicaciones.saludcastillayleon.es/10.1136/bmj.n71.
Tawfik GM, Dila KAS, Mohamed MYF, Tam DNH, Kien ND, Ahmed AM, Huy NT. A step by step guide for conducting a systematic review and meta-analysis with simulation data. Trop Med Health. 2019;47(1):46. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s41182-019-0165-6.
Aromataris E, Pearson A. The systematic review: an overview. AJN Am J Nurs. 2014;114(3):53. https://doiorg.publicaciones.saludcastillayleon.es/10.1097/01.NAJ.0000444496.24228.2c.
Berrang-Ford L, Pearce T, Ford JD. Systematic review approaches for climate change adaptation research. Reg Envriron Chang. 2015;15(5):755–69. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s10113-014-0708-7.
Wei CA, Burnside WR, Che-Castaldo JP. Teaching socio-environmental synthesis with the case studies approach. J Environ Stud Sci. 2015;5(1):42–9. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s13412-014-0204-x.
Wang Z, Nayfeh T, Tetzlaff J, O’Blenis P, Murad MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE. 2020;15(1):e0227742. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pone.0227742.
James KL, Randall NP, Haddaway NR. A methodology for systematic mapping in environmental sciences. Environ Evid. 2016;5(1):7. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13750-016-0059-6.
Whiting P, Savović J, Higgins JPT, Caldwell DM, Reeves BC, Shea B, Davies P, Kleijnen J, Churchill R. ROBIS: A new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016;69:225–34. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.jclinepi.2015.06.005.
Giummarra MJ, Lau G, Grant G, Gabbe BJ. A systematic review of the association between fault or blame-related attributions and procedures after transport injury and health and work-related outcomes. Accid Anal Prev. 2020;135:105333. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.aap.2019.105333.
Goldkuhle M, Dimaki M, Gartlehner G, Monsef I, Dahm P, Glossmann J-P, Engert A, von Tresckow B, Skoetz N. (n.d.). Nivolumab for adults with Hodgkin’s lymphoma (a rapid review using the software RobotReviewer)—Goldkuhle, M– 2018 | Cochrane Library. Retrieved July 14, 2024, from https://www.cochranelibrary.com/cdsr/doi/https://doiorg.publicaciones.saludcastillayleon.es/10.1002/14651858.CD012556.pub2/full
Lam J, Howard BE, Thayer K, Shah RR. Low-calorie sweeteners and health outcomes: A demonstration of rapid evidence mapping (rEM). Environ Int. 2019;123:451–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.envint.2018.11.070.
Pinna F, Manchia M, Paribello P, Carpiniello B. The impact of alexithymia on treatment response in psychiatric disorders: A systematic review. Front Psychiatry. 2020;11. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fpsyt.2020.00311.
Viner R, Russell S, Saulle R, Croker H, Stansfeld C, Packer J, Nicholls D, Goddings A-L, Bonell C, Hudson L, Hope S, Schwalbe N, Morgan A, Minozzi S. Impacts of school closures on physical and mental health of children and young people: A systematic review. MedRxiv. 2021;2021021021251526. https://doiorg.publicaciones.saludcastillayleon.es/10.1101/2021.02.10.21251526.
J MR, Warrier VMK, H., Gupta Y. (2024). Fine Tuning LLM for Enterprise: Practical Guidelines and Recommendations (No. arXiv:2404.10779). arXiv. http://arxiv.org/abs/2404.10779
Liu J, Yang M, Yu Y, Xu H, Li K, Zhou X. Large Language models in bioinformatics: applications and perspectives (No. arXiv:2401.04155). arXiv. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2401.04155
Taubenfeld A, Dover Y, Reichart R, Goldstein A. (2024). Systematic biases in LLM simulations of debates (No. arXiv:2402.04049). arXiv. http://arxiv.org/abs/2402.04049
OpenAI Platform. (n.d.). Retrieved September 23. 2024, from https://platform.openai.com
Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016. https://mitpress.mit.edu/9780262035613/deep-learning/.
Masters D, Luschi C. (2018). Revisiting small batch training for deep neural networks (No. arXiv:1804.07612). arXiv. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1804.07612
Smith LN. (2017). Cyclical learning rates for training neural networks (No. arXiv:1506.01186). arXiv. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1506.01186
Akamine A, Hayashi D, Tomizawa A, Nagasaki Y, Akamine C, Fukawa T, Hirosawa I, Saigo O, Hayashi M, Nanaoya M, Odate Y. Effects of temperature settings on information quality of ChatGPT-3.5 responses: A prospective, single-blind, observational cohort study. MedRxiv. 2024;2024061124308759. https://doiorg.publicaciones.saludcastillayleon.es/10.1101/2024.06.11.24308759.
Holtzman A, Buys J, Du L, Forbes M, Choi Y. (2020). The curious case of neural text degeneration (No. arXiv:1904.09751). arXiv. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1904.09751
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L.,Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I.,Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J.,… Zoph, B.(2024). GPT-4 Technical Report (No. arXiv:2303.08774). arXiv. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2303.08774.
Cohen J. (1960). A Coefficient of Agreement for Nominal Scales—Jacob Cohen, 1960. https://journals.sagepub.com/doi/abs/10.1177/001316446002000104
Fleiss JL. Measuring agreement between two judges on the presence or absence of a trait. Biometrics. 1975;31(3):651–9. https://doiorg.publicaciones.saludcastillayleon.es/10.2307/2529549.
McDonagh M, Peterson K, Raina P, Chang S, Shekelle P. Avoiding bias in selecting studies. In: Norris BL, Carey MJ, Sanders AC, Chang GH, editors Methods guide for effectiveness and comparative effectiveness reviews (AHRQ Publication No. 10(13)-EHC063-EF). Agency for Healthcare Research and Quality (US; 2008. http://www.ncbi.nlm.nih.gov/books/NBK126701/
Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, Clarke M, Devereaux PJ, Kleijnen J, Moher D. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration. BMJ (Clinical Res Ed). 2009;339:b2700. https://doiorg.publicaciones.saludcastillayleon.es/10.1136/bmj.b2700.
Syriani E, David I, Kumar G. Screening articles for systematic reviews with ChatGPT. J Comput Lang. 2024;80:101287. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.cola.2024.101287.
Guimarães NS, Joviano-Santos JV, Reis MG, Chaves RRM. & Observatory of Epidemiology, N., Health Research (OPENS). Development of search strategies for systematic reviews in health using ChatGPT: A critical analysis. J Translational Medicine. 2024;22(1):1. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12967-023-04371-5
Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, Christiano P, Irving G. (2020). Fine-Tuning Language Models from Human Preferences (No. arXiv:1909.08593). arXiv. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1909.08593
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
Chen Zuo: Conceptualization; data curation; formal analysis; methodology; project administration; validation; visualization; writing - original draft; writing - review and editing. Xiaohao Yang: data curation; fine-tuning model; formal analysis; validation; writing - review and editing. Josh Errickson: Conceptualization; methodology; writing - review and editing. Jiayang Li: Conceptualization; methodology; writing - review and editing. Yi Hong: Conceptualization; data curation; formal analysis; methodology; writing - review and editing. Runzi Wang: Conceptualization; data curation; methodology; supervision; writing - review and editing.
Corresponding author
Ethics declarations
Ethics and consent to participate declarations
Ethics and Consent to Participate declarations: not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
13750_2025_358_MOESM1_ESM.docx
Supplementary Material 1: Text appendix show the Title & screening prompt and Full-text screening prompt. Tables show a search queries for datasets; an articles identified results; the article screening results of Steps 1 and 2; the criteria’s different versions in Steps 1 and 2; the screening results of ChatGPT-3.5 Turbo in Steps 1 and 2; the review results of human reviewers in Steps 1 and 2; the human reviewer and ChatGPT’s screening result of test set articles in Steps 1 and 2; the Cohen’s Kappa score of ChatGPT-3.5 Turbo at Steps 1 and 2; the Cohen’s Kappa score of human reviewers at Steps 1 and 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zuo, C., Yang, X., Errickson, J. et al. AI-assisted evidence screening method for systematic reviews in environmental research: integrating ChatGPT with domain knowledge. Environ Evid 14, 5 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13750-025-00358-5
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13750-025-00358-5