Large Language Models for Rapid Instrument Prototyping: Design and Structural Optimization of the Dry Eye Disease in Pregnancy Questionnaire (DED-PREG)

Marta Jaruchowska,¹ Musa Aamir Qazi,¹ Muhammad Jalal Haidar,¹ Joanna Przybek-Skrzypecka,^2,³ Janusz Skrzypecki¹

¹Department of Experimental Physiology and Pathophysiology, Medical University of Warsaw, Warsaw, Poland; ²Department of Ophthalmology, Medical University of Warsaw, Warsaw, Poland; ³SPKSO Ophthalmic University Hospital, Warsaw, Poland

Correspondence: Janusz Skrzypecki, Department of Experimental Physiology and Pathophysiology, Medical University of Warsaw, Banacha 1B, Warsaw, 02-097, Poland, Tel +0-22 116 6195, Fax +0-22 57 20 734, Email [email protected]

Introduction: Gestational Dry Eye Disease (DED) affects up to 50% of expectant mothers, yet current diagnostic tools are generic and fail to capture pregnancy-specific symptom patterns. Developing and validating new instruments in this population is logistically and ethically challenging due to recruitment barriers. This study describes the development and computational prototyping of the Dry Eye Disease in Pregnancy Questionnaire (DED-PREG) using a Generative Artificial Intelligence (GenAI) framework.
Methods: We utilized a multi-stage in silico framework involving two independent synthetic cohorts. First, a qualitative focus group cohort was generated to simulate clinical dialogues for content derivation, followed by semantic vectorization for algorithmic item reduction. Subsequently, an independent validation cohort of 500 pregnant personas was instantiated. We evaluated the resulting 20-item instrument for internal consistency, structural validity, and test-retest reliability via a longitudinal simulation engine utilizing temporal context injection to model gestational progression across five distinct timepoints (T1–T5).
Results: The DED-PREG mapped to three distinct domains: Ocular Symptoms, Functional Impact, and Lifestyle & Environmental Modulators. The instrument demonstrated satisfactory internal consistency (Cronbach’s alpha = 0.89) and excellent temporal stability in a strictly stable subsample (ICC = 0.99). Confirmatory Factor Analysis indicated acceptable model fit for synthetic high-dimensional data (CFI = 0.82; RMSEA = 0.11). Longitudinal analysis confirmed the instrument’s responsiveness to gestational change (Global Cohen’s d = 0.44), with Linear Mixed Models (LMM) revealing a significant interaction between low socioeconomic status and symptom exacerbation (β=0.053, p < 0.001).
Conclusion: This study presents the first pregnancy-specific DED instrument structurally optimized via AI simulation. While human validation remains the gold standard, this computational approach demonstrates that GenAI can serve as a rigorous “stress-test” for instrument design, enabling the rapid prototyping of robust clinical tools prior to in vivo deployment.

Plain Language Summary: Dry Eye Disease affects nearly 50% of pregnant women, but current diagnostic tools fail to capture specific pregnancy-related symptoms. Developing new tests is difficult due to the logistical and ethical barriers in recruiting expectant mothers for research. This study used Generative AI to design and validate a new questionnaire, called DED-PREG. By testing the tool on 500 “digital personas” simulating pregnancy, researchers found the survey was accurate, reliable, and effective at tracking symptom changes across trimesters. This approach proves AI can rigorously “stress-test” medical instruments, enabling the rapid creation of better clinical tools before testing on real patients.

Keywords: dry eye disease, PROM, large language models

Introduction

Pregnancy induces profound systemic adaptations that disrupt tear film homeostasis, leading to Gestational Dry Eye Disease (DED) in up to 50% of expectant mothers.¹ Driven by the interplay between estrogen-induced inflammation and androgen-mediated lipid deficiency, symptoms range from visual fluctuation to severe discomfort. Despite this, pregnancy remains an unclassified risk factor in major reports like Tear Film & Ocular Surface Society (TFOS) Dry Eye Workshop (DEWS) III.²

The absence of formal classification leaves clinicians without standardized diagnostic criteria, frequently leading to underdiagnosis or dismissal of maternal ocular symptoms as routine physiological changes. Consequently, pregnant patients are routinely excluded from targeted therapeutic interventions. To elevate gestational DED to a recognized pathological entity within global consensus frameworks, the scientific community requires robust epidemiological data. However, generating this evidence relies entirely on the prior availability of validated, condition-specific diagnostic instruments capable of isolating pregnancy-induced ocular surface disease from baseline somatic symptoms.A critical barrier to classifying and treating this condition is the lack of specialized diagnostic instrumentation. Current patient-reported outcome measures (PROMs) like the Ocular Surface Disease Index 6 (OSDI-6) or Standardized Patient Evaluation of Eye Dryness (SPEED) were validated in general populations, typically older adults with chronic, stable disease.^3,4 Therefore, it might be hypothesized that these tools could underperform in the setting of the volatility of pregnancy. Furthermore, while simple unidimensional tools like Visual Analog Scales (VAS) allow for rapid symptom intensity reporting, they fail to capture the multidimensional impact of the disease on daily function and emotional well-being. Generic instruments often rely on long “recall periods”, which tend to average out symptoms^3,4 Furthermore, the OSDI places disproportionate emphasis on visual dysfunction, a domain often confounded in pregnancy by physiological refractive shifts rather than tear film instability.⁵

Traditionally, developing a sensitive, condition-specific questionnaire to fill this gap is a prohibitive undertaking. Standard psychometric validation requires rigorous iterative prototyping and pilot testing.⁶ While this “gold standard” approach ensures the instrument reflects the authentic patient voice, it presents significant logistical and ethical hurdles in obstetric research. This creates a methodological bottleneck: we need a better tool to study the population, but we cannot easily access the population to build the tool.

To circumvent this, recent advancements in artificial intelligence offer a novel solution for the instrument design and prototyping phase. Emerging literature introduces the concept of “silicon sampling”, demonstrating that Large Language Models (LLMs) can effectively simulate human psychological traits and survey response patterns.⁷ Some authors suggest that LLMs, when properly conditioned, can reproduce complex socio-demographic biases and persona-specific behaviors with high fidelity, often indistinguishable from human respondents in specific contexts.^8,9

However, applying this technology requires significant methodological caution. While “silicon sampling” offers speed and scalability, it lacks the external validity of human research and carries the risk of “hallucinations”—where models generate plausible but factually incorrect correlations.⁷ Therefore, we acknowledge that LLMs operate on algorithmic logic that is fundamentally distinct from biological human cognition and cannot serve as a substitute for traditional clinical validation.

In this study, we use these “silicon subjects” solely for structural optimization of the tool. We leverage the model’s ability to simulate diverse patient personas to stress-test the questionnaire items for clarity, relevance, and ambiguity before they reach human participants. This study describes the development of the DED in Pregnancy Questionnaire (DED-PREG), demonstrating how AI-driven prototyping can streamline the initial design phases, resulting in a candidate instrument structure prepared for future empirical verification.

Materials and Methods

Study Design and in silico Framework

This study was conducted in accordance with the tenets of the Declaration of Helsinki.

To address logistical and ethical constraints associated with longitudinal research in pregnant populations, a Generative Artificial Intelligence (GenAI) framework was operationalized. The protocol for instrument development adhered to the iterative standards set forth by the US Food and Drug Administration (FDA) regarding Patient-Reported Outcome (PRO) measures. The study design followed a sequential mixed-methods approach, transitioning from qualitative content generation to quantitative psychometric validation. The workflow comprised three computational phases: (1) Stochastic Persona Modeling, (2) Semantic Item Generation and Reduction, and (3) Longitudinal Symptom Simulation.

Phase I: Stochastic Persona Modeling A generative agent architecture was utilized to create a cohort of synthetic patient profiles (N=500). A stochastic parameter injection method was implemented to vary demographic and psychographic variables.

Persona Configuration: Personas were generated based on inclusion criteria: female sex, reproductive age (18–45 years), and varying gestational stages (trimesters 1–3). Socioeconomic status (SES) was stratified (Low 30%, Middle 45%, High 25%) and parameterized as a functional covariate. To model potential health disparities, a weighted coefficient (+0.15 SD to symptom magnitude) was conditionally applied to personas designated as “Low SES” during the gestational progression phases.

Dialogue Simulation: Each persona was instantiated in a simulated clinical encounter. The model was prompted to simulate the symptom burden of pregnancy-associated DED, producing a corpus of synthetic doctor-patient dialogues to ground the questionnaire in patient vernacular.

Phase II: Semantic Item Generation and Reduction A semi-supervised extraction pipeline was employed to transition from qualitative dialogue to quantitative metrics.

Item Extraction: The LLM analyzed the dialogue corpus to identify recurring symptom themes and formulated candidate Likert-scale items (1–5 scale).

Vector-Based Selection: Candidate items were vectorized using a pre-trained sentence transformer model. A cosine similarity matrix was calculated across item vectors to identify redundancy.

Optimization: An iterative selection algorithm was applied to maximize the semantic distance between items, resulting in a final item bank representing distinct symptoms.

Phase III: Longitudinal Structural Optimization The prototype instrument (DED-PREG) underwent an in silico longitudinal simulation across five timepoints to assess sensitivity to symptom fluctuation.

Time-Series Simulation: The cohort (N=500) was subjected to a multi-point temporal simulation: T1 (Baseline), T2 (7-day retest), T3 (2nd Trimester, approx. +10 weeks), T4 (3rd Trimester, approx. +22 weeks), and T5 (Postpartum, approx. +36 weeks).

Temporal Context Injection: At each interval, a temporal marker and physiological context were embedded into the persona’s state vector.

Dynamic Response Modeling: Agents completed the questionnaire under evolved temporal contexts to assess the instrument’s sensitivity to the simulated progression of pregnancy and SES-modulated trajectories.

Reliability: Internal consistency was assessed using Cronbach’s alpha. Test-retest reliability was evaluated in the full cohort (N=500) and a stable subsample (n=112) using the Intraclass Correlation Coefficient (ICC, model 2,1). The stable subsample was defined by an absolute global score shift of <= 0.30 points between T1 and T2.

Structural Validity: Confirmatory Factor Analysis (CFA) was conducted specifying a three-factor oblique model using Maximum Likelihood estimation. Model fit was evaluated via the Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), and Root Mean Square Error of Approximation (RMSEA).

Responsiveness & Validity: Longitudinal responsiveness was tested using Cohen’s d effect sizes and Linear Mixed Models (LMM). Known-groups validity was assessed by comparing mean scores across trimesters and SES groups using Kruskal–Wallis tests.

Fairness: Differential Item Functioning (DIF) was examined across age and parity groups using ordinal logistic regression with Benjamini-Hochberg correction.

Statistical Analysis

Statistical analyses were performed using Python v3.10 (Python Software Foundation, Wilmington, DE, USA) with SciPy, statsmodels, and pingouin packages.

Results

Instrument Structure and Content Generation Qualitative analysis of simulated consultations identified themes of physical discomfort, visual fluctuation, and pregnancy-specific health anxiety. Semantic reduction resulted in the 20-item DED-PREG (Table 1). The analysis supported a three-domain structure:

Table 1 DED-PREG Domain Structure and Item Composition

● Ocular Symptoms (7 items): Symptomatic burden such as grittiness and redness.

● Functional Impact (3 items): Interference with daily life and visual tasks.

● Lifestyle & Environmental Modulators (10 items): Modifying factors and environmental triggers.

Psychometric Validation The DED-PREG was validated in the synthetic longitudinal cohort (N=500).

Reliability: Internal consistency for the global scale yielded a Cronbach’s alpha of 0.891 (95% CI: 0.876–0.904) at baseline. Domain-specific analysis showed alpha values of 0.817 for Ocular Symptoms, 0.558 for Functional Impact, and 0.802 for Lifestyle/Environmental Modulators (Table 2).

Table 2 Reliability Metrics of the DED-PREG (Internal Consistency and Test-Retest Stability)

Temporal Stability: The ICC for the total score in the full cohort (N=500) was 0.949 between T1 and T2. In the stable subsample (n=112, delta <= 0.30), the ICC was 0.999. Bland-Altman analysis indicated a mean bias of −0.0045 with no significant proportional bias (p > 0.05) (Table 2).

Structural Validity: CFA using Maximum Likelihood estimation supported the three-factor oblique model. Fit indices were: CFI = 0.824, TLI = 0.800, and RMSEA = 0.111. Factor loadings indicated items associated with their respective domains; for example, item Q16 showed a loading of 2.90 on the Ocular Symptoms factor (Table 3).

Table 3 Confirmatory Factor Analysis (CFA) Fit Indices for the Three-Factor Model

Discriminant and Known-Groups Validity: Kruskal–Wallis tests indicated significant differences in scores across groups based on trimester, DED severity, and screen time. DIF analysis flagged 8 out of 120 (6.7%) item-group combinations (Table 4).

Table 4 Known-Groups Validity: Mean DED-PREG Total Scores by Gestational Trimester

Longitudinal Responsiveness and Trajectory: The global score showed a change from T1 to T4 (3rd Trimester) with a Cohen’s d of 0.441. LMM analysis indicated a significant main effect of time (p < 0.001). A significant interaction was observed between Time and Low SES (beta = 0.053, p < 0.001). This interaction indicates that personas parameterized as Low SES experienced a steeper trajectory of symptom increase in the third trimester compared to Middle and High SES groups (Table 5).

Table 5 Longitudinal Responsiveness of the DED-PREG

Discussion

This study establishes a computational prototyping framework for developing a pregnancy-specific PROM questionnaire. Harnessing a multi-stage pipeline, we successfully engineered the DED-PREG, a 20-item instrument designed to isolate the distinct phenomenology of gestational ocular surface disease. These findings provide empirical support for the emerging paradigm of “Silicon Sampling”, demonstrating that LLMs—when constrained by rigorous prompt engineering—can emulate the iterative item-generation phases traditionally reliant on labor-intensive qualitative research.^7,10 However, we emphasize that these in silico metrics represent a “structural optimization” of the instrument’s logic, serving as a risk-mitigation precursor to, rather than a substitute for, prospective human clinical validation.

Current generic instruments, such as the OSDI or DEQ-5, were validated in stable, older populations and are susceptible to distinct measurement errors when applied to obstetric cohorts.^11–15 A primary failure mode of legacy tools is the confounding influence of “somatic noise” and refractive instability.

Legacy instruments conflate visual acuity deficits with DED symptomatology, a flaw that is magnified during gestation. This is most evident in the OSDI, where 50% of the items (6 out of 12) are directly dependent on visual function (eg, questions regarding “blurred vision”, “poor vision”, or difficulty “driving at night”).^5,16 In the general population, blurred vision is a reliable proxy for tear film instability. However, in pregnancy, systemic fluid retention frequently induces corneal edema and transient myopic shifts, causing visual blurring unrelated to ocular surface desiccation.¹⁷ Consequently, generic instruments might yield high rates of false positives, misclassifying gestational refractive changes as severe DED. The DED-PREG mitigates this by structurally filtering these confounders, focusing instead on localized nociception (eg, grittiness, foreign body sensation) specifically attributable to surface hyperosmolarity.¹⁸

Crucially, the acceleration of the development timeline did not compromise psychometric integrity. The instrument exhibited satisfactory internal consistency (Global alpha = 0.89) and structural validity. Confirmatory Factor Analysis (CFA) confirmed a three-domain structure (Ocular Symptoms, Functional Impact, Lifestyle & Environmental Modulators) with fit indices (CFI = 0.824, RMSEA = 0.111) that fall within acceptable ranges for high-dimensional synthetic data, though they indicate potential for further item reduction in future clinical iterations. To address concerns regarding “algorithmic hallucination”, we employed a “State-Preserving Longitudinal Simulation” with temporal context injection. Unlike stateless approaches where the model generates new personas at every step, our pipeline maintained the persona’s biological identity (demographics, medical history) constant while injecting specific temporal markers (eg, T1 vs. T2) into the prompt context. The resulting high test–retest reliability demonstrates stability in silico, pending clinical confirmation. Furthermore, the model demonstrated significant responsiveness to change, yielding a Cohen’s d of 0.441 for the global score by the 3rd trimester. Notably, the Ocular Symptoms domain exhibited a large effect size (d=1.08), confirming the instrument’s sensitivity to physiological progression despite the stability of lifestyle factors. LMM analysis revealed that this trajectory is not uniform; specifically, it identified a significant interaction between low socioeconomic status (SES) and symptom exacerbation (β=0.053, p<0.001), suggesting the model successfully encoded complex socio-biological determinants of health.¹⁹

A distinguishing feature of the DED-PREG is the emergence of the “Lifestyle & Environmental Modulators” domain. In traditional ophthalmic instruments, environmental triggers (eg, wind, low humidity) are typically assessed passively—simply asking if they provoke symptoms.¹⁴ However, our factor analysis reveals that in the gestational context, these factors are intrinsically linked to active “health vigilance” and pharmacological hesitancy.

Pregnant patients often exhibit a heightened reluctance to use topical pharmacotherapy due to fears of systemic absorption and fetal teratogenicity. Consequently, their primary management strategy shifts from “treatment” to “behavioral compensation” (eg, rigid avoidance of air conditioning, strategic use of humidifiers, or limiting screen time to prevent desiccation). Generic tools like the OSDI fail to capture this burden; a patient who successfully avoids triggers might report low symptom scores, masking the significant functional restrictions she has imposed on her daily life to achieve that comfort.²⁰ The DED-PREG’s Lifestyle domain specifically operationalizes these behavioral nuances, moving beyond simple symptom counting toward a holistic assessment of the patient’s lived experience.

The implications of this framework extend beyond the DED-PREG. The underlying architecture—specifically the stochastic generation of patient personas—lays the groundwork for Obstetric Digital Twins. In the future, by inputting real-world maternal parameters (eg, hormone levels, pre-existing conditions) into such models, clinicians could simulate potential symptom trajectories to anticipate ocular complications before they manifest.

Limitations

This study must be interpreted within the context of its in silico design. While the synthetic cohort of 500 personas was generated to maximize demographic diversity, the data remains a probabilistic approximation of human experience. The high reliability coefficients observed likely represent an “upper bound” of psychometric performance, free from the cognitive fatigue and variable interpretation found in clinical settings. Consequently, the magnitude of the instrument’s responsiveness requires verification in a prospective “bridge study” involving real-world patients to calibrate diagnostic cut-offs against objective physiological biomarkers such as tear osmolarity or MMP-9 levels.

Conclusions

In conclusion, GenAI-assisted methodologies offer a promising supplementary framework for rapid PROM prototyping. While in silico simulations suggest the DED-PREG exhibits strong potential, these findings represent a computational proof-of-concept rather than clinical validation. The instrument currently serves as a scalable prototype, providing a foundation for future empirical studies required to standardize diagnosis in this population.

Disclosure

The authors report no conflicts of interest in this work.

References

1. Jaruchowska M, Przybek-Skrzypecka J, Skrzypecki J. Pregnancy and dry eye syndrome: a review for clinical practice. Int J Mol Sci. 2025;26(20):9990. doi:10.3390/ijms26209990

2. Perez VL, Chen W, Craig JP, et al. TFOS DEWS III: executive summary. Am J Ophthalmol. 2025;282:135–8. doi:10.1016/j.ajo.2025.09.035

3. Ngo W, Situ P, Keir N, Korb D, Blackie C, Simpson T. Psychometric properties and validation of the standard patient evaluation of eye dryness questionnaire. Cornea. 2013;32(9):1204–1210. doi:10.1097/ICO.0b013e318294b0c0

4. Pult H, Wolffsohn JS. The development and evaluation of the new Ocular Surface Disease Index-6. Ocul Surf. 2019;17(4):817–821. doi:10.1016/j.jtos.2019.08.008

5. Asiedu K, Kyei S, Mensah SN, Ocansey S, Abu LS, Kyere EA. Ocular Surface Disease Index (OSDI) versus the Standard Patient Evaluation of Eye Dryness (SPEED): a study of a nonclinical sample. Cornea. 2016;35(2):175–180. doi:10.1097/ICO.0000000000000712

6. Bull C, Teede H, Carrandi L, Rigney A, Cusack S, Callander E. Evaluating the development, woman-centricity and psychometric properties of maternity patient-reported outcome measures (PROMs) and patient-reported experience measures (PREMs): a systematic review protocol. BMJ Open. 2022;12(2):e058952. doi:10.1136/bmjopen-2021-058952

7. Sarstedt M, Adler SJ, Rau L, Schmitt B. Using large language models to generate silicon samples in consumer and marketing research. Psychology and Marketing. 2024;41(6):1254–1270. doi:10.1002/mar.21982

8. Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D. Out of one, many: using language models to simulate human samples. Political Analysis. 2023;31(3):337–351. doi:10.1017/pan.2023.2

9. Horton JJ. Large language models as simulated economic agents: what can we learn from homo silicus? NBER Working Papers. 2023:31122.

10. Random silicon sampling: simulating human sub-population opinion using a large language model based on group-level demographic information. arXiv preprint arXiv:2402.18144. 2024.

11. Dougherty BE, Nichols JJ, Nichols KK. Rasch analysis of the Ocular Surface Disease Index (OSDI). Invest Ophthalmol Vis Sci. 2011;52(12):8630–8635. doi:10.1167/iovs.11-8027

12. Okumura Y, Inomata T, Iwata N, et al. A review of dry eye questionnaires: measuring patient-reported outcomes and health-related quality of life. Diagnostics. 2020;10(8):559. doi:10.3390/diagnostics10080559

13. Coco G, Piccotti G, Rossi L, Fucci P, Manni G. Assessment of dry eye questionnaires in patients with and without glaucoma. Br J Ophthalmol. 2025.

14. Meng X, Geng R, Yang K, et al. Comparison of SPEED and OSDI questionnaires for dry eye symptom in Chinese college students: a cross-sectional study. BMC Ophthalmol. 2025;25(1):425. doi:10.1186/s12886-025-04255-w

15. Schiffman RM, Christianson MD, Jacobsen G, Hirsch JD, Reis BL. Reliability and validity of the ocular surface disease index. Arch Ophthalmol. 2000;118(5):615–621. doi:10.1001/archopht.118.5.615

16. Aljarousha M, Alghamdi WM, Attaallah S, Alhoot MA. Ocular Surface Disease Index questionnaire in different languages. Med Hypothesis Discov Innov Ophthalmol. 2025;13(4):190–200. doi:10.51329/mehdiophthal1510

17. Sunness JS. The pregnant woman’s eye. Surv Ophthalmol. 1988;32(4):219–238. doi:10.1016/0039-6257(88)90172-5

18. Harrell CR, Feulner L, Djonov V, Pavlovic D, Volarevic V. The molecular mechanisms responsible for tear hyperosmolarity-induced pathological changes in the eyes of dry eye disease patients. Cells. 2023;12(23):2755. doi:10.3390/cells12232755

19. Nkiru ZN, Stella O, Udeh N, Polycarp UA, Daniel CN, Ifeoma RE. Dry eye disease: a longitudinal study among pregnant women in Enugu, south east, Nigeria. Ocul Surf. 2019;17(3):458–463. doi:10.1016/j.jtos.2019.05.001

20. Abetz L, Rajagopalan K, Mertzanis P, et al. Development and validation of the impact of dry eye on everyday life (IDEEL) questionnaire, a patient-reported outcomes (PRO) measure for the assessment of the burden of dry eye on patients. Health Qual Life Outcomes. 2011;9(1):111. doi:10.1186/1477-7525-9-111

Creative Commons License © 2026 The Author(s). This work is published and licensed by Dove Medical Press Limited. The full terms of this license are available at https://www.dovepress.com/terms and incorporate the Creative Commons Attribution - Non Commercial (unported, 4.0) License. By accessing the work you hereby accept the Terms. Non-commercial uses of the work are permitted without any further permission from Dove Medical Press Limited, provided the work is properly attributed. For permission for commercial use of this work, please see paragraphs 4.2 and 5 of our Terms.

Download Article [PDF]