Background: Online consultation platforms are central to maternal and infant healthcare in China, providing continuous professional support. Large volumes of unstructured clinician–patient dialogues remain underused for extracting reproducible clinical patterns, necessitating robust computational approaches that balance transparency and NLP innovation.
Objective: The objective was to systematically compare Latent Dirichlet Allocation (LDA) and BERTopic in analyzing large-scale online consultation data and evaluate their performance in identifying breastfeeding-related inquiry patterns.
Methods: We analyzed 527,979 messages from the Internet Outpatient Platform of Guangzhou Women and Children’s Medical Center, Guangzhou, China (2021–2024). After removing nontextual elements, 2,735 consultation-level records were generated through segmentation, customized stopword refinement, synonym merging, and construction of domain-specific medical dictionaries. The optimal number of LDA topics was determined using perplexity (≈12.4) and coherence (≈0.72), with 5 topics selected. BERTopic utilized all-MiniLM-L6-v2 embeddings, UMAP dimensionality reduction, HDBSCAN clustering, and c-TF-IDF weighting. Model performance was compared using topic coherence, topic distinctiveness, and visualization outputs.
Results: Both models identified five key thematic clusters in online breastfeeding consultations: (1) milk supply and infant weight concerns, (2) latch and sucking issues, (3) breast/nipple pain, (4) maternal diet, medication, and galactagogues, and (5) milk expression challenges. BERTopic achieved higher coherence (c_v = 0.78) and produced more compact, well-separated clusters, whereas LDA generated more stable macro-level topic structures. Comparative analysis demonstrates that LDA and BERTopic provide complementary strengths in topic extraction, combining macro-level stability with fine-grained semantic distinction.
Conclusions: Topic modeling of online consultation data enables systematic extraction of patterns in breastfeeding-related inquiries. Integrating LDA and BERTopic supports scalable analysis of unstructured clinical dialogue, facilitates identification of broad and detailed thematic patterns, and advances secondary use of digital health data for telehealth optimization. These findings demonstrate the utility of structured topic modeling in leveraging online consultation platforms for clinical information extraction.