Modelamiento de tópicos para identificar patrones en la investigación cientíifica del Covid-19
Topic modeling to identify patterns in Covid-19 scientific research
Resumen (es)
Presentamos un modelo de tópicos basado en el método asignación latente de Dirichlet (LDA, por sus siglas en inglés) con el objetivo de examinar patrones en la investigación científica del Covid--19 teniendo en cuenta las publicaciones indexadas en la base datos especializada PubMed. Se toman 4928 resumenes científicos publicados durante el primer semestre de 2020. Se ajusta un modelo LDA utilizando dos tópicos. El primer tópico corresponde a factores de riesgo, severidad y mortalidad por infección viral, mientras que el segundo al impacto de las infecciones respiratorias en la salud pública. La clasificación propuesta brinda una visión global sobre las dos tendencias de investigación presentes a la fecha en la que el análisis tiene lugar. Adicionalmente, los resultados señalan que la aplicación de la metodología propuesta provee un camino para direccionar y hacer más eficiente la revisión bibliográfica en el contexto académico.
Resumen (en)
We consider a topic modeling approach using latent Dirichlet allocation (LDA) methods aiming to examine patterns in the scientific research of Covid-19 using publications indexed in the PubMed database. A total of 4928 scientific abstracts published during the first semester of 2020 are taken into account. An LDA model is fitted using two topics. The first topic corresponds to risk factors, severity, and mortality due to viral infection, whereas the second is the impact of respiratory illnesses on public health. Our classification provides a global overview of these two research trends from the moment the analysis takes place. Additionally, our findings suggest that the systematic application of the proposed methodology provides a way to address and make more efficient the bibliographic review in academic contexts.
Referencias
Älgå, A., Eriksson, O., and Nordberg, M. (2020). Analysis of scientific publications during the early phase of the covid-19 pandemic: topic modeling study. Journal of medical Internet research, 22(11):e21559.
Ashihara, K., El Vaigh, C. B., Chu, C., Renoust, B., Okubo, N., Takemura, N., Nakashima, Y., and Nagahara, H. (2020). Improving topic modeling through homophily for legal documents. Applied Network Science, 5(1):1–20.
Barry, A. E., Valdez, D., Padon, A. A., and Russell, A. M. (2018). Alcohol advertising on twitterâa topic model. American Journal of Health Education, 49(4):256–263.
Bastani, K., Namavari, H., and Shaffer, J. (2019). Latent dirichlet allocation (lda) for topic modeling of the cfpb consumer complaints. Expert Systems with Applications, 127:256–271.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77–84.
Blei, D. M. and Lafferty, J. D. (2007). A correlated topic model of science. The annals of applied statistics, 1(1):17–35.
Blei, D. M. and Lafferty, J. D. (2009). Topic models. In Text mining, pages 101–124. Chapman and Hall/CRC.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022.
Buntine, W. (2009). Estimating likelihoods for topic models. In Asian Conference on Machine Learning, pages 51–64. Springer.
Chen, L., Hossain, K. T., Butler, P., Ramakrishnan, N., and Prakash, B. A. (2016).
Syndromic surveillance of flu on twitter using weakly supervised temporal topic models. Data mining and knowledge discovery, 30(3):681–710.
Darling, W. M. (2011). A theoretical and practical implementation tutorial on topic modeling and gibbs sampling. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 642–647.
DiMaggio, P., Nag, M., and Blei, D. (2013). Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of us government arts funding. Poetics, 41(6):570–606.
Dumais, S. T. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1):188–230.
Fantini, D. (2017). easypubmed: An r package for search and retrieve scientific publication records from pubmed. Technical report.
Feinerer, I. (2013). Introduction to the tm package text mining in r. Technical report.
Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235.
Grün, B. and Hornik, K. (2011). topicmodels: An r package for fitting topic models.
Journal of statistical software, 40(1):1–30.
Grün, B., Hornik, K., and Grun, M. B. (2019). Package âtopicmodelsâ.
Gulo, C. A. and R ́ubio, T. R. (2015). Text mining scientific articles using the r. In Doctoral Symposium in Informatics Engineering.
Ho, T. and Thanh, T. D. (2021). Discovering community interests approach to topic model with time factor and clustering methods. Journal of Information Processing Systems, 17(1):163–177.
Jain, E. G. (2021). A comparative analyzing of sms spam using topic models.
In Innovations in Information and Communication Technologies (IICT-2020), pages 91–99. Springer.
Kim, S.-H., Lee, N., and King, P. E. (2020). Dimensions of religion and spirituality: A longitudinal topic modeling approach. Journal for the scientific study of religion, 59(1):62–83.
Kumar, A. and Paul, A. (2016). Mastering text mining with R. Packt Publishing Ltd.
McCallum, A., Corrada-Emmanuel, A., and Wang, X. (2005). Topic and role discovery in social networks.
Ovádek, M., Dyevre, A., and Wigard, K. (2021). Analysing eu treaty-making and litigation with network analysis and natural language processing. Frontiers in Physics, 9:202.
Pham, Q., Stanojevic, M., and Obradovic, Z. (2020). Extracting entities and topics from news and connecting criminal records. arXiv preprint arXiv:2005.00950.
Porter, M. F. (2006). An algorithm for suffix stripping. Program.
Qaiser, S. and Ali, R. (2018). Text mining: use of tf-idf to examine the relevance of words to documents. International Journal of Computer Applications, 181(1):25–29.
Richardson, G. M., Bowers, J., Woodill, A. J., Barr, J. R., Gawron, J. M., and Levine, R. A. (2014). Topic models: A tutorial with r. International Journal of Semantic Computing, 8(01):85–98.
Silge, J. and Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media, Inc.”.
Srivastava, A. N. and Sahami, M. (2009). Text mining: Classification, clustering, and applications. CRC press.
Tian, Y. (2021). A multilayer correlated topic model. arXiv preprint arXiv:2101.02028.
Trueba-Gómez, R. and Estrada-Lorenzo, J.-M. (2010). La base de datos pubmed y la búsqueda de información científica. Seminarios de la Fundación Española de Reumatología, 11(2):49–63.
Valdez, D., Picket, A. C., Young, B.-R., and Golden, S. (2021). On mining words: The utility of topic models in health education research and practice. Health Promotion Practice, 22(3):309–312.
Wainwright, M. J. and Jordan, M. I. (2008). Introduction to variational methods for graphical models. Foundations and Trends in Machine Learning, 1:1–103.
Wallach, H. M. (2006). Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning, pages 977–984.
Wang, H., Ding, Y., Tang, J., Dong, X., He, B., Qiu, J., and Wild, D. J. (2011).Finding complex biological relationships in recent pubmed articles using bio-lda.
PloS one, 6(3):e17243.
Cómo citar
Licencia
Los autores mantienen los derechos sobre los artículos y por tanto son libres de compartir, copiar, distribuir, ejecutar y comunicar públicamente la obra bajo las condiciones siguientes:
Reconocer los créditos de la obra de la manera especificada por el autor o el licenciante (pero no de una manera que sugiera que tiene su apoyo o que apoyan el uso que hace de su obra).
Comunicaciones en Estadística está bajo una licencia Creative Commons Atribución-NoComercial-CompartirIgual 4.0 Internacional (CC BY-NC-SA 4.0)