Skip to main content

Automating the Retrieval of Dosage to Consolidate Conflicting Evidence from Genistein Literature and Implications for Breast Cancer

This study aims to resolve conflicting claims concerning genistein and cell death and cell proliferation by consolidating the tested dosage in empirical studies concerning genistein and breast cancer. By automating a system to extract dosage quantities from thousands of scientific articles, the acceleration of toxicology review can be accomplished, without sacrificing comprehensiveness. Oracle SQL Developer was used to design a rule-based model to retrieve dosage quantities and accompanying SI units from a pre-established dataset of 777 PubMed articles. Information extracted from the articles was divided into the following categories: number, SI unit, dose, time, number and dose, and number and time. The baseline model identified sentences from 85 PubMed articles that were manully evaluated. Overall, the system annotated 69.1% sentences correctly and the error analysis revealed that the core issue was number and dose and number and time, alongside precision for SI units. Upon expanding the SI unit database, diversifying the syntax for time cues, and writing rules to recognize variance and ranges, an improved model was constructed and evaluated. Its accuracy was 82.8%, an increase in accuracy of nearly 13 percentage points; overall, the system demonstrated improvement in nearly all of the precision or recall categories. This system, with further revision to listing syntax and dosage differentiation, can consequently capture the dosage in a variety of settings.

Melissa Seecharan
Pomona College
Molecular Biology/English
Research Advisor: 
Dr. Catherine Blake,
Department of Research Advisor: 
Information Sciences
Year of Publication: