Expanding Your Vocabulary: A Framework for Topic Integration in Texts

Topic integration
Vocabularies
Constitution

Roy Gardner, Matthew Martin, Ashley Moran, Zachary Elkins, Andrés Cruz, and Guillermo Pérez. “Expanding Your Vocabulary: A Framework for Topic Integration in Texts.”

Authors
Affiliations

Comparative Constitutions Project, University of Texas at Austin

Department of Government, University of Texas at Austin

Department of Government, University of Texas at Austin

Department of Government, University of Texas at Austin

Department of Government, University of Texas at Austin

Published

September 2024

Abstract

Topic discovery and integration are vital for maintaining vocabularies that categorize textual corpora. Purely automated approaches, however, are often computationally expensive and lack domain-specific conceptual nuance; conversely, purely manual approaches are costly in terms of time and potential bias. To address this dilemma, we introduce the segments-as-topic (SAT) methodology, a four-stage process that combines automation and human expertise to assess candidate topics for vocabulary inclusion. In the SAT generation stage, a distinct topic is formulated and refined through collaboration with domain experts, then a sentence-level semantic similarity model retrieves corpus segments semantically aligned with the topic. The SAT expansion stage uses this seed set to find additional semantically similar segments, which are iteratively accepted or rejected to build a final segment set. During the review stage, a panel of scholars evaluates the topic for inclusion. Finally, in the integration stage, all segments in the final segment set are automatically tagged with the new topic We apply this methodology to the Comparative Constitutions Project vocabulary that tracks over 330 topics in national constitutions, resulting in the addition of three new topics to the vocabulary. The SAT approach balances computational efficiency with expert judgment, offering a systematic, user-friendly, and replicable framework for social scientists to expand domain-specific vocabularies.