Expanding Your Vocabulary: A Framework for Topic Integration in Texts

Abstract

Topic discovery and integration are vital for maintaining vocabularies that categorize textual corpora. Purely automated approaches, however, are often computationally expensive and lack domain-specific conceptual nuance; conversely, purely manual approaches are costly in terms of time and potential bias. To address this dilemma, we introduce the segments-as-topic (SAT) methodology, a four-stage process that combines automation and human expertise to assess candidate topics for vocabulary inclusion. In the SAT generation stage, a distinct topic is formulated and refined through collaboration with domain experts, then a sentence-level semantic similarity model retrieves corpus segments semantically aligned with the topic. The SAT expansion stage uses this seed set to find additional semantically similar segments, which are iteratively accepted or rejected to build a final segment set. During the review stage, a panel of scholars evaluates the topic for inclusion. Finally, in the integration stage, all segments in the final segment set are automatically tagged with the new topic We apply this methodology to the Comparative Constitutions Project vocabulary that tracks over 330 topics in national constitutions, resulting in the addition of three new topics to the vocabulary. The SAT approach balances computational efficiency with expert judgment, offering a systematic, user-friendly, and replicable framework for social scientists to expand domain-specific vocabularies.