So you have collected text data, now what? If you want to avoid bad results, read our blog about turning Text Into Topics and other ideas for useful topic modeling.
Intro
A popular way to understand text data without humans labeling any of the text relies on the computer model to determine the categories, otherwise known as topics. This process is often referred to as topic modeling. Generative AI Models like OpenAI’s ChatGPT and Anthropic’s Claude can divide text without guidance into topics. While this method can give results without humans reading any of the data, the method often fails to create useful categories.
Data Organization
To understand why, consider that a chef and a supermarket manager are going to choose different ways to organize food storage. A chef will organize ingredients based upon the cooking process. For instance, all items that will be grilled may be kept in the same part of the refrigerator while all items that will be cooked on a rotisserie will be kept in a different spot. In this organizational scheme, chicken parts may be kept with the vegetables that will be grilled and a whole chicken may be kept with the other rotisserie items. In contrast, the supermarket manager will likely group the whole chicken and the selection of chicken parts together, since it is easier for shoppers to find what they are looking for if all the cuts of one type of meat are in the same grouping. In both organizational arrangements, form follows function. Yet, when we rely on the computer model to organize the text data on its own, it is not optimizing this organization based on the user’s purpose. Thus, its recommended organizational arrangement can fall flat. For instance, the computer model may place the questions “Why are orange prices so high this year?” and “Why are apples so expensive now?” in the same topic, when the user, who is trying to analyze what types of fruits people are talking about on-line, may want them placed in separate topics covering the type of fruit they are discussing. The auto-generated organizational scheme divides the on-line comments about soccer, so each team is a separate topic, but the user may need the comments bucketed by player position.
Topic Creation
To ensure that topic creation and boundaries meet the use-case, a subject matter expert must review the model’s suggested organizational arrangement. This person should determine whether this arrangement makes sense given the end-goal of the data analysis project. If it doesn’t, organizations may want to cherry-pick the auto-created topics and work with its experts to supplement these topics. This approach is referred to as “human in the loop”.
Gen AI
With the creation of the topic model complete, the organization can find within its data or write-up new samples for each topic in the model. Then, the organization can feed these samples via a prompt into a generative model, which will use the samples to guide it in the task of identifying the topics contained within each text. The generative model’s choosing among topics should be assessed by spot-checking from a subject matter expert, comparing the results with those of other generative AI models, or ideally, comparing the topics selected with those chosen with human labelers. With this assessment, the organization may want to seek to improve the topic labeling by tweaking the prompt and then assessing the revised results in an iterative fashion.
The above process ensures that the results will meet the intended goals rather than leaving the organization with a group of distinctive, yet useless, topics. The key to a useful process is context for why you want to topics. Context is key!
© 2025 ORRO AI. All rights reserved.