This is the fourth article in the series ‘The Art and Science of Taxonomy Development for Market and Competitive Intelligence’. view to view other articles.


T he preliminary step in developing a taxonomy is gathering terms to be included in the taxonomy. The goal is to identify terms that users already use to describe the information. The same users will later use the taxonomy to organize and find information.

Selecting terms for the taxonomy should be based on the following principle:

Taxonomy is developed for a specific use-case and for given users. It’s a common mistake to start taxonomy development, not from the user and use-case perspectives, but from the perspective of what information is available.

Based on the above principle, the two fundamental criteria for selecting a term are:

– The terms should belong to taxonomy’s core use-case, not what information is available

– The terms should be easy to understand and recall for taxonomy users

– Terms should have as little overlap as possible

Continuing with our example of job postings, if your competitive intelligence content has job postings of your competitors, should your taxonomy include terms for tagging those job postings? Yes, if your strategy team will use it for analysis and to spot new trends. No, if the users are sales team members. Who knows if they might just apply to those job openings.

Don’t include a term in the taxonomy just because there is some content available to tag with it. In the above example, even if your content has job postings, don’t include any term for it, unless there is a clear use-case for it.

The process for selecting terms for the taxonomy should be bottoms-up; and the process for creating the taxonomy should be top-down (discussed next). The bottoms-up process for selecting terms should start with the manual review of the sample content.

1. Manual Content Review

Start with the manual review of sample content — news articles, reports, web pages, marketing collaterals, presentations, and so forth — that needs to be tagged (indexed) using the taxonomy.

Manual review is one of the most efficient ways to develop an intuitive understanding of the underlying content.

You’ll discover some of the fundamental reasons for the information chaos. You’ll find that different terms are used to describe the same thing and different things are referred to by the same name.

For example, some users would write ‘attorneys’, while others refer to the same thing as ‘lawyers’ and ‘advocates’. Some users would categorize ‘Layoffs’ as ‘Human Resources’, whereas others would refer to the same information as ‘Cost Cutting’. Which ones should you include in your taxonomy?

You’ll have to make several such subjective decisions about which terms to include or exclude. Based on our experience, we recommend the following:

– Do not exclude any terms at this stage of gathering terms for your taxonomy. Whenever in doubt, include the candidate term for review in later stages. Even if it is not included in the final taxonomy, it can be used as synonyms of the main term.

– Resist the temptation to read everything. The goal is not to become a domain expert but to understand just enough to have productive conversations with the users about the content and how it should be organized.

2. Leverage Existing Taxonomies

Yes, there are some existing taxonomies in your organization that you can refer to. These might not be referred to as taxonomy but are valuable sources for identifying candidate terms. For example, you can refer to:

– Your company’s website navigation, sections, sub-sections, sitemap, etc., or your competitors’ websites.

– Analyst reports, regulatory bodies, industry associations, and trade publications, and more.

– Existing folder structures within your intranet or your users’ existing system for organizing content in their local computers.

You can be creative and resourceful in finding more such sources. This and similar subjective aspects are referred to as ‘Art’ in the title of this article.

Tip: To develop intuition about the taxonomy and content, try to manually tag sample content with the candidate terms that you are considering for your taxonomy.

3. User Interviews

User interviews are a great setting to understand their use-case — why they need information and how they use it.
Avoid asking how they currently organize the information because that might be flawed.

Before asking your users what information they need, how they get that information, and how they will use it, ask why they need the information.

To understand how your users describe the information ask them which folder they will store a given piece of information or which folder they will go to find a specific kind of specific information, or what term they will use to search it.

Pay special attention to their choice of words and terms as they describe the information. We’ve to develop our hearing to spot subtle differences — for example, do they say ‘Management Change’ or ‘Leadership Change’; ‘Recruitment’ or ‘Hiring’; ‘Sales’ or ‘Business Development’.

Only one term will be selected as the preferred term, however, these alternative words should be added as synonyms (metadata) in your taxonomy. This will allow users to find the information using terms of their preference.

Incorporating such minor differences in the taxonomy goes a long way in making the communication efficient; reducing errors, and aligning the whole organization.

In a taxonomy design, like any other design, users and their use-case is more important than anything else. When there are trade-offs to be made, select the terms that users use to describe the information, not academically correct terms for the information.

4. Automated Methods

If you are faced with the challenging task of building a taxonomy for a large amount of information without any access to the users, for example, building the taxonomy for a news website then consider following automated processes to identify candidate terms.

– A. Machine Learning: Unsupervised

Unsupervised learning is a branch of machine learning that groups the data that has not been already tagged, classified, or labeled. In unsupervised learning, an algorithm segregates the content into categories, based on the underlying features in the content.

These algorithms, based on mathematical probabilities and statistics principles, group or segment content that has some common attributes.

By grouping content through unsupervised learning, you can develop an understanding of the underlying content, which would have been impossible to do manually.

Unsupervised Machine Learning

– B. Keyword Cloud

Candidate terms for the taxonomy may be identified automatically using software that can help you create a keyword cloud from the content.

Keyword Cloud

– C. User Queries

Your software should not just return the search results but also save the search queries for analysis. Terms found in user queries may also be considered for inclusion in the taxonomy, especially the terms that occur in multiple queries.

To recap:

– ‘Information Systems’ are the ‘Nervous System’ of the organizations
– ‘Taxonomy’ is the foundational pillar of ‘Information Systems’
– ‘Terms’ are the building blocks of the ‘Taxonomy’

If the terms are selected without thoughtful considerations, the entire ‘Information System’ will be at the mercy of individual users’ preferences.

Therefore, before building the taxonomy, we need to select the terms that are consistent and follow a systematic approach.