Recently, there has been many discussions in the news related to metadata being used to protect citizens. This is something that is easy to see and, frankly, metadata can do many things for the public good – from predicting disease outbreak to traffic accidents. The question that most organizations ask themselves is how they can benefit from metadata to improve their own information gathering, data collection, content management, and more. For those who understand metadata, the answers to these questions can be obvious. However, at times many organizations don’t understand what metadata truly is – in its simplest form, metadata is “data about data”. Now take it further – the next natural step is to think of metadata as a higher level of data. The next evolution is to think about metadata in a class of its own, either separate or embedded into documents, records, or objects.
There are many ways to use metadata – this post will focus on classification of content with metadata. So, as related to classification, we are mainly discussing putting documents or content itself into classes and/or categories. This information can be embedded into the content or it can be external into some form of managed metadata that is connected, aside, or points to the actual document or object. Additionally, the content can be examined to then add this metadata, which can be machine generated, human generated, or both. There are pros and cons to every one of these methods that people must consider when implementing metadata schemes. First, content-based automatic classification systems are very powerful but they can be fooled into saying the content is something that it is not. User assignment or cataloging could also have the same problem of the user tricking the classification process or just setting the value incorrectly.
Many people refer to the practice and or science of classification as Taxonomy. This Semantic, usage of metadata classification, approach lends itself to a highly ordered set of data and beyond useful content management if implemented properly. Additionally, it gives users a superior way to use and interact with data and people. Within the Taxonomy, we can also have controlled vocabularies and/or user-generated vocabularies. An example of user-generated vocabulary on your intranet may be hash tags in social collaboration. A controlled vocabulary would allow only certain values for a machine or human to select from, thus providing more usable categorization and enhanced retrieval of information than simple hash tags.
Now that we’ve set the stage for metadata classification, it’s time to begin. The first step would be defining your group or organization’s Taxonomy. If you are a product company, you may have something like this:
· Product, Controlled Vocabulary – CRM, SharePoint, or Other
· Description, Freeform – Allow entry of 10 to 250 characters
· Product Delivery, Controlled Vocabulary – On-premise or Cloud
· Document Type, Controlled Vocabulary – Specification, White Paper, Blog, or Social
· Department, Controlled Vocabulary – Sales, Marketing, Product Management, or Support
· Sensitivity, Controlled Vocabulary – Classified, Top Secret, or Public
From this simple Taxonomy, a company could now catalog all of their documents in some or any location.
Getting it Done
If you have 10 documents, you could probably go read every document and assign the proper metadata. However, once you pass more than 10 it gets harder and harder – and then imagine if you have 100 content contributors to your document collection; the exercise has become exponentially harder! A company could create a rule that all users must use all of the metadata in the taxonomy and then not let them submit unless they do so. Even with metadata templates, having some values prefilled, this task will be hard and the quality of the metadata would naturally decline. If this was an internal search engine for an intranet there may be minimal risk, but in many cases information is being shared publicly with customers and prospects. Thus, there is a higher risk of having hard-to-use content or exposing sensitive content to the wrong sets of eyes.
Based on both items it would seem, and it is, that the best solution would be a system and user collaboration to properly catalog content to increase searchability, usability, and security. One of the products that AvePoint has to handle the creation of metadata is Compliance Guardian, and this product allows the user to create metadata in the ways mentioned at the beginning of this blog post as well as common to the science of classification (Taxonomy). The simple table below covers the main points.
Author and Automatic Classification Interactions
|Supports Embedded Metadata||Allow the embedding of data into documents.||While there are many ways to assign metadata, the metadata itself should also be embedded directly into content so that the metadata travels with the data.|
|Supports Custom Taxonomies||Allow users of content management systems to create their own taxonomy and implement controlled vocabularies.||Metadata itself should have structure, while crowdsourced metadata like hash tags are great for Twitter in a company you could actually use a more refined approach to cataloging your data.|
|Auto Classification||Allow for content to be automatically classified based on rule types, taking into account advanced vocabulary support.||Auto Classification is not always black and white. It needs to be flexible and capable of looking at multiple rules beyond simple content in order to determine meaning. Take a document for a product as an example: the footer may contain the name of the other products in a company, so we cannot classify by product name alone but rather look at other factors.
The same is true of determining sensitivity. If we find an address by itself, does that constitute a PII violation or do we need to classify based on complex rules? Of course, we need complex rules and you cannot auto classify without this capability.
|Allow User to Enter Metadata||Users can add metadata to the document.||User-generated metadata is important. In our example taxonomy in this post, we have Description, which is an obvious reason to have user-entered metadata.|
|Ensure Quality of User Selected
|Since users can add metadata, a classification system should be able to identify errors and correct if necessary.||In any system where a user can enter metadata, there needs to be a way to assure quality. The system needs to validate that the selection from a controlled vocabulary is in line with the actual content itself. For example, what if an author selected Public from the sensitivity setting and the automatic classification system found employee payroll data? This user error should be identified, alert someone, optionally correct the metadata automatically, or quarantine the content for further review.|
|Ensure Quality of User-entered
|In freeform Text Fields, users can enter metadata and the systems need to validate that the text does not violate policy and matches the document.||From classification errors to keyword stuffing, since the invention of the search engine people have been trying to improve their metadata to make their content appear higher in rankings so that it is read first – even if that meant misrepresenting data or stuffing keywords wherever possible. The metadata management system must prevent these types of author innovations!|
|Transfer Security-enabled Metadata||Encryption level must be determined based on sensitivity of content to provide content with site-specific classification.||If we refer back to our sensitivity-controlled vocabulary, we can then look for more qualities of the transport to determine if the content is being viewed properly. This way, we can perform more actions if the document is of a protected class and the system finds it is not protected. If this is the case, the system should be able to move the document to a user-specified correct location.|
While that table provides the basic minimal requirements for your automatic classification system, we now must move onto explain the minimal rules types and methods for the automatic classification system as well as their importance.
Classification Rule Types
|Does Text Exist||Location of Text in a document or in document metadata as a way that can be used to assign a metadata element (Tag) value.||The existence of Text or keywords has always been a prime way to determine meaning or relationships. With controlled vocabularies, this becomes even more important as we can have rules of both how to classify and validate human classification based on what is actually found in the content versus metadata.|
|Conditional Text||Like the above rule, Location of Text in a document or in document metadata as a way that can be used to assign a metadata element (Tag) value. It can also be used as a conditional to look for another word or String.||Continuing on the Does Text Exist rule, this provides a more complex way of looking at text relationships to determine classification.|
|Dictionaries||Another text-based rule to determine if the system can assign a controlled vocabulary term based on the existence of one or many words.||Most classification and terms get really complex and there can be a shopping list of terms that, if present or not present, determine classification.|
|Element Validation||Whether user-entered metadata or structural elements, one can find meaning from element-based data. This check looks into element and attribute data to find meaning.||Be it HTML, XML, or XAML, we can find meaning from data and/or location. Certain words may have more value if found in Headers, Footers, or H1 Tags. This data needs to be explored and evaluated in order to better define meaning.|
|Enhanced Elements||This rule type looks deeper into the structure of content to find information about elements that exist in elements of content.||Much like conditional find text, elements within elements can be used to find meaning.|
|External Content||Searching for content within content that is actually only a reference to content existing on external sites.||This is a special rule type that identifies mixed content that can be used again to classify a document or assign a controlled vocabulary term.|
|RegEx||Location of Regular Expression Match in a document or in document metadata as a way that can be used to assign a metadata element (Tag) value.||This provides a powerful method to classify content based on pattern matching.|
|Conditional RegEx||Location of Regular Expression Match in a document or in document metadata can trigger another regular expression pattern search as a way we can use to assign a metadata element (Tag) value.||In some organizations, it becomes essential to find more than one item – rather pairs of items in the same content or structure.|
|Transport||Transport tests the usage and level of usage of the HTTPS protocol that is being used to serve content.||Many classifications and/or actions on classification can be made by determining communications type. For example, we may transmit Secure and Sensitive Information but only if the connection of a site is secure. A classification system should include the capability of classification-based rules and these rules must be able to ask questions about site.|
|Cookie||Aligned or Connected Data attached to content can be tested to determine classification.||If content has or exposes PII, it is important to classify the document properly. Cookies can determine content sensitivity.|
|Custom||A custom check type allows the performance of external functions if a condition is found.||In some cases, classification rules become a bit more complex. Is something near something else or far from something else? Does a number match some checksum? Does some complex relationship exist? The custom check isn’t meant to find things that we thought of. It is meant to classify based on all of the content combinations that we could not know of in a classification system.|
|Batching/Super Rules||To combine one or more rule types in sequential logic to produce controlled vocabulary or simple classification outcomes.||Controlled vocabularies that exist in taxonomies can be properly classified easily by batching together one or more of the above rule types to create a super rule.|
A classification system, to be complete, has to be flexible enough to work with or without one or more taxonomy and/or controlled vocabularies. It must also be able to work with content, and in some cases, content and structure as well as enhanced site properties related to the content to be classified. Additionally, author/user input is essential to proper classification: A system must allow for users to either override the classification or allow the system to override the user on a granular level. Last, a classification system should be capable of classifying based on undefined relationships in data but have a rule language capable of customization to do the same.
Based on classification, alerts and actions are necessary. While classification can help for search, it is also essential for Data Loss Prevention (DLP), content management, risk management, and more. Based on classification, a content owner should be able to not only add metadata – or for Compliance Guardian versions that support Microsoft SharePoint, add managed metadata – but also capable of disposing content whether that means: move, manage, delete, or other quarantine or disposal-related actions.
This blog post should be seen as a simple primer to both metadata and capabilities of the Compliance Guardian Classification Capabilities – it’s by no means a comprehensive listing of features or definitions.
For more information on Compliance Guardian and if you need help setting up your classification system, please contact AvePoint Sales. We have a group of world-renowned experts in this field who can work with you to develop your Taxonomy and Controlled Vocabularies as well as write your rules and actions.
It is obvious that data is important, but metadata is even more important. Everyone can benefit from learning more about metadata, and every company can benefit in some way by implementing a policy as related to customer service, employee education, or as part of a DLP effort. Please feel free to contact me as well if you have any questions related to this blog post!