A Reliable Data Protection Strategy Hinges Upon Highly-Accurate Data Detection

Aug 20, 2021
8 minutes
144 views

The growth of data continues unabated throughout the world. In fact, IDC predicts that by 2025, 175 trillion gigabytes (or 175 zettabytes) of new data will be generated worldwide. In the coming decade, the exponential growth of data will continue to break boundaries, drive new innovation, and give birth to new economies of scale.

So far, big tech companies have led the way to prove to the rest that the collection and creation of data can lead to very large and profitable business models. In the near future, more and more enterprises will solely run on data as their primary asset. Its main value will be highlighted in the many different ways it will be commoditized to be purchased, sold, bartered, serviced and incorporated into other products and services.

Not All Data is Created the Same

But while the data economy continues to pick up steam, companies need to make sense of their data wisely. Just because they’ve been collecting or creating massive amounts of data, does not equate to data being of good use. The fact of the matter is that—not all data is created the same. Taking into account the variety and diversity of data available to businesses is critically important.

Take for example, the personal identifiable information (PII) of customers and employees or the personal health information (PHI) of patients in healthcare delivery organizations. These types of datasets can be classified as sensitive and therefore are highly confidential. Other examples of sensitive data include financial information, trade secrets, M&A documents, and intellectual property (e.g. engineering designs, source code, patents etc.).

Cookie IDs, hashed email addresses, mobile advertising IDs, and any other technical identifiers that don’t directly identify individuals are some examples of non-sensitive data. For businesses, any data that would not pose a risk to the company if released to a competitor or the general public could generally be classified as non sensitive.

Structured vs. Unstructured Data

Keeping visibility of all the corporate sensitive data, wherever it is and moves, is fundamental in order to enable an effective data protection strategy. But automatic discovery and categorization of sensitive data is not an easy task for any classification tool. Apart from broadly categorizing data into sensitive and non-sensitive, it can also be classified as structured or unstructured.

Structured data is clearly defined and searchable because it is stored and exists in predefined structures such as databases or spreadsheets like excel files, made of cells with columns and rows with specific names, specific addresses, specific phone numbers, specific social security numbers, specific credit card numbers etc.

Unstructured data is usually available in a variety of formats. As an example, a generic, non-specific social security number is a 9 digit number, often separated by 2 dashes and can be found in any construct such as a message among other words, in a Word file, in a PDF, etc.

In other words, structured data is organized & often formatted, and unstructured data is often raw data of various types.

Why is Some Data Considered More Valuable

Datasets often synonymized as confidential / secret / sensitive data are considered more valuable than other types of datasets for many different reasons. For example because such data is protected by privacy laws or industry regulations such as GDPR, CCPA, PCI-DSS, HIPAA etc. Other types of data are also specifically important to any organization for business purposes—think about customer information, intellectual property, M&A document, financial data etc. All

This type of information, if lost, or even worse, if fallen in the wrong hands, would cause reputational damage, fines for non-compliance, or loss of competitive advantage, or even a possible lawsuit if it affects the individuals protected by data privacy legislation.

What is Data Loss Prevention Used For?

Data Loss Prevention or DLP is a technology that helps organizations automatically locate all their sensitive information and protect it from intentional or unintentional loss and from theft.

DLP starts with ‘automatic’ data discovery and classification. Once data is discovered, it can be protected. In fact DLP can discover wherever this data is stored in the organization’s sanctioned SaaS applications, in their public cloud storage resources, traversing their networks, or shared across their employees etc. It can then enable automatic or manual protective actions.

If a DLP solution can’t discover all sensitive data in a highly reliable way, and everywhere data is, its outcomes would matter very little because it would offer only partial protection, and most importantly, cause great amounts of false positives. False positives can interfere with standard business processes and cause frustrating and time consuming incident triage processes for the incident response team. A false positive could also prevent a data exchange among legitimate users that doesn’t need to be stopped. A mediocre DLP solution that doesn’t offer accurate detection and creates too many false positives is not worth the investment.

Data detection, in order to be highly efficient and therefore best-in-class, must leverage a variety of detection techniques to identify both structured and unstructured data.

DLP for Unstructured Data

First of all, let’s talk about unstructured data. A good DLP solution must rely on out-of-the-box, granularly customizable policies that are based on several hundreds—if not thousands—of predefined data patterns. This would allow the identification of standard data formats such as country-based national IDs, banking numbers, passport numbers, tax IDs, localized address constructs and other standard PII formats, but also source codes, secret keys, and even cover things like common blasphemous, homophobic, racial, sexual language and many other types of descriptive commonly identifiable data.

It would also allow the use of compliance-related policies for regulation such as GDPR, CCPA, HIPAA and many others. Descriptive content detection works well only if it’s context aware, meaning that only the textual context about and around that number would allow to accurately distinguish for instance an actual social security number from any generic 9 digit number.

False positives can be dramatically minimized if an efficient DLP can truly grasp the context around the content. And this is a very important aspect that many data protection vendors neglect to provide because it is by far the most challenging. Developing highly sophisticated context-aware techniques requires high engineering investments.

DLP for Structured Data

As for structured data, it’s very important for a DLP solution to be able to fingerprint large structured data sources with thousands of records, to detect with speed, and to easily build detection policies that rely on multiple combinations of information such as a name + a credit card number.

A best-in-class DLP should scan many documents and file types, and even extract information from graphic formats like images of picture IDs, passports, credit cards etc. via advanced Optical Character Recognition (OCR) algorithms.

In addition, it should leverage Exact Data Matching (EDM) to detect and monitor specific sensitive data, and protect it from malicious exfiltration or loss. Designed to scale to very large data sets, EDM fingerprints an entire database of known personally identifiable information (PII), like bank account numbers, credit card numbers, addresses, medical record numbers and other personal information stored in a structured data source, such as a database, a directory server, or a structured data file such as CSV or spreadsheet. This data is then detected across the entire enterprise, as it traverses the network edge or it is transferred by employees from remote locations.

User-based document tagging and data classification is also an important factor. When available, DLP needs to be able to detect such classification, read the document properties and apply protective actions based on policy.

The Role of Machine Learning and AI

Machine learning and AI are the present and the future of data protection. Security practitioners are fed up with manually chasing myriads of false incidents that require deep and time-consuming investigations.

So can DLP leverage user feedback to reliably detect true positives and learn continuously? Can it automatically understand the context and the true meaning of a written conversation, including likelihood and misspelling?

Consistent protection is extremely important. Once you define these detection policies you need to ensure that the DLP solution applies the same rule to detect sensitive data everywhere data is and flows.

In order to accomplish that, an Enterprise DLP must deliver consistent policy from a single cloud-based engine making it easy to define data protection policies and configurations anywhere and applying them automatically to every location. By using a DLP solution with these superior capabilities, your IT security team will be spared from reinventing the wheel every time your organization adds new SaaS apps, new networks, and new users.

Learn more about Palo Alto Networks enterprise-grade DLP and its highly reliable detection.


Subscribe to Network Security Blogs!

Sign up to receive must-read articles, Playbooks of the Week, new feature announcements, and more.