Skip to content
May 8, 2023

Data masking: Best practices for improving your security and compliance posture

Data masking involves removing identifying values from your datasets – without reducing the value of the data to your business. It’s commonly used when managing data relating to PII, PHI, PCI DSS, and other forms of IP. It’s used for a wide variety of applications, because it can be applied across structured, semi-structured, and unstructured data. 

The result is that analysts can still extract insights, while the data itself remains hidden. 

This is a business-critical balancing act – enabling access while maintaining your security and compliance posture. Especially now, as data privacy and protection laws evolve across multiple global jurisdictions, with 71% of countries having already adopted legislation. 

Further complexity comes from managing industry-specific requirements. For example, HIPAA regulations go beyond simply making sure identifiers are removed from files and locations. Within the HIPAA Security Rule, organizations also have to provide administrative, physical, and technical safeguards to protect data when it’s being stored and transmitted. 

Meanwhile, changes to PCI DSS (PCI DSS 4.0) are expected to become effective in early 2024. These contain a series of enhancements around data control, access, and use. There are new requirements for processes that must be implemented and mechanisms for reporting incidents and vulnerabilities, documenting role and responsibility assignments, and annual scoping of environments.

Most governments do recognize that, while data must be protected, organizations must also be free to harness technology that can use data to improve outcomes.

“A major goal of the Security Rule is to protect the privacy of individuals’ health information while allowing covered entities to adopt new technologies to improve the quality and efficiency of patient care.”

US Department of Health and Human Services (HHS)

The masking procedure plays an essential part in helping organizations achieve this balance. 

Protect Sensitive Data with Tokenization

Discover how tokenization reduces risks and enhances compliance.

LEARN MORE

What are some types of data masking?

  • Replacing or substituting
    Data is swapped out with unrelated characters, words, or symbols. To maintain integrity, all values should be masked in the same way. For example, by substituting the same number of characters.
  • Randomizing or shuffling
    Data is moved around and unrelated data is inserted. However, this randomness can lead to inconsistencies within data if masking takes place in a relational database. 
  • Averaging or aggregating
    This technique is suitable for larger datasets. Individual values are replaced by an average value that still retains the overall insights. For example, a government masking individuals’ medical results and instead using averages from each city. 
  • Scrambling
    Data values are swapped around, but in a way that keeps the data realistic and representative. This type of data obfuscation is particularly suitable for testing environments, keeping data usable yet untraceable.
  • Hashing
    Data is converted into a fixed-length string, which may be a shortened version of the original. If you’re deciding between other forms of masking vs hashing, remember that hashing is often less secure than other techniques. This is why it is often used together with encryption.
  • Pseudonymization
    Personal identifiers are replaced by fictional or generic data. However, unlike other masking techniques, pseudonymization can be reverse-engineered; if users are given additional information, they might be able to deduce the true identity of individuals in the dataset.   

“Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.”

GDPR, Recital 26

What are some data masking best practices?

The right platform should be able to manage data masking across the complete data lifecycle. This includes deploying rules for masking, configuring triggers and schedules, and monitoring performance trends across your operations and architecture.

A best practice framework should consider: 

  • Location
    Identify the systems that are storing sensitive data, and where they’re located. If different sources are involved, such as Hadoop or Databricks, different permissions and policies should be defined.
  • Components
    Classify PII and sensitive information in your dataset. This relies on correct cataloging and classification, so data can be surfaced at the right time to the right people.
  • Identities and privileges
    Examine how designated users are currently accessing data to ensure they are compliant and that your policies are up to date with regulations.
  • Thresholds and triggers
    Define criteria and use cases for how and when masking should be deployed. This should also factor in operational demands related to access and resources, along with existing and incoming industry regulations.
  • Scope
    Assess your existing data usage and the context of requests in relation to the relevant regulations. Also identify use cases, so you can define which type of masking you will need to use to remain compliant. 
  • Proofs of concept
    Assess whether your chosen masking techniques are suitable for your data and use cases (including your ability to adapt to evolving datasets, policies and legislation).

It’s also important to consider whether you are masking dynamic or static data. 

Dynamic data masking

Masking happens in real-time, when users request data access. This can be granular, down to the column level, rather than across an entire dataset. As a result, you can be more agile and fewer resources are required. This also makes it easier to scale processes and update policies. 

Static data masking

A duplicated and masked version of the entire dataset is created. This new database can then be analyzed and shared. Bear in mind that static data masking makes it much harder to spot duplicated or erroneous data points.

Other considerations for data masking

Preservation

You’ll need to check that data masking doesn’t disrupt format-dependent data or strings. For example, dates of birth should still be recorded in the same number order. 

Encryption

Many data masking techniques are irreversible. However, if you might want to revert the data to its original format later, you could consider encryption. 

Encryption is often more secure, but it can make data less usable. When choosing between encryption and masking, you should consider whether security or utility is more important to your specific project.

Integration

Data masking isn’t a standalone solution. It should be part of an integrated data access strategy that includes control based on roles (RBAC), attributes (ABAC), or policies (PBAC) in order to comply with some regulations and add an extra layer of protection. 

NEW GEN AI

Get answers to even the most complex questions about your data and explore the complexities of your data landscape using Generative AI chat.