Data masking involves removing identifying values from your datasets – without reducing the value of the data to your business. It’s commonly used when managing data relating to PII, PHI, PCI DSS, and other forms of IP. It’s used for a wide variety of applications, because it can be applied across structured, semi-structured, and unstructured data.
The result is that analysts can still extract insights, while the data itself remains hidden.
This is a business-critical balancing act – enabling access while maintaining your security and compliance posture. Especially now, as data privacy and protection laws evolve across multiple global jurisdictions, with 71% of countries having already adopted legislation.
Further complexity comes from managing industry-specific requirements. For example, HIPAA regulations go beyond simply making sure identifiers are removed from files and locations. Within the HIPAA Security Rule, organizations also have to provide administrative, physical, and technical safeguards to protect data when it’s being stored and transmitted.
Meanwhile, changes to PCI DSS (PCI DSS 4.0) are expected to become effective in early 2024. These contain a series of enhancements around data control, access, and use. There are new requirements for processes that must be implemented and mechanisms for reporting incidents and vulnerabilities, documenting role and responsibility assignments, and annual scoping of environments.
Most governments do recognize that, while data must be protected, organizations must also be free to harness technology that can use data to improve outcomes.
“A major goal of the Security Rule is to protect the privacy of individuals’ health information while allowing covered entities to adopt new technologies to improve the quality and efficiency of patient care.”
US Department of Health and Human Services (HHS)
The masking procedure plays an essential part in helping organizations achieve this balance.
What are some types of data masking?
- Replacing or substituting
Data is swapped out with unrelated characters, words, or symbols. To maintain integrity, all values should be masked in the same way. For example, by substituting the same number of characters. - Randomizing or shuffling
Data is moved around and unrelated data is inserted. However, this randomness can lead to inconsistencies within data if masking takes place in a relational database. - Averaging or aggregating
This technique is suitable for larger datasets. Individual values are replaced by an average value that still retains the overall insights. For example, a government masking individuals’ medical results and instead using averages from each city. - Scrambling
Data values are swapped around, but in a way that keeps the data realistic and representative. This type of data obfuscation is particularly suitable for testing environments, keeping data usable yet untraceable. - Hashing
Data is converted into a fixed-length string, which may be a shortened version of the original. If you’re deciding between other forms of masking vs hashing, remember that hashing is often less secure than other techniques. This is why it is often used together with encryption. - Pseudonymization
Personal identifiers are replaced by fictional or generic data. However, unlike other masking techniques, pseudonymization can be reverse-engineered; if users are given additional information, they might be able to deduce the true identity of individuals in the dataset.
“Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.”
GDPR, Recital 26
What are some data masking best practices?
The right platform should be able to manage data masking across the complete data lifecycle. This includes deploying rules for masking, configuring triggers and schedules, and monitoring performance trends across your operations and architecture.
A best practice framework should consider:
- Location
Identify the systems that are storing sensitive data, and where they’re located. If different sources are involved, such as Hadoop or Databricks, different permissions and policies should be defined. - Components
Classify PII and sensitive information in your dataset. This relies on correct cataloging and classification, so data can be surfaced at the right time to the right people. - Identities and privileges
Examine how designated users are currently accessing data to ensure they are compliant and that your policies are up to date with regulations. - Thresholds and triggers
Define criteria and use cases for how and when masking should be deployed. This should also factor in operational demands related to access and resources, along with existing and incoming industry regulations. - Scope
Assess your existing data usage and the context of requests in relation to the relevant regulations. Also identify use cases, so you can define which type of masking you will need to use to remain compliant. - Proofs of concept
Assess whether your chosen masking techniques are suitable for your data and use cases (including your ability to adapt to evolving datasets, policies and legislation).
It’s also important to consider whether you are masking dynamic or static data.
Dynamic data masking
Masking happens in real-time, when users request data access. This can be granular, down to the column level, rather than across an entire dataset. As a result, you can be more agile and fewer resources are required. This also makes it easier to scale processes and update policies.
Static data masking
A duplicated and masked version of the entire dataset is created. This new database can then be analyzed and shared. Bear in mind that static data masking makes it much harder to spot duplicated or erroneous data points.
Other considerations for data masking
Preservation
You’ll need to check that data masking doesn’t disrupt format-dependent data or strings. For example, dates of birth should still be recorded in the same number order.
Encryption
Many data masking techniques are irreversible. However, if you might want to revert the data to its original format later, you could consider encryption.
Encryption is often more secure, but it can make data less usable. When choosing between encryption and masking, you should consider whether security or utility is more important to your specific project.
Integration
Data masking isn’t a standalone solution. It should be part of an integrated data access strategy that includes control based on roles (RBAC), attributes (ABAC), or policies (PBAC) in order to comply with some regulations and add an extra layer of protection.