Introduction
This checklist will assist in-house counsel, private practitioners, and compliance personnel who are responsible for overseeing or managing the de-identification of data to be entered into artificial intelligence (AI) systems. By outlining the main steps and considerations involved, it helps practitioners guide their companies’ decision-making throughout the de-identification process.
This checklist addresses the following steps:
- Determining whether any data must, or should, be de-identified
- Choosing a mechanism for de-identification
- Verification of de-identification
- Understand the risk of re-identification
This checklist can be used in conjunction with How-to guides: Understanding AI-driven risks, Understanding the risk of negligence claims when using AI, Risks and liabilities of algorithmic bias, How to determine and apply relevant US privacy laws to your organization; Checklists: Steps to mitigate risks associated with AI use in business, Understanding privacy laws in the US and Quick views: Overview of AI in business and Key AI terms.
Step 1 – Determining whether any data must, or should, be de-identified
| No. | Requirement |
| 1.1 | What kind of data will be used? |
| 1.2 | What are the implications of applicable laws/regulations? |
| 1.3 | Is de-identification required by contract? |
| 1.4 | Do other considerations make de-identification advisable? |
Step 2 – Choosing a mechanism for de-identification
| No. | Requirement |
| 2.1 | Determine whether a specific method needs to be adopted to fall within a safe harbor |
| 2.2 | Consider whether the data should be subject to anonymization |
| 2.3 | Consider whether the data should be subject to pseudonymization |
Step 3 – Verification of de-identification
| No. | Verification of de-identification |
| 3.1 | Choose a verification methodology |
| 3.2 | Review and follow up on results of verification |
Step 4 – Understand the risk of re-identification
| No. | Risk of re-identification |
| 4.1 | Understand the risk of re-identification |
| 4.2 | Protect yourself from external re-identification |
Explanatory Notes
General notes
There is no single definition of AI or of AI systems, but the latter term generally refers to machine-based systems that use certain inputs, provided either by humans or machines, to produce a wide range of outputs. These can include predictions, recommendations, or decisions. An example is a credit-scoring system that produces credit scores for specific objectives, such as deciding whether to grant someone a loan, based on data about applicants and a set of rules determined by those developing the system. The output can also consist of texts and images, generated based on prompts entered by users. ChatGPT has recently dominated global headlines for its ability to generate an extremely broad and diverse range of these kind of outputs.
AI systems can be external or internal. External systems refer to systems developed and overseen by another company. Internal systems refer to proprietary systems created in-house. Tailored to individual needs and not available for use by anyone outside of the company, these tend to focus on more specific business tasks.
While AI systems are incredibly diverse, a common thread is that they all rely on large amounts of data to perform their functions. Entering data into AI systems also inevitably raises privacy issues, regardless of whether these are internal or external systems, and regardless of whether you are entering data about consumers, your employees, or others.
Using external systems requires providing data to a system that is not part of one’s own company. However, a person who may have agreed to provide certain data to a specific company has not necessarily agreed to provide that data to anyone else. Moreover, the external systems using the data may produce an output that makes it possible for third parties to extract the original data, or the external system may experience a data breach. Even when using an internal system, privacy issues may arise, especially if the results generated by the AI system are shared with others internally or externally.
Depending on the data involved, such usage may be prohibited by laws or by contract. Even if not formally prohibited, such data sharing can be a public relations disaster, particularly if the individuals whose data was shared feel they were not clearly informed that their data may be entered into an AI system.
De-identifying the data at issue before using it as an input for an AI system can at least minimize such problems. De-identification refers to a wide range of processes and methods that can be used to make it less likely that certain data can be tied to specific individuals. It can be an important step to protect the privacy of data subjects.
Step 1 – Determining whether any data must, or should, be de-identified
This checklist assumes that the company has already made the decision to use an AI system and to enter data into that system. As the information presented below underscores, this is not a decision that should be made lightly.
Moreover, before you look more closely at whether de-identification is required or advisable, help reduce potential issues by making sure your company is only using data that it absolutely cannot do without. In other words, ensure that the scope of the data used has proactively been limited to truly essential data.
1.1 What kind of data will be used?
You must first identify what kind of data is being entered into the AI system. This is because the rules and considerations that apply will vary depending on the nature of the data at issue. Determining how to categorize the data you intend to use is not a simple step, given that specific laws and regulations may define the various categories in different ways.
For example, you may want to determine whether your data would be considered ‘sensitive.’ But there is no single definition of ‘sensitive’ data under US law. This is because there is no single, overarching privacy law in the US. Instead, several specific federal laws apply to certain industries or to select data, such as the Health Insurance Portability and Accountability Act (HIPAA) and the Children's Online Privacy Protection Act (COPPA). In addition, states have their own diverse privacy laws. Determining which of these laws apply to your situation, and by extension, what data definitions apply, is a challenging task.
Nonetheless, there are some types of data that are clearly likely to entail more legal scrutiny and regulation than others. A good first step would therefore be to check whether, broadly speaking, your data is likely to fall into one of the following categories:
- Personally identifiable information (PII) generally refers to any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. Examples include assigned numbers, such as social security or driver's license numbers, as well as patient identification, financial account, and credit card numbers. Such information also includes addresses, phone numbers, certain biometric records (a photo, an X-ray, a fingerprint, or even one’s facial geometry or voice signature). It can also include information that, when combined with other information, can be used to identify a specific individual, such as the date or place of their birth, their race or religion, or employment or education information.
- Medical/health information is especially heavily protected. This refers to data that is created or received by a wide range of actors - including healthcare providers, health plans, public health authorities, but also employers, life insurers, schools/universities, or healthcare clearinghouses - that relate to someone’s physical or mental health/condition, the provision of healthcare to someone, or payments for healthcare provided to an individual.
- Data containing financial information is also a category of data that is subject to specific regulation, and needs to be treated particularly carefully. Such information refers to information about a person’s personal finances, and includes bank account balances, payment histories, and information contained in loan or credit card applications.
1.2 What are the implications of applicable laws/regulations?
Once you have properly identified the kind of data involved, you need to determine which laws or regulations apply, and consider whether these mandate de-identification, or, even if they do not, whether de-identification is advisable under the given legal regime. Again, the lack of a single overarching privacy law makes this a challenging endeavor and may warrant hiring specialized counsel.
For example, your company may be subject to HIPAA, and may have decided to enter customer health information into an AI system. If the data at issue qualifies as protected health information (PHI) - individually identifiable health information that is transmitted or maintained in electronic media or any other form of medium - HIPAA and its Privacy Rule may require individual authorization from the data subjects, which can be costly and time-consuming.
However, these laws and rules will not apply if the data is de-identified in certain ways spelled out in the HIPAA Privacy Rule. For example, under the rule’s safe harbor, data will not be considered identifiable if the listed 18 identifiers are removed. Similarly, data will not be considered identifiable where an expert uses ‘generally accepted statistical and scientific principles and methods’ to leave only a ‘very small’ risk that the information could be used to identify the subject of the information. While de-identification is not legally mandated, it may thus be highly advisable. For further information including the list of identifiers, see 45 CFR section 164.514(b)(1), (2).
Meanwhile, the Gramm-Leach-Bliley Act (GLBA), which applies to companies that offer financial products or services, exempts from its definition of personally identifiable financial information, and so from its coverage, information that does not identify a consumer, such as ‘blind data’ that does not contain personal identifiers such as account numbers, names, or addresses (see 16 CFR section 313.3(o)(2)(ii)(B)). Again, for companies covered by the law, de-identification can bring significant legal benefits.
Notably, however, the US Department of Justice recently published a Data Security Program pursuant to a final rule, which went into effect on April 8, 2025. This law, unlike most privacy and data broker laws, expressly includes de-identified data.
You will also need to consider relevant state privacy/data protection laws. Since these vary greatly, it is important to individually examine the laws in all state jurisdictions relevant to your business. In addition, many states are currently introducing new legislation, meaning the applicable legal frameworks are in a state of flux. For a broad overview of these state laws, see Q&A: US Data Protection and Privacy (state-by-state).
A look at Colorado, for example, makes clear why it is important to consider the legal implications. Obligations and limitations imposed on entities subject to the state’s new privacy law (in effect since mid-2023) vary greatly depending on the type of data at issue. For example, ‘sensitive’ data is defined as including, among others, personal data that reveals someone’s racial or ethnic origin, religious beliefs, a mental or physical health condition or diagnosis, sex life or sexual orientation, citizenship or citizenship status. Where ‘sensitive’ data is at issue, a covered entity that is processing it must first obtain prior consent. By contrast, data that has been de-identified does not even qualify as ‘personal data’ under the law; the applicable requirements merely involve exercising ‘reasonable oversight’ with respect to compliance with any applicable contractual commitments. Again, while de-identification is not mandated, the legal framework at issue makes it beneficial to pursue it.
1.3 Is de-identification required by contract?
In addition to checking relevant laws, review applicable contracts to determine whether these require de-identification (or prohibit the use of data as inputs to AI systems). Such contractual obligations may stem from your company’s agreements with the data subjects themselves (eg, de-identification of their data may have been a condition of their participation in having their data collected). But contractual requirements may also stem from your company’s agreements with a data broker or seller.
1.4 Do other considerations make de-identification advisable?
Your review may indicate that de-identification is not mandated by law or by contract. But this should not end your analysis. Whether couched as an ethical matter or as a smart business decision, de-identification can be advisable even where it is arguably optional.
Consumers’ awareness of privacy issues is rising and any inadvertent release of personal data can result in a public relations disaster. By contrast, proactively showing a commitment to de-identification sends a signal that your company is pursuing its own business needs without compromising individuals’ personal privacy. This can enhance trust in your company among both your customers and your employees. For example, you may want to formalize your commitment to de-identification by including it in your company’s research protocols, in which you outline the background, goals, design, and methodology of your research.
An alternative, or additional, approach is to rely on informed consent. This involves clearly and thoroughly explaining how you collect data and what you intend to do with it. Depending on the circumstances, your decision to use personal data in a specific AI system, de-identified or not, could involve explicitly asking the data subjects to consent to that specific use.
Step 2 – Choosing a mechanism for de-identification
Once you have decided to de-identify a set of data, you will need to decide which exact mechanism you will use for this purpose. As noted, de-identification refers to a wide range of processes and methods that can be used to make it less likely that data can be tied to specific individuals.
Deciding which method to use usually involves a trade-off between privacy and utility, given that data that has been de-identified to a larger degree may be less useful for use by AI systems. However, it will also depend on the definition of de-identification used by the laws and regulations that apply to your specific industry and company, so it is advisable to start with a review of those.
2.1 Determine whether a specific method needs to be adopted to fall within a safe harbor
In some cases, there may be little choice in terms of which method of de-identification to apply. This is because the applicable laws may define de-identification in a specific way, meaning you will only incur the benefits that stem from de-identifying your data set (ie, not having those data sets being subject to the laws’ requirements) if your approach meets the specific definition.
HIPAA is the most obvious example of this. As noted, under the HIPAA Privacy Rule’s safe harbor, health information will no longer be considered protected if it has been de-identified in the way set out in the regulation; specifically, the 18 identifiers listed in the regulation are removed (see 45 CFR section 164.514(b)(1), (2)). The identifiers include information such as names and email addresses as well as full-face photographs. For more information on this approach, see the US Department of Health and Human Services, Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.
2.2 Consider whether the data should be subject to anonymization
Anonymization refers to a process that removes the association between the identifying dataset and the data subject, meaning the data can no longer be linked to a specific individual. Techniques used to achieve these vary and include:
- removing identifying values from data (‘suppression’), such as deleting any references to age (possible only where this information is not relevant to the analysis);
- taking identifying values and making them less specific (‘generalization’), such as replacing someone’s age with an age range; and
- introducing ‘statistical noise,’ such as by adding or subtracting a certain value from individual entries in a data set, applying a set algorithmic formula to the entries and using those results rather than the actual entries or actually swapping some of the data contained within individual entries.
2.2.1 What are the top data anonymization techniques?
One example of a specific anonymization technique is K-anonymization. This technique is commonly used in the US, particularly with respect to health data. K-anonymization groups together similar individuals and either generalizes or suppresses those data fields that contain identifying information. The result is considered k-anonymous if in one data set more than one person is covered by a particular combination of several identifying attributes, meaning no one individual can be identified via these attributes. The level of K-anonymization reached via this process is expressed numerically as k=x. If x=3, this means that within a specific data set, there are at least three separate records that contain a particular combination of, for example, a set age and location. The higher the k value, the greater the extent of de-identification. However, the usefulness of the data may also be correspondingly lower.
Unlike pseudonymization, anonymization is considered irreversible. However (as addressed in more detail in step 4.1), analysts note that anonymization may be a misnomer, given that multiple studies have shown that it is possible to re-identify data presumed to have been anonymized – though often only with considerable effort and access to additional data. Moreover, it can be tricky to apply anonymization techniques while making sure the resulting data is indeed still useful. Deciding whether to use such a technique requires a careful balancing of these considerations.
2.3 Consider whether the data should be subject to pseudonymization
Pseudonymization generally refers to a de-identification technique that replaces an identifier for a data principal with a pseudonym or placeholder, to hide their identity (in the healthcare context, hiding a person's identity is often referred to as ‘coding’). The identifiers that are replaced are typically ‘direct identifiers;’ that is, those identifiers unique to a specific individual (eg, the person's driver's license number, or Social Security number). 'Indirect identifiers,’ on the other hand, are those identifiers that do not refer to one specific individual, such as their eye color or race.
The pseudonym used is unique to the particular data subject, but cannot be linked to that subject without access to separately held information that links the pseudonym to the specific person. Examples of the kinds of masking functions used include encryption, which involves using an algorithm to take certain text and turn it into values that appear random without access to the decryption key. The process of pseudonymization can become anonymization if the separate key indicates that relationship between a pseudonym and a data subject is destroyed.
Pseudonymization reduces the risk of re-identification, but does not completely eliminate that risk. For example, when it is possible to access the information that clarifies the relationship between a pseudonym and a data subject, pseudonymization can be reversed.
Step 3 – Verification of de-identification
Before you enter a data set into an AI system, you need to verify that the de-identification process you have chosen will adequately perform its intended functions.
Verification of de-identification typically focuses on two specific criteria: the completeness of the de-identification and the continued utility of the data. In other words, you need to verify that the method applied sufficiently protects the privacy of the data subjects whose data is included in the data set to be entered into the AI system and that the resulting data set nonetheless remains usable for the intended AI application. What kind of results satisfy these two criteria will very much depend on the specific context in which you are operating.
3.1 Choose a verification methodology
Verification of your chosen de-identification method should occur before you actually enter your data set into the AI application. Typically, this means completing a test run of the process on a representative sample from your data set, or using synthetic (dummy) data. You must be sure to use an adequate sample size for this purpose.
Like choosing a de-identification process, deciding which methodology to use to verify that process requires considerable expertise. The verification methodology options will depend on the chosen de-identification approach. Expert input by individuals with statistical expertise is likely inevitable. Dedicated privacy officers may also be qualified to perform the task.
3.2 Review and follow up on results of verification
Once you have run the test, take adequate time to analyze the results. Clearly outline any lessons learned. If necessary, take remedial action. This may involve improving the process you have chosen, or completely rethinking it. Once you settle on a specific de-identification approach, be sure to regularly revisit that approach in light of newly emerging security and privacy concerns.
Step 4 – Understand the risk of re-identification
When deciding to de-identify data in order to enter it into an AI system, the risk of re-identification must be addressed.
4.1 Understand the risk of re-identification
De-identification of data is not as a ‘cure-all' that will resolve all privacy issues. De-identification processes do not eliminate the possibility of ‘re-identifying’ such data. This is typically achieved by analyzing the de-identified data in combination with public or private databases that contain additional information about the individuals whose data is included in the de-identified data set. This risk is growing, particularly due to AI systems that can analyze and combine huge volumes of data, making it possible to identify links that are not ascertainable with smaller data sets.
It is important to be aware of this when deciding to rely on de-identification to address potential privacy concerns, and when considering possible public relations issues that may arise from depending heavily on de-identification. One way to minimize the potential outfall is to obtain the informed consent of those affected. Where feasible, this may be the best protection for your company in case of unanticipated re-identification.
It is also important to keep in mind the risk of re-identification from a legal perspective. Commentators have noted that the legal definitions of terms like ‘anonymization’ and ‘de-identification’ may be outdated and may need to be amended in light of new technical possibilities by which data might be identified, so it is important to keep up to date with any related developments. Meanwhile, the US Federal Trade Commission has noted that claims that data ‘has been anonymized’ can amount to deceptive trade practices if those claims are not true.
4.2 Protect yourself from external re-identification
Choosing and optimizing a specific de-identification process can be a complex and time-consuming process. Re-identification of data appears to be a growing risk. No matter how diligently you try to avoid re-identification risks, all your efforts can be undone if you enter data into an external AI system that is managed by an entity that does not pay sufficient heed to privacy matters. If using that application results in a re-identification of your data set, this can have legal ramifications and prompt harmful publicity.
To help shield yourself against such problems, insist, by contract, on a confirmation from the AI application provider that the application will not attempt to re-identify any de-identified data you submit to the application. In some cases, such as under California’s privacy law, this may actually be required for your efforts to qualify legally as de-identification.
Additional resources
International Association of Privacy Professionals (IAPP), Data protection issues for employers to consider when using generative AI (2023)
IAPP, Deidentification 201: A lawyer’s guide to pseudonymization and anonymization
National Institute of Standards and Technology (NIST), De-Identifying Government Datasets: Techniques and Governance
National Library of Medicine, Modes of De-identification
NIST, De-Identification of Personal Information
Network for Public Health Law, Statistical or Scientific De-Identification Fact Sheet
US Department of Health and Human Services, Guidance Regarding Methods for De-identification of Protected Health Information
Related Lexology Pro content
How-to guides:
Understanding AI-driven risks
Understanding the risk of negligence claims when using AI
Risks and liabilities of algorithmic bias
How to determine and apply relevant US privacy laws to your organization
Checklists:
Steps to mitigate risks associated with AI use in business
Understanding privacy laws in the US
Quick views:
Overview of AI in business
Key AI terms
Reliance on information posted:
While we use reasonable endeavours to provide up to date and relevant materials, the materials posted on our site are not intended to amount to advice on which reliance should be placed. They may not reflect recent changes in the law and are not intended to constitute a definitive or complete statement of the law. You may use them to stay up to date with legal developments but you should not use them for transactions or legal advice and you should carry out your own research. We therefore disclaim all liability and responsibility arising from any reliance placed on such materials by any visitor to our site, or by anyone who may be informed of any of its contents.