AI and ML algorithms rely heavily on vast data for training and development. However, the availability of high-quality, diverse, and secure data can be a significant challenge. In fact, upon not being recognized by a facial recognition system, an MIT student found that the biometric company’s training dataset was 77% male and 83% white.
The concept of synthetic data has emerged to solve this issue, which involves generating artificial datasets that mimic real-world data—Gartner already estimates that by 2030, synthetic data will overshadow real data in AI models. This will help reduce the resource-intensive process of manually collecting, cleaning, and manually labeling large datasets.
Instead, generating synthetic data involves manipulating variables and parameters to replicate data patterns, such as those seen when identifying rare fingerprints. Synthetic data can also negate data privacy concerns, as it doesn’t involve real individuals’ sensitive information. This is essential in criminal investigations, where protecting privacy is paramount.
On the other hand, synthetic data can help strengthen cybersecurity algorithms, such as facial recognition and fingerprint matching. These applications could potentially improve suspect identification in law enforcement and streamline processes for border control.
So, what about the ethics of using synthetic data and its potential to enhance law enforcement practices? Let’s dive in.
Ethical Challenges of Synthetic Data
As the EU keeps adding regulations to AI across several applications, biometric systems have made the cut as well. In their regulatory framework proposal, the organization has asserted that using AI in some instances of law enforcement, like asserting the reliability of evidence, migration, asylum, and border control, is high-risk for its citizens. Furthermore, it prohibits remote biometric identification in publicly accessible spaces for law enforcement purposes.
However, the EU specifies that in some cases, such as identifying and locating suspects or preventing an imminent terrorist threat, AI can be used with narrow exceptions. For the US, these guidelines can serve as blueprints, representing an opportunity to regulate AI while allowing the use of synthetic data to ensure its application in biometrics is safe for citizens.
So, creating synthetic data should adhere to ethical guidelines, especially when replicating sensitive information or scenarios, such as in criminal justice or healthcare applications. And in doing so, there must be full disclosure about its use so as not to mislead users or stakeholders. For instance, in cybersecurity, when using synthetic data to evaluate the effectiveness of security algorithms, there should be a disclaimer that the data is synthetic to prevent false expectations.
The Case for Data Privacy and Bias
Generating high-quality synthetic data is all about accurate modeling and validation to represent the target domain. As a result, it can enhance diversity in datasets and improve machine learning training to successfully identify individuals regardless of race and gender.
For example, the system that couldn’t identify an MIT student because it had a 34.7% error rate for black-skinned women could’ve been better trained with synthetic datasets to avoid such a mistake. The data can be manipulated to include more skin colors and evenly represent gender so the AI is sufficiently equipped to identify anyone without bias.
On the other hand, synthetic data can ensure data privacy by concealing personally identifiable information in sectors like healthcare. This works by making sensitive health-related information publicly accessible through synthetic patient data that closely resembles real-world cases.
A peer-reviewed study recently analyzed many instances that used synthetic data in healthcare and confirmed its advantages, concluding that the generated datasets benefit the sector as they help advance research and inform evidence-based policymaking.
Improving Cybersecurity Algorithms
Synthetic data allows cybersecurity teams to use updated information without the limitations of real data to train AI models for fraud detection or biometrics recognition.
For example, creating synthetic fingerprints can be complex as it involves the generation of oval-shaped fingerprints with minutiae points, like ridge endings and spurs, that mimic real-world variations. To simulate real fingerprints, systems must make them “latent” by introducing distortions and keeping essential minutiae points. This new and highly accurate data can help train AI models to match latent fingerprints with their scanned or photographed counterparts.
This technology is already deployed in large-scale biometric criminal investigation systems across the US. A notable example is the Iowa Department of Public Safety’s Automated Biometric Identification System (ABIS), which scored 63,700 10-print “hits” in 2018. Internationally, the Indonesia National Police also counts with an ABIS that uses synthetic data.
Similar techniques can be applied to facial recognition, enhancing the accuracy and reliability of facial matching algorithms, particularly when dealing with partial or degraded images.
In the financial sector, synthetic data can model customer behaviors to improve fraud detection algorithms without compromising user privacy. Likewise, it can assist with successfully de-identifying financial transactions, which usually blur account numbers and names but leave addresses and amounts visible. This information is still enough to re-identify the information and can become a liability—which won’t occur when applying synthetic data.
One last example of using synthetic data for cybersecurity in the financial industry is Amazon’s trailblazing palm factory, which the e-commerce giant built to train their palm biometric payment, Amazon One.
Although AI is on the way to becoming heavily regulated and may be flawed in some instances, synthetic data is a highly reliable asset to safeguard data privacy and improve processes in cybersecurity. Its successful use cases far outweigh people’s worries stemming from inaccuracies and, indeed, help combat bias in AI training datasets. For law enforcement, it represents better biometrics detection and fraud detection systems.