AI-driven solutions are rapidly being adopted across diverse industries, services, and products every day. However, their effectiveness depends entirely on the quality of the data they are trained on – an aspect often misunderstood or overlooked in the dataset creation process.
As data protection authorities increase scrutiny on how AI technologies align with privacy and data protection regulations, companies face growing pressure to source, annotate, and refine datasets in compliant and ethical ways.
Is there truly an ethical approach to building AI datasets? What are companies’ biggest ethical challenges, and how are they addressing them? And how do evolving legal frameworks impact the availability and use of training data? Let’s explore these questions.
Data Privacy and AI
By its nature, AI requires a lot of personal data to execute tasks. This has raised concerns about gathering, saving, and using this information. Many laws around the world regulate and limit the use of personal data, from the GDPR and newly introduced AI Act in Europe to HIPAA in the US, which regulates access to patient data in the medical industry.
Reference for how strict data protection laws are around the world / DLA Piper
For instance, fourteen U.S. states currently have comprehensive data privacy laws, with six more set to take effect in 2025 and early 2026. The new administration has signaled a shift in its approach to data privacy enforcement at the federal level. A key focus is AI regulation, emphasizing fostering innovation rather than imposing restrictions. This shift includes repealing previous executive orders on AI and introducing new directives to guide its development and application.
Data protection legislation is evolving in various countries: in Europe, the laws are stricter, while in Asia or Africa, they tend to be less stringent.
However, personally identifiable information (PII) — such as facial images, official documents like passports, or any other sensitive personal data — is generally restricted in most countries to some extent. According to the UN Trade & Development, the collection, use, and sharing of personal information to third parties without notice or consent of consumers is a major concern for most of the world. 137 out of 194 countries have regulations ensuring data protection and privacy. As a result, most global companies take extensive precautions to avoid using PII for model training since regulations like those in the EU strictly prohibit such practices, with rare exceptions found in heavily regulated niches such as law enforcement.
Over time, data protection laws are becoming more comprehensive and globally enforced. Companies adapt their practices to avoid legal challenges and meet emerging legal and ethical requirements.
What Methods Do Companies Use to Get Data?
So, when studying data protection issues for training models, it is essential first to understand where companies obtain this data. There are three main and primary sources of data.
- Data collection
This method enables gathering data from crowdsourcing platforms, media stocks, and open-source datasets.
It is important to note that public stock media are subject to different licensing agreements. Even a commercial-use license often explicitly states that content cannot be used for model training. These expectations differ platform by platform and require businesses to confirm their ability to use content in ways they need to.
Even when AI companies obtain content legally, they can still face some issues. The rapid advancement of AI model training has far outpaced legal frameworks, meaning the rules and regulations surrounding AI training data are still evolving. As a result, companies must stay informed about legal developments and carefully review licensing agreements before using stock content for AI training.
- Data Creation
One of the safest dataset preparation methods involves creating unique content, such as filming people in controlled environments like studios or outdoor locations. Before participating, individuals sign a consent form to use their PII, specifying what data is being collected, how and where it will be used, and who will have access to it. This ensures full legal protection and gives companies confidence that they will not face claims of illegal data usage.
The main drawback of this method is its cost, especially when data is created for edge cases or large-scale projects. However, large companies and enterprises are increasingly continuing to use this approach for at least two reasons. First, it ensures full compliance with all standards and legal regulations. Second, it provides companies with data fully tailored to their specific scenarios and needs, guaranteeing the highest accuracy in model training.
- Synthetic Data Generation
Using software tools to create images, text, or videos based on a given scenario. However, synthetic data has limitations: it is generated based on predefined parameters and lacks the natural variability of real data.
This lack can negatively impact AI models. While it isn’t relevant for all cases and doesn’t always happen, it’s still important to remember “model collapse” — a point at which excessive reliance on synthetic data causes the model to degrade, leading to poor-quality outputs.
Synthetic data can still be highly effective for basic tasks, such as recognizing general patterns, identifying objects, or distinguishing fundamental visual elements like faces.
However, it isn’t the best option when a company needs to train a model entirely from scratch or deal with rare or highly specific scenarios.
The most revealing situations occur in in-cabin environments, such as a driver distracted by a child, someone appearing fatigued behind the wheel, or even instances of reckless driving. These data points are not commonly available in public datasets — nor should they be — as they involve real individuals in private settings. Since AI models rely on training data to generate synthetic outputs, they struggle to represent scenarios they have never encountered accurately.
When synthetic data fails, created data — collected through controlled environments with real actors — becomes the solution.
Data solution providers like Keymakr place cameras in cars, hire actors, and record actions such as taking care of a baby, drinking from a bottle, or showing signs of fatigue. The actors sign contracts explicitly consenting to using their data for AI training, ensuring compliance with privacy laws.
Responsibilities in the Dataset Creation Process
Each participant in the process, from the client to the annotation company, has specific responsibilities outlined in their agreement. The first step is establishing a contract, which details the nature of the relationship, including clauses on non-disclosure and intellectual property.
Let’s consider the first option for working with data, namely when it is created from scratch. Intellectual property rights state that any data the provider creates belongs to the hiring company, meaning it is created on their behalf. This also means the provider must ensure the data is obtained legally and properly.
As a data solutions company, Keymakr ensures data compliance by first checking the jurisdiction in which the data is being created, obtaining proper consent from all individuals involved, and guaranteeing that the data can be legally used for AI training.
It’s also important to note that once the data is used for AI model training, it becomes near-impossible to determine what specific data contributed to the model because AI blends it all together. So, the specific output doesn’t tend to be its output, especially when discussing millions of images.
Due to its rapid development, this area still establishes clear guidelines for distributing responsibilities. This is similar to the complexities surrounding self-driving cars, where questions about liability — whether it’s the driver, manufacturer, or software company — still require clear distribution.
In other cases, when an annotation provider receives a dataset for annotation, he assumes that the client has legally obtained the data. If there are clear signs that the data has been obtained illegally, the provider must report it. However, such apparent cases are extremely rare.
It is also important to note that large companies, corporations, and brands that value their reputation are very careful about where they source their data, even if it was not created from scratch but taken from other legal sources.
In summary, each participant’s responsibility in the data work process depends on the agreement. You could consider this process part of a broader “sustainability chain,” where each participant has a crucial role in maintaining legal and ethical standards.
What Misconceptions Exist About the Back End of AI Development?
A major misconception about AI development is that AI models work similarly to search engines, gathering and aggregating information to present to users based on learned knowledge. However, AI models, especially language models, often function based on probabilities rather than genuine understanding. They predict words or terms based on statistical likelihood, using patterns seen in previous data. AI does not “know” anything; it extrapolates, guesses, and adjusts probabilities.
Furthermore, many assume that training AI requires enormous datasets, but much of what AI needs to recognize — like dogs, cats, or humans — is already well-established. The focus now is on improving accuracy and refining models rather than reinventing recognition capabilities. Much of AI development today revolves around closing the last small gaps in accuracy rather than starting from scratch.
Ethical Challenges and How the European Union AI Act and Mitigation of US Regulations Will Impact the Global AI Market
When discussing the ethics and legality of working with data, it is also important to clearly understand what defines “ethical” AI.
The biggest ethical challenge companies face today in AI is determining what is considered unacceptable for AI to do or be taught. There is a broad consensus that ethical AI should help rather than harm humans and avoid deception. However, AI systems can make errors or “hallucinate,” which challenges determining whether these mistakes qualify as disinformation or harm.
AI Ethics is a major debate with organizations like UNESCO getting involved — with key principles surrounding auditability and traceability of outputs.
Legal frameworks surrounding data access and AI training play a significant role in shaping AI’s ethical landscape. Countries with fewer restrictions on data usage enable more accessible training data, while nations with stricter data laws limit data availability for AI training.
For example, Europe, which adopted the AI Act, and the U.S., which has rolled back many AI regulations, offer contrasting approaches that indicate the current global landscape.
The European Union AI Act is significantly impacting companies operating in Europe. It enforces a strict regulatory framework, making it difficult for businesses to use or develop certain AI models. Companies must obtain specific licenses to work with certain technologies, and in many cases, the regulations effectively make it too difficult for smaller businesses to comply with these rules.
As a result, some startups may choose to leave Europe or avoid operating there altogether, similar to the impact seen with cryptocurrency regulations. Larger companies that can afford the investment needed to meet compliance requirements may adapt. Still, the Act could drive AI innovation out of Europe in favor of markets like the U.S. or Israel, where regulations are less stringent.
The U.S.’s decision to invest major resources into AI development with fewer restrictions could also have drawbacks but invite more diversity in the market. While the European Union focuses on safety and regulatory compliance, the U.S. will likely foster more risk-taking and cutting-edge experimentation.