Security researchers at Lasso Security have discovered more than 1,500 API tokens from tech giants such as Meta, Microsoft, Google and VMware available on the open source platform Hugging Face. This compromise potentially puts 723 organizations at risk, including Meta, EleutherAI, and BigScience Workshop. The researchers found that most tokens had write permissions that allowed files in account repositories to be modified. Although affected companies quickly responded by remediating the vulnerabilities, researchers point out the potential for serious consequences, including data theft, training data poisoning, model theft and data poisoning attacks, which are among the critical threats to AI and machine learning.
The researchers successfully accessed and modified multiple datasets and accessed over 10,000 private models, highlighting the vulnerability of the systems to such attacks. The leak was discovered through manual substring searches on Hugging Face, and researchers highlighted the need to raise awareness about API token protection. In response to the discovery, major companies revoked the tokens and removed the code from their repositories. EleutherAI, one of the affected companies, highlighted its collaboration with Hugging Face and Stability AI to strengthen security in machine learning research by developing a new checkpoint format to limit the risks associated with attacks of this type.
HuggingFace message for deprecated organization API tokens
Hugging Face is a platform that many LLM professionals use as a source of tools and other resources for LLM projects. The company’s core offerings include Transformers, an open source library that provides APIs and tools for downloading and optimizing pre-built models. The company hosts – GitHub-style – more than 500,000 AI models and 250,000 datasets, including those from Meta, Google, Microsoft and VMware.
It allows users to publish their own models and datasets on the platform and access others’ models for free via a Hugging Face API. The company has so far raised around $235 million from investors including Google and Nvidia. Given the widespread use and growing popularity of the platform, Lasso researchers decided to take a closer look at the registry and its security mechanisms. As part of this exercise, researchers in November 2023 tried to see if they could find exposed API tokens that would allow them to access datasets and models on Hugging Face.
They looked for API tokens exposed on GitHub and Hugging Face. Initially, searches only returned a very limited number of results, particularly for Hugging Face. However, by slightly modifying the analysis process, researchers were able to find a relatively large number of exposed tokens, says Lanyado.
The implications of this error are significant as we managed to gain full access, both read and write permissions, to Meta Llama 2, BigScience Workshop and EleutherAI. All of these organizations have models with millions of downloads – a result “that leaves the organization vulnerable to exploitation by cybercriminals,” says Bar Lanyado, security researcher at Lasso Security.
The seriousness of the situation cannot be overstated. By controlling an organization with millions of downloads, we now have the ability to manipulate existing models and potentially convert them into malicious entities. This poses a serious threat as the injection of corrupted models could impact millions of users who rely on these basic models for their applications.
Access to high quality organizations
So we decided to take a closer look, and indeed the write function wasn’t working, but apparently despite small changes to the login function in the library, the read function still worked and we were able to use the tokens we found to privately download models with an exposed org_api token , for example Microsoft, explains Lanyado in his blog.
Research methodology
To begin the search for API tokens, researchers searched the GitHub and Hugging Face repositories using their search functionality. When searching on GitHub, they used the option to search code by regex, but ran into a problem: the results of this type of search only returned the first 100 results. So they searched for the regular Hugging Face tokens (both user and org_api tokens) which allowed them to get thousands of results, but they could only read 100. To overcome this obstacle, the researchers had to extend the prefix of our token by drilling the first two letters of the token in order to obtain fewer responses per query and thus have access to all available results.
A tip and a call to action
It is critical that organizations and developers understand that platforms like Hugging Face do not take active initiatives to protect their users’ exposed API tokens.
Developers are strongly recommended to avoid using hard-coded tokens and follow best practices. This eliminates the need to check each delivery to ensure that sensitive information or tokens are not accidentally transferred to the repositories.
Lasso Security researchers also advise Hugging Face to constantly monitor publicly available API tokens and promptly revoke them or notify affected users and organizations of potential risks. A similar approach is taken by GitHub, which automatically revokes the OAuth token, GitHub app token, or personal access token when it is detected in a public repository or public repository.
In the context of an ever-changing digital landscape, early detection is of paramount importance to prevent potential damage related to language model security (LLM). To address challenges such as exposed API tokens, training data sabotage, supply chain vulnerabilities, and theft of models and datasets, it is recommended to apply token classification and implement specific security solutions for IDE inspection and code review that address these Transformation models are intended to protect.
By quickly resolving these issues, organizations can strengthen their defenses and prevent imminent threats related to these vulnerabilities. Vigilance is essential in the digital security landscape, and this research represents an urgent call to action to secure the foundations of the language model space.
The vulnerability of exposed API tokens, which in most cases have write permissions, highlights the possible negligence on the part of developers in managing data security. The ability to tamper with files in account repositories could have serious consequences, such as: B. Data theft and poisoning of training data, jeopardizing the confidentiality and integrity of the information.
Source: Lasso Security
And you ?
Are Lasso security researchers’ research findings relevant?
What lessons do you think can be learned from this incident for researchers, developers and users of AI and machine learning?
See also:
92% of IT leaders believe they have made the right security investments, but half are still concerned about their organization’s security
Identity and access management company Okta confirms that all of its customers’ data was stolen by hackers, reigniting the debate about the risk of relying on cloud companies
According to a CybSafe study, more than half of office workers ignore important cybersecurity warnings and alerts due to information overload