LLM01: Prompt Injection 프롬프트 주입 취약점은 공격자가 조작된 입력을 통해 LLM(대형 언어 모델)을 조작하여 LLM이 무의식적으로 공격자의 의도를 실행할 때 발생합니다. 이는 시스템 프롬프트를 "

작성자 bryan
작성일 2024.06.11 23:05

조회 4,248

LLM01: Prompt Injection

Prompt Injection Vulnerability occurs when an attacker manipulates a large language model (LLM) through crafted inputs, causing the LLM to unknowingly execute the attacker's intentions. This can be done directly by "jailbreaking" the system prompt or indirectly through manipulated external inputs, potentially leading to data exfiltration, social engineering, and other issues.

Direct Prompt Injections, also known as "jailbreaking", occur when a malicious user overwrites or reveals the underlying system prompt. This may allow attackers to exploit backend systems by interacting with insecure functions and data stores accessible through the LLM.
Indirect Prompt Injections occur when an LLM accepts input from external sources that can be controlled by an attacker, such as websites or files. The attacker may embed a prompt injection in the external content hijacking the conversation context. This would cause LLM output steering to become less stable, allowing the attacker to either manipulate the user or additional systems that the LLM can access. Additionally, indirect prompt injections do not need to be human-visible/readable, as long as the text is parsed by the LLM.

The results of a successful prompt injection attack can vary greatly - from solicitation of sensitive information to influencing critical decision-making processes under the guise of normal operation.

In advanced attacks, the LLM could be manipulated to mimic a harmful persona or interact with plugins in the user's setting. This could result in leaking sensitive data, unauthorized plugin use, or social engineering. In such cases, the compromised LLM aids the attacker, surpassing standard safeguards and keeping the user unaware of the intrusion. In these instances, the compromised LLM effectively acts as an agent for the attacker, furthering their objectives without triggering usual safeguards or alerting the end user to the intrusion.

Common Examples of Vulnerability

A malicious user crafts a direct prompt injection to the LLM, which instructs it to ignore the application creator's system prompts and instead execute a prompt that returns private, dangerous, or otherwise undesirable information.
A user employs an LLM to summarize a webpage containing an indirect prompt injection. This then causes the LLM to solicit sensitive information from the user and perform exfiltration via JavaScript or Markdown.
A malicious user uploads a resume containing an indirect prompt injection. The document contains a prompt injection with instructions to make the LLM inform users that this document is excellent eg. an excellent candidate for a job role. An internal user runs the document through the LLM to summarize the document. The output of the LLM returns information stating that this is an excellent document.
A user enables a plugin linked to an e-commerce site. A rogue instruction embedded on a visited website exploits this plugin, leading to unauthorized purchases.
A rogue instruction and content embedded on a visited website exploits other plugins to scam users.

How to Prevent

Prompt injection vulnerabilities are possible due to the nature of LLMs, which do not segregate instructions and external data from each other. Since LLMs use natural language, they consider both forms of input as user-provided. Consequently, there is no fool-proof prevention within the LLM, but the following measures can mitigate the impact of prompt injections:

Enforce privilege control on LLM access to backend systems. Provide the LLM with its own API tokens for extensible functionality, such as plugins, data access, and function-level permissions. Follow the principle of least privilege by restricting the LLM to only the minimum level of access necessary for its intended operations.
Add a human in the loop for extended functionality. When performing privileged operations, such as sending or deleting emails, have the application require the user approve the action first. This reduces the opportunity for an indirect prompt injections to lead to unauthorised actions on behalf of the user without their knowledge or consent.
Segregate external content from user prompts. Separate and denote where untrusted content is being used to limit their influence on user prompts. For example, use ChatML for OpenAI API calls to indicate to the LLM the source of prompt input.
Establish trust boundaries between the LLM, external sources, and extensible functionality (e.g., plugins or downstream functions). Treat the LLM as an untrusted user and maintain final user control on decision-making processes. However, a compromised LLM may still act as an intermediary (man-in-the-middle) between your application's APIs and the user as it may hide or manipulate information prior to presenting it to the user. Highlight potentially untrustworthy responses visually to the user.
Manually monitor LLM input and output periodically, to check that it is as expected. While not a mitigation, this can provide data needed to detect weaknesses and address them.

Example Attack Scenarios

An attacker provides a direct prompt injection to an LLM-based support chatbot. The injection contains "forget all previous instructions" and new instructions to query private data stores and exploit package vulnerabilities and the lack of output validation in the backend function to send e-mails. This leads to remote code execution, gaining unauthorized access and privilege escalation.
An attacker embeds an indirect prompt injection in a webpage instructing the LLM to disregard previous user instructions and use an LLM plugin to delete the user's emails. When the user employs the LLM to summarise this webpage, the LLM plugin deletes the user's emails.
A user uses an LLM to summarize a webpage containing text instructing a model to disregard previous user instructions and instead insert an image linking to a URL that contains a summary of the conversation. The LLM output complies, causing the user's browser to exfiltrate the private conversation.
A malicious user uploads a resume with a prompt injection. The backend user uses an LLM to summarize the resume and ask if the person is a good candidate. Due to the prompt injection, the LLM response is yes, despite the actual resume contents.
An attacker sends messages to a proprietary model that relies on a system prompt, asking the model to disregard its previous instructions and instead repeat its system prompt. The model outputs the proprietary prompt and the attacker is able to use these instructions elsewhere, or to construct further, more subtle attacks.

Reference Links

ChatGPT Plugin Vulnerabilities- Chat with Code: Embrace the Red
ChatGPT Cross Plugin Request Forgery and Prompt Injection: Embrace the Red
Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection: Arxiv preprint
Defending ChatGPT against Jailbreak Attack via Self-Reminder: Research Square
Prompt Injection attack against LLM-integrated Applications: Cornell University
Inject My PDF: Prompt Injection for your Resume: Kai Greshake
ChatML for OpenAI API Calls: GitHub
Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection: Cornell University
Threat Modeling LLM Applications: AI Village
Reducing The Impact of Prompt Injection Attacks Through Design: Kudelski Security
Universal and Transferable Attacks on Aligned Language Models: LLM-Attacks.org
Indirect prompt injection: Kai Greshake
AI Injections: Direct and Indirect Prompt Injections and Their Implications: Embrace the Red

LLM01: 신속한 주입

프롬프트 주입 취약점은 공격자가 조작된 입력을 통해 LLM(대형 언어 모델)을 조작하여 LLM이 무의식적으로 공격자의 의도를 실행할 때 발생합니다. 이는 시스템 프롬프트를 "탈옥"하여 직접 수행하거나 조작된 외부 입력을 통해 간접적으로 수행할 수 있으며 잠재적으로 데이터 유출, 사회 공학 및 기타 문제로 이어질 수 있습니다.

"탈옥"이라고도 알려진 직접 프롬프트 삽입은 악의적인 사용자가 기본 시스템 프롬프트를 덮어쓰거나 표시할 때 발생합니다. 이로 인해 공격자는 LLM을 통해 액세스할 수 있는 안전하지 않은 기능 및 데이터 저장소와 상호 작용하여 백엔드 시스템을 악용할 수 있습니다.
간접 프롬프트 삽입은 LLM이 웹사이트나 파일과 같이 공격자가 제어할 수 있는 외부 소스로부터 입력을 받아들일 때 발생합니다. 공격자는 대화 컨텍스트를 가로채는 외부 콘텐츠에 프롬프트 삽입을 삽입할 수 있습니다. 이로 인해 LLM 출력 조정이 덜 안정적이게 되어 공격자가 사용자 또는 LLM이 액세스할 수 있는 추가 시스템을 조작할 수 있게 됩니다. 또한 LLM에서 텍스트를 구문 분석하는 한 간접 프롬프트 삽입은 사람이 보거나 읽을 수 있을 필요가 없습니다.

성공적인 신속한 주입 공격의 결과는 민감한 정보 요청부터 정상적인 작동을 가장하여 중요한 의사 결정 프로세스에 영향을 미치는 것까지 매우 다양할 수 있습니다.

고급 공격에서는 LLM을 조작하여 유해한 인물을 모방하거나 사용자 설정의 플러그인과 상호 작용할 수 있습니다. 이로 인해 민감한 데이터 유출, 무단 플러그인 사용 또는 사회 공학이 발생할 수 있습니다. 이러한 경우 손상된 LLM은 공격자를 지원하여 표준 보호 조치를 능가하고 사용자가 침입을 인식하지 못하도록 합니다. 이러한 경우 손상된 LLM은 공격자의 에이전트 역할을 효과적으로 수행하여 일반적인 보호 조치를 실행하거나 최종 사용자에게 침입에 대해 경고하지 않고 목표를 달성합니다.

취약점의 일반적인 예

악의적인 사용자는 LLM에 직접 프롬프트 삽입을 만들어 응용 프로그램 작성자의 시스템 프롬프트를 무시하고 대신 개인 정보, 위험 또는 기타 바람직하지 않은 정보를 반환하는 프롬프트를 실행하도록 지시합니다.
사용자는 간접 프롬프트 삽입이 포함된 웹페이지를 요약하기 위해 LLM을 사용합니다. 그러면 LLM이 사용자로부터 민감한 정보를 요청하고 JavaScript 또는 Markdown을 통해 추출을 수행하게 됩니다.
악의적인 사용자가 간접 프롬프트 삽입이 포함된 이력서를 업로드합니다. 이 문서에는 LLM이 사용자에게 이 문서가 훌륭하다는 것을 알리도록 하는 지침이 포함된 즉각적인 삽입이 포함되어 있습니다. 직무에 대한 훌륭한 후보자입니다. 내부 사용자는 LLM을 통해 문서를 실행하여 문서를 요약합니다. LLM의 출력은 이것이 훌륭한 문서라는 정보를 반환합니다.
사용자가 전자상거래 사이트에 연결된 플러그인을 활성화합니다. 방문한 웹사이트에 삽입된 악성 명령이 이 플러그인을 악용하여 무단 구매로 이어집니다.
방문한 웹사이트에 포함된 악성 지침과 콘텐츠는 다른 플러그인을 악용하여 사용자를 사기합니다.

예방하는 방법

명령어와 외부 데이터를 서로 분리하지 않는 LLM의 특성으로 인해 프롬프트 주입 취약점이 발생할 수 있습니다. LLM은 자연어를 사용하므로 두 가지 입력 형식을 모두 사용자가 제공한 것으로 간주합니다. 결과적으로 LLM 내에는 완벽한 예방 방법이 없지만 다음 조치를 통해 즉각적인 주입의 영향을 완화할 수 있습니다.

백엔드 시스템에 대한 LLM 액세스에 대한 권한 제어를 시행합니다. 플러그인, 데이터 액세스 및 기능 수준 권한과 같은 확장 가능한 기능을 위해 자체 API 토큰을 LLM에 제공합니다. LLM을 의도된 작업에 필요한 최소한의 액세스 수준으로만 제한하여 최소 권한의 원칙을 따릅니다.
확장된 기능을 위해 루프에 사람을 추가하세요. 이메일 보내기 또는 삭제와 같은 권한 있는 작업을 수행할 때 애플리케이션에서 사용자가 먼저 해당 작업을 승인하도록 요구합니다. 이렇게 하면 사용자가 알지 못하거나 동의하지 않고 사용자를 대신하여 무단 작업으로 이어질 수 있는 간접적인 프롬프트 주입 가능성이 줄어듭니다.
사용자 프롬프트에서 외부 콘텐츠를 분리합니다. 사용자 프롬프트에 대한 영향을 제한하기 위해 신뢰할 수 없는 콘텐츠가 사용되는 위치를 구분하고 표시합니다. 예를 들어 OpenAI API 호출용 ChatML을 사용하여 LLM에 프롬프트 입력 소스를 표시합니다.
LLM, 외부 소스 및 확장 가능한 기능(예: 플러그인 또는 다운스트림 기능) 간의 신뢰 경계를 설정합니다. LLM을 신뢰할 수 없는 사용자로 취급하고 의사 결정 프로세스에 대한 최종 사용자 제어를 유지합니다. 그러나 손상된 LLM은 사용자에게 정보를 제공하기 전에 정보를 숨기거나 조작할 수 있으므로 여전히 애플리케이션의 API와 사용자 사이의 중개자(중간자) 역할을 할 수 있습니다. 잠재적으로 신뢰할 수 없는 응답을 사용자에게 시각적으로 강조합니다.
LLM 입력 및 출력을 주기적으로 수동으로 모니터링하여 예상대로인지 확인합니다. 완화는 아니지만 약점을 감지하고 해결하는 데 필요한 데이터를 제공할 수 있습니다.

공격 시나리오 예

공격자는 LLM 기반 지원 챗봇에 직접 프롬프트 주입을 제공합니다. 주입에는 "이전 지침을 모두 잊어버리세요"와 개인 데이터 저장소를 쿼리하고 패키지 취약점을 이용하며 이메일을 보내기 위한 백엔드 기능의 출력 유효성 검사가 부족하다는 새로운 지침이 포함되어 있습니다. 이로 인해 원격 코드가 실행되어 무단 액세스 및 권한 상승이 발생합니다.
공격자는 LLM에게 이전 사용자 지침을 무시하고 LLM 플러그인을 사용하여 사용자의 이메일을 삭제하도록 지시하는 간접적인 프롬프트 삽입을 웹페이지에 삽입합니다. 사용자가 LLM을 사용하여 이 웹페이지를 요약하면 LLM 플러그인이 사용자의 이메일을 삭제합니다.
사용자는 LLM을 사용하여 이전 사용자 지침을 무시하고 대화 요약이 포함된 URL에 연결되는 이미지를 삽입하도록 모델에 지시하는 텍스트가 포함된 웹페이지를 요약합니다. LLM 출력이 이를 준수하므로 사용자의 브라우저가 비공개 대화를 유출하게 됩니다.
악의적인 사용자가 프롬프트 삽입으로 이력서를 업로드합니다. 백엔드 사용자는 LLM을 사용하여 이력서를 요약하고 그 사람이 적합한 후보자인지 묻습니다. 프롬프트 인젝션으로 인해 실제 이력서 내용에도 불구하고 LLM 응답은 yes입니다.
공격자는 시스템 프롬프트에 의존하는 독점 모델에 메시지를 보내 모델에게 이전 지침을 무시하고 대신 시스템 프롬프트를 반복하도록 요청합니다. 모델은 고유한 프롬프트를 출력하고 공격자는 이러한 지침을 다른 곳에서 사용하거나 더 교묘한 공격을 구성할 수 있습니다.