`render_word_file` removes `<` and `&` from tag values

rkg · 28 June 2024 16:09

Which tool versions are you using?

SDK: v14.11.0
Platform: v24.06.6
Python: v3.10
Isolation mode: venv

Current Behavior

When using render_word_file(), all < and & characters in the values supplied to WordFileTag() are being removed.

I’m using this code

def download_test_word_file(self, **kwargs) -> DownloadResult:

    content = "A < B << C > D >> E & F &amp; G &lt; H &gt; I"

    components = [
        WordFileTag("test_123", content),
        # Testing some workarounds
        WordFileTag("test_456", content.replace("<", "\N{LESS-THAN SIGN}")),
        WordFileTag("test_789", content.replace("<", "\N{FULLWIDTH LESS-THAN SIGN}")),
    ]

    with open(Path(__file__).parent / "resources" / "simple.docx", "rb") as template:
        word_file = render_word_file(template, components)

    return DownloadResult(word_file, "dummy_test.docx")

The Word template looks like:

The result looks like:

Expected Behavior

I expected the < and & characters to show up in the Word output, just as the > character.

Context (optional, but preferred)

We’re generating a Word document that, among other things, includes names of machinery. They sometimes include the less-than of greater-than signs, like:

mobile crane < 80 tons

mslootweg · 2 July 2024 11:52

Hi @rkg ,

I haven’t tried it out myself, but maybe this could help you out?

github.com/elapouya/python-docx-template

Problem with data containing '< '

opened 07:33PM - 08 Apr 16 UTC

closed 09:37AM - 09 Apr 16 UTC

micabot

Whenever I try to create a document containing a minor sign followed by a space …('< ') I get the following error: ``` Traceback (most recent call last): File "simple.py", line 7, in <module> tpl.render(context) File "/usr/lib/python2.7/site-packages/docxtpl/__init__.py", line 105, in render self.map_xml(xml) File "/usr/lib/python2.7/site-packages/docxtpl/__init__.py", line 86, in map_xml root.replace(body,etree.fromstring(xml)) File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77697) File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494) File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115040) File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109165) File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404) File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058) File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:103967) lxml.etree.XMLSyntaxError: StartTag: invalid element name, line 1, column 827 ``` I included a template and script to reproduce it in [simple.zip](https://github.com/elapouya/python-docx-template/files/210731/simple.zip). I peeked several DOCX files which included a '<' char and the XML actually stores the HTML entity instead of the character, so maybe replacing beforehand? Besides my particular need of using the combination of these chars, I'm pretty sure the lib could be vulnerable to an [XML external entities attack](https://www.owasp.org/index.php/XML_External_Entity_%28XXE%29_Processing).

rkg · 4 July 2024 15:07

I’ve tried several things, but none seems to be working.

In my code example I tried using &, < and > as proposed in the Github issue. But as shown in the screenshots, they are completely removed (on the “test_123” row, there’s nothing between F, G, H and I).

Taking a look inside the word\document.xml file inside the .docx reveals

<w:t xml:space="preserve">A  B  C &gt; D &gt;&gt; E  F  G  H  I</w:t>

In the the python-docx-template documentation they mention escaping by means of {{ <var>|e }}. Unfortunately that doesn’t solve it either and shows up in Word like:

And the xml:

<w:t xml:space="preserve">A  B  C  D  E  F amp; G lt; H gt; I</w:t>

I also tried using the suggested R() in the value sent to WordFileTag(), but that result in a TypeError: Object of type RichText is not JSON serializable.

I’m a bit confused by there results . Only > is correct when no escaping in the template is preformed.

In Python string	Output of `{{ <var> }}`	Output of `{{ <var>\|e }}`
`<`	nothing	nothing
`>`	`>`	nothing
`&`	nothing	nothing
`<`	nothing	`lt;`
`>`	nothing	`gt;`
`&`	nothing	`amp;`