As you can probably guess from the title, this post is a companion piece to another recent post dealing with the creation of simple TEI from HTML in Python (actually, using Python as a wrapper to several Regular Expressions). Once you have created such TEI files, have annotated them more richly by hand or by some other means, or were lucky enough to find nicely encoded TEI files matching your interests, it is likely that you will want to put all that markup to some use.
One of the simplest ways of doing so is to extract selected parts of text from the TEI files for further text processing such as stylometric analyses. (I have provided some arguments why you would want to take this detour in the first place in the post mentioned above.) The traditional way would be to do this fully in the XML universe, that is using XSLT and XPath directly with an XSLT processor or from within an XML editor such as oXygen. With your plain text collection in hand, you would then use some other tool, possibly the stylo package for R or some Python scripts to do your quantitative text analysis.
In Würzburg, we are increasingly moving towards using Python as the central tool from which we can then call other tools and services. Quite simply, Python seems to be the programming language which is most accessible for humanities scholars, is very well-suited for text processing, and has a large and vibrant community providing help and developing useful modules. For us, it makes a lot of sense to move the text preparation itself also closer to Python. So what I have done intermittently over the last few days (when not attending the wonderful Digital Humanities session at the April Conference on English and American Literature and Culuture in Kraków, Poland) was to learn some first steps towards a pythonic way of dealing with XML.
The solution for this used here is lxml, a Python library for processing XML and HTML from within Python. To quote from the lxml website: “The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.4 to 3.3.”
What lxml basically does for me, so far, is two things: first, parse several XML files from a specified folder and build a tree representation for each of them that can then be queried as XML (although the tree representation of lxml is somewhat quirky at first); and second, apply one of several XPath expressions to each of these tree representations to selectively extract parts of the XML documents and write them to plain text files (for this, the quirkyness of the internal tree structure does not matter). From there (or directly), these texts can then be submitted to other Python functions for text analysis, such as the ones described by Allen Riddell in his TAToM tutorial. Integrating the text preparation directly from the TEI master files into the workflow also means that the steps from the master files to the final result of the analysis becomes more transparent and more easily reproducible.
That’s really all there is to it for now, but lxml is much more powerful and flexible than that. You can find the little Python script called tei2txt.py as well as some sample texts on which it has been tested over at my GitHub toolbox where I have also posted the earlier html2tei.py function. (Note that there is an issue with namespaces in the tei2txt script that still needs to be resolved.)
[Edit, 27.4.2014: The namespace issue is now fixed.]
[Edit, 19.3.2016: There is a more recent version of the script, now found in the CLiGS group’s toolbox and part of the submodule extract.py.]