Schedule for Saturday
9:00 | Conference site opens |
9:30 | Opening and sponsor presentation |
9:40 | XML preserved from the past and into the future or? Karin Bredenberg |
10:10 | Transparent Invisible XML Nico Verwer |
10:40 | Roundtripping Invisible XML Steven Pemberton |
11:10 | Coffee break |
11:30 | Toward RESTful XQuery 2.0 Adam Retter |
12:15 | Tutorial Development XML Mashup with XProc Erik Siegel |
12:45 | Enriching Data from Digital Libraries with XProc 3.0 Boris Lehečka |
13:00 | Lunch |
14:30 | Modern Benchmarking of XQuery and XML Databases Alan Paxton and Adam Retter |
15:00 | Simple Semantic Data Modeling in XML (SeMoX) Renzo Kottmann, Cedric Pauken and Andreas Schmitz |
15:30 | GEDCOM to RDF: Transforming Genealogical Data for use in a Personal Knowledge Graph Robert Walpole |
16:00 | Coffee break |
16:30 | Bridging XDM types in multiple native type systems O’Neil Delpratt and Matt Patterson |
17:00 | natural-xml-diff: an XML Diffing Library Martijn Faassen |
17:30 | It’s Useful After All – VIN Numbers, DITA, and iXML Ari Nordström |
18:00 | Closing of the conference |
Session details
XML preserved from the past and into the future or?
Karin Bredenberg
Some have been saying XML is dead (we know that is not true), and the use we see for XML most often involves publications, but so much more relies on XML and XML being a part of the future. You might not know that the final resting place for a lot of the information (which will be called data in the rest of the text) created by agencies, municipalities and others is being saved for the future with the aid of digital preservation by archives both national and others no matter if it’s a physical artefact or digital data. The digital data is exported in different ways out of the system on which it was created to facilitate the saving and reuse of data without requiring the originating system. The export involves creating some type of itinerary and transforming the data into formats suitable for moving the data to the archive or whoever is the receiver. In almost all cases, the itinerary is in the XML format following a standard, providing an XML schema for describing a transfer and all of the components required besides the data itself. The data will come in many different forms such as database dumps, images in various file formats, PDFs and XML documents.
Transparent Invisible XML
Nico Verwer
Invisible XML (ixml) is a language for specifying grammars that can be used to parse plain text and turn it into structured XML. This works when the input is just plain text with an implicit structure, but not when the input is already structured XML, to which we want to add more structure.
In this talk, we present a modification to the Markup Blitz ixml parser, which makes it possible to augment existing XML markup with XML elements from the parser output. In BaseX, this modification is available as a slightly modified version of the proposed fn:invisible-xml XQuery 4.0 function. The parser only sees the text content of the input document, hence the name ‘transparent’ invisible XML.
We will also present a modification to ixml itself, allowing existing XML elements to be recognized as non-terminals. This makes it possible to use ‘pre-parsed’ input that is generated by another ixml parser, or by another method like named entity recognition. An ixml grammar can refer to these ‘pre-parsed’ elements, and make them part of larger recognized structures.
Roundtripping Invisible XML
Steven Pemberton
Invisible XML (ixml) is a language and process that takes linear textual input, recognises the implicit structure in the input, and converts it to structured XML output. It does this by parsing the input using a grammar describing the format of the input document, and serialising the resultant parse tree as XML, using extra information in the grammar to drive the serialisation.
If that were all it did, then round-tripping the XML back to text would be trivial: it would be simply a case of concatenating the text nodes of the XML, and you’d be done.
However, there are issues with regards to ixml serialisation:
* Input characters may be deleted from the parse tree on serialisation.
* Extra characters that weren’t in the input may be inserted.
* Some parse tree nodes may be serialised as attributes rather than elements, causing a reordering of the input text, since attributes appear before element content.
As hinted at in earlier papers on ixml, round-tripping could be achieved by having a special-purpose general parser which attempts to recreate a parse-tree that could have produced the serialisation, and then concatenating the resulting text nodes.
This paper takes a different approach: by transforming the input grammar into a grammar that represents all possible serialisations of the input grammar, it can use the same parser as used by ixml, with some small additions, to parse the serialisation back into a parse tree that would have produced that serialisation.
This raises a number of technical issues similar to the normal ixml process, in particular what to do with ambiguity, where a serialisation could have been produced by more than one input.
Toward RESTful XQuery 2.0
Adam Retter
In 2012 RESTful XQuery proposed a set of standardised annotations and associated machinery for XQuery 3.0 that could be used by an XQuery implementation running within a Web Context to service REST calls by invoking XQuery User Defined Functions. RESTful XQuery, became colloquially known as RESTXQ 1.0 and was rapidly implemented within a number of XQuery products.
RESTXQ can be considered a framework for developing Web Applications in XQuery. It emphasises simplicity and ease of use and offers a “conventions over configuration” approach. Whilst RESTXQ prescribes a RESTful approach and provides the primitives for building such RESTful API’s, the choice of how how closely to adhere to REST principles remains with the user. It is in fact possible for the user to also build HTTP 1.1 API’s with RESTXQ that are not RESTful at all.
As RESTXQ 1.0 was adopted and used in the wild it became apparent that there were some missing capabilities, in some products these perceived holes have been plugged by non-standard vendor specific extensions. Additionally, JAX-RS which inspired RESTXQ, and the state-of-the-art for Web communication protocols, have both advanced independently since 2012.
Herein we have undertaken a literature review of select appropriate works since 2012, evaluated that, and then subsequently used that knowledge to inform a proposal for an update to RESTXQ.
The literature review, (a) reflects on the success of RESTXQ 1.0 and identifies any perceived shortcomings that held true at the time of its delivery, (b) describes the continued evolution of JAX-RS and identifies new features for which the development of RESTXQ equivalents may be desirable, and (c) reviews new developments in Web communication protocols/capabilities which may or may not be considered desirable to support in RESTXQ.
Finally we propose an update to RESTXQ 1.0, namely RESTXQ 2.0, which builds upon the success of its predecessor by introducing additional standardisation aimed at enhanced functionality, addressing various limitations, and supporting the latest in Web technology.
Key improvements considered for RESTXQ 2.0 in this paper include: full URI templating, enabling more flexible request path matching; support for multipart content, enabling the seamless transfer of multiple data types within RESTXQ requests and responses; authentication support, introducing standardised mechanisms for authenticating requests, ensuring robust security in RESTXQ applications; improved support for various HTTP methods such as HEAD, OPTIONS, DELETE, and PATCH, streamlining interaction with RESTXQ endpoints and aligning with best practices in web service development; implementing the latest communication protocols such as Web Sockets and Server Side Events; an expanded library of extension functions for processing requests and responses; and addressing the need for recoverable errors, offering mechanisms for graceful error handling and recovery strategies to maintain application stability and user experience.
Overall, the proposed enhancements in version 2.0 of the RESTXQ specification aim to advance the capabilities and usability of RESTful services, fostering greater flexibility, security, and interoperability in modern web applications.
Tutorial Development XML Mashup with XProc
Erik Siegel
When you create a tutorial or a course, you have to maintain a lot of information: slides, exercises, instructions, etc. Numbering and referencing must be correct. There is usually duplication: documents are used in multiple exercises and sometimes also shown on slides.
It’s rather a chore to keep everything consistent during development and maintenance. This presentation describes an attempt to make this easier using specific XML markup, and processing software built from XProc 3 components.
Enriching Data from Digital Libraries with XProc 3.0
Boris Lehečka
This paper describes an application developed in the XProc 3.0 programming language that allows downloading publicly available data from digital libraries: bibliographic data, images, recognized text. The text can be enriched with linguistic metadata, such as lemmata, part-of-speech and morphological categories of individual words, along with recognized entities, using web services. Communication is done via a REST API. A document in TEI format is generated from the extracted data. The current version of the application supports digital libraries based on the Kramerius system and IIIF protocol.
Modern Benchmarking of XQuery and XML Databases
Alan Paxton and Adam Retter
When considering the performance of XQuery implementations and/or XML Databases, the rich variety of forms in which XML documents may be structured can hugely influence whether an application performs adequately or not.
Previous work in XML benchmarking has yielded standalone tools which carefully craft realistic synthetic XML documents with which to perform benchmarks. These tools then typically have a small corpus of XQuery that is executed over collections of generated documents that vary in structure and size to ascertain performance metrics.
More recently, an enormous amount of engineering effort has been expended in the NoSQL Database market to enable products to compete based on their performance characteristics. This has yielded very powerful and general purpose “pluggable” benchmarking frameworks that can be used to flexibly configure and drive benchmarks over multiple different products for comparison.
Herein we firstly review the previous approaches to benchmarking XQuery implementations and XML Databases, secondly we examine the current approaches to NoSQL Database benchmarking, thirdly we propose an adapter for utilising a NoSQL database benchmarking framework for use with an XQuery implementation and XML Database, and finally we evaluate the capabilities of our adapter compared to the previous approaches, and its suitability for helping to diagnose and address application specific performance problems.
Simple Semantic Data Modeling in XML (SeMoX)
Renzo Kottmann, Cedric Pauken and Andreas Schmitz
The aim of Simple Semantic Data Modeling in XML (SeMoX) is to provide non-technical domain experts a simple model and additional tooling for capturing semantics of data with a technology-neutral approach. It is foremost designed for modeling data exchange standards between heterogeneous systems. SeMoX is simple because
all it needs are five basic concepts: Terms, Semantic Datatypes, Rules, Structures and Syntax Bindings.
The core artifact of SeMoX is semo.xsd [https://projekte.kosit.org/semox/semox-model/-/
blob/master/src/main/xsd/semo.xsd]. This XML Schema defines a concise and linear unfolding XML structure for users to create own SeMoX based semantic data modeling projects. In contrast to many UML based model-driven approaches in standardization, SeMoX is able to leverage the entirety of the fully interoperable XML technology stack. SeMoX is set as the modeling approach of the whole German XEinkauf
and already proved to be valuable in production for the development and maintenance of procurement standards such as eForms-DE and XRechnung. SeMoX is open source under a permissive MIT Licence and invites usage
and participation.
For further details see homepage at https://semo-xml.org and project repository at https://projekte.
kosit.org/semox/semox-model.
GEDCOM to RDF: Transforming Genealogical Data for use in a Personal Knowledge Graph
Robert Walpole
This paper describes a process for converting genealogical data in GEDCOM format to RDF suitable for loading into a Personal Knowledge Graph (PKG). As part of this process, we aim to retain as much metadata about individuals and their relationships as possible. Once loaded into the PKG, this dataset will provide an excellent base for further enrichment of the PKG following the principle of the Open World Assumption. New individuals and additional information about existing individuals can be added to the PKG over time.
It should be noted that the PKG being created is intended purely for personal or household activities under the terms of GDPR and is not intended for use in a public space.
Bridging XDM types in multiple native type systems
O’Neil Delpratt and Matt Patterson
We explore the relationship of the XDM types and the native types in the host language. In applications that are designed to support multi-languages such as SaxonC we find that there is not always that one size fits all approach to representing data of a native type to what we require in the XDM type system. Secondly, we look at the complexities handling strings; it seems simple to represent it across languages and within the XDM system, but this we found can get cumbersome and complicated. Lastly, we dive into the use case of handling XDM Node objects such as traversing, memory management cross language from Java to C++ and vice versa. On top of that we discuss how we add further complexity in layering C++ extension to support APIs in Python and PHP which operate in a managed code environment again.
natural-xml-diff: an XML Diffing Library
Martijn Faassen
natural-xml-diff is a software library written in the Rust programming language that produces a structure-aware difference between two XML documents. The diff is aimed to be human-readable and is produced efficiently for typical document-style content. natural-xml-diff creates a document describing the difference, not just a sequence of edits, and this diff document is post-processed to further improve the human readability of the diff.
It’s Useful After All – VIN Numbers, DITA, and iXML
Ari Nordström
Vehicle Identification Numbers, VINs for short, are used in the automotive industry to identify a vehicle’s configuration as sold – model year, engine, body – but also the vehicle as an individual. For repair shops, the use cases are obvious – it becomes easy to plan a service, from understanding what needs to be serviced and when to acquiring spare parts and consumables in time for the service occasion. But the VIN also helps locating the service documentation applicable to the vehicle if the documentation is marked up accordingly. The VIN can be expressed in an iXML (Invisible XML) grammar and serialised as XML to provide filtering in an application bult on top of an XML database, eXist-db. That DB stores the service documentation as DITA topics, publishing them on the fly. This presentation examines expressing a VIN number used to uniquely identify a car as an iXML grammar and then using the results to filter technical documentation for that car in an XML database setting, publishing the results on the fly. » Read paper in proceedings |