Sparnatural demonstrator, Archives nationales: documentation

Authors: Florence Clavaud and Pauline Charbonnier (Lab of the Archives nationales of France).

Date of this version: July, 22th 2022.

This documentation will be improved in the coming weeks. It is licensed under Creative Commons « Attribution 4.0 international (CC-BY 4.0) » (https://creativecommons.org/licenses/by/4.0/deed.fr).

This demonstrator was created as part of the development work, between August 2021 and June 2022, of a new version of the open source visual SPARQL query editor Sparnatural, for which the French Ministry of Culture, the National Library of France, and the National Archives of France have partnered to contract with Sparna, Sparnatural’s lead developer.

This demonstrator, which uses the current version of Sparnatural, is the result of an exploratory work on the transition from archival metadata to data graphs, following the projects of qualitative proof of concept PIAAF (released in February 2018) and development of the RiC-O Converter software (version 1.0 released in April 2020). The main challenge of the project was to provide users with a relevant and accessible research interface to explore an archival metadata graph of significant size, which exploits its nature.

Objectives

The work to develop the current version of Sparnatural was carried out by the project team — i.e. representatives of the three partner institutions and Sparna — with the aim of providing answers to the following general questions:

  • How to value a knowledge graph as a knowledge graph?
    (and not behind a faceted search engine or clickable navigation links)
  • How to promote exploratory user interactions (test/error) to discover the data?
  • How to reconcile generic conceptual models with the specific points of view of the end users?

From the point of view of the National Archives, the demonstrator was to be used in particular:

  • to test the evolutions of Sparnatural and thereby contribute to the development work;
  • to set up a very first web application making it possible to query and consult a significant part of the institution’s archival metadata, converted to RDF/RiC-O.

In doing so, the objectives were also:

  • to demonstrate that large-scale semantization of “classic” archival metadata is possible, and that this opens up new search opportunities for users;
  • to show what such an operation allows to learn about these metadata;
  • to build a methodology and acquire skills on the implementation of Sparnatural, which was likely to be used in other projects.

This work is part of an overall strategy for the evolution of National Archives’ metadata and information system towards better quality, more accessible metadata and linked entity graphs.

Quick project history

  • Sparna set up an RDF database (using GraphDB Free software) on the French Very Large Research Infrastructure (TGIR) Huma-Num of CNRS service grid, and a GitHub repository to store the demonstrator code (summer 2021); the database has been provisionally filled with RDF/RiC-O data already available at the National Archives.
  • Start of development of Sparnatural version 2 (August 2021), in agile mode.
  • Workshops between Sparna and the Archives Lab for getting started with the Sparnatural configuration method.
  • Selection of the metadata set, conversion to RDF/RiC-O data (see below The production process of the datagraph) and import of the metadata set into the RDF database; first additional data processing (Lab, autumn 2021).
  • First Sparnatural configurations by the Lab in autumn 2021 (with the help of Sparna).
  • Following the developments of Sparnatural v2, a first version of the demonstrators was set up, which was tested in user workshops. Two workshops were held in November 2021, bringing together about 20 persons of various backgrounds (archivists, researchers, genealogists, DH engineers, etc.), during which the participants, after a presentation of the project and the Sparnatural tool, were able to test the interface. During these workshops, the project team collected their feedback and the needs raised by these tests, then presented the developments planned for Sparnatural.
  • Progressive integration by Sparna of Sparnatural developments into the demonstrator’s repository on GitHub; testing and implementation of these new features by the Lab in the demonstrator’s search interface.
  • Second metadata conversion to RDF/RiC-O, and consolidation of search interface configurations (May-June 2022).

The current content of the demonstrator

The metadata of the National Archives of France are today mainly:

  • more than 30,000 finding aids (conforming to the XML/EAD 2002 DTD),
  • more than 15,400 authority records (conforming to the XML/EAC-CPF schema) on archival creators,
  • vocabularies used to index these files,

to describe about 375 linear km of paper archives of all types, not including natively digital archives. The salle des inventaires virtuelle(online catalog) allows users to search all of these metadata.

Given this important volume, in the absence of a semantic application such as data.bnf.fr and given the human and financial resources available, a subset of these metadata had to be selected to feed the Sparnatural demonstrator.

Metadata selection

A part of the metadata describing the archives of Parisian notaries kept in the National Archives have been retained in agreement with the department responsible for them (the Department of Minutier Central des Notaires de Paris at the Direction des Fonds — DMC), for several reasons:

  • they form a coherent whole (in all, these are the archives of 122 notarial offices, produced and preserved continuously since the end of the 15th to the beginning of the 20th century, i.e. 194,500 boxes, more than 212,000 reference codes, more than 26 km linear and an estimated total of 20 million acts — here we take over some of the elements given by the research aid sheet on inventory work at the DMC;
  • their description is rich (about 1.6 million acts have already been analyzed) and is constantly enriched, while presenting overall the classic characteristics of the finding aids drawn up by the French public archival services;
  • they describe documents that are widely consulted by various types of audiences; the fonds placed under the responsibility of the DMC accounted for 26 % of the communications of the National Archives, all sites combined, 2019, according to the survey report entitled Presentation of resources in notarial archives and their valorisation on the websites of the Departmental and National Archives, carried out in April 2020 by Sandra Fullenbaum Lenfant, under the direction of Marie-Françoise Limon-Bonnet;
  • archives of the same nature are kept in each of the departmental archives services. The survey report quoted above indirectly gives a general overview of these fonds.

We therefore chose to work on the authority records of archival creators (in XML/EAC-CPF) and finding aids (in XML/EAD 2002) relating to the archives of the first 40 Paris notarial offices (out of the 122 kept in the National Archives, i.e. about one third of the offices). The relevant XML/EAC-CPF and EAD 2002 files were selected from a complete export of validated XML files, carried out on 11 March 2022 by the Information Systems Department of the National Archives. The selection was made by following the links established between the first forty EAC-CPF records of notarial offices and, on the one hand, the notaries’ records, on the other hand, the finding aids. This operation produced a corpus including:

  • the 40 records describing notarial offices I to XL (and the LIII office record, in which one of the notaries of office XV also worked between 1961 and 1973);
  • the 1079 records describing the notaries who worked in these forty offices;
  • the 1577 archival finding aids describing the archives of these offices (32 % of the 4,902 existing finding aids validated in March 2022 for the DMC), including 933 digitized registers; 899 of these finding aids describe acts or mentions of an act (based on the results of a SPARQL query executed on the obtained RDF graph).

To this set we have added the records (on persons and organizations which are the subjects of documents) and vocabularies (types of documents, subjects, activities, places of Paris...) used in these records and finding aids to index them. These records and vocabularies are published and available in a public repository of the National Archives on GitHub. Please note that the versions of the creator records in this repository are dated March 2022.

As part of the project, these metadata sets were somewhat enriched. Thereby, for the notaries from offices I and II, the chronological relations between successive notaries were added to the EAC-CPF records, paving the way for possible future work. Some new entries (such as ‘notarial register’, ‘list of acts’ and ‘mention of an act’) and definitions (e.g. definitions assigned to notarial deeds or notarial minutes) have also been added to the document types vocabulary, with the agreement of the DMC. The purpose of these additions was not to immediately enrich the source EAD files, but to make these additions possible afterwards, while adopting rigorous definitions that we could immediately use to construct the RDF graph.

The data graph production process

The selected metadata have been converted into linked data graphs, compliant to the new Records in Contexts Conceptual Model (RiC-CM) and its technical transposition, the Records in Contexts-Ontology (RiC-O, version 0.2, published in February 2021).

The conversion was carried out twice. The first iteration was performed in November 2021, the second in May 2022, after learning various lessons from the first one and from the initial configuration work on the demonstrator.

To convert the XML/EAD and XML/EAC-CPF files, we used the open source software RiC-O Converter,developed for the National Archives in 2019 and available on GitHub since April 2020. RiC-O Converter produces RDF files compliant with RiC-O 0.1. The resulting files were then made compliant with RiC-O 0.2 and slightly enriched or modified by specific scripts written by the Lab. Among the enrichments and changes made let us mention: the propagation of the provenance relation from the description of each fonds to the description of all the archival resources included in this fonds, the categorization (with the rico:hasDocumentaryFormType property and the document type vocabulary) of instances of rico:RecordResource that can be identified as registers, lists of acts and mentions of acts, and the replacement of association relations between notarial offices and notaries by more precise membership relations.
The vocabularies and records used to index EAD and EAC-CPF files have been converted using dedicated scripts, written and used by the Lab.
Finally, once imported into the RDF database chosen for the demonstrator (currently, an instance of the free GraphDB Free software, configured in a very simple way), the data graph has been further enriched, in order to create classes specific to notarial archives and to “populate” these classes by inference. The screenshot below shows an example: these are the SPARQL queries that were used to add the demoanonto:Repertoire class to the RDF database and to create the instances of this class. In other words, an ontology extending RiC-O 0.2 was produced and implemented. Even if this work was not essential, it has greatly facilitated the design of the demonstrator’s configurations and also made it possible to optimize its operation by reducing the execution time of certain queries.

Example of a class creation in the graph

Example of a SPARQL query to create a class in the graph.

Statistical elements concerning the resulting RDF graph

The graph obtained currently has about 57,5 million RDF triples, including about 37 million inferred triples. We can therefore already estimate that the RDF graph resulting from the semantization of all metadata from the notarial archives of the National Archives would amount to about 170 million triples. In addition, let us recall that the approximately 4900 finding aids of the DMC make up 16 % of the total number of 30,000 finding aids. Even if simplistic extrapolations are to be avoided, especially since these EAD files are among the most indexed and precise in the National Archives, it is now proven that semantizing all of the institution’s XML metadata would produce an extremely massive dataset.

This dataset is the first large-scale semantization operation (in RDF compliant with RiC-O 0.2) carried out by the National Archives (and in France as far as we know) of “classic” archival metadata. It should be noted that another less massive dataset was also released in December 2021 (in a public repository on GitHub) and that it is also accessible via the SPARQL endpoint of the ALEGORIA project).

In this dataset, are used, among the components defined in RiC-O 0.2:

  • 34 classes (entity categories) out of 106 (48 with inferred triples);
  • 25 datatype properties (relations whose target is a string) out of 62 (28 counting inferred triples);
  • 77 object properties (relations whose target is an entity) out of 423 (160 with inferred triples).

SPARQL queries executed directly in the SPARQL endpoint of the RDF database, in order to go beyond the limit of 1000 results set for the Sparnatural interface, give the following counts for the main categories of objects described in the graph:

  • 72,661 groups of records (fonds, series, bundles, clients files, etc.);
  • 400,570 notarial acts (described to date) within the bundles;
  • 104,383 mentions of acts recorded in the registers;
  • 3135 registers;
  • 13,127 lists of acts within the registers, including 11,670 digitized lists;
  • 629,246 persons (including 3,213 persons with an IRI);
  • 530 organizations (including 528 with an IRI);
  • 15,559 places (including 15,463 with an IRI).

We will soon include in this documentation a diagram providing an overview of the graph data model.

As explained below, if the quantities found for notarial acts, mentions of acts and lists of acts are to be regarded as a good approximation (given the fairly reliable reasoning that we used to generate the data), the numbers found for persons and registers are not reliable and can only be considered orders of magnitude. However, these statistics, on their own, show in our opinion the interest of the technologies used (since it is currently impossible for an end-user to obtain them via the Salle des inventaires virtuelle, and it is not easy to calculate them for a professional competent in the field of XML technologies). They show above all, if it is still necessary to prove it, the very richness of these archival holdings.

We will continue to work on these data, as explained below.

We have currently set up a private repository on GitHub to deposit and update the RDF dataset. Do not hesitate to contact us if you wish to have access to this repository or a copy of the dataset.

The graph exploration interfaces

Choices and design methods

Several Sparnatural configuration options are possible: configuration via a Google spreadsheet or an OWL ontology (see the documentation). The Lab chose to configure the demonstrator via an ontology, with the help of Sparna, and used Protégé, a free software, to edit the ontology.

Example of Sparnatural class edition in Protégé

Sparnatural class edition in Protégé.

The configuration ontology of a Sparnatural search interface imports and uses components defined in two generic ontologies included in the source code of the software.
The aim is to specify an ontological model for research and its correspondence with the classes and properties of the domain ontology.
This approach makes it possible to retain certain categories of objects or relationships rather than others that are considered secondary to search, to specify new ones to group several existing components in order to simplify the exploration of the graph or to represent text type nodes, to give them understandable labels in the language you want, to define a display order for the main entry points, to specify how the target nodes of a relationship are displayed, to use existing SKOS vocabularies, etc. If you know HTML, the data contained in the graph, the business model used and its implementation, SPARQL and the OWL language, you can build and develop your interface independently.

The aim for the Lab was therefore to achieve a reasonable compromise between the complexity of the domain model and the need to produce an understandable and efficient interface for users. It took a long time to achieve the current result, through successive iterations. Feedback from users during the two workshops held at the end of 2021 were invaluable in this respect and really guided both the developments of the software and the choices we made.

The design work of the search interfaces also made it possible to test the features developed as part of the Sparnatural evolution project, including:

  • the support of the OPTIONAL clause (which facilitates the discovery, as the granularity and the accuracy of the data are variable and not determined in advance in a graph of cultural metadata, in particular in a archival metadata graph) — we used this possibility for many relationships, e.g. ‘has digital surrogate’, ‘has title’, ‘have subject’;
  • FILTER NOT EXISTS support (to express the negation of a property) — we also retained this possibility in the same cases as for OPTIONAL;
  • the ability to select values from a tree list (of type thesaurus) — we used this feature to allow selection in the document type thesaurus;
  • the ability to add contextual help (tooltips) — this allowed us to attach to each of the categories of entities usable in the interface a short definition useful in the context of the project;
  • the ability to pre-record queries, in order to guide the user in his first queries;
  • the ability to choose the result columns to display — this allows the user to define the contents of the result list as he wishes, so that he can immediately have textual data (rather than only the URIS of entities that a classic SPARQL query returns by default), operate sorts and then make an export in CSV format;
  • the ability to translate the interface into as many languages as necessary — so far we have been able to deploy the search interfaces of the National Archives demonstrator in French and English.

Results

Warnings

The demonstrator only consists of two search interfaces and this documentation. The National Archives, unlike the BnF (with data.bnf.fr), does not have a website for viewing data graphs. A feature has therefore been added to the Sparnatural demonstrator, to associate with each result entity a page allowing to view its description in a simple way (via the Lodview application). This feature is not entirely optimal. Moreover, RDF resources IRIs are not currently dereferencable; in particular, no webpage is associated with these IRIs, nor can we guarantee the sustainability of the root segment (http://data.archives-nationales.culture.gouv.fr/).

The results obtained still show some small technical flaws that we will try to correct. Furthermore, we take any comments or suggestions for improvement. We are aware that after the user workshops held at the end of 2021, these interfaces need to be tested by more users. Just like the RDF queried data, the interfaces are likely to evolve significantly.

A Configuration

The first configuration (A interface) is simple and generalistic; it can almost be applied to any RiC-O compliant archival dataset, insofar as contextual entities such as archival creators, document subject agents, locations are indexed. It is a priori rather intended for the discovery of metadata. .

The main categories of objects (classes) defined in the configuration ontology and usable in this first interface are:

  • Records (a project-specific class, which gathers all archival resources, thus all instances of the RiC-O RecordResource class, with the exception of finding aids and authority records);
  • Person (which corresponds to class rico:Person);
  • Organization (which corresponds to class rico:CorporateBody);
  • Place (which corresponds to class rico:Place);
  • Documentary form type (class of SKOS concepts defined in the vocabulary of the document types of the National Archives, also instances of class rico:DocumentaryFormType);
  • Subject (class of SKOS concepts defined in the vocabulary of the national archives subjects, also instances of class rico:Thing);
  • Occupation (class of SKOS concepts defined in the vocabulary of occupations and functions of persons in the National Archives, also instances of class rico:OccupationType);
  • Digital surrogate (the instances of class rico:Instantiation having JPEG format, generated from the daogrp elements of EAD files);
  • and other associated ‘entities’, which correspond to literal values of RiC-O properties: date, title, name.
Inventories after death of goldsmiths in the 17th century

Pre-recorded query in the interface A : inventories after death of goldsmiths in the 17th century

B Configuration

The second configuration (B interface) exploits the specificities of notarial archives. It is a priori rather adapted to the needs of a person familiar with this universe, archivist or accustomed user. This is the configuration that has had the greatest number of changes in the history of the project.
In addition to the categories of objects listed above, this configuration defines and uses new categories of objects, and establishes for these new categories correspondences with classes of the ontology extending RiC-O mentioned above:

  • Record set (which corresponds to class rico:RecordSet), Notarial register, List of acts, Notarial act, Mention of an act, Reference Code;
  • Notarial office, Notarial office number, Notary;
  • Parisian district or parish, Street in Paris, Parisian building, Address.

Most of these entities can be used from the beginning of writing a query.

New relationships are also present, for example between notarial office and notaries (Notary is member of Notarial office), between notaries (Notary has successor or has family relation with Notary), between archival resources (such as Notarial register includes or mentions Mention of an act or List of acts).

Bilan

The demonstrator as an exploration and research device

The results are very positive in terms of the possibilities for exploring the graph and the new research opportunities offered by Sparnatural.
Among the main points identified by users during the November 2021 workshops, once the surprise effect has passed, one can mention the intuitive, flexible and interactive nature of this type of interface, which also requires a greater intellectual commitment for the user than to fill out a classic search form. The user builds his questionnaire by choosing among the options available to him and discovers the content of the graph at the same time as he constructs his search; he’s really taking over. It is also easy for him to go back by undoing a step, then the previous one. Finally, he can share the executed query or export the results of his search.

We cannot, of course, replace the non-archivist user or the one who does not know the RiC-O data model at all. We just want to present here an example of the searches that a classic search form in EAD files currently does not allow to do, and which Sparnatural makes possible. This example is the same as the third pre-recorded query in the B search interface.

Pre-recorded query in the B interface : Documentary form types of the notarial
              actes dated 1848, with their reference codes

Pre-recorded query in the B interface : documentary form types of the notarial actes dated 1848, with their reference codes

In this example, we first see that the entry point is a context entity, which is currently rare in our archival search interfaces, in which the focus is mostly placed at the outset on archival documents. The query editor then allows you to choose a category of entity related to document types (here, the choice fell on notarial acts), before going through the graph step by step. The user is still on the same HTML page, and remains in the context of the construction of a questionnaire (without having to work on search results in which you would then click on hyperlinks, to find out what to expect). He could thus, for example, also be interested in the archival creator of the acts (notary or notarial office). Such a question, even if it is not formulated in natural language, seems to us close to the formulation that a researcher in history might make of it, who would like, for example, to form a corpus and to get a first idea of it. One of the notable features here is also that the user can, for many relationships, either select the target precisely, or request all the targets of a given relationship (here, all notarial acts, or all dates), at least initially. The ability to determine which information is included in the results table, and to have textual values (entity labels), not just URIs, in this table, is also particularly valuable. Finally, the user can easily modify his questionnaire in stages, or share the query and save the link in his own files, in order to replay the query later.

The functional limits of the demonstrator are essentially those of Sparnatural. The first, already mentioned above, is that it is a tool for bulding an exploration and search interface in a knowledge graph, not for displaying the data listed in the result tables and consulting them. Moreover, the National Archives does not yet have a way to de-reference URIs. The HTML page display feature achieved using Lodview, obviously does not replace a real design work of a consultation interface, that could also help visualize parts of the obtained graph, in line with what the PIAAF prototype proposed for a much smaller amount of data. This was not the purpose of the work done under this project.

Such a search interface is not, moreover, exclusive of a ‘full text’ search allowing, from the word or expression entered, to have a list of suggestions giving for each of them the type of entity concerned; or tab-type navigation options (one tab per entity category) as the PIAAF prototype also offers.

Of course, in addition to these limitations, the dataset used does not provide a homogeneous coverage of the archives of the first 40 notarial offices: the work of analyzing the notarial acts held within these fonds is continuing at the DMC, as in the other fonds.

The RDF graph obtained, an example of large-scale use of RiC-O

The result of this large-scale conversion of XML/EAD and XML/EAC-CPF source data to RDF compliant with RiC-O clearly showed:

  • that this conversion is entirely possible without necessarily making any changes to this data;
  • that the graph obtained from classic archival metadata of this type will offer all the more different entry points and will be all the more rich in nodes (in entities of various types) as the source metadata use reference data (authority records, vocabularies) to describe the contextual entities of the archives;
  • that, even from well-indexed metadata, such an operation will only use a subset of the RiC-O classes and properties, which is normal since RiC-O is the technical transposition of an overall conceptual framework;
  • that, while employing only a small part of RiC-O, carrying out such a project may encourage, in particular in the case of a homogeneous archival corpus such as ours, to extend this ontology, which is a very generic international model which cannot take into account the variety of corpuses or all local specificities and needs.

Such lessons may, in our view, be useful to any institution or team that would consider following the same path. We would of course be interested in any comparable feedback.

A new perspective on the metadata used

At this early stage of the work on metadata that the demonstrator allows to explore, as we have already mentioned above, thanks to semantic technologies, we have new quantitative elements on the content of the described archives. We will be able to refine this mapping in the coming months.
Moreover, as in any project for the production and publication of graphs of linked entities from pre-existing data, where a more or less significant leap is made in precision, the work on the data and their exploration as a graph have highlighted problems in the source metadata, although not unknown but so far very little visible and difficult to quantify. Here we will take two examples.

629 246 people in the graph?

The essentially free indexing, via the element EAD <persname>, of persons (except for 3213 of them with authority records) in the finding aids processed results in a very large number of entities of the type rico:Person in the graph, about which little is often known (e.g., we know the occupation of 34 925 of these persons only) and which have not been the subject of aggregation or identification work. In fact, entities with the same name can be aggregated as a single entity Person in this graph only if there is absolute certainty or quasi-certainty that it is the same natural person — when, for example, that person is found at about the same date, with the reference to the same occupation or function (or the same relations with other persons), in the acts of the same notary or mentioning the same property. This is a very interesting field of investigation. Regardless of the results, there should be a very large cohort of highly blurred individuals whose numbers are not expected to differ significantly from the initial number.

3135 registers?

This number is much higher than the observed reality. It results from the fact that in the finding aids treated, there are usually two descriptions (two EAD <c>elements between which no relationship is currently established), created in two different finding aids, about the same register: one describes the fonds of a given notary, thus its registers — and sometimes, references to acts in the registers — and minutes; the other, produced as part of the full digitisation of registers, describes the notary’s registers as material objects and the lists of acts contained therein. This redundancy is not visible enough in the source metadata in the Salle des inventaires virtuelle; it is much more so in the RDF graph.

Further analysis has led us to decide to carry out, in the coming months, a test on a representative sample of these finding aids, of linking or even merging the information they contain, in order to produce, ideally, a single description of each of the registers. This example is very representative of other situations, related to the long history of the National Archives metadata and the phenomenon of stratification of finding aids.

Other tangible results of the project

The preparation of the RDF data for the Sparnatural demonstrator made it possible to carry out a work prior to the update of RiC-O Converter, during which the various actions to be undertaken were first identified and coded. This work will save some time for the rest. Indeed, we plan to develop a second version of RiC-O Converter to make it compatible with version 0.2 of RiC-O.

The translation into French of the English labels of the RiC-O classes and properties used in the graph, carried out during the configuration of the Sparnatural demonstrator, will be integrated into the RiC-O sources in order to have a complete translation over time.

In conclusion: prospects

In a possible continuation of the project, it is envisaged that the two configurations A and B should be merged into one, by hierarchizing classes and properties, to give to the user the choice, within a single search interface, of a quick or more precise exploration (e.g. select Person > Notary in a single interface). This will involve changing the configuration possibilities offered by Sparnatural.

The Lab of National Archives pans to produce and publish a generic Sparnatural search ontology for RiC-O compliant RDF data, starting with configuration A.

Finally, the National Archives Lab is now able on its own to design a Sparnatural search interface for other data sets. The reuse of Sparnatural is therefore foreseen in other research projects, for example to explore data from the ORESM project (Works and References of Students, Supôts and Masters of the University of Paris in the Middle Ages). See the presentation in November 2021 of the proof of concept already carried out.