Team Text - Software Architecture Overview¶

Introduction¶

This document presents the wider architecture developed by Team Text at the KNAW Humanities Cluster. All in-house software mentioned in this documentation can be found via https://tools.huc.knaw.nl.

The document serves both as an internal reference, as well as a technical show-case to external parties.

1. Service Oriented Architecture for Text Collections¶

We have ample experience publishing diverse scientific text collections. These may be literary text editions, historical manuscripts, linguistically-annotated collections or large corpora from automatic OCR or Handwritten Text Recognition.

1.1. Current SOA for Text Collections¶

This is our current Service Oriented Architecture for making available (enriched) Text Collections, it still includes uses of TextRepo. TextAnnoViz is the frontend that end-users will deal with mostly, via their web browsers, to browse and search texts, their original scans and annotations on either.

%%{init: {"flowchart": {"htmlLabels": true},
    'themeVariables': {
      'edgeLabelBackground': 'transparent'
    }
}}%%
flowchart TD


    user@{ shape: sl-rect, label: "End-user (Researcher)
in web browser"}
    user -- "HTTPS (UI)" --> textannoviz
    subgraph frontend
        textannoviz[/"TextAnnoViz
(web front-end)"/]
        mirador@{shape: subproc, label: "Mirador
IIIF Image viewer"}
        textannoviz --> mirador
    end

    techuser@{ shape: sl-rect, label: "Technical user/machine
via a web client"}

    techuser -- "HTTPS + Broccoli API" --> broccoli
    subgraph middleware
        textannoviz -- "HTTPS + Broccoli API" --> broccoli
        broccoli[/"Broccoli
(broker)"/]

        broccoli_annorepoclient@{shape: subproc, label: "annorepo-client (java)"}
        broccoli --> broccoli_annorepoclient 
    end


    subgraph backend

        annorepo[/"Annorepo
(web annotation server)"/]
        mongodb[/"MongoDB
(NoSQL database server)"/]
        annorepo_db[("Annotation Database")]
        annorepo -- "HTTP(S) + MongoDB Query API" --> mongodb --> annorepo_db


        textscans@{ shape: docs, label: "Text Scans
(image files)"}
        textdb@{ shape: database, label: "Texts (with metadata) database"}

        textrepo[/"Textrepo
(text server)"/]
        postgresql[/"Postgresql
(Database System)"/]

        subgraph brinta
            broccoli -- "HTTP(S) + ElasticSearch API" --> elasticsearch
            elasticsearch[/"ElasticSearch
(Search engine)"/]
            searchindex[("Text and annotation index
(for full text text search and faceted search)")]
            elasticsearch --> searchindex
        end

        textrepo -- "Postgresql" --> postgresql

        postgresql --> textdb

        cantaloupe[/"Cantaloupe
(IIIF Image server)"/]
        manifests@{ shape: docs, label: "IIIF Manifests"}
        cantaloupe --> textscans

        broccoli_annorepoclient -- "HTTP(S) + W3C Web Annotation Protocol" --> annorepo
        broccoli -- "HTTP(S) + TextRepo API" --> textrepo

        manifest_server[/"nginx
(static manifest server)"/]
        manifest_server --> manifests


        mirador -- "HTTPS + IIIF Image API" --> cantaloupe
        mirador -- "HTTPS" --> manifest_server
    end


    classDef thirdparty fill:#ccc,color:#111
    class cantaloupe,mongodb,elasticsearch,postgresql,mirador,manifest_server thirdparty

    linkStyle default background:transparent,color:#009

Legend¶

Arrows follow caller (or loader) direction, response data flows in opposite direction. Edge labels denote communication protocols.
Rectangles represent processes.
Parallelograms represent networked processes (i.e. services).
Rectangles with an extra marked block left and right represent software libraries
Third party software is grayed out
All components (in any of frontend, middleware, and backend) are configurable via external configuration files. These are not explicitly drawn in the schema.

Notes¶

Web annotations produced by this pipeline have custom selectors for TextRepo that are not part of the W3C Web Annotation Data model.

1.2. New proposed SOA for Text Collections¶

This is our new proposed Service Oriented Architecture for making available (enriched) Text Collections, it switches out Textrepo for textsurf and adds a query expansion service (kweepeer).

%%{init: {"flowchart": {"htmlLabels": true},
    'themeVariables': {
      'edgeLabelBackground': 'transparent'
    }
}}%%
flowchart TD


    user@{ shape: sl-rect, label: "End-user (Researcher)
in web browser"}
    user -- "HTTPS (UI)" --> textannoviz
    user -- "HTTPS (UI)" --> annorepodashboard
    subgraph frontend
        textannoviz[/"TextAnnoViz
(web front-end)"/]
        mirador@{shape: subproc, label: "Mirador
IIIF Image viewer"}
        kweepeerfrontend@{shape: subproc, label: "Kweepeer Frontend
(Query expansion UI)"}
        textannoviz --> mirador
        textannoviz --> kweepeerfrontend

        annorepodashboard[/"AnnoRepo Dashboard
(explorative and administrative front-end for annotations)"/]
    end

    techuser@{ shape: sl-rect, label: "Technical user/machine
via a web client"}

    techuser -- "HTTPS + Broccoli API" --> broccoli
    subgraph middleware
        textannoviz -- "HTTPS + Broccoli API" --> broccoli
        broccoli[/"Broccoli
(broker)"/]

        broccoli_annorepoclient@{shape: subproc, label: "annorepo-client (java)"}
        broccoli_elasticclient@{shape: subproc, label: "elasticsearch-java
(client)"}
        broccoli --> broccoli_annorepoclient 
        broccoli --> broccoli_elasticclient 
    end


    subgraph backend

        annorepo[/"Annorepo
(web annotation server)"/]
        mongodb[/"MongoDB
(NoSQL database server)"/]
        annorepo_db[("Annotation Database")]
        annorepo --> mongodb --> annorepo_db
        annorepodashboard --> annorepo

        subgraph brinta
            broccoli_elasticclient -- "HTTP(S) + ElasticSearch API" --> elasticsearch
            elasticsearch[/"ElasticSearch
(Search engine)"/]
            searchindex[("Text and annotation index
(for full text text search and faceted search)")]
            elasticsearch --> searchindex
        end

        texts@{ shape: docs, label: "Text files
(plain text, UTF-8)"}
        textscans@{ shape: docs, label: "Text Scans
(image files)"}

        textsurf[/"Textsurf
(text server)"/]
        textframe@{shape: subproc, label: "Textframe
(text referencing library)"}


        cantaloupe[/"Cantaloupe
(IIIF Image server)"/]
        manifests@{ shape: docs, label: "IIIF Manifests"}
        cantaloupe --> textscans

        kweepeer[/"Kweepeer
(Query Expansion server)"/]

        broccoli_annorepoclient -- "HTTP(S) + W3C Web Annotation Protocol" --> annorepo
        broccoli -- "HTTP(S) + Textsurf API" --> textsurf

        mirador -- "HTTPS" --> manifest_server
        mirador -- "HTTPS + IIIF Image API" --> cantaloupe

        kweepeerfrontend -- "HTTP(S) + Kweepeer API" --> kweepeer

        manifest_server[/"nginx
(static manifest server)"/]
        manifest_server --> manifests

        textsurf --> textframe -->  texts
        textsurf --> texts
    end


    classDef thirdparty fill:#ccc,color:#111
    class cantaloupe,mongodb,elasticsearch,postgresql,mirador,manifest_server,broccoli_elasticclient thirdparty

    linkStyle default background:transparent,color:#009

Notes¶

Kweepeer is not further expanded in this schema, see https://github.com/knaw-huc/kweepeer/blob/master/README.md#architecture for further expansion.
Web annotations produced by this pipeline no longer have custom selectors but fully adhere to the standard.
All components (in any of frontend, middleware, and backend) are configurable via external configuration files. These are not explicitly drawn in the schema.

1.3. Potential SOA for Text Collections with STAM¶

This is a potential and highly experimental architecture that trades out various components for STAM-based solutions. Though STAM is implemented, it is currently not integrated into such a wider architecture. It is presented here merely as an option for consideration.

%%{init: {"flowchart": {"htmlLabels": true},
    'themeVariables': {
      'edgeLabelBackground': 'transparent'
    }
}}%%
flowchart TD


    user@{ shape: sl-rect, label: "End-user (Researcher)
in web browser"}
    user -- "HTTPS (UI)" --> textannoviz
    subgraph frontend
        textannoviz[/"TextAnnoViz
(web front-end)"/]
        mirador@{shape: subproc, label: "Mirador
IIIF Image viewer"}
        textannoviz --> mirador

        tavconf@{ shape: doc, label: "TextAnnoViz Configuration
(project specific)"}
        textannoviz --> tavconf
    end

    techuser@{ shape: sl-rect, label: "Technical user/machine
via a web client"}

    techuser -- "HTTPS + Broccoli API" --> broccoli
    subgraph middleware
        textannoviz -- "HTTPS + Broccoli API" --> broccoli
        broccoli[/"Broccoli
(broker)"/]

        broccoli --> brocconf
        brocconf@{ shape: doc, label: "Broccoli Configuration
(project specific)"}
    end

    textannoviz -. "HTTP(S) + W3C Web Annotation protocol (subset)
, STAM text referencing API
and/or STAM Query Language
" -.-> stamd

    subgraph backend

        stamd[/"stamd
(text and annotation server)"/]

        stamrust@{shape: subproc, label: "stam-rust
(STAM library)"}

        stamrust --> textframe

        texts@{ shape: docs, label: "Text files
(plain text, UTF-8)"}
        annotations@{ shape: docs, label: "STAM Annotations
(STAM JSON/CBOR)"}
        textscans@{ shape: docs, label: "Text Scans
(image files)"}
        textframe@{shape: subproc, label: "Textframe
(text referencing library)"}


        cantaloupe[/"Cantaloupe
(IIIF Image server)"/]
        manifests@{ shape: docs, label: "IIIF Manifests"}
        cantaloupe --> textscans

        broccoli -- "HTTP(S) + W3C Web Annotation protocol (subset)
, STAM text referencing API
and/or STAM Query Language
" --> stamd

        mirador -- "HTTPS" --> manifest_server
        mirador -- "HTTPS + IIIF Image API" --> cantaloupe

        manifest_server[/"nginx
(static manifest server)"/]
        manifest_server --> manifests

        stamd --> stamrust
        textframe --> texts
        stamrust --> annotations
    end


    classDef thirdparty fill:#ccc,color:#111
    class cantaloupe,mirador,manifest_server,broccoli_elasticclient thirdparty

    linkStyle default background:transparent,color:#009

Notes¶

There are three major caveats here currently:
- The STAM library does not provide a full-text index yet and it is not a drop-in replacement for Elastic Search.
- The STAM implementation is currently memory-bound, which means all annotations are loaded into memory (which makes it very fast) and this will not scale to huge corpora
  - The same goes for the texts themselves, but a solution to that is already proposed in this architecture (but not implemented); using textframe in stamd.
- The caller logic in Broccoli (or potentially in TextAnnoviz, see next point) would change drastically
  - The entire middleware layer (the broker) can be omitted entirely if the caller logic is implemented into TextAnnoViz. The dotted line represents this option.

2. Data Conversion Pipelines¶

2.1. Current conversion pipeline for Text Collections¶

Text Fabric [Factory] and un-t-ann-gle are used in the Suriano, Translatin, Van Gogh and Mondriaan projects. Untangle is used standlone in Republic (CAF data) and Globalise (PageXML data).

%%{init: {"flowchart": {"htmlLabels": true}} }%%
flowchart TD

    subgraph sources["Sources (pick one)"]
        direction LR
        teisource@{ shape: docs, label: "Enriched texts
(TEI XML)"}
        pagexmlsource@{ shape: docs, label: "Enriched texts
(Page XML)"}
        cafsource@{ shape: docs, label: "Enriched texts
(CAF)"}
    end

    user@{ shape: sl-rect, label: "End-user (Data manager)
in web browser"}
    user -- "HTTPS (UI)" --> peen

    subgraph frontend
        peen[/"Preview Editor's ENvironment (PEEN)"/]
    end

    subgraph conversion
        direction TB

        peen --> with_textfabric_factory

        subgraph with_textfabric_factory["With Text Fabric Factory (Python API)"]
            direction TB
            teisource ==> tff_fromtei_validation
            pagexmlsource ==> tff_fromxml
            tff_fromtei_validation["tff.convert.tei
(Validation step)"]
            tff_fromtei_validation ==> tff_fromtei_conversion
            tff_fromtei_conversion["tff.convert.tei
(Conversion step, from TEI to TF)"]


            tff_iiif["tff.convert.iiif
(IIIF Manifest generation)"]

            tff_fromxml["tff.convert.xml"]
            tfdata@{ shape: docs, label: "Text Fabric Data"}
            tff_watm["tff.convert.watm
(Conversion step)"]
            watm@{ shape: docs, label: "WATM (Web Annotation Text Model)
(internal intermediary representation)"}
            tff_fromtei_conversion ==> tfdata 
            tff_fromxml ==> tfdata
            tfdata ==> tff_iiif ==> manifests
            tfdata ==> tff_watm ==> watm
            manifests@{ shape: docs, label: "IIIF manifests
(to be served statically)"}
        end 

        with_textfabric_factory ~~~ editem_apparatus
        peen --> editem_apparatus
        teisource ==> editem_apparatus
        editem_apparatus["editem-apparatus
(Extract structured data from apparatus TEI)"]
        apparatus_json@{ shape: docs, label: "Apparatus data (JSON)
(to be served statically)"}
        editem_apparatus ==> apparatus_json
        peen --> with_untanngle

        subgraph with_untanngle["With un-t-ann-gle"]
            direction TB
            watm ==> untanngle_tf
            untanngle_tf["untanngle.tf
(Create texts and web annotations from WATM/TF joined data)"]
            untanngle_tf ==> untanngle_uploader

            untanngle_conversion["un-t-ann-gle
(Project specific conversion pipelines to create texts and webannotations from joined data)>"]
            pagexmlsource == (in globalise project) ==> untanngle_conversion
            cafsource == (in republic project) ==> untanngle_conversion
            untanngle_conversion ==> untanngle_uploader

            untanngle_uploader["un-t-ann-gle uploader
(Generic uploader for annorepo/textrepo)"]

            untanngle_annorepo_client@{shape: subproc, label: "annorepo-client (python)"}
            untanngle_textrepo_client@{shape: subproc, label: "textrepo-client (python)"}

            webanno@{ shape: docs, label: "W3C Web Annotations
(Stand-off annotations, JSONL)"}
            textsegments@{ shape: docs, label: "Text Segments
(JSON for Textrepo)"}


            untanngle_uploader ==> textsegments ==> untanngle_textrepo_client
            untanngle_uploader ==> webanno ==> untanngle_annorepo_client
        end
    end

    subgraph ingest
        direction TB
        untanngle_annorepo_client -- "HTTPS POST/PUT + W3C Web Annotation Protocol" --> annorepo
        untanngle_textrepo_client -- "HTTPS POST/PUT + TextRepo API" --> textrepo

        techuser@{ shape: sl-rect, label: "Technical user/machine
via a web client"}
        techuser --> indexer

        annorepo[/"Annorepo
(web annotation server)"/]
        mongodb[/"MongoDB
(NoSQL database server)"/]
        annorepo_db[("Annotation Database")]
        annorepo --> mongodb --> annorepo_db


        indexer["Indexer
(project-specific, multiple implementations exist)"]

        peen --> indexer
        indexer -- "HTTP(S) GET" --> annorepo
        indexer -- "HTTP(S) GET" --> textrepo

        textrepo[/"Textrepo
(text server)"/]
        postgresql[/"Postgresql
(Database System)"/]

        textrepo -- "Postgresql" --> postgresql

        indexer -- "HTTP(S) POST + ElasticSearch API" --> elasticsearch

        subgraph brinta
            elasticsearch[/"ElasticSearch
(Search engine)"/]
            searchindex[("Text and annotation index
(for full text text search and faceted annotation search)")]
            elasticsearch --> searchindex
        end


    end

    sources ~~~ conversion ~~~ ingest


    classDef thirdparty fill:#ccc,color:#111
    class mongodb,elasticsearch,postgresql thirdparty

    classDef abstract color:#f00
    class indexer abstract

    linkStyle default background:transparent,color:#009

Legend¶

Thick lines represent data flow rather than caller direction
Node with red text denote abstractions rather than specific software

2.3. Conversion with STAM¶

This pipeline with STAM is currently used in the Brieven van Hooft project (with FoLiA source) and also tested for Van Gogh. This pipeline does not have scans.

%%{init: {"flowchart": {"htmlLabels": true}} }%%
flowchart TD

    subgraph sources["Sources (pick one)"]
        direction LR
        teisource@{ shape: docs, label: "Enriched texts
(TEI XML)"}
        foliasource@{ shape: docs, label: "Enriched texts
(FoLiA XML)"}
        pagexmlsource@{ shape: docs, label: "Enriched texts
(Page XML)"}
    end

    techuser@{ shape: sl-rect, label: "Technical user
(command-line)"}
    techuser --> conversion

    subgraph conversion
        direction LR
        subgraph with_folia_tools["with FoLiA-tools (CLI)"]
            direction TB
            foliasource ==> folia2stam
            folia2stam["folia2stam
converts/untangles FoLiA XML to STAM"]
        end

        subgraph with_stam_tools["with stam-tools (CLI)"]
            direction TB
            stam_xmlconfig@{ shape: docs, label: "XML Format Configuration
(XML format specifications, 
e.g. for TEI)"}
            stam_xmlconfig ==> stam_fromxml

            stam_fromxml["stam fromxml
converts/untangles XML to STAM"] 
            stam_webanno["stam webanno
Conversion to W3C Web Annotations"]
            stam_annotations@{ shape: docs, label: "STAM Annotations
(stand-off annotations with references to texts
STAM JSON/CSV/CBOR or non-serialised in memory)"}
            teisource ==> stam_fromxml
            pagexmlsource ==> stam_fromxml
            folia2stam ==> stam_annotations
            stam_fromxml ==> stam_annotations
            stam_annotations ==> stam_webanno
        end
    end


    subgraph targets
        direction LR

        folia2stam ==> texts
        stam_fromxml ==> texts

        stam_webanno ==> webanno

        texts@{ shape: docs, label: "Texts
Plain texts (UTF-8)"}
        webanno@{ shape: docs, label: "W3C Web Annotations
Stand-off annotations (JSONL + JSON-LD)"}
    end

    subgraph ingest
        direction TB

        texts ==> uploader
        webanno ==> uploader
        uploader["Uploader
(Simple project-specific uploader script, python)"]
        uploader --> uploader_annorepo_client
        uploader --> uploader_textrepo_client

        uploader_annorepo_client[["annorepo-client (python)"]]
        uploader_textrepo_client[["textrepo-client (python)"]]

        uploader_annorepo_client -- "HTTPS POST/PUT + W3C Web Annotation Protocol" --> annorepo
        uploader_textrepo_client -- "HTTPS POST/PUT + TextRepo API" --> textrepo

        annorepo[/"Annorepo
(web annotation server)"/]
        mongodb[/"MongoDB
(NoSQL database server)"/]
        annorepo_db[("Annotation Database")]
        annorepo --> mongodb --> annorepo_db


        indexer["Indexer
(project-specific, multiple implementations exist)"]

        indexer -- "HTTP(S) GET" --> annorepo
        indexer -- "HTTP(S) GET" --> textrepo

        textrepo[/"Textrepo
(text server)"/]
        postgresql[/"Postgresql
(Database System)"/]

        textrepo -- "Postgresql" --> postgresql

        indexer -- "HTTP(S) POST + ElasticSearch API" --> elasticsearch

        subgraph brinta
            elasticsearch[/"ElasticSearch
(Search engine)"/]
            searchindex[("Text and annotation index
(for full text text search and faceted annotation search)")]
            elasticsearch --> searchindex
        end

        uploader -. "(manually invoked afterwards)" -.-> indexer


    end


    linkStyle default background:transparent,color:#009

Notes¶

TextRepo may be substituted with TextSurf in the future

3. Data Enrichment pipelines¶

Most data enrichment pipelines are documented elsewhere (direct links to schemas or READMEs/documentation with schemas): :

4. Data Models¶

Data models can be found elsewhere as well (direct links to schemas or READMEs/documentation with schemas):