Team Text - Software Architecture Overview¶
Introduction¶
This document presents the wider architecture developed by Team Text at the KNAW Humanities Cluster. All in-house software mentioned in this documentation can be found via https://tools.huc.knaw.nl.
The document serves both as an internal reference, as well as a technical show-case to external parties.
1. Service Oriented Architecture for Text Collections¶
We have ample experience publishing diverse scientific text collections. These may be literary text editions, historical manuscripts, linguistically-annotated collections or large corpora from automatic OCR or Handwritten Text Recognition.
1.1. Current SOA for Text Collections¶
This is our current Service Oriented Architecture for making available (enriched) Text Collections, it still includes uses of TextRepo. TextAnnoViz is the frontend that end-users will deal with mostly, via their web browsers, to browse and search texts, their original scans and annotations on either.
%%{init: {"flowchart": {"htmlLabels": true},
'themeVariables': {
'edgeLabelBackground': 'transparent'
}
}}%%
flowchart TD
user@{ shape: sl-rect, label: "End-user (Researcher)
in web browser"}
user -- "HTTPS (UI)" --> textannoviz
subgraph frontend
textannoviz[/"TextAnnoViz
(web front-end)"/]
mirador@{shape: subproc, label: "Mirador
IIIF Image viewer"}
textannoviz --> mirador
end
techuser@{ shape: sl-rect, label: "Technical user/machine
via a web client"}
techuser -- "HTTPS + Broccoli API" --> broccoli
subgraph middleware
textannoviz -- "HTTPS + Broccoli API" --> broccoli
broccoli[/"Broccoli
(broker)"/]
broccoli_annorepoclient@{shape: subproc, label: "annorepo-client (java)"}
broccoli --> broccoli_annorepoclient
end
subgraph backend
annorepo[/"Annorepo
(web annotation server)"/]
mongodb[/"MongoDB
(NoSQL database server)"/]
annorepo_db[("Annotation Database")]
annorepo -- "HTTP(S) + MongoDB Query API" --> mongodb --> annorepo_db
textscans@{ shape: docs, label: "Text Scans
(image files)"}
textdb@{ shape: database, label: "Texts (with metadata) database"}
textrepo[/"Textrepo
(text server)"/]
postgresql[/"Postgresql
(Database System)"/]
subgraph brinta
broccoli -- "HTTP(S) + ElasticSearch API" --> elasticsearch
elasticsearch[/"ElasticSearch
(Search engine)"/]
searchindex[("Text and annotation index
(for full text text search and faceted search)")]
elasticsearch --> searchindex
end
textrepo -- "Postgresql" --> postgresql
postgresql --> textdb
cantaloupe[/"Cantaloupe
(IIIF Image server)"/]
manifests@{ shape: docs, label: "IIIF Manifests"}
cantaloupe --> textscans
broccoli_annorepoclient -- "HTTP(S) + W3C Web Annotation Protocol" --> annorepo
broccoli -- "HTTP(S) + TextRepo API" --> textrepo
manifest_server[/"nginx
(static manifest server)"/]
manifest_server --> manifests
mirador -- "HTTPS + IIIF Image API" --> cantaloupe
mirador -- "HTTPS" --> manifest_server
end
classDef thirdparty fill:#ccc,color:#111
class cantaloupe,mongodb,elasticsearch,postgresql,mirador,manifest_server thirdparty
linkStyle default background:transparent,color:#009
Legend¶
- Arrows follow caller (or loader) direction, response data flows in opposite direction. Edge labels denote communication protocols.
- Rectangles represent processes.
- Parallelograms represent networked processes (i.e. services).
- Rectangles with an extra marked block left and right represent software libraries
- Third party software is grayed out
- All components (in any of frontend, middleware, and backend) are configurable via external configuration files. These are not explicitly drawn in the schema.
Notes¶
- Web annotations produced by this pipeline have custom selectors for TextRepo that are not part of the W3C Web Annotation Data model.
1.2. New proposed SOA for Text Collections¶
This is our new proposed Service Oriented Architecture for making available (enriched) Text Collections, it switches out Textrepo for textsurf and adds a query expansion service (kweepeer).
%%{init: {"flowchart": {"htmlLabels": true},
'themeVariables': {
'edgeLabelBackground': 'transparent'
}
}}%%
flowchart TD
user@{ shape: sl-rect, label: "End-user (Researcher)
in web browser"}
user -- "HTTPS (UI)" --> textannoviz
user -- "HTTPS (UI)" --> annorepodashboard
subgraph frontend
textannoviz[/"TextAnnoViz
(web front-end)"/]
mirador@{shape: subproc, label: "Mirador
IIIF Image viewer"}
kweepeerfrontend@{shape: subproc, label: "Kweepeer Frontend
(Query expansion UI)"}
textannoviz --> mirador
textannoviz --> kweepeerfrontend
annorepodashboard[/"AnnoRepo Dashboard
(explorative and administrative front-end for annotations)"/]
end
techuser@{ shape: sl-rect, label: "Technical user/machine
via a web client"}
techuser -- "HTTPS + Broccoli API" --> broccoli
subgraph middleware
textannoviz -- "HTTPS + Broccoli API" --> broccoli
broccoli[/"Broccoli
(broker)"/]
broccoli_annorepoclient@{shape: subproc, label: "annorepo-client (java)"}
broccoli_elasticclient@{shape: subproc, label: "elasticsearch-java
(client)"}
broccoli --> broccoli_annorepoclient
broccoli --> broccoli_elasticclient
end
subgraph backend
annorepo[/"Annorepo
(web annotation server)"/]
mongodb[/"MongoDB
(NoSQL database server)"/]
annorepo_db[("Annotation Database")]
annorepo --> mongodb --> annorepo_db
annorepodashboard --> annorepo
subgraph brinta
broccoli_elasticclient -- "HTTP(S) + ElasticSearch API" --> elasticsearch
elasticsearch[/"ElasticSearch
(Search engine)"/]
searchindex[("Text and annotation index
(for full text text search and faceted search)")]
elasticsearch --> searchindex
end
texts@{ shape: docs, label: "Text files
(plain text, UTF-8)"}
textscans@{ shape: docs, label: "Text Scans
(image files)"}
textsurf[/"Textsurf
(text server)"/]
textframe@{shape: subproc, label: "Textframe
(text referencing library)"}
cantaloupe[/"Cantaloupe
(IIIF Image server)"/]
manifests@{ shape: docs, label: "IIIF Manifests"}
cantaloupe --> textscans
kweepeer[/"Kweepeer
(Query Expansion server)"/]
broccoli_annorepoclient -- "HTTP(S) + W3C Web Annotation Protocol" --> annorepo
broccoli -- "HTTP(S) + Textsurf API" --> textsurf
mirador -- "HTTPS" --> manifest_server
mirador -- "HTTPS + IIIF Image API" --> cantaloupe
kweepeerfrontend -- "HTTP(S) + Kweepeer API" --> kweepeer
manifest_server[/"nginx
(static manifest server)"/]
manifest_server --> manifests
textsurf --> textframe --> texts
textsurf --> texts
end
classDef thirdparty fill:#ccc,color:#111
class cantaloupe,mongodb,elasticsearch,postgresql,mirador,manifest_server,broccoli_elasticclient thirdparty
linkStyle default background:transparent,color:#009
Notes¶
- Kweepeer is not further expanded in this schema, see https://github.com/knaw-huc/kweepeer/blob/master/README.md#architecture for further expansion.
- Web annotations produced by this pipeline no longer have custom selectors but fully adhere to the standard.
- All components (in any of frontend, middleware, and backend) are configurable via external configuration files. These are not explicitly drawn in the schema.
1.3. Potential SOA for Text Collections with STAM¶
This is a potential and highly experimental architecture that trades out various components for STAM-based solutions. Though STAM is implemented, it is currently not integrated into such a wider architecture. It is presented here merely as an option for consideration.
%%{init: {"flowchart": {"htmlLabels": true},
'themeVariables': {
'edgeLabelBackground': 'transparent'
}
}}%%
flowchart TD
user@{ shape: sl-rect, label: "End-user (Researcher)
in web browser"}
user -- "HTTPS (UI)" --> textannoviz
subgraph frontend
textannoviz[/"TextAnnoViz
(web front-end)"/]
mirador@{shape: subproc, label: "Mirador
IIIF Image viewer"}
textannoviz --> mirador
tavconf@{ shape: doc, label: "TextAnnoViz Configuration
(project specific)"}
textannoviz --> tavconf
end
techuser@{ shape: sl-rect, label: "Technical user/machine
via a web client"}
techuser -- "HTTPS + Broccoli API" --> broccoli
subgraph middleware
textannoviz -- "HTTPS + Broccoli API" --> broccoli
broccoli[/"Broccoli
(broker)"/]
broccoli --> brocconf
brocconf@{ shape: doc, label: "Broccoli Configuration
(project specific)"}
end
textannoviz -. "HTTP(S) + W3C Web Annotation protocol (subset)
, STAM text referencing API
and/or STAM Query Language
" -.-> stamd
subgraph backend
stamd[/"stamd
(text and annotation server)"/]
stamrust@{shape: subproc, label: "stam-rust
(STAM library)"}
stamrust --> textframe
texts@{ shape: docs, label: "Text files
(plain text, UTF-8)"}
annotations@{ shape: docs, label: "STAM Annotations
(STAM JSON/CBOR)"}
textscans@{ shape: docs, label: "Text Scans
(image files)"}
textframe@{shape: subproc, label: "Textframe
(text referencing library)"}
cantaloupe[/"Cantaloupe
(IIIF Image server)"/]
manifests@{ shape: docs, label: "IIIF Manifests"}
cantaloupe --> textscans
broccoli -- "HTTP(S) + W3C Web Annotation protocol (subset)
, STAM text referencing API
and/or STAM Query Language
" --> stamd
mirador -- "HTTPS" --> manifest_server
mirador -- "HTTPS + IIIF Image API" --> cantaloupe
manifest_server[/"nginx
(static manifest server)"/]
manifest_server --> manifests
stamd --> stamrust
textframe --> texts
stamrust --> annotations
end
classDef thirdparty fill:#ccc,color:#111
class cantaloupe,mirador,manifest_server,broccoli_elasticclient thirdparty
linkStyle default background:transparent,color:#009
Notes¶
- There are three major caveats here currently:
- The STAM library does not provide a full-text index yet and it is not a drop-in replacement for Elastic Search.
- The STAM implementation is currently memory-bound, which means all annotations are loaded into memory (which makes it very fast) and this will not scale to huge corpora
- The same goes for the texts themselves, but a solution to that is already proposed in this architecture (but not implemented); using textframe in stamd.
- The caller logic in Broccoli (or potentially in TextAnnoviz, see next point) would change drastically
- The entire middleware layer (the broker) can be omitted entirely if the caller logic is implemented into TextAnnoViz. The dotted line represents this option.
2. Data Conversion Pipelines¶
2.1. Current conversion pipeline for Text Collections¶
Text Fabric [Factory] and un-t-ann-gle are used in the Suriano, Translatin, Van Gogh and Mondriaan projects. Untangle is used standlone in Republic (CAF data) and Globalise (PageXML data).
%%{init: {"flowchart": {"htmlLabels": true}} }%%
flowchart TD
subgraph sources["Sources (pick one)"]
direction LR
teisource@{ shape: docs, label: "Enriched texts
(TEI XML)"}
pagexmlsource@{ shape: docs, label: "Enriched texts
(Page XML)"}
cafsource@{ shape: docs, label: "Enriched texts
(CAF)"}
end
user@{ shape: sl-rect, label: "End-user (Data manager)
in web browser"}
user -- "HTTPS (UI)" --> peen
subgraph frontend
peen[/"Preview Editor's ENvironment (PEEN)"/]
end
subgraph conversion
direction TB
peen --> with_textfabric_factory
subgraph with_textfabric_factory["With Text Fabric Factory (Python API)"]
direction TB
teisource ==> tff_fromtei_validation
pagexmlsource ==> tff_fromxml
tff_fromtei_validation["tff.convert.tei
(Validation step)"]
tff_fromtei_validation ==> tff_fromtei_conversion
tff_fromtei_conversion["tff.convert.tei
(Conversion step, from TEI to TF)"]
tff_iiif["tff.convert.iiif
(IIIF Manifest generation)"]
tff_fromxml["tff.convert.xml"]
tfdata@{ shape: docs, label: "Text Fabric Data"}
tff_watm["tff.convert.watm
(Conversion step)"]
watm@{ shape: docs, label: "WATM (Web Annotation Text Model)
(internal intermediary representation)"}
tff_fromtei_conversion ==> tfdata
tff_fromxml ==> tfdata
tfdata ==> tff_iiif ==> manifests
tfdata ==> tff_watm ==> watm
manifests@{ shape: docs, label: "IIIF manifests
(to be served statically)"}
end
with_textfabric_factory ~~~ editem_apparatus
peen --> editem_apparatus
teisource ==> editem_apparatus
editem_apparatus["editem-apparatus
(Extract structured data from apparatus TEI)"]
apparatus_json@{ shape: docs, label: "Apparatus data (JSON)
(to be served statically)"}
editem_apparatus ==> apparatus_json
peen --> with_untanngle
subgraph with_untanngle["With un-t-ann-gle"]
direction TB
watm ==> untanngle_tf
untanngle_tf["untanngle.tf
(Create texts and web annotations from WATM/TF joined data)"]
untanngle_tf ==> untanngle_uploader
untanngle_conversion["un-t-ann-gle
(Project specific conversion pipelines to create texts and webannotations from joined data)>"]
pagexmlsource == (in globalise project) ==> untanngle_conversion
cafsource == (in republic project) ==> untanngle_conversion
untanngle_conversion ==> untanngle_uploader
untanngle_uploader["un-t-ann-gle uploader
(Generic uploader for annorepo/textrepo)"]
untanngle_annorepo_client@{shape: subproc, label: "annorepo-client (python)"}
untanngle_textrepo_client@{shape: subproc, label: "textrepo-client (python)"}
webanno@{ shape: docs, label: "W3C Web Annotations
(Stand-off annotations, JSONL)"}
textsegments@{ shape: docs, label: "Text Segments
(JSON for Textrepo)"}
untanngle_uploader ==> textsegments ==> untanngle_textrepo_client
untanngle_uploader ==> webanno ==> untanngle_annorepo_client
end
end
subgraph ingest
direction TB
untanngle_annorepo_client -- "HTTPS POST/PUT + W3C Web Annotation Protocol" --> annorepo
untanngle_textrepo_client -- "HTTPS POST/PUT + TextRepo API" --> textrepo
techuser@{ shape: sl-rect, label: "Technical user/machine
via a web client"}
techuser --> indexer
annorepo[/"Annorepo
(web annotation server)"/]
mongodb[/"MongoDB
(NoSQL database server)"/]
annorepo_db[("Annotation Database")]
annorepo --> mongodb --> annorepo_db
indexer["Indexer
(project-specific, multiple implementations exist)"]
peen --> indexer
indexer -- "HTTP(S) GET" --> annorepo
indexer -- "HTTP(S) GET" --> textrepo
textrepo[/"Textrepo
(text server)"/]
postgresql[/"Postgresql
(Database System)"/]
textrepo -- "Postgresql" --> postgresql
indexer -- "HTTP(S) POST + ElasticSearch API" --> elasticsearch
subgraph brinta
elasticsearch[/"ElasticSearch
(Search engine)"/]
searchindex[("Text and annotation index
(for full text text search and faceted annotation search)")]
elasticsearch --> searchindex
end
end
sources ~~~ conversion ~~~ ingest
classDef thirdparty fill:#ccc,color:#111
class mongodb,elasticsearch,postgresql thirdparty
classDef abstract color:#f00
class indexer abstract
linkStyle default background:transparent,color:#009
Legend¶
- Thick lines represent data flow rather than caller direction
- Node with red text denote abstractions rather than specific software
2.3. Conversion with STAM¶
This pipeline with STAM is currently used in the Brieven van Hooft project (with FoLiA source) and also tested for Van Gogh. This pipeline does not have scans.
%%{init: {"flowchart": {"htmlLabels": true}} }%%
flowchart TD
subgraph sources["Sources (pick one)"]
direction LR
teisource@{ shape: docs, label: "Enriched texts
(TEI XML)"}
foliasource@{ shape: docs, label: "Enriched texts
(FoLiA XML)"}
pagexmlsource@{ shape: docs, label: "Enriched texts
(Page XML)"}
end
techuser@{ shape: sl-rect, label: "Technical user
(command-line)"}
techuser --> conversion
subgraph conversion
direction LR
subgraph with_folia_tools["with FoLiA-tools (CLI)"]
direction TB
foliasource ==> folia2stam
folia2stam["folia2stam
converts/untangles FoLiA XML to STAM"]
end
subgraph with_stam_tools["with stam-tools (CLI)"]
direction TB
stam_xmlconfig@{ shape: docs, label: "XML Format Configuration
(XML format specifications,
e.g. for TEI)"}
stam_xmlconfig ==> stam_fromxml
stam_fromxml["stam fromxml
converts/untangles XML to STAM"]
stam_webanno["stam webanno
Conversion to W3C Web Annotations"]
stam_annotations@{ shape: docs, label: "STAM Annotations
(stand-off annotations with references to texts
STAM JSON/CSV/CBOR or non-serialised in memory)"}
teisource ==> stam_fromxml
pagexmlsource ==> stam_fromxml
folia2stam ==> stam_annotations
stam_fromxml ==> stam_annotations
stam_annotations ==> stam_webanno
end
end
subgraph targets
direction LR
folia2stam ==> texts
stam_fromxml ==> texts
stam_webanno ==> webanno
texts@{ shape: docs, label: "Texts
Plain texts (UTF-8)"}
webanno@{ shape: docs, label: "W3C Web Annotations
Stand-off annotations (JSONL + JSON-LD)"}
end
subgraph ingest
direction TB
texts ==> uploader
webanno ==> uploader
uploader["Uploader
(Simple project-specific uploader script, python)"]
uploader --> uploader_annorepo_client
uploader --> uploader_textrepo_client
uploader_annorepo_client[["annorepo-client (python)"]]
uploader_textrepo_client[["textrepo-client (python)"]]
uploader_annorepo_client -- "HTTPS POST/PUT + W3C Web Annotation Protocol" --> annorepo
uploader_textrepo_client -- "HTTPS POST/PUT + TextRepo API" --> textrepo
annorepo[/"Annorepo
(web annotation server)"/]
mongodb[/"MongoDB
(NoSQL database server)"/]
annorepo_db[("Annotation Database")]
annorepo --> mongodb --> annorepo_db
indexer["Indexer
(project-specific, multiple implementations exist)"]
indexer -- "HTTP(S) GET" --> annorepo
indexer -- "HTTP(S) GET" --> textrepo
textrepo[/"Textrepo
(text server)"/]
postgresql[/"Postgresql
(Database System)"/]
textrepo -- "Postgresql" --> postgresql
indexer -- "HTTP(S) POST + ElasticSearch API" --> elasticsearch
subgraph brinta
elasticsearch[/"ElasticSearch
(Search engine)"/]
searchindex[("Text and annotation index
(for full text text search and faceted annotation search)")]
elasticsearch --> searchindex
end
uploader -. "(manually invoked afterwards)" -.-> indexer
end
linkStyle default background:transparent,color:#009
Notes¶
- TextRepo may be substituted with TextSurf in the future
3. Data Enrichment pipelines¶
Most data enrichment pipelines are documented elsewhere (direct links to schemas or READMEs/documentation with schemas): :
4. Data Models¶
Data models can be found elsewhere as well (direct links to schemas or READMEs/documentation with schemas):