“Interactive Knowledge Stack” (IKS) is an
integrating project targeting small to medium Content Management
Systems (CMS) providers in Europe providing technology platforms for
content and knowledge management to thousands of end user
organizations. Current CMS technology platforms lack the capability for
semantic web enabled, intelligent content, and therefore lack the
capacity for users to interact with the content at the user’s
knowledge level. The objective of IKS therefore, is to bring semantic
capabilities to current CMS frameworks. IKS puts forward the
“Semantic CMS Technology Stack” which merges the
advances in semantic web infrastructure and services with CMS industry
needs of coherent architectures that fit into existing technology
landscapes. IKS will provide the specifications and at least one Open
Source Reference Implementation of the full IKS Stack. To validate the
IKS Stack prototype solutions for industrial use cases ranging from
ambient intelligence infotainment, project management and controlling
to an online holiday booking system will be developed.
This document describes the current status of the IKS development release 6.0. The
IKS 6.0 release is planned to be the graduation release of Apache Stanbol, the main component
of the IKS. Additional, to the official Apache Stanbol software the IKS 6.0 release contains
additional components to complete the stack. With this release the IKS has emerged to a
stable status and is ready for use by early adopters.
In this deliverable we will focus on the IKS reference architecture and describe its
implementation layer by layer. The aim is to give an overview of the provided technologies
and the concepts behind them. We will use pointers to further readings and sources for the
interested reader who wants to dive deeper into the details. Early adopters can use this
overview to decide which concepts may be useful in their environment and find links to the
concrete sources.
This deliverable is structured as follows: First we give a motivating introduction to what the
IKS and its technology is about and for whom it is interesting to use. Then, we present a short catch
up on the applied development process. The next major part is the introduction of the IKS
reference architecture and its reference implementation. To give concrete advice about how to
use and integrate the different parts of the IKS, we describe a set of IKS service
integration patterns. The last part of this deliverable works as a reference by
describing the IKS reference implementation layer by layer and providing links to further
sources.
There are several hundred CMS and KMS provider for SMEs in Europe. Most of them are currently
not able to leverage semantics-based technology for use in their systems. This has negative
impact downstream, on thousands of end user organisations which are served by these providers.
Ultimately, tens or hundreds of thousands of knowledge workers in the downstream organisations
are prevented from leveraging their skills.
The technology base of many CMS providers is the LAMP stack (Linux, Apache, mySQL, php).
The LAMP stack is not well suited to using semantic web technologies such as RDF or OWL.
Conversely, it is still quite hard to make Semantic Web technologies work in operational
content management systems. However, many new web applications are making use of structuring
mechanisms such as RDFa and this is a first step towards getting semantics into CMS.
Improving the status quo and bringing semantic technologies to CMS vendors is the main
challenge of the IKS project. With the IKS 6.0 release the project delivers a stack of
software components that can be used to extend existing CMS by additional semantic features.
The IKS software goes beyond the simple usage of RDFa and demonstrates which concepts and
sementic technologies are available to create the semantic CMS of tomorrow.
The IKS software was designed to be as less intrusive as possible. Extending existing CMS
without the requirement to make fundamental changes to such systems had high priority. Therefore,
most parts of the IKS software can be used as standalone server-side extensions that can
be integrated via RESTful web service interfaces.
Basement for the IKS software is a reference architecture for semantic CMS that was developed
as part of the project. This reference architecture gives an overview of the required concepts
that should be reflected by modern semantic CMS. This architecture also reflects the fact
that existing CMS architectures need to be adapted and extended with these new concepts.
Therefore, the IKS reference architecture is based on two columns: the well known content
column that is already present in existing CMS architectures and the new knowledge column that
contains the new semantic features for modern CMS.
Before we will introduce the IKS reference architecture for semantic CMS and its implementation
in detail, we will catch up on the applied development process that lead us to this
solution.
The applied development process in this project is a combination of
planning the IKS top-down and learning how the system should be built
by bottom-up prototyping. The IKS Alpha version was the first IKS
version bringing together the results of both approaches. The IKS Alpha
consisted of components developed from the bottom-up approach which
are heavily inspired by the design ideas of the top-down
approach. With IKS Alpha the whole project was at a stage where both
top-down and bottom-up approaches met and the final IKS design and
implementation phase started middle-out. This situation is
depicted in Figure 1.1 and was already discussed in [D5.0 Alpha].
Figure 1.1: Development Process Overview
The IKS 6.0 release reflected in this document is the result of the iterative development process started
with the middle-out approach. The development was organized in a distributed fashion with
occasional face-to-face meetings. As the IKS implementation is mainly driven by the two open-source
projects called VIE and Apache Stanbol, the development is heavily inspired by best practices
learned from experienced open-source developers of the IKS consortium. Especially, the decision
to transfer the implementation to the Apache Software Foundation (ASF) had some consequences on how
the software was developed. The ASF rules had to be adopted inside the IKS consortium to ensure
an open development process that is transparent to developers from outside the IKS project.
With the incubation of Apache Stanbol in November 2010 the software components emerged from
prototypes to a more stable status. The project is visible inside the open-source community and
attracted a number of new developers and early adopters. With IKS 6.0 the Apache Stanbol project
is on the cusp of leaving the incubation phase and becoming a top-level project at the ASF.
In the following we will have a look at the design of the IKS architecture and how it was
implemented as part of the different open-source projects.
The IKS reference architecture (RA) for semantic CMS (SCMS) was published in [Christ11] and
described in [D4.2]. We will focus on the main aspects of this work and refer the interested reader
to the published work. We define an SCMS as follows:
An SCMS is a CMS with the capability of interacting with, extracting,
managing, and storing semantic metadata about content.
Based on this definition an SCMS is designed to manage two types of data.
The first data type is 'content' as the kind of data that is managed by
traditional CMS. Content can be text or any other kind of binary data
like images, video, or sound. A CMS architecture is designed to
organize and store content in an efficient way. For example, textual
content is indexed to become searchable. Many features of a CMS focus on
the management aspect of content. That is to implement e.g. content
(business) lifecycles, manage responsibilities, access controls, and
output channels. The second kind of data is typically missing in a CMS
or only very rudimentarily managed. It is 'knowledge' about the content
that is stored within the CMS. An SCMS manages the knowledge explicitly
and offers features to gain knowledge from the available content.
Figure 2.1: IKS Reference Architecture for a Semantic CMS
The IKS RA shown in Figure 2.1 consists of two columns. The left column is already known from
existing CMS systems and represents the features required to handle content. We call it the
'content column'. The right column, called the 'knowledge column', is an extension of the content
column and adds the semantic features to the architecture. On top of both columns we have user
interface features which are based on traditional user interface features plus the new semantic
interaction features.
The knowledge column is divided into
of the four main layers: Presentation and Interaction, Semantic Lifting, Knowledge Representation
and Reasoning, and Persistence. These four high-level layers are refined into a set of feature
layers. Each feature layer encapsulates required features at the different high-level layers for
an SCMS.
Even if called 'feature layer', it is not a strictly layered stack architecture, i.e. those
layers are not meant to be necessarly dependent on each other.
There is a logical dependency from top to bottom in this architecture that reflects
the idea of having functionality at lower layers that is used in higher layers. But this does not
imply that each layer can only access the next layer below. The possible combination of features
from each layer will be described by means of so called IKS service integration patterns in
section 4.
We will describe each feature layer shortly in the following Table 2.1.
2.1: Feature Layers of the IKS Reference Architecture
Feature
Short Description
Semantic User Interface
A semantic user interface is able to use the available semantic metadata and adapt
its behaviour based on the provided information. The semantic context of user interactions
affects the user interface.
Semantic User Interaction
A semantic user interface needs information about the semantic context of the current
user interaction. This context has to be provided by the semantic user interaction features
that control how a user interacts with the system.
Knowledge Access
The knowledge access layer encapsulates the access to all knowledge features. This interface
layer is required to ensure a standardized access to all participating services within the
knowledge column.
Content Integration
To integrate existing content from the CMS column, a bridge between content and knowledge
column is required. Such a bridge can be used once to make all the available
content known to the knowledge column or incrementally. New content needs to be enriched
and linked to the information provided by the knowledge column. For a bidirectional access
the content access needs to support a standardized interface.
Knowledge Extraction Pipelines
The content stored in the content column is typically not enhanced with additional semantic
metadata. To lift the content to a more semantic level automatic knowledge extraction
features based on natural language processing are required. Different knowledge extraction
pipelines are used to extract different semantic metadata from the content.
Reasoning
Based on the available semantic metadata and the defined knowledge models it is possible
to infer new knowledge by following semantic relations. Automatic reasoning features are
used to evaluate the available metadata in combination with the knowledge models.
Knowledge Models
Knowledge models are used to internally represent the semantic metadata and to define
the available semantic relations. Such knowledge models are often defined in terms of an
ontology. The knowledge models need to be customizable in order to support different
usage scenarios.
Knowledge Repository
To persist the semantic metadata defined by the knowledge models, a knowledge repository is
used. In contrast to a content repository, the knowledge repository is optimized for storing
semantic metadata and its relations. It supports efficient features to query this
information.
Knowledge Administration
The different knowledge features need to be administrated and configured in order to use
them in different and perhaps changing usage scenarios. Each feature has to provide an
administration interface. These administration interfaces are bundled in a centralized
administration console to configure the whole stack.
In the following, we will introduce the IKS reference implementation of
this reference architecture.
The IKS reference implementation (RI) is an instance of the reference architecture (RA) for
semantic CMS (SCMS) presented in the previous section. The RI combines the ideas of the RA with
the implementation provided by our bottom-up driven open-source software projects.
The IKS RI is based on two open-source projects. The
VIE project is focused on the semantic
user interface level. VIE is build on top of modern JavaScript-based web frameworks. Those
frameworks are designed to create interactive web applications that are executed in a web browser.
On top of those web frameworks, the VIE project started to implement the
semantic basemenet for future semantic web applications.
The objective of the VIE project is to ease the development of semantic web
applications on the user interaction level. The project started in February 2011 during the IKS
semantic interaction framework workshop in Vienna and meanwhile comprises several sub-projects:
the core VIE library, a number of semantic user interface widgets and a set of applications that
build upon these technologies. VIE's service-architecture provides communication to different
backend services directly from the
browser. In our context the most important of these services is the Apache Stanbol service. It mirrors
a segment of the Stanbol REST API that's relevant for using it directly from the user interface
layer. The main features of the VIE Stanbol service are text enhancement, lookup for entities in
the Entity Hub, getting metadata for specific entities, and storing entities that were created or
changed during user interaction. Hiding the complexity of the back-end engines is essential for
building long-lasting front-end applications that are also more resistent against future changes
of the different back-end components.
The second project is the Apache Stanbol project. This project focuses on flexible services on
the server-side of a semantic CMS. Therefore, Apache Stanbol components implement the knowledge
column of the reference architecture starting from the Knowledge Access layer. Apache Stanbol
complements the VIE project as the user interface functionality provided through VIE makes use of
the offered services from Apache Stanbol.
The Apache Stanbol incubating project was created in November 2010. The goal of this project is
to create an open-source community around semantic technologies for CMS. The IKS software components
that started with FISE [D5.0 Alpha] were migrated into the new project. Since then most of the IKS software was
directly developed under Apache Stanbol and is therefore freely available. Having a project at the
Apache Software Foundation ensures that the software developed within the IKS project is freely
available and will be maintained even after the project's end. Through Apache Stanbol the IKS
project creates significant impact on the availability of semantic technologies for CMS vendors.
The Apache Stanbol project is not only driven by IKS members but attracts more and more attention
of independent open-source developers who contribute to Apache Stanbol.
An overview of all IKS RI components that implement the IKS reference architecture is given by
Figure 2.2.
Figure 2.2: IKS Reference Implementation
Each component of the IKS RI offers or requires a RESTful web service interface as part of the
Knowledge Access layer. As one can see are those RESTful interfaces not connected to each other
a priori. Instead, each offered service can be used on its own or in a customized combination with
other services. This architecture is realized by the Apache Stanbol project.
Apache Stanbol is a modular set of components and HTTP services for semantic CMS (SCMS). It can
be used to extend traditional CMS with features for content enhancement. The content enhancement
features add additional semantic data about the plain content available in the CMS. Apache Stanbol's
modularity ensures that dependent on the customer scenario only the required components can be used
selectively. According to the RA Apache Stanbol provides the following components that
are listed in Table 2.2.
The Enhancer and its Enhancement Engines (formerly known as the FISE component) are the
components to enhance given content with additional semantic metadata.
Reasoners can be used for gaining additional knowledge by following the semantic relations
defined in the knowledge base. An example is to retrieve the additional knowledge that
Bob is grandfather of Kate by knowing that Pete is son of Bob and father of Kate.
Inference Rules, also known as transformation rules, are syntactic rules which take premises and
return a conclusion. These rules can be used, e.g. to transform the metadata into
other vocabularies.
Ontologies are used for defining the knowledge models that describe the metadata of content.
Additionally, the semantics of your metadata can be defined through an ontology. The
reasoners and rule features are based on such ontology definitions.
The Content Hub is the component which provides persistent document storage whose back-end is
Apache Solr. On top of the store, it enables semantic indexing during text-based
document submission and semantic search together with faceted search on the
documents.
The Entity Hub is Apache Stanbols component to deal with entities and their metadata.
It is a generic component that is able to connect to a configurable list of open-linked
databases. Using these data sources, the Entity Hub
is able to provide information about entities from different sources.
The Fact Store is used to store relations between entities. The Fact Store only stores the
relation and not the entities itself. It only uses references to entities by using the
entities' URI. The entities should be handled by the Entity Hub.
The CMS Adapter component acts as a bridge between JCR/CMIS compliant content management systems
and Apache Stanbol. It can be used to map existing node structures from JCR/CMIS content
repositories to RDF models or vica versa.
In the next section, we describe the possibilities to customize and combine the different
IKS RI services, i.e. Apache Stanbol services.
The IKS is implemented as a set of services that can be used and integrated in a customized way.
This service-oriented architecture makes the use of the IKS very flexible on the one hand, but
on the other hand it is more difficult to understand which components are needed in different
use case scenarios. To make it easier for early adopters
to decide and understand which components may be useful for integration with their CMS, we will
describe a set of so-called service integration patterns. These patterns describe the typical use
of IKS components in concrete use cases.
The VIE technology is provided by the IKS at the user interface level. VIE is designed to be
a JavaScript framework that can be extended to implement custom user interface widgets. The VIE
framework is not dependent on Apache Stanbol, but can use the Apache Stanbol RESTful API as an
example to implement semantic interaction in web applications. VIE can work together with any CMS
with the goal to decouple CMS and user interface. Integrating VIE means to include the VIE
JavaScript library and use it to implement the web user interface.
IKS service integration for the knowledge column is a matter of using the API provided by the
Knowledge Access layer. This API is implemented by the Apache Stanbol RESTful API. The only
requirement for a CMS that wants to integrate IKS technology at this level is the ability to
handle HTTP requests and responses. Depending on the used functionality, different data formats,
such as RDF+XML or JSON-LD, are supported to exchange information.
In the following we, describe each service integration pattern shortly and link to required
features provided by the IKS RI. First, we present a list of available service integration
patterns:
Though VIE is fully capable to work with Apache Stanbol, the service API also allows developers to
include other services (e.g., DBPedia). With VIE one can directly load information from
DBPedia (or a local service to be implemented by the developer).
VIE provides methods to query for ontological inheritance. Once an ontology has either been
loaded from external sources or built up internally using the provided data structures,
queries like "is this entity of type 'Person'" or "is an entity of type 'Country' also a
'Place'" can be performed on the client side without server-side interactions (e.g., in a
scenario where no Internet connection is available).
Include VIE and the VIE widget in the source of the HTML file. Depending on the used
vocabulary, a mapping of the semantic information to the related service queries needs to
be implemented.
For English content these engines already work out of the box. If your document has
a different language, you have to check whether your language is supported by a
corresponding model for the used OpenNLP
algorithms.
You want to link your annotations, e.g. extracted entities, to possibly external datasets
to provide more suggested information to that annotation. An example is to link found
entities to corresponding DBPedia datasets.
You have to define your own referenced sites for the Named Entity Tagging Engine in
the Entity Hub. By default a small snapshot of the DBPedia database is provided as a local
referenced site.
You want to use a different vocabulary than the default in your output format. For example,
you want to refactor the output to the Google Rich Snippet format.
You have run the Enhancer on the content item, store in OntoNet an onology containing
the inference statements (e.g. 'the located-in relationship is transitive') and run
the Reasoner 'enrich' service.
You want to ensure that your knowledge base does not contain contradictory or incomplete
statements (i.e. there are no consistency or integrity violations).
Run the Reasoner Check service on the ontology that represents your knowledge model.
Then define integrity constraints in a Apache Stanbol Rules recipe and run it on the Rules
Refactor service along with the ontology. The 'surviving' statements are those that
abide the rules.
You want CMS users to manage each their own content hierarchy (e.g. a SKOS thesaurus or
WebDav file system) concurrently and without tampering with each other's hierarchy.
You also want to minimize redundancies in the knowledge base and highlight privileged criteria
in the 'user-ring' and 'system-ring' (e.g. if SKOS itself or another metamodel were to change,
the changes should be reflected on the content hierarchies of all users. Not so for hierarchy
modifications performed by unprivileged users on shared resources).
Use the Content Hub API to store contents, their metadata and enhancements.
Then use OntoNet to open a scope and load SKOS into its core space and load the default content
hierarchy. Then open a session for each user. Whenever users modify their local hierarchy, write
the changes in their session only.
You want to select contents, users or other entities based on constraints more
complex than traditional search facets. For example, filter 'trusted' users after setting
trust rules (e.g. at least X trusted followers; at least one third of their comments and
reviews with a star rating of 4 or more, etc.)
Write recipes in Stanbol Rule syntax where the constraints for an entity to be 'viable' are written.
Query the Entity Hub for the signature of each entity and load them into an OntoNet session.
Run the Rules 'refactor' service to obtain an RDF graph with only the set of viable entities.
You want to aggregate ontologies and datasets from across the Web, organize them
conveniently into libraries and load a library only when a service requires it.
You have an ontology describing certain data in a given vocabulary, e.g. DBPedia+FOAF,
and want the same data to be described in another vocabulary, e.g. schema.org.
You can directly use RESTful services of Content Hub for storage and search. In an OSGi
environment, those services are accessible through OSGi components of Content Hub.
You want to store textual content items together with additional metadata and afterwards
make faceted search based on the metadata of stored content items.
You should provide additional metadata along with the content item while submitting it to
Content Hub. After the initial keyword-based search, you should parse the facet information
returned from the Content Hub services.
You want to store textual content items, possibly together with additional metadata regarding
to the content item, together with the enhancements obtained by Enhancer. Afterwards,
you want to be able to search documents based on the keywords that are included in the enhancements
or do faceted search based on the three basic types of enhancements i.e. Person, Place, and Organization.
You can directly use RESTful services of Content Hub for storage and search. In an OSGi
environment, those services are accessible through OSGi components of Content Hub. See also
the related patterns.
You need to learn the syntax of LDPath and create one based on your needs. You can create
a semantic index through RESTful or OSGi services of Content hub.
You want to populate your semantic index which was created previously through an
LDPath program. The semantic index can be populated with the information obtained by
querying the named entities that have been detected for the content item with the
corresponding LDPath program.
In this section we will describe the IKS RI layer by layer and describe the implemented
technologies. The aim is to give an introduction to the available components with links
to further readings and sources.
Like jQuery does for the normal web developer, VIE aims to reduce the lines of code for the day-
to-day recurring tasks of a semantic web developer. However, there are higher-level user
interactions that go beyond the scope of VIE. These "semantic interaction patterns" also need
support to allow an easier adoption and development. We started to implement VIE widgets that
implement these interaction patterns on a broad scale but with a flexible and extendible user
interface for application-specific usage. In the following, we list the currently available VIE
widgets and applications where we combined such widgets.
Autotagger: This widget displays a list of found entities in a tag-cloud to be processed
further.
Related Content: Displaying related content of the entity-in-focus. The challenge here
is to transform the knowledge about the entity-type into entity-specific queries, e.g., photos
of persons should usually contain a face, whereas images of a place will rather show a
landscape.
VIE.autocomplete: VIE.autocomplete uses the VIE.find() service method
to make autocomplete suggestions. The VIE.find method can query different backend
or frontend data sources.
The development of web-applications for the semantic web is a hard process that usually involves
several expert working together, especially when the benefits of the various semantic data needs
to be visible to the user. However, when looking at such applications, we can easily identify
different classes of semantic interaction patterns that are re-used in one way or another
throughout the applications (e.g., "querying a database", "accessing a semantic service"). We
analyzed these patterns and collaborated closely with partners from the CMS world to support
web-applications when relying on such patterns. The outcome of this work is an open-source,
MIT-licensed JavaScript library called VIE (Vienna IKS Editables).
VIE in the version 1.0 concentrated on the
development of decoupled Content Management Systems based on semantic annotations on a
web-page. This new concept to detach the front-end editing framework from the content repository
has been embraced by several CMS providers and working implementations within big CMS systems
could be developed quickly (e.g.,
Midgard Create,
WordPress,
TYPO3,
KaraCos,
Drupal).
The underlying principle is to encode knowledge about the content directly in the content to
allow the user interface to know how to deal with different parts of the content. As a side
effect, this gives search engines a deeper understanding about the pages which is a direct
benefit for the SEO effect. Here is an example:
With version 2.0 of the VIE library, we
extended its capabilities to ease the development of semantic interaction. The API now offers a
DSL to handle different namespaces seamlessly, to maintain ontological hierarchies (including
full-typed, multiple inheritances) and to access semantic backend services:
VIE.analyze()
Analyzes the given DOM element depending on the registered engines (e.g., RDFa-parsing,
Apache Stanbol Enhancer, Zemanta) and returns an array of found entities.
VIE.load()
Loads all properties for the given entity from external services (e.g., Apache Stanbol
Entityhub, DBPedia) into VIE.
VIE.save()
Saves knowledge about an entity to a service. This service can be the entityhub of Apache
Stanbol but also simply the local storage of the browser, using the JSON-LD representation of the
model.
VIE.find()
Queries semantic services for, e.g., all Persons whose names start with "Bar".
The most frequent answer that developers return when asked for their first association with the
term "semantic web" is "triples". However, at the same time, this object-centric representation
on entities is accepted rather poorly and often confuses developers. To face this issue, VIE
contains a subject-centric representation on entities using
Backbone.js models. Here is an example:
var Person = vie.types.get("Person");
var BarackObama = Person.instance({
"@subject" : "<#Barack_Obama>",
"name" : "Barack Hussein Obama",
"dbpedia:birthDate" : "\"1961-08-04\"^^xsd:date"
});
What first looks like more lines of code actually comes with a clean API to get(),
set() and maintain attributes of entities.
By default, we ship VIE with the ontology provided by
http://schema.org. However, VIE is ontology-agnostic and allows
to easily extend, remove or change the ontology.
Sources
VIE Homepage: http://viejs.org
In order to offer one centralized address where developers can
learn more about the VIE project
and have access to the variety of sources, we've created an official web-page. This page
contains all the documentation about
VIE, links to resources like widgets and applications that rely on VIE and provides an easy way
to play with code examples to experience how VIE works and what VIE has to offer.
The Apache Stanbol Enhancer and its Enhancement
Engines (formerly known as the FISE component) are the
components to enhance a given content with additional
semantic metadata. The Enhancer takes the content and delivers to a configurable chain of
enhancement engines. Each enhancement engine in this chain is used for a specific purpose.
There are preprocessing engines, e.g. to convert the content in the correct format, and
engines that automatically extract semantic metadata about the content. As an example
Stanbol provides an engine to automatically determine the language of a text. Other engines
are able to extract entities like persons and places from the text. The found entities can
then be enriched with background information found in open-linked databases like DBPedia.
Language Identification Engine:
The Language Identification Engine determines the language of a text.
In the default configuration, language profiles of about 19 major
European languages are provided. Language identification is a
prerequesite to allow other components to activate and use language
specific resources.
Metaxa Engine:
The Metaxa Engine provides a generic framework for extracting plain
text and embedded metadata from documents, images and and audio files.
A large number of standard document formats is supported in
the default configuration, ranging from office documents from
the major vendors as well as standard image and audio
formats. The framework is easily extensible by
alternative or additional extractors for specific file or
data formats. Special attention was given to HTML documents. In
addition to text extraction the extractor supports the extraction
of structured annotations embedded in HTML content that have
emerged in recent years, such as RDFa and microformats. Extracted
metadata are uniformly represented as RDF structures using appropriate
standard OWL ontologies and vocabularies.
Named Entity Extraction Engine: This engine is based on the NLP features of Apache
OpenNLP. It uses its Maximum Entropy models to detect Persons, Names and Organizations.
Keyword Linking Engine: The Keyword Linking Engine supports the extraction of keywords
in multiple languages.
Named Entity Tagging Engine: The Entity Linking Engine uses Referenced Sites of the
Enity Hub to search for entities based on given text annotations.
Geonames Engine: This engine creates fise:EntityAnnotations based on the
http://geonames.org dataset.
OpenCalais Engine:
OpenCalais provides a free high-quality online
service for Named Entity Recoginition and Relation Extraction in the
news domain. The Stanbol OpenCalais Engine provides an interface to that
service. The engine also provides means for mapping the OpenCalais
entity categories to user specified categories.
Zemanta Engine: Enhancement engine that uses the Zemanta API. You need a Zemanta API
key to run this engine.
Refactor Engine: It re-factors the RDF graphs of recognized entities to a target
vocabulary. The engines is provided with a default set of rules (a recipe) for the refactoring
which allows to produce an RDF graph according to the google vocabulary. That default recipe
allows to produce google rich snippets.
The Apache Stanbol Reasoners component provides a set of services that take advantage of automatic inference engines.
The module implements a common api for reasoning services, providing the possibility to plug different reasoners and configurations in parallel.
Actually the module includes OWLApi
and Jena
based abstract services, with concrete implementations for Jena RDFS, OWL, OWLMini and HermiT
reasoning service.
The Reasoners can be used to automatically infer additional knowledge. It is used to obtain new facts in the knowledge
base, e.g. if your enhanced content tells you about a shop located in "Montparnasse", you can infer via a "located-in"
relation that the same shop is located in "Paris", in the "Île-de-France" and in "France".
Apache Stanbol Rules is a component that supports the construction and execution of inference rules.
An inference rule, or transformation rule, is a syntactic rule or function which takes premises and returns a conclusion.
Stanbol Rules allows to add a layer for expressing business logics by means of axioms, which encode the inference rules.
These axioms can be organized into a container called recipe, which identifies a set of rules that share
the same business logic and interpret them as a whole.
Apache Stanbol allows to provide rules to other component, i.e., Apache Stanbol Reasoners,
or to third parties in three different formats.
SWRL.
The Semantic Web Rule Language (SWRL) is a rule language which combines OWL DL with the Unary/Binary Datalog RuleML
sublanguages of the Rule Markup Language and enables enables Horn-like rules to be combined with an OWL knowledge base.
Providing Stanbol Rules as SWRL rules means that they can be interpreted in classical DL reasoning. That allows,
for inantace, to use Stanbol Rules with any of the OWL 2 reasoners configured in the Stanbol Reasoners component
Jena Rules.
It enables compatibility with inference engines based on Jena inference and rule language. Internally,
the Stanbol Reasoners component provides a reasoning profile based on Jena inference
SPARQL.
SPARQL is a W3C recommendation as a query language for RDF. A natural way to represent inference transformation
rules in SPARQL is by using the CONSTRUCT query form. Stanbl Rules can be converted to SPARQL CONSTRUCTs
and executed by any SPARQL engine. Stanbol provides a particular SPARQL engine, namely the Refactor
which is supposed to perform transformation of RDF graphs based on transformation rules defined in Stanbol.
The latter allows, for instance, the vocabulary harmonization of RDF graphs retrieved from different sources in
Linked Data
The Apache Stanbol Rules component allows to add a layer which enables Stanbol to express business logics by means of axioms, i.e., rules.
These axioms can be organized into a container called Recipe, which groups and identifies set of rules which share the same
business logic and interprets them as a whole.
The following sub-components are used to implement the Apache Stanbol Rules features:
Rule language specifies the syntax used in Stanbol in order to represent rules.
Stanbol rules can be as SWRL, Jema rules or SPARQL CONSTRUCT.
Rule Store allows to persist rules.
Rules are stored in sets called recipies, which are designed to aggregate rules by their
functionality.
Refactor performs RDF graphs transformations to specific target
vocabularies or ontologies by means of rules. This allows the harmonization and the alignment
of RDF graphs expressed with different vocabularies, e.g., DBpedia, schema.org etc...
The Apache Stanbol Ontology Manager provides a controlled environment for managing ontologies,
ontology networks and user sessions for semantic data modeled after them. It provides full
access to ontologies stored into the Stanbol persistence layer. Managing an ontology network means that you
can activate or deactivate parts of a complex model from time to time, so that your data can be viewed and
classified under different "logical lenses".
This is especially useful in Reasoning operations.
A Web Ontology in computer and information science is a shareable conceptual model of a part
of the world [1]. This model describes concepts
terms of their characteristics and their relations with other concepts. By means of OntoNet, it is possible to
improve ontology managers like this:
Setup multiple Ontology networks simultaneously, by interconnecting the knowledge contained in
ontologies that normally would not be aware of one another.
Dynamic (de-)activation of parts of any ontology network, as needed by specific reasoning, rule
execution, or other knowledge processing tasks.
Organize ontologies into ontology libraries, which can be populated by setting up simple RDF
graphs called registries.
Use Stanbol as a central ontology repository that mirrors the ontologies scattered aound the Web,
so that there will be no need to query more than a single server for all the formal knowledge managed by the CMS.
OntoNet allows to construct subsets of the knowledge base managed by Stanbol into
OWL/OWL2 ontology networks.
Stanbol OntoNet implements the API section for managing OWL and OWL2 ontologies, in order to prepare them for
consumption by reasoning services, refactorers, rule engines and the like. Ontology management in + is sparse
and not connected: once loaded internally from their remote locations, ontologies live and are known within the realm
they were loaded in. This allows loose-coupling and (de-)activation of ontologies in order to scale the data sets for
reasoners to process and optimize them for efficiency.
Figure 4.1: An example of OntoNet setup for multiple ontology networks, showing the orthogonal layering
of sessions, scopes and spaces.
The following concepts have been introduced with OntoNet:
Ontology scope: a "logical realm" for all the ontologies that encompass a certain CMS-related
set of concepts (such as "User", "ACL", "Event", "Content", "Domain", "Reengineering", "Community", "Travelling" etc.).
Scopes never inherit from each other, though they can load the same ontologies if need be.
Ontology space: an access-restricted container for synchronized access to ontologies within a scope.
The ontologies in a scope are loaded within its set of spaces. An ontology scope contains: (a) one core space,
which contains the immutable set of essential ontologies that describe the scope; (b) one (possibly empty) custom space,
which extends the core space according to specific CMS needs (e.g. the core space for the User scope may contains alignments to FOAF).
Session: a container of (supposedly volatile) semantic data which need to be intercrossed with one or
more Scopes, for stateful management of ontology networks. It can be used to load instances and reason on them using
different models (one per scope). An OntoNet Session is not equivalent to an HTTP session (since it can live persistently
across multiple HTTP sessions), although its behaviour can reflect the one of the HTTP session that created it, if
required by the implementation.
Apache Stanbol Ontology Registry Manager manages ontology libraries for bootstrapping the network using both external and internal ontologies.
Registry management is a facility for Stanbol administrators to pre-configure sets of ontologies that
Stanbol should load and store, or simply be aware of, before they are included in a part of the ontology network
(e.g. a scope or session). Via the registry manager, it is possible to configure whether these ontologies should be loaded
immediately when Stanbol is initialized, or only when explicitly requested. The Ontology Registry Manager is essentially an
ontology bookmarker with caching support. It is also possible to cache multiple versions of the same ontology if needed.
The following concepts have been introduced with Registry:
A Library is a collection of references to ontologies, which can be located anywhere on the Web.
CMS administrators and knowledge managers can create a library by any criterion, e.g. a library of all
W3C ontologies, a library of all the ontologies that describe a social network
(which can include SIOC, FOAF etc.),
a library of ontology alignments (which includes ontologies that align DBPedia to Schema.org, GeoNames to DBPedia,
or a custom product ontology to GoodRelations).
A Registry is an RDF resource (i.e. an ontology itself) that describes one or more libraries. It is the
physical object that has to be accessed to gain knowledge about libraries.
The Content Hub of Apache Stanbol is a document repository which provides semantic storage for
the content items and semantic search services on top of them. Text-based documents can be
submitted, semantically indexed and searched through the services of Content Hub.
As presented in Figure 4.2, two main services are provided by the Contenhub. One is the
collection of storage related services; content items can be submitted through the semantic
enhancement facilities of Apache Stanbol. The other part is the search services. Through a powerful
search mechanism, content items can be searched.
Figure 4.2: The Content Hub Architecture
A document is referred as a Content Item within Content Hub after its submission. In addition
to the actual text-based content of the document, a content item consists of the metadata of the
document. Metadata can be supplied by the clients along with the document during the submission.
Indeed, metadata is generated by Apache Stanbol units through several linguistic analysis and enhancement
processes. Semantics within Content Hub start at this point. Meta-information about the
content, retrieved from several external and internal sources, is indexed and stored along with
the content according to the indexing directives supplied earlier. Indexing directives are
supplied to Content Hub through a formal language.
The Content Hub makes use of Apache Solr as its backend to store the content items.
Solr provides powerful indexing and text-based search mechanisms. It supports a rich and highly
flexible schema specification, and has an extensive search plug-in API for developing custom
search behavior.
The default semantic index of the Content Hub considers several generic semantic relations among entities
and meets several horizontal requirements. On the other hand, ability of creating Solr cores which
direct the system while indexing and searching the content items holds importance, considering that
indexing mechanism of Content Hub needs to be adjusted to different domains to meet different
indexing and search criteria.
The LMF (Linked Media Framework) project provides this functionality as-is. The LMF Semantic
search module creates Solr indexes with the help of so-called "RDF Path Programs". Recently, LMF
team has provided a standalone library for the evaluation of RDF Path Programs and named it as
"LDPath". LDPath is a simple path based query language over RDF (similar to Xpath or SPARQL
Property Paths) which is particularly designed for querying the Linked Data Cloud by following
RDF links between resources.
The Content Hub can be used to implement the following features at the user interface level.
The Entity Hub is Apache Stanbols component to deal with entities and their metadata.
Some Enhancement Engines rely on the ability to extract entities from plain text. To
reference the entities and to provide additional metadata from open-linked databases for
these entities the Entity Hub is used. The Entity Hub is a generic component that is able to
connect to a configurable list of open-linked databases. Even user defined data sources
are possible. Using these data sources, called yards in the Entity Hub, the Entity Hub
is able to provide information about entities from different sources. To improve the
performance and to not rely on unstable (or slow) internet connections, the Entity Hub is
able to cache these information locally.
The Fact Store is used to store relations between entities. The Fact Store only stores the
relation and not the entities itself. It only uses references to entities by using the
entities' URI. The entities should be handled by the Entity Hub.
A relation between entities is called a fact. A fact is defined by a fact schema which is
defined over types of entities.
A fact schema can be defined between an arbitrary number of entities. In most
cases a fact schema is defined between two or three entities. For example, the fact schema
'works-for' can be defined as a relation between entities of type 'Person' and
'Organization'. The Fact Store interface allows the creation of custom fact schemata and to
store facts according to these custom schemata.
The Fact Store provides a simple but efficient way to define and store facts. This component
is meant to be used in scenarios where a simple solution is sufficient and it is not
required to define a complex ontology with reasoning support.
The RESTful API is at the heart of integrating all available IKS services into an existing
CMS. All services are required to offer a RESTful interface to make integration as easy
and technology independent as possible. The aim of the IKS is to support a wide range of
available CMS and technologies, e.g. Java-based and PHP-based CMS. Therefore, the RESTful API
is a key factor in making semantic technology available to a broad audience.
Apart from the RESTful services of each component that provides direct access to themselves,
CMS Adapter component also acts as a bridge between content management systems and the Apache
Stanbol. CMS Adapter interacts with content management systems through JCR and CMIS specifications.
In other words, any content repository compliant with JCR or CMIS specifications can make use of CMS
Adapter functionalities. One of the main features of this component is bidirectional mapping between
RDF data and content repository structure. That is, it is possible to transform content repository
structure into RDF format or populate content repository based on an external RDF data. Furthermore,
this component provides submission of content repository items together with their properties and
enhancements obtained through the Enhancer to the Content Hub to make use of semantic indexing and
search capabilities of Content Hub over the content items.
To get started to work with the IKS software or even to get involved in its development
two three main entry points exist. The first entry point is the IKS project itself. The
second and third entry points are the open-source projects founded as part of the IKS project.
This organizational structure is depicted in Figure 5.1.
Figure 5.1: The IKS and its open-source projects
The first open-source project is the VIE project which provides an implementation for the
presentation and interaction layer of the reference architecture.
The second open-source project is the Apache Stanbol project which implements the lower layers
starting with semantic lifting features. Both are independent projects which have manage to
attract their own communities. They are aware of the goals of each other but are developed
separately. It is important to note that both projects are not bound to the IKS project and
will live on after the official IKS project phase has ended. It was one goal of the IKS project
to create such independent communities and to give the software development in the hand of
such vital open-source communities.
In the following we will present a short guidance through these projects.
VIE is a JavaScript library that has to be included in your web page. Once VIE is included it
searches inside that web page for available RDFa annotations at any element. By this, VIE is
automatically connected to those elements. If you want to get VIE support for any element in
your web page, you have to annotate it using RDFa. For example, you could annotate the element
that represents the author of an article with a corresponding RDFa annotation. VIE will find this
element and gets connected to the semantic information who is the author if this page. Based on
this infrastructure VIE can be used to design semantic user interfaces. For example, it is
possible now to change the author within the browser and sync this information back to the CMS.
By annotating more elements of the web page with RDFa VIE will support changes to those elements
as well. In summary, to get started using VIE, you have to:
Mark up your pages with RDFa annotations
Include VIE into the pages
Implement Backbone.sync
As you can see, VIE is depending on the Backbone.js library, because you have to implement
the Backbone.sync method as part of your VIE integration. Backbone.js is a library for using
the Model-View-Controller design pattern in JavaScript-based web applications. More information
can be found on the Backbone.js homepage.
The VIE devlopment is coordinated on the Github social
coding platform. There you can find the so called
blessed repository for stable releases and the
development repository for the latest development
version.
The VIE documentation consists of a comprehensive API documentation. For further readings we
refer to a couple of blog posts and articles that were published about VIE.
Apache Stanbol is an incubating project at the Apache Software Foundation (ASF). This means that the
project has not emerged enough and prooved that it can live on its own as a top-level Apache
project. Nevertheless, Apache Stanbol uses the infrastructure and best practices of open-source
software development that the ASF is known for. The first entry point for getting started with
Apache Stanbol is to visit its homepage. The Apache Stanbol project members did put a lot of
effort in improving the documentation available at the homepage.
The second source of information is the Apache Stanbol mailing list.
Everything related to Apache Stanbol is discussed on this mailing list. Developers use this
list to discuss new features and users can ask questions, getting in direct contact with the
developers. All e-mails sent to the mailing list are archived and can be referenced if
necessary. If you have an idea of improving Apache Stanbol or found a bug, then you should
create an issue in the issue tracker system.
Mailing List:
To subscribe to the Apache Stanbol mailing list, you have to send an e-mail to
stanbol-dev-subscribe@incubator.apache.org. You will get in return an automatic
generated confirmation e-mail to ensure you are the owner of this mail adress.
To get started to use Apache Stanbol, you should have a look at the documentation of usage
scenarios for Apache Stanbol. The first usage scenario most people get in contact with
is the content enhancement feature of Apache Stanbol. It allows you to send a piece of text
to the Apache Stanbol Enhancer web service and you will get in return semantic annotations to
this text. Such annotations describe found entities like persons or organizations in the text
plus additional background data for those entities that were retrieved by using the DBPedia
datasource.
To get invovled in the development of Apache Stanbol you should start to follow and use the
Apache Stanbol mailing list. It is the central point to ask questions, get help, or suggest
ideas for improvements. You should checkout the source code from the Apache Stanbol source code
repository and get in touch with software development based on OSGi. If you would like to
fix a bug or implement a new feature, you should implement it locally, create an issue in the
issue tracker and upload a patch to this issue that contains your changes to the source code.
Apache Stanbol developers will review the patch and apply it to the source code. And after some
time, if you have worked on some issues and provided good patches, you could be invited to
become an Apache Stanbol committer with full access rights and duties to the source code.
Query for the name of the capital of Mongolia, directly from DBPedia.
var vie = new VIE();
vie.use(new vie.DBPediaService());
var mongoliaURI = "<http://dbpedia.org/resource/Mongolia>";
var capitalPropURI = "<http://dbpedia.org/property/capital>";
var namePropURI = "rdfs:label";
vie
.load({entity : mongoliaURI})
.using('dbpedia')
.execute()
.done(function (mongolia) {
var capitalURI = mongolia.get(capitalPropURI);
vie
.load({entity : capitalURI})
.using('dbpedia')
.execute()
.done(function(capital) {
var url = capital.id.substr(1, capital.id.length - 2);
var label;
_.each(capital.get(namePropURI), function(labelLang) {
if (labelLang.substr(-2) === 'en') {
label = labelLang.substr(2, labelLang.length - 7);
}
});
jQuery('#mongolia .resultsholder').append(
jQuery('<p>The capital of Mongolia is
<a href="' + url + '">' + label + '</a>.</p>')
);
});
});
Use ontological hierarchies
var vie = new VIE();
vie.loadSchema("http://schema.rdfs.org/all.json",
{
baseNS : "http://schema.org",
success: function () {
jQuery('#schemaOrg .resultsholder')
.append('<div class="msg">Successfully loaded the ontology!</di>');
jQuery('#schemaOrg .resultsholder')
.append('<div class="msg">We now have '
+ this.types.list().length
+ ' classes loaded!</di>');
var Place = this.types.get("Place");
var City = this.types.get("City");
var Person = this.types.get("Person");
jQuery('#schemaOrg .resultsholder')
.append('<div class="msg">BTW (1): A schema:City is <b>' +
((City.isof(Place))? ' ' : 'not ') +
'</b>of type schema:Place, but <b>' +
((City.isof(Person))? ' ' : 'not ') +
'</b>of schema:Person!</div>');
jQuery('#schemaOrg .resultsholder')
.append('<div class="msg">BTW (2): A schema:City has <b>'
+ City.attributes.list().length
+ ' attributes, including all inherited!</div>');
},
error: function () {
jQuery('#schemaOrg .resultsholder')
.append('<div class="msg">Error while loading the ontology!</di>');
}
});
Christ, F. and Nagel, B. A Reference Architecture for Seman-tic Content Management Systems.
In M. Nüttgens, O. Thom-as, B. Weber (eds.): Proceeding of the Enterprise Modelling and
Information Systems Architectures Workshop 2011 (EMISA'11), Hamburg (Germany). GI, LNI, vol.
P-190, pp. 135-148 (2011)
D4.2
Fabian Christ, Gregor Engels, Benjamin Nagel, Stefan Sauer, Suat Gonul, Ali Anil Sinaci,
Olivier Grisel, Rüdiger Kurz, IKS Deliverable 4.2: Horizontal Industrial Case – Design and
Implementation, 2012
D5.0 Alpha
Fabian Christ, Gregor Engels, Benjamin Nagel, Stefan Sauer, Sebastian Germesin, Enrico Daga
and Ozgur Kilic, 2010, IKS Alpha Development