Extensible Metadata for Your CMS

Preamble

I am a metadata enthusiast, especially when it comes to Dublin Core. When it comes to the Web, I don't just want to see metadata for pages, I want to see metadata that conforms to a formal vocabulary (like Dublinc Core.) A quick read of my article Metadata, Meta Tags, Meta What? may help the reader get up to speed on this.

Content Management Systems (CMS) can provide a perfect framework for the creation, maintenance and presentation of metadata. Unfortunately, for most CMS software, this functionality is limited – often to informal, 'legacy,' terms – if it exists at all.

In my ideal world, a CMS would provide a ready-to-use means of associating Dublin Core metadata with all pages and be extensible so that the vocabulary could be extended or extra vocabularies added. Compared with some CMS functionality, this is not something that is difficult to achieve so I can only assume that the general lack of implementation speaks a total lack of interest in metadata on the part of the CMS developers.

Some time ago, I presented a set of notes on how to achieve this to a developer working on the Mambo CMS. This work never came to anything at the time as the project forked shortly thereafter and said developer left the project. Subsequent to this, I started working through my notes to produce an extensible metadata extension for the Drupal CMS and also described a toolkit that could be used to work with other CMS. Due to ill-health and lack of time, neither of these bore fruit.

The only progress I have really made on this to date has been in advising the developer of mojoPortal on my metadata concepts; a Dublin Core implementation for mojoPortal is being worked on at the time of writing.

Now, some three years on, I will try to make amends through this article by describing my concepts for adding an extensible metadata management system to a CMS.

I will attempt to keep this article as technology-neutral as possible by describing only the SQL table schemata and queries required to implement the system. However, it should be borne in mind that I am writing from a MySQL perspective and that changes may be required if working with other database technologies – especially when it comes to stored procedures.

One assumption that I am making, which is key to the whole concept, is that every page in the CMS is identified internally by a unique integer field. In the Drupal CMS, this would be the Node ID (nid.) If some other system is employed, a lookup table may need to be employed to implement my concepts.

The Simple, Inflexible Approach

To add metadata functionality to our CMS, we first need to extend the database schema. We could do this either by adding new fields to the table where we store our page content or to create a new table where we can store our metadata.

Our extended table or new table can have a column for every term. This keeps queries and management very simple – but is highly inflexible as adding terms would require modification of the table schema and the queries that relate to it. I find this approach somewhat distasteful – using a flat and fixed data structure when we have the power of a relational database to work with.

Key Metadata Concepts: Triples, n-Tuples

Metadata are data describing data. In the Web context, metadata are various pieces of data that describe properties of a page or media object.

In its simplest form, a metadata statement comprises three elements, the thing we are talking about, the property we are describing, and the value of that property. This set-of-three may be described as a triple or 3-tuple.

Consider the following example of the 'legacy' description metadata element:


<meta name="description" content="an article about metadata" />

Do you see the three elements of the triple? No; that's confusing, isn't it? This is because we are presenting the metadata on the page we are describing; the name attribute of the meta element tells us the property we are describing, the content attribute the value of that property; the subject – the thing we are talking about – is implied. (Those who deal in the grammar of human languages may wish to compare this with the concept of an imperative sentence, where the subject is implied rather than expressed. The name and content attributes thus form the predicate of that sentence.)

Now, who says we can only present metadata about a page on that page? Nobody. If we are storing this metadata in our CMS, we can present it elsewhere, such as in an external RDF file. Our store of metadata may be used to create a library-catalogue of our entire site.

For this simple case, where our metadata can be represented by triples, we might create a database table like this to accommodate it. (Note that more detailed descriptions of fields will be given for the "real-life"
schema later in this document.)


/*
Metadata is stored here.

subject - unique ID of page we are describing.

term - refers to metaterms.term; we look up metaterms.termname
to find the value that goes in the name attribute of the meta
element.

termvalue - what goes in the content attribute of the meta element.
*/
create table metadata
(
subject int unsigned not null,
term int unsigned not null,
unique index(subject,term),
termvalue text
);

/*
Terms are stored here.
*/
create table metaterms
(
term int unsigned not null auto_increment primary key,
term_name varchar(64)
);

/*
Set up some terms.
*/
insert into metaterms (term_name) values ('description'), ('keywords');

/*
Now create description and keywords records for our page which
has unique ID of 1.
*/
insert into metadata (subject, term, termvalue) values
(1,1,'an article about metadata'),(1,2,'metadata; Dublin Core; blah blah;');

See? All nice and simple for triples.

Dublin Core Complicates the Issue

Let's have a look at a couple of meta elements containing Dublin Core metadata assertions:


<meta name="DC.title" lang="en" content="Extensible Metadata for Your CMS" />
<meta name="DCTERMS.created" scheme="DCTERMS.W3CDTF" content="2009-12-05" />

Our DC.title has an extra property, lang, and DCTERMS.created has an extra
property, scheme. This somewhat complicates matters and means that the triple is no longer capable of holding all the bits we need. We are now moving up in the n-tuple (a triple is a tuple with 3 components, an n-tuple is a tuple with n components) world. Our triple, or 3-tuple, has now become a 6-tuple.

If you are now wondering how I came up with a 6-tuple, let's have a count:

  1. Subject (this page, implied)
  2. Vocabulary – the first part of the name attribute. From our example, this is either DC or DCTERMS.
  3. Term name – the second part of the name attribute.
  4. Scheme
  5. Language
  6. Value of the content attribute.

So, the mysterious extra member of the n-tuple occurs because we are overloading the name attribute of the meta element.

Our database structure just got a bit more complicated. How much more complicated is up to the developer; we can either stand up as purists and use a fully relational model, or we can cheat, simplify things and hope they don't come back to bite us. If we plan things carefully and consider the scenarios in which we are going to use our CMS, hopefully being bitten by the results of Bad Decisions will not be amongst our worries.

The Fully Relational Method

Is actually not quite full relational. I have cheated a little even in this method to make metadata searches a little more efficient. Let's have a look at the new schema of our metadata table:

Metadata Table


create table metadata
(
subject int unsigned not null,
index(subject),
termid int unsigned not null,
index(term),
scheme int unsigned not null,
index(scheme),
lang char(8) not null,
index(lang),
termvalue text,
fulltext(termvalue)
);

Metadata Table Fields

subject
The unique ID of the page in question (eg: nid for Drupal.)
termid
foreign key – refers to the primary key of the metaterms table, described below.
scheme
foreign key – refers to the primary key of the schemes table, described below.
lang
The language of the content of the meta element (eg: EN-US, FR, DE, etc.)
termvalue
The actual value of the content attribute of the meta element.

You will note that this table does not have a field to store the vocabulary. This is not necessary as this may be looked up from the metaterms table.

The columns scheme and lang are designated NOT NULL for purposes of indexing.
As values for these are not always present, we would populate these with
0 (zero) and 'NULL' respectively when no values are given. The
software generating the meta element for the HTML document would skip creation of the respective attributes if these defined null values were found.

Metaterms Table

The metaterms table is where we define all the metadata terms that we can use.


create table metaterms
(
termid int unsigned not null auto_increment primary key,
vocabterm varchar(32) not null,
unique index(vocabterm),
vocab int unsigned,
defscheme int unsigned
);

Metaterms Table Fields

termid
The primary key for this table.
vocabterm
The value of the name attribute of the meta element. It is here that a bit of "cheating" takes place. You will
recall that the name attribute of the meta element is overloaded by combining both vocabulary and term, as in DC.title. The metaterms table would have a field that contains just the term – at least it would if were doing things nicely. For the sake of efficiency, however, the vocabterm field contains the same vocabulary+term value that appears in the name attribute of the meta element. (The alternative would be to look up the vocabulary [the DC part of DC.title] from the vocabs table.)
vocab
foreign key – refers to the primary key of the vocabs table.
defscheme
foreign key – refers to the primary key of the schemes table; this is the default scheme for this term. If we want our system to be flexible, we should
let the user override this on a per-use basis, if they so wish.

See Appendix A for a dataset that can be used to pre-populate this table.>

Vocabs Table

The vocabs table is where we set up master records for the different vocabularies that will use. One of the key functions of this table is to provide the URIs that should be linked in our HTML document <head></head>.
For a full Dublin Core implementation, these would be:

<link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
<link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />

And here's the schema:


create table vocabs
(
vocab int unsigned not null primary key,
vocabname varchar(8),
vocaburi varchar(128)
);

Vocabs Table Fields

vocab
The primary key for this table.
vocabname
This is the first of the values that are joined in the name attribute
of the meta element – the DC of DC.title or DCTERMS of DCTERMS.created.
vocaburi
URI of the schema for this vocabulary, for instance http://purl.org/dc/elements/1.1/ for the DC vocabulary.

Appendix B provides a dataset that can be used to pre-populate this table.

Schemes Table

The schemes table provides a list of possible values that can be used
in the scheme attribute of the meta element.


create table schemes
(
scheme int unsigned not null auto_increment primary key,
schemename varchar(32) not null,
index(schemename)
);

Schemes Table Fields

scheme
The primary key for this table.
schemename
The actual value that will appear in the scheme attribute of
the meta element.

Appendix C provides a dataset that can be used to pre-populate this table.

Simple/Cheats' Method

If we are prepared to sacrifice flexibility and accept the default scheme in
the metaterms table as being the only that may be used for each term, we can do away with the schemes table altogether and replace the integer column metaterms.scheme with a varchar column containing that default scheme.

Another option would be to abandon the vocabs table and hard-code the links shown in the vocabs table section into the document template. If additional vocabularies were to be added, any corresponding schema links would also need to be added to the template.

SQL Queries

Whilst the database structure described here should provided what is required to implement a metadata repository for a CMS, I will provide some example queries to help get the ball rolling.

Vocabulary Links

select concat('schema.',vocabname), vocaburi from vocabs
where vocaburi is not null and vocaburi!='';

This will provide values ready to put in the rel and href attributes
of link elements. These links could also be added as static text
to the page template, as described in the Simple/Cheats' Method section.

Retrieving Metadata for a Page

Assuming that our page/node ID is 1234:

select t.vocabterm, s.schemename, m.lang, m.termvalue
from metadata m
join metaterms t on t.termid=m.termid
left join schemes s on s.scheme=m.scheme
where m.subject=1234;

This will return values for the meta element attributes name, scheme, lang, and content respectively. As values for scheme and lang may be NULL, creation
of these attributes should be suppressed if no value is returned for them.

Note
that the schemes table is attached with a left join so that a NULL may be
returned if the value of metadata.scheme=0.
(See Metadata Table Fields.)

Further queries may be added to this section if I think of anything
else that might be useful.

Implementation

Here is a toolkit, how it is implemented is the choice of the developer. Here are some pointers that may assist.

It may be sufficient for many to implement on Dublin Core metadata. When this is the case, no provision need be made in the CMS for maintaining the
metaterms, vocabs and schemes tables – the values provided in the appendices should provide all that is needed. If another vocabulary were identified that might be useful to a reasonable number of CMS users, this too could be added to the inserts in the appendices and no provision be made for maintaining it through the CMS.

If a form, or section of form in the CMS page maintenance area is provided for adding/maintaining metadata for pages, some fields could be pre-populated. DC.title could take the existing page title (I cannot see any reasonable situation where these would be different;) DC.identifier – the page URI – could be calculated; DCTERMS.created and DCTERMS.modified could certainly be derived automatically; DC.rights could be taken from a site default; DC.type and DC.format would generally be fixed. And the list goes on. Pre-population of fields would make the task of maintaining metadata less onerous and encourage compliance, which may be an issue in some organisations where provision of metadata is mandated.

Search facilities could be built that could identify lists of documents by author (DC.creator,) creation date, etcetera. I created an experimental metadata repository a while back – some four million pseudo-pages, each with three items of metadata. Searches on unique metadata values all completed under a second, much to my surprise. The repository used nearly the same table schemata (including indexing) presented here, so a powerful search engine would not be hard to implement for a CMS holding very large numbers of pages. I am currently unable to find the search queries I used, but will append them to the SQL Queries section, should I come across them at any point.

Sitemaps and other machine-readable (RDF) views of the repository could be generated – either as the results of search queries, or just dumps of the entire repository.

License

The contents of this document are released under the Creative Commons Attribution 3.0 Unported License. If you make use of the material presented here, I require attribution as a contributor to your work. A link back to this page would be nice, too. Yes, you can use it commercially; if you make heaps of money out of it, I'm rather partial to full-bodied reds. Hint, hint.

If you do make use of this material in your project, I'd love to hear from you and link to your project from this page.

Appendices

Appendix A

Values for the metaterms table.

insert into metaterms (vocabterm,vocab,defscheme) values
('DC.contributor','1',''),
('DC.coverage','1',''),
('DC.creator','1',''),
('DC.date','1','18'),
('DC.description','1',''),
('DC.format','1','4'),
('DC.identifier','1','17'),
('DC.language','1','13'),
('DC.publisher','1',''),
('DC.relation','1',''),
('DC.rights','1',''),
('DC.source','1',''),
('DC.subject','1',''),
('DC.title','1',''),
('DC.type','1','2'),
('DCTERMS.abstract','2',''),
('DCTERMS.accessRights','2',''),
('DCTERMS.accrualMethod','2',''),
('DCTERMS.accrualPeriodicity','2',''),
('DCTERMS.accrualPolicy','2',''),
('DCTERMS.alternative','2',''),
('DCTERMS.audience','2',''),
('DCTERMS.available','2',''),
('DCTERMS.bibliographicCitation','2',''),
('DCTERMS.conformsTo','2',''),
('DCTERMS.created','2','18'),
('DCTERMS.dateAccepted','2',''),
('DCTERMS.dateCopyrighted','2',''),
('DCTERMS.dateSubmitted','2',''),
('DCTERMS.educationLevel','2',''),
('DCTERMS.extent','2',''),
('DCTERMS.hasFormat','2',''),
('DCTERMS.hasPart','2',''),
('DCTERMS.hasVersion','2',''),
('DCTERMS.instructionalMethod','2',''),
('DCTERMS.isFormatOf','2',''),
('DCTERMS.isPartOf','2',''),
('DCTERMS.isReferencedBy','2',''),
('DCTERMS.isReplacedBy','2',''),
('DCTERMS.issued','2',''),
('DCTERMS.isVersionOf','2',''),
('DCTERMS.license','2','17'),
('DCTERMS.mediator','2',''),
('DCTERMS.medium','2',''),
('DCTERMS.modified','2','18'),
('DCTERMS.provenance','2',''),
('DCTERMS.references','2',''),
('DCTERMS.replaces','2',''),
('DCTERMS.requires','2',''),
('DCTERMS.rightsHolder','2',''),
('DCTERMS.spatial','2',''),
('DCTERMS.tableOfContents','2',''),
('DCTERMS.temporal','2',''),
('DCTERMS.valid','2',''),
('HTML.title','3',''),
('HTML.doctype','3',''),
('OTHER.description','4',''),
('OTHER.keywords','4',''),
('OTHER.author','4',''),
('OTHER.copyright','4',''),
('OTHER.generator','4','');

Appendix B

Values for the vocabs table.

insert into vocabs values
('1','DC','http://purl.org/dc/elements/1.1/'),
('2','DCTERMS','http://purl.org/dc/terms/'),
('3','HTML',NULL),
('4','OTHER',NULL);

Note the inclusion of vocabs HTML and OTHER. I have provided these so that our metadata repository can store the title and doctype of the HTML document (vocab=HTML,) and various 'legacy' metadata terms (vocab=OTHER,) if so required. If these are excluded, the corresponding entries should also be excluded from the end of the insert in Appendix A,
above.

Appendix C

Values for the schemes table.

insert into schemes (schemename) values
('Box'), ('DCMIType'), ('DDC'), ('IMT'),
('ISO3166'), ('ISO639-2'), ('LCC'), ('LCSH'),
('MESH'), ('NLM'), ('Period'), ('Point'),
('RFC1766'), ('RFC3066'), ('TGN'), ('UDC'),
('URI'), ('W3CDTF');

TinyURL for this page: http://tinyurl.com/ybgr2ds