Download the full OPERAS Design Study here: OPERAS Design Study
The technical mapping is a deliverable of WP3, ‘Technical and services requirements’ of OPERAS-D project which has the objective to identify the services the OPERAS Consortium would have to develop in the future and the method of implementing them in a fully distributed infrastructure. To achieve this objective, OPERAS must first know better its own technical environment, which is very diverse and uneven and then involve users to identify clearly what services are needed by the stakeholder communities.
The technical mapping of the OPERAS environment is meant to provide a global description of the technical, organisational and information systems within the OPERAS Consortium. More precisely, the mapping has collected detailed information about workflows, softwares, development languages, data and metadata management, dissemination and distribution tools.
The technical mapping has been done through a questionnaire sent to the different partners. Each of them has been sent a table structured alongside the most common types of digital publishing activities. As digital publishing is not standardized enough yet, a draft has been proposed to various individuals and profiles from the Consortium and then collectively validated. Ten OPERAS members have answered the questionnaire.
This work represents a first identification of practices, workflows and tools within the OPERAS Consortium. It is mainly a basic inventory. The categories used in the survey are going to be improved during the second semester 2017 through a collaborative process.
This work represents a first identification of practices, workflows and tools within the OPERAS Consortium. The categories used in the survey can and must be improved later through a collaborative process. The responses are detailed and represent a reliable collection of all the information needed. Nevertheless, some answers indicate that the categories used for the survey were somehow too loose or too abstract. For instance, the questions about publishing on one hand and workflow on the other created some confusion and the same response could be found in each field. The metadata questions were difficult to classify because of their different types and use, but this aspect has to be better formalized in order to have a better description of the data management process within the Consortium. Compared to this first attempt, the main activities of the partners should therefore be defined anew in order to offer a better articulation between concepts and real practices.
For these reasons, we have decided not to follow the tables progression but to reorder the content of this report on the basis of the schema in Annex 1. This schema represents in a circular way the various activities and missions of the digital publishers involved in the OPERAS Consortium.
The sections below are an adaptation of this schema to our technical content (see table ‘Functional architecture’ in Annex 2). We will present the various functions from the more technical to the more abstract.
Development language, Database, Size limit, Hardware
Leaving aside the front-end languages (HTML, CSS, JS), the general information collected regarding the development languages is two-fold:
- A first group of participants benefits from an external IT system managed by their organization or a partner and don’t have information on the topic;
- Another group is characterized by an in-house IT, that is an independent IT department or an operational autonomous set of IT skills (EKT, OAPEN, OBP, OE, SHARE, UGOE, UP).
In this second group, it will be useful, when many languages are involved, to understand better the usages of each language. In this way, it will be easier to identify potential collaborations.
It is interesting to note, however, that a majority of partners are PHP/MySQL users. With the exception of MWS (Python/Zope Object Database) and UGOE (XML publishing of Cocoon-Apache), all the others are using PHP alone or in combination with other languages.
The database and data size limit give us information about the present data management status and its possible evolution. For books and/or journals only, here are the database sizes:
- Less than 1 GB (OBP, SHARE books, UGOE)
- Around 2 GB (SHARE journals)
- Around 15 GB (OE Books)
- around 30 GB (EKT, OE journals)
- 100 GB (MWS), 240 GB (UP)
This data should nevertheless be completed with additional information on the destination of the database and the existence or not of many databases for each DBMS.
Some partners indicated a data size input limit (EKT, OAPEN, UGOE, UP), ranging from 20 MB to 4 GB, and it could be interesting to know if it affects their practices and in which way.
As for the hardware, here is the essential distribution:
- Virtual Machines: OBP (2 VMs)
- Servers: MWS (2 rented servers), SHARE (3 servers), UGOE (1 server), UP (6 servers)
- Servers and VMs: EKT (2 servers, n VMs), OE (21 servers, 40 VMs)
DATA AND METADATA PROCESSING
Indexing, Search functionality, Reference sets, Metadata standards, Identifiers
The processes which will create access points to the content or allow for its referencing are gathered in this section.
The indexing of the content is mainly handled in an automated way by the participants. A certain number of them use the full-text search provided by their publishing tool or repository application: OJS, OMP, E-prints or DSpace (EKT, SHARE, UniTo). Others are using a specific search engine like Solr (OE, UGOE) or Lucene (OAPEN). Some manual indexing is nevertheless used for completing the work of the application (UGOE, OBP) or for specific purposes (SHARE for Worldcat). Automated indexing also allows for a faceted search, but another set of questions could be useful in assessing the quality of the search functionality, especially by evaluating the results for each facet. In fact, one participant indicates some poor results of the embedded search functionality of OJS/OMP.
A minority of participants also enrich their content with referenced subject headings: BIC, BISAC, VLB, LCSH (OAPEN, OE, UCL, UGOE). It is hard to assess how much these reference sets help the discoverability and if they are easy to maintain but more information on this question will be sought from the relevant partners.
Despite the similarities one would expect, the standard metadata used by participants are present with some variations (no one is using exactly the same set of standards); this will be looked at more closely from an interoperability perspective. As we are lacking information on the way these metadata are generated, it is hard to tell how difficult an adjustment would be; it is worth mentioning, though, some publishing tools that allow for this generation (e.g. OJS). The main generated standards are: DC, MARC, ONIX – rarer are DCQ and MARC XML. Alternative standards are: METS, NLM, RFC1807, ESE and PICA XML. Leaving aside the various functions of the standards (DC for PMH, ONIX for distribution, etc.), it might be appropriate to give some more information about the specific use for each standard to check how much they are effectively interoperable.
Identifiers are another kind of metadata and we wish to outline the rather wide use of interoperable identifiers. Alongside the HIRMEOS group (EKT, OAPEN, OE, UGOE) where DOI, ORCID and Funding registry are being implemented, others already have DOI (soon MWS, OBP, OLH, SHARE, UCL, UniTO, UP) or ORCID (OLH, SHARE, UniTo, UP).
On a related topic, which could have been investigated in the survey, it is interesting to mention that one partner is providing persistent URLs for its content (MWS).
Types, Number of documents, Printed copy, Publishing tools, Single source publishing
This section gathers the various elements of the OPERAS Consortium central activity of digital publishing.
The majority of the participants publish more than one type of document. Far from being limited to the more traditional journals and monographs, the types of documents handled by the participants cover almost the whole range of academic production. Even if all the different kinds of documents are not taken care of in the same way, it is interesting to note, in the perspective of the scholarly communication evolution, that some participants have expertise with different sorts of data. Alongside conference proceedings, textbooks and theses, we also find blogs, images, audio/video files, software or, potentially, any kind of data. It should be noted that sometimes the different types are handled with specific software, but this seems more related to the size of the organization (e.g. SHARE, UniTo).
The overall published content of the participants clearly gives a strategic position to the OPERAS Consortium. One partner remains isolated by its size and its variety (OE), but it would be interesting to know the trends and perspectives of each partner.
Print-on-demand services among the participants are more present than one might have expected (OBP, SHARE, UCL, UGOE, UniTo). If needed, this could allow for collaborative work or counsel.
As for the publishing tools, the first observation is the rather wide use of PKP’s software (OJS, OMP) among the partners (EKT, SHARE, UCL, UniTo and soon MWS). This also obviously opens the possibility of collaborations and it already does for some of them. As some participants in this group are not using only PKP’s software for all their contents (UniTo, MWS) and others are using also different tools for their content (Lodel and WordPress for OE), it might be interesting to investigate more in detail the relations tool/purpose and the reasons for the choices.
Another important aspect regarding the publishing tools is the development. Two partners are managing an entire publication process with their own software: OE (Lodel), UP (Rua/Jura). Others have a strong development activity (OBP) or have produced plugins (EKT, MWS). This could lead to fruitful technical collaborations useful to the OPERAS Consortium.
The publishing tools analysis can also include the single-source-publishing question. If it seems easier to have a single pivot format with only one publishing software (XML-TEI / Lodel for OE), other participants are also using as a pivot format the XML (MWS) or the PDF (UGOE). This aspect couldn’t be detailed within the survey table but it surely must be developed by these partners.
The final observation to be clarified in the future: it wasn’t always easy to tell what was the use made by the participants of each software or application. Detailed benchmarking in this area would help to understand the different uses better.
Distribution, Referencing, Harvesting, Metrics
The majority of the participants are using their own platform(s) to achieve their content’s distribution (EKT, MWS, OAPEN, SHARE, UGOE, UniTo, UP). A smaller group is using other channels and, apart from one (OLH), it seems directly or partly related to their sales activity (OBP, OE, UCL, UP). In the last case (OBP, OE, UP), the number of distribution channels is logically very high. Even if of minor importance, we can note that the latter (OE) is externalizing the distribution process to electronic bookstores.
As for the referencing, it is more difficult to identify specificities. The main referencing entities among the partners are: DOAJ, DOAB, EBSCO. Nevertheless, not every participant has its contents referenced in each one and some referencing is sometimes more limited (MWS, UCL, OLH). Moving towards more uniform referencing throughout the Consortium would bring clear benefits.
On the other hand, almost every participant is maintaining an OAI repository for harvesting protocol. Even if differences obviously exist between the sets or the standards used, this remains a solid basis for an effective interoperability.
The situation regarding metrics appears rather disparate, even if some synergies seem possible. A certain number of partners is using or will use Google Analytics (OBP, OLH, SHARE, UCL, UP). Others are providing COUNTER statistics (EKT, OAPEN, OE, UniTo) – but some more information could be useful here as the production of COUNTER is rather complex for OE, while it seems automatic for UniTo with OJS. Some partners, finally, are using other applications: Piwik (MWS, OE, UP), Awstats (OE – soon completely replaced by Piwik), ALM metrics (SHARE).
Peer-reviewing, Proofreading, Typesetting
We put together in this ‘editing’ section peer-reviewing, proofreading and typesetting as being parts of the traditional publishing activity. Although not always directly involved in this editing work, most of the participants have it integrated to their own workflow. The situations are quite diverse, and present two extremes: from the participants who are not involved in editing (UniTO) to those who are traditional publishers (OBP and UCL). In between, we can find different levels of involvement.
As for the peer-reviewing, we can observe that the publishers amongst the participants, perform more or less directly peer-reviewing (UGOE, UCL, OBP). In the other cases (dissemination platforms), the peer-reviewing is a requirement or a recommendation (OE, EKT) – the difference between these will have maybe to be clarified in future surveys. The peer-reviewing of journals and books tend to be the same (e.g. two academic referees) but this also may need to be confirmed by each concerned participant.
Proofreading and typesetting are mainly undertaken by the editor and the author. Nevertheless, the same participants involved in the peer-reviewing also do the proofreading and the typesetting (OBP, MWS), but some also outsource these activities (UCL, OLH).
Process steps, Formats management, Access rights
Even though the status, services and organization of the Consortium partners is very different, the workflows used by the partners cannot be exactly similar. It was in fact difficult to give a clear and schematic representation of this section. Nevertheless, it should be possible to identify the tasks defining their mission, and more precisely their types, number and complexity.
The answers led to a first observation: those partners who use PKP publication tools (OJS, OMP) are heavily helped to structure and formalize their workflow. Although this gives a clear representation of the workflow, it is mainly ‘author-oriented’ and doesn’t really focus on the digital publisher’s work (the ‘layout editor’ in the OJS schema) Even if such a schema isn’t necessary for the OPERAS Consortium, a short list of the main publishing activities would be useful to better assess the strengths and weaknesses of the partners’ workflows. This list could be more or less the list of sections used in this report and is reflected by the various answers. For a better focus on the ‘who does what when?’, the list can be summarized in these specific digital publishing steps:
- Editing: peer-reviewing (partly effectuated, verified, requested?); copy-editing/typesetting (outsourced or not?); linear or circular process; access rights to the platform for authors or editors?
- Admission: document taken as it is sent; document modified (another format? Which one(s) with which tool?).
- Enrichment: adding metadata (for search, for dissemination, for archiving?).
- Dissemination: production of the output formats for the platforms; specific tasks related to the distribution outside the
These various aspects can of course be amended or completed, but they would give some sound elements to evaluate the length, the complexity and the efficiency of the digital publishing process and would be useful for the training programs of the infrastructure which help new publishers to set up their press.
Status, Funding, Budget
Although these activities are strictly speaking outside the perimeter of technical mapping, organizational characteristics have technical implications: IT autonomy and size, ability to change of scale, HR availability, etc. Essentially, one dominant organizational model emerges from the survey: public status with institutional funding.
However, there are a few exceptions:
- OAPEN: a not-for-profit foundation with public institutional funding;
- OLH: a charitable company whose funding comes from library subscriptions;
- OpenEdition: a public organization which receives institutional funding and freemium sales revenue;
- OBP: a CIC (specific UK status allowing profits for public good) funded by grants, membership and sales;
- UP: Private Limited company (APC/BPC and fees for books and journals financing)
The information on budgets was rather poor and this will be collected in full on another occasion as it was somewhat peripheral to the technical investigation.
A last set of questions tried to identify the interest of the partners in each other’s features and tools or outside the OPERAS Consortium. It was probably a bit too soon to ask the participants which technical interactions were possible for them with or within the OPERAS Consortium; this report might help to identify possible collaborations.
Among the few suggested collaborations, however, we can note the interest for the HIRMEOS implementations: identification, annotation, entity recognition (OBP, SHARE, UniTo). A partner would be interested in changing its method of publication by using OJS (OBP), which is already used by other partners. As another potential development for the entire OPERAS Consortium, some participants would like enrich their system with data mining or text analysis (SHARE, UGOE).
Read the full Report: OPERAS Technical Mapping