Thursday, January 12, 2017

The open access aggregators challenge — how well do they identify free full text?

Bielefeld Academic Search Engine (BASE) created by Bielefeld University Library in Bielefeld, Germany is probably one of the largest and most advanced aggregator of open access articles (hitting over 100 million records), others on roughly the same level are CORE (around 60 million records) and OAIster (owned by OCLC).

One way of seeing this class of open access aggregators is to see them as similar to web scale discovery search engines like Summon, EDS, Primo and WorldCat Discovery service. but focusing mainly in the open access context.
How well do web scale discovery engines cover open access?
It seems natural to think that index based solutions like Summon, Primo, EDS should cover both paywall contents as well as open access content, particularly since they typically can use OAI-PMH to harvest the institution’s own Institution Repository. In reality, their coverage of open access material can be spotty. The best ones have indexed OAIster or BASE. But even when open access sources are available in the index, many institution’s choose not to turn them on  for various reasons. This includes unstable links, inability to correctly show only open access material as well as flooding of results by inappropriate data (e.g foreign language or irrelevant subjects).

A unique challenge for open access aggregators

One area where BASE and CORE may differ from Summon and Primo is in that open access aggregators need to be able to tell if an article they harvest from a subject or institutional repository has free full text and this isn’t that easy.

This seems odd if you do not understand the history of open access repositories, but suffice to say when OAI-PMH (which is the standard way of harvesting open access repositories and was established as a way of harvesting metadata only and not full text) was established it was envisioned that most if not all items in such open access repositories would be open access (following the example of Arxiv), so no provision was made to have a standard way or of a mandatory field to indicate if the item is free to access.

In today’s world, of course subject and in particular institutional repositories are a mix of free full text and metadata only records. This happens in particular for institutional repositories because they have multiple goals beyond just supporting open access.
What are the multiple purposes of Institutional repositories?
While most librarians are familiar with Institutional repositories mission to support open access they may not be aware that it is not their only purpose (I also argue even advocates who support self archiving in the open access agenda can have different ultimate aims). Other purposes include
a) “to serve as tangible indicators of a university's quality and to demonstrate the scientific, societal, and economic relevance of its research activities, thus Increasing the institution's visibility, status, and public value” (Crow 2002)
b)"Nurture new forms of scholar communication beyond traditional publishing (e.g ETD,  grey literature, data archiving" – (Clifford 2003)
It is purpose A, tracking the institutional’s output that results in Institutional repositories hosting more than just full text items. Many institutional repositories have in fact more metadata only items than full text. It’s a rare Institutional respository that has more than a third full text records. 

Truth be told, most open access aggregators I have seen simply give up on this problem and just aggregate the contents of whole institutional repositoriesgiving users a mistaken idea that everything is free.

This leads to users wondering if something is wrong when they click through and get led to a metadata only record in the repository. This btw was the reason why I and I suspect many librarians tend not to turn on open access repositories available via Summon/Primo because it doesn’t really show only open access items and it’s a rare few that is say 99% free items (typically ETD or electronic thesis dissertations collections but even though has the occasionally embargoed ones), while many have in fact more metadata only records then full text records particularly if they blindly pull in metadata content via their institution’s research publication systems and/or Scopus/Web of Science.

There are of course ways to identify full text in repositories and Google Scholar seems to do it beautifully on an item level (via intelligent spidering to detect pdfs?) but that doesn’t seem common for non-google systems. As it stands, Google Scholar is current my #1 choice whenever I need to check if free articles exist.

One possibility is for institutional repositories to create “collections” that are 100% or near 100% full text and pull in such items by collections. This usually is what happens for ETD.

The other way of course is to set a metadata tag for each item that has full text but I’m not sure if there is 100% universal standard for this. A good start might be OpenAire’s standard.

BASE indeed does suggest you to support this for optimal indexing. I am not sure how wide spread this is outside the EU.

I’m not a repository manager so I’m not sure how this works, but I get the distinct impression that Digital Commons repositories can definitely reliably identify full text records, given that there can be full-text PDF RSS feeds, I’m just not sure how a third party aggregator can exploit that to identify full text and whether it can be generalised to all Digital commons repositories.

In any case, I think one can probably “hack” and create workarounds to reliably detect full text for one repository the trick is to do it without much work for most of them.
In a sense centralised Subject repositories have the advantage over institutional ones here because by the virtue of their mass, there is great incentive for aggregators to tweak compatibility with them compared to any individual institutional repository.

In any case, both BASE and CORE are capable of identifying full text records in their results, the question is how accurate are they?

How well does BASE and CORE do for identifying full text?

The nice thing about BASE is that it allows you to run a “blank search” which gives you everything that meets the criteria (similar to Summon). So one can easily segment the index based on criteria you desire without crude workarounds like searching for common words that all records would have.

Base results restricted to Source: Singapore

The above shows that when restricted to Singapore sources, BASE knows of

66,934 records from National University of Singapore’s IR — dubbed ScholarBank@NUS (using Dspace)

records from Nanyang Technological University’s IR — dubbed DR-NTU (using Dspace)

records from Singapore Management University’s IR — dubbed INK (using digital commons). [Disclosure I’m a staff of this institution]

Based on my colleague's recent Singapore update on open access figures for total records in each of the repositories — this shows a rough coverage of 67%, 89%, 98% respectively in BASE.

Take this figures with a pinch of salt because the total records I am getting are based on different times, e.g the NUS total record is as of 30 Sept 2016, NTU total record is of 18 October 2016. NUS also has fairly substantial non-traditional records eg. patents and music recordings so that might affect the result. Lastly, I did the search in BASE in early Jan 2017 while the total records are from a quarter earlier, so the actual coverage is probably a bit lower.

Overall, the coverage shown isn’t too bad, but the more important point is how well does BASE identify full text? Let us filter to Access : Open Access

Full text identified by BASE

Not very well it seems.

It is only able to identify 75 free records in National University of Singapore’s IR, 654 free records in Nanyang Technology University’s IR, 143 free records in Singapore Management University’s IR

I did not do a check to see if there were false positives in BASE’s identification of full text but in the best case scenario they are 100% correct, we see only a full text identification ratio of 0.6%, 3.8% and 2.7% respectively!

If you consider the case of Singapore Management University (disclosure again I am staff there), BASE is able to index practically every record in our Repository and yet only identifies 2.7% of our free full text. It’s in the same ballpark for the other Singapore repositories.

Let’s do the same for CORE. How many records does it index for the 3 Singapore repositories?

Here are the results :

National University of Singapore’s Scholarbank.

Records (100,657) + Full text (12)

Keyword : repository: (“Scholarbank@NUS”)

Singapore Management University — INK

Records (18,312) + Full text (166)

Keyword : repository: (“Institutional Knowledge at Singapore Management University”)

Interestingly enough I was unable to find any articles indexed in CORE from Nanyang Technological University’s IR, it’s possible I might have missed them somehow.

In any case, I won’t calculate the percentages for the other 2 IRs, there are broadly similar to the case in BASE, except CORE seems to show substantially more records (including metadata only records) indexed than in BASE.

In fact, CORE is showing more records indexed for both universities then the total records listed in the Singapore update on open access figures (e.g 100k vs 99k in NUS and 18k vs 16k for SMU). This possible because the total records from the Singapore update on open access figures generally refer to 3Q 2016 figures so since then the number of records would have grown.

Still I suspect that’s not the full reason, there could be duplicates archived in CORE inflating the result.

More importantly in terms of records identified as free full text the results for CORE are as dismal as BASE.


Both BASE and CORE are extremely sophisticated open access aggregators. For example they offer APIs (BASE, CORE), are indexed by some web scale discovery services, are doing various interesting things with ORCID, here also, creating recommendation systems or working with OADOI to help surface green open access articles hiding in respositories.

A difference is that BASE currently doesn’t search through full text while I believe CORE does.

However identifying which articles they have harvested has free full text is still problematic, BASE claims to be able to reliably identify 40% of their index as full text though the other 60% is still unknown due to lack of metadata. My own quick tests shows that it’s accuracy is quite bad for certain repositories. My hunch is that BASE either works very well with some respositories or not at all with others.

So this is a major challenge for the open access community and in particular institutional repositories to answer. The alternative is to shrug one’s shoulder’s and let Google Scholar be the default open access aggregator.

Friday, December 30, 2016

Library Discovery and the Open Access challenge - Take 2

Earlier this year, over at medium , I blogged about the Library Discovery and the Open Access challenge and asked librarians to consider how library discovery should react to the increasing pool of free material due to the inevitable rise of open access.

At the limit when nearly everything is freely available it is possible to consider whether library will have a place in the discovery business. After all, if all researchers have access to the same bulk of journal articles, does it really make sense for each institutional library to provide a separate discovery solution? Even today, many researchers prefer using Google Scholar and other non-institutional discovery solutions that operate at web scale and some (mostly students) grudgingly use our discovery systems to restrict discovery to things they have immediate access to.

This of course is the library discovery will be dead scenario when (almost) everything is free  and not everyone agrees. Some argue, that libraries can add value by providing superior and customized personalized discovery experiences because we know our users better (e.g what courses they taking/teaching, their demographics etc). Then there are plans to leverage linked data etc but I know regretfully little of that.

But the day when open access is dominant is still not here. We live in the world where there is a mix of toll based access and rising but uneven free access, Scihub notwithstanding. I opined that for now "if we really want to stay in the discovery business we need to be able to efficiently and effectively cover the increasing pool of open access resources".

So how does you ensure the library discovery system includes as much discovery of free open access articles as possible?

The idea of a open content discovery matrix by Pascal Calarco, Christine Stohn and John Dove comes to mind.

For most academic libraries who subscribe to commercial discovery indexes (Worldcat Discovery, EBSCO Discovery Service, Summon and Primo, with the later two having merged indexes), there isn't much libraries can do beyond hoping that discovery vendors include such content in their index.

Well I recently came across services like  1science's oaFindr that claims to have a high quality 20 million database of open access papers that perhaps could help? There's also a oafinder+ product that can identify green and gold OA articles for your institution only.

Even if  you can find open access metadata for content that is available for indexing, delivery issues still might occur in index based discovery services as link resolvers are infamously bad at linking to hybrid journals and practically ignore Green Open access articles. 

A alternative approach to such "pull" approaches is a push approach. The new service  (and an earlier service DOAI) is one of the more interesting things to emerge from this year's open access week and it can used together with discovery services.

The idea is simple. One of the challenges of discovery of open access journals in particular Green open access articles archived at subject repositories and institutional repositories is that in general there is no systematic easy way to find them.

 With the service, you can feed the service a doi and it will attempt to locate a free version of the paper, and this includes both articles made free via the Green or Gold roads.

Here's an example say you land on this article page,  Grandchild care, intergenerational transfers, and grandparents’ labor supply on Springer and you have no access.

Quick as a flash, you grab the DOI 10.1007/s11150-013-9221-x and look it up like this . And you get autoredirected to the preprint full text on our institutional repository.

Looks like magic! How does it work? The oadoi service uses a variety of means to try to detect if a open access version of an article is available (see below), but it looks to me that the main source for detecting articles on institutional repository in particular is via the aggregator BASE, so make sure your institutional repository is indexed in BASE.

My own limited testing with Oadoi was initially pretty disappointing as it failed to find most of the articles hosted on my Institutional repository (hosted on Bepress digital commons). It's possible that the way our institutional repository exposes the doi was not correctly picked up by BASE, but this seems to have been resolved somewhat. More testing required.


Savvy readers of this blog might already be screaming, why bother? Just use Google Scholar or plugins like Google Scholar button or Lazy Scholar buttons (which use Google Scholar in background) and all your problems are solved.

It's true that Google Scholar is pretty much unbeatable for finding free articles but the value in OADOI is that it offers a API.

Already many have been quick to use it to provide all kinds of services. For enable Zotero uses it as a lookup engine, librarians have created widgets etc.

But it's greatest value lies in the fact that it can be embedded into discovery services and link resolvers.

Here's work done on SFX doi service and alma libraries like Lincoln University have not been slow to include it either.

These are fairly basic uses of oaidoi and enable users to help direct users to open access content. Still such implementations are usually a "last resort, try it if it works " kind of deal currently and there is no guarantee clicking on the link will work. If you are Exlibris customer on Primo do consider supporting this the feature request "Add as an option in uresolver" which proposes " displays as an option if the API's value of is_free_to_read is true".

To DOI or not to DOI?

A lot of the problems about discovery and delivery of open access content lies in the fact that there are different variants of the same content.

In the old days it was pretty straight forward the only thing that we tracked and access was the article that appears in the journal.

Today, we make accessible a wide variety of content (data, blog posts, conference papers, working papers) and even worse different versions of the same content at different stages of the research lifecycle (preprint/postprint/final published version).

This leads to a great challenge for discovery.

It doesn't help there is a terminology muddle (despite NISO's best efforts at standardising terminology on Journal article Version names and license and access indicators), with people using terms like preprint/postprint/final published versions while others use author submitted manuscripts, author accepted manuscripts and version fo record.

But I think even beyond that, the question I always wonder is , how do we identity/address each version and these days it means assigning dois. The final version of record will have a DOI of course but what about the rest?

As such, I've always been confused about the practice of assigning dois to non peer reviewed papers. For example, should one assign dois to preprints? post-prints? working papers? Should they be a different doi from the final published version? It doesn't help that when you upload items to ResearchGate it offers to create a doi.

I could be wrong, but up to recently I don't think there was a clear guide. But in recent months there seems to be two developments that seemingly clarify this.

First crossref announced they are allowing members to register preprints. The intention here seems to be that the doi of the final version of record is to be a different doi, though there are ways to crosslink both papers, There's even a way to show a relationship between preprints and the later versions as explained in the crossref webinar.

The oadoi service mentioned earlier seems to be pushing in the other direction , encouraging postprints listed in repositories to be added using the same doi as the final version of record to make the postprint findable (but does it mean the preprint isn't since it will have a different doi?). This allows you to find the postprint using the oadoi service as both postprint and final version of record as the same doi.

I'm not quite sure if this is a good idea, while studies show most postprints are not that different then the final published version, it does seem to be a good idea to be able to track the two versions differently. Still mulling over this.


This will probably be my last post for 2016. This year I was particularly inspired towards the end of the year with many ideas but didn't have the time to craft them so expect a flood in the coming year.

I also would like to thank all my loyal readers for following this blog and reading my long winded posts. Next year this blog will be celebrating it's 8th anniversary and my 10th anniversary in the library industry and I might do something special.

Till then, stay happy and healthy and have a great new year's day!

Saturday, December 10, 2016

Aggregating institutional repositories - A rethink

In recently months, I've become increasingly concerned about the competition faced by individual siloed institutional repository versus bigger more centralised repositories like subject repositories and commercial competitors like ResearchGate.

In a way the answer seems simple, just get someone to aggregate all the institutional repositories on one site and start building services on top of that to compete. Given that all institutional repositories already support OAI-PMH, so this seems to be a trivial thing to do. Yet I'm coming to believe that in most cases, creating such an aggregator is pointless. Or rather if your idea of a aggregator is simply getting a OAI-PMH harvester , point it at the OAI PMH endpoints of the repositories of your members and dumping everything into a search interface like VUFIND or even using something commercial like Summon or EDS without any other attempt to standardise metadata, and call it a day, you might want to back off a bit and rethink. For the aggregator to add value, you will need to do more work.....

A simplistic history of aggregation in libraries

Let me tell you a story...

In the 90s - libraries began to offer online catalogues to allow users to help themselves find out what was available (in their mostly print) collections. These sources of informations were siloed and while they were on the web, they were mostly invisible to web crawlers. The only way you could find out what libraries had in their collections would be to go to each of their catalogues and searched.

So, someone said "Why not we aggregate them all together" and Union catalogues (including virtual Union catalogues based on federated searching) were built e.g Copac. People could now search across various silos in one place and all was well.

Librarians and Scholars used such union catalogues to decide what and who to do ILL from and to make collection decisions. Many were still invisible to Google and web search engines (except for a few innovators like OCLC), but it was still better than nothing.

By the late 90s and early 2000s, libraries began to create "digital libraries" (e.g Greenstone digital library software). It was the wild west and digital libraries at the time build up digital collections consisting of practically anything of interest such as digitized images of music scores, maps, photographs -  anything except for peer reviewed material. Most material on digital libraries was often difficult to find or invisible via web search engines for various reasons (e.g. non-text nature of content, lack of support of web standards etc) and it made sense for some degree of aggregation at various levels such as national or regional levels.

Today larger collections like Europeana exist and all was well.

Then came the rise of the Institutional repositories, and by 2010s, most universities had one.

Unlike it's predecessors, the main distinguishing point of institutional repositories was that for many it was designed around distributing Scholarly peer reviewed (or likely to be peer reviewed) content.

While it's true many institutional repositories do contain a healthy electronic thesis collection and some even inherited the mission of what would be earlier called digital libraries and carried grey literature and other digital objects such as data the main focus was always on textual journal articles.

The other major difference is that by then all Institutional Repositories worth the name supports the OAI-PMH standard which making harvesting and aggregating metadata of content in them easy....

And of course , the same logic seem to suggest itself again, why not aggregate all the contents together? And today, we have global aggregators like CORE (not this other CORE - Common Opens Repository Exchange) , BASE and OAISTER as well as regional aggregators built around associations and organizations both national and regional.

In my region for example there's the AUNILO (ASEAN University Network inter-library online) institutional repository discovery service that aggregates content from 20 over institutional repositories in ASEAN. Most University libraries in Singapore are also part of PRRLA (Pacific Rim Research Library Alliance) formerly PRDLA., which also has a Pacific Rim Library (PRL) project built around OAI-PMH harvesting.

I'm sure similar projects exist all around the world based on aggregating data by basically harvesting via OAI-PMH harvestors. And yet, I'm coming to believe that in most cases, creating such an aggregator is pointless, unless additional work is done.

Or rather if your idea of a aggregator is simply getting a OAI-PMH harvestor , point it at the OAI PMH endpoints of the repositories of your members and dumping everything into a search interface like VUFIND or even using something commercial like Summon or EDS, and call it a day, you might want to back off a bit and rethink.

I argue that unlike UNION catalogues or aggregation of digital libraries (by this I mean not the traditional Institutional repository of text based scholarly articles), aggregation of institutional repositories is likely to be pointless, unless you bring more to the table.

Here's why.

1. Items in your institutional repository are already easily discoverable

Unlike in the case of most library catalogues, items in your institutional repository are already easily findable in Google and Google Scholar. There is little value in creating an aggregator when such an excellent and popular one as Google and Google Scholar exist.

101 Innovations in Scholarly Communication - 89% use Google Scholar to search for literature/data

Given the immense popularity of Google Scholar, what would your simple aggregator based around OAI-PMH offer that Google Scholar does not that would make people come to your site to search?

2. Most simple repository aggregators don't link reliably to full text or even index full text

Union catalogues existed in a time, where it was acceptable for users to find items that had no full text online. You used it to find which libraries had the print holdings and either went down there to view it, or used Interlibrary loan to get it.

In today's world, direct to full text is the expected paradigm and you get undergraduates wondering why libraries bother to subscribe to online subject indexes that show items the library may not have access to.

Now how much worse do you think they feel when they search one of your repository aggregators and realise they can't figure out which item has full text or not until they click on it? This is where a glaring weakness in OAI-PMH rears its head.

I first encountered this problem when setting up my Web Scale Discovery Service - Summon a few years back, and I was surprised to realise that while I could easily harvest entries from my Institutional Repository (Dspace) into Summon via OAI-PMH, I couldn't easily get Summon to recognise if an item from the Dspace repository had full text or not.

I remember been stunned to be told that there was no field in the default Dspace fields that indicated full text or not.

This sounds crazy by today's standards. But a little understanding of the context of the time (1999) when OAI-PMH came about helps. It's a long story, but correct me I'm wrong but it was conceived at a time where preprint server Arxiv was the model and it was envisioned repositories would be 100% full text items, so there was no need for such a standard field. Today, this is of course not what happened, due to varying goals on what an Institutional repository should be and reluctance of researchers to self deposit we have a mix of both full text and metadata only items.

Another quirk about OAI-PMH that might surprise many is that it only allows harvesting of metadata only not full-text. Again in today's world where full-text is king and people are accustomed to web search engines (and library full text databases that have followed their lead) matching in the whole document and have search habits designed for that, they find aggregators based around OAI-PMH that only contain metadata odd to use. This is the same problem many students have with using traditional catalogues.

I understand there can be algorithmic workarounds to try to determine if full text exists and some aggregators try to do so with varying results but many don't and just display everything they grab via OAI-PMH.

To top it all off, Google Scholar actually has none of these problems. They can pretty reliably identify if the full text exists and where and combine that with the library links program you can easily tell if you have access to the item.

They crawl and index the full text, and can find items based on matching in full text and can often provide helpful search snippets before you even click into the result.

A vanity search of myself allows me to see where my name appears in context in the full text not just in abstracts

3. Aggregation doesn't have much point due to lack of consistency in standards

Think back to Union catalogues of traditional catalogues back then called OPACs. The nice thing about them was most of them were created using the same consistent standards.

There was MARC, Call number schemes like LCC/DDC/UDC, subject headings standards like LCSH/MeSH that you could crosswalk etc. So you could browse by subject headings or call numbers etc.

I'm probably painting a too positive view of how consistent standards are, but I think it's fair to say that in comparison institutional repositories are in an even worse state.

Under the heading for "Minimal Repository Implementation" in "Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting" we see it advises that "It is important to stress that there are many optional concepts in the OAI-PMH. The aim is to allow for high fidelity communication between repositories and harvesters when available and desirable."

Also under the section on dublin core which today is pretty much the default we see "Dublin Core (DC) is the resource discovery lingua franca for metadata. Since all DC fields are optional and repeatable, most repositories should have no trouble creating at least a minimal mapping of their native metadata to unqualified DC. "

Clearly, we see the original framers of OAI-PMH decided to give repositories a lot of flexibility on what was mandatory and what wasn't and only specified a minimum set.

In addition the "lingua franca for metadata", unqualified dublin core perhaps on hindsight was not the best option, not when most of your content is journal articles.

Even Google Scholar recommends against the use of Dublin core in favour of other metadata schemes like  Highwire Press tags, Eprints tags or BEpress tags.

On the section of getting indexed on Google Scholar, they advise repository owners to "use Dublin Core tags (e.g., DC.title) as a last resort - they work poorly for journal papers because Dublin Core doesn't have unambiguous fields for journal title, volume, issue, and page numbers."

Even something as fundamental today as doi (and in the future ORCID) isn't mandated.

I recently found that out when I realised the very useful service that allows you to input a DOI and find a copy in repositories (among other ways it works is that it searches for items indexed in BASE) failed for our institutional repository because doi indentifer wasn't in our unqualified Dublin core feed and that was picked up by BASE. The lack of standards is holding repositories back.

Leaving this aside, I'm not sure why this happened (I have a feeling that up to recently the same people working on institutional repositories were not the same people working on cataloguing) but most institutional repositories content do not use controlled vocabulary for subject headings or for subject classification, though they could easily do so.

As a result, unlike in catalogues, once you have aggregated all the content, you can easily slice the content by discipline (e.g. LC call range) or by subject headings (e.g. LCSH).

With aggregators of repositories you get a mass of inconsistent data. Your subjects are the equivalent of author supplied keywords and there is no standardised way to filter to specific disciplines like Economics or Physics.

 The more I think about it the more this lack of standardisation is hurting repositories.

For example, I love the digital commons network that allows me to compare and benchmark performance across all papers posted via digital commons repositories in the same discipline. This is possible only because digital commons has a hosted service has a standardised set of disciplines.

What should your aggregator of repositories do?

So if you read all this and are undeterred but still want to create a aggregator of institutional repository what should you do?

Here's some of the things I think you should shoot for beyond just aggregating everything and dumping it into one search box.

a) Try to detect reliably if an entry you harvested has full text

b) Try to index full text not just metadata

CORE seems to match full text in my search?

One way to detect reliably if full text exists or not is to decide on a metadata field that all repositories you are harvesting from has a metadata field indicating full text. But that won't scale currently at a global level. Another way is to try to crawl repositories to extract pdf full text.

Ideally the world should be moving away from OAI-PMH and start exposing content using new methods like resource-sync so not just metadata alone is synced. I understand that the PRRLA is working on a next generation repository among it's member that will use Resource-Sync.

c) Create consistent standards among repositories you are going to harvest

If you are going to aggregate repositories from say a small set of member institutions, it is important to not just focus on the tech but also focus on metadata standards. It's going to be tough, but if all institution members can agree on mapping to a standard (hint look at this), perhaps even something as simple as providing a mapping to Disciplines, the value of your aggregator increases a lot.

d) Value added services and infrastructure beyond user driven keyword discovery

Frankly, aggregating content just for discovery isn't something that is going to be a game changer even if one provides the best experience with consistent metadata allowing browsing, indexes full text etc as services like Google Scholar are good enough already.

So what else should you do when you aggregate a big bunch of institutional repositories? This is where it gets vague, but the ambitions of SHARE . while big show that aggregators should go beyond just supporting keyword based discovery.

See for example this description of SHARE

"For these reasons, a large focus of SHARE’s current grant award is on metadata enhancement at scale, through statistical and computational interventions, such as machine learning and natural language processing, and human interventions, such as LIS professionals participating in SHARE’s Curation Associates Program. The SHARE Curation Associates Program increases technical, curation confidence among a cohort of library professionals from a diverse range of backgrounds. Through the year-long program, associates are working to enhance their local metadata and institutional curatorial practices or working directly on the SHARE curation platform to link related assets (such as articles and data) to improve machine-learning algorithms."

SHARE isn't along there are other "repository networks" include OpenAIRE (Europe), LA Referencia (Latin America) and Nii (Japan), that work along similar lines , trying to standardise metadata etc.

Others have talked about layering a social layer over aggregated data similar to ResearchGate/, or provide a infrastructure for new forms of scholarly review and evaluation.

Towards a next generation repository?

In past posts on institutional repositories I've been trying to work out my thinking on institutional repositories and it's a complicated subject particularly with competition from larger more centralised subject and social repositories like ResearchGate.

I'm coming to think that to counter this individual smaller repositories need to link up together but yet this cannot be currently done in an effective way.

This is where "next generation repositories" comes in and they may have probably heard about this most prominently under the umbrella of COAR (Confederation of Open Access Repositories).

What I have described above is in fact my layman's understanding of what the next generation repositories must achieve (For a more official definition see this) and why.

Officially the next generation repositories focus on Repository Interoperability (See The Case for Interoperability and The Current State of Repository Interoperability )- which includes working groups on controlled vocabulary and open metrics and even linked data.

All this is necessary for institutional repositories to take their place as necessary and equal partners in the scholarly communication network.


I had the opportunity to attend the Asian Open Access Submit in November at Kuala Lumpur and learned a lot, particularly the talk by Kathleen Shearer from COAR, the Confederation of Open Access Repositories on repository networks helped clarify my thinking on the subject.

Friday, November 18, 2016

5 reasons why library analytics is on the rise

Since I've joined my new institution more than a year ago, I've focused a lot on this thing called "library analytics".

It's a new emerging field and books like "Library Analytics and Metrics, using data to drive decisions and services" , following by others are starting to emerge.

Still the definition and scope of  anything new is always hazy and as such my thoughts on the matter are going to be pretty unrefined, so please let me think aloud.

But why library analytics? Libraries have always collected data and analysed them (hopefully), so what's new this time around?

In many ways, interest in library analytics can be seen to arise from a confluence of many factors both from within and outside the academic libraries. Here are some reasons why.

Trend 1 :Rising interest in big data, data science and AI in general

I don't like to say what we libraries deal in is really big data (probably the biggest data sets we deal with is in ezproxy logs which can be manageable depending on the size of your institution) , but we are increasingly told that data scientists are sexy and we are seeing more and more data mining, machine learning, deep learning and all that to generate insights and aid decision making. 

Think glamour projects like IBM Watson and Google's AlphaGo. In Singapore, we have the Smart Nation initiative which leads to many opportunities to work with researchers and students who see the library has a rich source of data for collaboration. 

In case you think these are sky in the pie projects - already IBM Watson is threatening to replace Law librarians , and I've read of libraries starting projects to use IBM Watson at reference desks.

Academic libraries are unlikely to draw hard core data scientists as employees, but we are usually blessed to be situated near pockets of talent and research scientists who can collaborate with the library. 

As Universities start offering courses focusing on Analytics and data science, you will get hordes of students looking for clients to practice on and the academic library is a very natural target as a client to practice on.

Trend 2: Library systems are becoming more open and more capable at analytics

Recently, I saw someone tweeting that Jim Tallman who is CEO of Innovative Interfaces declaring that libraries are 8-10 years behind other industries in analytics.

Well if we are, a big culprit is the  integrated library system (ILS) that libraries have been using for decades. I haven't had much experience poking at the back-end of systems like Millennium (owned by Innovative), but I'm always been told that report generation is pretty much a pain beyond fixed standard reports.

As a sidenote, I always enjoy watching conventionally trained IT people come into the library industry and then hear them rant about ILS. :)

In any case, with the rise of Library Open service platforms like Alma, Sierra (though someone told me that all it does is basically adds SQL but that's a big improvement) etc more and more data is capable of being easily uncovered and exposed.

A good example is Ex Libris's Alma analytics system. Unlike in the old days where most library systems were black boxes and you had great difficulty generating all but the most simple reports, systems like Alma and other Library Service Platforms of its class, are built almost ground up to support analytics.

You don't even have to be a hard core IT person to drill into the data, though you can still use SQL commands if you want.

With Alma you can access COUNTER usage statistics uploaded with Ustat (eventually Ustat is to be absorbed into Alma) using Alma analytics. Add Primo Analytics, Google analytics or similar that most Universities use and a big part of the digital footprints of users is captured.

Alma analytics - COUNTER usage of Journals from one Platform 

Want to generate users and the number of loans by school made in Alma? A couple of clicks and you have it.

Unfortunately there still seems to be no easy way to track usage of electronic resources by users as COUNTER statistics are not granular enough. The only way is by mining ezproxy logs which can get complicated particularly if you are interested in downloads not just sessions.

This is still early days of course, but things will only get better with open APIs etc. 

Trend 3 : Assessment and increasing demand to show value are hot trends

A common trend on Top trends list for academic libraries in recent years (whether lists by ACRL or Horizon reports) is assessment and/or showing value and library analytics has potential to allow academic libraries to do so.

Both assessment (understanding to improve or make decisions) or advocacy (showing value) require data and analytics

For me, the most stereotypical way for a academic library to show value would be to run correlations showing high usage of library services would be highly correlated with good grades GPA. 

But that's not the only way. 

ACRL has led the way with reports like Value of academic libraries report , projects like Assessment in Action (AIA): Academic Libraries and Student Success to help librarians on the road towards showing value.

But as noted in the Assessment in Action: Academic Libraries and Student Success report, a lot of the value in such projects comes from the experience of collaborating with units outside the library.

Academic libraries that do such studies in isolation are likely to experience less success.

Trend 4 : Rising interest in learning analytics

A library focus on analytics also ties in nicely as universities themselves are starting to focus on learning analytics (with UK supported by JISC probably in the lead).

A lot of current learning analytics field focus on the LMS  (Learning management systems) data, as vendors such as  Blackboard, Desire2Learn, Moodle provide learning analytics modules that can be used.

But as libraries are themselves a rich store of data on learning (the move towards Reading list management software like Leganto, Talis Aspire and Rebus:List help too), many libraries like Nottingham Trent University find themselves involved in learning analytics approaches. 

So for example  Nottingham Trent University , provides all students with a engagement dashboard allowing them to benchmark themselves against others . Sources used to make up the engagement score include access of learning management systems, use of library and university buildings. 

Trend 5 : Increasing academic focus on managing research data provides synergy 

From the academic library side , we increasingly focus on the challenges of collecting, curating , managing and storing research data. There are rising fields like GIS, Digital Humanties that put the spotlight on data. We no longer focus not just on open access for articles, but on open data if not open science. 

While library analytics is a separate job from librarians who are involved in research data management , there is synergy to be had between the two job functions as both deal with data. Both jobs requires skills in  handling of large data sets, protection of sensitive data,  data visualization etc.

For example the person doing library analytics can act as a client for the research data management librarian to practice on when producing reports and research papers. In return, the later can gain experience handling relatively large datasets by doing analytics projects.

But what does library analytics entail?  Here are some common types of activities that might fall into that umbrella.

Assisting with operational aspects of decision making. 

Traditionally a large part of this involves collection development and evaluation.

In many institutions like mine it involves using alma analytics,Ezproxy logs, Google analytics, Gate counts and other systems that track user behavior etc.

This in many ways isn't anything new, though these days there are typically more of such systems to use and products are starting to compete on the quality of analytics available.

This type of activity can be opportunistic, ad hoc and in some libraries siloed within individual library areas.

Implementation and operational aspects of library dashboard projects 

A increasing hot trend, many libraries are starting to pull all their data together from diverse systems into one central dashboard using systems like Qlikview, Tableau, or free javascript libraries like D3.js

Typically such dashboard can be setup for public view or more commonly for internal users (usually within-library, ideally institution wide) but the main characteristic is that they go beyond showing data from one library system or function (so for example a Alma dashboard or a Google Analytics dashboard doesn't quite qualify as a library dashboard the way I defined it here).

Remember I mentioned above that library systems are becoming more "open" with APIs? This helps to keep dashboards up-to date without much manual work.

I'm aware of many academic libraries in Singapore and internationally creating library dashboards using commercial or opensource systems like Tableau, Qlikview etc but they tend to be private.

Here are my google sheet list of public ones.

Setting up the dashboard is relatively straightforward technically speaking, more important is sustaining it. What data should we present? How should we visualize the data? Is the data presented useful to decision makers? How can we tell? At what levels of decision makers are we targeting it at? Should the data be made public?

This type of activity breaks down barriers between library functions though it can still be siloed in the sense that it is just the work of a University Library separate from the rest of the University. 

Implementation or involvement in correlation studies, impact studies for value of libraries.

The idea of showing library impact by doing correlation studies of student success (typically GPA) and library usage seems to be popular these days with pioneers like libraries at the University of Huddersfield (with other UK libraries by JISC)University of WollongongUniversity of Minnesota etc leading the way.

Such studies could be one off studies, in which case arguably the value is much less as compared to a approach like University of Wollongong's Library Cube where a data warehouse is setup to provide dynamic uptodate data that people can use to explore the data.

Predictive analytics/learning analytics

Studies that show impact of library services on student success are well and good, but the next step beyond it I believe is getting involved in predictive analytics or learning analytics which will help people whether it be students, lecturers or librarians use the data to improve their own performance.

I've already mentioned Nottingham Trent University's engagement scores, where students can log into the learning management system to look at how well they do compared to their peers.

The dashboard also is able to tell them things like "Historically 80% of people who scored XYZ in engagement scores get Y results".

This type of analytics I believe is going to be the most impactful of all.

Hierarchy of analytics use in libraries

I propose that the activities I list above are listed in increasing levels of capability and perhaps impact.

It goes from

Level 1 - Any analysis done is library function specific. Typically ad-hoc analytics but there might be dashboard systems created for only one specific area (e.g collection dashboard for Alma or web dashboard for Google analytics)

Level 2 - A centralised library wide dashboard is created covering most functional areas in the library

Level 3 - Library "shows value" runs correlation studies etc

Level 4 - Library ventures into predictive analytics or learning analytics

Many academic libraries are at Level 1 or 2 and a few leaders are at level 3 or even level 4.

Analytics requires deep collaboration 

This way of looking at things I think misses a important element. I believe as you move up the levels, increasingly silos get broken & collaboration increases.

For instance while you can easily do analytics for specific library functions in a silos way (level 1), by building a library dashboard that covers library wide areas would break down the silos between library functions (level 2).

In fact, there are two ways to reach level 2.

Firstly, libraries can go their own way and implement a solution specific to just their library. Even better is if there is a University wide platform that the University is pushing for and the library is just one among various departments implementing dashboards.

The reason why the latter is better is if there is a University wide push for dashboards, the next stage is much easier to achieve because data is already on the University dashboard and University wide there is already familiarity with thinking about and handling of data.

Similarly at level 3, where you show value and run correlation studies and assessment studies you could do it in two ways. You could request for one off access to student data (particularly you need cooperation for many student outcome variables like GPA, though there can be public accessible data like class of degree and Honours' lists) or if there is already a University wide push towards a common dashboard platform, you could connect the data together creating a data warehouse. The later is more desirable of course.

By the time you reach level 4, it would be almost impossible for the library to go it alone.

Obviously I've presented a rosy picture of library analytics. But as always new emerging areas in libraries tend to be at the mercy of the hype cycle. Though conditions seem to be ripe for a focus on library analytics, it's unclear the best way to organize the library to push for it.

Should the library highlight one person who's sole responsibility is analytics? But beware of the Co-ordinator syndrome! Should it be a team? a standing committee? a taskforce? a intergroup? It's unclear.

Monday, November 7, 2016

Learning about proposed changes in copyright law - text data mining exceptions

Recently, a researcher I was talking to remarked to me that University staff can be jumpy around copyright questions and some would immediately duck for cover the moment they heard the word "copyright". I'm not that bad, but as a academic librarian my knowledge of copyright is not as good as I want it to be.

But last month, I attended a great engagement session at my library by  Intellectual Property Office of Singapore (IPOS) and Ministry of Law where the speakers gave a great talk on copyright in Singapore and addressed some of these proposed changes. They managed to concisely summarize the copyright law in Singapore, the current situation  (the irony of how the copyright law in Singapore pretty much copied the Australia one which itself is based on UK was not lost on the speaker) and the rationale for change.

Given that understanding basic copyright is going to be increasingly one of the fundamental skill sets needed by academic librarians, I benefited a great deal from attending.

There were many interesting and beneficial proposed changes for the education section but I was most captivated by proposed changes in the copyright laws with respect to Text data mining in Singapore designed to support the smart nation initiative in Singapore. 

This proposed change I believe is very similar to the one already in the UK , except that covers only non-commercial use. EU is also I believe mulling over a similar law.

Like in the UK law, I believe the proposed change will also disallow restriction of text data mining via contract.

Why is this proposed change important?

One of the most common issues we face today is the fact that increasingly many researchers are starting to do text data mining on content in our subscribed databases, they could be doing it in newspaper databases (e.g. Factiva) or journals (e.g. Sciencedirect) or other resources.

Many researchers I find aren't quite aware that for most part when the library signs an agreement for access, such rights exclude TDM (or do not state TDM as a allowed use).

Most databases we subscribe to also have a system to detect "mass downloads" and as such any TDM eis most likely going to be detected (though I believe some researchers may try to bypass this by scripting human-like behavior).

Businesses are never one to forgo a revenue opportunity and many databases require we pay an additional known expensive fee on top to allow TDM.

Others have a more "come talk to me and we will see" style policy and the rare few enlightened ones like JSTOR actually allow it up to certain limits. Many academic libraries have created guides like this and this  to try to keep track of things.

As text data mining can be more easily done via API through than scraping data, another approach is to offer a guide of the APIs that can be used. One example is MIT's libguide

The proposed law would have two effects. Firstly, the status of researcher's doing data mining of the open web was always hazy. In theory if you mine say reviews on blogger say and use it for your research, I understand content owners of the blog could possibly sue you for copyright infringement. The proposed changes clarify this and allow TDM of such data  (but not merely aggregation) of such data.

More interestingly for data that researchers have legitimate access to aka subscribed databases, there is no longer any distinction between reading an article and doing text data mining. And such a right cannot be excluded by contract by the vendors.
The data/position paper set out by the ministry of law/ipos here is a great read, and it points out that if such a change comes into effect, it is likely vendors who already charge for TDM will "price in" the cost of  TDM because they can no longer exclude these rights.

Will the exception disadvantage libraries that don't have users that won't do TDM?

There was an interesting Q&A afterwards mostly centering around the TDM exemption.

One of the more obvious points made was, is it necessarily desirable to put in these exemptions when it will lead to vendors "pricing-in" TDM rights for database packages automatically? While the bigger Universities and institutions would probably have staff that would do TDM, the smaller institutions would be unfairly affected resulting in higher prices for no benefit. Why not allow each institution to negotiate with vendors and allow exclusion of TDM depending on each institution's need?

I am sympathetic to this view point.

But my current gut feel is that overall this will be beneficial.

Let me try out this line of argument.

Libraries tend to be in a far weaker negotiation positions than the vendors (due to the fact that a lot of vendor material is unique) and what often happens is that under current law many libraries will simply play it safe, pay only for basic read access but not TDM because it's very hard to predict who will want to TDM even for big Universities. Some librarians will even refuse out of principle to pay for TDM.

So vendors will not be sure at first how much they are losing by not charging for TDM as whatever they getting now is probably less than true demand. 

The proposed changes package everything into one, and it turns the game into a game of chicken. While the vendor might want to price things as high as possible and to even recapture all the possible TDM revenue but there is a need to compromise (anchored around current prices that exclude TDM) or they will end up earning nothing.

That should put a cap on too exorbitant price increases at least initially (though in future periods they might be able to properly estimate the real TDM demand and price accordingly). I suspect the net effect is while prices will go up ,overall a lot more TDM will occur and if the intent is to encourage TDM that is a win and TDM generates sufficient benefits it will be a win.

But this is a wild guess.

I'm also wondering once the law forbids vendors from preventing TDM once libraries have paid for lawful access to the database, can they say "Okay, you can now do TDM but only via method A (probably API) and not via scraping or trying a script to do automatic download via the usual human facing interface?". This seems to suggest No.

It would be great if we could learn from the UK experience and I started asking around my usual international network of librarians but came up empty.

One librarian pointed out to me that even though the law was passed in 2014, given subscriptions cycles of 1 year or more, and research lag time, any such research probably is still in the works!

Still I ask readers of my blog, if you work in UK as a academic librarian what was your experience like? Did you find prices of databases that are most often targets of data mining start to rise even faster? Did the sales people reference the change in law as a reason? If you are a researcher in UK who has done TDM under this law, what was your experience like?

Even anecdotes would be nice. You can comment below or send me emails privately if you like and I will preserve any anonymity.

What law are the contracts signed under?

Another point that was brought up that was more damaging was that when libraries sign contracts with database vendors which jurisdiction of law will the contract be under? If the contract is to be under US law (fairly common?), the changes in the copyright act would have no sway over the breach of contract, effectively making it toothless. 

I'm not a lawyer so I do not know what will happen if a library was sued for breach of contract overseas outside Singapore and awarded damages.  

Other comments and questions

The Q&A was a good exchange of opinions and views between both the speakers and the audience (made up of faculty and librarians). Topics covered included open access (Gold open access is usually frowned upon by librarians in Asia which I think is quite different compared to the west), copyright for MOOs and more. 

One interesting point made by the speaker was that he was a bit surprised to see while there was organization on the author /creator side with organizations like The Copyright Licensing and Administration Society of Singapore Limited (CLASS),  Composers and Authors Society of Singapore (COMPASS)  representing the author rights, there wasn't such a group on the user side.

He suggested perhaps the Universities in Singapore band together to negotiate collectively on some agreed core content? Is this what we call a library consortium?

Then again Singapore is a really small market, so who knows perhaps the law would make little difference and vendors might just let it go?  

Saturday, October 29, 2016

5 thoughts on open access, Institutional and Subject repositories

Despite writing a bit more on open access and repositories in the last few years, I find the issues incredibly deep and nuanced and I am always thinking and learning about them. As this is open access week, here are 5 new thoughts that occurred to me recently.

They probably seem obvious to many open access specialists but I set them out here anyway in case they are not obvious to others.

1.   There are multiple goals for institutional repositories and supporting open access by accumulating full text of published output is just one goal. 

I suspect like many librarians, I first heard of institutional repositories in the context of open access. In particular, we were told to aim to support Green OA by getting copies of published output by faculty (final published version if possible, if not postprint or preprint). But in fact, looking back at the beginning of IRs and Open Access things were not so straight forward.

There seem to have been two seminal papers released at the beginning of the history of IRs. First there was Crow's The Case for Institutional Repositories: A SPARC Position Paper in 2002 and Lynch's Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age in 2003.  (See also the great talk "Dialectic: The Aims of Institutional Repositories" for a breakdown)

Between them, several goals were identified. Two of which were

a) “to serve as tangible indicators of a university's quality and to demonstrate the scientific, societal, and economic relevance of its research activities, thus Increasing the institution's visibility, status, and public value” (Crow 2002)

b)"Nurture new forms of scholar communication beyond traditional publishing (e.g ETD,  grey literature, data archiving" – (Clifford 2003)

All these goals are not mutually exclusive with the mission of supporting open access by accumulating published scholarly output but they are not necessarily complementary either.

For example, one can showcase the university output by merely depositing metadata without free full text, something that is occurring in many Institutional Repositories today that are filled with metadata of the scholarly output of their researchers with precious little full text.

Similarly, systems like Converis, or Pure or systems like Vivo that showcase institutional and reseaarch expertise do not necessarily need to support open access.

It also seems that at the time Clifford envisioned an alternative route for IRs to focus on collecting non-traditional scholarly outputs which includes grey literature instead of collecting published scholarly output. Following that vision, today most University IRs collect Electronic thesis and dissertations at the very least, others collect learning objects, Open Education resources and many are beginning to collect datasets.

2. Self archiving can differ in terms of timing , purpose and there are multiple views on how high rates of self archiving will eventually impact the scholar communication system

Even if you agree the goal of IRs is to collected deposits of published scholarly output there are still more nuances to why you are doing so and what your ultimate aims are.

At what stage is the papers deposited?

As a librarian with little disciplinary connections, I never gave much thought to subject repositories and focused more on institutional ones.

Reading Richard Poynder's somewhat disputed recounting of history of what was to be the first OAI conference at Santa Fe, New Mexico in 2009, I finally realized that while subject repositories and institutional repositories both could collect preprints/postprints the two were very different in terms of timing of deposit and reason for deposit.

Most researchers who submit to subject repositories do so primarily with the goal of getting feedback and this also leads up to the speeding up of scientific communication. While many papers in subject repositories are deposited and immediately submitted to journals for consideration, many are put up in more raw form and are replaced by new versions many times before finally being submitted for publication and many that don't end up been submitted in any journal at all, hence making the term "preprint server" a bit leading. All this is discipline specific of course.

Contrast this with IRs, where rarely researchers put up copies of their papers in IRs until the paper is accepted for publication or more likely already published. The goal here is to provide access for the scholarly poor of published or near published scholarly output and the carrot for researchers is citation advantage of open access papers.

However as the papers in the IR are placed much later in the research cycle, they generally are already in finalised form and nothing much happens to them.

As Dorothea Salo's memorable paper Innkeeper at the Roach Motel states “[The institutional repository] is like a roach motel. Data goes in, but it doesn’t come out.”  This line might also refer to point #4 below....

I am told that there really isn't any obstacle functionally for IRs to accept preprints (in the sense of papers that are going through peer review but haven't been accepted yet or haven't even yet been submitted for consideration for publication), but in actual fact this seldom occurs (though I'm sure there are examples perhaps with say CRIS systems).

Two views of Green Open access

The motivation and final end game for self archiving in IRs also differ among people.

Even if one agrees IRs should only collect post prints (or the final published version if allowed) and the main aim is to provide access to published scholarly material, but what is the ultimate goal or vision here?

Some would envision , green open access working thriving alongside the traditional publishing system today and for all time. In this view, green open access is not a threat to traditional publishing, and that a status quo would result, where there is both green open access self archiving in IRs and libraries continue to subscribe to journals as usual and they point to the effect (or lack of) of high rates of self archiving for high energy physics on subscriptions in that area.

Another view doesn't see self archiving just for the sake of access, they actually aim to eventually disrupt the current scholarly system. They believe that when "universal green OA" is achieved , then we can leverage a favorable transition (in terms of costs/prices) to Gold open access (because there is an alternative to getting the final published version in the post-print version). Without achieving universal green OA, flipping to Gold OA leads to "fool's Gold" and even if open access is achieved it is of very high cost.

This is of course the Steven Harnard view. It's usually paired with the idea of a immediate deposit/optional access mandate, where all researchers will need to deposit their paper at the moment of acceptance. In response to critics that publishers will not sit back and allow Green OA to prevail if it really catches on and they will start imposing embargos, Harnard suggests countering that with a "Request a copy" button on such embargoed item.

I'm not qualified to assess the merits of these arguments but it does seem to me that these two camps are essentially in conflict, as one camp is telling publishers that are in no threat to green open access and there is no likely disruption in the future and the Harnard camp which is trumpeting loudly what they intend to do once Green OA becomes dominant.

Some have suggested supporting Green OA is hypocritical (if for example one tells publishers that they are under no threat, yet secretly hopes for a Harnard disruption eventually), and yet others claim Green OA is flawed and will never succeed because essentially it is "parasitic" on the existing system and survives only because it relys on the current traditional publishing systems

A more radical form of Green open access (if it is considered one)

There is a even more radical purpose to collecting papers in repositories. If you read  Crow's The Case for Institutional Repositories: A SPARC Position Paper, he actually suggests a far more radical idea then just collecting post-prints that have been published by publishers and be happy with the status quo, or even the Harnard idea of flipping to Gold OA on favourable terms eventually,

The future he suggests actually involves competing with traditional publishers. In such a model, researchers would submit papers into IRs, reviewers as per usual would review them, but the key thing is that everything would be done through the repository, and universities, researchers could "take back" the scholarly publication system from traditional publishers.

This sounds a lot like the overlay journals we see done with arxiv. For a institutional repository version, we have the journals on Digital Commons system.

3. Much of the disadvantages in local institutional repositories vs more centralised subject repositories or academic social networks like ResearchGate hinges on the lack of network effects due to poor interoperability

In Are institutional repositories a dead end? , one way to summaries many of the strengths of centralised subject repositories vs institutional repositories is that "size matters".

As I noted in a talk recently, academic social networks like ResearchGate are not new, and there were a flood of them in 2007-2009, including now defunct attempts by Elsevier and Nature.

Yet it is only in recent years it seems ResearchGate and seem to become dominant.

The major reason why this is happening only in the last 2 years or so, is that the field of competition as now narrowed to two major systems left standing ResearchGate and (if you count Mendeley that's a third) and network effects are starting to dominate.

While it is true that if you consider the "denominator" of subject repositories (all scholarly output from a specific subject) or of say ResearchGate (all scholarly output?), they aren't necessarily doing better than institutional repositories (all scholarly output of that institution), in absolute terms the material centralised repositories have dwarfs that of most individual Institutional repositories.

As more papers appear in ResearchGate or a subject repositories network effects kick in. More people will visit the site to search, if there are any social aspects and functionality (which ResearchGate has a ton of) they will start becoming even more useful, and even statistics become more useful.

How so? Put your paper in a IR like Dspace, and even if you have the most innovative developer working on it, with the most interesting statistics, you still are limited to benchmarking your papers against the pitiful number of papers (by standards of centralised repositories) in your isolated institutional repositories.

Put it on SSRN, or ResearchGate and you can compare yourself easily with tons more researchers, papers or institutions.

Above shows ranking of university departments in the field of Acccounting.

In this way, the hosted network of repositories on Bepress Digital commons actually seems the way to go compared to isolated Dspace repositories because one can actually do the same types of comparison on the Digital Common Network that aggregates all the data across various repositories using Digital Commons.

So my institution is currently on Bepress Digital commons and faculty put their papers on it.

So in the above example, I can see how well Faculty from the School of Accountancy here are doing versus various peers in the same field who also put their papers on their IR. Happily I can report, the dean of the accountancy school here is one of September's most popular authors in terms of downloads.

4. interoperability among repositories is the only way to make network effects matter less

My merger understanding of OAI-PMH was that it was indeed designed to ensure all repositories could work together . The ideas was that individual repositories could host papers but others could build services that sat on top of them all and harvest and aggregate all the output into one service.

I know it's fashionable to bash OAI-PMH these days  and I would not like to jump on the band wagon.

Still it strikes me that a protocol that works only on metadata was on hindsight a mistake. Perhaps it was understandable to assume that all records in IRs would have full text as the model back then was arxiv which was full text. But as mentioned above, there were in fact multiple goals and objectives for IRs, and many became filled with metadata only records due to this.

This made it really painful for aggregators to work when they tried to pull all the records together from various IRs using OAI-PMH as they couldn't tell for sure whether there was full text or not. This is the main reason why systems like BASE can't 100% tell for sure a record they harvested has full text (I understand there can be rough algorithmic methods to try to guess if there is full text attached), and it's also the same reason why many libraries running web scale discovery service can't tell if a record they have from their own IR has full text or not. (Also they don't turn on in their discovery index other IRs that are available in the index for the same reason).

In truth making repositories work together involves a host of issues, from having standardized metadata (including subject, content type etc) so aggregators like BASE or CORE and offer better searching, browsing and slicing features, ensuring that full text can easily "flow" from one repository to another or ensuring usage statistics are standardized (or can be combined?).

In fact, there are protocols like  OAI-ORE and SWORD (Simple Web-service Offering Repository Deposit) that try to solve some of these problems. For example SWORD allow one to deposit to multiple repositories at the same time etc and do a repository to repository deposit, but I am unsure how well supported they are in practice.

Fortunately this is indeed what COAR (Confederation of open access repositories) is working on, and they have several working groups working on these issues now.

If individual repositories are to thrive, these issues need to be solved, allowing easy flow and aggregation of metadata, full text and perhaps usage statistics, allowing them to counter the network and size effects of big centralised repositories.

5. There seems to be a move towards integration among the full research cycle and or into author workflows. 

The pitch we have always made is this to researchers, give us your full text, we will put it online and you will gain the benefits (e.g more visibility, the satisfaction of knowing you are helping science progress, or that you are pushing back against commercial publishers), but sadly that doesn't seem to be enough for most to motivate them.

So what can we do?

Integration with University Research management systems from and to repositories

Firstly, we can tell them we are going to reuse all the data they are already giving us. Among other things, we can use their data to populate cv/resume systems like Vivo. Since all the data is already there we can use it for performance assessment at the individual, department and university levels by combining the data with citation metrics.

We can make it easier on the other end too. Instead of getting researchers to enter metadata manually, we can pull them into our systems using Scopus, Web of Science, ORCID or other systems that allow us to pull in researchers by institution.

What I describe above is indeed the idea of a class of software currently known as CRIS (Current research information systems) or RIMS (Research Information Management system). It is basically a faculty/research management workflow that can track the whole life cycle of research system, often including things typically done by other systems such as grants management and integrates with other institution systems like HR or Finance systems.

The three main systems out there are Pure, Converis and Symplectic elements. The point to notice is that these systems are not mainly about supporting open access, but it can be one of their functions.

For example while Converis's publication module accepts publication full text, this full text isn't necessarily available online publicly if you do not get the Research portal module (this isn't mandatory). In the case of Symplectic, I understand it doesn't even have a public facing component but there are integrations with IRs like Dspace available.

But we can have more integrations than this.

Integration with  Publisher systems to repositories

How about considering a integration between a publisher and a IR system? Sounds impossible?

The University of Florida has a tie up with Elsevier where using the Sciencedirect API, metadata from Sciencedirect will automatically populate their IR with articles from their institution. Unfortunately the links on the IR will point to articles on the Sciencedirect plaform. While a few will be open access , most will not be so.

I can hear librarians and open access activists screaming already, this isn't open access. What is interesting is, there is a phase II / pilot project listed where the aim is for Elsevier to provide "author accepted manuscripts" to IRs!

If you have ever tried to get a researcher to find the right version of the paper for depositing into IRs, you know how much of a game changer this will be.

Logically it makes so much sense, the publishers have the postprints already in their publication/manuscript submission systems, so why not give it to IRs? Well the obvious reason is we don't believe publishers would want to do that as it's not in their best interest? Yet ...........

Integration with  Publisher systems from repositories

Besides an integration from post-print to IR, the logical counterpart to that would be an integration from pre-print to publisher submission systems and where pre-prints are sitting is often in Subject repositories.

Indeed this is happening as PLOS as announced a link with their submission system and Bioarxiv.

In the same vein, the earlier mentioned overlay journals, can be said to be having the same idea.

Integration with reference managers?

What other types of integration could occur? Another obvious one would be from Reference managers.

Elsevier happens to own Mendeley, so a obvious route would be people collaborating via Mendeley groups and with a click push it to journal submission system.

Proquest which now owns a pretty big part of the author work flow including various discovery services, reference managers like Flow and refworks could do something similar, for example I remember some vague talk about interacting say Flow their reference manager with depositing thesis into ETD say.

Will a day come where I can push my paper from my reference manager or preprint server to the journal submission system and when accepted the post-print seamlessly goes into my IR of choice and in my IR the data further furthers into other systems for populating my cv profile and/or expert system?

I doubt it, but I can dream.

A 100% friction-less world?


This post has been a whirlwind of different ideas and thoughts, reflecting my still evolving thoughts on open access and repositories. I welcome any discussions, corrections of any misconceptions or errors in my post.

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...