Tuesday, December 30, 2008

Google's Gift of Books

As part of the settlement between the Association of American Publishers (AAP) and Google, each public library in the U.S. can get one free access to the Public Access Service to Google Books. (That's defined as one "terminal" per library building.)

I'm sure that many folks are quite impressed at this generosity: free access to the public! What's not to like? Well, keep reading.

Nothing's Really Free
Some of you may remember the late 1990's when Microsoft donated computers (running Microsoft software) and modems to public libraries, making it possible for them to offer free Internet access to the public. This was a great boon for the libraries, but there were numerous hidden costs. First, the libraries had to scramble to find space for the workstation, and finding "extra" space in a library is enough to make one hate the law of physics that precludes two objects occupying the same place at the same time. Then they had to get phone line access to the place where the computer would sit, and this had to be a dedicated line because it would be in use most of the hours that the library was open. The librarians had to learn about the Internet so they could help the public, which was especially difficult because the same computer that served the public was the only one that the librarians could learn on. As the Internet access became more popular, the libraries had to manage the demand for the service, setting up ways for patrons to sign up for time on the computer and mediating disagreements about whose time it was. In libraries where often the staff didn't have printers attached to their own work computers, they also had to find a way to manage the fact that users who didn't have a computer at home needed a way to take away what they found online.

All of these were costs for the libraries. They may seem like minor costs, but if you're thinking that then you're probably not working in a public library. I often say that public libraries are like old-age pensioners: they're on a fixed income that doesn't keep up with inflation, much less the demand for more services. (And my impression is that they've already been living on dog food for a number years now.)

Some costs were not so minor, however. For example, I discovered that the small branch of my public library nearest my home was paying for the phone line that this "free" internet access used. The problem was that library phone lines were considered business lines and they were being charged per-minute rates. This library was paying $2000 a month or more for the use of the phone line attached to its one public Internet workstation. That's more each month than Microsoft paid initially for the equipment it provided to the library. Yet Microsoft was considered "generous," while the story in the press ignored the costs to the library; costs we taxpayers were all bearing.

Nothing in this should be construed to demean the gift from Microsoft or the value of adding public Internet access to libraries. The story here is that free has costs, and those costs could be considerable. The story is also that some of those costs, perhaps many of those costs, get passed on to the public, even though the public doesn't have a say in the choice to support this service.

The First One is Always Free

That same small library that started with one of the Microsoft computers now has something like six public Internet access workstations that are rarely sitting unused. That initial gift led to the development of what today is an essential public library service. It is often the case, however, that in cash-strapped times libraries have to make trade-offs, dropping old services (like magazine subscriptions) to pay for new ones (like Internet access). Should the single access to the Google Books Public Access Service not suffice, libraries will need to add more subscriptions to meet the demand. It isn't known what this will cost, but unless it is ridiculously cheap, it eats into the already strained budgets of the libraries. Eventually, the cost will be absorbed into the budget as part of normal expenses, but there will be a painful phase at the beginning. Before they introduce this free service, libraries need to know what the costs will be in 2, 3 or possibly 5 years so they can begin the budget planning process that will allow them to provide full service to their users, if that's what they wish to do.

Just Say No

If taking advantage of the Google Books Public Access Service is going to strain library budgets, why don't the libraries just say no? They aren't being forced to accept the free service, after all. This creates a real dilemma for public libraries, the same dilemma that was created by the initial free Internet access: the mission of public libraries is to level the information access playing field for everyone. To do so, public libraries need to keep up with new information resources and services as they become available; to purchase or license these; and to give equal access to all. Generally, public libraries lag behind their richer cousins, the academic and research libraries, in providing information services. Academic libraries had access to the current crop of online versions of abstracting and information services about a decade before public libraries began to provide these to their users. But if public libraries don't provide these services as they become affordable, we end up with a two-tiered world of information haves, those with a connection to an academic institution, and information have-nots, the remainder of the public.

Equal Access for All

One option that libraries must consider when new services arise that are outside of their budget capabilties is whether they will choose to provide the service with a user fee attached. I remember this in academic libraries when the first article indexing services were available through Dialog in the 1970's. These services were quite expensive (they billed by the minute, if not the second, as I recall). Libraries tried budgeting a set amount to provide the service to their users, but the appeal of the "free" service was such that the entire year's budget was exhausted within months. For the remainder of the budget year, users had to foot the costs for the searches. Academic institutions can decide to give some users (professors, researchers) services that are not available to others (undergraduate students). They also can decide to charge fees, looking on this as part of the cost of attending the institution and making use of its facilities.

The public library mission of equal access to all, however, argues against requiring fees for services, other than those nominal fees designed to prevent squandering of resources (e.g. 25 cents for each book put on hold), or cost recovery for consumable materials, like photocopy services. But generally speaking, once a user has entered the library, it's an "all you can eat" situation. This is not the nature of Google's online book service. The settlement agreement is incredibly complex in terms of what is free and what is pay-for. For certain works, a certain number of pages can be viewed for free, after which one must purchase the book to see the rest. The number of pages that can be printed may be limited, and there may be charges for printing.

We do know that public libraries will not be able to offer remote access to their free subscription, only on-site access. That, of course, excludes many users. We also know that there may be advertising included in the service, and it may include the ability to purchase books (online or in hard copy) and additional services. In other words, the library's users become the service's customers.

Product Placement

When Microsoft began giving away software to libraries (actually, making them pay a pittance for the licenses), an article in Salon stated:
In the case of computer companies, giving away free product is a way to increase market share, influence future purchases, create good will at relatively low cost, and get a tax write-off for your efforts.
While possibly cynical, it's also true. Giving away samples of your product is a time-worn approach to building a customer base.

Charity is giving people what they need, not what you want them to have or what you would like them to buy in the future. While the provision of a free, one-user license to libraries may be generous, it is not charitable. It should be viewed in the same way that free samples of cereal are. Actually, the better analogy harks back to the days when cigarette companies gave away free packs of cigarettes on city streets, hoping to encourage non-smokers to become smokers. It is best to look on the free access to Google Books as part of an advertising campaign; it is definitely not Google and the AAP following in the footsteps of Carnegie. It's as if Carnegie had given each city enough steel (his product) to build part of a bridge.

Did Anyone Ask Public Libraries Before Deciding This?

One of the great difficulties that we have in understanding the Google/AAP settlement is that none of the participants can reveal the nature of the negotiations; they are all bound by a non-disclosure agreement. So we don't know who represented the libraries nor what they asked for. We don't know if the Google Public Access Service was offered by Google or demanded by library participants. We don't even know who the library participants were. A logical assumption would be that the library representatives in the discussions were limited to representatives of the current Google library partners. If that is the case, then they are all representatives of research and academic libraries. We don't know if any of them surveyed public libraries, even informally, about the desirability of this service, or about the burdens it might place on those libraries as it has been formulated. Could there have been a different deal that was better for public libraries and equally acceptable to the major players?
There is very little in the settlement that would allow one to imagine the precise nature of this service and how this service will be implemented and managed.

Public librarians I have talked to are very concerned about this matter. There is still plenty of time to work out details, but is there a plan to engage a representative group of public libraries to do the planning? What happens if the service, as envisioned in the negotiations, doesn't meet the needs of public libraries, or doesn't fit in with their current online systems?Are there different needs and capabilities in large urban public libraries and small rural ones? Will it be possible to serve these equally?

Where is the Public's Voice?

At the negotiations there were lawyers representing the AAP, lawyers representing Google, and lawyers and librarians representing libraries. But the public had no lawyer at that table, no representative. While each of the parties could have desires to serve the public, they each had a primary self interest that they were there to serve. Without public representation, it is not possible to say that the public's interest has been served. Without public representation, the public's interest has not even been solicited, much less heard. Yet, this settlement has a great effect on the public and its relationship with a major public resource, the collective wisdom contained in hundreds of years of publication of text on paper. I will address this in a forthcoming post.

Saturday, December 27, 2008

FRBR and Group 2 & 3 Oddities

You've probably realized by now that I cycle back to FRBR frequently, each time discovering something new. New to me, at least. Perhaps because of not being a cataloger it seems that I have missed some key concepts in earlier readings. This might help explain some misunderstandings between me and more catalog-savvy folks.

This time I was thinking about the way that the entities are used with the subject relationship. But before I get to that, there's always the publisher to torment me.

Creators and Publishers in FRBR and RDA

The Group 2 entities have what is called "responsibility relationships" with the Group 1 entities. The diagram (Figure 3.2, p. 14) shows the two Group 2 (G2) entities, person and corporate body, to related to the Group 1 entities in the following way:
Work is created by... G2
Expression is realized by ... G2
Manifestation is produced by ... G2
Item is owned by ... G2
(Note that I find it odd that FRBR limits the Group 1 to Group 2 relationships to only four, and only one per Group 1 entity, but that is how it is written. It makes me wonder what one does with, say, an illustrator of a particular expression of a book. Surely the addition of illustrations doesn't make it a new work?)

In section 4 of FRBR, the Group 2 entities are not included in the lists of attributes of the Group 1 entities. In other words, when you read the list of attributes of a work, there is no mention of creator, and the list of attributes of an item does not include owner.

I was therefore surprised to find among the attributes of a manifestation:
4.4.5 Publisher/Distributor
The publisher/distributor of the manifestation is the individual, group, or organization named in the manifestation as being responsible for the publication, distribution, issuing, or release of the manifestation. A manifestation may be associated with one or more publishers or distributors.
Since Group 2 entities are not listed as attributes in the Group 1 attribute lists, this pretty clearly states that publisher is not a person or corporate body entity.
Yet, the section on relationships between Group 1 and Group 2 entities says:
5.2.2 Relationships to Persons and Corporate Bodies
The entities in the second group (person and corporate body) are linked to the first group by four relationship types: the “created by” relationship that links both person and corporate body to work; the “realized by” relationship that links the same two entities to expression; the “produced by” relationship that links them to manifestation; and the “owned by” relationship that links them to item.
Essentially, this apparent inconsistency between the definitions of the entities and the attribute list for the manifestation has to do with the practice of transcribing data from the manifestation:
At first glance certain of the attributes defined in the model may appear to duplicate objects of interest that have been separately defined in the model as entities and linked to the entity in question through relationships. For example, the manifestation attribute “statement of responsibility” may appear to parallel the entities person and corporate body and the “responsibility” relationships that link those entities with the work and/or expression embodied in the manifestation. However, the attribute defined as “statement of responsibility” pertains directly to the labeling information appearing in the manifestation itself, as distinct from the relationship between the work contained in the manifestation and the person and/or corporate body responsible for the creation or realization of the work. (Section 4.1)
What this points out is that while FRBR supposedly puts forth an entity-relation model, in fact it is no more ER than our current bibliographic model with its mixture of transcribed data, cataloger supplied data, and controlled headings.

Then Comes Group 3

This is easier to explain, because it is very simple: The Group 3 entities (concept, object, event, place) can ONLY be used as subjects, e.g.:
For the purposes of this study places are treated as entities only to the extent that they are the subject of a work (e.g., the subject of a map or atlas, or of a travel guide, etc.). (section 3.2.10)
This eliminates any thought of using place as in "place of publication." Not to mention that each of these has a very limited attribute list; in fact, they each have exactly one attribute:
term for the concept/object/event/place
The Upshot

The upshot is that FRBR does not give us a true entity-relation model for our bibliographic data. This is frustrating for those of us trying to move library data in an ER direction, and it means that to achieve the ER model we will have to go beyond what exists today in FRBR, and beyond the version of FRBR that has been realized in RDA. I've kind of known this, but it's discouraging to have it confirmed in the FRBR document itself. Even more frustrating that it's been there the whole time and I missed it.

I've looked again at FRBR in RDF and the Scholarly Works Application Profile, and both make some interesting extensions to the FRBR concepts, taking them further along the ER road. It seems to me that the DC/RDA work will need also to deviate from FRBR in order to achieve its goals. The big question is: how far can we go and still be compatible with library data?

Tuesday, December 23, 2008

Monday, December 22, 2008

Google Replies on OCA Blog

The Open Content Alliance blog has a post on the Google/AAP agreement with a lengthy reply from Dan Clancy of Google Books, and my reply to Dan's reply.

LC forces take-down of lcsh.info

I am beside myself with fury. I hardly know where to begin. Not long ago, Ed Summers took the LCSH authority file and created an online site with the LC Subject Heading authority file re-formatted as a SKOS vocabulary. For the first time, Web services could link directly to LC subjects as represented in the authority file. And some did.

But the Library of Congress, our Federal, if not National, library, has required Ed to take down the site. A site that contained nothing more than LCSH in a usable form. Data that SHOULD be in the public domain, for anyone to use as they wish. This is an assault against libraries everywhere, an act of censorship.

You can read Ed's statement on lcsh.info.

I would very much like to hear LoC's statement about this. They should not be allowed to control the use of this data, data that belongs to all of us.

Ed couldn't refuse the Library's demand, but anyone who isn't an employee of LoC should have greater freedom. Let's gather around a find a new home for LCSH, one that can't be removed from the public.

Thursday, December 04, 2008

Google and Fair Use

There's some background to the Google/AAP settlement that I believe is key to understanding the subtext around it. This won't be news to most folks, but I thought it would be good to re-articulate it in the context of the settlement, lest we forget.

Google's first business is that of indexing resources that are on the web. I'll talk about them as if they were all texts because it's easier, but the same thing could be said for images and other resources.

To do the indexing, Google must make a copy of the web page or document. Using this copy, it adds the page to its search engine. As a good citizen, Google pays attention to the robots.txt file, and does not index pages where the site owner has opted out of being included in search engines.

This is all fine and unremarkable until you look at it from the point of view of copyright law. Copyright is specifically about... making copies, and it gives the right to make copies, or to authorize the making of copies, to the copyright holder. That can be the author, or someone to whom the author has passed along the right. Copyright holders must opt in to the making copies: they have to give permission. The default in copyright law is that copies cannot be made unless the copyright holder gives approval.

So the big question is: Is Google violating copyright law by making copies of web pages without the permission of the copyright holders? There are two main ways of looking at this:
  1. The web is different from the print environment. Anyone who has put their works out on the web has agreed to copying because no one can even view the work without making a copy. If they don't want people copying, they need to hide their works behind a security screen. However, there is no such exception or wording in copyright law that would support this.
  2. The web is not different from the print environment. But Google is just producing an index and there is nothing in copyright law that would prevent someone from producing an index of words in texts. The incidental copies that Google makes in order to produce the index are allowed under the Fair Use aspects of the copyright law.
So then we move on to the Google Books project. Initially, Google claimed that it was doing the same thing with books as it does with the web: making incidental copies in order to create keyword indexes to the texts. In terms of copyright law, argument #1 is pretty much out because these works can be read without making a copy, so the copyright holders haven't agreed to let their works be copied. This leaves us with argument #2: it must be fair use.

In fact, Google did and does make the fair use argument. The libraries that partnered with Google also came to the fair use conclusion in at least some cases. The CIC project FAQ says:

University of Michigan said this in 2007:

Does this project comply with copyright law?

Yes. This project was undertaken with careful attention to the law and to the rights and responsibilities of the various parties involved. The purpose of copyright law is to promote progress in society. We are confident that the Books Library project is fully consistent with the fair use doctrine under U.S. copyright law and the principles underlying copyright law itself. Copyright law strikes a balance between rewarding creators of intellectual property for their creations and facilitating public access to these works in ways that do not create a business harm. For books, this means ensuring authors write books, publishers sell them and libraries lend them. By making books more discoverable, Google is enhancing the ability of authors and publishers to sell books to an audience beyond the traditional book market.

What was at stake with the AAP lawsuit was exactly this decision about Fair Use. If copying the books for the purpose of indexing were determined to not be fair use, then this decision could bleed over into the web. And of course it would mean the end of Google Book Search (which has now become Google Book Store). Although Google has always provided a confident posture to the public, declaring unwaveringly that what it does as a search engine is perfectly within copyright law, the idea of going to court over the issue would have put their entire operation at risk.

Now back to libraries. Fair use is not a list of things you can do but a judgment call relating to some complex factors. Some key factors have to do with whether your use is commercial in nature or could compete with the exploitation of works by the copyright holders. There are, in addition, exceptions in the copyright law relating to research and study, and special exceptions for libraries. In fact, in relation to copyright law, libraries and educational institutions get considerably more latitude in using works than do commercial enterprises. As an example, a teacher can make copies of an article for her students as part of a lesson, and that is generally considered fair use. A company manager who wants his staff to read an article cannot rely on fair use for copying, but must apply to the copyright holder (usually through an intermediary such as CCC) and pay a fee. (See the Texaco case.)

What happened with Google Book Search and the AAP is that the digitization of the libraries' books and subsequent use of those was judged not by the criteria that would be used normally for libraries, of course, but by the criteria that would be used for a commercial entity. That's totally logical, since although Google was partnered with the libraries, the primary use of the materials was to fuel Google Book Search, an obviously for-profit activity.

Libraries have gotten the short end of the stick because their use of their own materials became commercialized through their partnership with Google. If instead libraries had managed to digitize the books on their own, the outcome would have likely have been entirely different (if any lawsuit had been brought, which might not have happened). I believe that libraries could be found to have a fair use case for digitizing their works for the purposes of searching, and could be allowed to use those digitized copies for the exceptions spelled out in section 108 of the copyright law (such as providing access to the sight impaired, or for replacement of deteriorated originals). Unfortunately, the concept of digitization of the contents of libraries has now been tainted with the air of commercialization and has earned the wrath of the publishers and authors. The Google/AAP settlement has created a mechanism that ignores the inherent rights of the libraries, but also makes it more difficult for them to justify undertaking their own digitization project.

This is why I disagree heartily when I hear statements like:

We're delighted that this agreement creates new opportunities for libraries and universities to offer their patrons and students access to millions of books beyond their own collections. (from Google)
The settlement might look good from the point of view of a commercial entity facing copyright law, but it binds the non-profit educational and cultural heritage community to legal decisions designed for the for-profit sector. This is not only not a win for libraries, but it will hinder libraries in their efforts to make use of current technologies to further the arts and sciences.

Friday, November 28, 2008

OCLC Use Policy Details: Use and Transparency

An interesting aspect of this policy is that it is entirely about the use of WorldCat records. That may seem obvious from its title, but what I am interpreting from the policy language is that the policy covers all WorldCat records currently in existence, regardless of when they were created or the policy in force at the time that were first used. Creation or update of records take place at a particular time, while use is an ongoing activity. I'd like to cover some possible consequences of that.

Agreement to the policy

OCLC has stated that the Policy will go into effect in mid-February. It appears that current Members will be "grandfathered" in under the policy, their continued use of OCLC being their agreement to the terms. The Policy also covers Non-OCLC Members, who will not have made any agreement with OCLC, and I am hard pressed to understand why those organizations would abide by the terms of the Policy. 

Versioning and records already "in play"

Section E.7 says that OCLC can make changes to the policy, and that those changes will apply to use from that point on, essentially what is happening now with this Policy. Although they have agreed to place a version indication in the policy statement field in the WorldCat MARC records, I'm unclear as to what role that version would play. Instead, it seems to me that the policy implies that all WorldCat records will be covered by the current policy, whatever version that is. If this is not the case, then it isn't clear how the new policy can apply to records obtained from WorldCat before the Policy was in force. Yet this is exactly what is implied in the section on adding 996 fields on page 8 of the FAQ:
B. Retrospectively. For records that already exist in your local system, we encourage you to add the 996 field to WorldCat records transferred to others. Should you choose to use it, the field should have an explicit note like the examples below:

MARC:
996 $aOCLCWCRUP $iUse and transfer of this record is governed by the OCLC® Policy for Use and Transfer of WorldCat® Records. $uhttp://purl.org/oclc/wcrup/1.0
"Retrospectively" in this case means for records that were created before OCLC began adding 996 fields, and thus before the Policy goes into effect.

With this control over the use of all WorldCat records in existence, OCLC could become a highly disruptive force for anyone with ongoing relationships around bibliographic records. Because the policy could change again regarding records that have already been transmitted, anyone developing applications around use of WorldCat records is left with great uncertainty. Absent a good survey of the OCLC record use landscape, it is hard to know how many organizations and uses could be affected by this because we don't know all of the many ways that organizations are transmitting, receiving and using WorldCat records. However, with a policy based on use, possession of WorldCat records is like having a ticking time bomb since you have no assurance that your use will be permitted in the future.

Transparency

The "out" for all of these areas where it isn't clear what use is or is not allowed is to file a WorldCat Record Use Form with OCLC.  OCLC will then determine if the use is allowed. Section E.6 says:
OCLC has the sole discretion to determine whether any Use and/or Transfer of WorldCat Records complies with this Policy.
If I were an OCLC Member organization, I would want this process to be as clearly defined and as transparent as possible, if for no other reason than to avoid any semblance of discrimination against parties making requests. For publicly funded libraries, participation in a process that even appears to some to exhibit prejudices could be a public relations disaster. The only way to demonstrate fairness is to have a process that is open and auditable. The same section says:
In the event OCLC identifies a Use and/or Transfer which does not comply with this Policy, OCLC shall notify the relevant OCLC Member(s) and/or Non-OCLC Member(s) and such parties agree to work with OCLC to resolve the noncompliance.
I would go further and ask for the development of a publicly available set of guidelines for use of the records, and a formal appeals process that has member input. 

OCLC Use Policy Details: Your Records

There has been a lot of excellent commentary about the proposed OCLC record use policy. What I want to do here is highlight a few details about the policy that I haven't seen discussed elsewhere. The first is...

Your original cataloging


There are two areas where it becomes important to identify "your records." The first is in section B.3 where "WorldCat record" is defined. In the final paragraph (top of p.2) it states:

An OCLC Member or Non-OCLC Member may Use or Transfer the following without complying with this policy: (i) a WorldCat Record designated in WorldCat as the Original Cataloging of the OCLC Member or Non-OCLC member...
In other words, your own original cataloging is not covered by this policy. That's good news, but the practical application of this may not be simple. The way to determine this is by reading the MARC 040 $a subfield, presuming that the system you used at the time set this correctly. There is also the fact that OCLC merges duplicate records, so two instances of original cataloging could become one in OCLC...

Then there's the issue of how this affects down-stream users. For example, if Library A gives a copy of all of its original cataloging to Library B, and says: "no restraints on use," is Library B still held to the policy in terms of its use of WorldCat Records? According to the policy (E.5):

Regardless of the source from which WorldCat Records are received, Use and Transfer of WorldCat Records is authorized solely by OCLC pursuant to this Policy.
This seems to contradict the "your original cataloging is not covered" clause, although perhaps contract law deals with these kinds of apparent conflicts in some neat way. I would say that your original cataloging is not considered a WorldCat Record (as defined in the policy) except that the language of the exception refers to the original cataloging records as WorldCat records.

Also not clear is how this relates to the request to include the OCLC policy field in exported records. Although it isn't stated here, it would seem that original cataloging records should not contain the statement. (Those records could, however, be given a CC license by the originating library.)

Your holdings

Another key area relating to a library's own records is section D on the transfer of WorldCat Records. Section D.1.a states that libraries can transfer WorldCat records of their own holdings to other Members and Non-Members. Holdings is defined in the glossary as the OCLC institutional symbol on the record.

Section D.3 gives the logical converse of that: that to transfer WorldCat records that aren't of your own holdings, you must obtain permission from OCLC. This places restrictions on any institution that has received records from others, and could have implications for union and consortial catalogs. There isn't any mention of consortial agreements in the policy, yet many libraries already share their records in one or more such databases.

-------

Even if we work out the conceptual issues, both of these pose some real challenges in implementation since our bibliographic data today often does not clearly define the origin nor the source of the record, especially data that is not transmitted in MARC format. I'm really not at all sure that we could actually do what the policy requires.

Saturday, November 22, 2008

More on Google/AAP

Here are some more bits and thoughts on the agreement between Google and the AAP.

Library Involvement

Some librarians were involved in the settlement talks. The only one I have found so far who has come out about this is Georgia Harper. The librarians were working under a non-disclosure agreement (NDA), and therefore will not be able to reveal any details of the discussions. I have heard statements from others who I believe were privy to the negotiations, and they all seem to feel that the outcome was better for libraries due to the involvement of members of our "class." (Note that Google and AAP had high-end lawyers arguing their side, and we had hard-working librarians. I don't know how many of "our" representatives were also lawyers, but you can just imagine how greatly out-gunned they were.) Unfortunately that doesn't change my mind about the bait and switch move.

Google Books as Library

Some have begun to refer to Google Books as a library. We have to do some serious thinking about what the Google Book database really is. To begin with, it's not a research collection, at least not at this point. It's really a somewhat odd, almost random bunch of book "stuff." As you know, neither Google nor the libraries are selecting particular books for digitization. This is a "mass digitization" project that starts at one end of a library and plows through blindly to the other end. Some libraries have limited Google to public domain works, so in terms of any area of study there is an artificial cut-off of knowledge. Not to mention that some libraries, mainly the University of California, have been working with Google primarily to digitize books in their two storage facilities; that is, they have been digitizing the low use books that were stored remotely.

So the main reason why Google Books is not a library is that it isn't what we would call a "collection." The books have not been chosen to support a particular discipline or research area. Yet it will become a de facto collection because people will begin using it for research. Thus "all human knowledge" becomes something more like the elephant and the blind man: research in online resources and research that uses print materials will get very different views of human knowledge. (This is not a new phenomenon. I wrote about this in terms of some early digital projects I was involved in.) One of the big gaps in Google Books will be current materials, those that are still in print. Google will need to convince the publishers that it can increase their revenue stream for current books in order to get them to participate.

Subscribing to Google Books: Just Say No?


Beyond the (undoubtedly hard-won by library representatives) single terminal access in each public library in the US, libraries will be asked to subscribe to the Google Book service in order to give their users access to the text of the books (not just the search capability). This is one of the more painful aspects of the agreement because it seems to ignore the public costs that went in to the purchase, organization, and storage of those works by libraries. (I'm not includng privately funded libraries here, but many of the participants are publicly funded.) The parallels with the OCLC mess are ironic: libraries paying for access to their own materials. So, couldn't the libraries just refuse to subscribe? Not really. Publicly funded libraries have a mission to provide access to the world's intellectual output in a way that best serves their users. When something new comes along -- films on DVD, music on CD, the Internet -- libraries must do what they can to make sure that their users are not informationally underpriviledged. Google now has the largest body of digitized full text, and there will be a kind of "information arms race" as institutions work to make sure that their users can compete using these new resources.

The (Somewhat Hidden) Carrot

I can't imagine that anyone thought that libraries and Google were digitizing books primarily so that people could read what are essentially photographs of book pages on a computer screen. Google initially stated that they were only interested in searching the full text of books. While interesting in itself, keyword searching of rather poor OCR text is not a killer app. What we gain by having a large number of digitized books is a large corpus on which we can do computational research. We can experiment with ideas like: can we follow the flow of knowledge through these texts? Can we create topic maps of fields of study? Can we identify the seminal works in some area? The ability to do this research is included in the agreement (section 7.2(d), The Research Corpus). There will be two copies of this corpus allowed under the agreement, although I don't see any detail as to what the "corpus" will consist of. Will it just be a huge file of digitized books and OCR? Will it be a set of services?

I have suspected for a while that Google was already doing research on the digital files that it holds. It only makes sense. For academics in areas like statistics, computer science, and linguistics, this corpus opens up a whole range of possibilities for research; and research means grants, and grants mean jobs (or tenure, as the case may be). This will be a strong motivation for institutions to want to participate in the Google Book product. Research will NOT be limited to participants; others can request access. What I haven't yet found is anything relating to pricing for the use of the research collection, nor if being a participating library grants less expensive access for your institution. If the latter is the case, then one motivation for libraries to agree to allow Google to scan their books (at some continuing cost to the library) will be that it favors the institution's researchers in this new and exciting area. Full participant libraries (the ones that get to keep the digital copies of their works) can treat their own corpus as research fodder. The other costs of being a full participant are such that I'll still be surprised if any libraries go that route, but if they do I think that this "hidden carrot" will be a big part of it.

----

There's lots of good blogging going on out there on this topic. It needs a cumulative page to help people find the posts. Please tell me you have time to work on that, so I don't have to take it on! (Or that it exists already and I've missed it.) (The PureInformation Blog has a good list.)

Note: the Internet Archive/OCA may take this on. I'll post if/when they do.

Previous posts:

Friday, November 21, 2008

Fork WorldCat



Done in haste - hopefully someone can improve.

Also, as stimulus for those with better art skills:

Tuesday, November 18, 2008

Google Giveth ... and Taketh Away

Some additions, amendments.

The agreement between Google and the AAP is of great significance for libraries. It is also very long, written in "legalese", and contains conclusions of a lengthy negotiation without revealing the nature of the discussion. Given that many lawyers were involved, we may never get the back story of this historic settlement, yet it has the potential to change the landscape on rights, digitization, and libraries.

I am basing much of my analysis on the summary of the agreement produced by ARL. This unfortunately means that some errors may be introduced between their summary and my interpretation. I have gone to the original document to check some particulars, such as definitions, but much of that document goes unread for now.

Key Points

(... or, a summary of the summary)

  • The agreement is primarily about books that are presumed to be in copyright but which are no longer in print. In-print books continue to be managed directly by the rights holders, who can make agreements with Google (or anyone else) for uses of those items.

  • The agreement has some odd limitations that baffle me: it only covers books published in the US that have been registered with the Copyright Office. It does not include any books published after January 5, 2009 .The settlement does cover non-US books (e.g. Berne countries); I'm still unclear on the statement about registration for US books, but it was cited in the ARL document.

  • The agreement trades off Google's liability with payment to rights holders. That is, as long as Google requires payment from users to displays and copies, and passes 2/3 of those monies to the rights holder, Google is exempt from copyright infringement claims by rights owners. So users of the digital files will pay to keep Google legal.

  • The agreement does not answer the all-important question of whether scanning for the purposes of searching is an allowed use under copyright law.

  • The agreement flaunts the concept of Fair Use by quantifying the amount of an in-copyright book that users can view for free ("20% of the text," "five adjacent pages," but not the final 5% of a fiction book, to keep the endings a surprise.) The ARL document has Google saying that it will not interfere with fair use. I can't find that statement in the actual settlement. These quantities are contractual, and I'm assuming that that technology will not allow users to exert fair use rights, only the contractual agreement.

  • Google will sell digital copies of in-copyright books to users, who will have perpetual access to the book online. Some printing will be allowed but all printed pages will have a watermark that identifies the user. (I'm calling this "ratwear," software that rats you out.) Users will be able to make notes on the book's pages, but they will only be able to share those notes with other purchasers of the book. (Thus buying a Google book is like joining a secret reading club.) The settle states that the watermark will identifier either the user, or other information "which could be used to identify the authorized user that printed the material or the access point from which the material was printed." Agreement, p. 47

Key Points Relating to Libraries

This is the hard part for me. Hard in that it really hurts.

  • After digitizing books held in libraries, Google will then turn around and become a library vendor, supplying those same books back to libraries under Google's control. Each public library in the US will get a single "terminal" provided (and presumably controlled) by Google that allows users to view (but not copy and paste from) books in the Google database. Some printing is allowed, but there will be a per-page fee charged.

  • Libraries and institutions can also subscribe to all or part of the database of out of print books. Access is not perpetual, but limited to the life of the subscription.

  • There is verbiage about how users in these institutions can share their "annotations." In other words, if you take notes on your own, obviously those are yours. But if you use the capabilities of the system to make your notes in the system, you cannot share your own notes freely.

Now for the Clincher


... this is the pact with the devil.

  • A library can partner with Google for digitization of its collection and get the same release from liability that Google has. The library can keep copies of these digitized books, however, it must follow security standards set by Google and the AAP and must submit its security plan for review and allow yearly auditing. (The security measures are formidable and quite possibly not affordable for all but the wealthiest institutions. There are huge penalties up to millions of dollars for not getting security right.)

  • Libraries that make this pact with the devil are thereby allowed to preserve the files, print replacement copies for deteriorating books, and provide access for people with disabilities. Note that all of these uses by libraries are already allowed by copyright law.

  • The libraries that make this pact with the devil cannot let their users read the digitized books. Well, they can let them read up to five (5!) pages in any digitized book. Presumably if the library wants to provide other uses it must subscribe to Google's service. Libraries are expressly forbidden from using their copies of the books for interlibrary loan, e-reserves, or in course management systems.

... and if you refuse to negotiate with the devil...

  • Current Google library partners who do not choose to become party to this must delete all copies of digitizations of in-copyright works made by the Google project in order to obtain a release from liability. If they choose not to delete the copies, they are on their own in terms of liability for the in-copyright books that Google did digitize (and Google knows exactly which books are involved.)

  • Even if the library was only allowing Google to digitize public domain works, those libraries must destroy all of their copies to get release from liability in case they mis-judged the copyright status of one of the those books.
In other words, this agreement is making the assumption that if anyone sues Google for copyright infringement, the library will be a party to that suit.

They say that "the devil is in the details." In this case that is not true: the devil is right up front, in the main message. That message is that Google has agreed with the publishers, and is selling out the libraries that is has been working with. The deal that Google and the libraries had was that in exchange for working with Google to digitize books in their collections, the libraries received a copy of the digital file. After that, it was up to the libraries to do the right thing based on their understanding of copyright law. Participating with Google has been an expensive proposition for the libraries in terms of their own staff time and in the development of digital storage facilities. Part of the appeal of working with Google was the assumption that partnering with the search giant gve the entire project clout and provided some protection for the libraries. With Google and the AAP now in cahoots, the libraries must join them or try to stand alone in an unclear legal situation; an unclear situation that Google invited the libraries into in the first place.

This is classic bait and switch. And it is bait and switch with powerful commercial interests against public institutions. There is no question about it...

THIS IS EVIL

Note: I've added more comment and info in the comments area as things pop up. So read on....

Tuesday, November 11, 2008

The Importance of FRBR Expression

Most of the talk about "FRBR-ization" (a terrible mis-nomer, but now common terminology) is about creating clusters of records that represent the same work. In fact, I'm of the opinion that the work level is of interest only to a few (for example, literary critics) -- what most users would like to see is the expression level. The expression is also the level that is needed for the various efforts to associate copyright information with bibliographic data.

In many cases, the work and expression are one and the same because the item has only been issued in one expression. For those, the distinction isn't of consequence.

Where there is more than one expression for the work, those expressions tend to take particular forms, at least for books: new editions, mainly for non-fiction; and translations. In both of these cases, I maintain that the expression level is what users want, not the work. (Non-book experts: does this carry through to other formats?)

My usual example of a translated work is Thomas Mann's Der Zauberberg. According to the cataloging rules, the work's title is Der Zauberberg, while expressions in our libraries may have the title in the language of the translation, e.g. The magic mountain. A FRBR-based work display would be something like:

Mann, Thomas
Der Zauberberg. 1924

This would be the work entry into The magic mountain for users in English-language catalogs, and I assume that many of those users would not recognize the German language title, nor want to go through this level to reach the translated version that they seek.

WorldCat has finessed this by keeping the translations separate -- in other words, WorldCat responds to a search with FRBR expression-level records. And I think this is more user-friendly than the work-level record would be.

The other case, that of editions, also argues for the importance of the FRBR expression-level, but the user needs may be different. In this case, the work level will be recognizable to the user, but the information about which is the latest edition/expression needs to be very clear so that the user does not mistakenly select an item that has been replaced or updated by a later edition. Using the Dewey decimal classification and relative index as our example of a work with many editions, WorldCat shows a single edition on its 'work' page, and I assumed that it was the latest, listed as "Ed. 20," "1989." In fact this isn't the latest edition -- there is a 22nd edition from 2003. Users would only find this by going to what seems to be the expression level where all editions are listed.

This shows how hard it is to create a single grouping for all records that serve the users' needs.

Meanwhile, I have another project that will be attempting to connect copyright information to bibliographic items, including linking to entries in the renewal database. Oddly enough, RDA lists the "copyright notice" element as being at the manifestation level, which seems wrong to me. Copyright is determined on the expression, at least for the two cases I have mentioned so far: each translation receives its own copyright, as does each distinct edition. That these may be republished in a variety of manifestations (hard back, paperback, large print, etc.) does not change their copyright status.

We cannot, however, link copyright information to works. There is no copyright in Der Zauberberg or in the Decimal Classification as a work; copyright will instead be on each expression. So for the purposes of linking to copyright information, it seems that we would ideally have a way to group items by expression. If not, then the only proper link would be on the manifestation, even though that means some repetition. What will make all of this difficult is that we won't often have a date that we can associate with the expression, only with the manifestation, and that isn't necessarily the copyright date. (Except when it is, of course. You librarians reading this know what I mean.)

It still baffles me that we don't include a transcription of the copyright statement on the book or item when we create library bibliographic data, considering how useful that could be. Yet, when I proposed the copyright statement field for the MARC record there was great opposition. Some things I just don't get.

Monday, November 03, 2008

Determining Copyright Status

Among the many interesting bits in the Google/AAP agreement is Section E which essentially lays out in detail what steps Google must take to determine if an item is or is not in the public domain. As we know, this is not easy. The agreement states that two people must view the title page of the work (yes, it says "two people") to determine if the item has a copyright notice, and to check the place of publication. To determine if copyright has been renewed, "Google shall search either the United States Copyright Renewal Records or a copy thereof." If a renewal record isn't found, and the work has a copyright date before 1964, then it is presumed to be in the public domain.

I decided to try this out, at least the part about checking the renewal. I did my searches in two databases: Stanford's and Rutgers'.

I happen to have a copy of Orwell's 1984 with detailed copyright notices. It lists the first copyright as 1949, by Harcourt, Brace and Jovanovich, Inc. It then says "Copyright renewed 1977 by Sonia Brownell Orwell." It also includes "Copyright 1984 by Virgin Cinema Films Limited" although I must say that I'm not sure why that latter copyright notice is in the book.

A search on '1984' in the Rutgers' database yields no hits, but using the author's name I find 37 items, of which one reads:
AUTH: George Orwell, translation: Amelie Audiberti. NM: translation.
TITL: 1984.
ODAT: 1Jul50; DREG: 7Nov77 RREG: R678090. RCLM: AFO-2377. Amelie Audiberti, nee Elisabeth Savane (A)
A search in the Stanford database gets me:
Title    1984 NM: translation
Author George Orwell, translation: Amelie Audiberti
Registration Date 1Jul50
Renewal Date 7Nov77
Registration Number AFO-2377
Renewal Id R678090
Renewing Entity Amelie Audiberti, nee Elisabeth Savane (A)
Both of these seem to be for the same item, and it's a translation of the book 1984. The renewal listed in the book for the English text is not in the databases. The instructions to Google say nothing about taking renewal dates from the book, so this one would appear to be in the public domain by the agreement's criteria.

Picking up another book of the right age, I have Proust's "The Captive" in the Modern Library edition, the "C. K. Scott Moncrieff" translation, with "Copyright, 1929, by Random House, Inc." on the title page.

In Stanford's database I get:

Title    The captive. Translated by C. K. Scott Monorieff
Author PROUST, MARCEL
Registration Date 27Jun29
Renewal Date 7Sep56
Registration Number A9965
Renewal Id R176423
Renewing Entity Random House, Inc. (PWH)

In Rutgers I get:
CLNA: RANDOM HOUSE, INC.
TITL: The captive.
XREF: Proust, Marcel.
Unfortunately, this latter doesn't include a date, so I'm not sure that this record provides sufficient information. Fortunately, the Stanford database gives more information. Unfortunately, the Stanford record gives the title and what we librarians would call the "statement of responsibility" in the same field, and misspells the name of the translator. This may make it more difficult for any automated matching of the records. (I am assuming that Google will be doing automated matching, not hand searching of the database. That may be a mistaken assumption, especially since they have agreed that two humans will view the title page.)

This next (and last) one is an especially interesting case. I have a copy of Rebecca West's "Black Lamb and Grey Falcon: A Journey through Yugoslavia" printed by Penguin books in 1994. It gives the copyright date as "1940, 1941" and the renewal date as "1968, 1969", both under the name of Rebecca West.

A search on the title in Rutgers' database gets me these three records:

CLNA: WEST, ROBERT.
TITL: Black lamb and grey falcon. (In Atlantic monthly, Feb.-May 1941)
ODAT: 21Jan41 OREG: B482882; 19Feb41 RREG: Rebecca West ; 12Aug68; R441634-441631.

CLNA: WEST, REBECCA.
TITL: Black lamb and grey falcon; a journey through Yugoslavia. Pub. serially in the Atlantic monthly, Dec. 17, 1940-Apr. 17, 1941. NM: additions.
ODAT: 20Oct41; A158501 RREG: Rebecca West ; 10Jan69; R453530.

CLNA: WEST PUB. CO.
TITL: Black lamb and grey falcon. (In The Atlantic monthly, Jan. 1941)
ODAT: 20Dec40; B479489 RREG: Rebecca West ; 2Jan68; R426137.

As you can tell, some part of the book was originally published in the Atlantic Monthly as a serial. From these records it's difficult to tell exactly what issues of the monthly it was included in, and the "Claimants" are all different. In the Stanford database it's a bit more clear. There are five records; four are duplicates for the original articles in the Atlantic Monthly and one more called "Additions." Each of the four duplicate records is like this one:
Title    Black lamb and grey falcon. (In Atlantic monthly, Feb.-May 1941)
Author WEST, REBECCA.
Registration Date 21Jan41, 19Feb41,21Mar41 21Apr41
Renewal Date 12Aug68
Registration Number B482882, B488595, , B492319,, B495868
Renewal Id R441633
Renewing Entity Rebecca West (A)
I suppose that the four renewal records are one for each item in the Atlantic Monthly, but they each have the same information. Only the fifth record, the one for "additions," includes the subtitle that appears on the book. The presence of the article records is puzzling because Stanford claims to have included only records for the renewal of books. In fact, it is easy to find records for articles in the database, so it's probably best to assume that the database covers text in general.

Even for the human searcher, it may be difficult to connect the book and the records because there is nothing in the book itself to indicate that it was previously published in a journal. In fact, the introduction merely mentions that the book itself was first published in two volumes in 1941.

The book was published in two volumes because it is nearly 1200 pages long. The archives of the Atlantic Monthly list the four articles with this same name as containing 24, 24, 26, and 24 pages, respectively. It's rather hard to understand how those articles, as copyrighted, could be the same as a 1200 page book. We are left only with the record that claims to be "Additions" and that has the same subtitle as the book:

Title   Black lamb and grey falcon; a journey through Yugoslavia.
Pub. serially in the Atlantic monthly, Dec. 17, 1940-Apr. 17, 1941.
NM: additions
Author WEST, REBECCA
Registration Date 20Oct41
Renewal Date 10Jan69
Registration Number A158501
Renewal Id R453530
Renewing Entity Rebecca West (A)
Again, title field contains quite a bit of information beyond the title, and it just isn't crystal clear to me that this record is for the book and not for the articles. If it is for the book, then the idea that 1200 pages were published serially over four journal issues is quite a stretch. Plus, the Monthly archive claims that the dates are Jan, Feb, Apr and May, 1941.

Underlying this statement: "To determine if copyright has been renewed, "Google shall search either the United States Copyright Renewal Records or a copy thereof" is a great deal more complexity than that one sentence implies. It makes me wonder if the negotiators for the AAP are fully aware of how inaccurate the results might be. (An example: the author field in a record for an article by George Orwell reads: "Author George Orwell. U. S. ed. pub. as Shooting an elephant, 26Oct50, A49135".) If they are aware of it, then I must commend them for taking the practical path and allowing Google to make books available based on this evidence. If a copyright holder notifies Google that a book has been determined to be public domain in error, Google is obliged to change the status of the work from public domain to "in copyright," but is not held liable for infringement if the steps for determining public domain were followed and documented as laid out in the agreement.

It will be hard to determine, however, if Google should happen to err on the side of copyright, and lists as under copyright works that are actually in the public domain. While copyright holders can be expected to make sure that their works are properly protected, works in the public domain have no rights holder to monitor their status, and no one assigned to protect the public interest.

One other caveat, which appears in Section E, is:
Any determination by Google that a work is a Public Domain Book is solely for the purposes of Section 3.2(d)(v) and is not to be relied on or invoked for any other purposes, including determining whether a work is in fact in the public domain under the Copyright Act.
Basically, this means that just because Google determines that a book is in the public domain doesn't mean that's the legal status of the book. It also means that the rest of us can't use the excuse: "But Google says it's in the public domain." I have not heard whether Google will make the documentation of its copyright search available, and it's that documentation that has the real value. It's kind of like algebra: the answer is important, but what really matters is how you got the answer.

[Note: keep an eye on the Open Library and Creative Commons for some work on copyright determination that will be openly accessible.]

Google/AAP settlement

This Google/AAP settlement has hit my brain like a steel ball in a pinball machine, careening around and setting off bells and lights in all directions. In other words, where do I start?

Reading the FAQ (not the full 140+ page document), it seems to go like this:

Google makes a copy of a book.
Google lets people search on words in the book.
Google lets people pay to see the book, perhaps buy the book, with some money going to the rights holder.
Google manages all of this with a registry of rights.

Now, replace the word "Google" above with "Kinko's."

Next, replace the word "Google" above with "A library."

TILT! If Google is allowed to do this, shouldn't anyone be allowed to do it? Is Jeff Bezos kicking himself right now for playing by the rules? Did Google win by going ahead and doing what no one else dared to do? Can they, like Microsoft, flaunt the law because they can buy their way out of any legal pickle?


Ping! Next thought: we already have vendors of e-books who provide this service for libraries. They serve up digital, encoded versions of the books, not scans of pages. These digital books often have some very useful features, such as allowing the user to make notes, copy quotes of a certain length, create bookmarks, etc. The current Google Books offering is very feature poor. Also, because it is based on scans, there is no flowing of pages to fit the screen. The OCR is too poor to be useful to the sight-impaired. And if they sell books, what will the format be?


TILT! Will it even be legal for a publicly-funded library to provide Google books if they aren't ADA compliant?


Ping! This one I have to quote:

"Public libraries are eligible to receive one free Public Access Service license for a computer located on-site at each of their library buildings in the United States. Public libraries will also be able to purchase a subscription which would allow them to offer access on additional terminals within the library building and would eliminate the requirement of a per page printing fee. Higher education institutions will also be eligible to receive free Public Access Service licenses for on-site computers, the exact number of which will depend on the number of students enrolled."


TILT! Were any public libraries asked about this? Does anyone have an idea of what it will cost them to 1) manage this limited access and pay-per-page printing 2) obtain more licenses when demand rises? Remember when public libraries only had one machine hooked up to the Internet? Is this the free taste that leads to the Google Books habit?


Ping! The e-book vendors only provide books where they have an agreement with the publishers, thus no orphan works are included. So, will Google's niche mainly consist of providing access to orphan works? Or will the current e-book vendors be forced out of the market because Google's total base is larger, even though the product may be inferior?


Ping! We already have a licensor of rights, the Copyright Clearance Center, and it was founded with the support of the very folks (the AAP) who have now agreed to create another organization, funded initially by Google and responding only to the licensing of Google-held content.


TILT! Google books gets its own licensing service, its own storefront... can anyone compete with that? And what happens to anything that Google doesn't have?


Ping! It looks like Google will collect fees on all books that are not in the public domain. This means that users will pay to view orphan works, even though a vast number of them are actually in the public domain. Unclaimed fees will go to pay for the licensing service. Thus, users will be paying for the service itself, and will be paying to view books they should be able to access freely and for free.


Ping! We have a copyright office run by the US government. I'm beginning to wonder what that Copyright Office does, however, since we now have two non-profit organizations in the business of managing rights, plus others getting into the game, such as OCLC with its rights assessment registry, and folks like Creative Commons. Shouldn't the Copyright Office be the go-to place to find out who owns the rights to a work? Shouldn't we be scanning the documents held by the Copyright Office that tell us who has rights? (Note: the famed renewal database is actually a scan of the INDEX to the copyright renewal documents, not the full information about renewal.) Even if we had access to every copyright registration document in the Copyright Office, would we know who owns various rights? I think not. And how much of this will change with the Google opt-in system? I get the feeling that we'll maybe resolve some small percentage of rights questions, somewhere in the order of 2-5%. And it will, in the end, all be paid for by readers, or by libraries on behalf of readers.


TILT! Rights holders can opt-out of the Google Books database. If (when) Google has the monopoly on books online, opt-out will be a nifty form of censorship. Actually, censorship aimed directly at Google will be a nifty form of censorship.


GAME OVER. All your book belong to us.

Saturday, October 18, 2008

The Semantics of Semantic

I've always had a hard time with the Semantic Web because it didn't appear to me to be semantic at all. Thus I started calling it the Syntactic Web since it seemed to be mainly about structure, more like diagramming sentences than having a conversation. I now think I understand why that is.

If you are like me, you assume the term "semantic" is about meaning, and in particular the meaning in words and language. That's what you'll find in the dictionary. It is only recently that I learned that there is another use of the term "semantic" and that is in an area of mathematics called "formal semantics". Basically, formal semantics are about formal languages, such as mathematics, programming languages, and such. You can perform operations, often called "inferences," based on the rules of formal languages. This is an example:

A ☠ B
B ☠ C
therefore, A ☠ C

Using some set of rules, this statement is true even though A, B and C are black boxes, as is the relationship "☠". The statement after "therefore" can be calculated without ever considering that A, B or C have any meaning beyond being the symbols A, B and C.

This is a whole different meaning of "meaning" compared to the meaning of words in the human language sense. This is the kind of meaning that works with machines and algorithms and is therefore quite suited to automation. And it is this meaning of semantic that is meant for the Semantic Web.

Well, no wonder those of us who aren't mathematicians (or more specifically, involved in the use of formal languages) have been confused by the Semantic Web! It isn't semantic at all in the human language sense, it is mainly about structure and syntax. It's a shame that its developers used the confusing term "semantic" in its name. Although not incorrect, it is definitely a minority view of the meaning (that is, the semantics) of the term, and leads to confusion.

With this new knowledge, I would characterize the semantic web as a basic structure within which one could insert human-meaningful data, and a set of rules that make that structure operational in a computing environment. It greatly resembles the structure of simple human utterances, like "Moby Dick is the title of this book," although the semantic web would express this something like:

URI:abcd (has relation) URI:1234 (with) URI:xxzz

While Semantic Web devotees may be enthusiastic about that statement, most of us are going to get more out of: "Moby Dick is the title of this book." In other words, we only connect with the statement when it has human-understandable meaning; the formal language alone just doesn't do it. It's rather like the difference between the architectural drawings and the actual building: architects can understand what the drawings represent, but most of us will have to wait until the building is completed and we can walk though it in order to experience what the drawings mean.

Now I have to ask myself what to do with this knowledge. I don't think it makes sense to require everyone to be an architect in order to walk through a building, and I know for sure that all of us can speak in sentences even if we aren't experts in linguistics. So it should be possible to interact with, nay even be an active creator of, the semantic web without being conversant with the field of formal semantics. At the moment, though, I don't know how that's going to work, but I do know that if it doesn't work that way the Semantic Web isn't going very far. It just has to become wysiwim - what you see is what I mean. I suggest something Pipes-like might do the trick.

Tuesday, September 30, 2008

More Puzzling over FRBR

FRBR came up often in our discussions at DC2008. In particular, there were many attempts to clarify what FRBR means in a technical environment. Since FRBR is about entities and relationships, it seems to be perfectly positioned as the first step in the transformation of library data to the semantic web.

After each such event where ideas about FRBR are thrown around, I go back to FRBR and try to understand it. Each time it's as if I'm reading and thinking about an entirely different model. So here's this week's entry into the multiple personalities of FRBR.

During this reading I focused on the relationships between the entities, such as the relationship between expression and work, and the various work/work, expression/expression (etc.) relationships. What struck me immediately is that there is a fair amount of detail in the explication of the relationships between different Group 1 entities (work/work, etc.). These turn out to be the richest set of relationships in FRBR. At the same time, the relationships between Work-Expression-Manifestation-Item are covered by a single sentence each:

Work: a distinct intellectual or artistic creation

Expression: the intellectual or artistic realization of a work

Manifestation: physical embodiment of an expression of a work

Item: exemplar of a manifestation


Only one example is given for each. Compare that to table 5.1 on page 63 of the FRBR document, which gives these relationships between works:

successor

supplement

complement

summarization

adaptation

transformation

imitation


It seems much easier to see the real world applicability of these relationships than "intellectual...realization of a work." The sum of the lists of the relationships inherent in the Group 1 entities embodies much of the network of bibliographic interactions that will interest us in the semantic web. In fact, these relationships are probably closer to the needs of users than those of WEMI. I would like to explore these relationships further to understand what they reveal in terms of the development of a navigable route through the bibliographic world.

Meanwhile, I have some comments on those relationships. To begin with, I find it very interesting that there is no priviledged "first expression" of a work. Admittedly the first expression may not be known, but in fact the expressions are all equal and all have the same relationship with the work. This means that you can indicate that one expression is a translation of the other, and the translation then has the same relationship to the work as does the expression in the original language (which may - or may not - be the original expression). This seems to defy the concept of the uniform title or work title, which represents the original language of expression, and therefore says something about the "originality" of the first expression.

Another thing is that the list of relationships I give above are also valid between expressions. This is adds to the obscurity of the difference between works and expressions. There are, however, times when it makes sense to me: you could have a film version of Romeo and Juliet that is based on a particular expression of Shakespeare's work. The "workness" of the film also has some relationship with the "workness" of the play (adaptation? transformation?). Yet in general I have trouble with work-work relationships since there really is no work without an expression, therefore it's hard to say that work A adapts work B. I suspect this is just the general uneasiness with the abstractness of the work, but it seems amplified when you try to add relationships to this very fuzzy concept. The relationships between expressions make more sense to me.

Something else that occurs to me is that the transformative relationships make sense between expressions (translation, adaptation) and the intellectual relationships make sense between works (imitation, successor, others?). That an imitation is based on a particular expression (whatever the imitator had in hand) is almost a secondary relationship. What this means is that the work-work relationships and the expression-work or expression-expression relationships with the same name may not be identical. In fact, they couldn't possibly be identical because they refer to different types of entities. So although they have the same names, I would argue that they are not the same relationships, in the same way that the work title and the manifestation title are distinct even though they are both called 'title'.

I end this lengthy, rambling brain dump with the thought that we might be able to create a rich network of expressions, linked handily to their respective works, that would be very useful for those seeking information. And that the network of expressions could help us identify the appropriate work for each expression, because once an expression is found to be a translation of another, then they must logically be expressions of the same work.

Forgive me if this is all a re-hash of the obvious. For some reason, nothing in FRBR comes to me easily.

Monday, September 29, 2008

DC2008

I recently attended the annual Dublin Core conference in Berlin. I would have blogged the sessions but in fact I spent most of my time in the hallways chatting with folks. The main message of the conference was: Semantic Web. This included an interesting talk by Martin Malmsten on turning MARC records into RDF triples. (See also the work at Talis in this area.)

For me the big deal of the conference was a meeting with some of the Dublin Core folks who developed the DC Abstract Model and the DC Application Profile model. We had a nice long talk about the distance between those views and the actual production of library metadata. What we concluded was that we will work together to bridge this gap, in part by creating simple, re-usable modules that are easy to understand and that, when hooked together, provide the information necessary to engineer a fully functioning, DCAM-compliant application profile.

Yes, I know we need more of an explanation, and I'll be working on that very soon. Don't go too far away.

Wednesday, September 17, 2008

Functional Requirements for App Profiles

In preparation for DC2008 (9/22-26, Berlin), I've been thinking about application profiles. The DC folks have developed a structure for application profiles which I have attempted to use for the DC-RDA work. I ran into some difficulties, in part because the library community has its own particular needs. So I thought it would be a good idea to articulate these needs in preparation for discussions I hope we will have next week.

I'm going to use some terminology from FRBR, and some from the DC work. Mainly, I'll use "entity" in the FRBR sense. I'll use "property" in the sense that it is used in the DCAM and in RDF.

Here's my first pass at what we need to express in an application profile for the library community:

entities


We need to define the entities that will be in our metadata environment. It would be ideal to be able to re-use entities where possible. So if two APs can use the same Person entity, they just need to be able to identify it. At the same time, it must be possible to create a different person entity and to give it a new identifier.

relationships between entities

It some cases it will be desirable to constrain the relationships that can exist between entities. Both RDA and FRBR constrain which Group 2 entities have relationships with a Work as opposed to an Expression. This is an area of some disagreement among sub-communities, so there will be some APs that will define the relationships differently.

properties of entities

Entities have properties. These are metadata elements that have been defined outside of the AP. Each property must have a unique identifier. It is the detailed information about the properties that will make up the bulk of the AP. Here is a first list of what that information needs to be:
  • property identifier
  • property is mandatory/option
  • property is repeatable/not (within entity)
  • properties are cumulative/mutually exclusive -- a way to say that you can use property A or B or C, or that you can use any combination of A, B, C.
  • property value is controlled/uncontrolled -- this distinguishes between free text (e.g. an abstract, user tags) and a constrained set of values (authority list, or a designated format). If controlled, then there needs to be a way to give some information on the type of control: URI of a list of terms; URI of a standard format for the data (e.g. date type format, or AACR2 name heading format).
  • property value is transcribed/supplied -- transcribed data are taken directly from the resource itself; supplied means source of the information is not the resource. (Title can be transcribed; subject headings are supplied.)
  • for controlled property values that use a set list of values, it has to be possible to state the vocabularies that are valid, and whether or not they are mandatory or optional. It may also be necessary to define whether one can extend the vocabulary in the metadata (e.g. use an unlisted value if a new value is needed). It needs to be stated whether the entire vocabulary is to be used. If not, the AP needs to define which values from the full vocabulary are valid. It also needs to be possible to create a list of values within the AP for any element. In this case there is no external controlled list.
Other?

I'm musing over whether we need to be able to define a "record," mainly to say what the minimum is that someone could expect to receive.

I'm also considering the need to define relationships between records -- like the FRBR work/work and work/part relationships. As I said in my post on linking, I see a difference between dependent and independent links, and these, in my mind, would be independent links, and may point beyond a particular database or system. I'll think more about this, and welcome comments.

Saturday, September 13, 2008

Thinking About Linking

In my previous post on affordances, I included inter- and intra-metadata links. I feel like there's a lot of confusion in this area (some of which I may myself have contributed), so I'm going to do a bit of a disorganized brain dump here as an attempt to start a conversation in this area, see if maybe we (or I) can't arrive at some clarity.

In the FRBR vision that RDA has embraced, there is something called the "relational/object-oriented model." I have some basic problems with this because I perceive relational and object-oriented designs to be quite distinct. This concept of relational/object-oriented gives me one of those "blank brain" moments -- when something sounds like it should make sense but I just can't make sense out of it. So I'm going to treat it as a set of relationships within a bibliographic record.

In the FRBR/RDA model there are entities: Work, Expression, Manifestation, Item (WEMI), and Person, Corporate body, Concept, Object, Event, Place. The interesting thing about these is that none of them is intended to stand alone. This is a very inter-dependent group of entities, not a set of separate records. This is hard for us to imagine because today's model is indeed of separate records for bibliographic data and authority data (covering names and subjects). However, our view is colored by the fact that the bibliographic record carries headings from the authority records, an therefore is complete in itself. Authority records, if you think about them, even those for names, are of the nature of a controlled vocabulary. The view of these vocabularies as contributing to the bibliographic description means that we have to have a way to express both the entities themselves and the links between them.

In addition, we have to decide what one defines as a record. If, to describe a work, one must also describe the creator, then it does seem that the Work entity and Person (or Corporate) entity must be part of the same record. Otherwise, the record cannot stand alone. So what does it mean to include the Person entity, and where does that entity reside? Or is an unresolved link to a (presumed) entity sufficient to complete the bibliographic record? In other words, if the bibliographic record has, as part of the work, a link to a Person entity that resides elsewhere, is that bibliographic record complete?

Note: I read back through FRBR and FRANAR regarding the Person entity. FRBR includes only the "name heading" in its Person entity, while the FRANAR Person entity has many more elements. This parallels today's difference between the personal name field and the name authority record.
There are other kinds of relationships that are between bibliographic entities. To my mind there are two types of relationships here: dependent and independent. The dependent relationships are between the WEMI entities, none of which is considered complete in itself. In fact, I consider the WEMI to be a single entity with dependent parts. (Admittedly, this is how current library cataloging views it, with a single flat record that contains information on all of these bibliographic levels which exist simultaneously in a single object.) To me, these are indivisible -- you can't have any one of them without the others.
[Note that I consider the WEMI to be a single entity in terms of library cataloging records. The levels of this entity do have meaning on their own. For example, a literary critic will often refer to the Work, perhaps to the Expression. A publisher or bookstore advertises the Manifestation. A library identifies and circulates the Item, and a rare book seller deals almost exclusively in Items.]

The independent relationships are those between different bibliographic entities -
  • Work-Work, two works that reflect or reference each other (cited, cites; works based on other works, like parodies or sequels)
  • Whole-Part, works in which one can be contained in the other (article and journal, chapter and book, volume and series)
  • Item-Item, reproductions of all types
To a large degree, these relationships can all be expressed as properties: isCreatorOf, isExpressionOf, isCitedBy. But I can't shake the feeling that there are at least two distinct kinds of relationships: those that fill in what otherwise would be gaps in a metadata record, and those that inform relationships between bibliographic items. I also wonder about links with and between complex entities. For example, imagine a bibliographic record that links to a member of a subject vocabulary that is stored in SKOS format. The SKOS record has numerous fields covering preferred and alternate headings, definitions, links to broader and narrower terms, and all of this in various languages. What if the property in the bibliographic record has the meaning "definition of term in French"? What does one link to? Or is the only possible link to the vocabulary member as a whole?

So these are a few of the questions I have. Hopefully some of them can be cleared up quickly. I'm interested in hearing how others think about these issues. For those attending DC2008, if this interests you I'm game for some discussion.

Monday, September 08, 2008

Metadata Affordances

In my last post, I promised to spend some time thinking about metadata affordances -- that is, a view of metadata based on what you can do with it. My hope is that this will inform a metadata model that serves our needs (whoever "we" are, but admittedly this will tend toward the metadata needs of the library community). Here are the categories that I have come up with, all open to comment, discussion, correction, etc., so please comment freely.

None (opaque text)

Some metadata will necessarily be of this category, with no particular affordances inherent in the contents. At times plain text is used because that is the nature of the particular metadata element, like the recording of the first paragraph of a text, or transcribing a title from the piece. At other times plain text is used because the metadata community has chosen not to exercise control over the particular metadata element. An example of this is user-input tags. Although human intelligence may be applied to plain text fields, it requires knowledge that is not inherent in the metadata structure itself.

Structure and rules (typed strings)

Typed strings are things like formatted dates (YYYYMMDD) and currency formats ($9,999.99). There are other possible formatted strings, such as the common identifiers like ISBN and ISSN. The affordances of these strings is that you can exercise control over the input of them, forcing the consistency of the values. With consistent values you can perform accurate operations, like adding up a set of figures, sorting or searching by date, etc. Some controlled list values may also have structure: the standard format for personal names used by libraries includes structural rules ("family name followed by comma, then forenames") that facilitates the use of alphabetically ordered lists of names.

List membership/vocabulary control

One way to assure consistency in metadata is to require that the metadata value be selected from a fixed list of values, rather than being open to free text. This tends to take the form of a list of like terms: languages of text, country names, colors, physical formats.

Although it provides consistency, list membership alone does not provide much in terms of capabilities for data processing. Other information is needed to provide affordances for list members:
  • access to display and indexing forms of the term
  • access to alternate forms, including other languages
  • access to definitions of terms

The information that is needed, therefore, for any list and its members is:
  • list identifier
  • member identifier
  • location of services relating to this list/member, and what services are available

If there are no automated services, then a system will need to provide its own, which is what we generally do today by creating a copy of the list within the system and serving display forms and other features from that internal list. In a web-enabled environment, however, one could imagine lists with web services interfaces that can be queried as needed.


Inter- and intra-metadata links

There is a need to create functional links within metadata segments to other metadata segments or records. For example, the use of name and subject authority records implies a link between those records and the bibliographic metadata records that contain the names and subjects as values. There are also links needed between bibliographic records themselves. These latter represent a number of different relationships, which have been articulated in the FRBR documentation. Some examples are: work-work relationships, work-expression relationships, and part-whole relationships (chapters within books, articles within journals).

There may be other kinds of links that are needed as well, but I think that the main need is to distinguish between identifiers and links. Some identifiers, like ISBNs, can be used to retrieve metadata in a variety of situations, but those should be seen as searches, not links. Searching is appropriate in some circumstances, but the ability to create stable links is a separate affordance and should be treated as such.

Note: These categories of affordances are not mutually exclusive. Some metadata values will provide more than one type of affordance. Each should be clearly and separately articulated, however, and we should think about the advantages and disadvantages of having metadata values serve multiple functions.