Less Talk, More Code: identifier

Showing posts with label identifier. Show all posts

Friday, 26 March 2010

Usage Statistics parsing and querying with redis and python

This is an update of my previous dabblings with chomping through log files. To summarise where I am now:

I have a distributable workflow, loosely coordinated using Redis and Supervisord - redis is used in two fashions: firstly using its lists as queues, buffering the communication between the workers, and secondly as a store, counting and associating the usage with the items and the metadata entities (people, subjects, etc) of those items.

I have written a very small python logger, that pushes loglines directly onto a redis list, providing me with live updating abilities, as well as manual log file parsing. This is currently switched on for testing in the live repository.

Current code base is here: http://github.com/benosteen/UsageLogAnalysis - it has a good number of things hardcoded to the perculiarities of my log files and repository. However, as part of the PIRUS 2 project, I am turning this into an easily reusable codebase, adding in the ability to push out OpenURLs to PIRUS statistics gatherers.

Overview:

Loglines -- lpush'd to 'q:loglines'

workers - 'debot.py' - pulls lines from this queue and parses them up, separating them into 4 categories:

Any hit by a recognised Bot or spider

Any view or download made by a real person on an item in the repository

Any 404, etc

And anything else

and the lines are moved onto 4 (5) queues respectively, q:bothits, q:objectviews (and q:count simultaneously), q:fof, and q:other. I am using prefixes as a convention when working with Redis keys - "q:" will almost always be a queue of some sort. These four queues are consumed by loggers, who commit the logs to disc, segregated into their categories.

The q:count queue is consumed by a further worker called - count.py. This does a number of jobs, and is the part that actually does the analysis.

For each repository item logged event, it finds the ID of the item and also whether this was a download of an item's files. With my repository, both these facts are deducible from the URL itself.

Given the ID, it checks redis to see if this item has had its metadata analysed before. If it hasn't, it grabs the metadata for the item from the repositories index (hosted by an instance of Apache Solr) and starts to add connections between metadata entity and ID to the redis index:

eg say item "pid:1" has the simple metadata of author_name='Ben' and subjects='foo, bar'

create unique IDs from the text by hashing the text and prefix it with the type of the field they came from:

Prefixes:

name => "n:"

institution => "i:"

faculty => "f:"

subjects => "s:"

keyphrases => "k:"

content type => "type:"

collection => "col:"

thesis type => "tt:"

eg

>>> from hashlib import md5

>>> md5("Ben").hexdigest()

'092f2ba9f39fbc2876e64d12cd662f72'

So, the hashkey of the 'name' 'Ben' is 'n:092f2ba9f39fbc2876e64d12cd662f72'

Now to make the connections in Redis:

Add ID to the set 'objectitems' - to keep track of all the IDs (SADD objectitems {ID})

Set 'n:092f2....' to 'Ben' (so we can keep a reverse mapping)

Add 'n:092f2...' to 'names' set (to make it clearer. KEYS n:* should return an equivalent set)

Add 'n:092f2...' to 'e:{id}' eg "e:pid:1" - (e -> prefix for collections of entities. e:{id} is a set of all entities that occur in id)

Add 'e:pid:1' to 'e:n:092f2....' (gathers a list of item ids in which this entity 'Ben' occurs in)

Repeat for any entity you wish to track.

To make this more truth-manageable, you should include the id of record with the text when you generate the hashkey. That way, 'Ben' appearing in one record will have a different key than 'Ben' occuring in another. The assertion that these two entities are the same can easily take place in a different set, (I'm using b: as the prefix for these bundles of asserted equivalence)

Once you have made these assertions, you can set about counting :)

Conventions for tracking hits:

d[v|d|o]:{id} - set of the dates on which {id} was viewed (v), downloaded from (d) or any other page action (o)

eg dv:pid:1 -> set of dates on which pid:1 had page views.

YYYY-MM-DD:{id}:[v|d|o] - set of IP clients that accessed a particular item on a given day - v,d,o as above

eg 2010年02月03日:pid:1:d - set of IP clients that downloaded a file from pid:1 on 2010年02月03日

t:views:{hashkey}, t:dls:{hashkey}, t:other:{hashkey}

Grand totals of views, downloads or other accesses on a given entity or id. Good for quick lookups.

Let's walk through an example: consider that a client of IP 1.2.3.4 visits the record page for this 'pid:1' on 2010年01月01日:

ID = pid:1

Add the User Agent string ("mozilla... etc") to the 'ua:{IP}' set, to keep track of the fingerprints of the visitors.

Try to add the IP address to the set - in this case "2010-01-01:pid:1:v"

If the IP isn't already in this set (the client hasn't accessed this page already today) then:

make sure that "2010-01-01" is a part of the 'dv:pid:1' set

go through all the entities that are part of pid:1 (n:092... etc) and increment their totals by one.
- INCR t:views:n:092...
- INCR t:views:pid:1

Now, what about querying?

Say we wish to look up the activity on a given entity, say for 'Ben'?

First, find the hashkey(s) that exist that are equivalent - either directly using the simple md5sum hash, or by checking which bundles are for this entity.

You can get the grand totals by simply querying "t:views:key", "t:dls..." for each key and summing them together.

You can get more refined answers by getting the set of IDs that this entity is associated with, and querying that to gather all the daily IP sets for them, and summing the answer. This gives me a nice way to generate data suitable for a daily activity sparkline, like:

I have added another set of keys to the store, of the form 'geocode:{IP}' that record country code to IP address, which gives me a nice way to plot out graphs like the following also using the google chart API:

Python logging to Redis

This functionality is mainly in one file in the github repo: redislogger.py

As you can see, most of that file is taken up with a demonstration of how to invoke it! The file that holds the logging configuration which this demo uses is in logging.conf.example.

NB The usage analysis code and UI is very much a WIP

but, I just wanted to post quickly on the rough overview on how it is set up and working.

Posted by Ben O'Steen at 03:36 No comments:

Labels: development, identifier, metadata, python, redis, repository, statistics, usage

Monday, 18 August 2008

The four rules of the web and compound documents

A real quirk that truly interests me is the difference in aims between the way documents are typically published and the way that the information within them is reused.

A published document is normally in a single 'format' - a paginated layout, and this may comprise text, numerical charts, diagrams, tables of data and so on.

My assumption is that, to support a given view or argument, a reference to the entirety of an article is not necessary; The full paper gives the context to the information, but it is much more likely that a small part of this paper contains the novel insight being referenced.

In the paper-based method, it is difficult to uniquely identify parts of an article as items in their own right. You could reference a page number, give line numbers, or quote a table number, but this doesn't solve this issue that the author hadn't put time to considering that a chart, a table or a section of text would be reused.

So, on the web, where multiple representations of the same information is getting to be commonplace (mashups, rss, microblogs, etc), what can we do to help better fulfill both aims, to show a paginated final version of a document, and also to allow each of the components to exist as items in their own right, with their own URIs (or better, URLs containing some notion of the context e.g. if /store/article-id gets to the splash page of the article, /store/article-id/paragraph-id will resolve to the text for that paragraph in the article.)

Note that the four rules of the web (well, of Linked Data) are in essence:

give everything a name,
make that name a URL ...
...which results in data about that thing,
and have it link to other related things.

[From TimBL's originating article. Also, see this presentation - a remix of presentations from TimBL and the speaker, Kingsley Idehen - given at the recent Linked Data Planet conference]

I strongly believe that applying this to the individual components of a document is a very good and useful thing.

One thing first, we have to get over the legal issue of just storing and presenting a bitwise perfect copy of what an author gives us. We need to let author's know that we may present alternate versions, based on a user's demands. This actually needs to be the case for preservation and the repository needs to make it part of their submission policy to allow for format migrations, accessibility requirements and so on.

The system holding the articles needs to be able to clearly indicate versions and show multiple versions for a single record.

When a compound document is submitted to the archive, a second parallel version should be made by fragmenting the document into paragraphs of text, individual diagrams, tables of data, and other natural elements. One issue that has already come up in testing, is that documents tend to clump multiple, separate diagrams together into a single physical image. It is likely that the only solution to breaking these up to this is going to be a human one, either author/publisher education(unlikely) or by breaking them up by hand.

I would suggest using a very lightweight, hierarchical structure to record the document's logical structure. I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to 'recreate' the document roughly.

Summary:

1) Break apart any compound document into its constituent elements (paragraph level is suggested for text)
2) Make sure that each one of these parts are clearly expressed in the context they are in, using hierarchical URLs, /article/paragraph or even better, /article/page/chart
3) On the article's splashpage, make a clear distinction between the real article and the broken up version. I would suggest a scheme like Google search's 'View [PDF, PPT, etc] as HTML'. I would assert that many people intuitively understand that this view is not like the original and will look or act differently.

Some related video blogs from the Crigshow trip
Finding and reusing algorithms from published articles
OCR'ing documents; Real documents are always complex
Providing a systematic overview of how a Research paper is written - giving each component and each version of a component would have major benefits here

Posted by Ben O'Steen at 03:40 5 comments:

Labels: crigshow, formats, identifier, repository

Wednesday, 9 January 2008

Conclusions on UUIDs and local ids in Fedora

I mentioned earlier about the possibility of using UUIDs as Fedora identifiers. I'll write my conclusion first and then the reasoning later for all you lazy people out there :)

Conclusions

Fedora repositories that wish to use UUIDs as identifiers should have the namespace 'uuid' added to the list of <retainPids> in fedora.fcfg.

The 32 character hex string representing the UUID is entered as the object id to the uuid namespace. For example:

Fedora pid - uuid:34b706b4-f080-4655-8695-641a0a8acb25

Benefits

Using the uuid scheme for identifiers, (those who retain the 'uuid' namespace), administrators will be able to painlessly transfer objects from one instance of Fedora to another, or even to have a distributed set of Fedora instances as a single 'repository'; No fiddling with pid changes, changing RELS-EXT datastreams or others, or changing metadata datastream identifiers.
Fedora pids will fit into the RFC 4122 mechanism for the uuid urn namespace easily -> urn:pid will be a valid URI.
The command 'getNextPID()' could be used to provide a local 'id', which can be added as a FOXML field, such as label or even added as the Alternate ID on ingest/migration.
Given a URI resolver that is updated when an object migrates from one fedora repository to another, a distributed set of Fedora instances could have cross-repository relationships in RDF that stay valid regardless of where the objects reside.
Use of a distributed set of Fedora instances with a federated search tool such as the Apache Solr is quite an attractive prospect for large scale implementations.

Reasoning

I've thought about the logistics of actually using them, and also of the fact that some people are happier with object ids that they can type (although for the life of me, I can't work out why; when was the last time that you, as a normal user, typed in a full URL and didn't just go to a discovery tool like google or a site search to get to a specific item? 99% of the time, I rely on my browser's address bar defaulting to google for anything that doesn't look like a url.) But I digress...

I prefer to deal with situations as they are, as opposed to what might be possible later, so let's recap what pids Fedora allows or needs:

Fedora pid = namespace : id

or more formally - (from http://www.fedora.info/definitions/identifiers/

object-pid = namespace-id ":" object-id
namespace-id = ( [A-Z] / [a-z] / [0-9] / "-" / "." ) 1+
object-id = ( [A-Z] / [a-z] / [0-9] / "-" / "." / "~" / "_" / escaped-octet ) 1+
escaped-octet = "%" hex-digit hex-digit
hex-digit = [0-9] / [A-F]

e.g. Anything that fits the following regular expression:

^([A-Za-z0-9]|-|\.)+:(([A-Za-z0-9])|-|\.|~|_|(%[0-9A-F]{2}))+$

As I said before, I am interested in using UUIDs (or something like them) because they need no real scheme to be unique and persistantly unique; UUIDs are designed so the chances of creating two ids that are the same is vanishingly small. So, what does one look like?

(from wikipedia:)

In its canonical form, a UUID consists of 32 hexadecimal digits, displayed in 5 groups separated by hyphens, in the form 8-4-4-4-12 for a total of 36 characters. For example:
550e8400-e29b-41d4-a716-446655440000

Regular expressions:

[A-Fa-f0-9]{8}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{12}
matches:

550e8400-e29b-41d4-a716-446655440000

^((?-i:0x)?[A-Fa-f0-9]{32}
matches: 0x550e8400e29b41d4a716446655440000

So a UUID can't be used as it is in place of a Fedora pid. However, according to RFC 4122, there is a uuid urn namespace which makes me more hopeful. The above uuid can be represented as urn:uuid:550e8400-e29b-41d4-a716-446655440000 for example.

So, how about if we make the reasonable assumption that a pid is a "valid" urn namespace, but a namespace that may or may not be registered yet? For example, I am currently using the fedora namespace ora for items in the Oxford repository. Would it be to far fetched to say that ora:1234 is understandable as urn:ora:1234?

So, all we need to do is make sure that the namespace 'uuid' is one of the ones in the <retainPid> element of fedora.fcfg and we are set to go. Looks like the pid format 'restriction' as I thought it, was quite handy after all :) So to state it clearly:

Fedora pids that follow the UUID scheme should be in the form of:

object-pid = "uuid:" object-id
object-id = 8-digit-hex '-' 4-digit-hex '-' 4-digit-hex '-' 4-digit-hex
 '-' 4-digit-hex '-' 12-digit-hex
8-digit-hex = ( hex-digit ) 8
4-digit-hex = ( hex-digit ) 4
12-digit-hex = ( hex-digit ) 12
hex-digit = [0-9] / [A-F] / [a-f]

e.g. "uuid:34b706b4-f080-4655-8695-641a0a8acb25"

(NB forgive any syntactically slips above, I hope it's clear as it is.)

I mentioned before that some people want human-typeable fedora pids.... urgh. No really sure what purpose it serves. In fact, let me have a little rant...

<rant>

A 'Cool URL' is one that doesn't change. Short pids make for pretty URLs and guarantee little more than that.

</rant>

Right, that's out of my system. Now to accommodate the request...

Firstly, I'd just like to point out that I will ignore any external search and discovery services; essentially any type of resolver from search engine to the Handles system. This is because I feel that the format of the fedora pid is quite irrelevent to these services. (I am aware that certain systems made use of a nasty hack where the object id part of the fedora pid was used as the Handle id, after the institution namespace and believe me, this hack does have me quite worried. I can understand the reasoning behind this as the Handles system doesn't seem to have a simple way to query for the next available id, but I think this is a potential problem on the horizon.)

My suggestion is that the pid itself is the uuid as defined above, but that the repository system has a notion of local 'id'; the Fedora call of 'getNextPid()' could be used to create these 'tinypids' with whatever namespace is deemed pleasant.

They can be stored in the FOXML in fields such as Label or stored as an Alternate ID (fields which I personally have no use of). Fedora will index these for its basic search service, and could be used as a mechanism to look up the real pid given the local id.

For example, with the RDF triplestore turned on, the following iTQL query should be enough:

"select $object from <#ri>
 where $object <info:fedora/fedora-system:def/model#label> 'ora:1234'"

The tuple that is returned will be something like "uuid:34b706b4-f080-4655-8695-641a0a8acb25"

But it still isn't great, and I don't think the benefits outweigh the work involved in implementing it but it's a workable solution I think for those that need it.

Posted by Ben O'Steen at 03:05 1 comment:

Labels: fedora, idea, identifier, repository, uuid

Monday, 10 December 2007

Object 'PID's and UUID, why not?

Handles, DOIs... schemes to provide unique, persistant identifiers. But what's the one flaw that unites all of these schemes?

They only work for as long as the people involved want them to.

If the money dries up behind the Handle resolver, what then? What happens to attempts to assign the same handle to different items? What about duplication?

So, step one is to acknowledge that there is no perfect way to uniquely identify something. Step two is making do with something that is less that perfect.

Which is where I started thinking about UUIDs. From the page:

A UUID is essentially a 16-byte (128-bit) number. The number of theoretically possible UUIDs is therefore 2^16*8 = 2¹²⁸ = 256¹⁶ or about 3.4 × 10³⁸. This means that 1 trillion UUIDs would have to be created every nanosecond for 10 billion years to exhaust the number of UUIDs.

So, it's fair to say that there are plenty of these ids to go around.

But if we randomly assign these ids to anything, what is the likelyhood of an id being assigned twice? I am lazy and loathe to do the calculations myself, but luckily I don't have to. The bottom line is that when 70,368,744,177,664 (2^46) ids have been randomly assigned, the chance of any of these ids being the same is 2.5 billion to one.

I like those odds.

In my Fedora-centric view, this means that objects and related URLs go from:

ora:909 -> http://ora.ouls.ox.ac.uk:8080/fedora/get/ora:909

to:

ora:0ddfa057-d673-4ed3-9186-e141c50bf58f -> http://ora.ouls.ox.ac.uk:8080/fedora/get/ora:0ddfa057-d673-4ed3-9186-e141c50bf58f

So, we now have something that is citable and unique all by itself. It needs no scheme or organising body or agency to remain unique.

It's not very human readable though, is it? But, in all seriousness, when was the last time you typed in an address by hand to go directly to a resource? Should I be embarrassed to admit that I find myself typing things as trivial as 'google maps' into my address bar on occasion, because I know that it sends it to google as a search and provides me with results I can click on?

For something that needs to be permanent, to be citable, and to be resolvable, I think UUIDs work as object ids. And as for the more human focused urls, urls that can be read in a mobile browser or in an email perhaps - What's wrong with the semi-permanent urls from services such as tinyurl.com?

Posted by Ben O'Steen at 04:27 3 comments:

Labels: idea, identifier, repository, uuid

Less Talk, More Code

Friday, 26 March 2010

Usage Statistics parsing and querying with redis and python

Monday, 18 August 2008

The four rules of the web and compound documents

Wednesday, 9 January 2008

Conclusions on UUIDs and local ids in Fedora

Monday, 10 December 2007

Object 'PID's and UUID, why not?

Dopplr

Subscribe Now

Mugshot

Additional links

Labels

Blog Archive

About Me