Maybe I am too much of a data mining geek, but I found the video below to be funny. It also talks about a super cool feature ODM introduced in 11.2: the ability of scoring data mining models at the disk controller level in Exadata. This is a significant performance booster. It also makes it feasible to produce actionable insights from massive amounts of data extremely fast. More on this on a future post. For now enjoy the video.
Labels: Funny, Viral Marketing
2/18/2010 01:08:00 PM | Permalink |
Imagine standing under an avalanche of data - 2500 variables, 10 times per second and a sailing team demanding answers to design and sailing variations immediately. This was the challenge facing the BMW ORACLE Racing Performance Analysis Team every sailing day as they refined and improved their giant 90 foot wide, 115 foot long trimaran sporting the largest hard-sail wing ever made. Using ORACLE DATA MINING accessing an ORACLE DATABASE and presenting results real time using ORACLE APPLICATION EXPRESS the performance team managed to provide the information required to optimise the giant multihull to the point that it not only beat the reigning America's Cup champions Alinghi in their giant Catamaran but resoundingly crushed them in a power display of high speed sailing. After two races - and two massive winning margins - the America's Cup was heading back to America - a triumph for the team, ORACLE and American technology.Here is a video showcasing the boat and give some interesting information about it:
— Ian Burns, Performance Director, BMW ORACLE Racing Team
Labels: Applications, Sailing
2/18/2010 12:57:00 PM | Permalink |
Rexer Analytics has just issued a last call for its annual data mining survey. This is a pretty nice survey that provides a great deal of valuable information about how data mining is used and who is doing it.
To participate, please click on the link below and enter the access code in the space provided. The survey should take approximately 20 minutes to complete. At the end of the survey you can request to have the results sent to you as well as get a copy of last year's survey.
Survey Link: www.RexerAnalytics.com/Data-Miner-Survey.html
Access Code: RS2008
Labels: Event
3/24/2009 09:41:00 AM | Permalink |
The Oracle Data Mining group has two openings for talented software developers and machine learning experts. Take a look at these links (IRC1133401, IRC1143559) for more details.
Labels: Jobs
11/03/2008 10:46:00 AM | Permalink |
Labels: Event
10/28/2008 05:10:00 AM | Permalink |
For a long time I have thought that we needed data mining books written for developers. Most data mining books are written for business or data analysts. Given that, it was a pleasant surprise to read Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran. The book provides a good discussion on data mining concepts anchored with interesting examples. It also provides guidance on when to use the techniques. I still think that there is room for improvement on how data mining and analytics should be presented to developers. However, the book is a great step forward in enabling developers to use analytics.
Toby's book has many Web 2.0 examples illustrated with data obtained from different web sites (e.g., blogs, Facebook, Zebo, etc). The examples are coded in Python or rely on freely available software for the data gathering and analytics. This got me thinking. If one wanted to illustrate how to create scalable and robust applications exploiting collective intelligence, would it not be interesting to replicate some of those examples utilizing the analytical and XML processing capabilities in the Oracle database? So I decided to do exactly that. This is the first post in a series showing how we can do the problems discussed in the book using technology in the Oracle database. Although most that is described in the book can be done using the Oracle database, this series will only showcase some of the examples in the book. To implement all of them would be like writing a new book.
Before we can start mining we need to collect data and store it in a database. There are many Web 2.0 websites with interesting data for mining. XML is the de facto data format returned by these websites. This post covers the basic steps on how to get this data by showing how to build an archive of entries from a list of RSS feeds. In later posts I will describe the mining.
First the good news, the Oracle RDBMS has a great set of features supporting XML and HTTP processing. Now the bad news, before writing this post I knew nothing about those features. So, after some digging around the web, I found lots of good information in a couple of articles. I have to say that I am really impressed by what can be done with the functionality in the database.
Below I describe some preparatory steps needed, how to query multiple RSS feeds at once, and finally how to automate the whole process to build a RSS feed archive.
Preparatory Steps
Lucas Jellema has a nice series of articles describing how to build a RSS feed reader and archive (1, 2). I used them as a departing point. The comments in the first article pointed me in the right direction to move from a PL/SQL implementation, as described in the article, to the SQL version given below.
Here is a list of things that one need to do before running the code in Lucas' articles or the code below:
CONN sys/manager AS SYSDBA GRANT EXECUTE ON UTL_HTTP TO dmuser; BEGIN DBMS_NETWORK_ACL_ADMIN.create_acl ( acl => 'mining_the_web.xml', description => 'ACL functionality for mining the web', principal => 'DMUSER', is_grant => TRUE, privilege => 'connect', start_date => SYSTIMESTAMP, end_date => NULL); COMMIT; END; /
BEGIN DBMS_NETWORK_ACL_ADMIN.assign_acl ( acl => 'mining_the_web.xml', host => '*', lower_port => 80, upper_port => NULL); COMMIT; END; /The assign_acl call above is very generous. It allows the DMUSER user to connect to any server (*) through port 80. For more details take a look at this nice article by Tim Hall.
EXEC utl_http.set_proxy('proxy.my-company.com', 'corp.my-company.com');
For more details see the documentation for the UTL_HTTP package (link).DROP TABLE rss_feeds PURGE; CREATE TABLE rss_feeds (title varchar2(200), url varchar2(500), category varchar2(50));Next we insert a couple of RSS feeds in the table assigning a category to each feed, such as: news, technology, etc.
INSERT INTO rss_feeds VALUES ('iTunes Top 25 Songs', 'http://ax.phobos.apple.com.edgesuite.net/WebObjects/MZStore.woa/wpa/MRSS/topsongs/limit=25/rss.xml','music');
INSERT INTO rss_feeds VALUES ('Apple Hot News', 'http://images.apple.com/main/rss/hotnews/hotnews.rss','Apple');
INSERT INTO rss_feeds VALUES ('Oracle Press Releases', 'http://www.oracle.com/rss/rss_ocom_pr.xml','Oracle');
INSERT INTO rss_feeds VALUES ('OTN Articles', 'http://www.oracle.com/technology/syndication/rss_otn_articles.xml','Oracle');
INSERT INTO rss_feeds VALUES ('CNN', 'http://rss.cnn.com/rss/cnn_topstories.rss','news');
INSERT INTO rss_feeds VALUES ('BBC', 'http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml','news');
INSERT INTO rss_feeds VALUES ('NYT', 'http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml','news');
INSERT INTO rss_feeds VALUES ('WPost Technology', 'http://www.washingtonpost.com/wp-dyn/rss/technology/index.xml','technology');
INSERT INTO rss_feeds VALUES ('Slashdot', 'http://rss.slashdot.org/Slashdot/slashdot','technology');
Now we can use the SQL statement below to query the RSS feeds and map the XML documents retrieved to a relational schema:SELECT t.source, p.scategory, it.title, it.link, it.pubdate,
it.author,
XMLCAST(
XMLQUERY('let $b := count(//category)
for $i at $pos in //category
return ($i, if ($pos = $b) then () else ", ")'
PASSING it.category RETURNING CONTENT
) AS VARCHAR2(4000)
) category,
it.description
FROM (SELECT HTTPURITYPE(url).getXML() doc, category scategory
FROM rss_feeds) p,
XMLTABLE('/rss/channel' PASSING p.doc
COLUMNS source VARCHAR2(4000) PATH 'title',
item XMLTYPE PATH 'item') t,
XMLTABLE('item' PASSING t.item
COLUMNS title VARCHAR2(4000) PATH 'title',
link VARCHAR2(4000) PATH 'link',
pubDate VARCHAR2(4000) PATH 'pubDate',
author VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]',
category XMLTYPE PATH 'category',
description CLOB PATH 'description') it;
The above query has a number of interesting features. It uses the HTTPURITYPE function to extract the XML for each RSS feed in the rss_feeds table. The XMLTABLE statements then process the XML document returned from each feed to extract the desired pieces of information and convert the document into a relational source. The first XMLTABLE statement extracts the tile and the item elements from each XML document returned by p. Because each XML document contains multiple items, a second XMLTABLE statement is used to return the pieces of each item element extracted by the first XMLTABLE. This XMLTABLE statement also returns the category as XMLTYPE. This is needed because some sites have multiple category elements for a given item. The category elements are then extracted using the XMLQUERY function so that we have all these elements concatenated as a single string separated by commas. For alternative ways of doing the last bit take a look at this post by Mark Volkmann.DROP TABLE rss_history PURGE; CREATE TABLE rss_history (source varchar2(4000), scategory varchar2(200), title varchar2(4000), link varchar2(4000), pubDate varchar2(200), author varchar2(4000), category varchar2(4000), description CLOB);Next we define a function to validate a RSS feed's URL and return the XML for the feed.
CREATE OR REPLACE FUNCTION rss_getXML(url IN varchar2) RETURN XMLType IS req UTL_HTTP.REQ; resp UTL_HTTP.RESP; res XMLType := null; BEGIN req := UTL_HTTP.BEGIN_REQUEST(url); resp := UTL_HTTP.GET_RESPONSE(req); IF (resp.status_code >=400 AND resp.status_code <= 599) THEN res := null; ELSE res := httpuritype(url).getXML(); END IF; UTL_HTTP.END_RESPONSE(resp); RETURN res; EXCEPTION WHEN OTHERS THEN UTL_HTTP.END_RESPONSE(resp); RETURN res; END; /We can now use the following MERGE statement to merge only newer entries into the archive:
MERGE
INTO rss_history rh
USING (SELECT t.source, p.scategory, it.title, it.link, it.pubdate,
it.author,
XMLCAST(
XMLQUERY('let $b := count(//category)
for $i at $pos in //category
return ($i, if ($pos = $b) then () else ", ")'
PASSING it.category RETURNING CONTENT
) AS VARCHAR2(4000)
) category,
it.description
FROM (SELECT rss_getXML(url) doc, category scategory
FROM rss_feeds) p,
XMLTABLE('/rss/channel' PASSING p.doc
COLUMNS source VARCHAR2(4000) PATH 'title',
item XMLTYPE PATH 'item') t,
XMLTABLE('item' PASSING t.item
COLUMNS title VARCHAR2(4000) PATH 'title',
link VARCHAR2(4000) PATH 'link',
pubDate VARCHAR2(4000) PATH 'pubDate',
author VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]',
category XMLTYPE PATH 'category',
description CLOB PATH 'description') it
) rl
ON (rh.link = rl.link)
WHEN NOT MATCHED
THEN INSERT (title, source, scategory, link, pubDate,
author, category, description)
VALUES (rl.title, rl.source, rl.scategory, rl.link, rl.pubDate,
rl.author, rl.category, rl.description);
It is very impressive that a single SQL statement can retrieve entries from multiple RSS feeds, parse the XML, and merge the newer entries into a table for archival. CREATE OR REPLACE PROCEDURE update_rss_history
(p_proxy IN VARCHAR2,
p_no_proxy_domains IN VARCHAR2 DEFAULT NULL)
IS
BEGIN
-- When going through a firewall, pass requests through p_proxy host.
-- Specify sites (p_no_proxy_domains) inside the firewall that don't
-- need the proxy host.
IF (p_proxy is not null) AND
(p_no_proxy_domains is not null)
THEN
UTL_HTTP.SET_PROXY(p_proxy, p_no_proxy_domains);
END IF;
MERGE
INTO rss_history rh
USING ( SELECT t.source, p.scategory, it.title, it.link, it.pubdate,
it.author,
XMLCAST(
XMLQUERY('let $b := count(//category)
for $i at $pos in //category
return ($i, if ($pos = $b) then ()
else ", ")'
PASSING it.category RETURNING CONTENT
) AS VARCHAR2(4000)
) category,
it.description
FROM (SELECT rss_getXML(url) doc,
category scategory
FROM rss_feeds) p,
XMLTABLE('/rss/channel' PASSING p.doc
COLUMNS source VARCHAR2(4000) PATH 'title',
item XMLTYPE PATH 'item') t,
XMLTABLE('item' PASSING t.item
COLUMNS title VARCHAR2(4000) PATH 'title',
link VARCHAR2(4000) PATH 'link',
pubDate VARCHAR2(4000) PATH 'pubDate',
author VARCHAR2(4000) PATH '*[namespace-uri()="http://purl.org/dc/elements/1.1/" and local-name()="creator"]',
category XMLTYPE PATH 'category',
description CLOB PATH 'description') it
) rl
ON (rh.link = rl.link)
WHEN NOT MATCHED
THEN INSERT (title, source, scategory, link, pubDate,
author, category, description)
VALUES (rl.title, rl.source, rl.scategory, rl.link, rl.pubDate,
rl.author, rl.category, rl.description);
END;
/
Besides the MERGE statement, the above procedure also includes some code for handling proxy servers as explained above.BEGIN DBMS_SCHEDULER.CREATE_JOB( job_name => 'UPDATE_RSS_ARCHIVE', job_type => 'PLSQL_BLOCK', job_action => 'BEGIN update_rss_history(); COMMIT; END;', enabled => TRUE, start_date => SYSDATE, repeat_interval => 'FREQ=MINUTELY;INTERVAL=15'); END; /For more details on DBMS_SCHEDULER see the Oracle documentation (link).
SELECT job_name, enabled, TO_CHAR(last_start_date,'YY-MM-DD HH24:MI:SS') last_run, TO_CHAR(last_run_duration,'HH:MI:SS') FROM user_scheduler_jobs;To manually stop and/or drop the job we can use the following calls:
-- Stop job EXECUTE DBMS_SCHEDULER.stop_job (job_name => 'update_rss_archive'); -- Drop job EXECUTE DBMS_SCHEDULER.drop_job (job_name => 'update_rss_archive');After a few days running the archive has accumulate quite a few entries:
SELECT COUNT(*) FROM rss_history; COUNT(*) ---------- 1558In later posts I will show how to mine the archive.
Labels: Collective Intelligence, XML
9/08/2008 11:10:00 PM | Permalink |
I firmly believe that a major trend in applications is the incorporation of analytic-enabled functionality. Users want more than just reports or a replay of the past. Users want to have insights and their attention directed to key points. This is where analytics can make a big impact across all types of applications. Notice that I am not proposing exposing analytical capabilities (e.g., data mining and statistical modeling) to users. That, I think, is only effective for a small number of users. What I believe is more meaningful is to present functionality and information to users that are the result of analytics taking place behind the scene.
In line with this trend, Oracle has recently released a new application: Oracle Sales Prospector. This application is targeted at sales representatives. It provides which accounts they should target with specific products or services. It also indicates which deals are most likely to close within specific time frames, and provides accompanying corporate profiles as well as likely customer references.
This is a very cool application and I have wanted to talk about it for some time now. My group has worked closely with the Social CRM group in developing the data mining driving the product, and now that the it has launched I can discuss some of the data mining details.
Oracle Sales Prospector is as analytical-driven as an application can be. Almost every aspect of what you seen in its very nice user interface is the product of data mining models. Oracle Data Mining (ODM) provides the algorithms and the real-time scoring driving the application.
Labels: Applications, Oracle Sales Prospector, Social CRM
8/22/2008 01:10:00 PM | Permalink |
Recently someone asked me how to invert a matrix using the UTL_NLA PL/SQL package. This can be done by solving a system of linear equations AX = B like the ones I described in this post but setting B to the identity matrix. I thought that the question was interesting enough to deserve its own post. So I am replicating the answer here for those interested.
The UTL_NLA package has many different procedures for solving systems of linear equations. Most of them are designed to take advantage of special form matrices (e.g., triangular, banded, etc.). For the example below, I picked the procedure LAPACK_GELS. This procedure solves general systems of linear equations in a robust fashion. It uses QR or LQ decompositions as needed.
DECLARE
A utl_nla_array_dbl := utl_nla_array_dbl(
5, 153, 352,
153, 5899, 9697,
352, 9697, 26086);
B utl_nla_array_dbl := utl_nla_array_dbl(
1, 0, 0,
0, 1, 0,
0, 0, 1);
ipiv utl_nla_array_int := utl_nla_array_int(0,0);
info integer;
n NUMBER := 3;
i NUMBER;
j NUMBER;
BEGIN
UTL_NLA.LAPACK_GELS (
trans => 'N', -- solve for A instead of A'
m => n, -- A number of rows
n => n, -- A number of columns
nrhs => n, -- B number of columns
a => A, -- matrix A
lda => n, -- max(1, m)
b => B, -- matrix B
ldb => n, -- ldb >= max(1, m, n)
info => info, -- operation status (0=sucess)
pack => 'R' -- how the matrices are stored
-- (C=column-wise)
);
dbms_output.put_line('inv(A)');
dbms_output.put_line('-------------');
dbms_output.put_line('info= '||info);
IF info=0 THEN
FOR l IN 1..B.count LOOP
i := ceil(l/n);
j := l-(i-1)*n;
dbms_output.put_line('x[' || i ||']['||j||']= ' ||
TO_CHAR(B(l),'99.9999'));
END LOOP;
END IF;
END;
/Here is the output for the above code:inv(A) ------------- info= 0 x[1][1]= 27.5307 x[1][2]= -.2658 x[1][3]= -.2727 x[2][1]= -.2658 x[2][2]= .0030 x[2][3]= .0025 x[3][1]= -.2727 x[3][2]= .0025 x[3][3]= .0028
Labels: Linear Algebra, UTL_NLA
8/18/2008 08:18:00 AM | Permalink |
It is that time of the year again. In about a week I am going to be attending the KDD (Knowledge Discovery in Databases) 2008 conference (conference website) along with some other Oracle colleagues. KDD is one of the primary conferences on data mining. This year it will take place in Las Vegas, Nevada, from August 24 to 27.
Oracle is a Gold sponsor for the event and will have again a large presence at the conference. Among other things, Oracle is the sponsor of the first poster reception and demonstration session on Monday.
This year I will not try to post daily conference highlights. After managing to do a poorer job last year than in the year before I gave up on the idea. Instead I will settle for a single post after the conference.
Here are some pictures from the Oracle booth at last year's KDD:
If you plan to attend KDD this year, stop by the Oracle booth to say hello. Last year I spent a great deal time at the Oracle booth. There were lots of interesting questions and problems. Here are some of the topics we talked about:
8/15/2008 11:34:00 PM | Permalink |
For the past couple of months the blog took a back seat. Basically, since KDD, I have had very little time to write. I have been on the road quite a bit and my trip to KDD unleashed a number of research ideas that I have been following up. I will post on the latter over time as the results mature.
I have also dropped the ball answering many of the emails and comments I have received. I have caught up with the blog email now. Usually I try to reply very quickly and I apologize for those that have had their question unanswered for a very long time. Below are some of the things that have been going on in the past couple of months. I will blog about them in more details very soon.
Labels: Other
12/06/2007 01:11:00 PM | Permalink |
For the next couple of days I am going to be attending the KDD (Knowledge Discovery in Databases) 2007 conference (conference website) along with some other Oracle colleagues. KDD is one of the primary conferences on data mining. This year it will take place in San Jose, CA, from August 12 to 15.
Oracle is a Gold sponsor for the event and will have a large presence at the conference. Among other things, Oracle is the sponsor of the first poster reception and demonstration session on Monday.
I will try to post daily conference highlights and do a better job than last year when I only covered the first two days. If you plan to attend KDD this year, stop by the Oracle booth.
Labels: KDD
8/12/2007 09:34:00 AM | Permalink |
This week I am going to be in San Francisco. I have been invited to give a talk at the San Francisco Bay ACM Data Mining SIG on Wednesday. The title of the talk is In-Database Analytics: A Disruptive Technology. Here is a link with information on the talk.
On Friday morning, I am presenting at the ST Seminar at Oracle's headquarter. The title of that talk is In-Database Mining: The I in BI. If you are in area for the external event on Wednesday or at headquarters on Friday, stop by.
I will be posting links to both presentations later on.
Labels: Talk
7/09/2007 09:51:00 PM | Permalink |
This post shows how to implement Principal Components Analysis (PCA) with the UTL_NLA package. It covers some of the uses of PCA for data reduction and visualization with a series of examples. It also provides details on how to build attribute maps and chromaticity diagrams, two powerful visualization techniques.
This is the second post in a series on how to do Linear Algebra in the Oracle Database using the UTL_NLA package. The first illustrated how to solve a system of linear equations. As promised in part 1, I am posting the code for a couple of procedures for reading and writing data from tables to arrays and vice-versa. Links to the data and code used in the examples are at the end of the post.
Principal Components Analysis
In data sets with many attributes groups of attributes often move together. This is usually due to the fact that more than one attribute may be measuring the same underlying force controlling the behavior of the system. In most systems there are only a few such driving forces. In these cases, it should be possible to eliminate the redundant information by creating a new set of attributes that extract the essential characteristics of the system. In effect we would like to replace a group of attributes with a single new one. PCA is a method that does exactly this. It creates a new set of attributes, the principal components, where each one of them is a linear combination of the original attributes. It is basically a special transformation of the data. It transforms numerical data to a new coordinate system where there is no redundant information recorded in the attributes.
Figure 1: A two-dimensional data distribution.
It is easier to understand what PCA does by looking at a couple of pictures (see Technical Note 1 below for a more technical explanation). Let's assume that we have a table (data set) with two columns (X1 and X2) and that all the rows (observations) in this table falls within the gray ellipsis in Figure 1. The double-edged arrows in the picture indicate the range for the X1 and X2 attributes. A couple of things to notice in the picture:Figure 2: PC1 and PC2 axes and their ranges.
Figure 2 shows the new axes PC1 and PC2 created by PCA for the example in Figure 1. As indicated by the sizes of the double-edged arrows, there is more variability (greater variance) on the PC1 axis than on the PC2 axis. Also the PC1 variance is bigger than that in the original attributes (X1 and X2). This is not surprising as PC1 is constructed to lie on the direction of the greatest variance in the data. The fact that PC2 has less variability means that it is less important for differentiating between observations. If there were more PCs, the data projection on these axes would have less and less variability. That is, they would capture less and less important aspects of the data. This is what allows compressing the data. By ignoring higher-order PCs we can great a new version of the data with fewer attributes than the original data.Figure 3: Coordinates for a point on the original axes and on the PC axes.
Figure 3 shows the coordinates for a point in the original coordinate system (attributes X1 and X2) and in the PC coordinate system (PC1 and PC2). We can see that, for the PC coordinate system, the majority of points can be distinguished from each other by just looking at the value of PC1. This is not true for the original coordinate system. For example, there are many fewer points with coordinate b1 than with coordinate a1.DECLAREBecause the data span a couple orders of magnitudes I applied the LOG transformation before centering the data using the mk_xfm_query function.
v_source long;
v_score_query long;
v_npc NUMBER;
BEGIN
v_source :=
'SELECT name, log(10,distance) distance, ' ||
'log(10,diameter) diameter, '||
'log(10,density) density FROM planet';
v_source :=
dmt_projection.mk_xfm_query(v_source,
dmt_matrix.array('NAME'),'C');
-- Compute PCA
dmt_projection.pca(
p_source => v_source,
p_iname => 'NAME',
p_s_tab_name => 'planet_s',
p_v_tab_name => 'planet_v',
p_u_tab_name => 'planet_u',
p_scr_tab_name => 'planet_scr',
p_score_query => v_score_query,
p_num_pc => v_npc
);
END;
/
Figure 4: Scree plot for the planetary data.
The planet_scr table (see Table 1) contains data for the Scree and Log-Likelihood plots. The Scree plot (Figure 4) charts all the eigenvalues in their decreasing order (from lower to higher PCs). I was curious so I checked out why it is called a Scree plot. The word scree comes from the Old Norse term for landslide. The term is used in this context because the plot resembles the side of a mountain with debris from the mountain lying at its base. The Log-Likelihood plot is covered in Example 3 below.Figure 5: Projection of planet data onto the first two PCs.
Figure 5 shows the distribution of the planets based on their PC1 (horizontal) and PC2 (vertical) coordinates. These numbers (scores) are persisted in the planet_u table. The red arrows show the projection of the original attributes in the PC coordinate system. The arrows point in the direction of increasing values of the original attribute. The direction of the arrows can be obtained from the V matrix returned by the pca procedure in the planet_v table. Each row in this table gives the PC coordinates for one of the original attributes.Figure 6: Country data projection onto the first two PCs.
Figure 5 shows the projection of the LOG of the data in the first two PCs (PC1 is the horizontal axis). From the picture we see that countries with large populations are in the upper left corner. Countries with large areas and large average incomes are closer to the lower right corner. It is harder to get insight on the countries in the middle. Also this type of analyzes becomes more and more difficult as the number of attributes increases (more red arrows in the picture).DECLAREThe code has two parts. First it does PCA on the data. Next it creates the maps. The maps are persisted in the COUNTRY_ATTR_MAP table. The bin coordinates for each country are returned in the COUNTRY_BVIEW view. The query used for creating the maps joins the original data with the PCA projections from the country_u table. To create a map for attributes not in the original data we just need to edit this query and replace the original table in the join with the new table containing the data we want to plot.
v_source long;
v_score_query long;
v_npc NUMBER;
BEGIN
v_source :=
'SELECT name, log(10,population) log_pop, ' ||
'log(10,avgincome) log_avgincome, log(10,area) ' ||
'log_area FROM country';
v_source := dmt_projection.mk_xfm_query(v_source,
dmt_matrix.array('NAME'),'Z');
-- Compute PCA
dmt_projection.PCA(
p_source => v_source,
p_iname => 'NAME',
p_s_tab_name => 'country_s',
p_v_tab_name => 'country_v',
p_u_tab_name => 'country_u',
p_scr_tab_name => 'country_scr',
p_score_query => v_score_query,
p_num_pc => v_npc
);
DBMS_OUTPUT.PUT_LINE('Example 2 - NumPC= ' || v_npc);
-- Define query to create maps
v_source := 'SELECT a.name, log(10,population) population, '||
'log(10,avgincome) avgincome, '||
'log(10,area) area, '||
'b.pc1, b.pc2 ' ||
'FROM country a, country_u b ' ||
'WHERE a.name=b.name';
-- Create maps
dmt_projection.mk_attr_map(
p_source => v_source,
p_nbins => 4,
p_iname => 'NAME',
p_coord => dmt_matrix.array('PC2','PC1'),
p_attr => dmt_matrix.array('POPULATION',
'AVGINCOME','AREA'),
p_op => 'AVG',
p_bin_tab => 'COUNTRY_BTAB',
p_bin_view => 'COUNTRY_BVIEW',
p_res_tab => 'COUNTRY_ATTR_MAP');
END;
/
Figure 7: Population attribute map.
Figure 8: Average income attribute map.
Figures 6-8 display the three maps using MS Excel's surface chart. The maps place the countries on a regular grid (PC1 vs. PC2, where PC1 is the horizontal axis) by binning the values of the coordinates PC1 and PC2 into four equi-width buckets. The colors provide a high to low landscape with peaks (purple) and valleys (blue). Each point in the grid can contain more than one country. I labeled a few countries per point for reference purpose. The value of the attribute at a grid node represents the average (AVG) value for the countries in that grid node. Values (and colors) between grid points are automatically created by interpolation by the surface chart routine in Excel.DECLARENext we need to determine if a small number of PCs (2 or 3) captures enough of the characteristics of the data that we can feel confident that our visual approach has merit. There are many heuristics that can be used to estimate the number of PCs to keep (see [ZHU06] for a review). A common approach is to find a cut-off point (a PC number) in the Scree plot where the change from PC to PC in the plot levels off. That is, the mountain ends and the debris start. Until not long ago, there were no principled low computational cost approaches to select the cut-off point on the Scree plot. A recent paper by Zhu and Ghodsi [ZHU06] changed that. The paper introduces the profile log-likelihood, a metric that can be used to estimate the number of PCs to keep. Basically one should keep all the PCs up to and including the one for which the Profile Log-Likelihood plot peaks.
v_source long;
v_score_query long;
v_npc NUMBER;
BEGIN
v_source :=
'SELECT name, x1, x2, x3, x4, x5, x6, x7, x8 FROM utility';
v_source := dmt_projection.mk_xfm_query(v_source,
dmt_matrix.array('NAME'),
'Z');
-- Compute PCA
dmt_projection.pca(
p_source => v_source,
p_iname => 'NAME',
p_s_tab_name => 'utility_s',
p_v_tab_name => 'utility_v',
p_u_tab_name => 'utility_u',
p_scr_tab_name => 'utility_scr',
p_score_query => v_score_query,
p_num_pc => v_npc
);
DBMS_OUTPUT.PUT_LINE('Example 3 - NumPC= ' || v_npc);
END;
/
Figure 10: Scree plot for electric utility data.
Figure 11: Log-likelihood profile plot for electric utility data.
Figures 10 and 11 show, respectively, the Scree and the Log-Likelihood plots using the data persisted in the utility_scr table. The Profile Log-Likelihood plot indicates that we should keep the first three PCs (the plot peaks at PC3). This might not have been immediately obvious from the Scree plot itself. The cumulative percentage explained variance for the first three PCs is 67.5%. PC3 accounts for about 16.5% of the variance. By keeping only the first three PCs we capture a good deal of the variance in the data.Figure 12: Electric utility data projection onto PC1 and PC2 axes.
Let's start our analysis by looking at the projection of the data on the first two PCs. This is represented in Figure 12. From the picture we can see that the values in PC1 (horizontal axis) divide the data into three sections: A left one with the Green group (San Diego), a right one with the Blue group, and a middle section with the remaining of the utilities. Along the PC2 axis the middle section separates into three subsections, colored red, yellow, and black in the picture. Finally, the Black group seems to have two subgroups separated by a line in the picture. This visual partition of the data is somewhat arbitrary but it is can be quite effective.Figure 13: Electric utility data projection onto PC1 and PC3 axes.
Figure 13 illustrates the projection of the data using the first three PCs (see Technical Note 4 for details on how to construct this plot). Each vertex of the triangle represents one of the first three PCs, the closer to the vertex the higher the value for the associated PC. The numbers have been scaled so that at a vertex the associated PC has a value of 1 and the other two PCs are zero. Also points at the line opposite to a vertex have a value of zero for the PC associated with the vertex. For example, San Diego has a value of almost 0 for PC1.PROCEDURE pca ( p_a IN OUT NOCOPY UTL_NLA_ARRAY_DBL,The above code uses Singular Value Decomposition (SVD) for computing PCA. It has three parts. First it does SVD calling a single UTL_NLA procedure (UTL_NLA.LAPACK_GESVD). Then it scales the U matrix to obtain the projections. Finally it scales and squares the S matrix to get the variances.
p_m IN PLS_INTEGER,
p_n IN PLS_INTEGER,
p_pack IN VARCHAR2 DEFAULT 'R',
p_s IN OUT NOCOPY UTL_NLA_ARRAY_DBL,
p_u IN OUT NOCOPY UTL_NLA_ARRAY_DBL,
p_vt IN OUT NOCOPY UTL_NLA_ARRAY_DBL,
p_jobu IN BOOLEAN DEFAULT true
)
AS
v_indx PLS_INTEGER;
v_alpha NUMBER;
v_info PLS_INTEGER;
v_dim PLS_INTEGER := p_n;
v_ldvt PLS_INTEGER := p_n;
v_jobu VARCHAR2(1) := 'N';
BEGIN
IF p_jobu THEN
v_jobu := 'S';
END IF;
-- Get the correct dimensions to use
IF (p_m <= p_n) THEN
v_dim := p_m - 1;
v_ldvt := p_m;
END IF;
-- Compute SVD of matrix A: A = U * S * Vt
UTL_NLA.LAPACK_GESVD(jobu => v_jobu,
jobvt => 'S',
m => p_m,
n => p_n,
a => p_a,
lda => p_n,
s => p_s,
u => p_u,
ldu => v_dim,
vt => p_vt,
ldvt => v_ldvt,
info => v_info,
pack => p_pack);
-- Scale U to obtain projections
IF p_jobu THEN
v_indx := 0;
FOR i IN 1..p_m LOOP
FOR j IN 1..v_dim LOOP
v_indx := v_indx + 1;
p_u(v_indx) := p_u(v_indx) * p_s(j);
END LOOP;
END LOOP;
END IF;
-- Scale and square S
v_alpha := 1 / SQRT(p_m - 1);
FOR i IN 1..v_dim LOOP
p_s(i) := v_alpha * p_s(i);
p_s(i) := p_s(i) * p_s(i);
END LOOP;
END pca;
PROCEDURE pca ( p_source IN VARCHAR2,The procedure has the following main steps:
p_iname IN VARCHAR2 DEFAULT NULL,
p_s_tab_name IN VARCHAR2,
p_v_tab_name IN VARCHAR2,
p_u_tab_name IN VARCHAR2,
p_scr_tab_name IN VARCHAR2,
p_score_query OUT NOCOPY VARCHAR2,
p_num_pc OUT NOCOPY PLS_INTEGER
)
AS
v_query LONG;
v_m PLS_INTEGER;
v_n PLS_INTEGER;
v_dim PLS_INTEGER;
v_a UTL_NLA_ARRAY_dbl := utl_nla_array_dbl();
v_s UTL_NLA_ARRAY_dbl := utl_nla_array_dbl();
v_u UTL_NLA_ARRAY_DBL := utl_nla_array_dbl();
v_vt UTL_NLA_ARRAY_DBL := utl_nla_array_dbl();
v_sc UTL_NLA_ARRAY_DBL := utl_nla_array_dbl();
v_col_names dmt_matrix.array;
v_row_ids dmt_matrix.array;
BEGIN
-- Read table into array
dmt_matrix.table2array(
p_source => p_source,
p_a => v_a,
p_m => v_m,
p_n => v_n,
p_col_names => v_col_names,
p_row_ids => v_row_ids,
p_iname => p_iname,
p_force => true);
-- Compute PCA. It only returns information for
-- PCs with non-zero eigenvalues
pca (p_a => v_a,
p_m => v_m,
p_n => v_n,
p_pack => 'R',
p_s => v_s,
p_u => v_u,
p_vt => v_vt,
p_jobu => false);
-- Get the correct number of components returned.
-- This is the min(v_n, v_m-1)
IF (v_m <= v_n) THEN
v_dim := v_m - 1;
ELSE
v_dim := v_n;
END IF;
-- Make scoring query
p_score_query := mk_pca_query(v_col_names, p_iname, v_dim, v_vt);
-- Estimate number of PCs and scree plot information
pca_eval(v_s, v_sc, p_num_pc, v_dim);
-- Persist eigenvalues
-- Number of cols is v_dim
dmt_matrix.array2table(
p_a => v_s,
p_m => 1,
p_n => v_dim,
p_tab_name => p_s_tab_name,
p_col_pfx => 'PC',
p_iname => NULL);
-- Persist PCs (V) (using pack = C transposes the matrix
-- as it was originally R packed)
-- Number of cols in V is v_dim
dmt_matrix.array2table(
p_a => v_vt,
p_m => v_n,
p_n => v_dim,
p_tab_name => p_v_tab_name,
p_col_pfx => 'PC',
p_iname => 'ATTRIBUTE',
p_row_ids => v_col_names,
p_pack => 'C');
-- Persist scores
-- Number of cols in U is v_dim
-- Using the scoring query so that we project all the data
IF p_u_tab_name IS NOT NULL THEN
v_query := 'CREATE TABLE ' || p_u_tab_name ||
' AS ' || p_score_query || '( ' ||
p_source ||')';
EXECUTE IMMEDIATE v_query;
END IF;
-- Persist scree and log-likelihood plot information
dmt_matrix.array2table(
p_a => v_sc,
p_m => v_dim,
p_n => 4,
p_iname => NULL,
p_col_names => array('ID','PCT_VAR','CUM_PCT_VAR',
'LOG_LIKELIHOOD'),
p_tab_name => p_scr_tab_name);
END pca;
Labels: Linear Algebra, Statistics, UTL_NLA
6/04/2007 01:05:00 PM | Permalink |