Monday, June 15, 2009

Building a Scraper Using UTL_HTTP

Someone set me the task of embedding pages from another one of our online applications into regions of an APEX app for the purpose of maintaining a single navigation and look and feel. My first thought was that it can't be done. As far as I was aware you couldn't perform a GET request from PL/SQL (I'd breifly checked in the past and that was the conclusion I'd arrived at). So instead I quickly threw something together in PHP that scraped the application and then used str_replace to embed additional style/navigation and achieve a pretty good compromise. It was quick and dirty, and getting it to work on the PHP4 version that comes with APEX was a bit of a challenge. It wasn't really an ideal solution, so I kept digging and found the UTL_HTTP package.

It's not quite as simple as calling a function that returns the page content, but it's possible. I followed the documentation as well as an O'Reilly example. The simple package I built doesn't handle proxies or authentication, but it does do a better job at handling large pages as CLOB data. I've pulled out the exception handling to make it simpler to understand. Refer to O'Reilly's example when you are ready for that.

CREATE OR REPLACE PACKAGE WEB_SCRAPER AS
FUNCTION HTTP_GET_PAGE(p_url in VARCHAR2) RETURN CLOB;

FUNCTION DOCUMENT_BODY(p_document in CLOB) RETURN CLOB;

FUNCTION HTTP_GET_PAGE_BODY(p_url in VARCHAR2) RETURN CLOB;

PROCEDURE PRINT_CLOB(p_content in CLOB);
END WEB_SCRAPER;
/

CREATE OR REPLACE PACKAGE BODY WEB_SCRAPER AS
FUNCTION HTTP_GET_PAGE(p_url in VARCHAR2) RETURN CLOB AS
/* Performs a simple HTTP GET request and returns the content */
l_request Utl_Http.req;
l_response Utl_Http.resp;
l_content_buffer VARCHAR2 (32767);
l_page CLOB;
BEGIN
DBMS_LOB.createtemporary(l_page, FALSE);

l_request := Utl_Http.begin_request (url => p_url, method => 'GET');

Utl_Http.set_header (r => l_request, NAME => 'User-Agent', VALUE => 'APEX-Scraper/0.1');
l_response := Utl_Http.get_response (r => l_request);

BEGIN
LOOP
UTL_HTTP.read_text(l_response, l_content_buffer, 32767);
DBMS_LOB.writeappend (l_page, LENGTH(l_content_buffer), l_content_buffer);
END LOOP;
EXCEPTION WHEN Utl_Http.end_of_body THEN
NULL;
END;

Utl_Http.end_response (r => l_response);
RETURN l_page;
END HTTP_GET_PAGE;

FUNCTION DOCUMENT_BODY(p_document in CLOB) RETURN CLOB AS
/* Return everything between the body tags */
BEGIN
RETURN REGEXP_REPLACE(p_document,'^.*<[[:space:]]*BODY[^>]*>(.*)<[[:space:]]*/[[:space:]]*BODY[[:space:]]*>.*$','\1',1,0,'imn');
END DOCUMENT_BODY;

FUNCTION HTTP_GET_PAGE_BODY(p_url in VARCHAR2) RETURN CLOB AS
/* Convienience function combining DOCUMENT_BODY and HTTP_GET_PAGE */
BEGIN
RETURN DOCUMENT_BODY(HTTP_GET_PAGE(p_url));
END HTTP_GET_PAGE_BODY;

PROCEDURE PRINT_CLOB(p_content in CLOB) AS
/* Use this instead of htp.p which only supports Varchar2 */
l_buffer_size Number := 8191; --Can't always use 32767. See bug #5896994
l_clob CLOB := p_content;
l_chunks Number;
l_cloblen Number;
BEGIN
l_cloblen := dbms_lob.getlength(l_clob);
l_chunks := floor(l_cloblen / l_buffer_size) + 1;
FOR i IN 0 .. l_chunks LOOP
htp.p(dbms_lob.substr(l_clob, l_buffer_size, (i * l_buffer_size) + 1));
END LOOP;
END PRINT_CLOB;

END WEB_SCRAPER;
/

No comments: