Showing posts with label XML. Show all posts
Showing posts with label XML. Show all posts

Wednesday, September 30, 2009

Proxies PHP and XML OH MY

I like PHP. I like XML, but like anything, it sometimes just flat out pisses me off. Such was the case when a host I had been gleefully fetching an RSS feed from decided they needed to move their website to a host that provides DDOS detection and protection.

The happy little code snippet below, that had been chugging along for almost a year, and continued to work on other hosts suddenly broke...

$rssfeedurl = http://www.somdomain.com/somerss.xml

if (!$xmlDoc = new DOMDocument()){
echo 'Could not initiate feed '.$rssfeedurl ;
}

if (!@$xmlDoc->load($rssfeedurl)){
echo 'Could not fetch or translate '.$rssfeedurl.' into XML';
} else {
// PARSE AWAY ! ! ! !
//get elements from ""
$channel = $xmlDoc->getElementsByTagName('channel')->item(0);
$channel_title = @$channel->getElementsByTagName('title') ->item(0)->childNodes->item(0)->nodeValue;
$channel_link = $channel->getElementsByTagName('link') ->item(0)->childNodes->item(0)->nodeValue;
$channel_desc = $channel->getElementsByTagName('description') ->item(0)->childNodes->item(0)->nodeValue;
}


After beating my head against my already battered desk for far too long, I asked the owners of the domain, what up?

They in turn conveyed my question to their new hosting company who promptly responded with "The HTTP request header is invalid" and hinted that their external proxy couldn't forward such a request to the targeted server..

So, I tried using stream_context_create() and libxml_set_streams_context() to create valid HTTP request headers, and it seemed to work....


$opts = array(
'http' => array(
'user_agent' => 'xml fetcher 1.0',
)
);

$context = stream_context_create($opts);
libxml_set_streams_context($context);
if (!@$xmlDoc->load($rssfeedurl)){
echo 'Could not fetch or translate '.$rssfeedurl.' into XML';
} else {
// PARSE AWAY ! ! ! !


I finally gave up on using it though because I was still getting intermittent failures, likely because creating a fully valid HTTP request header is trickier than I thought... So I switched to CURL, and Eureka ! CURLOPT_HTTPPROXYTUNNEL to the rescue!

Why didn't I try this before? If you have done a Google search on "PHP CURL proxy" you will find tons of info on posting through a local or intermediary proxy. But I could find almost nothing on dealing with the nuances of a remote proxy directly in front of the target server, and besides $xmlDoc->load($rssfeedurl) is just so damn elegant...

After some trial and error, I came up with this.

$cookie="./.cookiefile";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $rssfeedurl);
curl_setopt($ch, CURLOPT_HTTP_VERSION, 1.0); // 1.1 likely works as well, but 1.0 seems fine
curl_setopt($ch, CURLOPT_HEADER, 0); // Dont want response headers
curl_setopt($ch, CURLOPT_REFERER, 'http://'.$_SERVER['SERVER_NAME'].$_SERVER['REQUEST_URI']);
curl_setopt($ch, CURLOPT_USERAGENT, 'Dew-Code XML Grabber 1.0');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); // to handle redirects
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // assign response to a variable
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie); // .cookiefile is just an empty file, just in case
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie); // ditto
curl_setopt($ch, CURLOPT_TIMEOUT, 30); // seconds to wait for response
curl_setopt($ch, CURLOPT_AUTOREFERER, 1); // in case of redirect, convey referrer
curl_setopt($ch, CURLOPT_MAXREDIRS, 5); // how many times to allow redirect
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); // absolute must when a proxy is involved
$rssblob = curl_exec($ch); // make it go!!

if (curl_errno($ch)) {
echo curl_error($ch); // handy for debuging
} else {
curl_close($ch); // Save the binviroment! clean up after yourself
}

if (!@$xmlDoc->loadXML($rssblob)){
echo 'Could not fetch or translate '.$rssfeedurl.' into XML';
} else {
// PARSE AWAY ! ! ! !


I hope you find it useful..

Wednesday, July 22, 2009

When PHP stands for Possibly Hopless Progam

So, here I am, almost broke. Drank my last pot of coffee, down to 6 cigarettes and I get this seemingly simple gig to fix some XML fetch parse present script. Woot I can do that for like $20 USD .. ( I found if I don't specify USD I'm likely to be paid in pesos, rubles or some other currency, exchange rates rarely work to my favor)

Anyways.. It looked like a pretty well written script, using PHP's DOMDocument support, which I've used on some of my own scripts. The crux of the script was...


$rssfeedurl = 'http://someXMLurl';

if (!$xmlDoc = new DOMDocument()){
echo 'Could not initiate feed '.$rssfeedurl ; exit();
}

if (!$xmlDoc->load($rssfeedurl)){
echo 'Could not fetch feed '.$rssfeedurl ; exit();
} else {
$channel = $xmlDoc->getElementsByTagName('channel')->item(0);
$channel_title = @$channel->getElementsByTagName('title') ->item(0)->childNodes->item(0)->nodeValue;
$channel_link = $channel->getElementsByTagName('link') ->item(0)->childNodes->item(0)->nodeValue;
$channel_desc = $channel->getElementsByTagName('description') ->item(0)->childNodes->item(0)->nodeValue;

//etc etc . . . .
}

Of course the script ran for months just fine, then suddenly stopped for no apparent reason. Turns out it was getting a 403 forbidden response.. but the IP was not blocked by the host serving the XML.. actually they couldn't figure it out either.

I tested every conceivable cause I could think of. Shelling into the server, I could wget the XML just fine.. So I wrote a quick PHP script, using curl to fetch the feed, thinking it's gotta be the user agent they are blocking...

I tried all sort of different user agent names, even some real nasty ones that any network would ban on sight. All were allowed to fetch the feed.

Then I accidentally fired the script off with no user agent set. WHOOP THERE IT IS.. 403 forbidden.. BINGO. They reject requests with no user agent set.. but how the hell do you set an agent name for DOMDocument::load ?

Luckily I found the answer. The trick is adding a few lines before you initialize DOMDocument


$opts = array(
'http' => array(
'user_agent' => 'xml fetcher 1.0',
)
);

$context = stream_context_create($opts);
libxml_set_streams_context($context);



if (!$xmlDoc = new DOMDocument()){
echo 'Could not initiate feed '.$rssfeedurl ;
}




So.. now I got $20 USD YAY! A pound of coffee, maybe some top ramen and a pack of smokes!! NEXT !

PS. if you wre actually curious what PHP stand for, it's PHP: Hypertext Preprocessor so. in essence the first P stand for nothing, it is just there to create a cooler sounding 3 letter acronym, and beside HP was taken.