Wednesday, July 22, 2009

When PHP stands for Possibly Hopless Progam

So, here I am, almost broke. Drank my last pot of coffee, down to 6 cigarettes and I get this seemingly simple gig to fix some XML fetch parse present script. Woot I can do that for like $20 USD .. ( I found if I don't specify USD I'm likely to be paid in pesos, rubles or some other currency, exchange rates rarely work to my favor)

Anyways.. It looked like a pretty well written script, using PHP's DOMDocument support, which I've used on some of my own scripts. The crux of the script was...


$rssfeedurl = 'http://someXMLurl';

if (!$xmlDoc = new DOMDocument()){
echo 'Could not initiate feed '.$rssfeedurl ; exit();
}

if (!$xmlDoc->load($rssfeedurl)){
echo 'Could not fetch feed '.$rssfeedurl ; exit();
} else {
$channel = $xmlDoc->getElementsByTagName('channel')->item(0);
$channel_title = @$channel->getElementsByTagName('title') ->item(0)->childNodes->item(0)->nodeValue;
$channel_link = $channel->getElementsByTagName('link') ->item(0)->childNodes->item(0)->nodeValue;
$channel_desc = $channel->getElementsByTagName('description') ->item(0)->childNodes->item(0)->nodeValue;

//etc etc . . . .
}

Of course the script ran for months just fine, then suddenly stopped for no apparent reason. Turns out it was getting a 403 forbidden response.. but the IP was not blocked by the host serving the XML.. actually they couldn't figure it out either.

I tested every conceivable cause I could think of. Shelling into the server, I could wget the XML just fine.. So I wrote a quick PHP script, using curl to fetch the feed, thinking it's gotta be the user agent they are blocking...

I tried all sort of different user agent names, even some real nasty ones that any network would ban on sight. All were allowed to fetch the feed.

Then I accidentally fired the script off with no user agent set. WHOOP THERE IT IS.. 403 forbidden.. BINGO. They reject requests with no user agent set.. but how the hell do you set an agent name for DOMDocument::load ?

Luckily I found the answer. The trick is adding a few lines before you initialize DOMDocument


$opts = array(
'http' => array(
'user_agent' => 'xml fetcher 1.0',
)
);

$context = stream_context_create($opts);
libxml_set_streams_context($context);



if (!$xmlDoc = new DOMDocument()){
echo 'Could not initiate feed '.$rssfeedurl ;
}




So.. now I got $20 USD YAY! A pound of coffee, maybe some top ramen and a pack of smokes!! NEXT !

PS. if you wre actually curious what PHP stand for, it's PHP: Hypertext Preprocessor so. in essence the first P stand for nothing, it is just there to create a cooler sounding 3 letter acronym, and beside HP was taken.

No comments:

Post a Comment