The happy little code snippet below, that had been chugging along for almost a year, and continued to work on other hosts suddenly broke...
$rssfeedurl = http://www.somdomain.com/somerss.xml
if (!$xmlDoc = new DOMDocument()){
echo 'Could not initiate feed '.$rssfeedurl ;
}
if (!@$xmlDoc->load($rssfeedurl)){
echo 'Could not fetch or translate '.$rssfeedurl.' into XML';
} else {
// PARSE AWAY ! ! ! !
//get elements from ""
$channel = $xmlDoc->getElementsByTagName('channel')->item(0);
$channel_title = @$channel->getElementsByTagName('title') ->item(0)->childNodes->item(0)->nodeValue;
$channel_link = $channel->getElementsByTagName('link') ->item(0)->childNodes->item(0)->nodeValue;
$channel_desc = $channel->getElementsByTagName('description') ->item(0)->childNodes->item(0)->nodeValue;
}
After beating my head against my already battered desk for far too long, I asked the owners of the domain, what up?
They in turn conveyed my question to their new hosting company who promptly responded with "The HTTP request header is invalid" and hinted that their external proxy couldn't forward such a request to the targeted server..
So, I tried using stream_context_create() and libxml_set_streams_context() to create valid HTTP request headers, and it seemed to work....
$opts = array(
'http' => array(
'user_agent' => 'xml fetcher 1.0',
)
);
$context = stream_context_create($opts);
libxml_set_streams_context($context);
if (!@$xmlDoc->load($rssfeedurl)){
echo 'Could not fetch or translate '.$rssfeedurl.' into XML';
} else {
// PARSE AWAY ! ! ! !
I finally gave up on using it though because I was still getting intermittent failures, likely because creating a fully valid HTTP request header is trickier than I thought... So I switched to CURL, and Eureka ! CURLOPT_HTTPPROXYTUNNEL to the rescue!
Why didn't I try this before? If you have done a Google search on "PHP CURL proxy" you will find tons of info on posting through a local or intermediary proxy. But I could find almost nothing on dealing with the nuances of a remote proxy directly in front of the target server, and besides $xmlDoc->load($rssfeedurl) is just so damn elegant...
After some trial and error, I came up with this.
$cookie="./.cookiefile";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $rssfeedurl);
curl_setopt($ch, CURLOPT_HTTP_VERSION, 1.0); // 1.1 likely works as well, but 1.0 seems fine
curl_setopt($ch, CURLOPT_HEADER, 0); // Dont want response headers
curl_setopt($ch, CURLOPT_REFERER, 'http://'.$_SERVER['SERVER_NAME'].$_SERVER['REQUEST_URI']);
curl_setopt($ch, CURLOPT_USERAGENT, 'Dew-Code XML Grabber 1.0');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); // to handle redirects
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // assign response to a variable
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie); // .cookiefile is just an empty file, just in case
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie); // ditto
curl_setopt($ch, CURLOPT_TIMEOUT, 30); // seconds to wait for response
curl_setopt($ch, CURLOPT_AUTOREFERER, 1); // in case of redirect, convey referrer
curl_setopt($ch, CURLOPT_MAXREDIRS, 5); // how many times to allow redirect
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); // absolute must when a proxy is involved
$rssblob = curl_exec($ch); // make it go!!
if (curl_errno($ch)) {
echo curl_error($ch); // handy for debuging
} else {
curl_close($ch); // Save the binviroment! clean up after yourself
}
if (!@$xmlDoc->loadXML($rssblob)){
echo 'Could not fetch or translate '.$rssfeedurl.' into XML';
} else {
// PARSE AWAY ! ! ! !
I hope you find it useful..
No comments:
Post a Comment