Simple PHP Feed Aggregator

Today we will look into a simple PHP feed aggregator. Reccently I was working on a project Disaster.net where I had to pull in RSS feeds from multiple sources and then present them as a single feed; which is other wise known as PHP feed parsing. This is a really simple process and we harness the PHP’s object orientation capabilities to accomplish the task for minimal over head. You can look at the end result at Disaster.net

Now lets look into the PHP construct for accomplishing the task
> We create two PHP files 1) feedparser.php & 2) feeds.php

The feedparser.php holds the logic of parsing the feeds and the feeds.php is the actual page where we are trying to get our parsed feeds to be shown. Lets look at the code of feedparser.php

class FeedParser{
		
	private $xmlParser      = null;
	private $insideItem     = array();                  // Keep track of current position in tag tree
	private $currentTag     = null;                     // Last entered tag name      
	private $currentAttr    = null;                     // Attributes array of last entered tag
	
	private $namespaces     = array(
							'http://purl.org/rss/1.0/'                  => 'RSS 1.0',
							'http://purl.org/rss/1.0/modules/content/'  => 'RSS 2.0',
							'http://www.w3.org/2005/Atom'               => 'ATOM 1',
							);                          // Namespaces to detact feed version
	private $itemTags       = array('ITEM','ENTRY');    // List of tag names which holds a feed item
	private $channelTags    = array('CHANNEL','FEED');  // List of tag names which holds all channel elements
	private $dateTags       = array('UPDATED','PUBDATE','DC:DATE');  
	private $hasSubTags     = array('IMAGE','AUTHOR');  // List of tag names which have sub tags
	private $channels       = array();                  
	private $items          = array();
	private $itemIndex      = 0;

	private $url            = null;                     // The parsed url
	private $version        = null;                     // Detected feed version 
	
	   
	/**
	* Constructor - Initialize and set event handler functions to xmlParser
	*/    
	function __construct()
	{
		$this->xmlParser = xml_parser_create();
		
		xml_set_object($this->xmlParser, $this);
		xml_set_element_handler($this->xmlParser, "startElement", "endElement");
		xml_set_character_data_handler($this->xmlParser, "characterData");
	}   

	/*-----------------------------------------------------------------------+
	|  Public functions. Use to parse feed and get informations.             |   
	+-----------------------------------------------------------------------*/
   
	/**
	* Get all channel elements   
	* 
	* @access   public
	* @return   array   - All chennels as associative array
	*/
	public function getChannels()
	{
		return $this->channels;
	}
   
	/**
	* Get all feed items   
	* 
	* @access   public
	* @return   array   - All feed items as associative array
	*/
	public function getItems()
	{
		return $this->items;
	}

	/**
	* Get total number of feed items
	* 
	* @access   public
	* @return   number  
	*/   
	public function getTotalItems()
	{
		return count($this->items);
	}

	/**
	* Get a feed item by index
	* 
	* @access   public
	* @param    number  index of feed item
	* @return   array   feed item as associative array of it's elements 
	*/   
	public function getItem($index)
	{
		if($index getTotalItems())
		{
			return $this->items[$index];
		}
		else
		{
			throw new Exception("Item index is learger then total items.");
			return false;
		}        
	}
   
	/**
	* Get a channel element by name
	* 
	* @access   public
	* @param    string  the name of channel tag
	* @return   string
	*/   
	public function getChannel($tagName)
	{ 
		if(array_key_exists(strtoupper($tagName), $this->channels))
		{
			return $this->channels[strtoupper($tagName)];
		}
		else
		{
			throw new Exception("Channel tag $tagName not found.");
			return false;
		}
	}
   
	/**
	* Get the parsed URL
	* 
	* @access   public
	* @return   string
	*/   
	public function getParsedUrl()
	{
		if(empty($this->url))
		{
			throw new Exception("Feed URL is not set yet.");
			return FALSE;
		}
		else
		{
			return $this->url;
		}
		
		
	}

	/**
	* Get the detected Feed version
	* 
	* @access   public
	* @return   string
	*/   
   public function getFeedVersion()
   {
		return $this->version;
   }
   
	/**
	* Parses a feed url
	* 
	* @access   public
	* @param    srting  teh feed url
	* @return   void
	*/   
	public function parse($url)
	{
		$this->url  = $url;
		$URLContent = $this->getUrlContent();
		
		if($URLContent)
		{   
			$segments   = str_split($URLContent, 4096);
			foreach($segments as $index=>$data)
			{
				$lastPiese = ((count($segments)-1) == $index)? true : false;
				xml_parse($this->xmlParser, $data, $lastPiese)
				   or die(sprintf("XML error: %s at line %d",  
				   xml_error_string(xml_get_error_code($this->xmlParser)),  
				   xml_get_current_line_number($this->xmlParser)));
			}
			xml_parser_free($this->xmlParser);   
		}
		else
		{
			die('Sorry! cannot load the feed url.');	
		}
		
		if(empty($this->version))
		{
			die('Sorry! cannot detect the feed version.');
		}
	}   
   
   // End public functions -------------------------------------------------
   
   /*-----------------------------------------------------------------------+
   | Private functions. Be careful to edit them.                            |   
   +-----------------------------------------------------------------------*/

   /**
	* Load the whole contents of a RSS/ATOM page
	* 
	* @access   private
	* @return   string
	*/ 
	private function getUrlContent()
	{
		if(empty($this->url))
		{
			throw new Exception("URL to parse is empty!.");
			return false;
		}
	
		if($content = @file_get_contents($this->url))
		{
			return $content;
		}
		else
		{
			$ch         = curl_init();
			
			curl_setopt($ch, CURLOPT_URL, $this->url);
			curl_setopt($ch, CURLOPT_HEADER, false);
			curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

			$content    = curl_exec($ch);
			$error      = curl_error($ch);
			
			curl_close($ch);
			
			if(empty($error))
			{
				return $content;	
			}
			else
			{
				throw new Exception("Erroe occured while loading url by cURL. <br />n" . $error) ;
				return false;
			}
		}
	
	}
	
	/**
	* Handle the start event of a tag while parsing
	* 
	* @access   private
	* @param    object  the xmlParser object
	* @param    string  name of currently entering tag
	* @param    array   array of attributes
	* @return   void
	*/ 
	private function startElement($parser, $tagName, $attrs) 
	{
		if(!$this-&gt;version)
		{
			$this-&gt;findVersion($tagName, $attrs);
		}       
		
		array_push($this-&gt;insideItem, $tagName);
		
		$this-&gt;currentTag  = $tagName;
		$this-&gt;currentAttr = $attrs;
	}   

	/**
	* Handle the end event of a tag while parsing
	* 
	* @access   private
	* @param    object  the xmlParser object
	* @param    string  name of currently ending tag
	* @return   void
	*/    
	private function endElement($parser, $tagName) 
	{   
		if (in_array($tagName, $this-&gt;itemTags)) 
		{
		   $this-&gt;itemIndex++;
		}
		
		array_pop($this-&gt;insideItem);
		$this-&gt;currentTag = $this-&gt;insideItem[count($this-&gt;insideItem)-1];
	}   

	/**
	* Handle character data of a tag while parsing
	* 
	* @access   private
	* @param    object  the xmlParser object
	* @param    string  tag value
	* @return   void
	*/
	private function characterData($parser, $data) 
	{
		//Converting all date formats to timestamp
		if(in_array($this-&gt;currentTag, $this-&gt;dateTags)) 
		{
			$data = strtotime($data);
		}
				 
	   if($this-&gt;inChannel())
	   {
			// If has subtag, make current element an array and assign subtags as it's element
			if(in_array($this-&gt;getParentTag(), $this-&gt;hasSubTags))  
			{
				if(! is_array($this-&gt;channels[$this-&gt;getParentTag()]))
				{
					$this-&gt;channels[$this-&gt;getParentTag()] = array();
				}

				$this-&gt;channels[$this-&gt;getParentTag()][$this-&gt;currentTag] .= strip_tags($this-&gt;unhtmlentities((trim($data))));
				return;
			}
			else
			{
				if(! in_array($this-&gt;currentTag, $this-&gt;hasSubTags))  
				{
					$this-&gt;channels[$this-&gt;currentTag] .= strip_tags($this-&gt;unhtmlentities((trim($data))));
				}
			}
					   
			if(!empty($this-&gt;currentAttr))
			{
				$this-&gt;channels[$this-&gt;currentTag . '_ATTRS'] = $this-&gt;currentAttr;          
				
				//If the tag has no value
				if(strlen($this-&gt;channels[$this-&gt;currentTag]) currentAttr) == 1)
					{
						foreach($this-&gt;currentAttr as $attrVal)
						{
							$this-&gt;channels[$this-&gt;currentTag] = $attrVal;
						}
					}
					//If there are multiple attributes, assign the attributs array as channel value
					else
					{
						$this-&gt;channels[$this-&gt;currentTag] = $this-&gt;currentAttr;
					}                        
				}
			}
	   }
	   elseif($this-&gt;inItem())
	   {
		   // If has subtag, make current element an array and assign subtags as it's elements
		   if(in_array($this-&gt;getParentTag(), $this-&gt;hasSubTags))  
			{
				if(! is_array($this-&gt;items[$this-&gt;itemIndex][$this-&gt;getParentTag()]))
				{
					$this-&gt;items[$this-&gt;itemIndex][$this-&gt;getParentTag()] = array();
				}

				$this-&gt;items[$this-&gt;itemIndex][$this-&gt;getParentTag()][$this-&gt;currentTag] .= strip_tags($this-&gt;unhtmlentities((trim($data))));
				return;
			}
			else
			{
				if(! in_array($this-&gt;currentTag, $this-&gt;hasSubTags))  
				{
					$this-&gt;items[$this-&gt;itemIndex][$this-&gt;currentTag] .= strip_tags($this-&gt;unhtmlentities((trim($data))));
				}
			}
			
			 
			if(!empty($this-&gt;currentAttr))
			{
				$this-&gt;items[$this-&gt;itemIndex][$this-&gt;currentTag . '_ATTRS'] = $this-&gt;currentAttr;          
				
				//If the tag has no value
				
				if(strlen($this-&gt;items[$this-&gt;itemIndex][$this-&gt;currentTag]) currentAttr) == 1)
					{
						foreach($this-&gt;currentAttr as $attrVal)
						{
						   $this-&gt;items[$this-&gt;itemIndex][$this-&gt;currentTag] = $attrVal;
						}
					}
					//If there are multiple attributes, assign the attribute array as feed element's value
					else
					{
					   $this-&gt;items[$this-&gt;itemIndex][$this-&gt;currentTag] = $this-&gt;currentAttr;
					}                        
				}
			}
	   }
	}

	/**
	* Find out the feed version
	* 
	* @access   private
	* @param    string  name of current tag
	* @param    array   array of attributes
	* @return   void
	*/   
	private function findVersion($tagName, $attrs)
	{
		$namespace = array_values($attrs);
		foreach($this-&gt;namespaces as $value =&gt;$version)
		{
			if(in_array($value, $namespace))
			{
				$this-&gt;version = $version;
				return;
			}    
		}
	}
	
	private function getParentTag()
	{
		return $this-&gt;insideItem[count($this-&gt;insideItem) - 2];
	}

	/**
	* Detect if current position is in channel element
	* 
	* @access   private
	* @return   bool
	*/   
	private function inChannel()
	{
		if($this-&gt;version == 'RSS 1.0')
		{
			if(in_array('CHANNEL', $this-&gt;insideItem) &amp;&amp; $this-&gt;currentTag != 'CHANNEL')
			return TRUE;
		}
		elseif($this-&gt;version == 'RSS 2.0')
		{
			if(in_array('CHANNEL', $this-&gt;insideItem) &amp;&amp; !in_array('ITEM', $this-&gt;insideItem) &amp;&amp; $this-&gt;currentTag != 'CHANNEL')
			return TRUE;    
		}
		elseif($this-&gt;version == 'ATOM 1')
		{
			if(in_array('FEED', $this-&gt;insideItem) &amp;&amp; !in_array('ENTRY', $this-&gt;insideItem) &amp;&amp; $this-&gt;currentTag != 'FEED')
			return TRUE;    
		}
		
		return FALSE;
	}

	/**
	* Detect if current position is in Item element
	* 
	* @access   private
	* @return   bool
	*/    
	private function inItem()
	{
		if($this-&gt;version == 'RSS 1.0' || $this-&gt;version == 'RSS 2.0')
		{
			if(in_array('ITEM', $this-&gt;insideItem) &amp;&amp; $this-&gt;currentTag != 'ITEM')
			return TRUE;
		}
		elseif($this-&gt;version == 'ATOM 1')
		{
			if(in_array('ENTRY', $this-&gt;insideItem) &amp;&amp; $this-&gt;currentTag != 'ENTRY')
			return TRUE;    
		}
		
		return FALSE;
	}   

	//This function is taken from lastRSS
	/**
	* Replace HTML entities &amp;something; by real characters
	* 
	* 
	* @access   private
	* @author   Vojtech Semecky 
	* @link     http://lastrss.oslab.net/
	* @param    string
	* @return   string
	*/   
	private function unhtmlentities($string) 
	{
		// Get HTML entities table
		$trans_tbl = get_html_translation_table (HTML_ENTITIES, ENT_QUOTES);
		// Flip keysvalues
		$trans_tbl = array_flip ($trans_tbl);
		// Add support for &apos; entity (missing in HTML_ENTITIES)
		$trans_tbl += array('&apos;' =&gt; "'");
		// Replace entities by values
		return strtr ($string, $trans_tbl);
	}
} //End class FeedParser
?&gt;</code>

Let's understand how this is working; and I assume you have a basic understanding of how a RSS/ATOM xml file is constructed because by feed parsing we basically mean aggregating multiple XML files and getting the content and headers from the same as arrays and displaying them all together. 
The above functions get you the following various information from the XML feeds
        1. $Parser-&gt;getChannels()        - To get all channel elements as array
	2. $Parser-&gt;getItems()           - To get all feed elements as array
	3. $Parser-&gt;getChannel($name)    - To get a channel element by name
	4. $Parser-&gt;getItem($index)      - To get a feed element as array by it's index
	5. $Parser-&gt;getTotalItems()      - To get the number of total feed elements
	6. $Parser-&gt;getFeedVersion()     - To get the detected version of parsed feed
	7. $Parser-&gt;getParsedUrl()       - To get the parsed feed URL 

* one cliarification I have Dublin Core here in the above as the one feed I was creating had DC elements so nothing to confuse.

Now lets have a look at the feeds.php file which is a practical usage of the feedparser.php class created above
<code>
include('FeedParser.php');
$parser=new FeedParser();
parse('http://www.sitepoint.com/rss');
?&gt;

The above is a generic utilization of the same. You can create an array of parse function and can point to multiple XML files. You can further utilize the same by creating a XML of the aggregated feeds and representing them as a new XML feed and how you can accomplish that we will see next time. Looking forward to suggestions and feedbacks