Taking tutorial submissions! Please email them to dave[AT]icemelon[DOT]com for review. Thanks!

Latest Tutorials
IceMelon IM: Add IM Functionality to phpBB [Misc] Want to add site-wide member IM to your phpBB community? Sure you do. Simple step-by-step instructions inside.

IceMelon IM: Add IM Functionality to vBulletin [Misc] Want to add site-wide member IM to your vBulletin community? Simple step-by-step instructions inside.

Crawl Your Site for PageRanks [PHP] So you know your site has a PageRank of 8. What about the PRs of all the inner pages? This tutorial will teach how you to automate the process of grabbing all these PRs.

Create CAPTCHA images [PHP] Worried about bots? Then create CAPTCHA images to prevent bots from taking advantage of your site.

Convert SHN and FLAC files to MP3 [Misc] Low on disk space and have an insensitive ear? Here are some steps to converting SHN and FLAC files to MP3 format.

[View All]

Become a sponsor for $15/month. Link is sitewide - PR5 homepage, 20+ PR4 pages, 90+ PR3 pages. Email dave[AT]icemelon[D0T]c0m.

Awesome Tutorials

Crawl Your Site for PageRanks
By dave

This tutorial is kinda like a review. It requires an assortment of PHP knowledge, covered in various tutorials already. So, before we get going, be sure you're familiar with these tutorials:
- Intro to PHP Sessions
- Regular Expressions in PHP
- Output While Script Is Running

Alright, let's DOOO IT!!

First, below is the code. It's almost identical to the site_pageranks function I put up, with some minor modifications for easier to use. You will want to understand how this function works, so that you can make your own tweaks.

<?
session_start();

// BULK OF CODE START ***************
/*
This code is released unto the public domain
Raistlin Majere euclide@email.it
*/

define('GOOGLE_MAGIC', 0xE6359A60);

//unsigned shift right
function zeroFill($a, $b)
{
    $z = hexdec(80000000);
    if ($z & $a)
    {
        $a = ($a>>1);
        $a &= (~$z);
        $a |= 0x40000000;
        $a = ($a>>($b-1));
    }
    else
    {
        $a = ($a>>$b);
    }
    return $a;
}

function mix($a,$b,$c) {
    $a -= $b; $a -= $c; $a ^= (zeroFill($c,13));
    $b -= $c; $b -= $a; $b ^= ($a<<8);
    $c -= $a; $c -= $b; $c ^= (zeroFill($b,13));
    $a -= $b; $a -= $c; $a ^= (zeroFill($c,12));
    $b -= $c; $b -= $a; $b ^= ($a<<16);
    $c -= $a; $c -= $b; $c ^= (zeroFill($b,5));
    $a -= $b; $a -= $c; $a ^= (zeroFill($c,3));
    $b -= $c; $b -= $a; $b ^= ($a<<10);
    $c -= $a; $c -= $b; $c ^= (zeroFill($b,15));
    
    return array($a,$b,$c);
}

function GoogleCH($url, $length=null, $init=GOOGLE_MAGIC) {
    if(is_null($length)) {
        $length = sizeof($url);
    }
    $a = $b = 0x9E3779B9;
    $c = $init;
    $k = 0;
    $len = $length;
    while($len >= 12) {
        $a += ($url[$k+0] +($url[$k+1]<<8) +($url[$k+2]<<16) +($url[$k+3]<<24));
        $b += ($url[$k+4] +($url[$k+5]<<8) +($url[$k+6]<<16) +($url[$k+7]<<24));
        $c += ($url[$k+8] +($url[$k+9]<<8) +($url[$k+10]<<16)+($url[$k+11]<<24));
        $mix = mix($a,$b,$c);
        $a = $mix[0]; $b = $mix[1]; $c = $mix[2];
        $k += 12;
        $len -= 12;
    }

    $c += $length;
    switch($len) /* all the case statements fall through */
    {
        case 11: $c+=($url[$k+10]<<24);
        case 10: $c+=($url[$k+9]<<16);
        case 9 : $c+=($url[$k+8]<<8);
        /* the first byte of c is reserved for the length */
        case 8 : $b+=($url[$k+7]<<24);
        case 7 : $b+=($url[$k+6]<<16);
        case 6 : $b+=($url[$k+5]<<8);
        case 5 : $b+=($url[$k+4]);
        case 4 : $a+=($url[$k+3]<<24);
        case 3 : $a+=($url[$k+2]<<16);
        case 2 : $a+=($url[$k+1]<<8);
        case 1 : $a+=($url[$k+0]);
        /* case 0: nothing left to add */
    }
    $mix = mix($a,$b,$c);
    /*—————————————————————— report the result */
    return $mix[2];
}

//converts a string into an array of integers containing the numeric value of the char
function strord($string) {
    for($i=0;$i<strlen($string);$i++) {
        $result[$i] = ord($string{$i});
    }
    return $result;
}

function get_pr($url) {
    $result=array("",-1);
    
    if (($url.""!="")&&($url.""!="http://")):
    // check for protocol
        if (substr(strtolower($url),0,7)!="http://"):
            $url="http://".$url;
        endif;
        
        $url="info:".$url;
        $checksum=GoogleCH(strord($url));
        $google_url=sprintf("http://www.google.com/search?client=navclient-auto&ch=6%CODE:u&features=Rank&q=".$url,$checksum); // url to get from google
        
        $contents="";
        // let's get ranking
        // this way could cause problems because the Browser Useragent is not set...
        if ($handle=fopen($google_url,"rb")):
            while(true):
                $data=fread($handle,8192);
                if (strlen($data)==0):
                    break;
                endif;
                $contents.=$data;
            endwhile;
            fclose($handle);
        else:
            $contents="Connection unavailable";
        endif;
    
        $result[0]=$contents;
        // Rank_1:1:0 = 0
        // Rank_1:1:5 = 5
        // Rank_1:1:9 = 9
        // Rank_1:2:10 = 10 etc
        $p=explode(":",$contents);
        if (isset($p[2])):
            $result[1]=$p[2];
        endif;
    endif;
    
    return $result;
}

function google_pagerank($url) {
    $pr = get_pr($url);
    if($pr[1] == -1)
        $pr[1] = 0;
    return $pr[1];
}
// BULK OF CODE END ****************


// initialization
$_SESSION['urls'] = Array();
$_SESSION['pageranks'] = Array( Array(), Array(), Array(), Array(), Array(), Array(), CODE:Array() );

// recursive function
function site_pageranks($url, $domain) {

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    $html = curl_exec($ch);
    curl_close($ch);
    
    $html = str_replace("\n", '', $html);
    preg_match_all('/href="?\'?(.*?)"?\'?[ >]/i', $html, $m);
        
    foreach($m[1] AS $url) {
        // check if $domain
        if(preg_match("/^http:\/\/.*$domain/i", $url)) {
            // check if local page
            if(!preg_match('/http:\/\//i', $url))
                $url = "http://www.$domain/$url";
            // get rid of PHPSESSID
            if(preg_match('/(\?PHPSESSID=\w+)$/i', $url, $m2))
                $url = str_replace($m2[1], '', $url);

            // check if url checked
            if(!in_array($url, $_SESSION['urls'])) {
                $_SESSION['urls'][] = $url;
                $pr = google_pagerank($url);
                $pr = trim($pr);
                $_SESSION['pageranks'][$pr][] = $url;          

                //echo "$pr:  $url\n";
                //if($pr > 1) // dont bother crawling is pr 0 or 1
                    site_pageranks($url, $domain);
            }
        }    
    }
}
?>

Don't worry about the bulk of the code. (It is based off this function: google_pagerank.) Just note that we start off by start a session using session_start. This is because we will storing values in some $_SESSION variables. Now, let's skip down to the bottom of the code, which where the cool stuff happens. Here it copied again below:

<?
// initialization
$_SESSION['urls'] = Array();
$_SESSION['pageranks'] = Array( Array(), Array(), Array(), Array(), Array(), Array(), CODE:Array() );

// recursive function
function site_pageranks($url, $domain) {

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    $html = curl_exec($ch);
    curl_close($ch);
    
    $html = str_replace("\n", '', $html);
    preg_match_all('/href="?\'?(.*?)"?\'?[ >]/i', $html, $m);
        
    foreach($m[1] AS $url) {
        // check if $domain
        if(preg_match("/^http:\/\/.*$domain/i", $url)) {
            // check if local page
            if(!preg_match('/http:\/\//i', $url))
                $url = "http://www.$domain/$url";
            // get rid of PHPSESSID
            if(preg_match('/(\?PHPSESSID=\w+)$/i', $url, $m2))
                $url = str_replace($m2[1], '', $url);

            // check if url checked
            if(!in_array($url, $_SESSION['urls'])) {
                $_SESSION['urls'][] = $url;
                $pr = google_pagerank($url);
                $pr = trim($pr);
                $_SESSION['pageranks'][$pr][] = $url;          

                //echo "$pr:  $url\n";
                //if($pr > 1) // dont bother crawling is pr 0 or 1
                    site_pageranks($url, $domain);
            }
        }    
    }
}
?>

First, we need to initialize some variables. $_SESSION['urls'] store all of the URLs that have already crawled. This is so we don't waste our time redoing work, and also so that this script will eventually stop. $_SESSION['pageranks'] organizes all of the URLs by pagerank. Note that it is a 2-level array. The first key is the PageRank. The second is the URL. Here is an example output of this array:

Array
(
[0] => Array
(
[0] => http://www.icemelon.com/php/strtok.htm
[1] => http://www.icemelon.com/php/chunk_split.htm
[2] => http://www.icemelon.com/php/array_values.htm
)
[1] => Array
(
[0] => http://www.icemelon.com/php/rtrim.htm
[1] => http://www.icemelon.com/php/mb_strcut.htm
[2] => http://www.icemelon.com/php/mcal_expunge.htm
)
[2] => Array
(
[0] => http://www.icemelon.com/php/spliti.htm
[1] => http://www.icemelon.com/php/split.htm
[2] => http://www.icemelon.com/php/explode.htm
[3] => http://www.icemelon.com/php/implode.htm
...

[4] => Array
(
[0] => http://www.icemelon.com/tutorials.php
[1] => http://www.icemelon.com/headlines.php
[2] => http://www.icemelon.com/coolsites.php
)
[5] => Array
(
[0] => http://www.icemelon.com/index.php
)

[6] => Array
(
)

)

Next, we define a function called site_pageranks. This is a recursive function that will do the following:
1. extract all internal links within the domain (determined by parameter 2) on a specified webpage (determined by parameter 1)
2. calculate the PageRank for each of these links
3. recursively call site_pageranks for each of these links

If you're unsure, a recursive function is a function that calls itself. That then begs the question of how will this function ever stop running? Note that the last IF clause in this function checks to see if the current link has already been crawled. If it has, then it will not recursively call site_pageranks. Thus, once all links have been crawled, this script will stop running.

At the start of this function, we grab the HTML for a particular page (i.e. the URL specified by parameter 1). This is done using these cURL functions: curl_init, curl_setopt, curl_exec, and curl_close. (CURL is pretty cool stuff. I may cover it in another tutorial, but not right here. Note, though, you may need to install cURL yourself. It is not installed by default. Find help here.)

Once we have the HTML, stored in $html, we can parse it for the internal domain links. This can be easily done using regular expressions, namely the preg_match function. (This tutorial may also prove useful: Create Your Own BBCode). Note that we need to check for the PHPSESSID=string that is often appended to the end of link URLs. This is done using regular expressions as well:
if(preg_match('/(\?PHPSESSID=\w+)$/i', $url, $m2))

You probably can already tell that this script may take a while to long—especially if your site is huge! If this is the case, you may not want to skip over crawling pages that have low PRs. If so, then uncomment this first line:
//if($pr > 1) // dont bother crawling is pr 0 or 1
     site_pageranks($url, $domain);
Doing this will stop the script from crawling pages that have a PR of 0 or 1.

Now, let's look at the last several lines of the code:
set_time_limit(0);
ob_implicit_flush(true);
echo '<xmp>';
site_pageranks('http://www.icemelon.com', 'icemelon.com');
print_r($_SESSION['pageranks']);

It's very important to use PHP[set_time_limit] so that your script won't time out after 30 seconds. We also set PHP[ob_implicit_flush] to true. This is because we PHP[echo] the URL-PR pairs as they are being called. This is done in this line:
echo "$pr:  $url\n";

Finally, we call the function to get things started:
site_pageranks('http://www.icemelon.com', 'icemelon.com');

When all is said and done, we print out the array where everything is being stored:
print_r($_SESSION['pageranks']);

It may be a good idea to extract $_SESSION['pageranks'] using a different script and manipulate the data there.

Well, good luck, my friend. Please link back to IceMelon.com to increase our PR! =)

P.S. Check out my new site TheManWhoSoldtheWeb.com, where I publish guides and scripts on Internet Marketing and SEO. Here is a limited time freebie: the Rapid Google Indexer.


» Bookmark this Tutorial
» Bookmark IceMelon
Icemelon -- Crawl Your Site for PageRanks -- PHP, CSS, Javascript Tutorials, & More!
  © 2005-2010 Icemelon.com   Email: dave[AT]icemelon[D0T]c0m