Scraping Data from Website in php - Stack Of Codes

Breaking

Ads

Tuesday 19 December 2017

Scraping Data from Website in php

Scraping Data from Website in php::

There is PHP Simple HTML DOM Parser. It's fast, easy and super flexible.
It basically sticks an entire HTML page in an object then you can access any element from that object.

Document Link : http://simplehtmldom.sourceforge.net/

Like:: get all links on the main Google page:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';
 
 

 
Alternatively, 
we can use this library PHPPowertools/DOM-Query.
Document Link: https://github.com/PHPPowertools/DOM-Query

It uses customized version of Masterminds/html5-php under the hood parsing an HTML5 string into a DomDocument and symfony/DomCrawler for conversion of CSS selectors to XPath selectors.
It always uses the same DomDocument, even when passing one object to another, to ensure decent performance.


LIKE::
namespace PowerTools;

// Get file content
$pagecontent = file_get_contents( 'http://www.4wtech.com/csp/web/Employee/Login.csp' );

// Define your DOMCrawler based on file string
$H = new DOM_Query( $pagecontent );

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query( $H->select('body') );

// Passing a string (CSS selector)
$s = $H->select( 'div.foo' );

// Passing an element object (DOM Element)
$s = $H->select( $documentBody );

// Passing a DOM Query object
$s = $H->select( $H->select('p + p') );

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');




1 comment:

Topics

PHP (27) CodeIgniter (22) SQL (4) Facebook (3) HTML (3) Blogger (2) Constructor (2) Destructor (2) Google (2) How to (2) Aadhaar (1) Agent (1) Browser (1) CSS (1) Cakephp (1) Constants (1) India (1) Ip address (1) JS (1) Jquery (1) Meta Tags (1) Robots (1) Scraping Data (1) escape_str (1) htaccess (1) iS mobile (1) javascript (1) mysqli (1)