[DUG] web scraping using IHTMLDocument2

Fri Jan 29 15:27:16 NZDT 2010

Given you mention .Filename property I assume you are using
TXMLDocument.  Forget that and use MSXML direct - its much better, you
could load a URL direct without first downloading to a file.  Import the
MSXML 6.0 to create MSXML2_TLB.  You will probably find that most web
sites have xhtml tags but are still not valid.  Try extracting from html
opening tag down to the closing tag and processing that piece only as
xml.  In the website used in the sample below if you download it to a
file and strip the headings before the html tag it will load properly.
You might be able to find a way around this, I haven't looked any
further.

unit Unit5;

interface

uses

  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls,
Forms,

  Dialogs, StdCtrls, MSXML2_TLB;

type

  TForm5 = class(TForm)

    Button1: TButton;

    procedure Button1Click(Sender: TObject);

  private

    { Private declarations }

  public

    { Public declarations }

  end;

  EValidateXMLError = class(Exception)

  private

    FErrorCode: Integer;

    FReason: string;

  public

    constructor Create(aErrorCode: Integer; const aReason: string; const
aLine, aChar, aFilePos : integer; const aSrcText, aURL, aXPath :
string);

    property ErrorCode: Integer read FErrorCode;

    property Reason: string read FReason;

  end;

var

  Form5: TForm5;

implementation

{$R *.dfm}

resourcestring

  RsValidateError = 'XML Validation Error (%.8x) Reason: %s  XPath: %s
Line: %d  Char: %d  File Pos: %d  URL: %s Src Text: %s';

constructor EValidateXMLError.Create(aErrorCode: Integer; const aReason:
string; const aLine, aChar, aFilePos : integer; const aSrcText, aURL,
aXPath : string);

begin

  inherited CreateResFmt(@RsValidateError, [AErrorCode, aReason, aXPath,
aLine, aChar, aFilePos, aURL, aSrcText]);

  FErrorCode := aErrorCode;

  FReason := aReason;

end;

procedure TForm5.Button1Click(Sender: TObject);

var  oXMLDoc: DOMDocument60;

oError: IXMLDOMParseError2;

begin

  oXMLDoc := CoDOMDocument60.Create;

  oXMLDoc.async := FALSE;

  oXMLDoc.setProperty('ProhibitDTD', TRUE);

  oXMLDoc.resolveExternals := FALSE;

  oXMLDoc.validateOnParse  := FALSE;

  oXMLDoc.load('http://w3future.com/weblog/gems/xhtml2.xml'); //use
oXMLDOc.load() also loads file paths.  use oXMLDoc.loadXML to load XML
in a string

  if oXMLDoc.parseError.errorCode <> S_OK then // validate is off above
but you should still check for load errors.  This is different to
validation though check out schemacache if you want to validate against
xsd

  begin

    oError := oXMLDoc.parseError as IXMLDOMParseError2;

    raise EValidateXMLError.Create(oError.errorCode, oError.reason,

                                     oError.line, oError.linepos,
oError.filepos,

                                     oError.srcText, oError.url,
oError.errorXPath);

  end;

  showmessage(oXMLDoc.xml);

end;

end.

cameron

From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz]
On Behalf Of Alister Christie
Sent: Friday, 29 January 2010 2:40 p.m.
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] web scraping using IHTMLDocument2

Thanks Cameron,

It does indeed have that header, how do I make this work?
  XMLDocument1.FileName := 'c:\temp\test.htm';
  XMLDocument1.Active := True;
Gives me various errors, I suspect that that the file is not valid xml,
or is there some other way of parsing it?

Alister Christie
Computers for People
Ph: 04 471 1849 Fax: 04 471 1266
http://www.salespartner.co.nz
PO Box 13085
Johnsonville
Wellington 

Cameron Hart wrote: 

Do you know if the websites are xhtml - do they have anything like below
in the start of the page.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
<http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd> >

<html xmlns="http://www.w3.org/1999/xhtml"
<http://www.w3.org/1999/xhtml> >

If they are it would be easier to load them into XML documents and
process them that way using msxml DOMDocument60

cameron 

From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz]
On Behalf Of Alister Christie
Sent: Friday, 29 January 2010 12:22 p.m.
To: NZ Borland Developers Group - Delphi List
Subject: [DUG] web scraping using IHTMLDocument2

I'm trying to do some web page scraping using IHTMLDocument2, which is
working fairly well and I can grab the second paragraph on a web page by
doing something like:

p := iDoc.all.tags('P');
if p.Length >= 2 then
  result := p.Item(1).InnerText;

Where iDoc is an isnstance of IHTMLDocument2.

However say there there is an HTML element like

<div class="propertyInfo">Price: <span>Negotiation</span></div>

How would I be able to find the divs where class="propertyInfo"? (if
anyone has much experience with IHTMLDocument2) 

-- 
Alister Christie
Computers for People
Ph: 04 471 1849 Fax: 04 471 1266
http://www.salespartner.co.nz
PO Box 13085
Johnsonville
Wellington 

________________________________

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi at delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject:
unsubscribe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://listserver.123.net.nz/pipermail/delphi/attachments/20100129/6fba16b5/attachment-0001.html