[DUG] web scraping using IHTMLDocument2

Fri Jan 29 18:35:59 NZDT 2010

There is lots on msdn
http://msdn.microsoft.com/en-us/library/aa752574%28VS.85%29.aspx

Cameron Hart | Development Manager | Flow Software Limited
P: +64 9 476 3579 | M: +64 21 222 3569 | E:
cameron.hart at flowsoftware.co.nz
PO Box 305-237, Triton Plaza, Auckland 0757, New Zealand |
www.flowsoftware.co.nz

This message is intended for the addressee named above. It may contain
privileged or confidential information. If you are not the intended
recipient of this message you must not use, copy, distribute or disclose
it to anyone.

Please consider the environment before printing this email 

-----Original Message-----
From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz]
On Behalf Of Alister Christie
Sent: Friday, 29 January 2010 4:26 p.m.
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] web scraping using IHTMLDocument2

Thanks, although it looks like the html documents are not xml compliant,

so I'll probably have to either parse the file manually or continue 
experimenting with IHTMLDocument2 (and hopefully find some documentation

for it).

Alister Christie
Computers for People
Ph: 04 471 1849 Fax: 04 471 1266
http://www.salespartner.co.nz
PO Box 13085
Johnsonville
Wellington 

Cameron Hart wrote:
>
> Given you mention .Filename property I assume you are using 
> TXMLDocument. Forget that and use MSXML direct - its much better, you 
> could load a URL direct without first downloading to a file. Import 
> the MSXML 6.0 to create MSXML2_TLB. You will probably find that most 
> web sites have xhtml tags but are still not valid. Try extracting from

> html opening tag down to the closing tag and processing that piece 
> only as xml. In the website used in the sample below if you download 
> it to a file and strip the headings before the html tag it will load 
> properly. You might be able to find a way around this, I haven't 
> looked any further.
>
> unit Unit5;
>
> interface
>
> uses
>
> Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls,
Forms,
>
> Dialogs, StdCtrls, MSXML2_TLB;
>
> type
>
> TForm5 = class(TForm)
>
> Button1: TButton;
>
> procedure Button1Click(Sender: TObject);
>
> private
>
> { Private declarations }
>
> public
>
> { Public declarations }
>
> end;
>
> EValidateXMLError = class(Exception)
>
> private
>
> FErrorCode: Integer;
>
> FReason: string;
>
> public
>
> constructor Create(aErrorCode: Integer; const aReason: string; const 
> aLine, aChar, aFilePos : integer; const aSrcText, aURL, aXPath :
string);
>
> property ErrorCode: Integer read FErrorCode;
>
> property Reason: string read FReason;
>
> end;
>
> var
>
> Form5: TForm5;
>
> implementation
>
> {$R *.dfm}
>
> resourcestring
>
> RsValidateError = 'XML Validation Error (%.8x) Reason: %s XPath: %s 
> Line: %d Char: %d File Pos: %d URL: %s Src Text: %s';
>
> constructor EValidateXMLError.Create(aErrorCode: Integer; const 
> aReason: string; const aLine, aChar, aFilePos : integer; const 
> aSrcText, aURL, aXPath : string);
>
> begin
>
> inherited CreateResFmt(@RsValidateError, [AErrorCode, aReason, aXPath,

> aLine, aChar, aFilePos, aURL, aSrcText]);
>
> FErrorCode := aErrorCode;
>
> FReason := aReason;
>
> end;
>
> procedure TForm5.Button1Click(Sender: TObject);
>
> var oXMLDoc: DOMDocument60;
>
> oError: IXMLDOMParseError2;
>
> begin
>
> oXMLDoc := CoDOMDocument60.Create;
>
> oXMLDoc.async := FALSE;
>
> oXMLDoc.setProperty('ProhibitDTD', TRUE);
>
> oXMLDoc.resolveExternals := FALSE;
>
> oXMLDoc.validateOnParse := FALSE;
>
> oXMLDoc.load('http://w3future.com/weblog/gems/xhtml2.xml'); //use 
> oXMLDOc.load() also loads file paths. use oXMLDoc.loadXML to load XML 
> in a string
>
> if oXMLDoc.parseError.errorCode <> S_OK then // validate is off above 
> but you should still check for load errors. This is different to 
> validation though check out schemacache if you want to validate 
> against xsd
>
> begin
>
> oError := oXMLDoc.parseError as IXMLDOMParseError2;
>
> raise EValidateXMLError.Create(oError.errorCode, oError.reason,
>
> oError.line, oError.linepos, oError.filepos,
>
> oError.srcText, oError.url, oError.errorXPath);
>
> end;
>
> showmessage(oXMLDoc.xml);
>
> end;
>
> end.
>
> cameron
>
> *From:* delphi-bounces at delphi.org.nz 
> [mailto:delphi-bounces at delphi.org.nz] *On Behalf Of *Alister Christie
> *Sent:* Friday, 29 January 2010 2:40 p.m.
> *To:* NZ Borland Developers Group - Delphi List
> *Subject:* Re: [DUG] web scraping using IHTMLDocument2
>
> Thanks Cameron,
>
> It does indeed have that header, how do I make this work?
> XMLDocument1.FileName := 'c:\temp\test.htm';
> XMLDocument1.Active := True;
> Gives me various errors, I suspect that that the file is not valid 
> xml, or is there some other way of parsing it?
>
>
> Alister Christie
> Computers for People
> Ph: 04 471 1849 Fax: 04 471 1266
> http://www.salespartner.co.nz
> PO Box 13085
> Johnsonville
> Wellington 
>
>
>
> Cameron Hart wrote:
>
> Do you know if the websites are xhtml - do they have anything like 
> below in the start of the page.
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" 
> <http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd>>
>
> <html xmlns="http://www.w3.org/1999/xhtml"
<http://www.w3.org/1999/xhtml>>
>
> If they are it would be easier to load them into XML documents and 
> process them that way using msxml DOMDocument60
>
> cameron
>
> *From:* delphi-bounces at delphi.org.nz 
> <mailto:delphi-bounces at delphi.org.nz> 
> [mailto:delphi-bounces at delphi.org.nz] *On Behalf Of *Alister Christie
> *Sent:* Friday, 29 January 2010 12:22 p.m.
> *To:* NZ Borland Developers Group - Delphi List
> *Subject:* [DUG] web scraping using IHTMLDocument2
>
> I'm trying to do some web page scraping using IHTMLDocument2, which is

> working fairly well and I can grab the second paragraph on a web page 
> by doing something like:
>
> p := iDoc.all.tags('P');
> if p.Length >= 2 then
> result := p.Item(1).InnerText;
>
> Where iDoc is an isnstance of IHTMLDocument2.
>
> However say there there is an HTML element like
>
> <div class="propertyInfo">Price: <span>Negotiation</span></div>
>
> How would I be able to find the divs where class="propertyInfo"? (if 
> anyone has much experience with IHTMLDocument2)
>
> -- 
> Alister Christie
> Computers for People
> Ph: 04 471 1849 Fax: 04 471 1266
> http://www.salespartner.co.nz
> PO Box 13085
> Johnsonville
> Wellington 
>  
>
------------------------------------------------------------------------
>
>
>   
>  
> _______________________________________________
> NZ Borland Developers Group - Delphi mailing list
> Post: delphi at delphi.org.nz <mailto:delphi at delphi.org.nz>
> Admin: http://delphi.org.nz/mailman/listinfo/delphi
> Unsubscribe: send an email to delphi-request at delphi.org.nz
<mailto:delphi-request at delphi.org.nz> with Subject: unsubscribe
>
------------------------------------------------------------------------
>
> _______________________________________________
> NZ Borland Developers Group - Delphi mailing list
> Post: delphi at delphi.org.nz
> Admin: http://delphi.org.nz/mailman/listinfo/delphi
> Unsubscribe: send an email to delphi-request at delphi.org.nz with
Subject: unsubscribe
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi at delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject:
unsubscribe