[DUG] web scraping using IHTMLDocument2

Alister Christie alister at salespartner.co.nz
Fri Jan 29 16:25:51 NZDT 2010


Thanks, although it looks like the html documents are not xml compliant, 
so I'll probably have to either parse the file manually or continue 
experimenting with IHTMLDocument2 (and hopefully find some documentation 
for it).


Alister Christie
Computers for People
Ph: 04 471 1849 Fax: 04 471 1266
http://www.salespartner.co.nz
PO Box 13085
Johnsonville
Wellington 



Cameron Hart wrote:
>
> Given you mention .Filename property I assume you are using 
> TXMLDocument. Forget that and use MSXML direct – its much better, you 
> could load a URL direct without first downloading to a file. Import 
> the MSXML 6.0 to create MSXML2_TLB. You will probably find that most 
> web sites have xhtml tags but are still not valid. Try extracting from 
> html opening tag down to the closing tag and processing that piece 
> only as xml. In the website used in the sample below if you download 
> it to a file and strip the headings before the html tag it will load 
> properly. You might be able to find a way around this, I haven’t 
> looked any further.
>
> unit Unit5;
>
> interface
>
> uses
>
> Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
>
> Dialogs, StdCtrls, MSXML2_TLB;
>
> type
>
> TForm5 = class(TForm)
>
> Button1: TButton;
>
> procedure Button1Click(Sender: TObject);
>
> private
>
> { Private declarations }
>
> public
>
> { Public declarations }
>
> end;
>
> EValidateXMLError = class(Exception)
>
> private
>
> FErrorCode: Integer;
>
> FReason: string;
>
> public
>
> constructor Create(aErrorCode: Integer; const aReason: string; const 
> aLine, aChar, aFilePos : integer; const aSrcText, aURL, aXPath : string);
>
> property ErrorCode: Integer read FErrorCode;
>
> property Reason: string read FReason;
>
> end;
>
> var
>
> Form5: TForm5;
>
> implementation
>
> {$R *.dfm}
>
> resourcestring
>
> RsValidateError = 'XML Validation Error (%.8x) Reason: %s XPath: %s 
> Line: %d Char: %d File Pos: %d URL: %s Src Text: %s';
>
> constructor EValidateXMLError.Create(aErrorCode: Integer; const 
> aReason: string; const aLine, aChar, aFilePos : integer; const 
> aSrcText, aURL, aXPath : string);
>
> begin
>
> inherited CreateResFmt(@RsValidateError, [AErrorCode, aReason, aXPath, 
> aLine, aChar, aFilePos, aURL, aSrcText]);
>
> FErrorCode := aErrorCode;
>
> FReason := aReason;
>
> end;
>
> procedure TForm5.Button1Click(Sender: TObject);
>
> var oXMLDoc: DOMDocument60;
>
> oError: IXMLDOMParseError2;
>
> begin
>
> oXMLDoc := CoDOMDocument60.Create;
>
> oXMLDoc.async := FALSE;
>
> oXMLDoc.setProperty('ProhibitDTD', TRUE);
>
> oXMLDoc.resolveExternals := FALSE;
>
> oXMLDoc.validateOnParse := FALSE;
>
> oXMLDoc.load('http://w3future.com/weblog/gems/xhtml2.xml'); //use 
> oXMLDOc.load() also loads file paths. use oXMLDoc.loadXML to load XML 
> in a string
>
> if oXMLDoc.parseError.errorCode <> S_OK then // validate is off above 
> but you should still check for load errors. This is different to 
> validation though check out schemacache if you want to validate 
> against xsd
>
> begin
>
> oError := oXMLDoc.parseError as IXMLDOMParseError2;
>
> raise EValidateXMLError.Create(oError.errorCode, oError.reason,
>
> oError.line, oError.linepos, oError.filepos,
>
> oError.srcText, oError.url, oError.errorXPath);
>
> end;
>
> showmessage(oXMLDoc.xml);
>
> end;
>
> end.
>
> cameron
>
> *From:* delphi-bounces at delphi.org.nz 
> [mailto:delphi-bounces at delphi.org.nz] *On Behalf Of *Alister Christie
> *Sent:* Friday, 29 January 2010 2:40 p.m.
> *To:* NZ Borland Developers Group - Delphi List
> *Subject:* Re: [DUG] web scraping using IHTMLDocument2
>
> Thanks Cameron,
>
> It does indeed have that header, how do I make this work?
> XMLDocument1.FileName := 'c:\temp\test.htm';
> XMLDocument1.Active := True;
> Gives me various errors, I suspect that that the file is not valid 
> xml, or is there some other way of parsing it?
>
>
> Alister Christie
> Computers for People
> Ph: 04 471 1849 Fax: 04 471 1266
> http://www.salespartner.co.nz
> PO Box 13085
> Johnsonville
> Wellington 
>
>
>
> Cameron Hart wrote:
>
> Do you know if the websites are xhtml – do they have anything like 
> below in the start of the page.
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" 
> <http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd>>
>
> <html xmlns="http://www.w3.org/1999/xhtml" <http://www.w3.org/1999/xhtml>>
>
> If they are it would be easier to load them into XML documents and 
> process them that way using msxml DOMDocument60
>
> cameron
>
> *From:* delphi-bounces at delphi.org.nz 
> <mailto:delphi-bounces at delphi.org.nz> 
> [mailto:delphi-bounces at delphi.org.nz] *On Behalf Of *Alister Christie
> *Sent:* Friday, 29 January 2010 12:22 p.m.
> *To:* NZ Borland Developers Group - Delphi List
> *Subject:* [DUG] web scraping using IHTMLDocument2
>
> I'm trying to do some web page scraping using IHTMLDocument2, which is 
> working fairly well and I can grab the second paragraph on a web page 
> by doing something like:
>
> p := iDoc.all.tags('P');
> if p.Length >= 2 then
> result := p.Item(1).InnerText;
>
> Where iDoc is an isnstance of IHTMLDocument2.
>
> However say there there is an HTML element like
>
> <div class="propertyInfo">Price: <span>Negotiation</span></div>
>
> How would I be able to find the divs where class="propertyInfo"? (if 
> anyone has much experience with IHTMLDocument2)
>
> -- 
> Alister Christie
> Computers for People
> Ph: 04 471 1849 Fax: 04 471 1266
> http://www.salespartner.co.nz
> PO Box 13085
> Johnsonville
> Wellington 
>  
> ------------------------------------------------------------------------
>
>
>   
>  
> _______________________________________________
> NZ Borland Developers Group - Delphi mailing list
> Post: delphi at delphi.org.nz <mailto:delphi at delphi.org.nz>
> Admin: http://delphi.org.nz/mailman/listinfo/delphi
> Unsubscribe: send an email to delphi-request at delphi.org.nz <mailto:delphi-request at delphi.org.nz> with Subject: unsubscribe
> ------------------------------------------------------------------------
>
> _______________________________________________
> NZ Borland Developers Group - Delphi mailing list
> Post: delphi at delphi.org.nz
> Admin: http://delphi.org.nz/mailman/listinfo/delphi
> Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: unsubscribe


More information about the Delphi mailing list