[DUG] web scraping using IHTMLDocument2
Cameron Hart
Cameron.Hart at flowsoftware.co.nz
Fri Jan 29 15:27:16 NZDT 2010
Given you mention .Filename property I assume you are using
TXMLDocument. Forget that and use MSXML direct - its much better, you
could load a URL direct without first downloading to a file. Import the
MSXML 6.0 to create MSXML2_TLB. You will probably find that most web
sites have xhtml tags but are still not valid. Try extracting from html
opening tag down to the closing tag and processing that piece only as
xml. In the website used in the sample below if you download it to a
file and strip the headings before the html tag it will load properly.
You might be able to find a way around this, I haven't looked any
further.
unit Unit5;
interface
uses
Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls,
Forms,
Dialogs, StdCtrls, MSXML2_TLB;
type
TForm5 = class(TForm)
Button1: TButton;
procedure Button1Click(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
end;
EValidateXMLError = class(Exception)
private
FErrorCode: Integer;
FReason: string;
public
constructor Create(aErrorCode: Integer; const aReason: string; const
aLine, aChar, aFilePos : integer; const aSrcText, aURL, aXPath :
string);
property ErrorCode: Integer read FErrorCode;
property Reason: string read FReason;
end;
var
Form5: TForm5;
implementation
{$R *.dfm}
resourcestring
RsValidateError = 'XML Validation Error (%.8x) Reason: %s XPath: %s
Line: %d Char: %d File Pos: %d URL: %s Src Text: %s';
constructor EValidateXMLError.Create(aErrorCode: Integer; const aReason:
string; const aLine, aChar, aFilePos : integer; const aSrcText, aURL,
aXPath : string);
begin
inherited CreateResFmt(@RsValidateError, [AErrorCode, aReason, aXPath,
aLine, aChar, aFilePos, aURL, aSrcText]);
FErrorCode := aErrorCode;
FReason := aReason;
end;
procedure TForm5.Button1Click(Sender: TObject);
var oXMLDoc: DOMDocument60;
oError: IXMLDOMParseError2;
begin
oXMLDoc := CoDOMDocument60.Create;
oXMLDoc.async := FALSE;
oXMLDoc.setProperty('ProhibitDTD', TRUE);
oXMLDoc.resolveExternals := FALSE;
oXMLDoc.validateOnParse := FALSE;
oXMLDoc.load('http://w3future.com/weblog/gems/xhtml2.xml'); //use
oXMLDOc.load() also loads file paths. use oXMLDoc.loadXML to load XML
in a string
if oXMLDoc.parseError.errorCode <> S_OK then // validate is off above
but you should still check for load errors. This is different to
validation though check out schemacache if you want to validate against
xsd
begin
oError := oXMLDoc.parseError as IXMLDOMParseError2;
raise EValidateXMLError.Create(oError.errorCode, oError.reason,
oError.line, oError.linepos,
oError.filepos,
oError.srcText, oError.url,
oError.errorXPath);
end;
showmessage(oXMLDoc.xml);
end;
end.
cameron
From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz]
On Behalf Of Alister Christie
Sent: Friday, 29 January 2010 2:40 p.m.
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] web scraping using IHTMLDocument2
Thanks Cameron,
It does indeed have that header, how do I make this work?
XMLDocument1.FileName := 'c:\temp\test.htm';
XMLDocument1.Active := True;
Gives me various errors, I suspect that that the file is not valid xml,
or is there some other way of parsing it?
Alister Christie
Computers for People
Ph: 04 471 1849 Fax: 04 471 1266
http://www.salespartner.co.nz
PO Box 13085
Johnsonville
Wellington
Cameron Hart wrote:
Do you know if the websites are xhtml - do they have anything like below
in the start of the page.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
<http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd> >
<html xmlns="http://www.w3.org/1999/xhtml"
<http://www.w3.org/1999/xhtml> >
If they are it would be easier to load them into XML documents and
process them that way using msxml DOMDocument60
cameron
From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz]
On Behalf Of Alister Christie
Sent: Friday, 29 January 2010 12:22 p.m.
To: NZ Borland Developers Group - Delphi List
Subject: [DUG] web scraping using IHTMLDocument2
I'm trying to do some web page scraping using IHTMLDocument2, which is
working fairly well and I can grab the second paragraph on a web page by
doing something like:
p := iDoc.all.tags('P');
if p.Length >= 2 then
result := p.Item(1).InnerText;
Where iDoc is an isnstance of IHTMLDocument2.
However say there there is an HTML element like
<div class="propertyInfo">Price: <span>Negotiation</span></div>
How would I be able to find the divs where class="propertyInfo"? (if
anyone has much experience with IHTMLDocument2)
--
Alister Christie
Computers for People
Ph: 04 471 1849 Fax: 04 471 1266
http://www.salespartner.co.nz
PO Box 13085
Johnsonville
Wellington
________________________________
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi at delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject:
unsubscribe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://listserver.123.net.nz/pipermail/delphi/attachments/20100129/6fba16b5/attachment-0001.html
More information about the Delphi
mailing list