Xmlparser error...

new BookmarkLockedFalling
ezmoney
Junior Member
**

ezmoney Avatar

Posts: 69

Post by ezmoney on Feb 23, 2013 0:03:41 GMT -5


I was trying out the xmlparser

the input code was...
url$="http://url.com/" <--- not the real one but representative.

xml$ = httpget$(url$)

xmlparser #rxx, xml$

The error print out was...


Runtime Error in program 'untitled': xmlparser #rxx, xml$HttpClientError

What am I doing wrong? ???

neal
Full Member
***

neal Avatar

Posts: 104

Post by neal on Feb 25, 2013 0:13:54 GMT -5

Try this:

url$="http://url.com/" <--- not the real one but representative.

xml$ = httpget$(url$)
xml$ = mid$(xml,ドル instr(xml,ドル "<html"))

xmlparser #rxx, xml$

Th extra line removes the <!DOCTYPE ...> line - this seems to overcome the HttpClientError, but I then get other errors - HTML is not XML I guess :)
Last Edit: Feb 25, 2013 0:14:18 GMT -5 by neal
ezmoney
Junior Member
**

ezmoney Avatar

Posts: 69

StefanPendl
Global Moderator
*****

StefanPendl Avatar

Run for BASIC ...
Posts: 945

[b]Stefan[/b] - [a href=http://stefanpendl.runbasichosting.com/]Homepage[/a][br][br][b]Please give credit if you use code I post, no need to ask for permission.[/b][br][br]Run BASIC 1.01, Fire-/Waterfox (IE11, Edge), Windows 10 Professional x64, Intel Core i7-4710MQ 2.5GHz, 16GB RAM
ezmoney
Junior Member
**

ezmoney Avatar

Posts: 69

StefanPendl
Global Moderator
*****

StefanPendl Avatar

Run for BASIC ...
Posts: 945

[b]Stefan[/b] - [a href=http://stefanpendl.runbasichosting.com/]Homepage[/a][br][br][b]Please give credit if you use code I post, no need to ask for permission.[/b][br][br]Run BASIC 1.01, Fire-/Waterfox (IE11, Edge), Windows 10 Professional x64, Intel Core i7-4710MQ 2.5GHz, 16GB RAM
ezmoney
Junior Member
**

ezmoney Avatar

Posts: 69

Post by ezmoney on Mar 1, 2013 7:42:20 GMT -5


That works good.. I better understand... :D

Now I have a funny thing happen... :o

I put in the url string for page 1 and retrived the data..

Then I put in the url string for page 2 but got back page 1 links..

The whole 6 pages came back page 1.... only the page number
was different in the get string,,,

It has got me stumped why it is doing this ... ???

Just so I would not make a typo error I used the copy to
replace the URL links ... I don't understand this one.

Also I thought maybe the previous return was staying in core..
So just before I did the httpget

I put in
ret$=""
ret$=httpget(url$)

Thus the return had to come from the URL$ that the httpget found..

Anybody got any ideas? Why this does this and how to fix it?

Things to try or ideas that will show why it is not picking the
correct url$ page?

There is a bug in the works someplace...It has me stumped. >:(

Thanks..

StefanPendl
Global Moderator
*****

StefanPendl Avatar

Run for BASIC ...
Posts: 945

[b]Stefan[/b] - [a href=http://stefanpendl.runbasichosting.com/]Homepage[/a][br][br][b]Please give credit if you use code I post, no need to ask for permission.[/b][br][br]Run BASIC 1.01, Fire-/Waterfox (IE11, Edge), Windows 10 Professional x64, Intel Core i7-4710MQ 2.5GHz, 16GB RAM
meerkat
Senior Member
****

meerkat Avatar

Posts: 250

Post by meerkat on Mar 1, 2013 16:05:54 GMT -5

Not sure exactly what you are trying to do, but maybe this will help.

Extracting stuff from html is a common practice with a language like RB that deals with the web.
Remember that web pages are free format.
So don't expect to see perfect tags. If you look for "<a href" you may not find it. It could be written with extra spaces or tabs as "<a href" or "<a(tab)href". Or on separate lines such as
"<a
href" Or caps like "<A href" or "<a Href" or '<A HrEf".

You need to get as decent web page as possible.
First changes <CR>and <LF> to a single space.
Reduce multiple spaces to a single space. Since letter case may be a problem, you should either make it all upper or all lower case.
If you want to maintain the original case, move the changed case to a new destination. That way you can search the case changed document but use the offsets to get data from the unchanged web page.

when finding all "<a href" you need to offset it by the last instr()
for example:
i = instr(webPage,ドル"<a href")
while i > 0
j = instr(webPage,ドル">")
hrefData$ = mid$(webPage,ドルi,j-i-1)
....

i = instr(webPage,ドル"<a href",i+1)
wend

HTH
ezmoney
Junior Member
**

ezmoney Avatar

Posts: 69

Post by ezmoney on Mar 2, 2013 6:34:27 GMT -5

I convert everything to lower case at the

ret$=lower$(httpget(git$))

the problem is when I change the git$ from 1 page to the page 2 I get the same return...
Thus for some reason url$ page 1 is returned regardless of the page number.. that is the issue..

Any operations below the Fetch httpget(git$) would not effect what was returned.

I search for the url link which is rather lenthy...as it contains the url address..
Generally the web site address is in the link.
And if there are non found.. I'm done... I'm out of that search.

Each record has a begning the "http://www.abcdefghijklmnopqrst....xyz"
and an ends generally "</a>"

somewhere in that record between "http" and "</a>" is a quote mark thus the link is

search$="http://www.thesearchsiteaddress"

v=0
[again]
ifr=instr(url,ドルsearch,ドルv)
if ifr<1 then [nxturl] ' if not found then i'm done search is over...
v=ifr+len(search$)+1 ' this moves the search to the next search point
ito=instr(url,ドル"</a>",v) ' find the end string from last point v
if ito<1 then [nxturl] 'there is no end to the record for some reason malformed or..
v=ito 'move the v pointer up to the last search found.
print " found end of record at=";ito


the return of the link and what comes after it is..

ret$= mid$(url,ドルifr,ito-ifr) 'assign to ret$


'find the quote mark..
iqt=instr(ret,ドルchr$(34),0) ' quoto mark position
if iqt<1 then [skipprt] ' no qoute market skip it must be a malformed link.
lnk$=left$(ret,ドル(iqt-1)) ' the web link is everything left$ of the quote mark
dis$=right$(ret,ドル(len(ad$)-iqt-1)) ' the text or discription is everything to the right.
dis$=ascii$(dis$) ' this function cleans up all the unprintables
print st$;" ";lnk$;space$(5);dis$ ' this forms the new record to save.
rc=rc+1 ' this counts the records for the input link page
trc=trc+1 ' this counts the total records found
print #4,st$;" ";lnk$;space$(5);dis$;" ";date$("yyyy/mm/dd") 'this saves output in file 4
[skipprt]
after development and all is working i make the extra prints comments and if anything
is funny looking later i can quickly bring them back into the printing and see more of what is happening.