[RELEASE] Web Scraper Device

Simple device driver to scrape data from a web site. Driver reads the website searching for a user provided string and returns data based on offsets from the search string. Driver can be used as a one time scrape or set up as a periodic website check.

Warning: Depending on the site and the search criteria, this can be a long running task. The successful attribute will display running while the scrape is in process and either true or false depending on whether or not the search finds the string requested.

Device code is available through HPM or for import from:
https://raw.githubusercontent.com/thebearmay/hubitat/main/webScraper.groovy

13 Likes

@thebearmay

Thanks for the driver. Can you search for items inside an element (?)? aka, inside the <> like the item below:

<a class="widget-link badge-notification unread-notifications" href="" title="1 unseen notification">1</a>

aka, search for "badge-notification"?

Also, check line 107, debug is misspelled.

Alan

The driver reads the page in raw mode so it will allow searches on not only what is seen on the screen, but also all of the html/js/css code that is not seen.

Fixed line 107 if you want to do an HPM repair - Thanks for the head's up.

Is there a way of seeing the raw data for the return data. It's a little frustrating trying to figure out starts and ends.

How about an option to pull between > and <?

Alan

If you look at the source code brough back there are several hundred < & > (many of which you don't see unless view the raw code) which makes using them somewhat problematic. I might be able to do a raw code preview, let me play a little and see - in the mean time if you pull down the File Manager Device driver and use the xferFile command to save the page you're interested in as a *.txt file on the hub you could then look at the raw source from the hub File Manager.

@alan564923 v0.0.3 gives an option that will open a new browser window to show the raw source code. It does create a temporary file on the hub and then zeros it back out after displaying it (file entry for scrapeWork.txt will remain with a size of ~0.001 KB).

2 Likes

I think I have a use for this but I want to double check....

I have a ESP32 based hot tub controller. It has a local GUI. It has a section that brings over exactly what is on the physical display. We sometimes get a pesky error code on the display. It would be really awesome if I was able to pull that from the GUI. Where it says [99] is the virtual display. The error changes that to [e02]. Can I use this to detect that?

feature request..
givie it one url and a username and password to log into a site and get a cookie ..
then a second url once logged in to do the scrapping.

1 Like

Should be able to check for the e02 without too much effort…

That may be doable…

@kahn-hubitat Looking at it, it should function pretty much the way I handle Hub Security, so very doable. Just need to be able to keep my cookies straight when dealing with Hub Security and this function.

1 Like

Dumb question chit on the table...

What were the use cases that you were thinking of when you created this, @thebearmay?

Code resulted out of getting the Zwave Status for a small driver I wrote, but there have been a couple requests over the last few months where someone wanted to be alerted when a phrase like “Red Flag Warning” or similar appeared on an official website.

1 Like

ya i was looking at loging into teslafi and scraping my tesla token.. but on further reserach the login page returns a javascript function .. same problem i had going direct to teslas website.

Beta version that might give you what you're looking for:

https://raw.githubusercontent.com/thebearmay/hubitat/main/development/webScraperBeta.groovy

Adds a path, username, and password for the site login page, but also need the submit value for the login button on the site.

thanks tried it.. doesnt seem to be trying to log in.. i belive the function was validateForm i need

May want to remove the second image, but try:

http://teslafi.com/userLogin.php
for the URL and
submit
for the site submit function

i dont think if https is the isue it is going to work as it automatically redirects to https
ie even tried with http:....:80:

thanks tried that.. found an error.. in your code. fixed that and tried again. just get not found


is there supposed to be debugging coming out during the login?

ya i turned on raw.. it does not appear to be trying to log in.. the tokennew url just returns a rehash of the login page.

tried submit Login validateForm no differnce