This is our alternative to ImportXML, although it is formally very different. Our one used, as a mechanism for parsing, regular expressions, while the function of Google Spreadsheet is using XPath.
The difference has two motivations:
- Excel has a native function for the application of XPath, FILTERXML, which is available from only Excel 2013 and has the limitation that only accepts valid and well-formed XML input. Which is not the HTML of most of the pages that you find on the web.
- We find more versatile using of regular expressions then XPath.
In any case, we are thinking of an alternative to also use XPath in a future release of this project.
Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.
GetHTMLAndApplyGroupRegEx(Url, Regex, NumberOfOccurences, Separator)
This function allows to download the contents of a url and apply a regular expression containing groups to html downloaded.
Parameters:
- Url: String. Any url you want to download the content. Ex: http://www.ideativi.it
- Regex: String. Any regular expressione. Ex: http[^"]*
- NumberOfOccurences (optional): integer. The maximum number of occurrences that the regex can return. This field is optional and has, as default value, -1. -1 permits not to have a maximum number of results.
- Separator (optional): boolean. It specifies the separator to be applied to separate the results of the regex in case they are more than one. This field is optional and has, as default value, vbCrLf. vbCrLf is the equivalent of \r\n and corresponds to a carriage return.
This function is similar to the previous GetHTMLAndApplyRegEx and shares goals and objectives. The only difference is that instead of extracting the whole pattern, extracts only the
groups in it.
Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.
GoogleLookup(TextToQuery)
This function allows to query Google and grab the contents of the One Box, if present for the query.
Parameters:
- TextToQuery: String. Any text for which Google has an answer in the One Box. Ex: Italy Population
Google OneBox
All the queries, that generate a box of this kind, are analyzed and the data downloaded and returned by this function. I do not have a complete list of what you can ask Google, but in the file that you can download at the bottom of this page, you can find several working examples. You can find other examples on the page where Google announced
GoogleLookup, now abandoned.
Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.
BingLookup(TextToQuery)
This function allows to query Bing grab the contents of the One Box, if present for the query.
Parameters:
- TextToQuery: String. Any text for which Bing has an answer in the One Box. Ex: Italy Population
Bing OneBox
The function is similar to the previous GoogleLookup, with the difference that uses, as a supplier of information, Bing. All queries that generate a box. as one of the photo. are analyzed and the data downloaded and returned by this function. Again, the examples are in the excel file that you can download at the bottom of this page. For now it would seem that Bing returns less results then Google, but it is plausible that they will increase since the supplier of the information is
Freebase.
Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.
GoogleFinanceLookup(TextToQuery)
This function allows to query Google Finance and withdraw the value of the searched company's stock.
Parameters:
- TextToQuery: String. The name of any company or ticker of the share. Ex: Microsoft
The function relies on
Google Finance, and returns the value of shares of a company. This function is partly redundant since you can get the same information through the GoogleLookup. The problem of GoogleLookup is that it doesn't return the value of each share, while if you use the stock ticker (ex BIT:F, for Fiat) in this function there is a wider range of stocks to be queried.
Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.
GetTitleFromUrl(Url)
This function allows to query any URL and extract the Title tag if present.
Parameters:
- Url: String. Any url you want to download the content. Ex: http://www.ideativi.it
The function is a simple application of GetHTMLAndApplyRegEx with a regular expression that extracts the contents of the Title tag if present.
Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.
GetMetaDescriptionFromUrl(Url)
This function allows to query any URL and extract the contents of the META Description if present.
Parameters:
- Url: String. Any url you want to download the content. Ex: http://www.ideativi.it
The function is a simple application of GetHTMLAndApplyRegEx with a regular expression that extracts the contents of the META Description if present.
Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.
GetBodyTextFromUrl(Url)
This function allows you to query any URL and extract the text within the body.
Parameters:
- Url: String. Any url you want to download the content. Ex: http://www.ideativi.it
The function downloads the HTML of the URL parameter passed by and lextracts the text from the Body tag, eliminating any tag so far.
Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.
Limitations
After so much good news, we must speak of the limitations that are present in the use of these functions, which depend on the same Excel limitations and where we collect the information.
Almost all the function, we have created, start downloading a page from an URL. VBA, the language with which you can extend the capabilities of Excel has the feature of being single threaded. Which means you can only do one thing at a time, and the worst thing is that Excel shares the same thread between programming in VBA and the user interface. This means that while it is running one of the custom functions, the interface is blocked.
We tried to minimize this problem to the maximum, but it is still quite appreciable. Which means that if you are running a few hundred functions simultaneously the interface might have blocked for a few minutes, and the processing speed also depends on the speed of the sites you are downloading. My advice is once you have downloaded the information you need, copy it to a different excel file. The reason is that Excel may decide at a random time while you work to rerun all the functions, and when it happens you find the sheet blocked. Much better to take the data and place them on a different sheet where to work them.
Another useful tip is to not apply more than 100 functions on the sheet at a time, for two reasons:
- you find the sheet stuck for a while
- if you are using GoogleLookup or BingLookup, know that these do not rely to any public API but recall the same page that you would use in your browser. This means that if there are too many calls, the server would block them to save resources. Google, when there are too many requests, shows a captcha to make sure that they came from a human operator and not from asoftware.
If you want to know how much you can push, I called up to 2000 functions at once, it worked, but I had to wait several minutes before I got control back of the excel sheet. However, you can get the status of the calculation, because when applied several heavy function, Excel provides some information on the status bar.
From the status bar we know that we are using 8 processors, the calculation is at 2% and we applied 1850 functions
We also have another limitation regarding GoogleLookup and BingLookup. Since the information that is returned only come from the American versions of Google and Bing, queries and their results are in English only.
Last but not least, these functions work only on Microsoft Excel for Windows. It doesn't work on Windows RT and on Mac OS.
Credits
Thanks Enrico. :-)