Skip to main content

Using git-history to build a web scraper in less than a minute

Simon Willison describes in his blog post git-history, a python tool which leverages a pattern, previously described by Simon to scrape web content via Github actions. Basically the pattern scrapes a web resource and commits differences into a git repository.

I wanted to try this pattern on a to this date static source of open data, to supply the future with a time series of this former static source. So I did with the parking data of Frankfurt and you can inspect the data in this repository.

In the following plot you see the occupancy of parking space in Frankfurt, for the last 1000 rows generated out of the repositories history.

Random collection of parking blocks

Also if you feel weird about the concept of time series, which is the default for git-history, you can import alternatively non-different versions via a own flag. So you can customize the the invocation to your own needs. The import defaults to input content in JSON format. So in case you are importing something different, you define a custm input-conversion routine for this and indicate it with the flag --convert. In the case of the Frankfurt parking data it looks like the following (just some bits bits of the final code):

git-history file ffm-parking.db parkdaten_dyn.xml --convert 'tree = xml.etree.ElementTree.fromstring(content)
# ...
areas = []
for el in tree[1][3][0].findall("{}parkingAreaStatus"):
        areas.append({"id": e(el, "parkingAreaReference").get("id"),\
        "occupancy": e(el, "parkingAreaOccupancy").text,\

return areas # ...
' --id id --import xml.etree.ElementTree

Feel free to download the data of the repository and visualize it. Searching for git-scraping is a collection of many more examples using this this techniqe.

Update 12/05/2022: Unfortunately the open data portal has removed the dataset from it’s platform on the 17th of December 2021. And its removed until today. I suppose it’s no coincidence in the light of log4shell vulnerability.