🏡 scraping.nim

Webscraping with Binarylang

Alright! Let's move onto the next problem, shall we? A common real world problem that many people run into is extracting information from a webpage. This is (usually) known as webscraping. There are a few ways to do it, such as regex and using query selectors by parsing the DOM.

Anyway, I've got some HTML that looks like this:

<ul class="cardDeck cardGrid" data-type="anime"><li data-type="anime" data-id="14109" data-episode-type="episodes" data-episodes="" data-total-episodes="6" class="card  ">
<a title="<h5 class='theme-font'>Bottom-tier Character Tomozaki</h5><h6 class='theme-font tooltip-alt'>Alt title: Jaku-Chara Tomozaki-kun</h6><ul class='entryBar'><li class='type'>TV (6+ eps)</li><li>Project No. 9</li><li class='iconYear'>2021 - ?</li><li><div class='ttRating'>3.6</div></li></ul><p>Expert gamer Tomozaki Fumiya doesn&rsquo;t exactly fit in, but he wishes he did. With no written rules for success and gameplay that doesn&rsquo;t work in his favor, the real world seems impossible for someone like him. But, like any noob, all he really needs are some strategies and a seasoned player like Aoi Hinami to help him. Hopefully with her guidance, Tomozaki will gain the experience he needs.</p><div class='tooltip notes'><p>Source: Funimation</p></div><div class='tags'><h4>Tags</h4><ul><li>Comedy</li><li>Drama</li><li>Romance</li><li>Shounen</li><li>School Life</li><li>Based on a Light Novel</li></ul></div>        <div class='myListBar theirList sep'>
            <h4>their anime:</h4>
             <span class='status2'></span> Watching - 5/6 eps        </div>

    " href="/anime/bottom-tier-character-tomozaki" class="tooltip anime14109">
<div class="crop"><img alt="Bottom-tier Character Tomozaki" data-src="/images/anime/covers/thumbs/bottom-tier-character-tomozaki-14109.jpg?t=1610367923" src="/inc/img/card-load.svg" /></div><div class="statusArea"><span class='status2'></span> 5 eps</div> <h3 class='cardName'>Bottom-tier Character Tomozaki</h3>
</a>
</li><li data-type="anime" data-id="14295" data-episode-type="episodes" data-episodes="" data-total-episodes="5" class="card  ">
<a title="<h5 class='theme-font'>Dr. Stone: Stone Wars</h5><ul class='entryBar'><li class='type'>TV (5+ eps)</li><li>TMS Entertainment</li><li class='iconYear'>2021 - ?</li><li><div class='ttRating'>4.6</div></li></ul><p>Second season of <a href=&quot;https://www.anime-planet.com/anime/dr-stone&quot; >Dr. Stone</a>.</p><div class='tags'><h4>Tags</h4><ul><li>Adventure</li><li>Comedy</li><li>Sci Fi</li><li>Shounen</li><li>Modern Knowledge</li><li>Person in a Strange World</li><li>Post-apocalyptic</li><li>Prehistoric</li><li>Survival</li><li>Based on a Manga</li></ul></div>        <div class='myListBar theirList sep'>
            <h4>their anime:</h4>
             <span class='status2'></span> Watching - 4/5 eps        </div>

    " href="/anime/dr-stone-stone-wars" class="tooltip anime14295">
<div class="crop"><img alt="Dr. Stone: Stone Wars" data-src="/images/anime/covers/thumbs/dr-stone-stone-wars-14295.jpg?t=1599268423" src="/inc/img/card-load.svg" /></div><div class="statusArea"><span class='status2'></span> 4 eps</div> <h3 class='cardName'>Dr. Stone: Stone Wars</h3>
</a>
</li><li data-type="anime" data-id="15781" data-episode-type="episodes" data-episodes="" data-total-episodes="5" class="card  ">
<a 
...

Oh boy, this is a mess. There is some non-standard stuff going on within the title attribute, having an entire other element inside of it.

What we want: A list of shows, and the watch status (how many episodes have been watched). Hm, it looks like the title is between <h5 class='theme-font'> and </h5>, and the watch status is also between some strings. Let's try it!

createParser(show):
  s: _ # skip until we see the next field
  s: _ = "<h5 class='theme-font'>"
  s: title
  s: _ = "</h5>"
  s: _
  s: _ = "Watching - "
  s: seen
  s: _ = "/"
  s: total
  s: _ = " eps"
  s: _
  s: _ = "</li>" # Read until the end of the item
print website.toShow
toShow(website)=Show(title:"Bottom-tier Character Tomozaki", seen:"5", total:"6")

Wasn't that super easy! You don't need to parse the HTML dom, don't need to figure out any regex, and you get a normal Nim type to work with! Now, let's generalize this to all of the shows.

createParser(information):
  *show: {shows}
  s: _ = "</ul>" # Ends when the list ends
print website.toInformation
toInformation(website)=Information(
  shows:@[
    Show(title:"Bottom-tier Character Tomozaki", seen:"5", total:"6"),
    Show(title:"Dr. Stone: Stone Wars", seen:"4", total:"5"),
    Show(title:"Horimiya", seen:"5", total:"5"),
    Show(title:"Mushoku Tensei: Jobless Reincarnation", seen:"5", total:"5"),
    Show(title:"Re:ZERO -Starting Life in Another World- Season 2: Part II", seen:"6", total:"6"),
    Show(title:"So I\'m a Spider, So What?", seen:"5", total:"6"),
    Show(title:"Suppose a Kid from the Last Dungeon Boonies Moved to a Starter Town", seen:"6", total:"6"),
    Show(title:"That Time I Got Reincarnated as a Slime Season 2", seen:"5", total:"5")
  ]
)

And that's it! The only tricky part is figuring out when to stop parsing but so long as the website has some sort of structure this is pretty doable.