Webscraping with Binarylang
Alright! Let's move onto the next problem, shall we? A common real world problem that many people run into is extracting information from a webpage. This is (usually) known as webscraping. There are a few ways to do it, such as regex and using query selectors by parsing the DOM.
Anyway, I've got some HTML that looks like this:
<ul class="cardDeck cardGrid" data-type="anime"><li data-type="anime" data-id="14109" data-episode-type="episodes" data-episodes="" data-total-episodes="6" class="card ">
<a title="<h5 class='theme-font'>Bottom-tier Character Tomozaki</h5><h6 class='theme-font tooltip-alt'>Alt title: Jaku-Chara Tomozaki-kun</h6><ul class='entryBar'><li class='type'>TV (6+ eps)</li><li>Project No. 9</li><li class='iconYear'>2021 - ?</li><li><div class='ttRating'>3.6</div></li></ul><p>Expert gamer Tomozaki Fumiya doesn’t exactly fit in, but he wishes he did. With no written rules for success and gameplay that doesn’t work in his favor, the real world seems impossible for someone like him. But, like any noob, all he really needs are some strategies and a seasoned player like Aoi Hinami to help him. Hopefully with her guidance, Tomozaki will gain the experience he needs.</p><div class='tooltip notes'><p>Source: Funimation</p></div><div class='tags'><h4>Tags</h4><ul><li>Comedy</li><li>Drama</li><li>Romance</li><li>Shounen</li><li>School Life</li><li>Based on a Light Novel</li></ul></div> <div class='myListBar theirList sep'>
<h4>their anime:</h4>
<span class='status2'></span> Watching - 5/6 eps </div>
" href="/anime/bottom-tier-character-tomozaki" class="tooltip anime14109">
<div class="crop"><img alt="Bottom-tier Character Tomozaki" data-src="/images/anime/covers/thumbs/bottom-tier-character-tomozaki-14109.jpg?t=1610367923" src="/inc/img/card-load.svg" /></div><div class="statusArea"><span class='status2'></span> 5 eps</div> <h3 class='cardName'>Bottom-tier Character Tomozaki</h3>
</a>
</li><li data-type="anime" data-id="14295" data-episode-type="episodes" data-episodes="" data-total-episodes="5" class="card ">
<a title="<h5 class='theme-font'>Dr. Stone: Stone Wars</h5><ul class='entryBar'><li class='type'>TV (5+ eps)</li><li>TMS Entertainment</li><li class='iconYear'>2021 - ?</li><li><div class='ttRating'>4.6</div></li></ul><p>Second season of <a href="https://www.anime-planet.com/anime/dr-stone" >Dr. Stone</a>.</p><div class='tags'><h4>Tags</h4><ul><li>Adventure</li><li>Comedy</li><li>Sci Fi</li><li>Shounen</li><li>Modern Knowledge</li><li>Person in a Strange World</li><li>Post-apocalyptic</li><li>Prehistoric</li><li>Survival</li><li>Based on a Manga</li></ul></div> <div class='myListBar theirList sep'>
<h4>their anime:</h4>
<span class='status2'></span> Watching - 4/5 eps </div>
" href="/anime/dr-stone-stone-wars" class="tooltip anime14295">
<div class="crop"><img alt="Dr. Stone: Stone Wars" data-src="/images/anime/covers/thumbs/dr-stone-stone-wars-14295.jpg?t=1599268423" src="/inc/img/card-load.svg" /></div><div class="statusArea"><span class='status2'></span> 4 eps</div> <h3 class='cardName'>Dr. Stone: Stone Wars</h3>
</a>
</li><li data-type="anime" data-id="15781" data-episode-type="episodes" data-episodes="" data-total-episodes="5" class="card ">
<a
...
Oh boy, this is a mess. There is some non-standard stuff going on within the title attribute, having an entire other element inside of it.
What we want: A list of shows, and the watch status (how many episodes have been watched).
Hm, it looks like the title is between <h5 class='theme-font'>
and </h5>
, and the
watch status is also between some strings. Let's try it!
createParser(show):
s: _ # skip until we see the next field
s: _ = "<h5 class='theme-font'>"
s: title
s: _ = "</h5>"
s: _
s: _ = "Watching - "
s: seen
s: _ = "/"
s: total
s: _ = " eps"
s: _
s: _ = "</li>" # Read until the end of the item
print website.toShow
toShow(website)=Show(title:"Bottom-tier Character Tomozaki", seen:"5", total:"6")
Wasn't that super easy! You don't need to parse the HTML dom, don't need to figure out any regex, and you get a normal Nim type to work with! Now, let's generalize this to all of the shows.
createParser(information):
*show: {shows}
s: _ = "</ul>" # Ends when the list ends
print website.toInformation
toInformation(website)=Information( shows:@[ Show(title:"Bottom-tier Character Tomozaki", seen:"5", total:"6"), Show(title:"Dr. Stone: Stone Wars", seen:"4", total:"5"), Show(title:"Horimiya", seen:"5", total:"5"), Show(title:"Mushoku Tensei: Jobless Reincarnation", seen:"5", total:"5"), Show(title:"Re:ZERO -Starting Life in Another World- Season 2: Part II", seen:"6", total:"6"), Show(title:"So I\'m a Spider, So What?", seen:"5", total:"6"), Show(title:"Suppose a Kid from the Last Dungeon Boonies Moved to a Starter Town", seen:"6", total:"6"), Show(title:"That Time I Got Reincarnated as a Slime Season 2", seen:"5", total:"5") ] )
And that's it! The only tricky part is figuring out when to stop parsing but so long as the website has some sort of structure this is pretty doable.