I use pup for this. People who tried both, any difference? https://github.com/Er...

zamubafoo · on April 13, 2022

Not sure if pup supports this but something I do use fairly often (and copied into my own internal tooling) is the ability to filter out results as a flag in the CLI.

For example, something I usually do is:

  curl --include --location https://example.com | tee /tmp/example-com.html | htmlq --base https://example.com a --attribute href --remove-nodes 'a[href*="#"],a[href^="javascript"],a[href*="?"]'

This grabs the page, shunts a copy to /tmp for subsequent, iterative testing, then tries to grab all the links while filtering out any links that have a '#', '?', or start with the word 'javascript'. This is super helpful when I'm just exploring some HTML scrape and trying to build a graph of links without having to pop out a proper programming language just yet.