April 29, 2020
Web Scraping with Beautiful Soup
Object Types
Using the following code will provide the first tag of the specified type from the BeautifulSoup object
print(soup.div)
You can get the name of the tag using .name and a dictionary representing the attributes of the tag using .attrs
print(soup.div.name)
print(soup.div.attrs)
NavigableStrings are pieces of text that are in the HTML tags on the page. We can get the string inside the tag by calling .string
print(soup.string)
We can get the children of a tag by accessing the .children attribute
for child in soup.ul.children:
print(child)
We can also navigate up the tree of a tag by accessing the .parents in soup.li.parents:
for parent in soup.li.parents:
print(parent)
If we want to find all the occurrences of a tag, instead of just the first one, we can use .find_all( )
print(soup.find_all(“h1”)
With .find_all( ), we can use regexes, attributes, or even functions to select HTML elements more intelligently.
If we want to find all of the elements with the “banner” class, we could use the command:
soup.find_all(attrs = {‘class’ : ‘banner’})