Extract Comments From Reddit Rss Feed Feedparser Python

Print Friendly, PDF & Email

In this post, we will take a closer look at how to fetch and parse syndicated feeds with Feedparser library in Python. The detailed documentation of this library can be found here. As usual, the post will begin with an introduction of RSS. The next section will be a brief explanation of what Feedparser is. Then we will directly jump into how to install and use the library. To make a complete demonstration, we will present an example where BioModels' Models of The Month RSS feeds are fetched and processed in terms of client's need. The post will be closed by the conclusions references.

What is RSS?

RSS stands for Rich Site Summary, also known Really Simple Syndication, which allows the audiences of websites or web-based applications to access the latest updates standardised in a computer readable format, XML basically. An RSS document (also shortly called feed, web feed or channel) often includes full or summarised text, metadata like publishing date and author's name.

To check this RSS feeds, the user usually uses a program or plugin/extension (i.e. you read RSS feeds by using Web Browser), so-called RSS reader or news aggregator, to track of many different websites they want to keep updates. In the other side, common programming languages support developers to parse RSS content by providing libraries. At the time when I write this article, ROME API, written in Java, and Python-based library Feedparser presented in this post, are mostly used.

BioModels' Models of The Month RSS feeds
BioModels' Models of The Month RSS feeds fetched in Firefox

Common structure of an RSS document

Look at the link below:

https://www.ebi.ac.uk/biomodels/modelOfTheMonth/rss

That is BioModels' Models of The Month RSS feed where you can track of all models published monthly by BioModels' Data Curators. The XML-based document includes required tags that are concisely explained below.

  • The document is begun with RSS tag while channel contains a title, link, description as the mandatory fields. The language, copyright, managingEditor and image are optional properties. There properties are common information/metadata of the feeds news.
  • Each channel can have multiple items.
  • Each item should include title, link, description, pubDate and guid.

What is Feedparser?

Feedparser is a Python-based library which provides us facilities in order to parse feeds in a variety of known formats, such as Atom, RSS and RDF. It can properly work on Python 2.4 or later to Python 3.6 as stated in its development repository (see tox.ini file).

Install Feedparser

I am using conda to manage Python packages. The command used to install Feedparser is

conda install feedparser

For many Python developers, they could end up with using pip. The command is also similar to the one in conda.

pip install feedparser

Verify the installation

To verify a package installed in your system yet, we can run the command conda list or pip list. The command will display a list of installed packages where you determine feedparser package has been installed or not.

Naturally, you can enter import feedparser  into Python interactive mode. If the output displays nothing without any errors, it's sure that feedparser library was successfully installed.

I won't let you be patient anymore because it's time to have a play and go with Feedparser.

Fetch and parser BioModels' Models of The Month Feeds

As explained above, we will familiarise with Feedparser by learning how to fetch and extract information from BioModels' Models of The Month Feeds.

You start your program with importing the feedparser package.

import feedparser

Fetch the document

Fetch a document means creating a feed by using the parse method with feed link as the unique required argument.

bm_mom_feeds_link = "https://www.ebi.ac.uk/biomodels/modelOfTheMonth/rss"  d = feedparser.parse(bm_mom_feeds_link)

Access parsed data

As explained above, d['feed']  gives you common information/metadata of this feeds. The output looks like below.

{      'title':'Models of The Month',    'title_detail':{         'type':'text/plain',       'language':None,       'base':'https://www.ebi.ac.uk/biomodels/modelOfTheMonth/rss',       'value':'Models of The Month'    },    'links':[         {            'rel':'alternate',          'type':'text/html',          'href':'https://www.ebi.ac.uk/biomodels/content/model-of-the-month?all=yes'       }    ],    'link':'https://www.ebi.ac.uk/biomodels/content/model-of-the-month?all=yes',    'subtitle':'Every month, a scientist from the BioModels Database team selects a model to further investigate and writes a synopsis to explain that model in details.',    'subtitle_detail':{         'type':'text/html',       'language':None,       'base':'https://www.ebi.ac.uk/biomodels/modelOfTheMonth/rss',       'value':'Every month, a scientist from the BioModels Database team selects a model to further investigate and writes a synopsis to explain that model in details.'    },    'language':'en-GB',    'rights':'Copyright 2005-2018, EMBL-EBI',    'rights_detail':{         'type':'text/plain',       'language':None,       'base':'https://www.ebi.ac.uk/biomodels/modelOfTheMonth/rss',       'value':'Copyright 2005-2018, EMBL-EBI'    },    'authors':[         {            'name':'BioModels Team',          'email':'biomodels-developers@lists.sf.net'       }    ],    'author':'biomodels-developers@lists.sf.net (BioModels Team)',    'author_detail':{         'name':'BioModels Team',       'email':'biomodels-developers@lists.sf.net'    },    'image':{         'title':'Models of The Month',       'title_detail':{            'type':'text/plain',          'language':None,          'base':'https://www.ebi.ac.uk/biomodels/modelOfTheMonth/rss',          'value':'Models of The Month'       },       'href':'https://www.ebi.ac.uk/biomodels/images/biomodels/logo_small.png'    } }

The output shows in JSON format so that you easily know which field you want to retrieve.

Now, move on a bit further where we want to get news/feeds entries. To know the number of the entries/items, we can run the following statement.

print(len(d['entries']))

You certainly access each item either via the index or via a loop. Below are the snippet of extracting all entries' links.

for entry in d.entries:     print(entry.title + "\n" + entry.link)

The output looks like.

Chen2004 - An integrated yeast cell cycle model https://www.ebi.ac.uk/biomodels/content/model-of-the-month?year=2018&month=12 Proctor2016 - Circadian rhythm of PTH and the dynamics of signalling molecules on bone remodelling https://www.ebi.ac.uk/biomodels/content/model-of-the-month?year=2018&month=11 Liebal et al., (2012). Proteolysis of beta-galactosidase following SigmaB activation in Bacillus subtilis https://www.ebi.ac.uk/biomodels/content/model-of-the-month?year=2018&month=10 Heldt2018 - Proliferation-quiescence decision in response to DNA damage. https://www.ebi.ac.uk/biomodels/content/model-of-the-month?year=2018&month=08 Gould2013 - Network balance via CRY signalling controls the Arabidopsis circadian clock over ambient temperatures. https://www.ebi.ac.uk/biomodels/content/model-of-the-month?year=2018&month=07 Rateitschak et al (2012). Parameter Identifiability and Sensitivity Analysis Predict Targets for Enhancement of STAT1 Activity in Pancreatic Cancer and Stellate Cells https://www.ebi.ac.uk/biomodels/content/model-of-the-month?year=2018&month=06 Costa et al (2014). An extended dynamic model of Lactococcus lactis metabolism for mannitol and 2,3-butanediol production https://www.ebi.ac.uk/biomodels/content/model-of-the-month?year=2018&month=05

The complete example can be found below.

Conclusions

We have gone through RSS feed introduction, Feedparser explanation then played the library directly with hands-on examples. The library is power so that we can use it to parse other formats of feeds. Apart from this library, ROME API preserves for Java's developer. The post guiding you how to use ROME API to work with RSS feeds will be published soon.

References

[1] Using Feedparser in Python

aguilaranonton99.blogspot.com

Source: https://www.itersdesktop.com/2018/12/16/using-feedparser-in-python-to-read-rss/

0 Response to "Extract Comments From Reddit Rss Feed Feedparser Python"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel