Python - Corpora Access

Neha Kumawat

8 months ago

Python - Corpora Access | insideAIML
Table of Contents
  • Introduction
  • Accessing Raw Text

Introduction

          Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous corpus is the Gutenberg Corpus which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. In the below example we access the names of only those files from the corpus which are plain text with filename ending as .txt.

from nltk.corpus import gutenberg
fields = gutenberg.fileids()

print(fields)
When we run the above program, we get the following output −

[austen-emma.txt', austen-persuasion.txt', austen-sense.txt', bible-kjv.txt', 
blake-poems.txt', bryant-stories.txt', burgess-busterbrown.txt',
carroll-alice.txt', chesterton-ball.txt', chesterton-brown.txt', 
chesterton-thursday.txt', edgeworth-parents.txt', melville-moby_dick.txt',
milton-paradise.txt', shakespeare-caesar.txt', shakespeare-hamlet.txt',
shakespeare-macbeth.txt', whitman-leaves.txt']

Accessing Raw Text

          We can access the raw text from these files using sent_tokenize function which is also available in nltk. In the below example we retrieve the first two paragraphs of the blake poen text.

from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

token = sent_tokenize(sample)

for para in range(2):
    print(token[para])
When we run the above program we get the following output −

[Poems by William Blake 1789]

 
SONGS OF INNOCENCE AND OF EXPERIENCE
and THE BOOK of THEL


 SONGS OF INNOCENCE
 
 
 INTRODUCTION
 
 Piping down the valleys wild,
   Piping songs of pleasant glee,
 On a cloud I saw a child,
   And he laughing said to me:
 
 "Pipe a song about a Lamb!"
So I piped with merry cheer.
       
Liked what you read? Then don’t break the spree. Visit our insideAIML blog page to read more awesome articles. 
Or if you are into videos, then we have an amazing Youtube channel as well. Visit our InsideAIML Youtube Page to learn all about Artificial Intelligence, Deep Learning, Data Science and Machine Learning. 
Keep Learning. Keep Growing. 

Submit Review