|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "attachments": {}, |
| 5 | + "cell_type": "markdown", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "PyPDF2 library can be used to read in the text data from a PDF file\n", |
| 9 | + "\n", |
| 10 | + "PyPDF2 librry is made to extract text from PDF files directly created from a word processor" |
| 11 | + ] |
| 12 | + }, |
| 13 | + { |
| 14 | + "cell_type": "code", |
| 15 | + "execution_count": 2, |
| 16 | + "metadata": {}, |
| 17 | + "outputs": [], |
| 18 | + "source": [ |
| 19 | + "import PyPDF2" |
| 20 | + ] |
| 21 | + }, |
| 22 | + { |
| 23 | + "cell_type": "code", |
| 24 | + "execution_count": 7, |
| 25 | + "metadata": {}, |
| 26 | + "outputs": [], |
| 27 | + "source": [ |
| 28 | + "# reading a PDF\n", |
| 29 | + "myfile = open('C:\\\\Users\\\\gmi\\\\Documents\\\\NLP using Python Course\\\\Notebook Files\\\\00-Python-Text-Basics\\\\US_Declaration.pdf', mode='rb')\n", |
| 30 | + "\n", |
| 31 | + "# rb- reading in binary methdod, this is needed because this is no longer a text file rather a PDF file" |
| 32 | + ] |
| 33 | + }, |
| 34 | + { |
| 35 | + "cell_type": "code", |
| 36 | + "execution_count": 9, |
| 37 | + "metadata": {}, |
| 38 | + "outputs": [], |
| 39 | + "source": [ |
| 40 | + "# converting it into a PDF file reader object\n", |
| 41 | + "pdf_reader = PyPDF2.PdfReader(myfile)" |
| 42 | + ] |
| 43 | + }, |
| 44 | + { |
| 45 | + "cell_type": "code", |
| 46 | + "execution_count": 12, |
| 47 | + "metadata": {}, |
| 48 | + "outputs": [ |
| 49 | + { |
| 50 | + "data": { |
| 51 | + "text/plain": [ |
| 52 | + "5" |
| 53 | + ] |
| 54 | + }, |
| 55 | + "execution_count": 12, |
| 56 | + "metadata": {}, |
| 57 | + "output_type": "execute_result" |
| 58 | + } |
| 59 | + ], |
| 60 | + "source": [ |
| 61 | + "len(pdf_reader.pages) # to display the number of pages in the PDF document" |
| 62 | + ] |
| 63 | + }, |
| 64 | + { |
| 65 | + "cell_type": "code", |
| 66 | + "execution_count": 14, |
| 67 | + "metadata": {}, |
| 68 | + "outputs": [], |
| 69 | + "source": [ |
| 70 | + "# To extract the texts from the first page of the PDF\n", |
| 71 | + "page_one = pdf_reader.pages[0] # 0 --> first page" |
| 72 | + ] |
| 73 | + }, |
| 74 | + { |
| 75 | + "cell_type": "code", |
| 76 | + "execution_count": 18, |
| 77 | + "metadata": {}, |
| 78 | + "outputs": [ |
| 79 | + { |
| 80 | + "name": "stdout", |
| 81 | + "output_type": "stream", |
| 82 | + "text": [ |
| 83 | + "Declaration of Independence\n", |
| 84 | + "IN CONGRESS, July 4, 1776. \n", |
| 85 | + "The unanimous Declaration of the thirteen united States of America, \n", |
| 86 | + "When in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\n", |
| 87 | + "of Happiness.— \u0014That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,— \u0014That whenever any Form of Government\n", |
| 88 | + "becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to\n", |
| 89 | + "institute new Government, laying its foundation on such principles and organizing its powers in\n", |
| 90 | + "such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence,indeed, will dictate that Governments long established should not be changed for light andtransient causes; and accordingly all experience hath shewn, that mankind are more disposed to\n", |
| 91 | + "suffer, while evils are sufferable, than to right themselves by abolishing the forms to which theyare accustomed. But when a long train of abuses and usurpations, pursuing invariably the same\n", |
| 92 | + "Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty,\n", |
| 93 | + "to throw off such Government, and to provide new Guards for their future securit y.— \u0014Such has\n", |
| 94 | + "been the patient sufferance of these Colonies; and such is now the necessity which constrainsthem to alter their former Systems of Government. The history of the present King of GreatBritain is a history of repeated injuries and usurpations, all having in direct object the\n", |
| 95 | + "establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a\n", |
| 96 | + "candid world. \n", |
| 97 | + "He has refused his Assent to Laws, the most wholesome and necessary for the\n", |
| 98 | + "public good.He has forbidden his Governors to pass Laws of immediate and pressingimportance, unless suspended in their operation till his Assent should be obtained;and when so suspended, he has utterly neglected to attend to them.He has refused to pass other Laws for the accommodation of large districts of\n", |
| 99 | + "people, unless those people would relinquish the right of Representation in theLegislature, a right inestimable to them and formidable to tyrants only. He has called together legislative bodies at places unusual, uncomfortable, and distantfrom the depository of their public Records, for the sole purpose of fatiguing them into\n", |
| 100 | + "compliance with his measures.\n" |
| 101 | + ] |
| 102 | + } |
| 103 | + ], |
| 104 | + "source": [ |
| 105 | + "print(page_one.extract_text())" |
| 106 | + ] |
| 107 | + }, |
| 108 | + { |
| 109 | + "cell_type": "code", |
| 110 | + "execution_count": null, |
| 111 | + "metadata": {}, |
| 112 | + "outputs": [], |
| 113 | + "source": [] |
| 114 | + } |
| 115 | + ], |
| 116 | + "metadata": { |
| 117 | + "kernelspec": { |
| 118 | + "display_name": "Python 3", |
| 119 | + "language": "python", |
| 120 | + "name": "python3" |
| 121 | + }, |
| 122 | + "language_info": { |
| 123 | + "codemirror_mode": { |
| 124 | + "name": "ipython", |
| 125 | + "version": 3 |
| 126 | + }, |
| 127 | + "file_extension": ".py", |
| 128 | + "mimetype": "text/x-python", |
| 129 | + "name": "python", |
| 130 | + "nbconvert_exporter": "python", |
| 131 | + "pygments_lexer": "ipython3", |
| 132 | + "version": "3.10.9" |
| 133 | + }, |
| 134 | + "orig_nbformat": 4 |
| 135 | + }, |
| 136 | + "nbformat": 4, |
| 137 | + "nbformat_minor": 2 |
| 138 | +} |
0 commit comments