Get the nth column out of a text document (Python 3)

I'm trying to write code that will allow Linux Mint users to install all recommended packages for any software that is already installed on their machine. To get the list of packages already installed, I run the following in bash:

grep 'install' /var/log/dpkg.log

This returns something like this:

2015-09-24 19:39:01 install libportsmf0:amd64 <none> 0.1~svn20101010-4
2015-09-24 19:39:02 install libsbsms10:amd64 <none> 2.0.2-1
2015-09-24 19:39:03 install libsoxr0:amd64 <none> 0.1.1-1
2015-09-24 19:39:04 install libwxbase3.0-0:amd64 <none> 3.0.2-1+b1
2015-09-24 19:39:05 install libwxgtk3.0-0:amd64 <none> 3.0.2-1+b1
2015-09-24 19:39:07 install libvamp-hostsdk3:amd64 <none> 1:2.5-dmo6
2015-09-24 19:39:08 install audacity-data:all <none> 2.0.6-2
2015-09-24 19:39:10 install audacity:amd64 <none> 2.0.6-2
2015-09-25 11:47:36 install hardinfo:amd64 <none> 0.5.1-1.4
2015-09-25 12:14:35 install libstdc++6:i386 <none> 4.9.2-10
2015-09-25 12:14:36 install libudev1:i386 <none> 215+12+betsy
2015-09-25 12:14:37 install libtinfo5:i386 <none> 5.9+20140913-1+b1
2015-09-25 12:14:38 install libbsd0:i386 <none> 0.7.0-2
2015-09-25 12:14:39 install libedit2:i386 <none> 3.1-20140620-2
2015-09-25 12:14:40 install nvidia-installer-cleanup:amd64 <none> 20141201+1

What I need is to be able to grab the fourth column of each line where it says the package name. So libportsmf0:amd64, libsbsms10:amd64... Up to this point, I've tried piping the output of grep 'install' to a file, opening the file with Python 3, and using a for loop to grab the third column, such that

import os
def recommends():
    os.system("grep 'install' /var/log/dpkg.log >> ~/irFiles.txt")

file1 = '~/irFiles.txt'

But I haven't been able to figure out how to set up the for loop yet. Thanks!

Answers

To get the n-th space-separated column from a text file in Python 3:

packages = set() # a set of unique package names
with open('/var/log/dpkg.log') as file:
    for line in file:
        column = line.split() # split on any whitespace
        if "install" in column[2]:  # 3rd column
            packages.add(column[3]) # 4th column


You don't need to parse dpkg.log, you could get the info directly e.g., to get a list of previously recommended packages and to install them:

#!/usr/bin/env python
import subprocess

def recommended_packages():
    # see http://serverfault.com/a/382231
    return iterlines(["aptitude", "search", '~RBrecommends:~i', '-F', '%p'])

for package in recommended_packages():
    subprocess.check_call(b"sudo apt-get install".split() + [package])

where iterlines() yields subprocess' stdout line by line:

def iterlines(cmd):
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE, bufsize=1)
    with p.stdout:
        for line in iter(p.stdout.readline, b''):
            yield line.rstrip(b'\n') # yield package name
    if p.wait() != 0:
        raise subprocess.CalledProcessError(p.returncode, cmd)

Or to install "recommends" for all manually installed packages:

def manually_installed_packages():
    # see http://superuser.com/a/6932
    return iterlines(["aptitude", "search", '~i !~M', "-F", "%p"])

for package in manually_installed_packages():
    subprocess.check_call(b"sudo apt-get install --install-recommends".split()
                          + [package])

You could probably avoid spawning subprocesses and import apt_pkg instead, to get the necessary info and manage the corresponding packages.

Posted on by jfs

Why not doing it directly via bash?

Using cut

# something like that
$ cat /var/log/dpkg.log | grep 'install' | cut -f4 -d" "

The field parameter -f<number> can be different, I have status inbetween, for me it's -f5. The -d parameter says that it's separated by spaces not tabs.

Exclude unwanted output via grep -v

And if you want to exclude something like <none> in the output, you can extend the command with inverted grep (grep -v) like this:

# something like that
$ cat /var/log/dpkg.log | grep 'install' | cut -f4 -d" " | grep -v '<none>'

It's easy to pipe more grep -v commands after the whole command to get more excluded (which could also be done with one regular expression, but this way is more easy to understand).

Removing duplicates at the end with sort and uniq

If you have duplicates in the output, you can also remove them using sort and uniq.

# something like that
$ cat /var/log/dpkg.log | grep 'install' | cut -f4 -d" " | grep -v '<none>' | sort | uniq

Python

If you really want to do it with Python, you can do something like this:

# the with statement is not really necessary, but recommended.
with open("/var/log/dpkg.log") as logfile:
    for line in logfile:
        # covers also 'installed', 'half-installed', …
        # for deeper processing you can use re module, but it's very likely not necessary
        if "install" in line.split()[3]:  # or [4]
            # your code here
            print(line)
Posted on by colidyre

Using the sample input as shown in the question:

awk '/install/{print $4;}' /var/log/dpkg.log

Note that there is no need for grep here.

If one likes python, the following bash-python one-liner does the same thing:

python -c $'import sys\nfor line in sys.stdin:\n  if "install" in line: print(line.split()[4])' </var/log/dpkg.log

Other file formats

For the above to work on the /var/log/dpkg.log on my system, some changes were required:

awk '/status installed/{print $5;}' /var/log/dpkg.log

Working with the pastebin file format

For the file format as shown in the pastebin example:

$ awk '/ install /{print $4;}' /var/log/dpkg.log
libecj-java:all
i2p-router:all
libjbigi-jni:amd64
libservice-wrapper-java:all
libservice-wrapper-jni:amd64
[...snip...]

Posted on by John1024
installed_modules = []
for line in glob.glob(directory):
  mo = re.search(r'install\s*(\S+):',line)
       installed_modules.append(mo.group(1))

or, with a file ...

with open('data') as f:
    for l in f:
       mo = re.search(r'install\s*(\S+):',l)
       installed_modules.append(mo.group(1))

print(installed_modules)

['libportsmf0', 'libsbsms10', 'libsoxr0', 'libwxbase3.0-0', 'libwxgtk3.0-0', 'libvamp-hostsdk3', 'audacity-data', 'audacity', 'hardinfo', 'libstdc++6', 'libudev1', 'libtinfo5', 'libbsd0', 'libedit2', 'nvidia-installer-cleanup']
Posted on by LetzerWille

In Python:

for line in open(file1):
    package = line.split()[3]

or

dpkg_log_content = open(file1).read()
for line in dpkg_log_content.splitlines():
    package = line.split()[3]

In package you have the package name, so you can save it to a list or do whatever you want with it.

Also, you can filter the 'install' lines in Python:

for line in open("/var/log/dpkg.log"):
    line_columns = line.split()
    action = line_columns[2]
    if action == "install":
        package = line_columns[3]
Posted on by Anonymous